python multi thread crawling picture 2

In the previous article, multithreading was created by using classes. This time, use functions to create multithreading or the same website. https://www.quanjing.com/category/1286521/1.html,

The code is as follows:

 1 # Multithreading, automatically create folders, each page stores a folder separately
 2 
 3 import requests
 4 import threading
 5 import re
 6 import time
 7 import queue
 8 import os
 9 from bs4 import BeautifulSoup
10 
11 
12 string = 'https://www.quanjing.com/category/1286521/'
13 url_queue = queue.Queue()
14 pipei = re.compile('lowsrc="(.*?)" m=')        # Define regular expressions to match the links of each picture
15 
16 
17 def get_url(page):          # Create a 1-page url for each page based on the number of incoming pages
18     for i in range(1, page+1):
19         url = string + str(i) + '.html'      # Splicing url
20         url_queue.put(url)            # Put each url in the queue
21     # print(url_queue.queue)
22 
23 
24 def spider(url_queue):      # Crawling function
25     url = url_queue.get()    # Take the top url from the queue
26     floder_count = url[-7:-5]  # Determine the page number of the current crawler, which is used to create a folder later. If the page number is two digits, the current page number will be intercepted. If it is one digit, the current page number and the preceding '/' symbol will be intercepted.
27     if floder_count[0] == '/':
28         floder_name = floder_count[1]
29     else:
30         floder_name = floder_count
31     os.mkdir('The first{0}page'.format(floder_name)) # mkdir create folder
32     html = requests.get(url=url).text
33     soup = BeautifulSoup(html, 'lxml')    # Analyze the source code
34     ul = soup.find_all(attrs={"class": "gallery_list"})    # Extract part of picture link
35     # print(ul)
36     lianjies = re.findall(pipei, str(ul))       # Match the link of each picture. Regular matching must be of string type.
37     i = 1
38     for lianjie in lianjies:
39         # print(lianjie)
40         result = requests.get(url=lianjie).content    # Binary mode requests each picture and stores it.
41         with open('The first{0}page\{1}.jpg'.format(floder_name, i), 'ab') as f:
42             f.write(result)
43         print('The first{0}Page No.{1}Storage completed'.format(floder_name, i))
44         i += 1
45 
46     if not url_queue.empty():    # If the queue is not empty, the thread continues to work and takes the url from the queue
47         spider(url_queue)
48 
49 
50 def main():      # main function, used for thread creation and thread startup
51     queue_list = []    # Thread list
52     queue_count = 3    # Number of threads
53     for i in range(queue_count):
54         t = threading.Thread(target=spider, args=(url_queue, ))  # To create a thread, the first parameter is the function to be called by the thread, and the second parameter is the parameter of the function
55         queue_list.append(t)        # Queue threads
56     for t in queue_list:    # Thread start
57         t.start()
58     for t in queue_list:  # Wait for all threads to end
59         t.join()
60 
61 
62 if __name__ == '__main__':
63     page = int(input("Please enter the number of pages to crawl:"))
64     get_url(page)
65     start_time = time.time()
66     main()
67     print("test3 When used:%f" % (time.time() - start_time))    # Calculation of crawling time

There are two difficulties in writing code: one is how to keep the thread working when the queue is not empty. At first, the main function was called after if judgment, but it redefined the new thread. It was not that the thread was still working, and sometimes there was a crawl. After that, the spider function was tried to crawl.

The second difficulty is the creation of the folder. At the beginning, there was no judgment on the two characters of the screenshot, which led to the creation failure. Baidu found that it could use makedirs to solve the problem. After trying to create a multi-level directory, it could not (possibly because of the '/' character). Later, a judgment was added to solve the problem.

Writing these two multi-threaded crawlers is a program that understands the working mechanism of threads. ps: if there is something wrong, please correct it at any time. xixix)

Keywords: PHP

Added by 8ta8ta on Sun, 03 Nov 2019 14:18:27 +0200