In the previous article, multithreading was created by using classes. This time, use functions to create multithreading or the same website. https://www.quanjing.com/category/1286521/1.html,
The code is as follows:
1 # Multithreading, automatically create folders, each page stores a folder separately 2 3 import requests 4 import threading 5 import re 6 import time 7 import queue 8 import os 9 from bs4 import BeautifulSoup 10 11 12 string = 'https://www.quanjing.com/category/1286521/' 13 url_queue = queue.Queue() 14 pipei = re.compile('lowsrc="(.*?)" m=') # Define regular expressions to match the links of each picture 15 16 17 def get_url(page): # Create a 1-page url for each page based on the number of incoming pages 18 for i in range(1, page+1): 19 url = string + str(i) + '.html' # Splicing url 20 url_queue.put(url) # Put each url in the queue 21 # print(url_queue.queue) 22 23 24 def spider(url_queue): # Crawling function 25 url = url_queue.get() # Take the top url from the queue 26 floder_count = url[-7:-5] # Determine the page number of the current crawler, which is used to create a folder later. If the page number is two digits, the current page number will be intercepted. If it is one digit, the current page number and the preceding '/' symbol will be intercepted. 27 if floder_count[0] == '/': 28 floder_name = floder_count[1] 29 else: 30 floder_name = floder_count 31 os.mkdir('The first{0}page'.format(floder_name)) # mkdir create folder 32 html = requests.get(url=url).text 33 soup = BeautifulSoup(html, 'lxml') # Analyze the source code 34 ul = soup.find_all(attrs={"class": "gallery_list"}) # Extract part of picture link 35 # print(ul) 36 lianjies = re.findall(pipei, str(ul)) # Match the link of each picture. Regular matching must be of string type. 37 i = 1 38 for lianjie in lianjies: 39 # print(lianjie) 40 result = requests.get(url=lianjie).content # Binary mode requests each picture and stores it. 41 with open('The first{0}page\{1}.jpg'.format(floder_name, i), 'ab') as f: 42 f.write(result) 43 print('The first{0}Page No.{1}Storage completed'.format(floder_name, i)) 44 i += 1 45 46 if not url_queue.empty(): # If the queue is not empty, the thread continues to work and takes the url from the queue 47 spider(url_queue) 48 49 50 def main(): # main function, used for thread creation and thread startup 51 queue_list = [] # Thread list 52 queue_count = 3 # Number of threads 53 for i in range(queue_count): 54 t = threading.Thread(target=spider, args=(url_queue, )) # To create a thread, the first parameter is the function to be called by the thread, and the second parameter is the parameter of the function 55 queue_list.append(t) # Queue threads 56 for t in queue_list: # Thread start 57 t.start() 58 for t in queue_list: # Wait for all threads to end 59 t.join() 60 61 62 if __name__ == '__main__': 63 page = int(input("Please enter the number of pages to crawl:")) 64 get_url(page) 65 start_time = time.time() 66 main() 67 print("test3 When used:%f" % (time.time() - start_time)) # Calculation of crawling time
There are two difficulties in writing code: one is how to keep the thread working when the queue is not empty. At first, the main function was called after if judgment, but it redefined the new thread. It was not that the thread was still working, and sometimes there was a crawl. After that, the spider function was tried to crawl.
The second difficulty is the creation of the folder. At the beginning, there was no judgment on the two characters of the screenshot, which led to the creation failure. Baidu found that it could use makedirs to solve the problem. After trying to create a multi-level directory, it could not (possibly because of the '/' character). Later, a judgment was added to solve the problem.
Writing these two multi-threaded crawlers is a program that understands the working mechanism of threads. ps: if there is something wrong, please correct it at any time. xixix)