This is my first time to blog here. I'm still a little excited.
I also just got in touch with Python and found that Python code is really powerful and can handle complex things easily. Recently, I wanted to write a crawler, but I didn't reach the level. There was an open class in CSDN college. Mr. Huang Yong's "Mastering Python multithreaded crawler in 90 minutes (whole process practice)". I listened to the live broadcast at 20:00 p.m. on March 6, but I didn't catch up with it at that time. I didn't understand it until I saw the replay (maybe I was Python3 in Python2 class, find out the reasonO(∩ ∩) O haha ~).
Take notes first:
Process analysis of reptiles:
1. Request data: requests Library (this library can easily request network data)
*Installation method: pip install requests
2. Parse the requested data, get the data we want, and discard the unwanted data
*Beautifulsoup: pip install bs4
*lxml:pip install lxml
3. Save the parsed data. If it is a text type, it can be saved to a file, database or cache. If it is a file type, such as pictures and videos, it can be saved to a hard disk
4. Whether your reptile is large or small, it is composed of these modules.
Thank you, Mr. Huang Yong. No more verbosity, just go to the code.
#coding:utf-8 import requests import os from bs4 import BeautifulSoup import urllib import threading # First of all, identity camouflage Headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Mobile Safari/537.36'} PAGE_URLS=[] IMG_URLS=[]#global variable IMG_URLS: It's just a list, which stores many links of emoticons gLock=threading.Lock() def producer(): '''Producer: specially used to get expression packs from websites url Link, equivalent to increase IMG_URLS Data in''' while True: gLock.acquire()#Multithreading must lock global variable operations if len(PAGE_URLS)==0: gLock.release()#You must also unlock before exiting the loop break page_url=PAGE_URLS.pop()#pop Function to delete the last item in the list and return the deleted item gLock.release()#Remember to unlock after operation response = requests.get(page_url, Headers)#Get web page data, return to response text = response.text # print text soup = BeautifulSoup(text, 'lxml')#Example BeautifulSoup Object parsing data, using lxml Engine. Of course, you can also use other parsing tools or regular expressions, which are more complex img_list = soup.find_all("img", attrs={"class": "img-responsive lazy image_dta"})#Find the data you want according to the tag attribute, and discard other non expression package pictures for img in img_list: img_url = img['data-original']#Find the picture source URL, img['src']It's not the real picture source. The website is the same gLock.acquire() IMG_URLS.append(img_url) gLock.release() def consumer(): '''Consumer: special expression pack url Download picture in link,Equivalent to consumption IMG_URLS Data in''' while True: gLock.acquire() if len(IMG_URLS)==0 and len(PAGE_URLS)==0:#New operation possible IMG_URLS Is empty, so add PAGE_URLS Blank at the same time is the end condition gLock.release() break if len(IMG_URLS)>0: img_url=IMG_URLS.pop()#Space time pop error else: img_url='' gLock.release() if img_url: filename = img_url.split("/")[-1]#Split the image address into a list and take the last file name fullpath = os.path.join("images", filename)#take images Directory name and file name are merged. Because of different systems, they are not necessarily added"/" try: urllib.urlretrieve(img_url, fullpath)#urlretrieve The function is to download and store the local address from the target URL, Python3 In the request In Library # print img_url,"Download complete" except Exception as e: print e print img_url,"Download failed"#There are also 10054 errors, which may be caused by the server finding that the crawler forcibly closes the current link def main(): for x in range(1,100):#Crawl 1-99 Page data page_url="https://www.doutula.com/photo/list/?page="+str(x) PAGE_URLS.append(page_url) for x in range(5):#Open 5 producer threads th=threading.Thread(target=producer) th.start() for x in range(5):#Open 5 consumer threads th =threading.Thread(target=consumer) th.start() if __name__ == '__main__':#Execute as a program, do not execute if loaded as a package main()