Crawling expression bag

This is my first time to blog here. I'm still a little excited.

I also just got in touch with Python and found that Python code is really powerful and can handle complex things easily. Recently, I wanted to write a crawler, but I didn't reach the level. There was an open class in CSDN college. Mr. Huang Yong's "Mastering Python multithreaded crawler in 90 minutes (whole process practice)". I listened to the live broadcast at 20:00 p.m. on March 6, but I didn't catch up with it at that time. I didn't understand it until I saw the replay (maybe I was Python3 in Python2 class, find out the reasonO(∩ ∩) O haha ~).

Take notes first:

Process analysis of reptiles:

1. Request data: requests Library (this library can easily request network data)
*Installation method: pip install requests
2. Parse the requested data, get the data we want, and discard the unwanted data
*Beautifulsoup: pip install bs4
*lxml:pip install lxml
3. Save the parsed data. If it is a text type, it can be saved to a file, database or cache. If it is a file type, such as pictures and videos, it can be saved to a hard disk
4. Whether your reptile is large or small, it is composed of these modules.

Thank you, Mr. Huang Yong. No more verbosity, just go to the code.

#coding:utf-8
import requests
import os
from bs4 import BeautifulSoup
import urllib
import threading
# First of all, identity camouflage
Headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Mobile Safari/537.36'}
PAGE_URLS=[]
IMG_URLS=[]#global variable IMG_URLS: It's just a list, which stores many links of emoticons
gLock=threading.Lock()
def producer():
    '''Producer: specially used to get expression packs from websites url Link, equivalent to increase IMG_URLS Data in'''
    while True:
        gLock.acquire()#Multithreading must lock global variable operations
        if len(PAGE_URLS)==0:
            gLock.release()#You must also unlock before exiting the loop
            break
        page_url=PAGE_URLS.pop()#pop Function to delete the last item in the list and return the deleted item
        gLock.release()#Remember to unlock after operation
        response = requests.get(page_url, Headers)#Get web page data, return to response
        text = response.text
        # print text
        soup = BeautifulSoup(text, 'lxml')#Example BeautifulSoup Object parsing data, using lxml Engine. Of course, you can also use other parsing tools or regular expressions, which are more complex
        img_list = soup.find_all("img", attrs={"class": "img-responsive lazy image_dta"})#Find the data you want according to the tag attribute, and discard other non expression package pictures
        for img in img_list:
            img_url = img['data-original']#Find the picture source URL, img['src']It's not the real picture source. The website is the same
            gLock.acquire()
            IMG_URLS.append(img_url)
            gLock.release()

def consumer():
    '''Consumer: special expression pack url Download picture in link,Equivalent to consumption IMG_URLS Data in'''
    while True:
        gLock.acquire()
        if len(IMG_URLS)==0 and len(PAGE_URLS)==0:#New operation possible IMG_URLS Is empty, so add PAGE_URLS Blank at the same time is the end condition
            gLock.release()
            break
        if len(IMG_URLS)>0:
            img_url=IMG_URLS.pop()#Space time pop error
        else:
            img_url=''
        gLock.release()
        if img_url:
            filename = img_url.split("/")[-1]#Split the image address into a list and take the last file name
            fullpath = os.path.join("images", filename)#take images Directory name and file name are merged. Because of different systems, they are not necessarily added"/"
            try:
                urllib.urlretrieve(img_url, fullpath)#urlretrieve The function is to download and store the local address from the target URL, Python3 In the request In Library
                # print img_url,"Download complete"
            except Exception as e:
                print e
                print img_url,"Download failed"#There are also 10054 errors, which may be caused by the server finding that the crawler forcibly closes the current link

def main():
    for x in range(1,100):#Crawl 1-99 Page data
        page_url="https://www.doutula.com/photo/list/?page="+str(x)
        PAGE_URLS.append(page_url)
    for x in range(5):#Open 5 producer threads
        th=threading.Thread(target=producer)
        th.start()
    for x in range(5):#Open 5 consumer threads
        th =threading.Thread(target=consumer)
        th.start()
if __name__ == '__main__':#Execute as a program, do not execute if loaded as a package
    main()

Keywords: Python pip network Database

Added by mgason on Mon, 09 Dec 2019 07:12:41 +0200