Crawling expression bag

This is my first time to blog here. I'm still a little excited.

I also just got in touch with Python and found that Python code is really powerful and can handle complex things easily. Recently, I wanted to write a crawler, but I didn't reach the level. There was an open class in CSDN college. Mr. Huang Yong's "Mastering Python multithreaded crawler in 90 minutes (whole process practice)". I listened to the live broadcast at 20:00 p.m. on March 6, but I didn't catch up with it at that time. I didn't understand it until I saw the replay (maybe I was Python3 in Python2 class, find out the reasonO(∩ ∩) O haha ~).

Take notes first:

Process analysis of reptiles:

1. Request data: requests Library (this library can easily request network data)
*Installation method: pip install requests
2. Parse the requested data, get the data we want, and discard the unwanted data
*Beautifulsoup: pip install bs4
*lxml:pip install lxml
3. Save the parsed data. If it is a text type, it can be saved to a file, database or cache. If it is a file type, such as pictures and videos, it can be saved to a hard disk
4. Whether your reptile is large or small, it is composed of these modules.

Thank you, Mr. Huang Yong. No more verbosity, just go to the code.

import requests
import os
from bs4 import BeautifulSoup
import urllib
import threading
# First of all, identity camouflage
Headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Mobile Safari/537.36'}
IMG_URLS=[]#global variable IMG_URLS: It's just a list, which stores many links of emoticons
def producer():
    '''Producer: specially used to get expression packs from websites url Link, equivalent to increase IMG_URLS Data in'''
    while True:
        gLock.acquire()#Multithreading must lock global variable operations
        if len(PAGE_URLS)==0:
            gLock.release()#You must also unlock before exiting the loop
        page_url=PAGE_URLS.pop()#pop Function to delete the last item in the list and return the deleted item
        gLock.release()#Remember to unlock after operation
        response = requests.get(page_url, Headers)#Get web page data, return to response
        text = response.text
        # print text
        soup = BeautifulSoup(text, 'lxml')#Example BeautifulSoup Object parsing data, using lxml Engine. Of course, you can also use other parsing tools or regular expressions, which are more complex
        img_list = soup.find_all("img", attrs={"class": "img-responsive lazy image_dta"})#Find the data you want according to the tag attribute, and discard other non expression package pictures
        for img in img_list:
            img_url = img['data-original']#Find the picture source URL, img['src']It's not the real picture source. The website is the same

def consumer():
    '''Consumer: special expression pack url Download picture in link,Equivalent to consumption IMG_URLS Data in'''
    while True:
        if len(IMG_URLS)==0 and len(PAGE_URLS)==0:#New operation possible IMG_URLS Is empty, so add PAGE_URLS Blank at the same time is the end condition
        if len(IMG_URLS)>0:
            img_url=IMG_URLS.pop()#Space time pop error
        if img_url:
            filename = img_url.split("/")[-1]#Split the image address into a list and take the last file name
            fullpath = os.path.join("images", filename)#take images Directory name and file name are merged. Because of different systems, they are not necessarily added"/"
                urllib.urlretrieve(img_url, fullpath)#urlretrieve The function is to download and store the local address from the target URL, Python3 In the request In Library
                # print img_url,"Download complete"
            except Exception as e:
                print e
                print img_url,"Download failed"#There are also 10054 errors, which may be caused by the server finding that the crawler forcibly closes the current link

def main():
    for x in range(1,100):#Crawl 1-99 Page data
    for x in range(5):#Open 5 producer threads
    for x in range(5):#Open 5 consumer threads
        th =threading.Thread(target=consumer)
if __name__ == '__main__':#Execute as a program, do not execute if loaded as a package

Keywords: Python pip network Database

Added by mgason on Mon, 09 Dec 2019 07:12:41 +0200