python - multi process crawling hot pictures of embarrassing events

Today, we can't solve the practical problem when using regular expressions, so we use BS4 library to complete the matching. Through repeated tests, we finally solve the practical problem and deepen the understanding of bs4.beautiful soup module.

Crawling process

Prelude:

It is necessary to analyze the website information of different pages of the hot map section of the Encyclopedia of embarrassing events because it is necessary to turn pages and crawl the content

Specific steps:

1. Get web content (urllib.request). Encyclopedia has a crawler technology, so you need to add headers to disguise the program browser

2. Analyze the content of the web page and get the picture link (from BS4 import beautiful soup)

3. Download the picture (urllib.request) through the picture link and store it locally

Remarks:

Specific crawling instructions are explained in detail in the code

  1 import urllib.request
  2 import requests
  3 from bs4 import BeautifulSoup
  4 # import re
  5 import gevent
  6 from gevent import monkey
  7 import bs4
  8 
  9 monkey.patch_all()
 10 
 11 
 12 def get_html_text(url, raw_html_text, depth):
 13 
 14     # Crawling web data
 15 
 16     # There is an anti crawler mechanism in the Encyclopedia of embarrassing things. You need to set the request header to disguise it as a browser
 17     hd = ('User-Agent','Mozilla/5.0(Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Ch    rome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0')
 18 
 19     # Establish opener object
 20     opener = urllib.request.build_opener()
 21     opener.addheaders = [hd]
 22 
 23     # take opener Object set to global
 24     urllib.request.install_opener(opener)
 25 
 26     # Page climb html_text
 27     for i in range(depth):
 28         # According to the analysis of the web address, construct the web address according to the page turning
 29         url_real = url + str(i+1)
 30         try:
 31             html_data = urllib.request.urlopen(url_real).read().decode('utf-8', 'ignore')
 32             raw_html_text.append(html_data)
 33             # Test code
 34             # print(len(html_data))
 35         except Exception as result:
 36             print('Error type:', result)
 37     
 38     print('Complete extraction of web page information...')
 39     return raw_html_text
 40     # Test code         
 41     # print(len(raw_html_text))
 42 
 43 
 44 def parser_html_text(raw_html_text, done_img):
 45 
 46     # Traverse the crawled web page data
 47 
 48     for html_text in raw_html_text:
 49         # Use BeautifulSoup Analyze the web page
 50         soup = BeautifulSoup(html_text, 'html.parser')
 51         # Use soup.find_all('div','thumb') Find out all the tags in each page are div,The attribute value is thumb Label
 52         # Through the analysis of the source code of the web page, the image information is stored in the Sun Tzu tag under the tag img Properties in src in
 53         # Traversing each div Label
 54         for tag in soup.find_all('div', 'thumb'):
 55             # judge tag Whether it is bs4.element.Tag Property because the div Next, not all labels
 56             if isinstance(tag, bs4.element.Tag):
 57                 # Traversing each div All grandchildren under the tag
 58                 for img in tag.descendants:
 59                     # Determine if the name of the tag is'img',If it is,Take out the attribute in the label src Property value of.
 60                     if img.name == 'img':
 61                         link = img.get('src')
 62                         done_img.append(link)
 63     #Test code
 64     #print(done_img)
 65     print('Page parsing complete...')
 66     return done_img
 67 
 68 def save_crawler_data(done_img):
 69     # Store target text locally'./'Represents the current directory
 70     path = './img/'
 71     # enumerate(list) Returns the index and the elements in the list corresponding to the index
 72     for i,j in enumerate(done_img):
 73         # Analysis crawling link, missing in front'https:',Using string concatenation
 74         j ='https:' + j
 75         # adopt urllib.request.urlopen()Download pictures
 76         try:
 77             img_data = urllib.request.urlopen(j).read()
 78             path_real = path + str(i+1)
 79             with open(path_real, 'wb') as f:
 80                 f.write(img_data)
 81         except:
 82             continue
 83     print('Picture storage complete')
 84 
 85 
 86 def main():
 87     url = 'https://www.qiushibaike.com/imgrank/page/'
 88     depth = 20
 89     raw_html_text = list()
 90     done_img = list()
 91     Raw_html_text = get_html_text(url, raw_html_text, depth)
 92     Done_img = parser_html_text(Raw_html_text, done_img)
 93     gevent.joinall([
 94         gevent.spawn(get_html_text,url,raw_html_text,depth),
 95         gevent.spawn(parser_html_text,Raw_html_text,done_img),
 96         gevent.spawn(save_crawler_data,Done_img)
 97         ])
 98 
 99     save_crawler_data(done_img)
100 
101 
102 if __name__ == '__main__':
103     main()

Keywords: Python Attribute Windows

Added by Stryves on Mon, 02 Dec 2019 20:56:03 +0200