Today, we can't solve the practical problem when using regular expressions, so we use BS4 library to complete the matching. Through repeated tests, we finally solve the practical problem and deepen the understanding of bs4.beautiful soup module.
Crawling process
Prelude:
It is necessary to analyze the website information of different pages of the hot map section of the Encyclopedia of embarrassing events because it is necessary to turn pages and crawl the content
Specific steps:
1. Get web content (urllib.request). Encyclopedia has a crawler technology, so you need to add headers to disguise the program browser
2. Analyze the content of the web page and get the picture link (from BS4 import beautiful soup)
3. Download the picture (urllib.request) through the picture link and store it locally
Remarks:
Specific crawling instructions are explained in detail in the code
1 import urllib.request 2 import requests 3 from bs4 import BeautifulSoup 4 # import re 5 import gevent 6 from gevent import monkey 7 import bs4 8 9 monkey.patch_all() 10 11 12 def get_html_text(url, raw_html_text, depth): 13 14 # Crawling web data 15 16 # There is an anti crawler mechanism in the Encyclopedia of embarrassing things. You need to set the request header to disguise it as a browser 17 hd = ('User-Agent','Mozilla/5.0(Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Ch rome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0') 18 19 # Establish opener object 20 opener = urllib.request.build_opener() 21 opener.addheaders = [hd] 22 23 # take opener Object set to global 24 urllib.request.install_opener(opener) 25 26 # Page climb html_text 27 for i in range(depth): 28 # According to the analysis of the web address, construct the web address according to the page turning 29 url_real = url + str(i+1) 30 try: 31 html_data = urllib.request.urlopen(url_real).read().decode('utf-8', 'ignore') 32 raw_html_text.append(html_data) 33 # Test code 34 # print(len(html_data)) 35 except Exception as result: 36 print('Error type:', result) 37 38 print('Complete extraction of web page information...') 39 return raw_html_text 40 # Test code 41 # print(len(raw_html_text)) 42 43 44 def parser_html_text(raw_html_text, done_img): 45 46 # Traverse the crawled web page data 47 48 for html_text in raw_html_text: 49 # Use BeautifulSoup Analyze the web page 50 soup = BeautifulSoup(html_text, 'html.parser') 51 # Use soup.find_all('div','thumb') Find out all the tags in each page are div,The attribute value is thumb Label 52 # Through the analysis of the source code of the web page, the image information is stored in the Sun Tzu tag under the tag img Properties in src in 53 # Traversing each div Label 54 for tag in soup.find_all('div', 'thumb'): 55 # judge tag Whether it is bs4.element.Tag Property because the div Next, not all labels 56 if isinstance(tag, bs4.element.Tag): 57 # Traversing each div All grandchildren under the tag 58 for img in tag.descendants: 59 # Determine if the name of the tag is'img',If it is,Take out the attribute in the label src Property value of. 60 if img.name == 'img': 61 link = img.get('src') 62 done_img.append(link) 63 #Test code 64 #print(done_img) 65 print('Page parsing complete...') 66 return done_img 67 68 def save_crawler_data(done_img): 69 # Store target text locally'./'Represents the current directory 70 path = './img/' 71 # enumerate(list) Returns the index and the elements in the list corresponding to the index 72 for i,j in enumerate(done_img): 73 # Analysis crawling link, missing in front'https:',Using string concatenation 74 j ='https:' + j 75 # adopt urllib.request.urlopen()Download pictures 76 try: 77 img_data = urllib.request.urlopen(j).read() 78 path_real = path + str(i+1) 79 with open(path_real, 'wb') as f: 80 f.write(img_data) 81 except: 82 continue 83 print('Picture storage complete') 84 85 86 def main(): 87 url = 'https://www.qiushibaike.com/imgrank/page/' 88 depth = 20 89 raw_html_text = list() 90 done_img = list() 91 Raw_html_text = get_html_text(url, raw_html_text, depth) 92 Done_img = parser_html_text(Raw_html_text, done_img) 93 gevent.joinall([ 94 gevent.spawn(get_html_text,url,raw_html_text,depth), 95 gevent.spawn(parser_html_text,Raw_html_text,done_img), 96 gevent.spawn(save_crawler_data,Done_img) 97 ]) 98 99 save_crawler_data(done_img) 100 101 102 if __name__ == '__main__': 103 main()