python Crawler Learning Crawl 169 Picture Website

With the aesthetic orientation of health, beauty, youth and fashion, 169 Beauty Picture Network shows the beauty of beauty for the vast number of netizens and appreciates the beauty and feelings of the contemporary young female generation.

 7 import requests
 8 from pyquery import PyQuery as pq
 9 import os
10 headers={
11     'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0'
12 }
14 #A module for downloading pictures
15 def Download_the_module(file,tehurl):
16     count = 1
17     # Go to the website and download pictures
18     The_second_request = requests.get(tehurl, headers=headers).text
19     # download
20     The_doc = pq(The_second_request)
21     Download_the_pictures = The_doc('.big_img')
22     Take_out_the=pq(Download_the_pictures.html())
23     Extract_the=Take_out_the.find('img').items()
24     for i in Extract_the:
25         save=i.attr('src')
26         #print(save)
27         The_sponse=requests.get(save,headers=headers)
28         The_name='F:/picture/'+file
29         Save_the_address = str(The_name)
30         # Check if there is image Create a directory if it does not exist
31         if not os.path.exists(Save_the_address):
34             os.makedirs('F:/picture/' + file)
35         else:
38             with open(Save_the_address+'/%s.jpg'%count,'wb')as f:
39                 f.write(The_sponse.content)
40                 print('Downloaded%s Zhang'%count)
41             count += 1
42 #Crawl address
43 def Climb_to_address(page):
45     URL=''%page
46     sponse=requests.get(URL,headers=headers)
47     sponse.encoding='gbk'
48     encodin=sponse.text
49     doc=pq(encodin)
50     extract=doc('.pic').items()
51     for i in extract:
52         #file name
53         The_file_name=i.text()
54         #Extracted website
55         The_url=i.attr('href')
57         Download_the_module(The_file_name,The_url)
59 #There are 616 pages altogether.
60 a=int(input('Please enter the number of pages to start crawling:'))
61 b=int(input('Please enter the number of pages that end crawling:'))
62 Climb_to_address(a,b)


One advantage of using Python is that it can do repetitive work instead of us, release our labor force, and let us have time to do what we like (Tou Lan).

There are two problems with this crawler. One is: crawling the website does not have any anti climbing mechanism, so you basically do not need to add anything to header. Setting up session and cookie also provides us with great convenience, and the code is simple to write. The second problem is: the program can not be interrupted, but once interrupted, you have to start downloading again, so there should be a way to set where the crawler starts to crawl. In fact, this problem is not difficult to solve, just as homework, you can try it when you have time!

Keywords: Python network Windows encoding

Added by llcoollasa on Thu, 03 Oct 2019 23:50:30 +0300