preface
Suddenly, I have a strong hunch that I will soon be able to use the crawler. After learning a little about python crawler these two days, I want to find a case to practice. Unfortunately, Douban has been selected. Who calls the HTML rendered by the server more interesting?
Sorry, Douban
get ready
The packages used are
requests
Used to send requests and receive results
re
Used to use regular expressions
csv
It is used to write the obtained data into the csv file, which is more convenient to read in the later data processing
thinking
Let's begin:
Note: the idea is not the sequence of programming, but starts from the concept of the overall situation, and finally has complete code that can be run directly
1, Website information reading
1. First, get the URL of the page you need to crawl. Here is Douban
Douban film TOP250
2. By viewing Douban's web page source code, it is found that the web page is HTML rendered by the server, so you choose to use regular (there are many methods here, and try regular first)
3. It is found that the request method is get through the package capturing tool provided by chrome
4. Because watercress is often climbed, it is set to reverse climb 🤣, However, just check the UA. Find the UA carried by the general browser accessing the Douban web page and write it into the requests. It is still the file just captured. At the bottom is the UA, that is, the user agent
5. Next, write the precompile of regular re. For large websites such as Douban, the front-end writing regulations are very strict. For example, the class in div with movie information is different from other Web building components, which also facilitates the application of regular.
For example, in this code, the movie names are all in the label span, and there is a specific class. Therefore, with this span and class, you can quickly locate and obtain the movie name. In python re, the results are stored in the dictionary, where
?P<name>
It is the key of the dictionary, and the data in the source code is the value in the dictionary
<span class="title">(?P<name>.*?)</span>
It is worth noting that "." in regular can match any character except line breaks, but there are a large number of line breaks in HTML code. Fortunately, re.compile can use a parameter, re.S, so that "." can also match line breaks \ n
6. Douban, like most websites, adopts pagination display, but only the start parameter start can be seen from the url on page 2 and page 3. I guess there will be an end parameter, but I continue to choose the troublesome start parameter, which is a test for myself and obtained through the for loop
# Because the page has only the starting number, it cannot be obtained from the range, #The setting variable page is represented as the page actually captured for i in range(0, 250, 25):
2, csv file write
1. There's nothing to say here, but it should be noted that the parameter mode here uses a +, i.e. append write, rather than w overwrite write. This is because the page information is obtained circularly, not all the data is requested at one time, and the windows computer needs to add encoding code, otherwise it defaults to gbk, and the newline is because the opening of the file will automatically wrap, The csvwriter will also automatically wrap lines, so that newline = "" can ignore the open line wrapping of the file
# Set the received csv file f = open('data.csv', mode='a+', encoding='utf-8', newline='') # Create writer csvwritter = csv.writer(f)
2. Finally, be sure to turn off both the request and the file stream
# Close the link to prevent being pulled black resp.close() # Close file stream f.close()
Finally, the complete code is attached with a lot of comments
# Get the page source code requests import requests # re is used to further extract effective information import re # csv processing import csv # url to crawl url = 'https://movie.douban.com/top250' # Set UA of request header headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36' } # Precompiled re obj = re.compile( r'<li>.*?<div class="item">.*?<span class="title">(?P<name>.*?)</span>' r'.*?<p class="">.*?<br>(?P<year>.*?) .*?<span class="rating_num" property="v:average">(?P<score>.*?)</span>.*?' r'<span>(?P<num>.*?)Human evaluation</span>', re.S) # Set the received csv file f = open('data.csv', mode='a+', encoding='utf-8', newline='') # Because the page has only the starting number and cannot be obtained in the range, set the variable page to represent the page actually captured for i in range(0, 250, 25): # Request parameters data = { # start=25&filter= 'start': i, 'filter': '' } # get data resp = requests.get(url, headers=headers, params=data) # Convert to string page_content = resp.text # Close the link to prevent being pulled black resp.close() # Receive regular results ret = obj.finditer(page_content) # Create writer csvwritter = csv.writer(f) # Cycle results for i in ret: dic = i.groupdict() dic['year'] = dic['year'].strip() csvwritter.writerow(dic.values()) # Close file stream f.close()
When drunk, I don't know the sky is in the water. The boat is full of clear dreams and the Star River is pressed.