python crawls to top 250

Catalog

1. Analyze web pages
2. Request server
3.xpath extraction information
4. Regular expression
- 4.1 extraction of fixed position information
- 4.2 match numbers
5. Extract all information from a page
6. Write content to text csv
7. Summarize all the codes

1. Analyze web pages

When we go to crawl a web page, the first thing we need to do is to analyze the structure of the web page, and then we will find the corresponding laws, as follows:

Generate link: you can write a for loop from the law of web page link to generate its link. Its interval is 25. The program is as follows:

for page in range (0,226,25):
    url ="https://movie.douban.com/top250?start=%s&filter="%page
    print (url)

The results are as follows:

2. Request server

Before crawling the web page, we need to make a request to the server

2.1 import package

If the requests package is not installed, install it first. The steps are: 1. Run win + R - 2. Enter CMD - 3. Enter the command pip install requests

2.2 setting up browser proxy

The code to set up the browser agent is as follows:

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
}

2.3 request server format

Request source code to send a request to the server. If. Text is added after it, the output text content will be shown as follows:

requests.get(url = test_url, headers = headers)

2.4 request server code summary

import requests
#pip install requests - > Win + R, run - > CMD, enter - > pip
test_url = 'https://Movie. Double. COM / top250? Start = 0 & filter = 'ා' format as string

#Set up browser proxy, which is a dictionary
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
}

#Request source code, send request to server, 200 represents success
reponse = requests.get(url = test_url, headers = headers).text
# Shortcut key run, Ctrl+Enter

3.xpath extraction information

3.1 method of getting xpath node

3.2xpath extract content

from lxml import etree #Import resolution Library
html_etree = etree.HTML(reponse) # Think of it as a sieve, a tree

3.2.1 extract text

When we extract the text in the tag, we need to add / text() after the copied xpath
Such as farewell my concubine:

<span class="title">Farewell to my concubine</span>

xpath:

//*[@id="content"]/div/div[1]/ol/li[1]/div/div[2]/div[1]/a/span[1]

Extract text:

name = html_etree.xpath('//*[@id="content"]/div/div[1]/ol/li[1]/div/div[2]/div[1]/a/span[1]/text()')
print ("This is in array form:",name)
print ("This is in string form:",name[0])

3.2.2 extract links

When we extract the link, we need to add / @ href after the copied xpath to specify the extraction link,

movie_url = html_etree.xpath('//*[@id="content"]/div/div[1]/ol/li[1]/div/div[2]/div[1]/a/@href')
print ("This is in array form:",movie_url)
print ("This is in string form:",movie_url[0])

The results are as follows:

3.2.3 extract label elements

Extracting tag elements is the same as extracting links, but you need to add / @ class after it,

rating = html_etree.xpath('//*[@id="content"]/div/div[1]/ol/li[1]/div/div[2]/div[2]/div/span[1]/@class')
print ("This is in array form:",rating)
print ("This is in string form:",rating[0])

The results are as follows:

4. Regular expression

4.1 extraction of fixed position information

In regular expressions, we use (. *?) to extract the information we want. When we use regular expressions, we usually import the re package first. For example:

import re
test = "I am js"
text = re.findall("I am.*?",test)
print (text)

The results are as follows:

4.2 match numbers

For example, if we want to match how many people evaluate this movie, we can write as follows:

import re 
data = "1059232 Human evaluation"
num = re.sub(r'\D', "", data)
print("The number here is:", num)

The results are as follows:

5. Extract all information from a page

For example, here we extract the movie name of the last page, as follows:

li = html_etree.xpath('//*[@id="content"]/div/div[1]/ol/li')
for item in li:
    name = item.xpath('./div/div[2]/div[1]/a/span[1]/text()')[0]
    print (name)

The results are as follows:

In this way, we can get everything down.

6. Write content to text csv

The code is as follows:

import csv
# Create folder and open
fp = open("./Bean paste top250.csv", 'a', newline='', encoding = 'utf-8-sig')
writer = csv.writer(fp) #I want to write

# Write contents
writer.writerow(('ranking', 'Name', 'link', 'Star class', 'score', 'Number of people assessed'))

#Close file
fp.close()

7. Summarize all the codes

import requests, csv, re
from lxml import etree

#Set up browser proxy, which is a dictionary
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'
}

# Create folder and open
fp = open("./Bean paste top250.csv", 'a', newline='', encoding = 'utf-8-sig')
writer = csv.writer(fp) #I want to write
# Write contents
writer.writerow(('ranking', 'Name', 'link', 'Star class', 'score', 'Number of people assessed'))

for page in range(0, 226, 25): #226
    print ("Getting%s page"%page)
    url = 'https://movie.douban.com/top250?start=%s&filter='%page
    
    #Request source code, send request to server, 200 represents success, back to it, Ctrl +]
    reponse = requests.get(url = url, headers = headers).text
    # Shortcut key run, Ctrl+Enter
    html_etree = etree.HTML(reponse) # Think of it as a sieve, a tree
    # filter
    li = html_etree.xpath('//*[@id="content"]/div/div[1]/ol/li')
    for item in li:
        #ranking
        rank = item.xpath('./div/div[1]/em/text()')[0]
        #Movie title
        name = item.xpath('./div/div[2]/div[1]/a/span[1]/text()')[0]
        #link
        dy_url = item.xpath('./div/div[2]/div[1]/a/@href')[0]
        #score
        rating = item.xpath('./div/div[2]/div[2]/div/span[1]/@class')[0]
        rating = re.findall('rating(.*?)-t', rating)[0]
        if len(rating) == 2:
            star = int(rating) / 10  #int() converted to number
        else:
            star = rating
    #     Note ctrl+?

        rating_num = item.xpath('./div/div[2]/div[2]/div/span[2]/text()')[0]
        content = item.xpath('./div/div[2]/div[2]/div/span[4]/text()')[0]
        content = re.sub(r'\D', "", content)
#         print (rank, name, dy_url, star, rating_num, content)
        # Write contents
        writer.writerow((rank, name, dy_url, star, rating_num, content))
fp.close()

The results are as follows:

Results in csv file:

The last climb is over.

Keywords: Python pip Windows encoding

Added by Tucker1337 on Wed, 15 Apr 2020 17:31:06 +0300

Programming VIP

python crawls to top 250

1. Analyze web pages

2. Request server

2.1 import package

2.2 setting up browser proxy

2.3 request server format

2.4 request server code summary

3.xpath extraction information

3.1 method of getting xpath node

3.2xpath extract content

3.2.1 extract text

3.2.2 extract links

3.2.3 extract label elements

4. Regular expression

4.1 extraction of fixed position information

4.2 match numbers

5. Extract all information from a page

6. Write content to text csv

7. Summarize all the codes

Popular Keywords