L3&L4 first crawler project

L3 & L4 first crawler project

1 basic steps of reptile

Send a request to the server - parse the source code in the web page - extract data - save data

  1. Send a request to the web page to obtain the web page source code;
  2. Import the new module and analyze the web page source code;
  3. View the data nodes in the web page;
  4. Analyze the content and extract the data of the node;
  5. Learn the method of finding nodes and extract the contents of nodes.

2 get the web page source code

To get the data in the web page, first get the HTML code of the web page, and then extract the data from it.
We want to send a request to the server of the web page, and the response returned by the server is the HTML code of the web page.

# TODO uses import to import the requests module
import requests

# TODO assigns the URL address to the variable URL
url="https://www.baidu.com"

# TODO passes the variable url into requests Get(), assigned to response
response=requests.get(url)

# TODO uses print to output response
print(response)

3 parsing the web page source code

3-1 node

For a web page node, it can define id, class or other attributes, and there is a hierarchical relationship between nodes.
We can extract the desired information with the help of the structure and attributes of Web nodes.

Every part of a web page can be called a node. For example, html tags (such as < H1 > tag, < p > tag), attributes, text, etc. are all nodes.

3-2 parsing tool -- Beautiful soup

Beautiful soup is an HTML or XML parsing module of Python, which can be used to extract the desired data from web pages.
BeautifulSoup is not a built-in module, so it should be installed in the terminal through the code pip install bs4 before use.
If you cannot install it on your own computer or the installation is slow, you can add PIP install BS4 - I after the command https://mirrors.aliyun.com/pypi/simple/ Accelerate

3-3 lxml parser

The ultimate purpose of web crawler is to filter and select network information. The most important part can be said to be the parser. The quality of the parser determines the speed and efficiency of the crawler.
Beautiful Soup officially recommends that we use lxml parser because it has higher efficiency, so we will also use lxml parser.
Lxml is not a built-in module, so you should install lxml in the terminal through the code PIP install before using it.
If you cannot install it on your own computer or the installation is slow, you can add PIP install lxml - I after the command https://mirrors.aliyun.com/pypi/simple/ Accelerate.

3-4 creating a beautiful soup object

soup = BeautifulSoup(html, "lxml")

First parameter
Is the html text that needs to be parsed. Here we pass in the variable html.
Second parameter
Is the type of parser. Here we use lxml.

# Import requests module using import
import requests

# Import beautiful soup from bs4
from bs4 import BeautifulSoup

# Assign URL address to variable URL
url = "https://www.baidu.com"

# Pass the variable url into requests Get(), assigned to response
response = requests.get(url)

# Convert the server response content into string form and assign it to html
html = response.text

# TODO uses BeautifulSoup() to read html, adds an lxml parser, and assigns a value to soup
soup=BeautifulSoup(html,"lxml")

# TODO uses print to output soup
print (soup)

4 analyze the content and extract the data of the node

4-1 locate the node where the content is located and view the location of the extracted content in the source code;

The location of the text is contained in nodes such as XXX, which have the same label.
We can use find in beautiful soup_ All() function to obtain all nodes that meet the specified conditions.

4-2 using find_ The all() function finds the node in the code

ps = soup.find_all(name = "h1")

find_all(name = "tag") query nodes by tag name
If you want to get the node where the h1 tag is located, you can pass in the name parameter, whose parameter value is h1.
name can be omitted or the parameter value can be passed in directly.
The output is a list of all h1 nodes.

# Import requests module using import
import requests

# Import beautiful soup from bs4
from bs4 import BeautifulSoup

# Assign URL address to variable URL
url = "https://www.baidu.com"

# Pass the variable url into requests Get(), assigned to response
response = requests.get(url)

# Convert the server response content into string form and assign it to html
html = response.text

# Use BeautifulSoup() to read html, add an lxml parser, and assign a value to soup
soup = BeautifulSoup(html, "lxml")

# TODO uses find_ Assign a value to all in the query node_ all
content_all=soup.find_all(name="head")

# TODO uses print to output content_all
print (content_all)

4-3 get label content

# Import requests module using import
import requests

# Import the beautiful soup module from bs4
from bs4 import BeautifulSoup

# Assign URL address to variable URL
url = "https://www.icourse163.org/"

# Pass the variable url into requests Get(), assigned to response
response = requests.get(url)

# Convert the server response content into string form and assign it to html
html = response.text

# Use BeautifulSoup() to read html, add an lxml parser, and assign a value to soup
soup = BeautifulSoup(html, "lxml")

# Use find_all() queries the div node in the soup and assigns it to content_all
content_all = soup.find_all(name="div")

# TODO for loop traversal content_all
for content in content_all:

    # TODO gets the content in the tag in each node and assigns it to contentString
    contentString=content.string
    
    # TODO uses print to output a contentString that is not None
    if content.string!=None:
        print (contentString) 

5 learn the method of finding nodes and extract the contents of nodes

5-1 string attribute

The. string attribute can only extract the content of a single node or a unified node.
Extract the content of a single node:
for example

<p><em>This is a paragraph</em></p>

p node contains a child node em, because p node has and only one child node, it is used string attribute, the of the EM node will be output string content.

When the extracted node contains multiple child nodes:
When the positioned node contains multiple child nodes at the same time, one of them is em label content, and the other nodes are pure text.
use. When the string attribute is used, it is not clear which node should be called, and the value of None will be returned.

# Define html
html = """
<p>
    <em>Internet worm</em>
    This is a paragraph
</p>"""

# Import the beautiful soup module from bs4
from bs4 import BeautifulSoup

# Use BeautifulSoup() to read html, add an lxml parser, and assign a value to soup
soup = BeautifulSoup(html, "lxml")

# Use find_all() queries the p node in the soup and assigns a value to content_all
content_all = soup.find_all(name="p")

# for loop traversal content_all
for content in content_all:

    # TODO gets the content in the tag in each node and assigns it to contentString
    content_all=soup.find_all(name="p")
    contentString=content.string
    # Use print to output contentString
    print(contentString)

5-2 text properties

When a node contains both other nodes and text, you can use Text property to extract the content.
The. Text attribute can directly extract all the text in this node and return the string format.

# Define html
html = """
<p>
    <em>Internet worm</em>
    This is a paragraph
</p>"""

# Import the beautiful soup module from bs4
from bs4 import BeautifulSoup

# Use BeautifulSoup() to read html, add an lxml parser, and assign a value to soup
soup = BeautifulSoup(html, "lxml")

# Use find_all() queries the p node in the soup and assigns a value to content_all
content_all = soup.find_all(name="p")

# for loop traversal content_all
for content in content_all:

    # TODO gets all the contents in the node and assigns it to contentString
    contentString = content.text
    
    # Use print to output contentString
    print(contentString)

6 practice

The crawler requests the web page response and obtains all the user nicknames in the 5 pages.
How to turn the page? Most of the contents of the five Web page links are the same, and each page is increased by 1.
first page: https://www.qiushibaike.com/text/page/1/
Page 5: https://www.qiushibaike.com/text/page/5/
Sort out the following ideas:

  1. Import requests and bs4 modules;
  2. Use the knowledge of for loop and formatted string to obtain the URL of 5 pages;
    For example: url = f“ https://www.qiushibaike.com/text/page/ {num}/"
  3. Add the url parameter to requests Get() to get the HTML code of the web page;
  4. Create a beautiful soup object and use find_ The all() function obtains the node;
  5. Call string attribute to get the content in the label of each node.
# TODO uses import to import the requests module
import requests

# TODO imports the beautiful soup module from bs4
from bs4 import BeautifulSoup

# TODO uses the for loop to traverse the numbers 1-5 generated by the range() function
for num in range(1,6):

    # TODO uses formatted characters to generate a string of website links, which are assigned to the variable url
    url = f"https://www.qiushibaike.com/text/page/{num}/"

    # TODO passes the variable url into requests Get(), assigned to response
    response=requests.get(url)

    # TODO converts the server response content into string form and assigns it to html
    html=response.text

    # TODO uses BeautifulSoup() to read html, adds an lxml parser, and assigns a value to soup
    soup=BeautifulSoup(html,"lxml")

    # TODO uses find_all() queries h2 the node in the soup and assigns a value to name_all
    name_all=soup.find_all(name="h2")

    # TODO for loop traversal name_all
    for name in name_all:

        # TODO gets the content in the label of each node and assigns it to name
        name = name.string

        # TODO print output name
        print(name)

Keywords: Python crawler

Added by bryanzera on Tue, 01 Feb 2022 22:02:18 +0200