Stage 12 - Reptile 02: [request; data extraction (regular, Beautiful Soup, xpath)]

Articles Catalogue

1. URLError

  • First, explain the possible causes of URLError:

    • The network is not connected, that is, the local computer can not access the Internet.

    • Unable to connect to a specific server

    • Server does not exist

  • In the code, we need to surround and capture the corresponding exception with try-except statement, the code is as follows:

from urllib.request import Request, urlopen
from fake_useragent import UserAgent
from urllib.error import URLError

url = 'http://www.sxt.cn/index/login/login12353wfeds.html' # Servers available, resources not available
url = 'http://www.sxt12412412.cn/index/login/login12353wfeds.html'
headers = {'User-Agent': UserAgent().chrome}
try:
    req = Request(url, headers=headers)
    resp = urlopen(req)
    info = resp.read().decode()
    print(info)
except URLError as e:
    if len(e.args) != 0:
        print('Address acquisition error!')
    else:
        print(e.code)
print('Climbing completed')

  • Skills on debug mode:

  • We use the urlopen method to access a non-existent web site. The results are as follows:
[Errno 11004] getaddrinfo failed

2. Use of request Libraries

2.1. Basic Introduction

  1. Introduction:

    It is helpful to understand some basic concepts of reptiles and grasp the process of reptiles crawling. After introducing, we need to learn more advanced content and tools to facilitate our crawling. So this section gives a brief introduction to the basic usage of requests libraries.

  2. install

    Install with pip:

pip install requests
  1. Basic Request
req = requests.get("http://www.baidu.com")
req = requests.post("http://www.baidu.com")
req = requests.put("http://www.baidu.com")
req = requests.delete("http://www.baidu.com")
req = requests.head("http://www.baidu.com")
req = requests.options("http://www.baidu.com")

2.2. get request

  • The parameters are dictionaries, and we can also pass json-type parameters:
  1. Use of get 01:
import requests
from fake_useragent import UserAgent

url = 'http://www.baidu.com'
headers = {'User-Agent': UserAgent().chrome}
resp = requests.get(url, headers=headers)
resp.encoding='utf-8'
print(resp.text)

  1. Use of get 02:
import requests
from fake_useragent import UserAgent

url = 'http://www.baidu.com/s?'
params = {
    'wd': 'Black Horse Programmer'
}
headers = {'User-Agent': UserAgent().chrome}
resp = requests.get(url, headers=headers, params=params)
resp.encoding = 'utf-8'
print(resp.text)

2.3. post request

  1. The parameters are dictionaries, and we can also pass json-type parameters:
  2. Code Example 01:
import requests
from fake_useragent import UserAgent

url = 'http://www.sxt.cn/index/login/login.html'
args = {
    'user': '17703181473',
    'password': '123456'
}
headers={'User-Agent':UserAgent().chrome}
resp = requests.post(url,headers=headers,data=args)
print(resp.text)

  1. Code example 02:
import requests
from fake_useragent import UserAgent

# Sign in
login_url = 'https://www.kuaidaili.com/login/'

headers = {'User-Agent': UserAgent().chrome}
data = {
    'username': '398707160@qq.com',
    'passwd': '123456abc'
}

resp = requests.post(login_url, headers=headers, data=data)
print(resp.text)


2.4. Custom request header

  • Camouflage request headers are often used in gathering, and we can use this method to hide:
headers = {'User-Agent': 'python'}
r = requests.get('http://www.zhidaow.com', headers = headers)
print(r.request.headers['User-Agent'])

2.5. Setting timeout time

  • Timeout can be set through the timeout property, and if the response content is not available beyond that time, an error will be prompted.
requests.get('http://github.com', timeout=0.001)

2.6. Proxy access

  1. In order to avoid blocked IP, proxy is often used. requests also has the corresponding proxies attribute:
import requests

proxies = {
  "http": "http://10.10.1.10:3128",
  "https": "https://10.10.1.10:1080",
}

requests.get("http://www.zhidaow.com", proxies=proxies)
  1. If the agent needs an account and password, this is the case:
proxies = {
    "http": "http://user:pass@10.10.1.10:3128/",
}
  1. Code example:
import requests
from fake_useragent import UserAgent

url = 'http://httpbin.org/get'
headers = {'User-Agent': UserAgent().chrome}

# proxy = {
#     'type': 'type://ip:port',
#     'type': 'type://username:password@ip:port'
# }
proxy = {
    'http':'http://117.191.11.102:8080'
    #'http': 'http://398707160:j8inhg2g@58.87.79.136:16817'

}

resp = requests.get(url, headers=headers, proxies=proxy)
print(resp.text)

2.7. session automatically saves cookies

  1. Seeion means to maintain a session, such as: continue to operate (record identity information) after login, while requests are requests for a single request, and the identity information will not be recorded.
# Create a session object 
s = requests.Session()
# Set cookies by issuing get requests with session objects 
s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
  1. Code example:
import requests
from fake_useragent import UserAgent

# Sign in
login_url = 'http://www.sxt.cn/index/login/login'

# Personal information
info_url = 'http://www.sxt.cn/index/user.html'
headers = {'User-Agent': UserAgent().chrome}
data = {
    'user': '17703181473',
    'password': '123456'
}
# Open session object and save cookie s in session
session = requests.Session()
resp = session.post(login_url, headers=headers, data=data)
# Get the response content (in strings)
print(resp.text)

info_resp = session.get(info_url, headers=headers)

print(info_resp.text)

2.8. ssl verification

# Disable Security Request Warning
requests.packages.urllib3.disable_warnings()

resp = requests.get(url, verify=False, headers=headers)

2.9. request for information

Code Meaning
resp.json() Get the response content (in json strings)
resp.text Get the response content (in strings)
resp.content Get the response content (in bytes)
resp.headers Get the response header content
resp.url Get access address
resp.encoding Get Web Coding
resp.request.headers Request header content
resp.cookie Getting cookie s
resp.state_code Response State Code

3. Data extraction

3.1. Regular expression re (the highest lattice; the fastest speed)

1. Extracting data

  1. We've figured out how to get the content of the page before, but it's still a step away. How can we extract and sort out so much messy code with text? Let's start with a very powerful tool, regular expressions!

    Regular expression is a logical formula for string operation, which is to form a "regular string" by using pre-defined specific characters and combinations of these specific characters. This "regular string" is used to express a filtering logic for strings.

    Regular expressions are powerful tools for matching strings. There are also the concepts of regular expressions in other programming languages. Python is no exception. With regular expressions, it's easy for us to extract what we want from the returned page content.

  2. Rules:

Pattern describe
$ Match the end of the string
. Matches any character except line breaks. When the re.DOTALL tag is specified, any character including line breaks can be matched.
[...] Used to represent a set of characters, listed separately: [a m k] matches `a', `m'or `k'.
[^...] Characters not in []: a B C matches characters other than a,b,c
re* Matching 0 or more expressions
^ Match the beginning of a string
re+ Match one or more expressions
re? Match 0 or 1 fragment defined by the previous regular expression in a non-greedy manner
re{ n}
re{ n,} Accurate Matching of n Previous Expressions
re{ n,m} Match n to m fragments defined by the previous regular expression, greedy way
a b
(re) G matches expressions in parentheses, which also represent a group
(?-imx) Regular expressions close i, m, or x optional flags. Influencing only the areas in parentheses
(?imx) Regular expressions contain three optional flags: i, m, or x. Influencing only the areas in parentheses
(?: re) Similar (... But it doesn't mean a group.
(?imx: re) Use i, m, or x optional flags in parentheses
(?-imx: re) Do not use i, m, or x optional flags in parentheses
(?#...) Notes
(?= re) Forward affirmative definer. If the regular expression is contained, use (...) Represents that if the current position is successfully matched, it will succeed or fail. But once the contained expression has been tried, the matching engine has not improved at all; the rest of the pattern also tries to the right of the demarcator.
(?! re) Forward negative demarcator. Contrary to an affirmative delimiter; succeeds when the contained expression does not match the current position of the string
(?> re) Matching independent patterns, eliminating backtracking
\w Match alphanumeric and underscore
\W Matching non-alphabetic numbers and underscores
\s Matching any blank character is equivalent to [t\nrf].
\S Matching any non-null character
\d Match any number, equivalent to [0-9]
\D Matching Arbitrary Nonnumerals
\A Match string start
\Z Match the end of the string, if there is a newline, only match the end of the string before the newline. c
\z Matching String End
\G Match the position where the last match is completed
\b Match a word boundary, that is, the position between the word and the space. For example,'er B'can match'er' in'never', but not'er'in'verb'.
\B Match non-word boundaries. "Er B" matches "er" in "verb", but not "er" in "never".
\ n, t, etc. Match a newline character. Match a tab. etc.
\1...\9 Matching the content of the nth grouping
\10 Match the content of the nth grouping if it is matched. Otherwise, it refers to the expression of octal character code.
[\u4e00-\u9fa5] Chinese

2. Relevant Annotations of Regular Expressions

  1. The greedy and non-greedy modes of quantifiers:

    Regular expressions are often used to find matching strings in text
    Quantifiers in Python are greedy by default (or in a few languages they may be non-greedy by default), always trying to match as many characters as possible; non-greedy, on the contrary, always trying to match as few characters as possible.

    • For example, if the regular expression `ab*'is used to find `abbbc', it will find `abbb'. If we use the non-greedy quantifier "ab?", we will find "a".
  2. Common methods:

    1. re.match

      • re.match tries to match a pattern from the beginning of the string, and if the match() is not successful, it returns none.
      • Functional grammar:
        re.match(pattern, string, flags=0)
    2. re.search

      • re.search scans the entire string and returns the first successful match.
      • Functional grammar:
        re.search(pattern, string, flags=0)
    3. re.sub

      • Subsubstitution string

      • Grammar:

        re.sub(pattern,replace,string)

    4. re.findall

      • Findall Find All

      • Grammar:

        re.findall(pattern,string,flags=0)

  3. Regular expression modifier-optional flag:

    Regular expressions can contain optional token modifiers to control the pattern of matching. The modifier is specified as an optional flag. Multiple flags can be specified by bitwise OR(|). For example, re.I | re.M is set to the I and M flags:

Modifier describe
re.I Make matching case insensitive
re.L locale-aware matching
re.M
re.S Make. match all characters including newlines
re.U Resolve characters according to Unicode character set. This sign affectsw,W,b,B.
re.X This flag makes it easier for you to understand regular expressions by giving you a more flexible format.
  • Code example:
import re

str1 = 'I study Python3.6 everday!'

				############ match ############
print('-' * 30, 'match()', '-' * 30)  
# match from left to right, matching in turn (the previous matching also needs to be matched), if matching is not directly returned to None

# m1 = re.match(r'I', str1)
# m1 = re.match(r'[I]', str1)
# m1 = re.match(r'\bI', str1)
# m1 = re.match(r'\w', str1)
# m1 = re.match(r'\S', str1) 
# m1 = re.match(r'(I)', str1)
# m1 = re.match(r'.', str1)
# m1 = re.match(r'\D', str1)
m1 = re.match(r'\w\s(study)', str1)
print(m1.group(1))

				############ search ############
print('-' * 30, 'search()', '-' * 30)  
# From left to right, scan all, find the first and return the result.
s1 = re.search(r'study', str1)
s1 = re.search(r'y', str1)
print(s1.group())

				############ findall ############
print('-' * 30, 'findall()', '-' * 30)
f1 = re.findall(r'y', str1)
f1 = re.findall(r'Python3.6', str1)
f1 = re.findall(r'P\w*.\d', str1)
print(f1)

				############ sub ############
print('-' * 30, 'sub()', '-' * 30)
su1 = re.sub(r'everday', 'Everday', str1)
su1 = re.sub(r'ev.+', 'Everday', str1)
print(su1)

print('-' * 30, 'test()', '-' * 30)
str2 = '<span><a href="http://Www.bjstx.com "> Silicon Valley sxt</a> </span>"

# t1 = re.findall(r'[\u4e00-\u9fa5]+', str2)
# t1 = re.findall(r'>([\u4e00-\u9fa5]+)<', str2)
# t1 = re.findall(r'>(\S+?)<', str2)
t1 = re.findall(r'<a href=".*">(.+)</a>', str2)
t1 = re.findall(r'<a href="(.*)">.+</a>', str2)
print(t1)
t2 = re.sub(r'span', 'div', str2)
t2 = re.sub(r'<span>(.+)</span>', r'<div>\1</div>', str2)
print(t2)

  1. Exercise: Crawl the first three pages of the Encyclopedia of Gongshi, only the content of the paragraph.
import requests
from fake_useragent import UserAgent
import re

with open('duanzi.txt', 'w', encoding='utf-8') as f:
    for i in range(1, 4):
        url = 'https://www.qiushibaike.com/text/page/{}/'.format(i)
        headers = {'User-Agent': UserAgent().chrome}
        resp = requests.get(url, headers=headers)
        html = resp.text
        infos = re.findall(r'<div class="content">\s<span>\s+(.+)', html)
        for info in infos:
            f.write('-' * 30 + '\n')
            f.write(info.replace(r'<br/>','\n'))
            f.write('\n' + '-' * 30 + '\n')

3.2. Beautiful Soup

1. Introduction, Installation and Four Categories

  1. Beautiful Soup provides some simple, python-like functions for navigating, searching, and modifying analysis trees. It's a toolbox that provides users with data to grab by parsing documents. Because it's simple, it doesn't need much code to write a complete application.

    Beautiful Soup automatically converts input documents into Unicode encoding and output documents into utf-8 encoding. You don't need to think about coding, unless the document does not specify a coding method, then Beautiful Soup can't automatically identify the coding method. Then, you just need to explain the original encoding.

    Beautiful Soup has become an excellent python interpreter like lxml and html6lib, providing users with flexible different parsing strategies or strong speed

  2. Installation: Beautiful Soup 3 is currently out of development. It is recommended to use Beautiful Soup 4 in current projects, but it has been ported to BS4, which means we need import bs4 when importing.

    pip install beautifulsoup4
    pip install lxml
    

    Beautiful Soup supports HTML parsers in Python standard libraries and some third-party parsers. If we don't install it, Python will use Python's default parser. The lxml parser is more powerful, faster and recommended for installation.

Parser Usage method advantage Inferiority
Python Standard Library BeautifulSoup(markup, "html.parser") 1. Python's built-in standard library 2. Moderate execution speed 3. Document fault tolerance Document fault tolerance in previous versions of Python 2.7.3 or 3.2.2) is poor
lxml HTML parser BeautifulSoup(markup, "lxml") 1. Fast speed 2. Document fault tolerance Need to install C language library
lxml XML parser BeautifulSoup(markup, ["lxml", "xml"]) BeautifulSoup(markup, "xml") 1. Fast 2. Unique XML-enabled parser 3. Need to install C language library
html5lib BeautifulSoup(markup, "html5lib") 1. Best fault tolerance 2. Parsing documents in browser mode 3. Generating documents in HTML5 format 4. Slow speed Independent of External Extension
  1. Create Beautiful Soup Objects

    from bs4 import BeautifulSoup
    bs = BeautifulSoup(html,"lxml")
    
  2. Four categories of objects:

    Beautiful Soup converts complex HTML documents into a complex tree structure. Each node is a Python object. All objects can be summarized into four types:

    • Tag

    • NavigableString

    • Beautiful Soup

    • Comment (not often used)

2. Tag

  • Popularly, it's a tag in HTML; for example, <div> <title>.

  • How to use it:

# Take the following code as an example
<title id='title'>Ada Tam</title>
<div class='info' float='left'>Welcome to SXT</div>
<div class='info' float='right'>
    <span>Good Good Study</span>
    <a href='www.bjsxt.cn'></a>
    <strong><!--Useless--></strong>
</div>
  1. Get tags:
# Parsing in lxml
soup = BeautifulSoup(info, 'lxml')
print(soup.title)
# < title > Shangxue </title >

  • Note: The same tag can only get the first tag that meets the requirements.
  1. Get properties:
# Get all attributes
print(soup.title.attrs)
# class='info' float='left'

# Get the value of a single attribute
print(soup.div.get('class'))
print(soup.div['class'])
print(soup.a['href'])
# info

3. Navigable String Gets Content

print(soup.title.string)
print(soup.title.text)
#Ada Tam

4. BeautifulSoup

  • BeautifulSoup object represents the entire content of a document, and most of the time it can be treated as a Tag object, which supports traversing the document tree and searching for most of the methods described in the document tree.

    Because the BeautifulSoup object is not a real HTML or XML tag, it has no name and attribute attributes. But sometimes it's convenient to look at its. name attribute, so the BeautifulSoup object contains a special attribute. name with a value of "[document]".

print(soup.name)
print(soup.head.name)
# [document]
# head

5. Comment

  • Comment object is a special type of Navigable String object. In fact, the output still does not include annotation symbols, but if it is not handled properly, it may cause unexpected trouble to our text processing.
if type(soup.strong.string) == Comment:
    print(soup.strong.prettify())
else:
    print(soup.strong.string)

6. Searching Document Tree

  • Beautiful Soup defines a number of search methods, focusing on two: find() and find_all(). The parameters and usage of other methods are similar. Please give us a second opinion.

  • Filter:

    Before introducing the find_all() method, let's first introduce the types of filters that run through the entire search API. Filters can be used in tag name, node attributes, strings, or their mix.

  1. Character string

The simplest filter is a string. When a string parameter is passed into the search method, Beautiful Soup finds the content that matches the string completely. The following example is used to find all of the contents in the document.

Label

# Returns all div Tags
print(soup.find_all('div'))

If a bytecode parameter is passed in, Beautiful Soup will be encoded as UTF-8, and a Unicode code code can be passed in to avoid a Beautiful Soup parsing error.

  1. regular expression

If a regular expression is passed in as a parameter, Beautiful Soup matches the content through the match() of the regular expression.

# Returns all div Tags
print (soup.find_all(re.compile("^div")))
  1. list

If a list parameter is passed in, Beautiful Soup returns the content that matches any element in the list.

# Returns all matched span, a Tags
print(soup.find_all(['span','a']))
  1. keyword

If a parameter with a specified name is not a search for the built-in parameter name, it will be searched as an attribute of the specified name tag. If a parameter with a name ID is included, Beautiful Soup will search for the "id" attribute of each tag.

#Returns the label with id welcom
print(soup.find_all(id='welcom'))
  1. True

True can match any value. The following code finds all tag s, but does not return string nodes.

  1. Search by CSS

Searching tags according to CSS class names is very useful, but the keyword class identifying CSS class names is a reserved word in Python. Using class as a parameter can lead to grammatical errors. Starting from version 4.1.1 of Beautiful Soup, tags with specified CSS class names can be searched through class_parameter.

# Returns a div whose class equals info
print(soup.find_all('div',class_='info'))
  1. Search by property
soup.find_all("div", attrs={"class": "info"})

7. CSS selector (extension)

  • Sop. select (parameter):
Expression Explain
tag Select the specified label
* Select all nodes
#id Select the node whose id is container
.class Select all class es containing container nodes
li a Select all a nodes under all li
ul + p (Brother) Select the first p element after ul
div#id > ul (father and son) select the first ul child element of div whose id is id
table ~ div Select all div elements adjacent to table
a[title] Select all a elements with title Attributes
a[class="title"] Select all class attributes as title Value a
a[href*="sxt"] Select all href attributes containing sxt element a
a[href^="http"] Select all href attribute values starting with http for element a
a[href$=".png"] Select a element with all href attribute values ending in. png
input[type="redio"]:checked Select the selected hobby element

8. Code examples

# pip install bs4
# pip install lxml
from bs4 import BeautifulSoup
from bs4.element import Comment

str1 = '''
<title id='title'>Ada Tam</title>
<div class='info' float='left'>Welcome to SXT</div>
<div class='info' float='right'>
    <span>Good Good Study</span>
    <a href='www.bjsxt.cn'></a>
    <strong><!--Useless--></strong>
</div>
'''
soup = BeautifulSoup(str1, 'lxml')
print('-' * 30, 'Get Tags', '-' * 30)
print(soup.title)
print(soup.span)
print(soup.div)

print('-' * 30, 'get attribute', '-' * 30)
print(soup.div.attrs)
print(soup.div.get('class'))
print(soup.a['href'])

print('-' * 30, 'Getting content', '-' * 30)
print(type(soup.title.string))
print(soup.title.text)

print(type(soup.strong.string))
print(soup.strong.text)

if type(soup.strong.string) == Comment:
    print('There are notes!')
    print(soup.strong.prettify())

print('-' * 30, 'find_all()', '-' * 30)
print(soup.find_all('div'))
print(soup.find_all(id='title'))
print(soup.find_all(class_='info'))
print(soup.find_all(attrs={'float': 'right'}))

print('-' * 30, 'select()', '-' * 30)
print(soup.select('a'))
print(soup.select('#title'))
print(soup.select('.info'))
print(soup.select('div span'))
print(soup.select('div > span'))

3.3. Xpath

  • You can install the Xpath Helper plug-in on Google Chrome.

1. Introduction and installation

  1. Beautiful Soup is already a very powerful library, but there are some popular parsing libraries, such as lxml, using Xpath grammar, which is also a more efficient parsing method. If you're not used to Beautiful Soup, try Xpath.

    Official website http://lxml.de/index.html

    w3c http://www.w3school.com.cn/xpath/index.asp

  2. Installation:

    pip install lxml
    

2. Xpath grammar

  1. XPath is a language for finding information in XML documents. XPath can be used to traverse elements and attributes in XML documents. XPath is the main element of W3C XSLT standard, and XQuery and XPointer are built on XPath expression.

  2. Node relationship:

    • Parent
    • Children
    • Sibling
    • Ancestor
    • Descendant

3. Acquisition node:

  1. Commonly used path expressions:
Expression describe
nodename Select all child nodes of this node
/ Selection from the root node
// Select the nodes in the document from the current node that matches the selection, regardless of their location
. Select the current node
... Select the parent of the current node
@ Select attributes
  1. Wildcards: XPath wildcards can be used to select unknown XML elements.
wildcard describe Give an example Result
* Match any element node xpath('div/*') Get all the child nodes under div
@* Match any attribute node xpath('div[@*]') Select all div nodes with attributes
node() Match any type of node
  1. Select several paths: By using the "|" operator in the path expression, you can select several paths
Expression Result
xpath('//div|//table') Get all div and table nodes
  1. Predicates: Predicates are embedded in square brackets to find a particular node or node containing a specified value
Expression Result
xpath('/body/div[1]') Select the first div node under body
xpath('/body/div[last()]') Select the last div node under body
xpath('/body/div[last()-1]') Select the penultimate node under body
xpath('/body/div[positon()❤️]') Select the first three div nodes under body
xpath('/body/div[@class]') Select div node with class attribute under body
xpath('/body/div[@class="main"]') Select div node whose class attribute is main under body
xpath('/body/div[price>35.00]') Selecting div nodes with price element greater than 35 under body
  1. Xpath operator
operator describe Example Return value
Computing two node sets //book
+ addition 6 + 4 10
subtraction 6 – 4 2
* multiplication 6 * 4 24
div division 8 div 4 2
= Be equal to price=9.80 If price is 9.80, return true. If price is 9.90, return false.
!= Not equal to price!=9.80 If price is 9.90, return true. If price is 9.80, return false.
< less than price<9.80 If price is 9.00, return true. If price is 9.90, return false.
<= Less than or equal to price<=9.80 If price is 9.00, return true. If price is 9.90, return false.
> greater than price>9.80 If price is 9.90, return true. If price is 9.80, return false.
>= Greater than or equal to price>=9.80 If price is 9.90, return true. If price is 9.70, return false.
or or price=9.80 or price=9.70 If price is 9.80, return true. If price is 9.50, return false.
and and price>9.00 and price<9.90 If price is 9.80, return true. If price is 8.50, return false.
mod Calculate the remainder of division 5 mod 2 1

4. Use

1. Examples:

from lxml import etree

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''

html = etree.HTML(text)
result = etree.tostring(html)
print(result)

First we use the etree Library of lxml, then we initialize it with etree.HTML, and then we print it out.

Among them, this reflects a very practical function of lxml is to automatically modify the HTML code, you should note that the last li tag, in fact, I deleted the tail tag, is not closed. However, lxml inherits the features of libxml 2 and has the function of automatically modifying HTML code.

So the output is as follows:

<html><body>
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>

</body></html>

It not only completes the li tag, but also adds body and html tag.
File Reading

In addition to reading strings directly, it also supports reading content from files. For example, we create a new file called hello.html, which contains:

<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>

The parse method is used to read files:

from lxml import etree
html = etree.parse('hello.html')
result = etree.tostring(html, pretty_print=True)
print(result)

The same results can also be obtained.

2. XPath uses:

  1. Get all < li > tags
from lxml import etree
html = etree.parse('hello.html')
print (type(html))
result = html.xpath('//li')
print (result)
print (len(result))
print (type(result))
print (type(result[0]))
  • Operation results:
<type 'lxml.etree._ElementTree'>
[<Element li at 0x1014e0e18>, <Element li at 0x1014e0ef0>, <Element li at 0x1014e0f38>, <Element li at 0x1014e0f80>, <Element li at 0x1014e0fc8>]

<type 'list'>
<type 'lxml.etree._Element'>

It can be seen that the type of etree.parse is ElementTree. After calling xpath, we get a list of five < li > elements, each of which is Element type.

  1. Get all class es of the < li > tag
result = html.xpath('//li/@class')
print (result)
  • Operation results:
['item-0', 'item-1', 'item-inactive', 'item-1', 'item-0']
  1. Get the < a > label with href as link 1. HTML under < li > label
result = html.xpath('//li/a[@href="link1.html"]')
print (result)
  • Operation results:
[<Element a at 0x10ffaae18>]
  1. Get all < span > tags under < li > tags

    Note: This is not correct.

result = html.xpath('//li/span')

#Because / is used to get child elements, and < span > is not < li > child elements, so use parallel slash bars
result = html.xpath('//li//span')
print(result)
  • Operation results:
[<Element span at 0x10d698e18>]
  1. Get all class es under the < li > tag, excluding < li >
result = html.xpath('//li/a//@class')
print (resul)t
# Operation results
# ['blod']
  1. Get the last < li > href of < a >
result = html.xpath('//li[last()]/a/@href')
print (result)
  • Operation results:
['link5.html']
  1. Get the content of the penultimate element
result = html.xpath('//li[last()-1]/a')
print (result[0].text)
  • Operation results:
fourth item
  1. Get the label signature with class as bold
result = html.xpath('//*[@class="bold"]')
print (result[0].tag)
  • Operation results:
span

Select the nodes in the XML file:

  • Element (element node)
  • Attribute (attribute node)
  • Text (text node)
  • Concat (element node, element node)
  • Comment (comment node)
  • Root (root node)

5. Code examples

  • In the vertical and horizontal web, crawl the first three pages of data, as long as the title of the book
from lxml import etree
import requests
from fake_useragent import UserAgent

url = 'http://book.zongheng.com/store/c1/c0/b0/u0/p1/v9/s1/t0/u0/i1/ALL.html'
headers = {'User-Agent': UserAgent().chrome}
resp = requests.get(url, headers=headers)
html = resp.text

# Constructing Analytical Object etree
e = etree.HTML(html)

# Title
names = e.xpath('//div[@class="bookname"]/a/text()')
# author
authors = e.xpath('//div[@class="bookilnk"]/a[1]/text()')
# Mode 01: If there is no author, it will not correspond.
for i in range(len(names)):
    print('{}:{}'.format(names[i], authors[i]))
# Mode 02: If the number of iterators is different, choose the shorter one.
for n, a in zip(names, authors):
   print('{}:{}'.format(n,a))

('//li[last()-1]/a')
print (result[0].text)

- Operation results:

fourth item

8. Obtain class by bold Tag signature

```python
result = html.xpath('//*[@class="bold"]')
print (result[0].tag)
  • Operation results:
span

Select the nodes in the XML file:

  • Element (element node)
  • Attribute (attribute node)
  • Text (text node)
  • Concat (element node, element node)
  • Comment (comment node)
  • Root (root node)

5. Code examples

  • In the vertical and horizontal web, crawl the first three pages of data, as long as the title of the book
from lxml import etree
import requests
from fake_useragent import UserAgent

url = 'http://book.zongheng.com/store/c1/c0/b0/u0/p1/v9/s1/t0/u0/i1/ALL.html'
headers = {'User-Agent': UserAgent().chrome}
resp = requests.get(url, headers=headers)
html = resp.text

# Constructing Analytical Object etree
e = etree.HTML(html)

# Title
names = e.xpath('//div[@class="bookname"]/a/text()')
# author
authors = e.xpath('//div[@class="bookilnk"]/a[1]/text()')
# Mode 01: If there is no author, it will not correspond.
for i in range(len(names)):
    print('{}:{}'.format(names[i], authors[i]))
# Mode 02: If the number of iterators is different, choose the shorter one.
for n, a in zip(names, authors):
   print('{}:{}'.format(n,a))

Keywords: Attribute Python Session xml

Added by Dimwhit on Wed, 24 Jul 2019 12:58:14 +0300