Stage 12 - Reptile 02: [request; data extraction (regular, Beautiful Soup, xpath)]

Articles Catalogue

1. URLError
2. Use of request Libraries

2.1. Basic Introduction
2.2. get request
2.3. post request
2.4. Custom request header
2.5. Setting timeout time
2.6. Proxy access
2.7. session automatically saves cookies
2.8. ssl verification
2.9. request for information

3. Data extraction

3.1. Regular expression re (the highest lattice; the fastest speed)

1. Extracting data
2. Relevant Annotations of Regular Expressions

3.2. Beautiful Soup

1. Introduction, Installation and Four Categories
2. Tag
3. Navigable String Gets Content
4. BeautifulSoup
5. Comment
6. Searching Document Tree
7. CSS selector (extension)
8. Code examples

3.3. Xpath

1. Introduction and installation
2. Xpath grammar
3. Acquisition node:
4. Use

1. Examples:
2. XPath uses:

5. Code examples
5. Code examples

1. URLError

First, explain the possible causes of URLError:
- The network is not connected, that is, the local computer can not access the Internet.
- Unable to connect to a specific server
- Server does not exist
In the code, we need to surround and capture the corresponding exception with try-except statement, the code is as follows:

from urllib.request import Request, urlopen
from fake_useragent import UserAgent
from urllib.error import URLError

url = 'http://www.sxt.cn/index/login/login12353wfeds.html' # Servers available, resources not available
url = 'http://www.sxt12412412.cn/index/login/login12353wfeds.html'
headers = {'User-Agent': UserAgent().chrome}
try:
    req = Request(url, headers=headers)
    resp = urlopen(req)
    info = resp.read().decode()
    print(info)
except URLError as e:
    if len(e.args) != 0:
        print('Address acquisition error!')
    else:
        print(e.code)
print('Climbing completed')

Skills on debug mode:

We use the urlopen method to access a non-existent web site. The results are as follows:

[Errno 11004] getaddrinfo failed

2. Use of request Libraries

2.1. Basic Introduction

Introduction:

It is helpful to understand some basic concepts of reptiles and grasp the process of reptiles crawling. After introducing, we need to learn more advanced content and tools to facilitate our crawling. So this section gives a brief introduction to the basic usage of requests libraries.
install

Install with pip:

pip install requests

Basic Request

req = requests.get("http://www.baidu.com")
req = requests.post("http://www.baidu.com")
req = requests.put("http://www.baidu.com")
req = requests.delete("http://www.baidu.com")
req = requests.head("http://www.baidu.com")
req = requests.options("http://www.baidu.com")

2.2. get request

The parameters are dictionaries, and we can also pass json-type parameters:

Use of get 01:

import requests
from fake_useragent import UserAgent

url = 'http://www.baidu.com'
headers = {'User-Agent': UserAgent().chrome}
resp = requests.get(url, headers=headers)
resp.encoding='utf-8'
print(resp.text)

Use of get 02:

import requests
from fake_useragent import UserAgent

url = 'http://www.baidu.com/s?'
params = {
    'wd': 'Black Horse Programmer'
}
headers = {'User-Agent': UserAgent().chrome}
resp = requests.get(url, headers=headers, params=params)
resp.encoding = 'utf-8'
print(resp.text)

2.3. post request

The parameters are dictionaries, and we can also pass json-type parameters:
Code Example 01:

import requests
from fake_useragent import UserAgent

url = 'http://www.sxt.cn/index/login/login.html'
args = {
    'user': '17703181473',
    'password': '123456'
}
headers={'User-Agent':UserAgent().chrome}
resp = requests.post(url,headers=headers,data=args)
print(resp.text)

Code example 02:

import requests
from fake_useragent import UserAgent

# Sign in
login_url = 'https://www.kuaidaili.com/login/'

headers = {'User-Agent': UserAgent().chrome}
data = {
    'username': '398707160@qq.com',
    'passwd': '123456abc'
}

resp = requests.post(login_url, headers=headers, data=data)
print(resp.text)

2.4. Custom request header

Camouflage request headers are often used in gathering, and we can use this method to hide:

headers = {'User-Agent': 'python'}
r = requests.get('http://www.zhidaow.com', headers = headers)
print(r.request.headers['User-Agent'])

2.5. Setting timeout time

Timeout can be set through the timeout property, and if the response content is not available beyond that time, an error will be prompted.

requests.get('http://github.com', timeout=0.001)

2.6. Proxy access

In order to avoid blocked IP, proxy is often used. requests also has the corresponding proxies attribute:

import requests

proxies = {
  "http": "http://10.10.1.10:3128",
  "https": "https://10.10.1.10:1080",
}

requests.get("http://www.zhidaow.com", proxies=proxies)

If the agent needs an account and password, this is the case:

proxies = {
    "http": "http://user:pass@10.10.1.10:3128/",
}

Code example:

import requests
from fake_useragent import UserAgent

url = 'http://httpbin.org/get'
headers = {'User-Agent': UserAgent().chrome}

# proxy = {
#     'type': 'type://ip:port',
#     'type': 'type://username:password@ip:port'
# }
proxy = {
    'http':'http://117.191.11.102:8080'
    #'http': 'http://398707160:j8inhg2g@58.87.79.136:16817'

}

resp = requests.get(url, headers=headers, proxies=proxy)
print(resp.text)

2.7. session automatically saves cookies

Seeion means to maintain a session, such as: continue to operate (record identity information) after login, while requests are requests for a single request, and the identity information will not be recorded.

# Create a session object 
s = requests.Session()
# Set cookies by issuing get requests with session objects 
s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')

Code example:

import requests
from fake_useragent import UserAgent

# Sign in
login_url = 'http://www.sxt.cn/index/login/login'

# Personal information
info_url = 'http://www.sxt.cn/index/user.html'
headers = {'User-Agent': UserAgent().chrome}
data = {
    'user': '17703181473',
    'password': '123456'
}
# Open session object and save cookie s in session
session = requests.Session()
resp = session.post(login_url, headers=headers, data=data)
# Get the response content (in strings)
print(resp.text)

info_resp = session.get(info_url, headers=headers)

print(info_resp.text)

2.8. ssl verification

# Disable Security Request Warning
requests.packages.urllib3.disable_warnings()

resp = requests.get(url, verify=False, headers=headers)

2.9. request for information

Code	Meaning
resp.json()	Get the response content (in json strings)
resp.text	Get the response content (in strings)
resp.content	Get the response content (in bytes)
resp.headers	Get the response header content
resp.url	Get access address
resp.encoding	Get Web Coding
resp.request.headers	Request header content
resp.cookie	Getting cookie s
resp.state_code	Response State Code

3. Data extraction

3.1. Regular expression re (the highest lattice; the fastest speed)

1. Extracting data

We've figured out how to get the content of the page before, but it's still a step away. How can we extract and sort out so much messy code with text? Let's start with a very powerful tool, regular expressions!

Regular expression is a logical formula for string operation, which is to form a "regular string" by using pre-defined specific characters and combinations of these specific characters. This "regular string" is used to express a filtering logic for strings.

Regular expressions are powerful tools for matching strings. There are also the concepts of regular expressions in other programming languages. Python is no exception. With regular expressions, it's easy for us to extract what we want from the returned page content.
Rules:

Pattern	describe
$	Match the end of the string
.	Matches any character except line breaks. When the re.DOTALL tag is specified, any character including line breaks can be matched.
[...]	Used to represent a set of characters, listed separately: [a m k] matches `a', `m'or `k'.
[^...]	Characters not in []: a B C matches characters other than a,b,c
re*	Matching 0 or more expressions
^	Match the beginning of a string
re+	Match one or more expressions
re?	Match 0 or 1 fragment defined by the previous regular expression in a non-greedy manner
re{ n}
re{ n,}	Accurate Matching of n Previous Expressions
re{ n,m}	Match n to m fragments defined by the previous regular expression, greedy way
a	b
(re)	G matches expressions in parentheses, which also represent a group
(?-imx)	Regular expressions close i, m, or x optional flags. Influencing only the areas in parentheses
(?imx)	Regular expressions contain three optional flags: i, m, or x. Influencing only the areas in parentheses
(?: re)	Similar (... But it doesn't mean a group.
(?imx: re)	Use i, m, or x optional flags in parentheses
(?-imx: re)	Do not use i, m, or x optional flags in parentheses
(?#...)	Notes
(?= re)	Forward affirmative definer. If the regular expression is contained, use (...) Represents that if the current position is successfully matched, it will succeed or fail. But once the contained expression has been tried, the matching engine has not improved at all; the rest of the pattern also tries to the right of the demarcator.
(?! re)	Forward negative demarcator. Contrary to an affirmative delimiter; succeeds when the contained expression does not match the current position of the string
(?> re)	Matching independent patterns, eliminating backtracking
\w	Match alphanumeric and underscore
\W	Matching non-alphabetic numbers and underscores
\s	Matching any blank character is equivalent to [t\nrf].
\S	Matching any non-null character
\d	Match any number, equivalent to [0-9]
\D	Matching Arbitrary Nonnumerals
\A	Match string start
\Z	Match the end of the string, if there is a newline, only match the end of the string before the newline. c
\z	Matching String End
\G	Match the position where the last match is completed
\b	Match a word boundary, that is, the position between the word and the space. For example,'er B'can match'er' in'never', but not'er'in'verb'.
\B	Match non-word boundaries. "Er B" matches "er" in "verb", but not "er" in "never".
\ n, t, etc.	Match a newline character. Match a tab. etc.
\1...\9	Matching the content of the nth grouping
\10	Match the content of the nth grouping if it is matched. Otherwise, it refers to the expression of octal character code.
[\u4e00-\u9fa5]	Chinese

2. Relevant Annotations of Regular Expressions

The greedy and non-greedy modes of quantifiers:

Regular expressions are often used to find matching strings in text
Quantifiers in Python are greedy by default (or in a few languages they may be non-greedy by default), always trying to match as many characters as possible; non-greedy, on the contrary, always trying to match as few characters as possible.
- For example, if the regular expression `ab*'is used to find `abbbc', it will find `abbb'. If we use the non-greedy quantifier "ab?", we will find "a".
Common methods:
1. re.match
  - re.match tries to match a pattern from the beginning of the string, and if the match() is not successful, it returns none.
  - Functional grammar:
    re.match(pattern, string, flags=0)
2. re.search
  - re.search scans the entire string and returns the first successful match.
  - Functional grammar:
    re.search(pattern, string, flags=0)
3. re.sub
  - Subsubstitution string
  - Grammar:
    
    re.sub(pattern,replace,string)
4. re.findall
  - Findall Find All
  - Grammar:
    
    re.findall(pattern,string,flags=0)
Regular expression modifier-optional flag:

Regular expressions can contain optional token modifiers to control the pattern of matching. The modifier is specified as an optional flag. Multiple flags can be specified by bitwise OR(|). For example, re.I | re.M is set to the I and M flags:

Modifier	describe
re.I	Make matching case insensitive
re.L	locale-aware matching
re.M
re.S	Make. match all characters including newlines
re.U	Resolve characters according to Unicode character set. This sign affectsw,W,b,B.
re.X	This flag makes it easier for you to understand regular expressions by giving you a more flexible format.

Code example:

import re

str1 = 'I study Python3.6 everday!'

				############ match ############
print('-' * 30, 'match()', '-' * 30)  
# match from left to right, matching in turn (the previous matching also needs to be matched), if matching is not directly returned to None

# m1 = re.match(r'I', str1)
# m1 = re.match(r'[I]', str1)
# m1 = re.match(r'\bI', str1)
# m1 = re.match(r'\w', str1)
# m1 = re.match(r'\S', str1) 
# m1 = re.match(r'(I)', str1)
# m1 = re.match(r'.', str1)
# m1 = re.match(r'\D', str1)
m1 = re.match(r'\w\s(study)', str1)
print(m1.group(1))

				############ search ############
print('-' * 30, 'search()', '-' * 30)  
# From left to right, scan all, find the first and return the result.
s1 = re.search(r'study', str1)
s1 = re.search(r'y', str1)
print(s1.group())

				############ findall ############
print('-' * 30, 'findall()', '-' * 30)
f1 = re.findall(r'y', str1)
f1 = re.findall(r'Python3.6', str1)
f1 = re.findall(r'P\w*.\d', str1)
print(f1)

				############ sub ############
print('-' * 30, 'sub()', '-' * 30)
su1 = re.sub(r'everday', 'Everday', str1)
su1 = re.sub(r'ev.+', 'Everday', str1)
print(su1)

print('-' * 30, 'test()', '-' * 30)
str2 = '<span><a href="http://Www.bjstx.com "> Silicon Valley sxt</a> </span>"

# t1 = re.findall(r'[\u4e00-\u9fa5]+', str2)
# t1 = re.findall(r'>([\u4e00-\u9fa5]+)<', str2)
# t1 = re.findall(r'>(\S+?)<', str2)
t1 = re.findall(r'<a href=".*">(.+)</a>', str2)
t1 = re.findall(r'<a href="(.*)">.+</a>', str2)
print(t1)
t2 = re.sub(r'span', 'div', str2)
t2 = re.sub(r'<span>(.+)</span>', r'<div>\1</div>', str2)
print(t2)

Exercise: Crawl the first three pages of the Encyclopedia of Gongshi, only the content of the paragraph.

import requests
from fake_useragent import UserAgent
import re

with open('duanzi.txt', 'w', encoding='utf-8') as f:
    for i in range(1, 4):
        url = 'https://www.qiushibaike.com/text/page/{}/'.format(i)
        headers = {'User-Agent': UserAgent().chrome}
        resp = requests.get(url, headers=headers)
        html = resp.text
        infos = re.findall(r'<div class="content">\s<span>\s+(.+)', html)
        for info in infos:
            f.write('-' * 30 + '\n')
            f.write(info.replace(r'<br/>','\n'))
            f.write('\n' + '-' * 30 + '\n')

3.2. Beautiful Soup

1. Introduction, Installation and Four Categories

Beautiful Soup provides some simple, python-like functions for navigating, searching, and modifying analysis trees. It's a toolbox that provides users with data to grab by parsing documents. Because it's simple, it doesn't need much code to write a complete application.

Beautiful Soup automatically converts input documents into Unicode encoding and output documents into utf-8 encoding. You don't need to think about coding, unless the document does not specify a coding method, then Beautiful Soup can't automatically identify the coding method. Then, you just need to explain the original encoding.

Beautiful Soup has become an excellent python interpreter like lxml and html6lib, providing users with flexible different parsing strategies or strong speed
Installation: Beautiful Soup 3 is currently out of development. It is recommended to use Beautiful Soup 4 in current projects, but it has been ported to BS4, which means we need import bs4 when importing.
```
pip install beautifulsoup4
pip install lxml
```
Beautiful Soup supports HTML parsers in Python standard libraries and some third-party parsers. If we don't install it, Python will use Python's default parser. The lxml parser is more powerful, faster and recommended for installation.

Parser	Usage method	advantage	Inferiority
Python Standard Library	BeautifulSoup(markup, "html.parser")	1. Python's built-in standard library 2. Moderate execution speed 3. Document fault tolerance	Document fault tolerance in previous versions of Python 2.7.3 or 3.2.2) is poor
lxml HTML parser	BeautifulSoup(markup, "lxml")	1. Fast speed 2. Document fault tolerance	Need to install C language library
lxml XML parser	BeautifulSoup(markup, ["lxml", "xml"]) BeautifulSoup(markup, "xml")	1. Fast 2. Unique XML-enabled parser 3. Need to install C language library
html5lib	BeautifulSoup(markup, "html5lib")	1. Best fault tolerance 2. Parsing documents in browser mode 3. Generating documents in HTML5 format 4. Slow speed	Independent of External Extension

Create Beautiful Soup Objects

from bs4 import BeautifulSoup
bs = BeautifulSoup(html,"lxml")

Four categories of objects:

Beautiful Soup converts complex HTML documents into a complex tree structure. Each node is a Python object. All objects can be summarized into four types:
- Tag
- NavigableString
- Beautiful Soup
- Comment (not often used)

2. Tag

Popularly, it's a tag in HTML; for example, <div> <title>.
How to use it:

# Take the following code as an example
<title id='title'>Ada Tam</title>
<div class='info' float='left'>Welcome to SXT</div>
<div class='info' float='right'>
    <span>Good Good Study</span>
    <a href='www.bjsxt.cn'></a>
    <strong><!--Useless--></strong>
</div>

Get tags:

# Parsing in lxml
soup = BeautifulSoup(info, 'lxml')
print(soup.title)
# < title > Shangxue </title >

Note: The same tag can only get the first tag that meets the requirements.

Get properties:

# Get all attributes
print(soup.title.attrs)
# class='info' float='left'

# Get the value of a single attribute
print(soup.div.get('class'))
print(soup.div['class'])
print(soup.a['href'])
# info

3. Navigable String Gets Content

print(soup.title.string)
print(soup.title.text)
#Ada Tam

4. BeautifulSoup

BeautifulSoup object represents the entire content of a document, and most of the time it can be treated as a Tag object, which supports traversing the document tree and searching for most of the methods described in the document tree.

Because the BeautifulSoup object is not a real HTML or XML tag, it has no name and attribute attributes. But sometimes it's convenient to look at its. name attribute, so the BeautifulSoup object contains a special attribute. name with a value of "[document]".

print(soup.name)
print(soup.head.name)
# [document]
# head

5. Comment

Comment object is a special type of Navigable String object. In fact, the output still does not include annotation symbols, but if it is not handled properly, it may cause unexpected trouble to our text processing.

if type(soup.strong.string) == Comment:
    print(soup.strong.prettify())
else:
    print(soup.strong.string)

6. Searching Document Tree

Beautiful Soup defines a number of search methods, focusing on two: find() and find_all(). The parameters and usage of other methods are similar. Please give us a second opinion.
Filter:

Before introducing the find_all() method, let's first introduce the types of filters that run through the entire search API. Filters can be used in tag name, node attributes, strings, or their mix.

Character string

The simplest filter is a string. When a string parameter is passed into the search method, Beautiful Soup finds the content that matches the string completely. The following example is used to find all of the contents in the document.

Label

# Returns all div Tags
print(soup.find_all('div'))

If a bytecode parameter is passed in, Beautiful Soup will be encoded as UTF-8, and a Unicode code code can be passed in to avoid a Beautiful Soup parsing error.

regular expression

If a regular expression is passed in as a parameter, Beautiful Soup matches the content through the match() of the regular expression.

# Returns all div Tags
print (soup.find_all(re.compile("^div")))

list

If a list parameter is passed in, Beautiful Soup returns the content that matches any element in the list.

# Returns all matched span, a Tags
print(soup.find_all(['span','a']))

keyword

If a parameter with a specified name is not a search for the built-in parameter name, it will be searched as an attribute of the specified name tag. If a parameter with a name ID is included, Beautiful Soup will search for the "id" attribute of each tag.

#Returns the label with id welcom
print(soup.find_all(id='welcom'))

True

True can match any value. The following code finds all tag s, but does not return string nodes.

Search by CSS

Searching tags according to CSS class names is very useful, but the keyword class identifying CSS class names is a reserved word in Python. Using class as a parameter can lead to grammatical errors. Starting from version 4.1.1 of Beautiful Soup, tags with specified CSS class names can be searched through class_parameter.

# Returns a div whose class equals info
print(soup.find_all('div',class_='info'))

Search by property

soup.find_all("div", attrs={"class": "info"})

7. CSS selector (extension)

Sop. select (parameter):

Expression	Explain
tag	Select the specified label
*	Select all nodes
#id	Select the node whose id is container
.class	Select all class es containing container nodes
li a	Select all a nodes under all li
ul + p	(Brother) Select the first p element after ul
div#id > ul	(father and son) select the first ul child element of div whose id is id
table ~ div	Select all div elements adjacent to table
a[title]	Select all a elements with title Attributes
a[class="title"]	Select all class attributes as title Value a
a[href*="sxt"]	Select all href attributes containing sxt element a
a[href^="http"]	Select all href attribute values starting with http for element a
a[href$=".png"]	Select a element with all href attribute values ending in. png
input[type="redio"]:checked	Select the selected hobby element

8. Code examples

# pip install bs4
# pip install lxml
from bs4 import BeautifulSoup
from bs4.element import Comment

str1 = '''
<title id='title'>Ada Tam</title>
<div class='info' float='left'>Welcome to SXT</div>
<div class='info' float='right'>
    <span>Good Good Study</span>
    <a href='www.bjsxt.cn'></a>
    <strong><!--Useless--></strong>
</div>
'''
soup = BeautifulSoup(str1, 'lxml')
print('-' * 30, 'Get Tags', '-' * 30)
print(soup.title)
print(soup.span)
print(soup.div)

print('-' * 30, 'get attribute', '-' * 30)
print(soup.div.attrs)
print(soup.div.get('class'))
print(soup.a['href'])

print('-' * 30, 'Getting content', '-' * 30)
print(type(soup.title.string))
print(soup.title.text)

print(type(soup.strong.string))
print(soup.strong.text)

if type(soup.strong.string) == Comment:
    print('There are notes!')
    print(soup.strong.prettify())

print('-' * 30, 'find_all()', '-' * 30)
print(soup.find_all('div'))
print(soup.find_all(id='title'))
print(soup.find_all(class_='info'))
print(soup.find_all(attrs={'float': 'right'}))

print('-' * 30, 'select()', '-' * 30)
print(soup.select('a'))
print(soup.select('#title'))
print(soup.select('.info'))
print(soup.select('div span'))
print(soup.select('div > span'))

3.3. Xpath

You can install the Xpath Helper plug-in on Google Chrome.

1. Introduction and installation

Beautiful Soup is already a very powerful library, but there are some popular parsing libraries, such as lxml, using Xpath grammar, which is also a more efficient parsing method. If you're not used to Beautiful Soup, try Xpath.

Official website http://lxml.de/index.html

w3c http://www.w3school.com.cn/xpath/index.asp
Installation:
```
pip install lxml
```

2. Xpath grammar

XPath is a language for finding information in XML documents. XPath can be used to traverse elements and attributes in XML documents. XPath is the main element of W3C XSLT standard, and XQuery and XPointer are built on XPath expression.
Node relationship:
- Parent
- Children
- Sibling
- Ancestor
- Descendant

3. Acquisition node:

Commonly used path expressions:

Expression	describe
nodename	Select all child nodes of this node
/	Selection from the root node
//	Select the nodes in the document from the current node that matches the selection, regardless of their location
.	Select the current node
...	Select the parent of the current node
@	Select attributes

Wildcards: XPath wildcards can be used to select unknown XML elements.

wildcard	describe	Give an example	Result
*	Match any element node	xpath('div/*')	Get all the child nodes under div
@*	Match any attribute node	xpath('div[@*]')	Select all div nodes with attributes
node()	Match any type of node

Select several paths: By using the "|" operator in the path expression, you can select several paths

Expression	Result
xpath('//div\|//table')	Get all div and table nodes

Predicates: Predicates are embedded in square brackets to find a particular node or node containing a specified value

Expression	Result
xpath('/body/div[1]')	Select the first div node under body
xpath('/body/div[last()]')	Select the last div node under body
xpath('/body/div[last()-1]')	Select the penultimate node under body
xpath('/body/div[positon()❤️]')	Select the first three div nodes under body
xpath('/body/div[@class]')	Select div node with class attribute under body
xpath('/body/div[@class="main"]')	Select div node whose class attribute is main under body
xpath('/body/div[price>35.00]')	Selecting div nodes with price element greater than 35 under body

Xpath operator

operator	describe	Example	Return value
		Computing two node sets	//book
+	addition	6 + 4	10
–	subtraction	6 – 4	2
*	multiplication	6 * 4	24
div	division	8 div 4	2
=	Be equal to	price=9.80	If price is 9.80, return true. If price is 9.90, return false.
!=	Not equal to	price!=9.80	If price is 9.90, return true. If price is 9.80, return false.
<	less than	price<9.80	If price is 9.00, return true. If price is 9.90, return false.
<=	Less than or equal to	price<=9.80	If price is 9.00, return true. If price is 9.90, return false.
>	greater than	price>9.80	If price is 9.90, return true. If price is 9.80, return false.
>=	Greater than or equal to	price>=9.80	If price is 9.90, return true. If price is 9.70, return false.
or	or	price=9.80 or price=9.70	If price is 9.80, return true. If price is 9.50, return false.
and	and	price>9.00 and price<9.90	If price is 9.80, return true. If price is 8.50, return false.
mod	Calculate the remainder of division	5 mod 2	1

4. Use

1. Examples:

from lxml import etree

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''

html = etree.HTML(text)
result = etree.tostring(html)
print(result)

First we use the etree Library of lxml, then we initialize it with etree.HTML, and then we print it out.

Among them, this reflects a very practical function of lxml is to automatically modify the HTML code, you should note that the last li tag, in fact, I deleted the tail tag, is not closed. However, lxml inherits the features of libxml 2 and has the function of automatically modifying HTML code.

So the output is as follows:

<html><body>
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>

</body></html>

It not only completes the li tag, but also adds body and html tag.
File Reading

In addition to reading strings directly, it also supports reading content from files. For example, we create a new file called hello.html, which contains:

<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>

The parse method is used to read files:

from lxml import etree
html = etree.parse('hello.html')
result = etree.tostring(html, pretty_print=True)
print(result)

The same results can also be obtained.

2. XPath uses:

Get all < li > tags

from lxml import etree
html = etree.parse('hello.html')
print (type(html))
result = html.xpath('//li')
print (result)
print (len(result))
print (type(result))
print (type(result[0]))

Operation results:

<type 'lxml.etree._ElementTree'>
[<Element li at 0x1014e0e18>, <Element li at 0x1014e0ef0>, <Element li at 0x1014e0f38>, <Element li at 0x1014e0f80>, <Element li at 0x1014e0fc8>]

<type 'list'>
<type 'lxml.etree._Element'>

It can be seen that the type of etree.parse is ElementTree. After calling xpath, we get a list of five < li > elements, each of which is Element type.

Get all class es of the < li > tag

result = html.xpath('//li/@class')
print (result)

Operation results:

['item-0', 'item-1', 'item-inactive', 'item-1', 'item-0']

Get the < a > label with href as link 1. HTML under < li > label

result = html.xpath('//li/a[@href="link1.html"]')
print (result)

Operation results:

[<Element a at 0x10ffaae18>]

Get all < span > tags under < li > tags

Note: This is not correct.

result = html.xpath('//li/span')

#Because / is used to get child elements, and < span > is not < li > child elements, so use parallel slash bars
result = html.xpath('//li//span')
print(result)

Operation results:

[<Element span at 0x10d698e18>]

Get all class es under the < li > tag, excluding < li >

result = html.xpath('//li/a//@class')
print (resul)t
# Operation results
# ['blod']

Get the last < li > href of < a >

result = html.xpath('//li[last()]/a/@href')
print (result)

Operation results:

['link5.html']

Get the content of the penultimate element

result = html.xpath('//li[last()-1]/a')
print (result[0].text)

Operation results:

fourth item

Get the label signature with class as bold

result = html.xpath('//*[@class="bold"]')
print (result[0].tag)

Operation results:

span

Select the nodes in the XML file:

Element (element node)
Attribute (attribute node)
Text (text node)
Concat (element node, element node)
Comment (comment node)
Root (root node)

5. Code examples

In the vertical and horizontal web, crawl the first three pages of data, as long as the title of the book

from lxml import etree
import requests
from fake_useragent import UserAgent

url = 'http://book.zongheng.com/store/c1/c0/b0/u0/p1/v9/s1/t0/u0/i1/ALL.html'
headers = {'User-Agent': UserAgent().chrome}
resp = requests.get(url, headers=headers)
html = resp.text

# Constructing Analytical Object etree
e = etree.HTML(html)

# Title
names = e.xpath('//div[@class="bookname"]/a/text()')
# author
authors = e.xpath('//div[@class="bookilnk"]/a[1]/text()')
# Mode 01: If there is no author, it will not correspond.
for i in range(len(names)):
    print('{}:{}'.format(names[i], authors[i]))
# Mode 02: If the number of iterators is different, choose the shorter one.
for n, a in zip(names, authors):
   print('{}:{}'.format(n,a))

('//li[last()-1]/a')
print (result[0].text)

- Operation results:

fourth item

8. Obtain class by bold Tag signature

```python
result = html.xpath('//*[@class="bold"]')
print (result[0].tag)

Operation results:

span

Select the nodes in the XML file:

Element (element node)
Attribute (attribute node)
Text (text node)
Concat (element node, element node)
Comment (comment node)
Root (root node)

5. Code examples

In the vertical and horizontal web, crawl the first three pages of data, as long as the title of the book

from lxml import etree
import requests
from fake_useragent import UserAgent

url = 'http://book.zongheng.com/store/c1/c0/b0/u0/p1/v9/s1/t0/u0/i1/ALL.html'
headers = {'User-Agent': UserAgent().chrome}
resp = requests.get(url, headers=headers)
html = resp.text

# Constructing Analytical Object etree
e = etree.HTML(html)

# Title
names = e.xpath('//div[@class="bookname"]/a/text()')
# author
authors = e.xpath('//div[@class="bookilnk"]/a[1]/text()')
# Mode 01: If there is no author, it will not correspond.
for i in range(len(names)):
    print('{}:{}'.format(names[i], authors[i]))
# Mode 02: If the number of iterators is different, choose the shorter one.
for n, a in zip(names, authors):
   print('{}:{}'.format(n,a))

Keywords: Attribute Python Session xml

Added by Dimwhit on Wed, 24 Jul 2019 12:58:14 +0300

Programming VIP

Stage 12 - Reptile 02: [request; data extraction (regular, Beautiful Soup, xpath)]

Articles Catalogue

1. URLError

2. Use of request Libraries

2.1. Basic Introduction

2.2. get request

2.3. post request

2.4. Custom request header

2.5. Setting timeout time

2.6. Proxy access

2.7. session automatically saves cookies

2.8. ssl verification

2.9. request for information

3. Data extraction

3.1. Regular expression re (the highest lattice; the fastest speed)

1. Extracting data

2. Relevant Annotations of Regular Expressions

3.2. Beautiful Soup

1. Introduction, Installation and Four Categories

2. Tag

3. Navigable String Gets Content

4. BeautifulSoup

5. Comment

6. Searching Document Tree

7. CSS selector (extension)

8. Code examples

3.3. Xpath

1. Introduction and installation

2. Xpath grammar

3. Acquisition node:

4. Use

1. Examples:

2. XPath uses:

5. Code examples

5. Code examples

Popular Keywords