Articles Catalogue
- 1. URLError
- 2. Use of request Libraries
- 2.1. Basic Introduction
- 2.2. get request
- 2.3. post request
- 2.4. Custom request header
- 2.5. Setting timeout time
- 2.6. Proxy access
- 2.7. session automatically saves cookies
- 2.8. ssl verification
- 2.9. request for information
- 3. Data extraction
1. URLError
-
First, explain the possible causes of URLError:
-
The network is not connected, that is, the local computer can not access the Internet.
-
Unable to connect to a specific server
-
Server does not exist
-
-
In the code, we need to surround and capture the corresponding exception with try-except statement, the code is as follows:
from urllib.request import Request, urlopen from fake_useragent import UserAgent from urllib.error import URLError url = 'http://www.sxt.cn/index/login/login12353wfeds.html' # Servers available, resources not available url = 'http://www.sxt12412412.cn/index/login/login12353wfeds.html' headers = {'User-Agent': UserAgent().chrome} try: req = Request(url, headers=headers) resp = urlopen(req) info = resp.read().decode() print(info) except URLError as e: if len(e.args) != 0: print('Address acquisition error!') else: print(e.code) print('Climbing completed')
- Skills on debug mode:
- We use the urlopen method to access a non-existent web site. The results are as follows:
[Errno 11004] getaddrinfo failed
2. Use of request Libraries
2.1. Basic Introduction
-
Introduction:
It is helpful to understand some basic concepts of reptiles and grasp the process of reptiles crawling. After introducing, we need to learn more advanced content and tools to facilitate our crawling. So this section gives a brief introduction to the basic usage of requests libraries.
-
install
Install with pip:
pip install requests
- Basic Request
req = requests.get("http://www.baidu.com") req = requests.post("http://www.baidu.com") req = requests.put("http://www.baidu.com") req = requests.delete("http://www.baidu.com") req = requests.head("http://www.baidu.com") req = requests.options("http://www.baidu.com")
2.2. get request
- The parameters are dictionaries, and we can also pass json-type parameters:
- Use of get 01:
import requests from fake_useragent import UserAgent url = 'http://www.baidu.com' headers = {'User-Agent': UserAgent().chrome} resp = requests.get(url, headers=headers) resp.encoding='utf-8' print(resp.text)
- Use of get 02:
import requests from fake_useragent import UserAgent url = 'http://www.baidu.com/s?' params = { 'wd': 'Black Horse Programmer' } headers = {'User-Agent': UserAgent().chrome} resp = requests.get(url, headers=headers, params=params) resp.encoding = 'utf-8' print(resp.text)
2.3. post request
- The parameters are dictionaries, and we can also pass json-type parameters:
- Code Example 01:
import requests from fake_useragent import UserAgent url = 'http://www.sxt.cn/index/login/login.html' args = { 'user': '17703181473', 'password': '123456' } headers={'User-Agent':UserAgent().chrome} resp = requests.post(url,headers=headers,data=args) print(resp.text)
- Code example 02:
import requests from fake_useragent import UserAgent # Sign in login_url = 'https://www.kuaidaili.com/login/' headers = {'User-Agent': UserAgent().chrome} data = { 'username': '398707160@qq.com', 'passwd': '123456abc' } resp = requests.post(login_url, headers=headers, data=data) print(resp.text)
2.4. Custom request header
- Camouflage request headers are often used in gathering, and we can use this method to hide:
headers = {'User-Agent': 'python'} r = requests.get('http://www.zhidaow.com', headers = headers) print(r.request.headers['User-Agent'])
2.5. Setting timeout time
- Timeout can be set through the timeout property, and if the response content is not available beyond that time, an error will be prompted.
requests.get('http://github.com', timeout=0.001)
2.6. Proxy access
- In order to avoid blocked IP, proxy is often used. requests also has the corresponding proxies attribute:
import requests proxies = { "http": "http://10.10.1.10:3128", "https": "https://10.10.1.10:1080", } requests.get("http://www.zhidaow.com", proxies=proxies)
- If the agent needs an account and password, this is the case:
proxies = { "http": "http://user:pass@10.10.1.10:3128/", }
- Code example:
import requests from fake_useragent import UserAgent url = 'http://httpbin.org/get' headers = {'User-Agent': UserAgent().chrome} # proxy = { # 'type': 'type://ip:port', # 'type': 'type://username:password@ip:port' # } proxy = { 'http':'http://117.191.11.102:8080' #'http': 'http://398707160:j8inhg2g@58.87.79.136:16817' } resp = requests.get(url, headers=headers, proxies=proxy) print(resp.text)
2.7. session automatically saves cookies
- Seeion means to maintain a session, such as: continue to operate (record identity information) after login, while requests are requests for a single request, and the identity information will not be recorded.
# Create a session object s = requests.Session() # Set cookies by issuing get requests with session objects s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
- Code example:
import requests from fake_useragent import UserAgent # Sign in login_url = 'http://www.sxt.cn/index/login/login' # Personal information info_url = 'http://www.sxt.cn/index/user.html' headers = {'User-Agent': UserAgent().chrome} data = { 'user': '17703181473', 'password': '123456' } # Open session object and save cookie s in session session = requests.Session() resp = session.post(login_url, headers=headers, data=data) # Get the response content (in strings) print(resp.text) info_resp = session.get(info_url, headers=headers) print(info_resp.text)
2.8. ssl verification
# Disable Security Request Warning requests.packages.urllib3.disable_warnings() resp = requests.get(url, verify=False, headers=headers)
2.9. request for information
Code | Meaning |
---|---|
resp.json() | Get the response content (in json strings) |
resp.text | Get the response content (in strings) |
resp.content | Get the response content (in bytes) |
resp.headers | Get the response header content |
resp.url | Get access address |
resp.encoding | Get Web Coding |
resp.request.headers | Request header content |
resp.cookie | Getting cookie s |
resp.state_code | Response State Code |
3. Data extraction
3.1. Regular expression re (the highest lattice; the fastest speed)
1. Extracting data
-
We've figured out how to get the content of the page before, but it's still a step away. How can we extract and sort out so much messy code with text? Let's start with a very powerful tool, regular expressions!
Regular expression is a logical formula for string operation, which is to form a "regular string" by using pre-defined specific characters and combinations of these specific characters. This "regular string" is used to express a filtering logic for strings.
Regular expressions are powerful tools for matching strings. There are also the concepts of regular expressions in other programming languages. Python is no exception. With regular expressions, it's easy for us to extract what we want from the returned page content.
-
Rules:
Pattern | describe |
---|---|
$ | Match the end of the string |
. | Matches any character except line breaks. When the re.DOTALL tag is specified, any character including line breaks can be matched. |
[...] | Used to represent a set of characters, listed separately: [a m k] matches `a', `m'or `k'. |
[^...] | Characters not in []: a B C matches characters other than a,b,c |
re* | Matching 0 or more expressions |
^ | Match the beginning of a string |
re+ | Match one or more expressions |
re? | Match 0 or 1 fragment defined by the previous regular expression in a non-greedy manner |
re{ n} | |
re{ n,} | Accurate Matching of n Previous Expressions |
re{ n,m} | Match n to m fragments defined by the previous regular expression, greedy way |
a | b |
(re) | G matches expressions in parentheses, which also represent a group |
(?-imx) | Regular expressions close i, m, or x optional flags. Influencing only the areas in parentheses |
(?imx) | Regular expressions contain three optional flags: i, m, or x. Influencing only the areas in parentheses |
(?: re) | Similar (... But it doesn't mean a group. |
(?imx: re) | Use i, m, or x optional flags in parentheses |
(?-imx: re) | Do not use i, m, or x optional flags in parentheses |
(?#...) | Notes |
(?= re) | Forward affirmative definer. If the regular expression is contained, use (...) Represents that if the current position is successfully matched, it will succeed or fail. But once the contained expression has been tried, the matching engine has not improved at all; the rest of the pattern also tries to the right of the demarcator. |
(?! re) | Forward negative demarcator. Contrary to an affirmative delimiter; succeeds when the contained expression does not match the current position of the string |
(?> re) | Matching independent patterns, eliminating backtracking |
\w | Match alphanumeric and underscore |
\W | Matching non-alphabetic numbers and underscores |
\s | Matching any blank character is equivalent to [t\nrf]. |
\S | Matching any non-null character |
\d | Match any number, equivalent to [0-9] |
\D | Matching Arbitrary Nonnumerals |
\A | Match string start |
\Z | Match the end of the string, if there is a newline, only match the end of the string before the newline. c |
\z | Matching String End |
\G | Match the position where the last match is completed |
\b | Match a word boundary, that is, the position between the word and the space. For example,'er B'can match'er' in'never', but not'er'in'verb'. |
\B | Match non-word boundaries. "Er B" matches "er" in "verb", but not "er" in "never". |
\ n, t, etc. | Match a newline character. Match a tab. etc. |
\1...\9 | Matching the content of the nth grouping |
\10 | Match the content of the nth grouping if it is matched. Otherwise, it refers to the expression of octal character code. |
[\u4e00-\u9fa5] | Chinese |
2. Relevant Annotations of Regular Expressions
-
The greedy and non-greedy modes of quantifiers:
Regular expressions are often used to find matching strings in text
Quantifiers in Python are greedy by default (or in a few languages they may be non-greedy by default), always trying to match as many characters as possible; non-greedy, on the contrary, always trying to match as few characters as possible.- For example, if the regular expression `ab*'is used to find `abbbc', it will find `abbb'. If we use the non-greedy quantifier "ab?", we will find "a".
-
Common methods:
-
re.match
- re.match tries to match a pattern from the beginning of the string, and if the match() is not successful, it returns none.
- Functional grammar:
re.match(pattern, string, flags=0)
-
re.search
- re.search scans the entire string and returns the first successful match.
- Functional grammar:
re.search(pattern, string, flags=0)
-
re.sub
-
Subsubstitution string
-
Grammar:
re.sub(pattern,replace,string)
-
-
re.findall
-
Findall Find All
-
Grammar:
re.findall(pattern,string,flags=0)
-
-
-
Regular expression modifier-optional flag:
Regular expressions can contain optional token modifiers to control the pattern of matching. The modifier is specified as an optional flag. Multiple flags can be specified by bitwise OR(|). For example, re.I | re.M is set to the I and M flags:
Modifier | describe |
---|---|
re.I | Make matching case insensitive |
re.L | locale-aware matching |
re.M | |
re.S | Make. match all characters including newlines |
re.U | Resolve characters according to Unicode character set. This sign affectsw,W,b,B. |
re.X | This flag makes it easier for you to understand regular expressions by giving you a more flexible format. |
- Code example:
import re str1 = 'I study Python3.6 everday!' ############ match ############ print('-' * 30, 'match()', '-' * 30) # match from left to right, matching in turn (the previous matching also needs to be matched), if matching is not directly returned to None # m1 = re.match(r'I', str1) # m1 = re.match(r'[I]', str1) # m1 = re.match(r'\bI', str1) # m1 = re.match(r'\w', str1) # m1 = re.match(r'\S', str1) # m1 = re.match(r'(I)', str1) # m1 = re.match(r'.', str1) # m1 = re.match(r'\D', str1) m1 = re.match(r'\w\s(study)', str1) print(m1.group(1)) ############ search ############ print('-' * 30, 'search()', '-' * 30) # From left to right, scan all, find the first and return the result. s1 = re.search(r'study', str1) s1 = re.search(r'y', str1) print(s1.group()) ############ findall ############ print('-' * 30, 'findall()', '-' * 30) f1 = re.findall(r'y', str1) f1 = re.findall(r'Python3.6', str1) f1 = re.findall(r'P\w*.\d', str1) print(f1) ############ sub ############ print('-' * 30, 'sub()', '-' * 30) su1 = re.sub(r'everday', 'Everday', str1) su1 = re.sub(r'ev.+', 'Everday', str1) print(su1) print('-' * 30, 'test()', '-' * 30) str2 = '<span><a href="http://Www.bjstx.com "> Silicon Valley sxt</a> </span>" # t1 = re.findall(r'[\u4e00-\u9fa5]+', str2) # t1 = re.findall(r'>([\u4e00-\u9fa5]+)<', str2) # t1 = re.findall(r'>(\S+?)<', str2) t1 = re.findall(r'<a href=".*">(.+)</a>', str2) t1 = re.findall(r'<a href="(.*)">.+</a>', str2) print(t1) t2 = re.sub(r'span', 'div', str2) t2 = re.sub(r'<span>(.+)</span>', r'<div>\1</div>', str2) print(t2)
- Exercise: Crawl the first three pages of the Encyclopedia of Gongshi, only the content of the paragraph.
import requests from fake_useragent import UserAgent import re with open('duanzi.txt', 'w', encoding='utf-8') as f: for i in range(1, 4): url = 'https://www.qiushibaike.com/text/page/{}/'.format(i) headers = {'User-Agent': UserAgent().chrome} resp = requests.get(url, headers=headers) html = resp.text infos = re.findall(r'<div class="content">\s<span>\s+(.+)', html) for info in infos: f.write('-' * 30 + '\n') f.write(info.replace(r'<br/>','\n')) f.write('\n' + '-' * 30 + '\n')
3.2. Beautiful Soup
1. Introduction, Installation and Four Categories
-
Beautiful Soup provides some simple, python-like functions for navigating, searching, and modifying analysis trees. It's a toolbox that provides users with data to grab by parsing documents. Because it's simple, it doesn't need much code to write a complete application.
Beautiful Soup automatically converts input documents into Unicode encoding and output documents into utf-8 encoding. You don't need to think about coding, unless the document does not specify a coding method, then Beautiful Soup can't automatically identify the coding method. Then, you just need to explain the original encoding.
Beautiful Soup has become an excellent python interpreter like lxml and html6lib, providing users with flexible different parsing strategies or strong speed
-
Installation: Beautiful Soup 3 is currently out of development. It is recommended to use Beautiful Soup 4 in current projects, but it has been ported to BS4, which means we need import bs4 when importing.
pip install beautifulsoup4 pip install lxml
Beautiful Soup supports HTML parsers in Python standard libraries and some third-party parsers. If we don't install it, Python will use Python's default parser. The lxml parser is more powerful, faster and recommended for installation.
Parser | Usage method | advantage | Inferiority |
---|---|---|---|
Python Standard Library | BeautifulSoup(markup, "html.parser") | 1. Python's built-in standard library 2. Moderate execution speed 3. Document fault tolerance | Document fault tolerance in previous versions of Python 2.7.3 or 3.2.2) is poor |
lxml HTML parser | BeautifulSoup(markup, "lxml") | 1. Fast speed 2. Document fault tolerance | Need to install C language library |
lxml XML parser | BeautifulSoup(markup, ["lxml", "xml"]) BeautifulSoup(markup, "xml") | 1. Fast 2. Unique XML-enabled parser 3. Need to install C language library | |
html5lib | BeautifulSoup(markup, "html5lib") | 1. Best fault tolerance 2. Parsing documents in browser mode 3. Generating documents in HTML5 format 4. Slow speed | Independent of External Extension |
-
Create Beautiful Soup Objects
from bs4 import BeautifulSoup bs = BeautifulSoup(html,"lxml")
-
Four categories of objects:
Beautiful Soup converts complex HTML documents into a complex tree structure. Each node is a Python object. All objects can be summarized into four types:
-
Tag
-
NavigableString
-
Beautiful Soup
-
Comment (not often used)
-
2. Tag
-
Popularly, it's a tag in HTML; for example, <div> <title>.
-
How to use it:
# Take the following code as an example <title id='title'>Ada Tam</title> <div class='info' float='left'>Welcome to SXT</div> <div class='info' float='right'> <span>Good Good Study</span> <a href='www.bjsxt.cn'></a> <strong><!--Useless--></strong> </div>
- Get tags:
# Parsing in lxml soup = BeautifulSoup(info, 'lxml') print(soup.title) # < title > Shangxue </title >
- Note: The same tag can only get the first tag that meets the requirements.
- Get properties:
# Get all attributes print(soup.title.attrs) # class='info' float='left' # Get the value of a single attribute print(soup.div.get('class')) print(soup.div['class']) print(soup.a['href']) # info
3. Navigable String Gets Content
print(soup.title.string) print(soup.title.text) #Ada Tam
4. BeautifulSoup
-
BeautifulSoup object represents the entire content of a document, and most of the time it can be treated as a Tag object, which supports traversing the document tree and searching for most of the methods described in the document tree.
Because the BeautifulSoup object is not a real HTML or XML tag, it has no name and attribute attributes. But sometimes it's convenient to look at its. name attribute, so the BeautifulSoup object contains a special attribute. name with a value of "[document]".
print(soup.name) print(soup.head.name) # [document] # head
5. Comment
- Comment object is a special type of Navigable String object. In fact, the output still does not include annotation symbols, but if it is not handled properly, it may cause unexpected trouble to our text processing.
if type(soup.strong.string) == Comment: print(soup.strong.prettify()) else: print(soup.strong.string)
6. Searching Document Tree
-
Beautiful Soup defines a number of search methods, focusing on two: find() and find_all(). The parameters and usage of other methods are similar. Please give us a second opinion.
-
Filter:
Before introducing the find_all() method, let's first introduce the types of filters that run through the entire search API. Filters can be used in tag name, node attributes, strings, or their mix.
- Character string
The simplest filter is a string. When a string parameter is passed into the search method, Beautiful Soup finds the content that matches the string completely. The following example is used to find all of the contents in the document.
# Returns all div Tags print(soup.find_all('div'))
If a bytecode parameter is passed in, Beautiful Soup will be encoded as UTF-8, and a Unicode code code can be passed in to avoid a Beautiful Soup parsing error.
- regular expression
If a regular expression is passed in as a parameter, Beautiful Soup matches the content through the match() of the regular expression.
# Returns all div Tags print (soup.find_all(re.compile("^div")))
- list
If a list parameter is passed in, Beautiful Soup returns the content that matches any element in the list.
# Returns all matched span, a Tags print(soup.find_all(['span','a']))
- keyword
If a parameter with a specified name is not a search for the built-in parameter name, it will be searched as an attribute of the specified name tag. If a parameter with a name ID is included, Beautiful Soup will search for the "id" attribute of each tag.
#Returns the label with id welcom print(soup.find_all(id='welcom'))
- True
True can match any value. The following code finds all tag s, but does not return string nodes.
- Search by CSS
Searching tags according to CSS class names is very useful, but the keyword class identifying CSS class names is a reserved word in Python. Using class as a parameter can lead to grammatical errors. Starting from version 4.1.1 of Beautiful Soup, tags with specified CSS class names can be searched through class_parameter.
# Returns a div whose class equals info print(soup.find_all('div',class_='info'))
- Search by property
soup.find_all("div", attrs={"class": "info"})
7. CSS selector (extension)
- Sop. select (parameter):
Expression | Explain |
---|---|
tag | Select the specified label |
* | Select all nodes |
#id | Select the node whose id is container |
.class | Select all class es containing container nodes |
li a | Select all a nodes under all li |
ul + p | (Brother) Select the first p element after ul |
div#id > ul | (father and son) select the first ul child element of div whose id is id |
table ~ div | Select all div elements adjacent to table |
a[title] | Select all a elements with title Attributes |
a[class="title"] | Select all class attributes as title Value a |
a[href*="sxt"] | Select all href attributes containing sxt element a |
a[href^="http"] | Select all href attribute values starting with http for element a |
a[href$=".png"] | Select a element with all href attribute values ending in. png |
input[type="redio"]:checked | Select the selected hobby element |
8. Code examples
# pip install bs4 # pip install lxml from bs4 import BeautifulSoup from bs4.element import Comment str1 = ''' <title id='title'>Ada Tam</title> <div class='info' float='left'>Welcome to SXT</div> <div class='info' float='right'> <span>Good Good Study</span> <a href='www.bjsxt.cn'></a> <strong><!--Useless--></strong> </div> ''' soup = BeautifulSoup(str1, 'lxml') print('-' * 30, 'Get Tags', '-' * 30) print(soup.title) print(soup.span) print(soup.div) print('-' * 30, 'get attribute', '-' * 30) print(soup.div.attrs) print(soup.div.get('class')) print(soup.a['href']) print('-' * 30, 'Getting content', '-' * 30) print(type(soup.title.string)) print(soup.title.text) print(type(soup.strong.string)) print(soup.strong.text) if type(soup.strong.string) == Comment: print('There are notes!') print(soup.strong.prettify()) print('-' * 30, 'find_all()', '-' * 30) print(soup.find_all('div')) print(soup.find_all(id='title')) print(soup.find_all(class_='info')) print(soup.find_all(attrs={'float': 'right'})) print('-' * 30, 'select()', '-' * 30) print(soup.select('a')) print(soup.select('#title')) print(soup.select('.info')) print(soup.select('div span')) print(soup.select('div > span'))
3.3. Xpath
- You can install the Xpath Helper plug-in on Google Chrome.
1. Introduction and installation
-
Beautiful Soup is already a very powerful library, but there are some popular parsing libraries, such as lxml, using Xpath grammar, which is also a more efficient parsing method. If you're not used to Beautiful Soup, try Xpath.
-
Installation:
pip install lxml
2. Xpath grammar
-
XPath is a language for finding information in XML documents. XPath can be used to traverse elements and attributes in XML documents. XPath is the main element of W3C XSLT standard, and XQuery and XPointer are built on XPath expression.
-
Node relationship:
- Parent
- Children
- Sibling
- Ancestor
- Descendant
3. Acquisition node:
- Commonly used path expressions:
Expression | describe |
---|---|
nodename | Select all child nodes of this node |
/ | Selection from the root node |
// | Select the nodes in the document from the current node that matches the selection, regardless of their location |
. | Select the current node |
... | Select the parent of the current node |
@ | Select attributes |
- Wildcards: XPath wildcards can be used to select unknown XML elements.
wildcard | describe | Give an example | Result |
---|---|---|---|
* | Match any element node | xpath('div/*') | Get all the child nodes under div |
@* | Match any attribute node | xpath('div[@*]') | Select all div nodes with attributes |
node() | Match any type of node |
- Select several paths: By using the "|" operator in the path expression, you can select several paths
Expression | Result |
---|---|
xpath('//div|//table') | Get all div and table nodes |
- Predicates: Predicates are embedded in square brackets to find a particular node or node containing a specified value
Expression | Result |
---|---|
xpath('/body/div[1]') | Select the first div node under body |
xpath('/body/div[last()]') | Select the last div node under body |
xpath('/body/div[last()-1]') | Select the penultimate node under body |
xpath('/body/div[positon()❤️]') | Select the first three div nodes under body |
xpath('/body/div[@class]') | Select div node with class attribute under body |
xpath('/body/div[@class="main"]') | Select div node whose class attribute is main under body |
xpath('/body/div[price>35.00]') | Selecting div nodes with price element greater than 35 under body |
- Xpath operator
operator | describe | Example | Return value |
---|---|---|---|
Computing two node sets | //book | ||
+ | addition | 6 + 4 | 10 |
– | subtraction | 6 – 4 | 2 |
* | multiplication | 6 * 4 | 24 |
div | division | 8 div 4 | 2 |
= | Be equal to | price=9.80 | If price is 9.80, return true. If price is 9.90, return false. |
!= | Not equal to | price!=9.80 | If price is 9.90, return true. If price is 9.80, return false. |
< | less than | price<9.80 | If price is 9.00, return true. If price is 9.90, return false. |
<= | Less than or equal to | price<=9.80 | If price is 9.00, return true. If price is 9.90, return false. |
> | greater than | price>9.80 | If price is 9.90, return true. If price is 9.80, return false. |
>= | Greater than or equal to | price>=9.80 | If price is 9.90, return true. If price is 9.70, return false. |
or | or | price=9.80 or price=9.70 | If price is 9.80, return true. If price is 9.50, return false. |
and | and | price>9.00 and price<9.90 | If price is 9.80, return true. If price is 8.50, return false. |
mod | Calculate the remainder of division | 5 mod 2 | 1 |
4. Use
1. Examples:
from lxml import etree text = ''' <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div> ''' html = etree.HTML(text) result = etree.tostring(html) print(result)
First we use the etree Library of lxml, then we initialize it with etree.HTML, and then we print it out.
Among them, this reflects a very practical function of lxml is to automatically modify the HTML code, you should note that the last li tag, in fact, I deleted the tail tag, is not closed. However, lxml inherits the features of libxml 2 and has the function of automatically modifying HTML code.
So the output is as follows:
<html><body> <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div> </body></html>
It not only completes the li tag, but also adds body and html tag.
File Reading
In addition to reading strings directly, it also supports reading content from files. For example, we create a new file called hello.html, which contains:
<div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> </div>
The parse method is used to read files:
from lxml import etree html = etree.parse('hello.html') result = etree.tostring(html, pretty_print=True) print(result)
The same results can also be obtained.
2. XPath uses:
- Get all < li > tags
from lxml import etree html = etree.parse('hello.html') print (type(html)) result = html.xpath('//li') print (result) print (len(result)) print (type(result)) print (type(result[0]))
- Operation results:
<type 'lxml.etree._ElementTree'> [<Element li at 0x1014e0e18>, <Element li at 0x1014e0ef0>, <Element li at 0x1014e0f38>, <Element li at 0x1014e0f80>, <Element li at 0x1014e0fc8>] <type 'list'> <type 'lxml.etree._Element'>
It can be seen that the type of etree.parse is ElementTree. After calling xpath, we get a list of five < li > elements, each of which is Element type.
- Get all class es of the < li > tag
result = html.xpath('//li/@class') print (result)
- Operation results:
['item-0', 'item-1', 'item-inactive', 'item-1', 'item-0']
- Get the < a > label with href as link 1. HTML under < li > label
result = html.xpath('//li/a[@href="link1.html"]') print (result)
- Operation results:
[<Element a at 0x10ffaae18>]
-
Get all < span > tags under < li > tags
Note: This is not correct.
result = html.xpath('//li/span') #Because / is used to get child elements, and < span > is not < li > child elements, so use parallel slash bars result = html.xpath('//li//span') print(result)
- Operation results:
[<Element span at 0x10d698e18>]
- Get all class es under the < li > tag, excluding < li >
result = html.xpath('//li/a//@class') print (resul)t # Operation results # ['blod']
- Get the last < li > href of < a >
result = html.xpath('//li[last()]/a/@href') print (result)
- Operation results:
['link5.html']
- Get the content of the penultimate element
result = html.xpath('//li[last()-1]/a') print (result[0].text)
- Operation results:
fourth item
- Get the label signature with class as bold
result = html.xpath('//*[@class="bold"]') print (result[0].tag)
- Operation results:
span
Select the nodes in the XML file:
- Element (element node)
- Attribute (attribute node)
- Text (text node)
- Concat (element node, element node)
- Comment (comment node)
- Root (root node)
5. Code examples
- In the vertical and horizontal web, crawl the first three pages of data, as long as the title of the book
from lxml import etree import requests from fake_useragent import UserAgent url = 'http://book.zongheng.com/store/c1/c0/b0/u0/p1/v9/s1/t0/u0/i1/ALL.html' headers = {'User-Agent': UserAgent().chrome} resp = requests.get(url, headers=headers) html = resp.text # Constructing Analytical Object etree e = etree.HTML(html) # Title names = e.xpath('//div[@class="bookname"]/a/text()') # author authors = e.xpath('//div[@class="bookilnk"]/a[1]/text()') # Mode 01: If there is no author, it will not correspond. for i in range(len(names)): print('{}:{}'.format(names[i], authors[i])) # Mode 02: If the number of iterators is different, choose the shorter one. for n, a in zip(names, authors): print('{}:{}'.format(n,a))
('//li[last()-1]/a')
print (result[0].text)
- Operation results:
fourth item
8. Obtain class by bold Tag signature ```python result = html.xpath('//*[@class="bold"]') print (result[0].tag)
- Operation results:
span
Select the nodes in the XML file:
- Element (element node)
- Attribute (attribute node)
- Text (text node)
- Concat (element node, element node)
- Comment (comment node)
- Root (root node)
5. Code examples
- In the vertical and horizontal web, crawl the first three pages of data, as long as the title of the book
from lxml import etree import requests from fake_useragent import UserAgent url = 'http://book.zongheng.com/store/c1/c0/b0/u0/p1/v9/s1/t0/u0/i1/ALL.html' headers = {'User-Agent': UserAgent().chrome} resp = requests.get(url, headers=headers) html = resp.text # Constructing Analytical Object etree e = etree.HTML(html) # Title names = e.xpath('//div[@class="bookname"]/a/text()') # author authors = e.xpath('//div[@class="bookilnk"]/a[1]/text()') # Mode 01: If there is no author, it will not correspond. for i in range(len(names)): print('{}:{}'.format(names[i], authors[i])) # Mode 02: If the number of iterators is different, choose the shorter one. for n, a in zip(names, authors): print('{}:{}'.format(n,a))