In the last section, we implemented a basic crawler, but we use regular expressions to extract page information. After using them, we will find that constructing a regular expression is tedious, and in case there is a mistake, it may lead to matching failure. So we use regularization to extract page information. It is inconvenient to pay more or less interest.
For web nodes, it can define id, class or other attributes, and there is a hierarchical relationship between nodes. In web pages, one or more nodes can be located by XPath or CSS selector. So when parsing a page, we use XPath or CSS selector to extract a node, and then call the corresponding method to get its body content or attributes. Can't we extract any information we want?
In Python, how do we do this? Don't worry, there are many parsing libraries. The powerful libraries are LXML, Beautiful Soup, PyQuery and so on. In this chapter, we will introduce the use of these three parsing libraries. With these libraries, we don't need to worry about regularity any more, and the parsing efficiency will be greatly improved, which is a necessary tool for crawlers.
Use of XPath
XPath, full name XML Path Language, namely XML Path Language, is a language for searching information in XML documents. XPath was originally designed to search for XML documents, but it is also suitable for HTML documents.
So when doing crawler, we can use XPath to extract information. In this section, we will introduce the basic usage of XPath.
1. Overview of XPath
XPath's selection function is very powerful. It provides a very concise path selection expression. In addition, it provides more than 100 built-in functions for string, numerical value, time matching, node, sequence processing and so on. Almost all the nodes we want to locate can be selected by XPath.
XPath became the W3C standard on November 16, 1999. It is designed for XSLT, XPointer and other XML parsing software. More documents can be accessed on its official website: https://www.w3.org/TR/xpath/.
2. Common XPath rules
Let's enumerate some common rules in the table.
Expression | describe |
---|---|
nodename | Select all child nodes of this node |
/ | Selecting direct subnodes from the current node |
// | Selecting descendant nodes from the current node |
. | Select the current node |
... | Select the parent of the current node |
@ | Select attributes |
The common matching rules of XPath are listed here, such as / representing the selection of direct child nodes, / / representing the selection of all descendant nodes,. Representing the selection of the current node,... Represents the selection of the parent node of the current node, while @ adds attribute restrictions to select a specific node that matches the attribute.
For example:
//title[@lang='eng'] Python resource sharing qun 784758214, including installation packages, PDF, learning videos, here is the gathering place of Python learners, zero foundation, advanced, are welcome.
This is an XPath rule, which represents the selection of all nodes named title and the value of the attribute lang is eng.
In the following section, we will introduce the detailed usage of XPath, using XPath to parse HTML through Python's LXML library.
3. Preparations
Before using it, we should first ensure that the LXML library is installed. If there is no installation, we can refer to the installation process in Chapter 1.
4. Instance introduction
Now let's use an example to feel the process of parsing a web page using XPath. The code is as follows:
from lxml import etree text = ''' <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div> ''' html = etree.HTML(text) result = etree.tostring(html) print(result.decode('utf-8'))
Here we first import the etree module of the LXML library, then declare an HTML text, call the HTML class to initialize, so we successfully construct an XPath parsing object. Here we notice that the last li node in the HTML text is not closed, but the etree module can use HTML. Text is automatically corrected.
Here we call tostring() method to output the revised HTML code, but the result is bytes type. Here we use decode() method to convert to str type. The result is as follows:
<html><body><div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </li></ul> </div> </body></html>
We can see that after processing, the label of li node is completed, and body and html nodes are added automatically.
In addition, we can read the text file directly for parsing, as follows:
from lxml import etree html = etree.parse('./test.html', etree.HTMLParser()) result = etree.tostring(html) print(result.decode('utf-8'))
The content of test.html is the HTML code in the above example, which is as follows:
<div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div> Python Resource sharing qun 784758214 ,Installation packages are included. PDF,Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.
This output is slightly different, with an additional DOCTYPE statement, but it has no effect on parsing. The results are as follows:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </li></ul> </div></body></html>
5. All nodes
We usually use the XPath rule at the beginning of // to select all the nodes that meet the requirements. Take the HTML text as an example. If we want to select all the nodes, we can do this:
from lxml import etree html = etree.parse('./test.html', etree.HTMLParser()) result = html.xpath('//*') print(result)
Operation results:
[<Element html at 0x10510d9c8>, <Element body at 0x10510da08>, <Element div at 0x10510da48>, <Element ul at 0x10510da88>, <Element li at 0x10510dac8>, <Element a at 0x10510db48>, <Element li at 0x10510db88>, <Element a at 0x10510dbc8>, <Element li at 0x10510dc08>, <Element a at 0x10510db08>, <Element li at 0x10510dc48>, <Element a at 0x10510dc88>, <Element li at 0x10510dcc8>, <Element a at 0x10510dd08>]
Here we use * to represent matching all nodes, that is, all nodes in the entire HTML text are retrieved. You can see that the return form is a list, each element is an Element type, followed by the name of the node, such as html, body, div, ul, li, a, etc., all nodes are included in the list. Yes.
Of course, matching here can also specify the node name, if we want to get all li nodes, the example is as follows:
from lxml import etree html = etree.parse('./test.html', etree.HTMLParser()) result = html.xpath('//li') print(result) print(result[0])
Here we need to select all li nodes can use //, and then directly add the name of the node, call the xpath() method directly to extract.
Operation results:
[<Element li at 0x105849208>, <Element li at 0x105849248>, <Element li at 0x105849288>, <Element li at 0x1058492c8>, <Element li at 0x105849308>] <Element li at 0x105849208>
Here we can see that the extraction result is a list form, each element of which is an Element object. If one of the objects is to be extracted, it can be directly indexed with middle brackets, such as [0].
6. Subnodes
We can find the child node or descendant node of the element by / or // and join all the direct a child nodes of the li node that we want to select now.
from lxml import etree html = etree.parse('./test.html', etree.HTMLParser()) result = html.xpath('//li/a') print(result)
In this paper, we select all the direct a subnodes of all li nodes by adding a / a, because // li is the selected all li nodes and / A is the selected all the direct a subnodes of li nodes. Combining the two, we get all the direct a subnodes of all li nodes.
Operation results:
[<Element a at 0x106ee8688>, <Element a at 0x106ee86c8>, <Element a at 0x106ee8708>, <Element a at 0x106ee8748>, <Element a at 0x106ee8788>]
But if we want to get all the descendant nodes, we should use // for example, we need to get all descendant a nodes under ul node, which can be achieved as follows:
from lxml import etree html = etree.parse('./test.html', etree.HTMLParser()) result = html.xpath('//ul//a') print(result)
The results are the same.
But here if we use //ul/a, we can't get any results, because / is to get the direct sub-node, and there is no direct a sub-node under UL node, only li node, so we can't get any matching results. The code is as follows:
from lxml import etree html = etree.parse('./test.html', etree.HTMLParser()) result = html.xpath('//ul/a') print(result)
Operation results:
[]
So here we have to pay attention to the difference between / and // to get the direct child node, / to get the descendant node.
7. Parent node
We know that child nodes or descendant nodes can be found by successive / or //, so if we know how child nodes can find parent nodes? Here we can use... To get the parent node.
For example, we first select href as a node of link4.html, then get its parent node, and then get its class attribute. The code is as follows:
from lxml import etree html = etree.parse('./test.html', etree.HTMLParser()) result = html.xpath('//a[@href="link4.html"]/../@class') print(result)
Operation results:
['item-1']
Check the result, it is the class of the target li node that we get, and the parent node succeeds.
At the same time, we can get the parent node through parent:: code as follows:
from lxml import etree html = etree.parse('./test.html', etree.HTMLParser()) result = html.xpath('//a[@href="link4.html"]/parent::*/@class') print(result)
8. Attribute Matching
We can also use the @ symbol to filter attributes when selecting. For example, if we want to select a li node with class item-1, we can do this:
from lxml import etree html = etree.parse('./test.html', etree.HTMLParser()) result = html.xpath('//li[@class="item-0"]') print(result) Python Resource sharing qun 784758214 ,Installation packages are included. PDF,Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.
Here we restrict the class attribute of the node to item-0 by adding [@class= "item-0"], while there are two qualified li nodes in HTML text, so the return result should return two matched elements, as follows:
[<Element li at 0x10a399288>, <Element li at 0x10a3992c8>]
It can be seen that the matching results are exactly two. As to whether the two matching results are correct or not, we will verify them later.
9. Text Acquisition
We can get the text in the node by using the text() method in XPath. Next, we try to get the text in the li node above. The code is as follows:
from lxml import etree html = etree.parse('./test.html', etree.HTMLParser()) result = html.xpath('//li[@class="item-0"]/text()') print(result)
The results are as follows:
['\n ']
It's strange that we didn't get any text, but only a newline character. Why? Because text() in XPath is preceded by /, and the meaning of this / is to select a direct child node, where it is obvious that the direct child nodes of li are all a nodes, and the text is inside a node, so the matching result here is the newline character inside the modified li node, because the tail label of the automatically modified li node. Change the line.
The two nodes are selected:
<li class="item-0"><a href="link1.html">first item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </li>
For one of the nodes, the tail label of the li node is changed when it is added, so the only result of extracting the text is the newline between the tail label of the li node and the tail label of the a node.
So, if we want to get the text inside the li node, there are two ways: one is to select a node to get the text, the other is to use //. Let's see the difference between the two.
Firstly, we select a node to get the text. The code is as follows:
from lxml import etree html = etree.parse('./test.html', etree.HTMLParser()) result = html.xpath('//li[@class="item-0"]/a/text()') print(result)
Operation results:
['first item', 'fifth item']
You can see that the return value here is two, and the content is the text of the li node with the attribute of item-0, which also confirms that the result of attribute matching in our previous article is correct.
Here we select the li node layer by layer, then use / select its direct sub-node a, and then select its text. The results are exactly in line with our expectations.
Let's look at the result of another way // selection. The code is as follows:
from lxml import etree html = etree.parse('./test.html', etree.HTMLParser()) result = html.xpath('//li[@class="item-0"]//text()') print(result)
Operation results:
['first item', 'fifth item', '\n ']
Unexpectedly, there are three results returned here. It is conceivable that the text of all descendant nodes is selected here. The first two are the text inside the a node of li child node, and the second one is the text inside the last li node, that is, the newline character.
So, if we want to get all the text inside the descendant node, we can get it directly by // plus text(), which can ensure the most comprehensive text information, but may be mixed with some special characters such as line breaks. If we want to get all the text under a certain descendant node, we can select a specific descendant node first, and then call the text() method to get its internal text, so as to ensure that the results obtained are neat.
10. Attribute acquisition
We know that text() can be used to get the text inside the node, so how to get the node attributes? In fact, we can use the @ symbol. For example, we want to get href attributes of all a nodes under all li nodes. The code is as follows:
from lxml import etree html = etree.parse('./test.html', etree.HTMLParser()) result = html.xpath('//li/a/@href') print(result)
Here we can get the href attribute of the node by @href. Notice that there is a difference between the method of attribute matching here and that of attribute matching. Attribute matching is a middle bracket with attribute name and value to define an attribute, such as [@href="link1.html"]. Whereas @href here refers to the acquisition of an attribute of the node, the two need to be distinguished.
Operation results:
['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']
You can see that we have successfully retrieved the href attribute of a node under all li nodes and returned it as a list.
11. Attribute multi-value matching
Sometimes an attribute of some node may have multiple values, such as the following example:
from lxml import etree text = ''' <li class="li li-first"><a href="link.html">first item</a></li> ''' html = etree.HTML(text) result = html.xpath('//li[@class="li"]/a/text()') print(result)
Here, the class attribute of the Li node in HTML text has two values Li and li-first, but at this time if we want to use the previous attribute matching to get it, we can not match it. The code runs as follows:
[]
At this point, if the attribute has more than one value, you need to use contains() function. The code can be rewritten as follows:
from lxml import etree text = ''' <li class="li li-first"><a href="link.html">first item</a></li> ''' html = etree.HTML(text) result = html.xpath('//li[contains(@class, "li")]/a/text()') print(result)
So we pass in the attribute name through contains(), the first parameter, and the second parameter, so that the matching can be completed as long as the attribute contains the attribute value passed in.
Operation results:
['first item']
This selection method is often used when a node's attribute has multiple values, such as the class attribute of a node usually has multiple values.
12. Multiple Attribute Matching
In addition, we may encounter a situation where we may need to determine a node based on multiple attributes, which requires matching multiple attributes at the same time. Here we can use the operator and to connect. Examples are as follows:
from lxml import etree text = ''' <li class="li li-first" name="item"><a href="link.html">first item</a></li> ''' html = etree.HTML(text) result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()') print(result)
Here, the li node of HTML text adds an attribute name. At this time, we need to select the class and name attributes at the same time. We can connect the two conditions with the and operator. Both conditions are surrounded by brackets. The results are as follows:
['first item']
Here and is actually an operator in XPath, and there are many other operators, such as or, mod and so on, which are summarized as follows:
operator | describe | Example | Return value |
---|---|---|---|
or | or | price=9.80 or price=9.70 | If price is 9.80, return true. If price is 9.50, return false. |
and | and | price>9.00 and price<9.90 | If price is 9.80, return true. If price is 8.50, return false. |
mod | Calculate the remainder of division | 5 mod 2 | 1 |
\ | Computing two node sets | //book //cd | Returns all node sets with book and cd elements |
+ | addition | 6 + 4 | 10 |
- | subtraction | 6 - 4 | 2 |
* | multiplication | 6 * 4 | 24 |
div | division | 8 div 4 | 2 |
= | Be equal to | price=9.80 | If price is 9.80, return true. If price is 9.90, return false. |
!= | Not equal to | price!=9.80 | If price is 9.90, return true. If price is 9.80, return false. |
< | less than | price<9.80 | If price is 9.00, return true. If price is 9.90, return false. |
<= | Less than or equal to | price<=9.80 | If price is 9.00, return true. If price is 9.90, return false. |
> | greater than | price>9.80 | If price is 9.90, return true. If price is 9.80, return false. |
>= | Greater than or equal to | price>=9.80 | If price is 9.90, return true. If price is 9.70, return false. |
Reference sources for this table: http://www.w3school.com.cn/xp....
13. Choose in order
Sometimes when we choose, some attributes may match multiple nodes at the same time, but we only want one of them, such as the second node, or the last node. What should we do then?
At this point, you can use the method of passing in the index in brackets to get the nodes in a particular order, as follows:
from lxml import etree text = ''' <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div> ''' html = etree.HTML(text) result = html.xpath('//li[1]/a/text()') print(result) result = html.xpath('//li[last()]/a/text()') print(result) result = html.xpath('//li[position()<3]/a/text()') print(result) result = html.xpath('//li[last()-2]/a/text()') print(result)
For the first time, we select the first li node, and just pass in the number 1 in parentheses. Note that this is different from the code. The serial number starts with 1, not 0.
The second choice is that we select the last li node and pass in last() in parentheses. The last li node returns.
For the third time, we select the li nodes whose locations are less than 3, that is, the nodes whose locations are numbered 1 and 2, and the results are the first two li nodes.
For the fourth choice, we select the penultimate li node and pass in last()-2 in parentheses. Because last() is the last one, last()-2 is the penultimate one.
The results are as follows:
['first item'] ['fifth item'] ['first item', 'second item'] ['third item']
Here we use last(), position() and other functions. XPath provides more than 100 functions, including access, numeric value, string, logic, node, sequence and other processing functions. Specific functions can be referred to: http://www.w3school.com.cn/xp....
13. Node axis selection
XPath provides a lot of node axis selection methods, called XPath Axes in English, including acquisition of child elements, sibling elements, parent elements, ancestor elements and so on. Under certain circumstances, it can easily complete the node selection. Let's use an example to feel:
from lxml import etree text = ''' <div> <ul> <li class="item-0"><a href="link1.html"><span>first item</span></a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div> ''' html = etree.HTML(text) result = html.xpath('//li[1]/ancestor::*') print(result) result = html.xpath('//li[1]/ancestor::div') print(result) result = html.xpath('//li[1]/attribute::*') print(result) result = html.xpath('//li[1]/child::a[@href="link1.html"]') print(result) result = html.xpath('//li[1]/descendant::span') print(result) result = html.xpath('//li[1]/following::*[2]') print(result) result = html.xpath('//li[1]/following-sibling::*') print(result)
Operation results:
[<Element html at 0x107941808>, <Element body at 0x1079418c8>, <Element div at 0x107941908>, <Element ul at 0x107941948>] [<Element div at 0x107941908>] ['item-0'] [<Element a at 0x1079418c8>] [<Element span at 0x107941948>] [<Element a at 0x1079418c8>] [<Element li at 0x107941948>, <Element li at 0x107941988>, <Element li at 0x1079419c8>, <Element li at 0x107941a08>] Python Resource sharing qun 784758214 ,Installation packages are included. PDF,Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.
For the first time, we call the ancestor axis to get all ancestor nodes. Then we need to follow two colons, and then the node selector. Here we directly use *, which means matching all the nodes, so the result is that all ancestor nodes of the first li node, including html, body, div, ul, are returned.
In the second choice, we added a qualified condition, and this time we added div after the colon, so that we only got the result of div, the ancestor node.
In the third selection, we call the attribute axis to get all the attribute values, followed by the selector or *, which represents the acquisition of all the attributes of the node, and the return value is all the attribute values of the li node.
In the fourth selection, we call the child axis to get all the direct child nodes. Here we add a restriction to select a node whose href attribute is link1.html.
For the fifth time, we call the descendant axis to get all the descendant nodes. Here we add a restriction to get the span node, so we return only the span node but no a node.
In the sixth selection, we call the following axis to get all the nodes after the current node. Although we use * matching, we add index selection, so we only get the second follow-up node.
For the seventh time, we call the following-sibling axis to get all the peer nodes after the current node. Here we use * matching, so we get all the subsequent peer nodes.
These are the simple uses of the XPath axis
14. Concluding remarks
So far, we have basically introduced the XPath selector that we may use. XPath is very powerful and has many built-in functions, which can greatly improve the efficiency of extracting HTML information after skilled use.