Introduction to xpath
What exactly is xpath? Simply put, XPath is a language for finding information in XML documents
An XML document is a tree composed of a series of nodes. For example, here is a simple XML document:
<html> <body> <div> <p>Hello world<p> <a href="/home">Click here</a> </div> </body> </html>
Common nodes in XML documents include:
- Root node: html
- Element nodes: html, body, div, p, a
- Attribute node: href
- Text nodes: Hello world, Click here
Common relationships among nodes in XML documents include:
- Father and son: for example, < p > and < a > are child nodes of < div > and vice versa, also known as < div > are parent nodes of < p > and < a >.
- Brothers: For example, < p > and < a > are called brothers
- Ancestor / descendant: for example, < body >, < div >, < p >, < a > are descendant nodes of < HTML >, otherwise, < HTML > is ancestor node of < body >, < div >, < p >, < a >.
For web page parsing, xpath is more convenient and concise than re, so Python also provides the corresponding module - lxml.etree.
We can install it using the pip install lxml command
Ii. xpath usage
Before we formally start explaining how to use xpath, let's construct a simple XML document for testing.
In general crawler program, XML document is the source code of the web page crawled back.
>>> sc = ''' <html> <head> <meta charset="UTF-8"/> <link rel="stylesheet" href="style/base.css"/> <title>Example website</title> </head> <body> <div id="images" class="content"> <a href="image1.html">Image1<img src="image1.jpg"/></a> <a href="image2.html">Image2<img src="image2.jpg"/></a> <a href="image3.html">Image3<img src="image3.jpg"/></a> </div> </body> </html> '''
1. Import module
from lxml import etree
2. Constructing Objects
html = etree.HTML(sc) # structure lxml.etree._Element object # lxml.etree._Element Objects also have code completion # If we get it XML Documents are not canonical documents, and the object will automatically fill in missing closed labels # We can use it. tostring() Method converts objects into bytes Type string # Reuse decode('utf-8') Methods bytes Type string conversion str Type string print(etree.tostring(html).decode('utf-8'))
3. Matching data
We can use the xpath() method for matching
(1) xpath matching grammar
The xpath method accepts a string that satisfies the xpath matching syntax as a parameter
The following is a brief introduction to xpath matching grammar:
-
/ Represents a descendant node, for example, / E represents an E element node in a child node under a matching root node
test = html.xpath('/html/head/title')
-
// Represents descendant nodes, such as //E for E element nodes in descendant nodes under matching root nodes
test = html.xpath('//a')
-
* Represents all nodes, e.g. E/* represents all nodes in a child node that matches an E element node
test = html.xpath('/html/*')
-
text() represents a text node, such as E/text() represents a text node in a child node that matches an E element node.
test = html.xpath('/html/head/title/text()')
-
@ ATTR denotes attribute nodes, such as E/@ATTR denotes ATTR attribute nodes in child nodes matching E element nodes.
test = html.xpath('//a/@href')
-
Predicates are used to match the specified label
-
Specify the second < a > label
test = html.xpath('//a[2]')
-
Specify the first two < a > labels
test = html.xpath('//a[position()<=2]')
-
Specify < a > tags with href attributes
test = html.xpath('//a[@href]')
-
Specify < a > label with href attribute and value of image1.html
test = html.xpath('//a[@href="image1.html"]')
-
Specify < a > label with href attribute and image value
test = html.xpath('//a[contains(@href,"image")]')
-
(2)_Element object
The xpath method returns a string or a list of matches, each of which is an lxml.etree._Element object.
Following is a brief introduction to the common attributes and methods of _Element objects:
First, we use xpath method to get the matching list tests as the test sample. Each item in the tests is a _Element object.
test = html.xpath('//a[@href="image1.html"]') obj = test[0]
- Tag returns the tag name
>>> obj.tag 'a'
- attrib returns a dictionary of attributes and values
>>> obj.attrib {'href': 'image1.html'}
- get() returns the value of the specified attribute
>>> obj.get('href') 'image1.html'
- Text returns the text value
>>> obj.text 'Image1'