Basic usage of xpath

Introduction to xpath

What exactly is xpath? Simply put, XPath is a language for finding information in XML documents

An XML document is a tree composed of a series of nodes. For example, here is a simple XML document:

<html>
    <body>
        <div>
            <p>Hello world<p>
            <a href="/home">Click here</a>
        </div>
    </body>
</html>

Common nodes in XML documents include:

  • Root node: html
  • Element nodes: html, body, div, p, a
  • Attribute node: href
  • Text nodes: Hello world, Click here

Common relationships among nodes in XML documents include:

  • Father and son: for example, < p > and < a > are child nodes of < div > and vice versa, also known as < div > are parent nodes of < p > and < a >.
  • Brothers: For example, < p > and < a > are called brothers
  • Ancestor / descendant: for example, < body >, < div >, < p >, < a > are descendant nodes of < HTML >, otherwise, < HTML > is ancestor node of < body >, < div >, < p >, < a >.

For web page parsing, xpath is more convenient and concise than re, so Python also provides the corresponding module - lxml.etree.

We can install it using the pip install lxml command

Ii. xpath usage

Before we formally start explaining how to use xpath, let's construct a simple XML document for testing.

In general crawler program, XML document is the source code of the web page crawled back.

>>> sc = '''
<html>
    <head>
        <meta charset="UTF-8"/>
        <link rel="stylesheet" href="style/base.css"/>
        <title>Example website</title>
    </head>
    <body>
        <div id="images" class="content">
            <a href="image1.html">Image1<img src="image1.jpg"/></a>
            <a href="image2.html">Image2<img src="image2.jpg"/></a>
            <a href="image3.html">Image3<img src="image3.jpg"/></a>
        </div>
    </body>
</html>
'''

1. Import module

from lxml import etree

2. Constructing Objects

html = etree.HTML(sc) # structure lxml.etree._Element object
# lxml.etree._Element Objects also have code completion
# If we get it XML Documents are not canonical documents, and the object will automatically fill in missing closed labels
# We can use it. tostring() Method converts objects into bytes Type string
# Reuse decode('utf-8') Methods bytes Type string conversion str Type string
print(etree.tostring(html).decode('utf-8'))

3. Matching data

We can use the xpath() method for matching

(1) xpath matching grammar

The xpath method accepts a string that satisfies the xpath matching syntax as a parameter

The following is a brief introduction to xpath matching grammar:

  • / Represents a descendant node, for example, / E represents an E element node in a child node under a matching root node

    test = html.xpath('/html/head/title')
  • // Represents descendant nodes, such as //E for E element nodes in descendant nodes under matching root nodes

     test = html.xpath('//a')
  • * Represents all nodes, e.g. E/* represents all nodes in a child node that matches an E element node

    test = html.xpath('/html/*')
  • text() represents a text node, such as E/text() represents a text node in a child node that matches an E element node.

    test = html.xpath('/html/head/title/text()')
  • @ ATTR denotes attribute nodes, such as E/@ATTR denotes ATTR attribute nodes in child nodes matching E element nodes.

    test = html.xpath('//a/@href')
  • Predicates are used to match the specified label

    • Specify the second < a > label

      test = html.xpath('//a[2]')
    • Specify the first two < a > labels

       test = html.xpath('//a[position()<=2]')
    • Specify < a > tags with href attributes

      test = html.xpath('//a[@href]')
    • Specify < a > label with href attribute and value of image1.html

      test = html.xpath('//a[@href="image1.html"]')
    • Specify < a > label with href attribute and image value

      test = html.xpath('//a[contains(@href,"image")]')

(2)_Element object

The xpath method returns a string or a list of matches, each of which is an lxml.etree._Element object.

Following is a brief introduction to the common attributes and methods of _Element objects:

First, we use xpath method to get the matching list tests as the test sample. Each item in the tests is a _Element object.

test = html.xpath('//a[@href="image1.html"]')
obj = test[0]
  • Tag returns the tag name
>>> obj.tag
'a'
  • attrib returns a dictionary of attributes and values
>>> obj.attrib
{'href': 'image1.html'}
  • get() returns the value of the specified attribute
>>> obj.get('href')
'image1.html'
  • Text returns the text value
>>> obj.text
'Image1'

Keywords: PHP xml Attribute Python pip

Added by bing_crosby on Tue, 06 Aug 2019 05:48:04 +0300