Powerful crawler data analysis tool: Xpath

Official account: Special House
Author: Peter
Editor: Peter

Hello, I'm Peter~

Before, when the crawler parsed the data, it almost used the regular expression re module to parse the data. Regular parsing data is indeed very powerful, but the expression is very troublesome and sometimes needs to be tried many times; And the speed is relatively slow. I will write a special article on Python regular in the future.

This article describes how to get started quickly with another data parsing tool: Xpath.

Introduction to Xpath

XPath (XML Path) is a language for finding information in XML documents. XPath can be used to traverse elements and attributes in XML documents.

XPath is a major element of the W3C XSLT standard, and both XQuery and XPointer are built on XPath expressions.

  • Xpath is a query language
  • Look for nodes in the tree structure of XML (Extensible Markup Language) and HTML
  • XPATH is a language that 'looks for people' based on 'address'

Quick start website: https://www.w3schools.com/xml/default.asp

Xpath installation

Installation in MacOS is very simple:

pip install lxml

The installation in Linux takes Ubuntu as an example:

sudo apt-get install python-lxml

Please Baidu yourself for the installation in Windows. There will certainly be a tutorial, but the process will be more troublesome.

How do I verify that the installation is successful? No error is reported in the command line when importing lxml, which means the installation is successful!

Xpath parsing principle

  • Instantiate an etree parsing object, and load the parsed page source code data into the object
  • Call the xpath parsing method in xpath and combine the xpath expression to locate the tag and capture the content

How to instantiate etree objects?

  1. Load the source data in the local html document into the etree object: etree parse(filePath)
  2. Load the source code data obtained on the Internet into the object: etree HTML ('page_text '), where page_ Text refers to the source code content we obtained

Xpath usage

3 special symbols

  • /: indicates that the resolution starts from the root node, and it is a single level, positioning step by step
  • //: indicates multiple levels, and some of them can be skipped; It also means positioning from any position
  • .: a point indicates the current node

Common path expressions

The following is a common Xpath path expression:

give an example

Xpath operator

Operators are directly supported in Xpath expressions:

HTML Element

Because the data parsed by Xpath are basically structured data related to HTML elements, the following introduces a very basic knowledge of HTML.

HTML elements refer to all code from the start tag to the end tag. Basic syntax:

  • HTML elements start with the start tag; HTML elements terminate with end tags
  • The content of the element is the content between the start tag and the end tag
  • Some HTML elements have empty content
  • Empty elements are closed in the start tag (ending at the end of the start tag)
  • Most HTML elements can have attributes; Attributes are recommended in lowercase

About the use of empty elements: adding slashes in the start tag, such as < br / >, is the correct way to close empty elements. HTML, XHTML and XML all accept this method.

Common properties

attributevaluedescribe
classclassnameSpecifies the class name of the element
ididSpecifies the unique id of the element
stylestyle_definitionSpecifies the inline style of the element
titletextAdditional information specifying the element (can be displayed in tooltips)

HTML title

There are 6 levels of titles in HTML.

Heading is defined by tags such as < H1 > - < H6 >.

<h1>Define the largest title, < H6 > define the smallest title.

Original data

Before using Xpath to parse the data, we need to import the data and instantiate an etree object at the same time:

# Import library
from lxml import etree

# Instantiate resolution object
tree = etree.parse("test.html")
tree

The following is the original data to be parsed: test html:

 <html lang="en">

<head>
	<meta charset="utf-8" />
	<title>Ancient poets and works</title>
</head>

<body>
	<div>
		<p>Poet's name</p>
	</div>
	<div class="name">
		<p>Li Bai</p>
		<p>Bai Juyi</p>
		<p>Li Qingzhao</p>
		<p>Du Fu</p>
		<p>Wang Anshi</p>
		<a href="http://wwww. tang. com" target="_ Self "title =" Li Shimin ">
			<span> this is span </span>
			The poems written by ancient poets are really great</a>
		<a class="du" href="">The bright moon in front of the bed is suspected to be frost on the ground</a>
		<img alt="" src="http://www.baidu.com/tang.jpg" />
	</div>
	<div class="tang">
		<ul>
			<li><a href="http://www.baidu. Com "title =" Baidu "> farewell to the White Emperor, between the clouds, thousands of miles of Jiangling will return in one day</a></li>
			<li><a href="http://www.sougou. Com "title =" Sogou "> during the Qingming Festival, it rains in succession, and pedestrians on the road want to break their souls</a></li>
			<li><a alt="360" href="http://www.360. Com "> in the Qin Dynasty, the moon was bright and the Han Dynasty was closed, and the people on the long march had not returned</a></li>
			<li><a href="http://www.sina. Com "title =" Bing "> a gentleman gives words to others, while a commoner gives money to others</a></li>
			<li><b>Su Shi</b></li>
			<li><i>Su Xun</i></li>
			<li><a href="http://www.google. Cn "id =" Google "> welcome to Google browser</a></li>
		</ul>
	</div>
</body>

</html>                   

Get single label content

For example, you want to get the content in the title tag: ancient poets and works

title = tree.xpath("/html/head/title")
title

Through the above results, it is found that the result of each Xpath parsing is a list

If you want to get the text content in the label, use text():

# Extract the corresponding content from the list

title = tree.xpath("/html/head/title/text()")[0]  # Index 0 means to get the first element value
title

Get multiple contents in the tag

For example, we want to get the contents of div tags. There are three pairs of div tags in the original data. The result is that the list contains three elements:

1. Use single slash /: indicates that the root node html starts positioning, indicating a level

2. Use double slash / / in the middle: it means skipping the middle level and multiple levels

3. Double slash / / at the beginning: indicates starting from any position

Attribute positioning

When using attribute positioning, directly follow the label [@ attribute name = "attribute value"):

name = tree.xpath('//div[@class="name"]) # locate the class attribute, and the value is name
name

Index location

The index in Xpath starts from 1, which is different from the index in python starting from 0. For example, if you want to locate all the p tags under the class attribute (the value is name) under the div tag: 5 pairs of p tags, the result should be 5 elements

# Get all data

index = tree.xpath('//div[@class="name"]/p')
index

If we want to get the third p tag:

# Get single specified data: index starts at 1

index = tree.xpath('//div[@class="name"]/p[3] '# index starts from 1
index

Get text content

The first method: text() method

1. Get the elements below a specific tag:

# 1. /: single level
class_text = tree.xpath('//div[@class="tang"]/ul/li/b/text()')
class_text
# 2. / /: multiple levels
class_text = tree.xpath('//div[@class="tang"]//b/text()')
class_text

2. Multiple contents under a label

For example, if you want to get all the contents under the p tag:

# Get all data

p_text = tree.xpath('//div[@class="name"]/p/text()')
p_text

For example, if you want to get the content below the third p tag:

# Get the third label content

p_text = tree.xpath('//div[@class="name"]/p[3]/text()')  
p_text

If you first get all the contents in the p tag, the result is a list, and then use the python index to get it. Note that the index is 2:

Acquisition of non label direct content:

Acquisition of tag direct content: the result is empty, and there is no content in the direct li tag

If you want to get all the contents of the li tag, you can combine the following a, b and i tags and use vertical lines|

# At the same time, get the contents of a/b/i tag under li tag, which is equivalent to all the contents of li tag

abi_text = tree.xpath('//div[@class="tang"]//li/a/text() | //div[@class="tang"]//li/b/text() | //div[@class="tang"]//li/i/text()')
abi_text

Direct and indirect understanding

  • Direct: indicates to obtain the text content of the first level under the label
  • Indirect: it means to obtain the text content of all levels below the label

Get attribute content

If you want to get the value of the attribute, add: @ + attribute name to the final expression to get the value of the corresponding attribute

1. Gets the value of a single property

2. Gets multiple values of a property

actual combat

Use Xpath to obtain all the novel name s and URL addresses of Gu Long on a novel website. Introduction to Gu Long:

His real name is Xiong Yaohua, from Jiangxi; Taiwan's Tamkang English College (the predecessor of Tamkang University) graduated (i.e. studied as an undergraduate). Hemingway's novels are generally inspired by the western novels of Jack, George W. Steinbeck and even George W. Steinbeck. (Gulong himself said, "I like to steal tricks from modern Japanese and western novels." Therefore, it can be renewed day by day, catch up from behind, and don't open a new realm of martial arts novels.

Web page analysis

The crawled information is on this website: https://www.kanunu8.com/zj/10867.html , the following figure shows the names of all the novels:

By looking at the source code of the web page, we find that the name and URL address are all in the following tags:

There are three td tags under each tr tag, representing three novels, and one td contains the address and name

When we click on a specific novel, such as "peerless double pride", we can go to the specific chapter page of the novel:

Get web source code

Send a web page request to get the source code

import requests
from lxml import etree 
import pandas as pd

url = 'https://www.kanunu8.com/zj/10867.html'
headers = {'user-agent': 'Request header'}

response = requests.get(url = url,headers = headers)
result = response.content.decode('gbk')   # The web page needs to parse the data through gbk coding
result 

pick up information

1. Get the exclusive link address of each novel

tree = etree.HTML(result)

href_list = tree.xpath('//tbody/tr//a/@href ') # specifies the information of the attribute
href_list[:5]

2. Get the name of each novel

name_list = tree.xpath('//tbody/tr//a/text() '# specifies the entire content under the label
name_list[:5]

3. Generate data frame

# Address and name of the novel generated by Gulong

gulong = pd.DataFrame({
    "name":name_list,
    "url":href_list 
})

gulong

4. Perfect URL address

In fact, the URL address of each novel has a prefix, such as the complete address of the peerless double pride: https://www.kanunu8.com/book/4573/ , the data we obtained above is only the last part. Prefix each URL address:

gulong['url'] = 'https://www.kanunu8. COM / book '+ Gulong ['url'] # plus public prefix
gulong

# Export as excel file
gulong.to_excel("gulong.xlsx",index=False)

summary

Here is a summary of the use of Xpath:

  • //: indicates that the tag is not directly related to the content, and there is a cross level
  • /: indicates that only the immediate content of the tag is obtained, and does not cross the level
  • If the index is in the Xpath expression, the index starts from 1; If you get the list data from the Xpath expression and then use the python index to get the data, the index starts from 0

Keywords: Python crawler

Added by Mercenary on Tue, 08 Feb 2022 14:38:08 +0200