Python 3 Network Crawler Actual Warfare-30, PyQuery

In the last section, we introduced the use of Beautiful Soup, which is a very powerful web parsing library. Do you think some of its methods are not suitable for use? Do you think its CSS selector function is not so powerful?

If you have something to do with the Web, if you prefer to use CSS selectors, if you know something about jQuery, then here's a better parsing library for you, PyQuery.

Next, let's feel the power of PyQuery.

1. Preparations

Make sure that PyQuery is installed correctly before you start. If it is not installed, you can refer to the installation process in Chapter 1.

2. Initialization

Like Beautiful Soup, PyQuery initializes by passing in an HTML data source to initialize an operation object. There are many ways to initialize PyQuery, such as directly passing in a string, passing in a URL, and passing in a file name. Now let's go into details.

String initialization

First of all, let's use an example to feel:

html = '''
<div>
    <ul>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('li'))
Python Resource sharing qun 784758214 ,Installation packages are included. PDF,Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.

Operation results:

<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

Here we first introduce the object PyQuery, named pq, and then declare a long HTML string as a parameter to PyQuery, which completes the initialization successfully. Then we pass the initialized object into the CSS selector. In this example, we pass in the li node, so that we can Select all li nodes and print out the HTML text of all li nodes.

URL initialization

Initialized parameters can not only be passed in the form of strings, but also in the url of the web page, where only the url parameter is specified:

from pyquery import PyQuery as pq
doc = pq(url='http://www.segmentfault.com')
print(doc('title'))

Operation results:

<title>SegmentFault Think no</title>

In this way, PyQuery requests the URL first, and then initializes it with the HTML content, which is equivalent to passing the source code of the web page to PyQuery as a string to initialize.

It has the same functions as the following:

from pyquery import PyQuery as pq
import requests
doc = pq(requests.get('http://www.segmentfault.com').text)
print(doc('title'))

File initialization

Of course, in addition to passing a URL, you can also pass the local file name, the parameter specified as filename can:

from pyquery import PyQuery as pq
doc = pq(filename='demo.html')
print(doc('li'))

Of course, there needs to be a local HTML file demo.html, which contains HTML strings to be parsed. In this way, it first reads the local file content, and then passes the file content to PyQuery as a string to initialize.

The above three initialization methods can be used, of course, the most commonly used initialization method is to pass in the form of strings.

3. Basic CSS selector

Let's start with an example of how PyQuery's CSS selector works:

html = '''
<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('#container .list li'))
print(type(doc('#container .list li')))

Operation results:

<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<class 'pyquery.pyquery.PyQuery'>

After we initialize the PyQuery object here, we pass in a CSS selector, # container. list li, which means that all Li nodes inside the node whose id is container are class es inside the node whose id is list. Then print out, you can see that the qualified nodes have been successfully obtained.

Then we print out its type, and you can see that its type is still PyQuery.

4. Finding Nodes

Now let's introduce some common query functions, which are exactly the same as functions in jQuery.

Subnode

Finding child nodes requires the find() method. The parameters passed in are CSS selectors. Let's take HTML as an example.

from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
print(type(items))
print(items)
lis = items.find('li')
print(type(lis))
print(lis)

Operation results:

<class 'pyquery.pyquery.PyQuery'>
<ul class="list">
    <li class="item-0">first item</li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
    <li class="item-1 active"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
<class 'pyquery.pyquery.PyQuery'>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

First, we select the node whose class is list, then we call the find() method, pass in the CSS selector, select the li node inside it, and finally print out the corresponding query results. We can find that the find() method will select all the eligible nodes. The type of result is Py. Query type.

In fact, the search scope of find() is all the descendant nodes of the node, and if we only want to find the child nodes, we can use the children() method:

lis = items.children()
print(type(lis))
print(lis)

Operation results:

<class 'pyquery.pyquery.PyQuery'>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

If we want to filter all the eligible nodes in the child node, such as the node whose class is active in the child node, we can pass in the CSS selector to the children() method.

lis = items.children('.active')
print(lis)

Operation results:

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>

You can see that the output has been filtered, leaving the node whose class is active.

Parent node

We can use parent() method to get the parent node of a node. Let's feel it with an example.

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
container = items.parent()
print(type(container))
print(container)

Operation results:

<class 'pyquery.pyquery.PyQuery'>
<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>

Here we first select the node whose class is list with. list, and then call the parent() method to get the parent node whose type is still PyQuery.

Here, the parent node is the direct parent of the node, that is, it will not look for the parent node of the parent node, that is, the ancestor node.

But what if we want to get an ancestor node? parents() method can be used:

from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
parents = items.parents()
print(type(parents))
print(parents)

Operation results:

<class 'pyquery.pyquery.PyQuery'>
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
 <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>

Here we call the parents() method, and we can see that there are two output results. One is the node whose class is wrap and the other is the node whose id is container. That is to say, the parents() method returns all the ancestor nodes.

If we want to filter an ancestor node, we can pass in a CSS selector to the parents() method, which will return to the ancestor node that conforms to the CSS selector:

parent = items.parents('.wrap')
print(parent)

Operation results:

<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
Python Resource sharing qun 784758214 ,Installation packages are included. PDF,Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.

You can see that the output is one less node, leaving only the node whose class is wrap.

Brother Node

Above we show the use of child and parent nodes, and there is another kind of node that is siblings() method if you want to get siblings(). Let's take the HTML code above as an example to feel it:

from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.list .item-0.active')
print(li.siblings())

Here we first select the node whose class is list and whose class is item-0 and activity, which is the third li node. Obviously, there are four sibling nodes, namely, the first, second, fourth and fifth li nodes.

Operation results:

<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0">first item</li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

You can see that the result of the operation is exactly the four sibling nodes we just mentioned.

If we want to filter a sibling node, we can still pass in a CSS selector to the method, so that we can select eligible nodes from all siblings:

from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.list .item-0.active')
print(li.siblings('.active'))

Here we filter the nodes whose class is active. From the results we can see that the sibling node whose class is active has only the fourth li node, so the result should be one.

Operation results:

<li class="item-1 active"><a href="link4.html">fourth item</a></li>

5. Traverse

As we have just observed, PyQuery's selection results may be multiple nodes, or a single node, all of which are PyQuery types and do not return lists like Beautiful Soup.

For a single node, we can print the output directly or convert it to a string directly:

from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
print(str(li))

Operation results:

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

For the results of multiple nodes, we need to traverse to obtain, for example, here we traverse each li node, we need to call items() method:

from pyquery import PyQuery as pq
doc = pq(html)
lis = doc('li').items()
print(type(lis))
for li in lis:
    print(li, type(li))

Operation results:

<class 'generator'>
<li class="item-0">first item</li>
<class 'pyquery.pyquery.PyQuery'>
<li class="item-1"><a href="link2.html">second item</a></li>
<class 'pyquery.pyquery.PyQuery'>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<class 'pyquery.pyquery.PyQuery'>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<class 'pyquery.pyquery.PyQuery'>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<class 'pyquery.pyquery.PyQuery'>

Here we can find that after calling items() method, we will get a generator. After traversing, we can get li node objects one by one. Its type is also PyQuery type, so each li node can also call the method mentioned above to select, such as continuing to query sub-nodes, looking for an ancestor section. Points and so on. Very flexible.

6. Access to information

After extracting the node, our ultimate goal is to extract the information contained in the node. There are two kinds of more important information, one is to acquire attributes, the other is to acquire text.

get attribute

After extracting a PyQuery-type node, we can call attr() method to get the attributes:

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
a = doc('.item-0.active a')
print(a, type(a))
print(a.attr('href'))

Operation results:

<a href="link3.html"><span class="bold">third item</span></a> <class 'pyquery.pyquery.PyQuery'>
link3.html

Here we first select a node in the li node with class item-0 and activity, whose type can be seen as PyQuery type.

Then we call the attr() method and pass in the name of the property to get the value of the property.

Attributes can also be obtained by calling attr attributes, as follows:

print(a.attr.href)

Result:

link3.html

The result is exactly the same. In this case, instead of calling a method, we call attr attributes, and then call the attribute name, so we can get the attribute values.

What happens if we select multiple elements and then call the attr() method? Let's use an example to test:

a = doc('a')
print(a, type(a))
print(a.attr('href'))
print(a.attr.href)

Operation results:

<a href="link2.html">second item</a><a href="link3.html"><span class="bold">third item</span></a><a href="link4.html">fourth item</a><a href="link5.html">fifth item</a> <class 'pyquery.pyquery.PyQuery'>
link2.html
link2.html

In principle, we should have four selected a nodes and four printed results, but when we call the attr() method, the result is only the first one.

So when the return result contains multiple nodes, calling attr() method will only get the attributes of the first node.

In this case, if we want to get all the attributes of a node, we need to use the traversal mentioned above.

from pyquery import PyQuery as pq
doc = pq(html)
a = doc('a')
for item in a.items():
    print(item.attr('href'))

Operation results:

link2.html
link3.html
link4.html
link5.html

Therefore, in the process of attribute acquisition, we observe whether the return node is one or more, and if it is multiple, we need to traverse to get the attributes of each node in turn.

Getting text

Another major operation after getting a node is to get its internal text. We can call the text() method to get:

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
a = doc('.item-0.active a')
print(a)
print(a.text())

Operation results:

<a href="link3.html"><span class="bold">third item</span></a>
third item

We first select a node A and then call the text() method to get the text information inside the node. It ignores all HTML contained in the node and returns only plain text content.

But if we want to get HTML text inside this node, we can use the html() method:

from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
print(li.html())

Here we select the third li node and call the html() method, which should return all HTML text within the li node.

Operation results:

<a href="link3.html"><span class="bold">third item</span></a>

Here's the same question. If we select multiple nodes, what will text() or html() return?

Let's take an example to see:

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('li')
print(li.html())
print(li.text())
print(type(li.text())

Operation results:

<a href="link2.html">second item</a>
second item third item fourth item fifth item
<class 'str'>

The result may be quite unexpected. We selected all li nodes and found that the html() method returned the internal HTML text of the first li node, while text() returned the internal plain text of all li nodes, separated by a space in the middle, which is actually a string.

So it's worth noting that if we get multiple nodes, if we want to get the internal HTML text of each node, we need to traverse each node, and the text() method does not need to traverse to get it, it is to merge all nodes into a string after taking the text.

7. Node operation

PyQuery provides a series of methods to modify nodes dynamically, such as adding a class to a node, removing a node and so on. These operations sometimes bring great convenience for extracting information.

Because there are too many methods of node operation, here are some typical examples to illustrate its usage.

addClass,removeClass

Let's first use an example to feel:

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.removeClass('active')
print(li)
li.addClass('active')
print(li)

First we select the third li node, then we call the removeClass() method to remove the active class of the li node, and then we call the addClass() method to add the class back. Every time we perform an operation, we print out the content of the current li node.

Operation results:

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

You can see that three outputs were made, the active class of the second output li node was removed, and the third class was added back.

So we add Class (), removeClass() these methods can dynamically change the class attribute of the node.

attr,text,html

Of course, there are attr() methods to manipulate attributes in addition to the operation of class attributes. text() and html() methods can also be used to change the contents of nodes.

Let's use examples to feel:

html = '''
<ul class="list">
     <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
</ul>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.attr('name', 'link')
print(li)
li.text('changed item')
print(li)
li.html('<span>changed item</span>')
print(li)

Here we first select the li node, then call attr() method to modify the attribute, the first parameter is the attribute name, the second parameter is the attribute value, and then we call text() and html() method to change the content inside the node. After three operations, the current li node is printed out separately.

Operation results:

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active" name="link"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-0 active" name="link">changed item</li>
<li class="item-0 active" name="link"><span>changed item</span></li>

It can be found that after calling attr() method, the li node has an attribute name that does not exist. Its value is link. After calling text() method, the text inside the li node is changed to the incoming string text. After calling the html() method to pass in the HTML text, the inside of the li node changes to the incoming HTML text.

So if attr() method passes in only the first parameter attribute name, it gets the attribute value. If it passes in the second parameter, it can be used to modify the attribute value. If text() and html() method does not pass in parameters, it gets the plain text and HTML text in the node, and if it passes in parameters, it assigns values.

remove

remove, as its name implies, sometimes brings great convenience to information extraction. Let's look at an example:

html = '''
<div class="wrap">
    Hello, World
    <p>This is a paragraph.</p>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
wrap = doc('.wrap')
print(wrap.text())

Here's a piece of HTML text. Now we want to extract the string Hello and World instead of the string inside the p node. How can we extract this?

Here we first try to extract the content of the node whose class is wrap to see if we want it. The results are as follows:

Hello, World This is a paragraph.

However, this result also contains the content of the internal P node, that is to say, text() extracts all the plain text. If we want to delete the text inside the p-node, we can choose to extract the text inside the p-node again, and then remove the substring from the whole result, but this method is obviously more cumbersome.

That's the remove() method, and we can go on to do this:

wrap.find('p').remove()
print(wrap.text())

We first select the p node, then call remove() method to remove it, then only Hello, World is left inside wrap, and then use text() method to extract it.

So the remove() method can delete some redundant content to facilitate our extraction. Use at the right time can greatly improve efficiency.

In addition, there are many methods of node operation, such as append(), empty(), prepend(), and so on. They are completely consistent with jQuery's usage. Detailed usage can be referred to official documents: http://pyquery.readthedocs.io...

8. Pseudo-class selector

Another important reason why CSS selectors are powerful is that they support a variety of pseudo-Class selectors. For example, select the first node, the last node, the odd and even node, the node containing a text, and so on. Let's use an example to feel:

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('li:first-child')
print(li)
li = doc('li:last-child')
print(li)
li = doc('li:nth-child(2)')
print(li)
li = doc('li:gt(2)')
print(li)
li = doc('li:nth-child(2n)')
print(li)
li = doc('li:contains(second)')
print(li)
Python Resource sharing qun 784758214 ,Installation packages are included. PDF,Learning videos, here is Python The gathering place of learners, zero foundation and advanced level are all welcomed.

Here we use the pseudo-Class selector of CSS3 to select the first li node, the last li node, the second li node, the third li node after the third li node, the even position li node, and the li node containing the second text, which is very powerful.

9. Concluding remarks

So far, the common usage of PyQuery has been introduced.

Keywords: Attribute Python JQuery less

Added by prashanth on Tue, 30 Jul 2019 12:36:35 +0300