Python network data collection method

It's said that there are still many small partners who don't know the method of network data collection. Let me see who they are. I'll call the roll without talking. I've shared the tutorial below. If you need it, you can get it yourself. ( https://jq.qq.com/?_wv=1027&k=kYtMeTfp )

Generally, the following four methods can match the results, but the complexity is inconsistent. Choose according to the situation.

◾ Use of regular re

◾ Use of bs4

◾xpath

◾PyQuery

① Re (Regular Expression) is fast, efficient and accurate;

However, compared with the other two methods, it may be more complex and changeable, and there will be a little more symbolic rules to learn.

② bs4 (Beautiful Soup) is the simplest, but its execution efficiency is not high. Its principle is to locate HTML tags, attributes and attribute values.

③ xpath is easier to use and more efficient.

xpath is a language course to search for content in XML documents, and it is also a way of parsing data used by crawlers.

④ PyQuery if you are familiar with jQuery and don't want to remember a set of calling methods of beautiful soup, PyQuery is a good choice.

1, Use of regular re

We generally use several common functions when parsing data:

re.find() re.findall()

One is to find one and return the value. The other is to find all qualified values. The parameters inside are "pattern" and "string text".

•re.findall()

import re
str1='12s3asdfa'
mathch1 = re.findall("[0-9]",str1)
print(mathch1)

Run result: ['1', '2', '3'], matching all characters that meet the rules.

re.match() re.search()

One is to match from scratch, the other is to match from any position;

•re.match

import
 restr1='123asdfa'
 mathch1 = re.match("^[0-9]",str1)
 print(mathch1.group())

Operation results: 1;

•re.search

import
 restr1='1a2s3asdfa'
 mathch1 = re.search("^[0-9]",str1)
 print(mathch1.group())

Results: 1. Match characters from beginning to end until a match is found group to get the matching value.

re.serach() and re The difference between match():

re.search() will match all characters, re Match only matches the beginning of the string. If the beginning does not meet the rules, it returns None.

re.complie() re.finder() re.sub()

◾ re.complie() re.finder(): return iterator

◾ re.sub(): replace

re.sub(pattern, repl, string, count,flag)

import
 restr1='12s3asdfa'
 mathch1 = re.sub("[0-9]",'|',str1)
 print(mathch1)

Results: | s|asdfa; Replace the preceding qualifying character with the following character.

Use the most commonly used methods compile() and findall() to obtain data;


html = res.text
p=re.compile('<div class="movie-item-info">.*?<a href="/films/.*?title="(.*?)".*?<p class="star">\s+(.*?)\s+</p>.*?<p class="releasetime">(.*?)</p>',re.S)
result=re.findall(p,html)

Pay attention to the? And (.?)

And (.?) Is the information we want to get? Is the omitted information

For example:

html="abcd<hello world>abcd" p=re.compile('ab.*?<(.*?)>') result=re.findall(p,html)

At this time, the content in the result is hello world.

It should be noted that result is list information;

That is, we can save multiple data (that is, there can be multiple (*?)) in the template), And as long as the data in html matches the vacancy in the template, it will be saved.

The basic idea of this method is to use compile() to construct a template, and then use findall() to compare the constructed template with the data we crawl to find out the data we need, that is (. *?) in the template we set ourselves

Supplement: build template

Find the page you crawled and click f12 to view the original code of the page;

Find the location of the information you want to crawl in the source code;

Copy that part of the source code and observe the information you need,

Rational use? And (.?) You can get what you want

Information about.

Case:

# Get the page source code, requests# 
adopt re To extract the effective information you want re
import csv
 import requests
 import re 
 # Camouflage head
 headers = {   
  'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
 
  }
   # Target URL
   url = "https://movie.douban.com/top250"
    # Send request and get response
    resp = requests.get(url, headers=headers)
    page_content = resp.text 
    
    # print(page_content)
    # Parse data
    obj = re.compile(r'<li>.*?<div class="item">.*?<span class="title">(?P<name>.*?)'                 r'</span>.*?<p class="">.*?<br>(?P<year>.*?)&nbsp.*?<span '                 r'class="rating_num" property="v:average">(?P<score>.*?)</span>.*?'                 r'<span>(?P<num>.*?)Human evaluation</span>', re.S)
    # Start matching + save locally result = obj finditer(page_content)f = open("data.csv", mode="w", encoding="utf8")csvwriter = csv. writer(f)
    for it in result: 
       # print(it.group("name")) 
          # print(it.group("score"))   
           # print(it.group("num")) 
              # print(it.group("year").strip())  
                dic = it.groupdict() 
                   dic['year'] = dic['year'].strip()  
                     csvwriter.writerow(dic.values()) 
                   f.close()print("over!")

02 use of BS4

Use pip to install

pip install beautifulsoup4

Beatifulsoup Guide Package:

from bs4 import BeautifulSoup

◾ The find method returns an object that has been parsed

◾ The findall method returns the parse list

◾ The select method returns the parse list

◾ Method of getting attribute: get('attribute name ')

◾ Get text method: get_text()

◾ You need to know that when using bs4, you need to indicate that the parser is generally lxml

#Create a Beautiful Soup object
# soup = BeautifulSoup(html)
#Create an object by opening a local HTML file
#soup = BeautifulSoup(open('index.html'))
soup = BeautifulSoup(html_str, 'lxml')   # Common ways
#Format the contents of the output soup object
result = soup.prettify()

Common methods are as follows:

find_all(name, attrs, recursive, text, **kwargs)
CSS selector
(1)Find through label selector print soup.select('title') #[<title>The Dormouse's story</title>]
(2)Find through class selector print soup.select('.sister')#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
(3)adopt id Selector lookup print soup.select('#link1')#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
(4)Hierarchy selector lookup print soup.select('p #link1')#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
(5)Find through attribute selector print soup.select('a[href="http://example.com/elsie"]')#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]
(6) Get text content get_text()soup = BeautifulSoup(html, 'lxml')print type(soup.select('title'))print soup.select('title')[0].get_text()
for title in soup.select('title'):    print title.get_text()    (7) get attribute get('The name of the property')soup = BeautifulSoup(html, 'lxml')print type(soup.select('a'))print soup.select('a')[0].get('href')

Use of XPath 03

Generally used with lxml;

xpath is the most commonly used in daily life, because it has strong universality in data analysis. xpath can be used not only in Python, but also in other programming languages.

Installation of xml Library

pip install lxml

Guide package of lxml

from lxml import etree

• lxml conversion parsing type method: etree HTML(text)

• lxml method for parsing data: data xpath("//div/text()")

It should be noted that the data types of lxml extracted data are all list types;

If the data is complex:

First extract the large node, then traverse the small node, convert the converted element object into a string, and return the byte type result etree tostring(element)

The returned element object can continue to use the xpath method. For this, we can group according to a label in the later data extraction process, and then extract the data.

The discovery result is an element object, which can continue to use xpath methods;

It should be noted that the syntax of xpath becomes the path of this element for next matching/ (dot slash) indicates the current path.

xpath Node relationship: root node,Child node,Parent node,Sibling node,Child node,Descendant node
xpath Key syntax to get any node://
xpath The key syntax of gets the node label according to the attribute:[@attribute = 'value']
xpath Get node attribute value:@Attribute value
xpath Get node text value: text()
xpath How to use: if you don't get the desired result, look at its parent element and keep looking up.

Attribute matching

The @ symbol can be used for attribute filtering when matching;

For example: match the content whose attribute class under li is item-5

//li[@class="item-5"]

Text acquisition

There are two methods: one is to obtain the text directly after obtaining the node where the text is located, and the other is to use / /.

The second method will obtain the special characters generated by line feed when completing the code. It is recommended to use the first method to ensure that the obtained results are neat.

# First kind
from lxml import etree
html_data = html.xpath('//li[@class="item-1"]/a/text()')print(html_data)
# Second
html_data = html.xpath('//li[@class="item-1"]//text()')print(html_data)

Property acquisition

@The symbol is equivalent to a filter, which can directly obtain the attribute value of the node.

result = html.xpath('//li/a/@href')print(result)# 
Operation results:['https://s1.bdstatic.com/', 'https://s2.

Attribute multi value matching

Some nodes may have multiple values for an attribute:

from lxml import etree
text = '''
<li class="zxc  asd  wer"><a href="https://s2.bdstatic.com/">1 item</a></li><li class="ddd  asd  eee"><a href="https://s3.bdstatic.com/">2 item</a></li>'''html = etree.HTML(text)result = html.xpath('//li[contains(@class, "asd")]/a/text()')print(result)
# Running result: ['1 item', '2 item']

Multi attribute matching

When the current node has multiple attributes, you need to match them at the same time:

from lxml import etree
text = ''
<li class="zxc  asd  wer" name="222"><a href="https://s2.bdstatic.com/">1 item</a></li>
<li class="ddd  zxc  eee" name="111"><a href="https://s3.bdstatic.com/">2 item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "zxc") and @name="111"]/a/text()')
print(result)
# Running result: ['2 item ']

04 use of pyquery

PyQuery is a web page parsing tool similar to jQuery. It is implemented after jQuery, and the syntax is almost the same as jQuery.

Use the style of jQuery to traverse xml documents, which uses lxml to manipulate html xml documents.

Compared with the parsing library xpath and Beautiful Soup, it is more flexible and simple, and adds the operations of adding classes and removing nodes. These operations sometimes bring great convenience to extract information.

Please install the qyquery library before using it;

Use the following terminal commands to install

pip install pyquery

Guide package after installation

from pyquery import PyQuery as pq

initialization:

Like beautiful soup, when initializing pyquery, you also need to pass in html text to initialize a pyquery object.

During initialization, there are generally three incoming methods: incoming string, incoming URL and incoming html file; The most commonly used initialization method is to pass in the form of string.

String initialization

html = '''
<div>   
 <ul>       
  <li class="item-0">first-item</li>        
  <li class="item-1"><a href="link2.html">second item</a></li>       
   <li class="item=-0 active"><a href="link3.html"><span class=""bold>third item</span></a></li>      
     <li class="item-1 active"><a href="link4.html">fourth item</a></li>    
      <li class="item-0"><a href="link5.html">fifth item</a></li>            
      </ul>
      </div>''' 
      from pyquery import PyQuery as pq 
       doc = pq(html)
       print(doc)
       print(type(doc))
       print(doc('li'))

First, introduce the PyQuery object, named pq. Then declare a long HTML string and pass it as a parameter to the PyQuery class, which is successfully initialized.

Next, pass the css selector as a parameter into the initialization object. In this example, we pass in the li node, so that all li nodes can be selected.

URL initialization

Pass the URL as a parameter into the initialization object;

from pyquery import PyQuery as pq  
doc = pq('https://www.baidu.com', encoding='utf-8')
print(doc)
print(type(doc))
print(doc('title'))

Run the above code and you will find that we have successfully obtained Baidu's title node and web page information.

The PyQuery object will first request the URL, and then complete the initialization with the obtained HTML content, which is actually equivalent to passing the web page source code to the initialization object in the form of string.

File initialization

Pass the local file name and specify the parameter as filename.

from pyquery import PyQuery as pq 
 doc = pq(filename='baidu.html')
 print(doc)
 print(type(doc))
 print(doc('title'))

Generally speaking, there are two types of information we need to obtain from web pages:

One is text content, the other is node attribute value.

get attribute

After obtaining a node of PyQuery type, you can obtain the attribute through the attr() method.

from pyquery import PyQuery as pq 
 doc = pq(html)
 a = doc('.list .item-0.active a')
 print(a.attr('href'))

First, get node a under the node whose class is item-0 active under the node whose class is list. At this time, the variable a is of PyQuery type, then call attr() method and pass in the attribute value href.

You can also call attr attribute to obtain the attribute, and the output result is the same as the above code;

print(a.attr.href)

We can also get the attributes of all a nodes, for example:

html = '''
<div class="wrap">  
  <div id="container">      
    <ul class="list">            
    <li class="item-0">first-item</li>            
    <li class="item-1"><a href="link2.html">second item</a></li>            
    <li class="item-0 active"><a href="link3.html"><span class=""bold>third item</span></a></li>            
    <li class="item-1 active"><a href="link4.html">fourth item</a></li>            
    <li class="item-0"><a href="link5.html">fifth item</a></li>                
    </ul>    
    </div>
    </div>
    ''' 
    from pyquery import PyQuery as pq 


     doc = pq(html)
     a = doc('a').items()
     for item in a:    
     print(item.attr('href'))

There is a place to pay attention to!

If the code says this:

from pyquery import PyQuery as pq


  doc = pq(html)
  a = doc('a')
  print(a.attr('href'))

After running, you will find that you only get the href attribute of the first a node! This needs attention!

Extract text

The logic of extracting text is the same as that of extracting attributes. First obtain the node with class PyQuery, and then call the text() method to obtain the text.

First, get the text content of a node:

html = '''
<div class="wrap">    
<div id="container">        
<ul class="list">            
<li class="item-0">first-item</li>            
<li class="item-1"><a href="link2.html">second item</a></li>            
<li class="item-0 active"><a href="link3.html"><span class=""bold>third item</span></a></li>            
<li class="item-1 active"><a href="link4.html">fourth item</a></li>            
<li class="item-0"><a href="link5.html">fifth item</a></li>                
</ul>    
</div>
</div>
''' 
from pyquery import PyQuery as
 pq  doc = pq(html)
 a = doc('.list .item-0.active a')
 print(a.text())

After running, the text content of node a is successfully obtained, and then the text content of multiple li nodes is obtained:

html = '''
<div class="wrap">    
<div id="container">        
<ul class="list">            
<li class="item-0">first-item</li>            
<li class="item-1"><a href="link2.html">second item</a></li>            
<li class="item-0 active"><a href="link3.html"><span class=""bold>third item</span></a></li>            
<li class="item-1 active"><a href="link4.html">fourth item</a></li>            
<li class="item-0"><a href="link5.html">fifth item</a></li>                
</ul>    
</div>
</div>
''' 
from pyquery import PyQuery as pq  
doc = pq(html)
items = doc('li')
print(items.text())

The code successfully obtains the text content of all node names li, separated by spaces.

If you want to get one by one, you still need generators;

from pyquery import PyQuery as pq 
 doc = pq(html)
 items = doc('li').items()
 for item in items:    
 print(item.text())

Node operation

pyquery provides a series of methods to dynamically modify nodes;

For example, adding a class to a node and removing a node can sometimes bring great convenience to extracting information.

•add_class and remove_class

html = '''
<div class="wrap">    
<div id="container">        
<ul class="list">            
<li class="item-0">first-item</li>            
<li class="item-1"><a href="link2.html">second item</a></li>            
<li class="item-0 active"><a href="link3.html"><span class=""bold>third item</span></a></li>            
<li class="item-1 active"><a href="link4.html">fourth item</a></li>            
<li class="item-0"><a href="link5.html">fifth item</a></li>                
</ul>    
</div>
</div>
''' 
from pyquery import PyQuery as pq  
doc = pq(html)
li = doc('.list .item-0.active')
print(li)
li.remove_class('active')
print(li)
li.add_class('active')
print(li)

Operation results

<li class="item-0 active"><a href="link3.html"><span class="" bold="">third item</span></a></li>            
<li class="item-0"><a href="link3.html"><span class="" bold="">third item</span></a></li>            
<li class="item-0 active"><a href="link3.html"><span class="" bold="">third item</span></a></li>

There are three output sections above. First obtain a li node, and then delete the active class attribute. The third section of code is to add the active class attribute.

This paper summarizes and compares four methods of reptile data collection, each of which has its own advantages and disadvantages.

Of course, in practical problems, the more advanced the tools or methods used, the better. Specific problems should be analyzed in detail~

These are the methods of network data collection ~ do you understand? The comment area that I didn't understand told me that don't keep learning problems overnight. Oh, that's all for sharing in this chapter. I'll see you in the next chapter.

Keywords: Python

Added by ozconnect on Wed, 09 Mar 2022 08:35:05 +0200