Crawler: three practical techniques for parsing HTML messages with BeatifulSoap

☞ ░ Old ape Python blog Directory: https://blog.csdn.net/LaoYuanPython

1, Introduction to beautiful soup

BeautifulSoup is a class for HTML parsing provided by the Python third-party module bs4. It can be considered as an HTML parsing toolbox, which has a good fault-tolerant recognition function for tags in HTML messages. lxml is an HTML text parser. BeautifulSoup needs to specify an HTML parser when building objects. lxml is recommended.

BeautifulSoup and lxml installation commands:

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple bs4
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple lxml

Load BeautifulSoup:

from bs4 import BeautifulSoup

Common functions of BeatifulSoap for parsing HTML messages:

  1. Through the BeautifulSoup object, you can access the html element corresponding to the tag, and further access the tag name, attribute and the content in the html element tag pair.
    Case:
from bs4 import BeautifulSoup
import urllib.request
def getURLinf(url): 
    header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'}
    req = urllib.request.Request(url=url,headers=header)
    resp = urllib.request.urlopen(req,timeout=5)
    html = resp.read().decode()
  
    soup = BeautifulSoup(html,'lxml')
    return (soup,req,resp) 
    
soup,req ,resp  = getURLinf(r'https://blog.csdn.net/LaoYuanPython/article/details/111303395')

print(soup.p)
print(soup.link)
print(soup.title)
print(soup.link.attrs)
print(soup.link['rel'])
  1. Through the contents attribute of the tag, you can access all subordinate HTML elements nested under it. The HTML elements corresponding to the sub tags under the tag are placed in a list pointed to by contents.
    For example: print(soup.body.contents)
  2. You can access the parent, child, brother and ancestor tag information corresponding to the tag;
  3. Use the strings attribute to iteratively access all content except tags;
  4. You can use find, find_all,find_parent,find_parents and other methods to find tags that meet specific conditions;
  5. Use select to locate specific tags through css selectors.

For details, please refer to the free column of the old ape blog< Reptiles: https://blog.csdn.net/laoyuanpython/category_9103810.html >Or paid column< Getting started with Python crawler: https://blog.csdn.net/laoyuanpython/category_10762553.html >Introduction to.

2, Some analytical skills

In HTML parsing, it is easiest to search or select through a simple tag, or a single tag attribute (such as id, class) or text. In some cases, the combination method needs to be used.

2.1. Locate or search through multiple attribute combinations of tags

There are a lot of attributes to find, and there are many attributes to find. For example:

<div id="article_content" class="article_content clearfix">
......
</div>
<div id="article_content" class="article_view">
......
</div>
<div id="article_view" class="article_view">
......
</div>

There are multiple articles with id in the above html text_ div tag of content, if used:

>>> text="""```html
<div id="article_content" class="article_content clearfix">
......
</div>
<div id="article_content" class="article_view">
......
</div>
<div id="article_view" class="article_view">
......
</div>"""
>>> s = BeautifulSoup(text,'lxml')
>>> s.select('div#article_content')
[<div class="article_content clearfix" id="article_content">......</div>, 
<div class="article_view" id="article_content">......</div>]
>>>  

Two records will be returned. At this time, you can use the following two statements for multi label attribute positioning:

>>>s.select('div#article_content[class="article_content clearfix"]')
[<div class="article_content clearfix" id="article_content">......</div>]
>>>s.select('div[id="article_content"][class="article_content clearfix"]')
[<div class="article_content clearfix" id="article_content">......</div>]
>>>s.find_all("div",id="article_content",class_='article_content clearfix')
[<div class="article_content clearfix" id="article_content">......</div>]
>>>s.find_all("div","#article_content",class_='article_content clearfix')
[<div class="article_content clearfix" id="article_content">......</div>]

The above four methods are equivalent, because id can be marked with # and class needs to be distinguished from Python keyword class when searching. Therefore, there are different methods mentioned above. Note that each attribute of select must be enclosed in brackets, and there can be no space between brackets of different attributes.

2.2. Use tag tag relationship to locate content

Tag tag relationships include parent-child, brother, ancestor and other relationships. Sometimes the content to find or locate itself is not well located, but it can be uniquely confirmed in combination with other tag relationships (mainly parent-child and ancestor relationships).
Case:
This is part of the message about the blogger's personal information in CSDN's blog:

<div class="data-info d-flex item-tiling">
               <dl class="text-center" title="1055">
                   <a href="https://blog.csdn.net/LaoYuanPython" data-report-click='{"mod":"1598321000_001","spm":"1001.2101.3001.4310"}' data-report-query="t=1">
                       <dt><span class="count">1055</span></dt>
                       <dd class="font">original</dd>
                   </a>
               </dl>
               <dl class="text-center" data-report-click='{"mod":"1598321000_002","spm":"1001.2101.3001.4311"}' title="22">
                   <a href="https://blog.csdn.net/rank/writing_rank" target="_blank">
                       <dt><span class="count">22</span></dt>
                       <dd class="font">Weekly ranking</dd>
                   </a>
               </dl>
           </div>

In the above message, if you want to take the number of original articles and weekly ranking of bloggers, the number of original articles and the tag tag of weekly ranking of bloggers are exactly the same. Both are in the span tag, and the attributes and values of the tag are the same. Only the Chinese content of the string of the parent tag dt tag dd tag of the span tag can be distinguished. In this case, first locate the ancestor tag through the ancestor tag < div class = "data info d-flex item tiling" > and then locate the child tag of the brother tag to access the attribute through the Chinese string in the ancestor tag. Then find the parent tag of its parent tag through the child tag, and then access the specific value through the span child tag of the dt child tag of the parent tag.

The example code is as follows:

>>> text="""
<div class="data-info d-flex item-tiling">
               <dl class="text-center" title="1055">
                   <a href="https://blog.csdn.net/LaoYuanPython" data-report-click='{"mod":"1598321000_001","spm":"1001.2101.3001.4310"}' data-report-query="t=1">
                       <dt><span class="count">1055</span></dt>
                       <dd class="font">original</dd>
                   </a>
               </dl>
               <dl class="text-center" data-report-click='{"mod":"1598321000_002","spm":"1001.2101.3001.4311"}' title="22">
                   <a href="https://blog.csdn.net/rank/writing_rank" target="_blank">
                       <dt><span class="count">22</span></dt>
                       <dd class="font">Weekly ranking</dd>
                   </a>
               </dl>
           </div>"""
>>> s = BeautifulSoup(text,'lxml')
>>> subSoup = s.select('[class="data-info d-flex item-tiling"] [class="font"]')
>>> for item in subSoup:
           parent = item.parent
           if item.string=='original':
               orignalNum = int(parent.select('.count')[0].string)
           elif item.string=='Weekly ranking':
               weekRank =  int(parent.select('.count')[0].string)

               
>>> print(orignalNum,weekRank)
1055 22
>>> 

2.3 remove the program code before analysis to avoid interference

When parsing HTML messages, it is necessary to analyze useful tag information in most cases, but as a technical article, most blog posts have codes, which may interfere with the analysis. For example, the code in this article contains some analyzed HTML messages. If the complete HTML content of this article is obtained, these messages will also appear in the non code part. At this time, to eliminate the influence of the code, the code can be removed from the analysis content before analysis.

At present, the blog editors of most technical platforms support code identification. The tags of editor codes such as markdown are code standard check. If other editors use different tags, only the tag name is confirmed, which can be handled in a similar way as described below.
The processing steps are as follows:

  1. Obtain message;
  2. Build BeatifulSoap object soap;
  3. Through soup code. Extract() or soup code. The decompose () method removes the code part from the soup object. The difference between the decompose method and the extract method is that decompose directly deletes the corresponding object data, while extract returns the deleted object separately when it is deleted again.

The cases of this part can be referred to< https://blog.csdn.net/LaoYuanPython/article/details/114729045 n-line Python code series: Four line program separates the program code in HTML message >Detailed introduction of.

3, Summary

This paper introduces three techniques of parsing HTML message with BeatifulSoap, including finding or locating tags through multi-attribute combination, locating tags by combining multiple tag relationships, and removing code tags in HTML message to avoid the impact of code on parsing.

Blogging is not easy, please support:

If you get something from reading this article, please like, comment and collect it. Thank you for your support!

Paid column on old apes

  1. Paid column< https://blog.csdn.net/laoyuanpython/category_9607725.html Developing graphical interface Python application using PyQt This paper introduces the basic tutorial of PyQt graphical interface development based on Python, and the corresponding article directory is< https://blog.csdn.net/LaoYuanPython/article/details/107580932 Python application column directory for developing graphical interface with PyQt>;
  2. Paid column< https://blog.csdn.net/laoyuanpython/category_10232926.html moviepy audio and video development column )This paper introduces in detail the class related methods of moviepy audio and video clip synthesis processing and the processing of related clip synthesis scenes using related methods. The corresponding article directory is< https://blog.csdn.net/LaoYuanPython/article/details/107574583 moviepy audio and video development column directory>;
  3. Paid column< https://blog.csdn.net/laoyuanpython/category_10581071.html Opencv Python beginner difficult problem set >For< https://blog.csdn.net/laoyuanpython/category_9979286.html Opencv Python graphics and image processing >The accompanying column is the integration of the author's personal perception of some problems encountered in the study of OpenCV Python graphics and image processing. The relevant materials are basically the results of repeated research by old apes, which is helpful for opencv Python beginners to understand opencv more deeply. The corresponding article directory is< https://blog.csdn.net/LaoYuanPython/article/details/109713407 Opencv Python beginner difficult problem set column directory >
  4. Paid column< https://blog.csdn.net/laoyuanpython/category_10762553.html Introduction to Python crawler From the perspective of an Internet front-end development Xiaobai, this paper introduces the contents of crawler development, including the basic knowledge of crawler introduction, as well as the practical contents such as crawling CSDN article information, blogger information, giving praise to articles, comments and so on.

The first two columns are suitable for white readers with a certain Python foundation but no relevant knowledge. Please combine the third column< https://blog.csdn.net/laoyuanpython/category_9979286.html Opencv Python graphics and image processing >Learning and using.

For colleagues who lack Python foundation, you can use the free column of old ape< https://blog.csdn.net/laoyuanpython/category_9831699.html Column: Python basic tutorial directory )Learn Python from scratch.

If you are interested and willing to support the readers of old apes, you are welcome to buy the paid column.

Learn Python from an old ape!

☞ ░ Go to the old ape Python blog directory https://blog.csdn.net/LaoYuanPython

Keywords: Python Programming crawler

Added by erikjan on Tue, 08 Mar 2022 06:16:54 +0200