Detailed description of Beautiful Soup Library

install

pip install lxml
pip install beautifulsoup4

Verify installation

In [1]: from bs4 import BeautifulSoup

In [2]: soup = BeautifulSoup('<p>Hello</p>', 'lxml')

In [3]: print(soup.p.string)
Hello

About Beautiful Soup

Parsers supported by Beautiful Soup

Parser

By comparison, lxml parser is a better choice

Just set the second parameter to lxml when initializing Beautiful Soup

from bs4 import BeautifulSoup

html = '''
<html>
<head><title>Beautiful Soup test</title></head>
<body>
    <p class="first" name="first_p"><b>first content</b></p>
    <p class="second">second content
    <a href="http://example.com/first"></a>
    <a href="http://example.com/second">
'''

soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())  # Increase indent and beautify output
print(soup.title.string)  # Get the text content of the title node

Note: the html content in the above code is incomplete, and some tags are not closed

Operation result:

<html>
 <head>
  <title>
   Beautiful Soup test
  </title>
 </head>
 <body>
  <p class="first" name="first_p">
   <b>
    first content
   </b>
  </p>
  <p class="second">
   second content
   <a href="http://example.com/first">
   </a>
   <a href="http://example.com/second">
   </a>
  </p>
 </body>
</html>
Beautiful Soup test

Beautifulsop will automatically complete html tags

Node selector

from bs4 import BeautifulSoup

html = '''
<html>
<head><title>Beautiful Soup test</title></head>
<body>
    <p class="first" name="first_p"><b>first content</b></p>
    <p class="second">second content
    <a href="http://example.com/first"></a>
    <a href="http://example.com/second">
'''

soup = BeautifulSoup(html, 'lxml')
print(soup.title)
# <title>Beautiful Soup test</title>

print(type(soup.title))
# <class 'bs4.element.Tag'>

print(soup.title.string)
# Beautiful Soup test

print(soup.head)
# <head><title>Beautiful Soup test</title></head>

print(soup.p)
# <p class="first" name="first_p"><b>first content</b></p>

Node name

In [3]: print(soup.title.name)
title

Node all attributes

In [4]: print(soup.p.attrs)
{'class': ['first'], 'name': 'first_p'}

Node specified properties

In [5]: print(soup.p.attrs['name'])
first_p

Node specified attribute shorthand

In [6]: print(soup.p['name'])
first_p

Node text content

In [7]: print(soup.p.string)
first content

Nested selection

In [8]: print(soup.head.title)
<title>Beautiful Soup test</title>

In [9]: print(type(soup.head.title))
<class 'bs4.element.Tag'>

In [10]: print(soup.head.title.string)
Beautiful Soup test

Association selection

In [11]: print(soup.body.children)
<list_iterator object at 0x10825a6d8>

In [12]: for i, child in enumerate(soup.body.children):
    ...:     print(i, child)
    ...:
0

1 <p class="first" name="first_p"><b>first content</b></p>
2

3 <p class="second">second content
    <a href="http://example.com/first"></a>
<a href="http://example.com/second">
</a></p>
  • children all child nodes
  • descendants all descendant nodes
  • Parent direct parent
  • parents ancestor node
  • Next? Sibling
  • Previous > previous sibling
  • Next? Siblings
  • Previous > siblings

Method selector

find_all

Data preparation

In [13]: from bs4 import BeautifulSoup
    ...:
    ...: html = '''
    ...: <div class="panel">
    ...:     <div class="panel-heading">
    ...:         <h4>Hello</h4>
    ...:     </div>
    ...:     <div class="panel-body">
    ...:         <ul class="list" id="list-1">
    ...:             <li class="element">Foo</li>
    ...:             <li class="element">Bar</li>
    ...:             <li class="element">Jay</li>
    ...:         </ul>
    ...:         <ul class="list list-small" id="list-2">
    ...:             <li class="element">Foo</li>
    ...:             <li class="element">Bar</li>
    ...:         </ul>
    ...:     </div>
    ...: </div>
    ...: '''
    ...:
    ...: soup = BeautifulSoup(html, 'lxml')
    ...:
    ...:

All ul

In [16]: soup.find_all(name='ul')
Out[16]:
[<ul class="list" id="list-1">
 <li class="element">Foo</li>
 <li class="element">Bar</li>
 <li class="element">Jay</li>
 </ul>, <ul class="list list-small" id="list-2">
 <li class="element">Foo</li>
 <li class="element">Bar</li>
 </ul>]

Since the ul obtained is of Tag type, it can be iterated

In [17]: type(soup.find_all(name='ul')[0])
Out[17]: bs4.element.Tag

In [18]: for ul in soup.find_all(name='ul'):
    ...:     print(ul.find_all(name='li'))
    ...:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

Then through traversing li, the text of li is obtained

In [19]: for ul in soup.find_all(name='ul'):
    ...:     print(ul.find_all(name='li'))
    ...:     for li in ul.find_all(name='li'):
    ...:         print(li.string)
    ...:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
Foo
Bar
Jay
[<li class="element">Foo</li>, <li class="element">Bar</li>]
Foo
Bar

attrs

Query by attribute

In [26]: soup.find_all(attrs={'id': 'list-1'})
Out[26]:
[<ul class="list" id="list-1">
 <li class="element">Foo</li>
 <li class="element">Bar</li>
 <li class="element">Jay</li>
 </ul>]

text

Match the text content of the node

In [28]: import re

# Returns a list of all matching regular node text
In [29]: soup.find_all(text=re.compile('ar'))
Out[29]: ['Bar', 'Bar']

find

Returns the first matching element

In [30]: soup.find(text=re.compile('ar'))
Out[30]: 'Bar'

In [31]: soup.find('li')
Out[31]: <li class="element">Foo</li>

There are other uses for find:

  • find_parents() and find_parent()

  • Find? Next? Siblings() and find? Next? Siblings()

  • Find ABCD previous ABCD siblings() and find ABCD previous ABCD siblings()

  • Find'all'next() and find'next()

  • Fina? All? Previous() and find? Previous()

css selector

Just call the select() method and pass in the corresponding css selector

In [32]: soup.select('.panel .panel-heading')
Out[32]:
[<div class="panel-heading">
 <h4>Hello</h4>
 </div>]

In [33]: soup.select('ul li')
Out[33]:
[<li class="element">Foo</li>,
 <li class="element">Bar</li>,
 <li class="element">Jay</li>,
 <li class="element">Foo</li>,
 <li class="element">Bar</li>]

In [34]: soup.select('#list-2 .element')
Out[34]: [<li class="element">Foo</li>, <li class="element">Bar</li>]

In [35]: soup.select('ul')[0]
Out[35]:
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>

Nested selection

In [36]: for ul in soup.select('ul'):
    ...:     print(ul.select('li'))
    ...:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

get attribute

In [37]: for ul in soup.select('ul'):
    ...:     print(ul['id'])
    ...:     print(ul.attrs['id'])
    ...:
list-1
list-1
list-2
list-2

Get text

In [39]: for li in soup.select('li'):
    ...:     print('Get Text:', li.get_text())
    ...:     print('String:', li.string)
    ...:
    ...:
Get Text: Foo
String: Foo
Get Text: Bar
String: Bar
Get Text: Jay
String: Jay
Get Text: Foo
String: Foo
Get Text: Bar
String: Bar

Keywords: Attribute pip

Added by artied on Sun, 29 Dec 2019 17:52:44 +0200