Reptiles are the most basic

Understanding HTTP & static web page crawling

Request: request method, request URL, request header, request body (required by post method)

import requests
# Request methods include get, post, etc
headers = {
    'User-Agent':'Mozilla/5.0(Windows NT 10.0;Win64; x64) AppleWeb/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}   # Request header
r = requests.get("URL", headers = headers)

data = { 'name': 'germey', 'age': 22}  # Can it be understood as the request body?
r = requests.post("http://httpbin.org/post", data=data)
print(r.text)

Response: response status code, response header, response body (crawling content)

import requests

headers = {
    'User-Agent':'Mozilla/5.0(Windows NT 10.0;Win64; x64) AppleWeb/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}   # Request header
r = requests.get("URL", headers = headers)

print(r.status_code)   # The status code returns 200, which is the normal response
print(r.headers)    # Response header
print(r.cookies)    # Cookies
print(r.url)        # URL
print(r.history)    # Request history
print(r.text)       # Return to web page source code

regular expression

Common matching rules

patterndescribe
\wMatch letters, numbers, and underscores
\WMatches characters that are not letters, numbers, or underscores
\sMatch any white space character, equivalent to [\ t\n\r\f]
\SMatch any non null character
\dMatch any number, equivalent to [0-9]
\DMatches any non numeric character
\AMatches the beginning of a string
\ZMatch the end of the string. If there is a newline, only the end string before the newline is matched
\zMatches the end of the string. If there is a newline, it will also match the newline character
\GMatch the position where the last match was completed
\nMatch a newline character
\tMatch a tab
^Matches the beginning of a line of string
$Matches the end of a line of string
.Matches any character except the newline character when re When the dotall tag is specified, any character including a newline character can be matched
[...]Used to represent a group of characters and list them separately. For example, [amk] matches a, m or k
[^...]Characters not in [], such as [^ abc], match characters other than a, b and c
*Match 0 or more expressions
+Match 1 or more expressions
?Matches 0 or 1 fragments defined by previous regular expressions, non greedy
{n}Exactly match n previous expressions
{n, m}Match the fragment defined by the previous regular expression n to m times, greedy
( )Matches the expression in parentheses and also represents a group

Matching method match(), search(), findall()

  • match(): search from the beginning of the string. If the beginning does not match, return None
  • search(): scan the entire string and return the first successful matching result (only one)
  • findall(): returns all the contents matching the regular expression (in the form of a list). Meanwhile, if the regular expression uses () to divide the group, only the objects in () will be returned
html = '''<div id="songs-list">
    <h2 class="title">Classic old songs</h2>
    <p class="introduction">
        List of classic songs
    </p>
    <ul id="list" class="list-group">
        <li data-view="2">Simon Birch</li>
        <li data-view="7">
            <a href="/2.mp3" singer="Ren Xianqi">The sea laughed</a>
        </li>
        <li data-view="4" class="active">
            <a href="/3.mp3" singer="Qi Qin">The past goes with the wind</a>
        </li>
        <li data-view="6"><a href="/4.mp3" singer="beyond">Glorious years</a></li>
        <li data-view="5"><a href="/5.mp3" singer="Chen Huilin">Notepad</a></li>
        <li data-view="5">
            <a href="/6.mp3" singer="Deng Lijun">May we all be blessed with longevity</a>
        </li>
    </ul>
</div>'''

import re
 
result = re.search('<li.*?singer="(.*?)">(.*?)</a>', html, re.S)
print(result)
print(result.group(1))
print(result.group(2))
'''
Output results:
<re.Match object; span=(153, 260), match='<li data-view="2">Simon Birch</li>\n        <li data-vi>
Ren Xianqi
 The sea laughed
'''

results = re.findall('<li.*?singer="(.*?)">(.*?)</a>', html, re.S)
print(results)
print(type(results))

for i in results:
    print(i)
'''
Output results:
[('Ren Xianqi', 'The sea laughed'), ('Qi Qin', 'The past goes with the wind'), ('beyond', 'Glorious years'), ('Chen Huilin', 'Notepad'), ('Deng Lijun', 'May we all be blessed with longevity')]
<class 'list'>
('Ren Xianqi', 'The sea laughed')
('Qi Qin', 'The past goes with the wind')
('beyond', 'Glorious years')
('Chen Huilin', 'Notepad')
('Deng Lijun', 'May we all be blessed with longevity')
'''

Modifier

Modifier describe
re.IMake matching pairs case insensitive
re.SMake Matches all characters, including line breaks
re.LLocal aware matching
re.MMultiline matching, affecting ^ and$
re.UParses characters according to the Unicode character set. This flag affects \ W, \ W, \ B, and \ B

In web page matching, re S and re I

import re
 
content = '''Hello 1234567 World_This is 
a Regex Demo'''  # The string contains a newline character

result = re.match('^He.*?(\d+).*mo$', content)
print(result)  # Return to None

#  Use modifiers
result = re.match('^He.*?(\d+).*mo$', content, re.S)
print(result)
print(result.group())
print(result.group(1))
print(result.span())

Greed & non greed

Greedy & non greedy is the concept of universal matching.
Universal matching:* (dot star). Where. (DOT) represents matching any character (except newline character), and * (star) represents matching the preceding character infinite times
.* You can match characters of any length except the newline character

import re
content = 'Hello 1234567 World_This is a Regex Demo'

# \d + means matching at least one number, and d{n} means matching n numbers
result = re.match('^He.*(\d+).*mo$', content)
print(result.group(1))  # Return to 7, greed

result = re.match('^He.*?(\d+).*mo$', content)
print(result.group(1))  # Return 1234567, non greedy

Modify substitution character method (sub)

Parameters: the first parameter is the matching field (regular expression), the second parameter is the string to be replaced, and the third parameter is the original string

html = '''<div id="songs-list">
    <h2 class="title">Classic old songs</h2>
    <p class="introduction">
        List of classic songs
    </p>
    <ul id="list" class="list-group">
        <li data-view="2">Simon Birch</li>
        <li data-view="7">
            <a href="/2.mp3" singer="Ren Xianqi">The sea laughed</a>
        </li>
        <li data-view="4" class="active">
            <a href="/3.mp3" singer="Qi Qin">The past goes with the wind</a>
        </li>
        <li data-view="6"><a href="/4.mp3" singer="beyond">Glorious years</a></li>
        <li data-view="5"><a href="/5.mp3" singer="Chen Huilin">Notepad</a></li>
        <li data-view="5">
            <a href="/6.mp3" singer="Deng Lijun">May we all be blessed with longevity</a>
        </li>
    </ul>
</div>'''
import re
html2 = re.sub('<a.*?>|</a>', '', html)  # The "|" in the middle should mean "or". Delete the node containing a (redundant string)
print(html2)

compile() method

The compile() method can compile a regular string into a regular expression object for reuse in subsequent matches. For example, the first parameter in the sub() method is a regular expression. If you need to perform the same operation on multiple strings to be processed, you can use the compile() method.

import re
 
content1 = '2020-02-28 12:00'
content2 = '2020-02-29 12:55'
content3 = '2020-03-02 13:21'

pattern = re.compile('\d{2}:\d{2}')    

result1 = re.sub(pattern, '', content1)
result2 = re.sub(pattern, '', content2)
result3 = re.sub(pattern, '', content3)  # Delete the time information in three string objects

print(result1, result2, result3)
# Return to 2020-02-28 2020-02-29 2020-03-02

Keywords: Python crawler regex

Added by phprooky on Thu, 16 Dec 2021 20:12:42 +0200