Understanding HTTP & static web page crawling
Request: request method, request URL, request header, request body (required by post method)
import requests # Request methods include get, post, etc headers = { 'User-Agent':'Mozilla/5.0(Windows NT 10.0;Win64; x64) AppleWeb/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36' } # Request header r = requests.get("URL", headers = headers) data = { 'name': 'germey', 'age': 22} # Can it be understood as the request body? r = requests.post("http://httpbin.org/post", data=data) print(r.text)
Response: response status code, response header, response body (crawling content)
import requests headers = { 'User-Agent':'Mozilla/5.0(Windows NT 10.0;Win64; x64) AppleWeb/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36' } # Request header r = requests.get("URL", headers = headers) print(r.status_code) # The status code returns 200, which is the normal response print(r.headers) # Response header print(r.cookies) # Cookies print(r.url) # URL print(r.history) # Request history print(r.text) # Return to web page source code
regular expression
Common matching rules
pattern | describe |
---|---|
\w | Match letters, numbers, and underscores |
\W | Matches characters that are not letters, numbers, or underscores |
\s | Match any white space character, equivalent to [\ t\n\r\f] |
\S | Match any non null character |
\d | Match any number, equivalent to [0-9] |
\D | Matches any non numeric character |
\A | Matches the beginning of a string |
\Z | Match the end of the string. If there is a newline, only the end string before the newline is matched |
\z | Matches the end of the string. If there is a newline, it will also match the newline character |
\G | Match the position where the last match was completed |
\n | Match a newline character |
\t | Match a tab |
^ | Matches the beginning of a line of string |
$ | Matches the end of a line of string |
. | Matches any character except the newline character when re When the dotall tag is specified, any character including a newline character can be matched |
[...] | Used to represent a group of characters and list them separately. For example, [amk] matches a, m or k |
[^...] | Characters not in [], such as [^ abc], match characters other than a, b and c |
* | Match 0 or more expressions |
+ | Match 1 or more expressions |
? | Matches 0 or 1 fragments defined by previous regular expressions, non greedy |
{n} | Exactly match n previous expressions |
{n, m} | Match the fragment defined by the previous regular expression n to m times, greedy |
( ) | Matches the expression in parentheses and also represents a group |
Matching method match(), search(), findall()
- match(): search from the beginning of the string. If the beginning does not match, return None
- search(): scan the entire string and return the first successful matching result (only one)
- findall(): returns all the contents matching the regular expression (in the form of a list). Meanwhile, if the regular expression uses () to divide the group, only the objects in () will be returned
html = '''<div id="songs-list"> <h2 class="title">Classic old songs</h2> <p class="introduction"> List of classic songs </p> <ul id="list" class="list-group"> <li data-view="2">Simon Birch</li> <li data-view="7"> <a href="/2.mp3" singer="Ren Xianqi">The sea laughed</a> </li> <li data-view="4" class="active"> <a href="/3.mp3" singer="Qi Qin">The past goes with the wind</a> </li> <li data-view="6"><a href="/4.mp3" singer="beyond">Glorious years</a></li> <li data-view="5"><a href="/5.mp3" singer="Chen Huilin">Notepad</a></li> <li data-view="5"> <a href="/6.mp3" singer="Deng Lijun">May we all be blessed with longevity</a> </li> </ul> </div>''' import re result = re.search('<li.*?singer="(.*?)">(.*?)</a>', html, re.S) print(result) print(result.group(1)) print(result.group(2)) ''' Output results: <re.Match object; span=(153, 260), match='<li data-view="2">Simon Birch</li>\n <li data-vi> Ren Xianqi The sea laughed ''' results = re.findall('<li.*?singer="(.*?)">(.*?)</a>', html, re.S) print(results) print(type(results)) for i in results: print(i) ''' Output results: [('Ren Xianqi', 'The sea laughed'), ('Qi Qin', 'The past goes with the wind'), ('beyond', 'Glorious years'), ('Chen Huilin', 'Notepad'), ('Deng Lijun', 'May we all be blessed with longevity')] <class 'list'> ('Ren Xianqi', 'The sea laughed') ('Qi Qin', 'The past goes with the wind') ('beyond', 'Glorious years') ('Chen Huilin', 'Notepad') ('Deng Lijun', 'May we all be blessed with longevity') '''
Modifier
Modifier | describe |
---|---|
re.I | Make matching pairs case insensitive |
re.S | Make Matches all characters, including line breaks |
re.L | Local aware matching |
re.M | Multiline matching, affecting ^ and$ |
re.U | Parses characters according to the Unicode character set. This flag affects \ W, \ W, \ B, and \ B |
In web page matching, re S and re I
import re content = '''Hello 1234567 World_This is a Regex Demo''' # The string contains a newline character result = re.match('^He.*?(\d+).*mo$', content) print(result) # Return to None # Use modifiers result = re.match('^He.*?(\d+).*mo$', content, re.S) print(result) print(result.group()) print(result.group(1)) print(result.span())
Greed & non greed
Greedy & non greedy is the concept of universal matching.
Universal matching:* (dot star). Where. (DOT) represents matching any character (except newline character), and * (star) represents matching the preceding character infinite times
.* You can match characters of any length except the newline character
import re content = 'Hello 1234567 World_This is a Regex Demo' # \d + means matching at least one number, and d{n} means matching n numbers result = re.match('^He.*(\d+).*mo$', content) print(result.group(1)) # Return to 7, greed result = re.match('^He.*?(\d+).*mo$', content) print(result.group(1)) # Return 1234567, non greedy
Modify substitution character method (sub)
Parameters: the first parameter is the matching field (regular expression), the second parameter is the string to be replaced, and the third parameter is the original string
html = '''<div id="songs-list"> <h2 class="title">Classic old songs</h2> <p class="introduction"> List of classic songs </p> <ul id="list" class="list-group"> <li data-view="2">Simon Birch</li> <li data-view="7"> <a href="/2.mp3" singer="Ren Xianqi">The sea laughed</a> </li> <li data-view="4" class="active"> <a href="/3.mp3" singer="Qi Qin">The past goes with the wind</a> </li> <li data-view="6"><a href="/4.mp3" singer="beyond">Glorious years</a></li> <li data-view="5"><a href="/5.mp3" singer="Chen Huilin">Notepad</a></li> <li data-view="5"> <a href="/6.mp3" singer="Deng Lijun">May we all be blessed with longevity</a> </li> </ul> </div>''' import re html2 = re.sub('<a.*?>|</a>', '', html) # The "|" in the middle should mean "or". Delete the node containing a (redundant string) print(html2)
compile() method
The compile() method can compile a regular string into a regular expression object for reuse in subsequent matches. For example, the first parameter in the sub() method is a regular expression. If you need to perform the same operation on multiple strings to be processed, you can use the compile() method.
import re content1 = '2020-02-28 12:00' content2 = '2020-02-29 12:55' content3 = '2020-03-02 13:21' pattern = re.compile('\d{2}:\d{2}') result1 = re.sub(pattern, '', content1) result2 = re.sub(pattern, '', content2) result3 = re.sub(pattern, '', content3) # Delete the time information in three string objects print(result1, result2, result3) # Return to 2020-02-28 2020-02-29 2020-03-02