Learning notes and summary on regular expressions
On regular expression
I have a problem that I can solve with regular expressions. Well, now I have two problems
Why regular expressions
In my personal learning experience, I have been exposed to various string lookup functions, but why is regular expression so magical? I think it's its fuzziness. Its fuzziness determines that irrelevant elements have less influence when searching, and can find the required content better
Take a python code as an example
str = "abcdef" res = str.find("a")
This code can help you find the character a in the string, but if you change the problem and find a phone number at the beginning or a segment at the end in a pile of phone numbers, the find function is not so easy to use. At this time, the fuzziness of regular expressions can get the desired results.
Application scope of regular expressions
Regular expressions have corresponding package calls in languages such as java and python. In the use of crawlers, regular expressions can find a message you need from a pile of tags (although sometimes bs4 has better results when the structure is regular and the target is not complex)
Regular expression rules
Metacharacter
- .: matches any character except the newline character
- \w: match letters or numbers or underscores or Chinese characters
- \s: match any whitespace
- \d: matching number
- \b: match the beginning or end of the word. For example, "er\b" can match never, not very
- ^: matches the beginning of the input string
- $: matches the end of the input string
Repeat qualifier
- *Repeat zero or more times
- +Repeat one or more times
- ? Repeat zero or once
- {n} Repeat n times
- {n,} repeat n or more times
- {n,m} repeat n or m times
Grouping character
- (ab) indicates elements in a group of string "ab", for example: "^ (ab) +" matches strings starting with zero or more "ab"
- [abc] or [a-c] indicates a character that matches any one of abc
Escape character
- \Indicates that the next character is marked as a special character, or a literal character, or a backward reference, or an octal escape character. For example: "(\ n)" means matching \ n "(\(ab \) +" means matching one or more "(ab)" characters
Conditional character
- |The condition or satisfies the character of the former or the latter, for example "(12) (3|4)" matches the character of "123" or "124"
Assert
- Forward look ahead assertion expression 1(?=pattern) matches the expression before pattern and does not contain itself, such as the following code
import re line = "<div class = \"left_box\" height = 100px>" pattern = ".*(?=height)" m = re.search(pattern,line) print(m.group(0)) >>> <div class = "left_box"
- The negative forward assertion expression 1(?!pattern) matches the expression without pattern, and does not contain itself, such as the following code
import re line = "regular regex rlief" pattern = r"r(\w{1})(?!g)" m = re.search(pattern,line) print(m.group(0)) >>> rl
- Forward backward assertion (< = pattern) expression 1 is the same as forward forward forward assertion, except that the expression after pattern matches
- Negative backward assertion (<! Pattern) expression 1 is the same as negative forward assertion, but it matches the expression after pattern
Laziness (not greed)
What is the greed of regular expressions?
Greedy refers to matching as many as possible. For example, \ w{2,9} will match 9 as much as possible. If it is not enough, it will match 8
How to be lazy
- *? Repeat 0 or countless times to match from 0
- +? Repeat one to positive infinity, starting with 1
- ?? Repeat 0 to 1 times, starting from 0
- {n,m}? Repeat n to m times, starting with n
Number grouping
1. Number grouping {expression}, for example:
import re line = "020-85653333" pattern = r"(0\d{2,3})-(\d{8})" m = re.search(pattern,line) m.group(0) >>>020-85653333 m.group(1) >>>020 m.group(2) >>>85653333
- Named number capture group (< name > expression), non capture group ()?: Expression, not captured (the implementation methods are different in different languages, for example, P needs to be added between "and" in python)
import re line = "020-85653333" pattern = r"(?P<Area code>0\d{2,3})-(?P<number>\d{8})" m = re.search(pattern,line) m.group(0) >>>020-85653333 m.group("number") >>>020 m.group("Area code") >>>85653333 import re line = "020-85653333" pattern = r"(?P<Area code>0\d{2,3})-(?:\d{8})" m = re.search(pattern,line) m.group(0) >>>020-85653333 m.group("number") >>>Error m.group("Area code") >>>85653333
Reference website
Regular expression online matching website: https://c.runoob.com/front-end/854/