Match symbol
1.re module
re module is a module provided by Python to support regular expressions
fullmatch function:
Fullmatch (regular expression, string) - make the regular expression exactly match the specified string. If the matching fails, return None
2. Regular syntax
Regular expression - regular expression is a tool that makes complex string problems simple.
The main work of writing regular expressions: use regular symbols to describe the rules of related strings.
Python regular expression: r 'regular expression'
js regular: / regular expression/
from re import fullmatch
1) Ordinary character (ordinary symbol)
Symbols other than those with special functions or meanings in regular;
Ordinary characters represent the symbol itself in regular
# Matching a string has three characters: a,b and c re_str = r'abc' print(fullmatch(re_str, 'abc'))
2) . - Match any character
Note: one Only one arbitrary character can be matched
# Match a character with a length of three. The first character is a and the last character is c. There is any character between a and c re_str = r'a.c' print(fullmatch(re_str, 'abc')) print(fullmatch(re_str, 'a+c')) print(fullmatch(re_str, 'a good c')) re_str = r'abc...' print(fullmatch(re_str, 'abcm./')) print(fullmatch(re_str, 'abcm\t/')) print(fullmatch(re_str, 'abc G/'))
3)\d - match any numeric character
re_str = r'a\d\dc' print(fullmatch(re_str, 'a78c')) print(fullmatch(re_str, 'a78c'))
4)\s - match any white space character
Blank characters: space, carriage return (\ n), tab (\ t)
re_str = r'a\sb' print(fullmatch(re_str, 'a b')) print(fullmatch(re_str, 'a\nb')) print(fullmatch(re_str, 'a\tb')) print(fullmatch(re_str, 'a b')) # None
5)\w - matches any letter, number or underscore (not easy to use)
6)
\D - matches any non numeric character
\S - matches any non white space character
print(fullmatch(r'a\Sb\D', 'a>b=')) print(fullmatch(r'a\Sb\D', 'a b=')) # None print(fullmatch(r'a\sb\D', 'a>b1')) # None
7) [character set] - matches any character in the character set
Note: a [] can only match one character
[multiple ordinary characters] - for example: [abc], can match a or b or c
[character set containing special symbols beginning with \] - for example: [\ dabc], can match any data or a or b or c
[character set with minus sign between two characters] - the minus sign at this time indicates who gets who (Note: the code of the character before the minus sign must be less than that after the minus sign)
For example:
[a-z] - match any lowercase letter
[a-d] - match any one of a, b, c, d
[A-Z] - match any uppercase letter
[1-9] - match any numeric character from 1 to 9
[\ u4e00-\u9fa5] - match any Chinese character
[A-Za-z], [A-Za-z] - match any letter
[a-z123] - match any lowercase letter, or 1 or 2 or 3
[a-z\d] - match any lowercase letter or any number
re_str = r'a[xym]b' print(fullmatch(re_str, 'axb')) print(fullmatch(re_str, 'ayb')) print(fullmatch(re_str, 'amb')) print(fullmatch(re_str, 'azb')) # None re_str = r'a[16]b' print(fullmatch(re_str, 'a1b')) print(fullmatch(re_str, 'a6b')) re_str = r'a[a\db]b' print(fullmatch(re_str, 'a1b')) print(fullmatch(re_str, 'aab')) print(fullmatch(re_str, 'abb')) print(fullmatch(r'x[a-z]y', 'xmy')) print(fullmatch(r'x[a-zA-Z]y', 'xmy')) print(fullmatch(r'x[a-zA-Z]y', 'xKy')) print(fullmatch(r'x[a-zA-Z*&]y', 'x*y')) print(fullmatch(r'x[a-zA-Z*&]y', 'xMy')) print(fullmatch(r'x[0-9]y', 'x5y')) print(fullmatch(r'x[-09]y', 'x-y')) print(fullmatch(r'x[-09]y', 'x0y')) print(fullmatch(r'x[-09]y', 'x9y'))
8) [^ character set] - matches any character that is not in the character set
[^ abc] - matches any character except a, b, c
[^ a-z] - matches any character except lowercase letters
print(fullmatch(r'a[^\u4e00-\u9fa5]c', 'a yes c')) # None print(fullmatch(r'a[^a-zA-Z]c', 'azc')) # None
Note: - and ^ in [] have special functions only when they are placed in the specified position, otherwise they are ordinary characters in [].
Detection class symbol
Detecting the existence of class symbols does not affect the length of the matched string. Its function is to detect whether the position of symbols meets the requirements on the premise of successful matching
Usage of detection class symbols: first remove the detection class symbols to see whether the matching is successful. If it fails, the whole regular matching fails. If it succeeds, then see whether the location of the detection class symbols meets the requirements
1.\b - detect whether it is a word boundary
Word boundaries - symbols that can distinguish two different words belong to word boundaries, such as blank, punctuation, beginning of string, end of string
from re import fullmatch, findall re_str = r'abc\b123' print(fullmatch(re_str, 'abc123')) # None, no matching string re_str = r'abc,\b123' print(fullmatch(re_str, 'abc,123')) print(fullmatch(r'abc\s\b123', 'abc 123')) # Findall (regular expression, string) - get all substrings in the string that meet the regular expression str1 = '12ksksj78ss 34 antibiotic,89 Try 90 56 Jiangsu Province 23' result1 = findall(r'\d\d', str1) print(result1) # ['12', '78', '34', '89', '90', '56', '23'] result2 = findall(r'\d\d\b', str1) print(result2) # ['89', '90', '56', '23'] result3 = findall(r'\b\d\d\b', str1) print(result3) # ['89', '56']
2.\B - detect whether it is a non word boundary
result4 = findall(r'\d\d\B', str1) print(result4) # ['12', '78', '34']
3. ^ - check whether it starts with a string
re_str = r'^\d\d' print(fullmatch(re_str, '12')) print(findall(r'^\d\d', str1)) # ['12']
4. $- check whether it is the end of the string
re_str = r'^\d\d$' print(fullmatch(re_str, '67'))
Matching times
1. - match 0 or more times (any number of times)*
Usage: match class symbols*
A * - match any number of a
\d * - match any number of numeric characters
print(fullmatch(r'a*b', 'b')) print(fullmatch(r'a*b', 'aab')) print(fullmatch(r'a*b', 'aaaaaab')) print(fullmatch(r'\d*b', '2345b')) print(fullmatch(r'[abc]*x', 'aabbcccx'))
2. + - match one or more times (at least once)
print(fullmatch(r'a+b', 'b')) # None print(fullmatch(r'a+b', 'ab')) print(fullmatch(r'a+b', 'aaab'))
3.? - Matches 0 or 1 times
re_str = r'[-+]?[1-9]\d' print(fullmatch(re_str, '-12')) print(fullmatch(r'a?b', 'b')) print(fullmatch(r'a?b', 'ab')) print(fullmatch(r'a?b', 'aab')) # None
**4.{} **
{N} - match n times
{M,N} - match M to N times
{, n} - matches up to N times
* == {0,} + == {1,} ? == {0,1}
print(fullmatch(r'\d{3}', '123')) print(fullmatch(r'\d{3,5}', '123')) print(fullmatch(r'\d{3,5}', '1233')) print(fullmatch(r'\d{3,5}', '13233')) print(fullmatch(r'\d{3,5}', '132383')) # None print(fullmatch(r'\d{3,5}', '13')) # None print(fullmatch(r'\d{,5}', '13')) print(fullmatch(r'\d{3,}', '1693'))
Note: the symbol corresponding to the matching times must be preceded by the matching class symbol
print(fullmatch(r'+{2,3}', '++')) # re.error
5. Greed and non greed
When the matching times are uncertain, the matching mode is divided into greedy and non greedy. The default is greedy mode
On the premise of successful matching, greed is the one with the most matching times; Non greedy is the least number of matches
*,+,?, {M,N}, {M,}, {, N} - greedy
*?,+?,??, {M,N}?, {M,}?, {,N}? - Non greedy
print(match(r'\d{3}', '123 Yes, yes, yes')) print(match(r'a.*b', 'asmmdb Yes, yes, yes')) # asmmdb #'asb', 'asbmmb' and 'asbmmbdb' can succeed in three cases. Because of greed, the last matching times are the most print(match(r'a.*b', 'asbmmbdb Yes, yes, yes')) print(match(r'a.*?b', 'asbmmbdb Yes, yes, yes')) # asb
Grouping and branching
1. (- grouping
Function 1: take the contents in () as a whole and carry out overall related operations, for example, overall control times
Function 2: repeat the matching result of the previous m-th group through '\ M', m starts from 1
Function 3: capture (in findall)
str1 = '78nm34ms10xp' print(fullmatch(r'\d\d[a-z]{2}\d\d[a-z]{2}\d\d[a-z]{2}', str1)) print(fullmatch(r'(\d\d[a-z]{2}){3}', str1)) str1 = r'abababab' print(fullmatch(r'(ab)+', str1)) print(fullmatch(r'(\d{2})abc\1', '89abc89')) print(fullmatch(r'(\d{2})abc\1', '89abc34')) # None print(fullmatch(r'\d{2}abc\1', '89abc89')) # re.error print(fullmatch(r'(\d{3})([a-z]{3})-\2', '234ams-ams')) print(fullmatch(r'(\d{3})([a-z]{3})-\1', '234ams-234')) print(fullmatch(r'(\d{3})([a-z]{3})-\2\1', '234ams-ams234')) print(fullmatch(r'(\d{3})([a-z]{3})-\1{2}', '234ams-234234')) print(fullmatch(r'(\d{3})-\2([a-z]{3})', '234ams-ams')) # re.error,
2. | - Branch
Regular 1 | regular 2 - match with regular 1 first. If the achievement is successful, match with regular 2 again if the matching fails
# It is required to match 'abc98' and 'abcMKP' at the same time print(fullmatch(r'abc\d{2}|abc[A-Z]{3}', 'abcKMP')) print(fullmatch(r'abc(\d{2}|[A-Z]{3})', 'abcMKP'))
3. Escape symbol
Add \ beforethe special symbol to make the function of the symbol disappear and become an ordinary symbol
print(fullmatch(r'\+\d{3}', '+234')) print(fullmatch(r'\[\d{3}\]', '[234]')) print(fullmatch(r'\\dabc', '\dabc'))
If there are independent symbols with special functions, put the symbols into [] and their functions will disappear automatically
print(fullmatch(r'[+*?|()^$.]abc', '$abc')) print(fullmatch(r'[\^abc\-z\]]123', ']123'))
re module
1. Compile (regular expression) - compile a regular expression and return a regular expression object
Fullmatch (regular expression, string)
Regular expression object Fullmatch (string)
re_obj = re.compile(r'\d{3}') print(re_obj.fullmatch('234')) print(re.fullmatch(r'\d{3}', '234'))
2.
Fullmatch (regular expression, string) - make the regular expression match the whole string (exact match). If the matching fails, return None. If the matching succeeds, return the matching object
Match (regular expression, string) - match the beginning of the string (judge whether the beginning of the string conforms to the regular rules). If the matching fails, return None. If the matching succeeds, return the matching object
result = re.fullmatch(r'(\d{3})-([A-Z]+)', '345-K') print(result) # <re. Match object; Span = (0, 3), match ='345 '> match object
- Get the matching string
Match object group() / matching object group(0) - get the result of the entire regular match
Match object group(N) - get the matching result of the nth group
print(result.group()) # 345-K print(result.group(1)) # 345 print(result.group(2)) # K
3. Search (regular expression, string) - get the first substring in the string that satisfies the regular expression. The returned result is None or a matching object
result = re.search(r'\d{3}', 'Try 234 ksjs,345') print(result) # <re.Match object; span=(3, 6), match='234'> print(result.group()) # 234
- Get the position information of the matching result in the original string
Match object span() - a tuple is returned. The elements in the tuple are the start subscript and the end subscript. The corresponding position of the end subscript cannot be obtained
Match object span(N)
print(result.span())
4.
Findall (regular expression, string) - get all the regular substrings in the string, return the list, and the elements in the list are substrings (when there is no grouping)
If there is only one group in the regular: the elements in the returned list are the results of each group
If there are two or more groups in the regular: the elements in the returned list are tuples, and the elements in the tuples are the results of each group
result = re.findall(r'\d{2}', '34ssd908 On the computer 23, udh89,Try 89123') print(result) # ['34', '90', '23', '89', '89', '12'] result = re.findall(r'(\d{2})\D', '34ssd908 On the computer 23, udh89,Try 89123') print(result) # ['34', '08', '23', '89'] result = re.findall(r'((\d[a-z]){2})', '2m4m Driver 9 k0o Try 3 k5l--') print(result) # [('2m4m', '4m'), ('9k0o', '0o'), ('3k5l', '5l')] result = re.findall(r'(\d{2})-([a-z]{3})', '23-msn The data is 98-kop Christmas delivery') print(result) # [('23', 'msn'), ('98', 'kop')]
5. Finder (regular expression, string) - get all substrings in the string that meet the regularity, return an iterator, and the iterator is the matching object
result = re.finditer(r'(\d{2})-([a-z]{3})', '23-msn The data is 98-kop Christmas delivery') print(result) r1 = next(result) print(r1, r1.group(), r1.group(1), r1.group(2))
6. Split (regular expression, string) - take all substrings in the string that meet the regular expression as the cutting point to cut the string
re. Split (regular expression, string, N) - take the first N substrings in the string that meet the regular expression as the cutting point to cut the string
result = re.split(r'\d+', 'It's 9564 s Shuangsheng horizon 09 Century Oriental and 3 d Disrespect disrespect 2 try') print(result)
7.
Sub (regular expression, string 1, string 2) - replace all substrings in string 2 that satisfy the regular expression with string 1
Sub (regular expression, string 1, string 2, N) - replace the first N substrings in string 2 that satisfy the regular expression with string 1
result = re.sub(r'\d+', '*', 'It's 9564 s Shuangsheng horizon 09 Century Oriental and 3 d Disrespect disrespect 2 try') print(result) message = 'f u c k you! Fight, you TM Don't you see? SB' re_str = open('badLanguage.txt', encoding='utf-8').read() re_str = r'(?i)%s' % re_str result = re.sub(re_str, '*', message) print(result)
8. flags parameter
Each of the above functions has a parameter flag, which is used to set the regular parameter
- Single line matching and multi line matching parameters: re S,re. M (default)
Single line matching: Can match \ n
Multiline matching: Cannot match \ n
flags=re. S < = = > R '(? s) regular expression'
- Ignore case: re I
flags=re. I < = = > R '(? i) regular expression'
flags=re. S|re. I < = = > R '(? si) regular expression'
print(re.fullmatch(r'a.b', 'a\nb', flags=re.M)) # None print(re.fullmatch(r'a.b', 'a\nb')) # None print(re.fullmatch(r'a.b', 'a\nb', flags=re.S)) print(re.fullmatch(r'(?s)a.b', 'a\nb')) print(re.fullmatch(r'abc', 'abc')) print(re.fullmatch(r'abc', 'Abc')) # None print(re.fullmatch(r'abc', 'ABc', flags=re.I)) print(re.fullmatch(r'(?i)abc', 'ABc')) print(re.fullmatch(r'a.b', 'A\nb', flags=re.S|re.I)) print(re.fullmatch(r'(?is)a.b', 'A\nb'))