day16 regular expression

Match symbol

1.re module

re module is a module provided by Python to support regular expressions
fullmatch function:
Fullmatch (regular expression, string) - make the regular expression exactly match the specified string. If the matching fails, return None

2. Regular syntax

Regular expression - regular expression is a tool that makes complex string problems simple.
The main work of writing regular expressions: use regular symbols to describe the rules of related strings.

Python regular expression: r 'regular expression'
js regular: / regular expression/

from re import fullmatch

1) Ordinary character (ordinary symbol)

Symbols other than those with special functions or meanings in regular;
Ordinary characters represent the symbol itself in regular

# Matching a string has three characters: a,b and c
re_str = r'abc'
print(fullmatch(re_str, 'abc'))

2) . - Match any character

Note: one Only one arbitrary character can be matched

# Match a character with a length of three. The first character is a and the last character is c. There is any character between a and c
re_str = r'a.c'
print(fullmatch(re_str, 'abc'))
print(fullmatch(re_str, 'a+c'))
print(fullmatch(re_str, 'a good c'))

re_str = r'abc...'
print(fullmatch(re_str, 'abcm./'))
print(fullmatch(re_str, 'abcm\t/'))
print(fullmatch(re_str, 'abc G/'))

3)\d - match any numeric character

re_str = r'a\d\dc'
print(fullmatch(re_str, 'a78c'))
print(fullmatch(re_str, 'a78c'))

4)\s - match any white space character

Blank characters: space, carriage return (\ n), tab (\ t)

re_str = r'a\sb'
print(fullmatch(re_str, 'a b'))
print(fullmatch(re_str, 'a\nb'))
print(fullmatch(re_str, 'a\tb'))
print(fullmatch(re_str, 'a  b'))  # None

5)\w - matches any letter, number or underscore (not easy to use)

6)
\D - matches any non numeric character
\S - matches any non white space character

print(fullmatch(r'a\Sb\D', 'a>b='))
print(fullmatch(r'a\Sb\D', 'a b='))  # None
print(fullmatch(r'a\sb\D', 'a>b1'))  # None

7) [character set] - matches any character in the character set

Note: a [] can only match one character

[multiple ordinary characters] - for example: [abc], can match a or b or c
[character set containing special symbols beginning with \] - for example: [\ dabc], can match any data or a or b or c

[character set with minus sign between two characters] - the minus sign at this time indicates who gets who (Note: the code of the character before the minus sign must be less than that after the minus sign)
For example:
[a-z] - match any lowercase letter
[a-d] - match any one of a, b, c, d
[A-Z] - match any uppercase letter
[1-9] - match any numeric character from 1 to 9
[\ u4e00-\u9fa5] - match any Chinese character
[A-Za-z], [A-Za-z] - match any letter
[a-z123] - match any lowercase letter, or 1 or 2 or 3
[a-z\d] - match any lowercase letter or any number

re_str = r'a[xym]b'
print(fullmatch(re_str, 'axb'))
print(fullmatch(re_str, 'ayb'))
print(fullmatch(re_str, 'amb'))
print(fullmatch(re_str, 'azb'))  # None

re_str = r'a[16]b'
print(fullmatch(re_str, 'a1b'))
print(fullmatch(re_str, 'a6b'))

re_str = r'a[a\db]b'
print(fullmatch(re_str, 'a1b'))
print(fullmatch(re_str, 'aab'))
print(fullmatch(re_str, 'abb'))

print(fullmatch(r'x[a-z]y', 'xmy'))

print(fullmatch(r'x[a-zA-Z]y', 'xmy'))
print(fullmatch(r'x[a-zA-Z]y', 'xKy'))

print(fullmatch(r'x[a-zA-Z*&]y', 'x*y'))
print(fullmatch(r'x[a-zA-Z*&]y', 'xMy'))

print(fullmatch(r'x[0-9]y', 'x5y'))

print(fullmatch(r'x[-09]y', 'x-y'))
print(fullmatch(r'x[-09]y', 'x0y'))
print(fullmatch(r'x[-09]y', 'x9y'))

8) [^ character set] - matches any character that is not in the character set

[^ abc] - matches any character except a, b, c

[^ a-z] - matches any character except lowercase letters

print(fullmatch(r'a[^\u4e00-\u9fa5]c', 'a yes c'))  # None
print(fullmatch(r'a[^a-zA-Z]c', 'azc'))  # None

Note: - and ^ in [] have special functions only when they are placed in the specified position, otherwise they are ordinary characters in [].

Detection class symbol

Detecting the existence of class symbols does not affect the length of the matched string. Its function is to detect whether the position of symbols meets the requirements on the premise of successful matching
Usage of detection class symbols: first remove the detection class symbols to see whether the matching is successful. If it fails, the whole regular matching fails. If it succeeds, then see whether the location of the detection class symbols meets the requirements

1.\b - detect whether it is a word boundary

Word boundaries - symbols that can distinguish two different words belong to word boundaries, such as blank, punctuation, beginning of string, end of string

from re import fullmatch, findall

re_str = r'abc\b123'
print(fullmatch(re_str, 'abc123'))  # None, no matching string

re_str = r'abc,\b123'
print(fullmatch(re_str, 'abc,123'))

print(fullmatch(r'abc\s\b123', 'abc 123'))

# Findall (regular expression, string) - get all substrings in the string that meet the regular expression

str1 = '12ksksj78ss 34 antibiotic,89 Try 90 56 Jiangsu Province 23'
result1 = findall(r'\d\d', str1)
print(result1)  # ['12', '78', '34', '89', '90', '56', '23']

result2 = findall(r'\d\d\b', str1)
print(result2)  # ['89', '90', '56', '23']

result3 = findall(r'\b\d\d\b', str1)
print(result3)  # ['89', '56']

2.\B - detect whether it is a non word boundary

result4 = findall(r'\d\d\B', str1)
print(result4)  # ['12', '78', '34']

3. ^ - check whether it starts with a string

re_str = r'^\d\d'
print(fullmatch(re_str, '12'))
print(findall(r'^\d\d', str1))  # ['12']

4. $- check whether it is the end of the string

re_str = r'^\d\d$'
print(fullmatch(re_str, '67'))

Matching times

1. - match 0 or more times (any number of times)*

Usage: match class symbols*

A * - match any number of a
\d * - match any number of numeric characters

print(fullmatch(r'a*b', 'b'))
print(fullmatch(r'a*b', 'aab'))
print(fullmatch(r'a*b', 'aaaaaab'))
print(fullmatch(r'\d*b', '2345b'))

print(fullmatch(r'[abc]*x', 'aabbcccx'))

2. + - match one or more times (at least once)

print(fullmatch(r'a+b', 'b'))  # None
print(fullmatch(r'a+b', 'ab'))
print(fullmatch(r'a+b', 'aaab'))

3.? - Matches 0 or 1 times

re_str = r'[-+]?[1-9]\d'
print(fullmatch(re_str, '-12'))
print(fullmatch(r'a?b', 'b'))
print(fullmatch(r'a?b', 'ab'))
print(fullmatch(r'a?b', 'aab'))  # None

**4.{} **

{N} - match n times
{M,N} - match M to N times

{, n} - matches up to N times

* == {0,}
+ == {1,}
? == {0,1}
print(fullmatch(r'\d{3}', '123'))
print(fullmatch(r'\d{3,5}', '123'))
print(fullmatch(r'\d{3,5}', '1233'))
print(fullmatch(r'\d{3,5}', '13233'))
print(fullmatch(r'\d{3,5}', '132383'))  # None
print(fullmatch(r'\d{3,5}', '13'))  # None

print(fullmatch(r'\d{,5}', '13'))
print(fullmatch(r'\d{3,}', '1693'))

Note: the symbol corresponding to the matching times must be preceded by the matching class symbol

print(fullmatch(r'+{2,3}', '++'))  # re.error

5. Greed and non greed

When the matching times are uncertain, the matching mode is divided into greedy and non greedy. The default is greedy mode
On the premise of successful matching, greed is the one with the most matching times; Non greedy is the least number of matches

*,+,?, {M,N}, {M,}, {, N} - greedy
*?,+?,??, {M,N}?, {M,}?, {,N}? - Non greedy

print(match(r'\d{3}', '123 Yes, yes, yes'))

print(match(r'a.*b', 'asmmdb Yes, yes, yes'))  # asmmdb

#'asb', 'asbmmb' and 'asbmmbdb' can succeed in three cases. Because of greed, the last matching times are the most
print(match(r'a.*b', 'asbmmbdb Yes, yes, yes'))
print(match(r'a.*?b', 'asbmmbdb Yes, yes, yes'))  # asb

Grouping and branching

1. (- grouping

Function 1: take the contents in () as a whole and carry out overall related operations, for example, overall control times
Function 2: repeat the matching result of the previous m-th group through '\ M', m starts from 1
Function 3: capture (in findall)

str1 = '78nm34ms10xp'
print(fullmatch(r'\d\d[a-z]{2}\d\d[a-z]{2}\d\d[a-z]{2}', str1))
print(fullmatch(r'(\d\d[a-z]{2}){3}', str1))

str1 = r'abababab'
print(fullmatch(r'(ab)+', str1))


print(fullmatch(r'(\d{2})abc\1', '89abc89'))
print(fullmatch(r'(\d{2})abc\1', '89abc34'))    # None

print(fullmatch(r'\d{2}abc\1', '89abc89'))   # re.error

print(fullmatch(r'(\d{3})([a-z]{3})-\2', '234ams-ams'))
print(fullmatch(r'(\d{3})([a-z]{3})-\1', '234ams-234'))
print(fullmatch(r'(\d{3})([a-z]{3})-\2\1', '234ams-ams234'))
print(fullmatch(r'(\d{3})([a-z]{3})-\1{2}', '234ams-234234'))

print(fullmatch(r'(\d{3})-\2([a-z]{3})', '234ams-ams'))  #  re.error,

2. | - Branch

Regular 1 | regular 2 - match with regular 1 first. If the achievement is successful, match with regular 2 again if the matching fails

# It is required to match 'abc98' and 'abcMKP' at the same time
print(fullmatch(r'abc\d{2}|abc[A-Z]{3}', 'abcKMP'))
print(fullmatch(r'abc(\d{2}|[A-Z]{3})', 'abcMKP'))

3. Escape symbol

Add \ beforethe special symbol to make the function of the symbol disappear and become an ordinary symbol

print(fullmatch(r'\+\d{3}', '+234'))
print(fullmatch(r'\[\d{3}\]', '[234]'))
print(fullmatch(r'\\dabc', '\dabc'))

If there are independent symbols with special functions, put the symbols into [] and their functions will disappear automatically

print(fullmatch(r'[+*?|()^$.]abc', '$abc'))
print(fullmatch(r'[\^abc\-z\]]123', ']123'))

re module

1. Compile (regular expression) - compile a regular expression and return a regular expression object

Fullmatch (regular expression, string)
Regular expression object Fullmatch (string)

re_obj = re.compile(r'\d{3}')
print(re_obj.fullmatch('234'))

print(re.fullmatch(r'\d{3}', '234'))

2.
Fullmatch (regular expression, string) - make the regular expression match the whole string (exact match). If the matching fails, return None. If the matching succeeds, return the matching object
Match (regular expression, string) - match the beginning of the string (judge whether the beginning of the string conforms to the regular rules). If the matching fails, return None. If the matching succeeds, return the matching object

result = re.fullmatch(r'(\d{3})-([A-Z]+)', '345-K')
print(result)    # <re. Match object;  Span = (0, 3), match ='345 '> match object
  1. Get the matching string
    Match object group() / matching object group(0) - get the result of the entire regular match
    Match object group(N) - get the matching result of the nth group
print(result.group())   # 345-K
print(result.group(1))   # 345
print(result.group(2))   # K

3. Search (regular expression, string) - get the first substring in the string that satisfies the regular expression. The returned result is None or a matching object

result = re.search(r'\d{3}', 'Try 234 ksjs,345')
print(result)   # <re.Match object; span=(3, 6), match='234'>
print(result.group())    # 234
  1. Get the position information of the matching result in the original string
    Match object span() - a tuple is returned. The elements in the tuple are the start subscript and the end subscript. The corresponding position of the end subscript cannot be obtained
    Match object span(N)
print(result.span())

4.
Findall (regular expression, string) - get all the regular substrings in the string, return the list, and the elements in the list are substrings (when there is no grouping)
If there is only one group in the regular: the elements in the returned list are the results of each group
If there are two or more groups in the regular: the elements in the returned list are tuples, and the elements in the tuples are the results of each group

result = re.findall(r'\d{2}', '34ssd908 On the computer 23, udh89,Try 89123')
print(result)   # ['34', '90', '23', '89', '89', '12']

result = re.findall(r'(\d{2})\D', '34ssd908 On the computer 23, udh89,Try 89123')
print(result)  # ['34', '08', '23', '89']

result = re.findall(r'((\d[a-z]){2})', '2m4m Driver 9 k0o Try 3 k5l--')
print(result)   # [('2m4m', '4m'), ('9k0o', '0o'), ('3k5l', '5l')]

result = re.findall(r'(\d{2})-([a-z]{3})', '23-msn The data is 98-kop Christmas delivery')
print(result)   # [('23', 'msn'), ('98', 'kop')]

5. Finder (regular expression, string) - get all substrings in the string that meet the regularity, return an iterator, and the iterator is the matching object

result = re.finditer(r'(\d{2})-([a-z]{3})', '23-msn The data is 98-kop Christmas delivery')
print(result)
r1 = next(result)
print(r1, r1.group(), r1.group(1), r1.group(2))

6. Split (regular expression, string) - take all substrings in the string that meet the regular expression as the cutting point to cut the string
re. Split (regular expression, string, N) - take the first N substrings in the string that meet the regular expression as the cutting point to cut the string

result = re.split(r'\d+', 'It's 9564 s Shuangsheng horizon 09 Century Oriental and 3 d Disrespect disrespect 2 try')
print(result)

7.
Sub (regular expression, string 1, string 2) - replace all substrings in string 2 that satisfy the regular expression with string 1
Sub (regular expression, string 1, string 2, N) - replace the first N substrings in string 2 that satisfy the regular expression with string 1

result = re.sub(r'\d+', '*', 'It's 9564 s Shuangsheng horizon 09 Century Oriental and 3 d Disrespect disrespect 2 try')
print(result)

message = 'f u c    k you! Fight, you TM Don't you see? SB'
re_str = open('badLanguage.txt', encoding='utf-8').read()
re_str = r'(?i)%s' % re_str
result = re.sub(re_str, '*', message)
print(result)

8. flags parameter

Each of the above functions has a parameter flag, which is used to set the regular parameter

  1. Single line matching and multi line matching parameters: re S,re. M (default)
    Single line matching: Can match \ n
    Multiline matching: Cannot match \ n

flags=re. S < = = > R '(? s) regular expression'

  1. Ignore case: re I
    flags=re. I < = = > R '(? i) regular expression'

flags=re. S|re. I < = = > R '(? si) regular expression'

print(re.fullmatch(r'a.b', 'a\nb', flags=re.M))     # None
print(re.fullmatch(r'a.b', 'a\nb'))                 # None
print(re.fullmatch(r'a.b', 'a\nb', flags=re.S))
print(re.fullmatch(r'(?s)a.b', 'a\nb'))

print(re.fullmatch(r'abc', 'abc'))
print(re.fullmatch(r'abc', 'Abc'))      # None
print(re.fullmatch(r'abc', 'ABc', flags=re.I))
print(re.fullmatch(r'(?i)abc', 'ABc'))

print(re.fullmatch(r'a.b', 'A\nb', flags=re.S|re.I))
print(re.fullmatch(r'(?is)a.b', 'A\nb'))

Keywords: Python

Added by eagle1771 on Mon, 03 Jan 2022 09:05:59 +0200