When Python needs to match the name of a module, it can be brought into Python through the regular expression.
The approximate matching process of regular expressions is:
1. Compare the characters in the expression and text in turn,
2. If each character can be matched, the matching is successful; Once there are characters that fail to match, the matching fails.
3. If there are quantifiers or boundaries in the expression, the process will be slightly different.
r: Backslashes do not need any special treatment in string literals prefixed with 'R'. Therefore, R "" represents a string containing two characters' 'and' n ', while "" represents a string containing only one newline character.
Use of re module: import re
re.match function
Syntax: re match(pattern, string, flags=0)
pattern
Matching regular expressions
string
String to match
flags
Flag bit, which is used to control the matching method of regular expressions, such as case sensitivity, multi line matching, etc.
- re.I ignore case
- re.L indicates that the special character set w, W, B, s, s depends on the current environment
- re.M multiline mode
- re.S is And any character including newline character (. Excluding newline character)
- re.U represents the special character set w, W, B, D, D, s, s, which depends on the Unicode character attribute database
- re.X ignores spaces and # subsequent comments for readability
Try to match a pattern from the starting position of the string. If the matching is not successful, match() returns none. Match succeeded re The match method returns a matching object.
If the data is matched in the previous step, the group method can be used to extract the data. To get the matching expression using the group(num) or groups() matching object function.
group() is used to propose the string * * intercepted by grouping, and * * () is used for grouping. group() and group (0) are the overall results of matching regular expressions. group(1) lists the first bracket matching part, group(2) lists the second bracket matching part, and group(3) lists the third bracket matching part. No match is successful, re Search() returns None.
give an example:
>>> import re >>> result = re.match("itcast","itcast.cn") >>> result.group() 'itcast'
Start matching from the string header, and pattern can be matched completely. Pattern matching ends, and the matching ends at the same time cn no longer matches, and the information of successful matching is returned.
Match single character
character
function
position
.
Match any 1 character (except)
[ ]
Match the characters listed in []
d
Match numbers, i.e. 0-9
Can be written in character set [...]
D
Matches a number, that is, it is not a number
Can be written in character set [...]
s
Match blank, i.e. space, tab
Can be written in character set [...]
S
Match empty characters
Can be written in character set [...]
w
Match word characters, i.e. A-Z, A-Z, 0-9_
Can be written in character set [...]
W
Match word characters
Can be written in character set [...]
w
w matches word characters, i.e. A-Z, A-Z, 0-9_
W
Match word characters
[...] character set. The corresponding position can be any character in the character set. The characters in the character set can be listed one by one, or the range can be given, such as [abc] and [a-c]. If the first character indicates negation. All special characters (such as "]" - "" ") lose their original meaning in the character set. If you want to use it, you can put"] "-" in the first character and "^" in the non first character.
give an example:
import re ret = re.match(".","M") print(ret.group()) ret = re.match("t.o","too") print(ret.group()) ret = re.match("t.o","two") print(ret.group()) # What if hello? Character? Write, then regular expressions need? Written h ret = re.match("h","hello Python") print(ret.group()) # What if hello? Character? Write, then regular expressions need? Written H ret = re.match("H","Hello Python") print(ret.group()) # ?? It's OK to write h ret = re.match("[hH]","hello Python") print(ret.group()) ret = re.match("[hH]","Hello Python") print(ret.group()) ret = re.match("[hH]ello Python","Hello Python") print(ret.group()) # Match multiple ways of writing from 0 to 9 ret = re.match("[0123456789]Hello Python","7Hello Python") print(ret.group()) ret = re.match("[0-9]Hello Python","7Hello Python") print(ret.group()) # Match 0 to 3 and 5-9 ret = re.match("[0-35-9]Hello Python","7Hello Python") print(ret.group()) ret = re.match("[0-35-9]Hello Python","4Hello Python") #print(ret.group()) ret = re.match("Chang'e d number","Chang'e-1 was successfully launched") print(ret.group()) ret = re.match("Chang'e d number","Chang'e-2 successfully launched") print(ret.group())
result:
M
too
two
h
H
h
H
Hello Python
7Hello Python
7Hello Python
7Hello Python
Chang'e 1
Chang'e 2
Match multiple characters
character
function
position
Expression instance
Complete matching string
*
If the first character appears 0 times or is limited to times, it can be matched
Used after characters or (...)
abc*
abccc
The first character appears once or only once before matching, that is, there is less than one time
Used after characters or (...)
abc+
abccc
The first character appears 1 or 0 times before matching, that is, either 1 or no
Used after characters or (...)
abc
ab,abc
{m}
Match the first character m times
Used after characters or (...)
ab{2}c
abbc
{m,n}
The first character appears from m to N times before matching. If M is omitted, it will be matched 0 to N times. If n is omitted, it will be matched m to infinite times
Used after characters or (...)
ab{1,2}c
abc,abbc
give an example:
import re #: match,? String number? A word? For? Write characters, after? All? write? And these? write? Is there any way? ret = re.match("[A-Z][a-z]*","M") print(ret.group()) ret = re.match("[A-Z][a-z]*","MnnM") print(ret.group()) ret = re.match("[A-Z][a-z]*","Aabcdef") print(ret.group()) #Match whether the variable name is valid names = ["name1", "_name", "2_name", "__name__"] for name in names: ret = re.match("[a-zA-Z_]+[w]*",name) if ret: print("Variable name %s Meet the requirements" % ret.group()) else: print("Variable name %s ?method" % name) #Match the number between 0 and 99 ret = re.match("[1-9]?[0-9]","7") print(ret.group()) ret = re.match("[1-9]?d","33") print(ret.group()) # This result is not what you want, Lee$ Can solve ret = re.match("[1-9]?d","09") print(ret.group()) ret = re.match("[a-zA-Z0-9_]{6}","12a3g45678") print(ret.group()) #Match the 8-20 digit password, which can be?? Write English? Word Number, underline ret = re.match("[a-zA-Z0-9_]{8,20}","1ad12f23s34455ff66") print(ret.group())
result:
M
Mnn
Aabcdef
Variable name name1 meets the requirements
Variable name_ Name meets the requirements
Variable name 2_name method
Variable name__ name__ Meet the requirements
7
33
0
12a3g4
1ad12f23s34455ff66
Match beginning and end
character
function
^
Matches the beginning of a string
$
Match end of string
Example: match 163 Email address of COM
import re email_list = ["xiaoWang@163.com", "xiaoWang@163.comheihei", ".com.xiaowang@qq.com"] for email in email_list: ret = re.match("[w]{4,20}@163.com$", email) if ret: print("%s Is a qualified email address,After matching, the result is:%s" % (email, ret.group())) else: print("%s Do not meet the requirements" % email)
result:
xiaoWang@163.com Is a qualified email address, and the matching result is: xiaoWang@163.com
xiaoWang@163.comheihei Non conformance
.com.xiaowang@qq.com Non conformance
Matching grouping
character
function
|
Match left and right expressions
(ab)
Grouping characters in parentheses
um
The string to which the index group num matches
(P)
Group aliases, and the matched substring group is obtained externally through the defined name
(P=name)
The string matched by the alias name grouping
Examples:|
#Match numbers between 0 and 100 import re ret = re.match("[1-9]?d$|100","8") print(ret.group()) # 8 ret = re.match("[1-9]?d$|100","78") print(ret.group()) # 78 ret = re.match("[1-9]?d$|100","08") # print(ret.group()) # Not between 0-100 ret = re.match("[1-9]?d$|100","100") print(ret.group()) # 100
Example: ()
#Requirement: match 163, 126 and qq mailboxes ret = re.match("w{4,20}@163.com", "test@163.com") print(ret.group()) # test@163.com ret = re.match("w{4,20}@(163|126|qq).com", "test@126.com") print(ret.group()) # test@126.com ret = re.match("w{4,20}@(163|126|qq).com", "test@qq.com") print(ret.group()) # test@qq.com ret = re.match("w{4,20}@(163|126|qq).com", "test@gmail.com") if ret: print(ret.group()) else: print("Not 163, 126 qq mailbox") # Not 163, 126, qq email #Doesn't it end with 4 or 7? Machine number (11 digits) tels = ["13100001234", "18912344321", "10086", "18800007777"] for tel in tels: ret = re.match("1d{9}[0-35-68-9]", tel) if ret: print(ret.group()) else: print("%s Not what you want?Machine number" % tel) #Extract area code and phone number ret = re.match("([^-]*)-(d+)","010-12345678") print(ret.group()) print(ret.group(1)) print(ret.group(2))
Example: number
Match the combination represented by the number. Each bracket is a combination, which is numbered from 1. For example, (. +) matches' the 'or' 55 ', but not' the '(note the space after the combination). This special sequence can only be used to match the first 99 combinations. If the first digit of number is 0, or number is three octal numbers, it will not be regarded as a combination, but an octal numeric value. Within the '[' and ']' character sets, any numeric escape is considered a character.
Example 1: match out <html>hh</html>
,..., 9, match the content of the nth group. As shown in the example, it refers to the content matching the first group.
import re # Correct understanding idea: if in the second? What is in < >, logically speaking, after? What should be in that pair of < >. By citation? The matched data in the grouping is enough, but it should be noted that it is a meta string, that is, a format similar to "r". ret = re.match(r"<([a-zA-Z]*)>w*</>", "<html>hh</html>") # Because 2 pairs of data in < >? To, so there is no match test_label = ["<html>hh</html>","<html>hh</htmlbalabala>"] for label in test_label: ret = re.match(r"<([a-zA-Z]*)>w*</>", label) if ret: print("%s This is a pair of correct labels" % ret.group()) else: print("%s This is?Incorrect label" % label)
result:
Hhthis is a correct pair of labels hh this is an incorrect pair of labelsExample 2: match out
www.itcast.cn
import re labels = ["<html><h1>www.itcast.cn</h1></html>", "<html><h1>www.itcast.cn</h2></html>"] for label in labels: ret = re.match(r"<(w*)><(w*)>.*</></>", label) if ret: print("%s It is a label that meets the requirements" % ret.group()) else: print("%s Do not meet the requirements" % label)
result:
www.itcast.cn
It is a label that meets the requirementswww.itcast.cn
Do not meet the requirementsExample: (P) (P=name)
One for tags and one for reuse in the same regular expression
import re ret = re.match(r"<(?P<name1>w*)><(?P<name2>w*)>.*</(?P=name2)></(?P=name1)>","<html><h1>www.itcast.cn</h1></html>") ret.group() ret = re.match(r"<(?P<name1>w*)><(?P<name2>w*)>.*</(?P=name2)></(?P=name1)>","<html><h1>www.itcast.cn</h2></html>") #ret.group()
re.compile function
The compile function is used to compile regular expressions and generate a regular expression (Pattern) object for use by the match() and search() functions.
prog = re.compile(pattern) result = prog.match(string)
Equivalent to
result = re.match(pattern, string)
give an example:
>>>import re >>> pattern = re.compile(r'd+') m = pattern.match('one12twothree34four', 3, 10) # Match from the position of '1', just match >>> print m # Returns a Match object <_sre.SRE_Match object at 0x10a42aac0> >>> m.group(0) # 0 can be omitted '12' >>> m.start(0) # 0 can be omitted 3 >>> m.end(0) # 0 can be omitted 5 >>> m.span(0) # 0 can be omitted (3, 5)
Above, when the Match is successful, a Match object is returned, where:
- The group([group1,...]) method is used to obtain one or more group matching strings. When you want to obtain the whole matching substring, you can directly use group() or group(0);
- It is used to obtain the starting value of the string of the first parameter ([start]), which is used to match the starting value of the whole string of the group;
- The end([group]) method is used to obtain the end position of the substring matched by the group in the whole string (the index of the last character of the substring + 1), and the default value of the parameter is 0;
- span([group]) method returns (start(group), end(group))
re.search function
re.search scans the entire string and returns the first successful match. If there is no match, it returns a None.
re.match and re The difference between search: re Match only matches the beginning of the string. If the beginning of the string does not match the regular expression, the matching fails and the function returns None; And re Search matches the entire string until a match is found
give an example:
import re ret = re.search(r"d+", "The number of readings is 9999") print(ret.group())
result:
9999
re.findall function
Find all substrings matched by the regular expression in the string and return a list. If no match is found, return an empty list. Note * *: * * match and search match once, and findall match all.
give an example:
import re ret = re.findall(r"d+", "python = 9999, c = 7890, c++ = 12345") print(ret)
result:
['9999', '7890', '12345']
re. Finder function
Similar to findall, all substrings matched by the regular expression are found in the string and returned as an iterator.
import re it = re.finditer(r"d+", "12a32bc43jf3") for match in it: print(match.group())
result:
12
32
43
3
re.sub function
sub is written by substitute, which represents replacement and replaces the matched data.
Syntax: re sub(pattern, repl, string, count=0, flags=0)
parameter
describe
pattern
Required to represent the pattern string in the regular
repl
Required, that is, replacement. The string to be replaced can also be a function
string
Required, the string to be replaced
count
Optional parameter. count is the maximum number of times to replace. It must be a non negative integer. If this parameter is omitted or set to 0, all matches will be replaced
flag
Optional parameter, flag bit, used to control the matching method of regular expression, such as case sensitive, multi line matching, etc.
For example: add 1 to the matched reading times
Method 1:
import re ret = re.sub(r"d+", '998', "python = 997") print(ret)
Result: python = 998
Method 2:
import re def add(temp): #The int () parameter must be a string, a byte like object or number, not "re.Match" strNum = temp.group() num = int(strNum) + 1 return str(num) ret = re.sub(r"d+", add, "python = 997") print(ret) ret = re.sub(r"d+", add, "python = 99") print(ret)
result;
python = 998
python = 100
re.subn function
The behavior is the same as that of sub(), but returns a tuple (string, number of substitutions).
re.subn(pattern, repl, string[, count])
Return: (sub(repl, string[, count]), replacement times)
import re pattern = re.compile(r'(w+) (w+)') s = 'i say, hello world!' print(re.subn(pattern, r' ', s)) def func(m): return m.group(1).title() + ' ' + m.group(2).title() print(re.subn(pattern, func, s)) ### output ### # ('say i, world hello!', 2) # ('I Say, Hello World!', 2)
re.split function
Cut the string according to the match and return a list.
re.``split(pattern, string, maxsplit=0, flags=0)
parameter
describe
pattern
Matching regular expressions
string
String to match
maxsplit
Separation times, maxplit = 1, separated once, the default is 0, and the number is not limited
give an example:
import re ret = re.split(r":| ","info:xiaoZhang 33 shandong") print(ret)
Results: [info ',' xiaoZhang ',' 33 ',' shandong ']
python greed and greed
Python quantifiers are greedy by default (and may be greedy by default in a few languages), always trying to match as many characters as possible; Greed, on the contrary, always tries to match as few characters as possible.
For example, if the regular expression "ab *" is used to find "abbbc", it will find "abbb". If you use the non greedy Quantifier "ab *", you will find "a".
Note: we generally use non greedy patterns to extract.
Add?, after "*", "", "+", "{m,n}"?, Turn greed into greed.
Example 1:
import re s="This is a number 234-235-22-423" #In regular expression mode? To the general configuration word, will it try to "grab" the full when evaluating from left to right? Match the most? String, on us? For example???, ". +" will grab the full string from the beginning of the string? What is the most important aspect of the model? Which characters do we want to get? In an integer field? Part, "d +" only needs? Bit characters can be matched, so it matches the number "4",? ". +" matches the number from the beginning of the string to the end of the string? All characters before digit 4 r=re.match(".+(d+-d+-d+-d+)",s) print(r.group(1)) #? Greedy operator '?', Is this operator OK? In "*", "+", "?" After that?, The less regular matches are required, the better r=re.match(".+?(d+-d+-d+-d+)",s) print(r.group(1))
result:
4-235-22-423
234-235-22-423
Example 2:
>>> re.match(r"aa(d+)","aa2343ddd").group(1) '2343' >>> re.match(r"aa(d+?)","aa2343ddd").group(1) '2' >>> re.match(r"aa(d+)ddd","aa2343ddd").group(1) '2343' >>> re.match(r"aa(d+?)ddd","aa2343ddd").group(1) '2343'
Example 3: extract picture address
import re test_str="<img data-original=https://rpic.douyucdn.cn/appCovers/2016/11/13/1213973.jpg>" ret = re.search(r"https://.*?.jpg", test_str) print(ret.group())
result: https://rpic.douyucdn.cn/appCovers/2016/11/13/1213973.jpg
r's work
Like most programming languages, regular expressions use "" as an escape character, which can cause backslash problems. If you need to match the character "\" in the text, you will need four backslashes "\ \" in the regular expression expressed in the programming language: the first two and the last two are respectively used to escape into backslashes in the programming language, convert into two backslashes, and then escape into a backslash in the regular expression. The native string in Python solves this problem well. In Python, r is added before the string to represent the original string.
import re mm = "c:\a\b\c" print(mm)#c:ac ret = re.match("c:\\",mm).group() print(ret)#c:
ret = re.match("c:\a",mm).group()
print(ret)#c:a
ret = re.match(r"c:\a",mm).group()
print(ret)#c:a
ret = re.match(r"c:a",mm).group()
print(ret)#AttributeError: 'NoneType' object has no attribute 'group'