python -- regular expression (re module) detailed explanation

When Python needs to match the name of a module, it can be brought into Python through the regular expression.

The approximate matching process of regular expressions is:
1. Compare the characters in the expression and text in turn,
2. If each character can be matched, the matching is successful; Once there are characters that fail to match, the matching fails.
3. If there are quantifiers or boundaries in the expression, the process will be slightly different.

r: Backslashes do not need any special treatment in string literals prefixed with 'R'. Therefore, R "" represents a string containing two characters' 'and' n ', while "" represents a string containing only one newline character.

Use of re module: import re

re.match function

Syntax: re match(pattern, string, flags=0)

pattern

Matching regular expressions

string

String to match

flags

Flag bit, which is used to control the matching method of regular expressions, such as case sensitivity, multi line matching, etc.

re.I ignore case
re.L indicates that the special character set w, W, B, s, s depends on the current environment
re.M multiline mode
re.S is And any character including newline character (. Excluding newline character)
re.U represents the special character set w, W, B, D, D, s, s, which depends on the Unicode character attribute database
re.X ignores spaces and # subsequent comments for readability

Try to match a pattern from the starting position of the string. If the matching is not successful, match() returns none. Match succeeded re The match method returns a matching object.

If the data is matched in the previous step, the group method can be used to extract the data. To get the matching expression using the group(num) or groups() matching object function.

group() is used to propose the string * * intercepted by grouping, and * * () is used for grouping. group() and group (0) are the overall results of matching regular expressions. group(1) lists the first bracket matching part, group(2) lists the second bracket matching part, and group(3) lists the third bracket matching part. No match is successful, re Search() returns None.

give an example:

>>> import re
>>> result = re.match("itcast","itcast.cn")
>>> result.group()
'itcast'

Start matching from the string header, and pattern can be matched completely. Pattern matching ends, and the matching ends at the same time cn no longer matches, and the information of successful matching is returned.

Match single character

character

function

position

.

Match any 1 character (except)

[ ]

Match the characters listed in []

d

Match numbers, i.e. 0-9

Can be written in character set [...]

D

Matches a number, that is, it is not a number

Can be written in character set [...]

s

Match blank, i.e. space, tab

Can be written in character set [...]

S

Match empty characters

Can be written in character set [...]

w

Match word characters, i.e. A-Z, A-Z, 0-9_

Can be written in character set [...]

W

Match word characters

Can be written in character set [...]

w

w matches word characters, i.e. A-Z, A-Z, 0-9_

W

Match word characters

[...] character set. The corresponding position can be any character in the character set. The characters in the character set can be listed one by one, or the range can be given, such as [abc] and [a-c]. If the first character indicates negation. All special characters (such as "]" - "" ") lose their original meaning in the character set. If you want to use it, you can put"] "-" in the first character and "^" in the non first character.

give an example:

import re
ret = re.match(".","M")
print(ret.group())
ret = re.match("t.o","too")
print(ret.group())
ret = re.match("t.o","two")
print(ret.group())
# What if hello? Character? Write, then regular expressions need? Written h
ret = re.match("h","hello Python")
print(ret.group())
# What if hello? Character? Write, then regular expressions need? Written H
ret = re.match("H","Hello Python")
print(ret.group())
# ?? It's OK to write h
ret = re.match("[hH]","hello Python")
print(ret.group())
ret = re.match("[hH]","Hello Python")
print(ret.group())
ret = re.match("[hH]ello Python","Hello Python")
print(ret.group())
# Match multiple ways of writing from 0 to 9
ret = re.match("[0123456789]Hello Python","7Hello Python")
print(ret.group())
ret = re.match("[0-9]Hello Python","7Hello Python")
print(ret.group())
# Match 0 to 3 and 5-9
ret = re.match("[0-35-9]Hello Python","7Hello Python")
print(ret.group())
ret = re.match("[0-35-9]Hello Python","4Hello Python")
#print(ret.group())
ret = re.match("Chang'e d number","Chang'e-1 was successfully launched")
print(ret.group())
ret = re.match("Chang'e d number","Chang'e-2 successfully launched")
print(ret.group())

result:

M
too
two
h
H
h
H
Hello Python
7Hello Python
7Hello Python
7Hello Python
Chang'e 1
Chang'e 2

Match multiple characters

character

function

position

Expression instance

Complete matching string

*

If the first character appears 0 times or is limited to times, it can be matched

Used after characters or (...)

abc*

abccc

The first character appears once or only once before matching, that is, there is less than one time

Used after characters or (...)

abc+

abccc

The first character appears 1 or 0 times before matching, that is, either 1 or no

Used after characters or (...)

abc

ab,abc

{m}

Match the first character m times

Used after characters or (...)

ab{2}c

abbc

{m,n}

The first character appears from m to N times before matching. If M is omitted, it will be matched 0 to N times. If n is omitted, it will be matched m to infinite times

Used after characters or (...)

ab{1,2}c

abc,abbc

give an example:

import re
#: match,? String number? A word? For? Write characters, after? All? write? And these? write? Is there any way?
ret = re.match("[A-Z][a-z]*","M")
print(ret.group())
ret = re.match("[A-Z][a-z]*","MnnM")
print(ret.group())
ret = re.match("[A-Z][a-z]*","Aabcdef")
print(ret.group())
#Match whether the variable name is valid
names = ["name1", "_name", "2_name", "__name__"]
for name in names:
    ret = re.match("[a-zA-Z_]+[w]*",name)
    if ret:
        print("Variable name %s Meet the requirements" % ret.group())
    else:
        print("Variable name %s ?method" % name)
#Match the number between 0 and 99
ret = re.match("[1-9]?[0-9]","7")
print(ret.group())
ret = re.match("[1-9]?d","33")
print(ret.group())
# This result is not what you want, Lee$ Can solve
ret = re.match("[1-9]?d","09")
print(ret.group())
ret = re.match("[a-zA-Z0-9_]{6}","12a3g45678")
print(ret.group())
#Match the 8-20 digit password, which can be?? Write English? Word Number, underline
ret = re.match("[a-zA-Z0-9_]{8,20}","1ad12f23s34455ff66")
print(ret.group())

result:
M
Mnn
Aabcdef
Variable name name1 meets the requirements
Variable name_ Name meets the requirements
Variable name 2_name method
Variable name__ name__ Meet the requirements
7
33
0
12a3g4
1ad12f23s34455ff66

Match beginning and end

character

function

^

Matches the beginning of a string

$

Match end of string

Example: match 163 Email address of COM

import re
email_list = ["xiaoWang@163.com", "xiaoWang@163.comheihei", ".com.xiaowang@qq.com"]
for email in email_list:
    ret = re.match("[w]{4,20}@163.com$", email)
    if ret:
        print("%s Is a qualified email address,After matching, the result is:%s" % (email, ret.group()))
    else:
        print("%s Do not meet the requirements" % email)

result:

xiaoWang@163.com Is a qualified email address, and the matching result is: xiaoWang@163.com
xiaoWang@163.comheihei Non conformance
.com.xiaowang@qq.com Non conformance

Matching grouping

character

function

|

Match left and right expressions

(ab)

Grouping characters in parentheses

um

The string to which the index group num matches

(P)

Group aliases, and the matched substring group is obtained externally through the defined name

(P=name)

The string matched by the alias name grouping

Examples:|

#Match numbers between 0 and 100
import re
ret = re.match("[1-9]?d$|100","8")
print(ret.group()) # 8
ret = re.match("[1-9]?d$|100","78")
print(ret.group()) # 78
ret = re.match("[1-9]?d$|100","08")
# print(ret.group()) # Not between 0-100
ret = re.match("[1-9]?d$|100","100")
print(ret.group()) # 100

Example: ()

#Requirement: match 163, 126 and qq mailboxes
ret = re.match("w{4,20}@163.com", "test@163.com")
print(ret.group()) # test@163.com
ret = re.match("w{4,20}@(163|126|qq).com", "test@126.com")
print(ret.group()) # test@126.com
ret = re.match("w{4,20}@(163|126|qq).com", "test@qq.com")
print(ret.group()) # test@qq.com
ret = re.match("w{4,20}@(163|126|qq).com", "test@gmail.com")
if ret:
    print(ret.group())
else:
    print("Not 163, 126 qq mailbox") # Not 163, 126, qq email
#Doesn't it end with 4 or 7? Machine number (11 digits)
tels = ["13100001234", "18912344321", "10086", "18800007777"]
for tel in tels:
    ret = re.match("1d{9}[0-35-68-9]", tel)
    if ret:
        print(ret.group())
    else:
        print("%s Not what you want?Machine number" % tel)
#Extract area code and phone number
ret = re.match("([^-]*)-(d+)","010-12345678")
print(ret.group())
print(ret.group(1))
print(ret.group(2))

Example: number

Match the combination represented by the number. Each bracket is a combination, which is numbered from 1. For example, (. +) matches' the 'or' 55 ', but not' the '(note the space after the combination). This special sequence can only be used to match the first 99 combinations. If the first digit of number is 0, or number is three octal numbers, it will not be regarded as a combination, but an octal numeric value. Within the '[' and ']' character sets, any numeric escape is considered a character.

Example 1: match out <html>hh</html>

,..., 9, match the content of the nth group. As shown in the example, it refers to the content matching the first group.

import re
# Correct understanding idea: if in the second? What is in < >, logically speaking, after? What should be in that pair of < >. By citation? The matched data in the grouping is enough, but it should be noted that it is a meta string, that is, a format similar to "r".
ret = re.match(r"<([a-zA-Z]*)>w*</>", "<html>hh</html>")
# Because 2 pairs of data in < >? To, so there is no match
test_label = ["<html>hh</html>","<html>hh</htmlbalabala>"]
for label in test_label:
    ret = re.match(r"<([a-zA-Z]*)>w*</>", label)
    if ret:
        print("%s This is a pair of correct labels" % ret.group())
    else:
        print("%s This is?Incorrect label" % label)

result:

Hhthis is a correct pair of labels hh this is an incorrect pair of labels

Example 2: match out

www.itcast.cn

import re
labels = ["<html><h1>www.itcast.cn</h1></html>", "<html><h1>www.itcast.cn</h2></html>"]
for label in labels:
    ret = re.match(r"<(w*)><(w*)>.*</></>", label)
    if ret:
        print("%s It is a label that meets the requirements" % ret.group())
    else:
        print("%s Do not meet the requirements" % label)

result:

www.itcast.cn

It is a label that meets the requirements

www.itcast.cn

Do not meet the requirements

Example: (P) (P=name)

One for tags and one for reuse in the same regular expression

import re
ret = re.match(r"<(?P<name1>w*)><(?P<name2>w*)>.*</(?P=name2)></(?P=name1)>","<html><h1>www.itcast.cn</h1></html>")
ret.group()
ret = re.match(r"<(?P<name1>w*)><(?P<name2>w*)>.*</(?P=name2)></(?P=name1)>","<html><h1>www.itcast.cn</h2></html>")
#ret.group()

re.compile function

The compile function is used to compile regular expressions and generate a regular expression (Pattern) object for use by the match() and search() functions.

prog = re.compile(pattern)
result = prog.match(string)

Equivalent to

result = re.match(pattern, string)

give an example:

>>>import re
>>> pattern = re.compile(r'd+')   
m = pattern.match('one12twothree34four', 3, 10) # Match from the position of '1', just match
>>> print m                                         # Returns a Match object
<_sre.SRE_Match object at 0x10a42aac0>
>>> m.group(0)   # 0 can be omitted
'12'
>>> m.start(0)   # 0 can be omitted
3
>>> m.end(0)     # 0 can be omitted
5
>>> m.span(0)    # 0 can be omitted
(3, 5)

Above, when the Match is successful, a Match object is returned, where:

The group([group1,...]) method is used to obtain one or more group matching strings. When you want to obtain the whole matching substring, you can directly use group() or group(0);
It is used to obtain the starting value of the string of the first parameter ([start]), which is used to match the starting value of the whole string of the group;
The end([group]) method is used to obtain the end position of the substring matched by the group in the whole string (the index of the last character of the substring + 1), and the default value of the parameter is 0;
span([group]) method returns (start(group), end(group))

re.search function

re.search scans the entire string and returns the first successful match. If there is no match, it returns a None.

re.match and re The difference between search: re Match only matches the beginning of the string. If the beginning of the string does not match the regular expression, the matching fails and the function returns None; And re Search matches the entire string until a match is found

give an example:

import re
ret = re.search(r"d+", "The number of readings is 9999")
print(ret.group())

result:

9999

re.findall function

Find all substrings matched by the regular expression in the string and return a list. If no match is found, return an empty list. Note * *: * * match and search match once, and findall match all.

give an example:

import re
ret = re.findall(r"d+", "python = 9999, c = 7890, c++ = 12345")
print(ret)

result:

['9999', '7890', '12345']

re. Finder function

Similar to findall, all substrings matched by the regular expression are found in the string and returned as an iterator.

import re
it = re.finditer(r"d+", "12a32bc43jf3")
for match in it:
    print(match.group())

result:

12
32
43
3

re.sub function

sub is written by substitute, which represents replacement and replaces the matched data.

Syntax: re sub(pattern, repl, string, count=0, flags=0)

parameter

describe

pattern

Required to represent the pattern string in the regular

repl

Required, that is, replacement. The string to be replaced can also be a function

string

Required, the string to be replaced

count

Optional parameter. count is the maximum number of times to replace. It must be a non negative integer. If this parameter is omitted or set to 0, all matches will be replaced

flag

Optional parameter, flag bit, used to control the matching method of regular expression, such as case sensitive, multi line matching, etc.

For example: add 1 to the matched reading times

Method 1:

import re
ret = re.sub(r"d+", '998', "python = 997")
print(ret)

Result: python = 998

Method 2:

import re
def add(temp):
    #The int () parameter must be a string, a byte like object or number, not "re.Match"
    strNum = temp.group()
    num = int(strNum) + 1
    return str(num)
ret = re.sub(r"d+", add, "python = 997")
print(ret)
ret = re.sub(r"d+", add, "python = 99")
print(ret)

result;

python = 998
python = 100

re.subn function

The behavior is the same as that of sub(), but returns a tuple (string, number of substitutions).

re.subn(pattern, repl, string[, count])

Return: (sub(repl, string[, count]), replacement times)

import re
pattern = re.compile(r'(w+) (w+)')
s = 'i say, hello world!'
print(re.subn(pattern, r' ', s))
def func(m):
    return m.group(1).title() + ' ' + m.group(2).title()
print(re.subn(pattern, func, s))
### output ###
# ('say i, world hello!', 2)
# ('I Say, Hello World!', 2)

re.split function

Cut the string according to the match and return a list.

re.``split(pattern, string, maxsplit=0, flags=0)

parameter

describe

pattern

Matching regular expressions

string

String to match

maxsplit

Separation times, maxplit = 1, separated once, the default is 0, and the number is not limited

give an example:

import re
ret = re.split(r":| ","info:xiaoZhang 33 shandong")
print(ret)

Results: [info ',' xiaoZhang ',' 33 ',' shandong ']

python greed and greed

Python quantifiers are greedy by default (and may be greedy by default in a few languages), always trying to match as many characters as possible; Greed, on the contrary, always tries to match as few characters as possible.

For example, if the regular expression "ab *" is used to find "abbbc", it will find "abbb". If you use the non greedy Quantifier "ab *", you will find "a".

Note: we generally use non greedy patterns to extract.

Add?, after "*", "", "+", "{m,n}"?, Turn greed into greed.

Example 1:

import re
s="This is a number 234-235-22-423"
#In regular expression mode? To the general configuration word, will it try to "grab" the full when evaluating from left to right? Match the most? String, on us? For example???, ". +" will grab the full string from the beginning of the string? What is the most important aspect of the model? Which characters do we want to get? In an integer field? Part, "d +" only needs? Bit characters can be matched, so it matches the number "4",? ". +" matches the number from the beginning of the string to the end of the string? All characters before digit 4
r=re.match(".+(d+-d+-d+-d+)",s)
print(r.group(1))
#? Greedy operator '?', Is this operator OK? In "*", "+", "?" After that?, The less regular matches are required, the better
r=re.match(".+?(d+-d+-d+-d+)",s)
print(r.group(1))

result:

4-235-22-423
234-235-22-423

Example 2:

>>> re.match(r"aa(d+)","aa2343ddd").group(1)
'2343'
>>> re.match(r"aa(d+?)","aa2343ddd").group(1)
'2'
>>> re.match(r"aa(d+)ddd","aa2343ddd").group(1)
'2343'
>>> re.match(r"aa(d+?)ddd","aa2343ddd").group(1)
'2343'

Example 3: extract picture address

import re
test_str="<img data-original=https://rpic.douyucdn.cn/appCovers/2016/11/13/1213973.jpg>"
ret = re.search(r"https://.*?.jpg", test_str)
print(ret.group())

result: https://rpic.douyucdn.cn/appCovers/2016/11/13/1213973.jpg

r's work

Like most programming languages, regular expressions use "" as an escape character, which can cause backslash problems. If you need to match the character "\" in the text, you will need four backslashes "\ \" in the regular expression expressed in the programming language: the first two and the last two are respectively used to escape into backslashes in the programming language, convert into two backslashes, and then escape into a backslash in the regular expression. The native string in Python solves this problem well. In Python, r is added before the string to represent the original string.

import re
mm = "c:\a\b\c"
print(mm)#c:ac
ret = re.match("c:\\",mm).group()
print(ret)#c:

ret = re.match("c:\a",mm).group()
print(ret)#c:a
ret = re.match(r"c:\a",mm).group()
print(ret)#c:a
ret = re.match(r"c:a",mm).group()
print(ret)#AttributeError: 'NoneType' object has no attribute 'group'

Keywords: Front-end html .NET http microsoft

Added by gilreilly on Tue, 08 Mar 2022 07:29:00 +0200