Pattern matching and regular expressions in Python

1. Brief introduction         

         Regular expression, or regex for short, is a description method of text patterns. For example: \ d is a regular expression that represents a one digit character, that is, any number with 0-9 digits.

        All regular expression functions in Python are in the re module. Enter the following code in the interactive environment to import the module

import re

2. Find text patterns with regular expressions

2.1 creating regular expressions and matching Regex objects

        Pass in a string value to re.compile(), representing a regular expression, which will return a Regex pattern object. The search() method of the Regex object finds the incoming string and all matches of the regular expression. If the regular expression pattern is not found in the string, the search() method will return None. If found, the search() method returns a Match object. The Match object has a group() method that returns the actual matching text in the searched string. example:

import re

phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('My number is 415-596-4565')
print(f"Phone number found: {mo.group()}")

         Output results:

Phone number found: 415-596-4565

2.2 summary of key points of regular expression matching

        There are several steps to using regular expressions in Python:

        1. Use   import re import regular presentation module

        2. Create a Regex object with the re.compile() function (remember to use the original string)

        3. Pass the string you want to find into the search() method of Regex object. It returns a Match object

        4. Call the group() method of the Match object to return the string of the actual matching text.

3. Match more patterns with regular expressions

3.1 grouping with brackets

        If you want to separate the area code from the telephone number. Adding parentheses creates a "grouping" in the regular expression: (\ d\d\d) - (\ d\d\d-\d\d\d\d). Then use   The group() matching object method obtains the matching text from a group.

         The first pair of parentheses in the regular expression string is group 1. The second pair of parentheses is group 2. Passing 0 or no parameter to group() returns the entire matching text. If you want to get all the groups at once, use   groups() method

example:

import re

phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is 415-596-4565')
print(mo.group(0))
print(mo.group(1))
print(mo.group(2))
print(mo.groups())

areaCode, mainNumber = mo.groups()
print(f"{areaCode}-{mainNumber}")

Output results:

415-596-4565
415
596-4565
('415', '596-4565')
415-596-4565

3.2 matching multiple groups with pipes

        The character | is called "pipe". When you want to match one of many expressions. You can use it. For example, the regular expression: r'Batman|Tina Fey 'will match' Batman 'or' Tina Fey '.

        If Batman and Tina Fey both appear in the searched string, the matching text that appears for the first time will be returned as a Match object.

import re

heroRegex = re.compile(r'Batman|Tina Fey')
mo1 = heroRegex.search('Batman and Tina Fey')
mo2 = heroRegex.search('Tina Fey and Batman')
print(mo1.group())
print(mo2.group())

Operation results:

Batman
Tina Fey

---------------------------------------------------------------------------------------------------------------------------------

If you want to match any of 'Batman', 'Batmobile', 'Batcopter' and 'Batbat'. Because all these strings start with Bat, it is convenient to specify the prefix only once.

import re

batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmobile lost a wheel')
print(mo.group())
print(mo.group(1))

Operation results

Batmobile
mobile

3.3 optional matching with question marks

         Sometimes, the pattern you want to match is optional. That is, whether the text is present or not, the regular expression will consider it a match. Character? Indicates that the previous grouping is optional in this mode.

        Example 1:

import re

batRegex = re.compile(r'Bat(wo)?man')
mo1 = batRegex.search('The Adventures of Batman')
mo2 = batRegex.search('The Adventures of Batwoman')
print(mo1.group())
print(mo2.group())

Operation results:

Batman
Batwoman

        Using the previous phone number example, you can let the regular expression find a phone number with or without an area code

import re

phoneRegex = re.compile(r'(\d\d\d-)?(\d\d\d-\d\d\d\d)')
mo1 = phoneRegex.search('My number is 415-555-4242')
mo2 = phoneRegex.search('My name is 555-4242')
print(mo1.group())
print(mo2.group())

Operation results:

415-555-4242
555-4242

3.4 match zero or more times with * sign

*(asterisk) means "match zero or more times", that is, the grouping before the asterisk can appear any time in the text. It can be completely absent or repeated again and again.

import re

batRegex = re.compile(r'Bat(wo)*man')
mo1 = batRegex.search("The Adventures of Batman")
mo2 = batRegex.search("The Adventures of Batwoman")
mo3 = batRegex.search("The Adventures of Batwowowoman")
print(mo1.group())
print(mo2.group())
print(mo3.group())

Operation results:

Batman
Batwoman
Batwowowoman

3.5 match one or more times with a plus sign

        * The asterisk means "match zero or more times", and the + (plus sign) means "match one or more times". The asterisk does not require the grouping to appear in the matching string, but the plus sign is different, and the grouping before the plus sign must "appear at least once", which is not optional.

import re

batRegex = re.compile(r'Bat(wo)+man')
mo1 = batRegex.search('The Adventures of Batwoman')
mo2 = batRegex.search('The Adventures of Batwowowoman')
print(mo1.group())
print(mo2.group())

Operation results:

Batwoman
Batwowowoman

The regular expression Bat(wo)+man will not match the string 'The Adventures of Batman' because the plus sign requires me to appear at least once.

3.6 matching strings with curly braces

         If you want a group to repeat a certain number of times, follow the group in the regular expression with the number surrounded by curly brackets. For example, the regular expression (Ha){3} will match the string 'ha ha'

        In addition to a number, you can also specify a range, that is, write a minimum value, a comma and a maximum value in curly braces. For example, the regular expression (Ha){3,5} will match the strings' HaHaHa ',' HaHaHaHa 'and' HaHaHaHa '

        You can also not write the first or second number in curly braces, and do not limit the minimum or maximum value. For example, regular expression (Ha){3,} will match 3 or more instances, and regular expression (Ha){,5} will match 0 to 5 instances

3.7 greedy and non greedy matching

        In the string 'HaHaHaHa', because (Ha){3,5} can Match 3, 4 or 5 instances, you may wonder why in the previous curly bracket example, the Match object group() call will return 'HaHaHaHa' instead of a shorter possible result. Comparing 'HaHaHaHa' with 'HaHaHaHa' can also effectively Match the regular expression (Ha){3,5}.

        python's regular expressions are "greedy" by default, which means that in case of ambiguity, they match the longest string as possible. The "non greedy" version of curly braces matches the shortest string as possible, that is, a question mark is followed by the closing curly braces.

import re

greedyHaRegex = re.compile(r'(Ha){3,5}')
nongreedyHaRegex = re.compile(r'(Ha){3,5}?')
mo1 = greedyHaRegex.search("HaHaHaHaHa")
mo2 = nongreedyHaRegex.search("HaHaHaHaHa")
print(mo1.group())
print(mo2.group())

Operation results:

HaHaHaHaHa
HaHaHa

        Note that the question mark has two meanings in regular expressions: declaring non greedy matching or representing optional grouping, which are completely irrelevant.

4. findall() method

        The search() method can only return one Match object containing the text of the first Match in the searched string, while the findall() method returns a set of strings containing all matches in the searched string.

        In addition, instead of returning a Match, the findall() method returns a list of strings as long as there are no groups in the regular expression. If there are groups in the regular expression, the findall() method will return a list of tuples.

import re

phoneRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo1 = phoneRegex.search('My number is 415-555-4242 and 212-425-3666')
mo2 = phoneRegex.findall('My number is 415-555-4242 and 212-425-3666')
phoneRegex = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)')
mo3 = phoneRegex.findall('My number is 415-555-4242 and 212-425-3666')
print(mo1.group())
print(mo2)
print(mo3)

Output results:

415-555-4242
['415-555-4242', '212-425-3666']
[('415', '555', '4242'), ('212', '425', '3666')]

Keywords: Python regex

Added by neox_blueline on Sun, 21 Nov 2021 01:45:09 +0200