python regular expression

Special symbols and characters

0x01 match multiple regular expressions|

Pipe symbol, select one of multiple modes, regular mode at|home

>>> str = re.match("at|home","home")
>>> str.group()
'home'
>>> str = re.match("at|home","at qaq")
>>> str.group()
'at'

0x02 matches any single character

A period or period The symbol matches any character except the newline character (the python regular expression has a compilation flag [S or DOTALL], which can override this restriction and make the point number match the newline character). Whether letters, numbers, spaces (excluding \ nline breaks), printable characters, non printable characters, or a symbol, use a dot to match them.

>>> str = re.match(".","a")
>>> str.group()
'a'

Summarizes the common line breaks and spaces. After testing, they can match except \ n

  • Newline \ v \n \f
>>> print("123\v456")
123
456
>>> print("123\n456")
123
456
>>> print("123\f456")
123
456
  • \t tab, four spaces, equivalent to table key
>>> print("123\t456")
123     456
  • \r returns the position of the cursor to the beginning of the line
    Realize the countdown function on the command line
import time
for i in range(10):
    print("\r There is still room to exit the program%s second" % (9-i), end="")
    time.sleep(1)

0x03 match from the beginning, end and boundary of characters

  • Match start position: caret ^, special character \ A
  • Match end position: dollar sign $, special character \ Z

The latter is mainly used for keyboards without caret, such as some international keyboards

Matches a string starting with from

>>> str = re.match("^from.*","from home")
>>> str.group()
'from home'

Matches a string ending with end

>>> str = re.match(".*end$","123end")
>>> str.group()
'123end'
  • \b matches the boundary of a character
  • \B is not a word boundary

Any string starting with the

\bthe

Match only the word the

\bthe\b

Any string that contains but does not start with the

\Bthe

0x04 limited scope and negation

The two symbols in square brackets are connected by hyphen, which is used to specify the range of a character, such as A-Z, A-Z or 0-9. If the caret follows the left square bracket, it indicates that it does not match any of the given characters

Match the letter z followed by any character, followed by a number

>>> re.match("z.[0-9]","z=3").group()
'z=3'
>>> re.match("[a-b][deh-j][y-z]","ahy").group()
'ahy'

Match non vowel characters

>>> re.match("[^aeiou]*","pygb").group()
'pygb'

0x04 use closure operator to realize existence and frequency matching

  • *Matches the expression on the left for zero or more times.
  • +A regular expression that appears one or more times.
  • ? Matches a regular expression with zero or one occurrence.
  • {N} Or {M,N} matches the previous regular expression n times, or matches M-N occurrences.

Match 15 or 16 digits

[0-9]{15,16}

0x05 represents a special character of the character set

  • \d matches any decimal number
  • \w matches all alphanumeric characters, [A-Za-z0-9_]
  • \s matches the space character
  • The upper case version above indicates a mismatch. For example, \ D indicates any non decimal number

Matches the format of a US phone, for example 800-555-1212

\d{3}-\d{3}-\d{4}

Match qq mailbox

\d{5,10}@qq.com

0x06 parentheses specify grouping

First and last name

>>> re.match("(Mr?s?\.)?([A-Za-z]*[A-Za-z-]+)","Mr.chen").group(0)
'Mr.chen'
>>> re.match("(Mr?s?\.)?([A-Za-z]*[A-Za-z-]+)","Mr.chen").group(1)
'Mr.'
>>> re.match("(Mr?s?\.)?([A-Za-z]*[A-Za-z-]+)","Mr.chen").group(2)
'chen'

0x07 extended notation

I didn't understand much

reference resources Learning notes of python core programming (I): regular expression extended representation

(?:\w+\.)* A string ending with a period, such as "google." “twitter.”, “facebook.”, However, these matches will not be saved for subsequent use and data retrieval

(? #comment) there is no match here, just as a comment

(? =. com) if a string is followed by ". com", the matching operation is performed, and no target string is used

(?!. net) if a string is not followed by ". net", the matching operation is performed

(? < = 800 -) if the string is matched with "800 -" before, it is assumed to be a telephone number. Similarly, no input string is used

(?<!192\.168\.) If a string is not preceded by "192.168." Before matching, it is used to filter out a group of class C IP addresses

(? (1) y|x) if a matching group 1 exists, it matches y, otherwise it matches X

In summary, there are four assertions:

Forward matching (? =...)## End with a string

Forward (?...)## Do not end with a string

Forward and backward matching (< =...)## Start with a string

Negative backward line matching (<!...)## Do not start with a string

The so-called look ahead and look behind actually mean looking forward and backward

Added by cLFlaVA on Wed, 09 Feb 2022 22:32:23 +0200