Dynamic regular expression in Matlab

catalogue

Dynamic regular expression

brief introduction

Dynamic matching expression - (?? expr)

Modify the command matching the expression - (? @ cmd)

Functional requirements of (@ cmd)

Override command in expression - ${cmd}

Dynamic regular expression

brief introduction

In a dynamic expression, you can require regexp to match. The pattern changes dynamically with the content of the input text. In this way, different input patterns in the parsed text can be more closely matched. Alternatively, you can use dynamic expressions in alternatives for the regexprep function. In this way, there is a way to make the alternative text better adapt to the parsed input.

You can click match in the following command_ Expr or replace_expr parameter contains any number of dynamic expressions: modify the value in myfunc:

regexp(text, match_expr)
regexpi(text, match_expr)
regexprep(text, match_expr, replace_expr)

Taking a dynamic expression as an example, the following regexprep command correctly replaces the term internationalization with its abbreviation i18n. However, to use this command with other terms, such as globalization, you must use a different alternative expression:

match_expr = '(^\w)(\w*)(\w$)';

replace_expr1 = '$118$3';
regexprep('internationalization', match_expr, replace_expr1)
ans =

    'i18n'
replace_expr2 = '$111$3';
regexprep('globalization', match_expr, replace_expr2)
ans =

    'g11n'

With the dynamic expression ${num2str(length(}))}, you can base an alternative expression on the input text so that you don't have to change the expression every time. This example uses the dynamic substitution syntax ${cmd}.

match_expr = '(^\w)(\w*)(\w$)';
replace_expr = '$1${num2str(length($2))}$3';

regexprep('internationalization', match_expr, replace_expr)
ans =

    'i18n'
regexprep('globalization', match_expr, replace_expr)
ans =

    'g11n'

After parsing, the dynamic expression must correspond to a complete valid regular expression. In addition, dynamic matching expressions that use backslashes to escape characters (\) require two backslashes: one for initial parsing of the expression and one for full matching. Parentheses enclosing dynamic expressions do not create capture groups.

There are three forms of dynamic expressions that can be used as matching expressions and one form of dynamic expressions that can be used as substitute expressions. These dynamic expressions are introduced below.

Dynamic matching expression - (?? expr)

(?? expr) operator parses the expression expr and inserts the result back into the matching expression. Then, MATLAB ® The modified matching expression is evaluated.

The following is an example of an expression type that uses this operator:

chr = {'5XXXXX', '8XXXXXXXX', '1X'};
regexp(chr, '^(\d+)(??X{$1})$', 'match', 'once');

The purpose of this special command is to locate a series of consecutive x characters in each character vector stored in the input cell array. Note, however, that the number of X , varies in each character vector. If the number does not change, you can use the expression {X{n} to indicate n of these characters to match. However, n is a constant value and does not apply to this example.

The solution used here is to capture the leading count in the markup (e.g. 5 in the first character vector of the cell array), and then use the count in the dynamic expression. The dynamic expression in this example is (? X{ }), where $1 is the value captured by the markup \ d +. The operator {$1} creates a qualifier for the token value. The first mock exam is dynamic, so the same pattern applies to all three input vectors in the cell array. For the first input character vector, regexp finds five x characters; For the second input string, the command looks for eight characters, while for the third input string, it looks for only one character:

regexp(chr, '^(\d+)(??X{$1})$', 'match', 'once')
ans =

  1×3 cell array

    {'5XXXXX'}    {'8XXXXXXXX'}    {'1X'}

Modify the command matching the expression - (? @ cmd)

Matlab uses the (? @ cmd) operator to incorporate the results of MATLAB commands into the matching expression. This command must return items that can be used in matching expressions.

For example, use the dynamic expression (? @ flilplr( )) to find the palindrome "Never Odd or Even" embedded in a larger character vector.

First, create the input string. Make sure all letters are lowercase and delete all non word characters.

chr = lower(...
  'Find the palindrome Never Odd or Even in this string');

chr = regexprep(chr, '\W*', '')
chr =

    'findthepalindromeneveroddoreveninthisstring'

Find palindromes in a character vector using the following dynamic expression:

palindrome = regexp(chr, '(.{3,}).?(??@fliplr($1))', 'match')
palindrome =

  1×1 cell array

    {'neveroddoreven'}

The dynamic expression reverses the order of the letters that make up the character vector, and then tries to match as many inverse character vectors as possible. This requires a dynamic expression because the value of $1} depends on the value of the markup (. {3,}).

Dynamic expressions in MATLAB have access to the currently active workspace. This means that you can change any function or variable used in a dynamic expression simply by changing the variables in the workspace. Repeat the last command of the above example, but this time use the function handle stored in the basic workspace to define the function to be invoked in the expression:

fun = @fliplr;

palindrome = regexp(chr, '(.{3,}).?(??@fun($1))', 'match')
palindrome =

  1×1 cell array

    {'neveroddoreven'}

Commands that meet functional requirements - (? @ cmd)

The (? @ cmd) operator is used to specify the matlab command that regexp or regexprep will run when parsing the entire matching expression. Unlike other dynamic expressions in MATLAB, this operator does not change the content of the expression in which it is located. Instead, you can use this feature to let Matlab report only the steps taken when parsing the contents of one of the regular expressions. This feature can be used to diagnose regular expressions.

The following example parses a word consisting of zero or more characters followed by two identical characters followed by zero or more characters:

regexp('mississippi', '\w*(\w)\1\w*', 'match')
ans =

  1×1 cell array

    {'mississippi'}

To track the exact steps that MATLAB takes when determining a match, this example inserts a short script (? @ disp( )) into the expression to display the characters that ultimately make up the match. Since this example uses positive qualifiers, matlab tries to match as many character vectors as possible. In this way, even if matlab finds a match at the beginning of the string, it will continue to find more matches until it reaches the end of the string. From there, it backs up the letters i to p and the next p and stops there because the match finally meets the requirements:

regexp('mississippi', '\w*(\w)(?@disp($1))\1\w*', 'match')
i
p
p

ans =

  1×1 cell array

    {'mississippi'}

Now try the same example again, this time setting the first qualifier to the de limit qualifier (*?). Similarly, MATLAB generates the same match:

regexp('mississippi', '\w*?(\w)\1\w*', 'match')
ans =

  1×1 cell array

    {'mississippi'}

However, by inserting a dynamic script, you can see the match this time, and MATLAB matches the text in a completely different way. In this example, matlab uses the first match that can be found without even considering the rest of the text:

regexp('mississippi', '\w*?(\w)(?@disp($1))\1\w*', 'match')
m
i
s

ans =

  1×1 cell array

    {'mississippi'}

To demonstrate the flexibility of this type of dynamic expression, try the following example. When MATLAB parses the input text iteratively, the example will gradually set up a cell array. Found (?!) at the end of the expression The operator is actually an empty forward operator that forces failure at each iteration. This forced failure is necessary if you want to track the steps MATLAB takes when processing expressions.

MATLAB passes through the input text many times, each time trying another letter combination to see if it can find a better match than the previous one. In any pass that does not find a match, the test generates a null character vector. The dynamic script (@ if (~ isempty ($&)) is used to omit the empty character vector in the "matches" cell array:

matches = {};
expr = ['(Euler\s)?(Cauchy\s)?(Boole)?(?@if(~isempty($&)),' ...
   'matches{end+1}=$&;end)(?!)'];

regexp('Euler Cauchy Boole', expr);

matches
matches =

  1×6 cell array

    {'Euler Cauchy Bo...'}    {'Euler Cauchy '}    {'Euler '}    {'Cauchy Boole'}    {'Cauchy '}    {'Boole'}

The operators $& (or the equivalent of $0), $` and $'refer to the part of the current match in the input text, all characters before the current match, and all characters after the current match, respectively. These operators are sometimes useful when dealing with dynamic expressions, especially those that use (? @ cmd) operators.

The following example parses the input text to find the letter G. When scanning the text in each iteration, regexp , compares the current character with , g , and advances to the next character if no match is found. This example tracks the progress of scanning in this text by marking the current position to be parsed with the ^ character.

($` and $´ operators capture the text part before and after the current parsing position. When the sequence $´ appears in the text, you need to use two single quotation marks ($'') to represent it.)

chr = 'abcdefghij';
expr = '(?@disp(sprintf(''starting match: [%s^%s]'',$`,$'')))g';

regexp(chr, expr, 'once');
starting match: [^abcdefghij]
starting match: [a^bcdefghij]
starting match: [ab^cdefghij]
starting match: [abc^defghij]
starting match: [abcd^efghij]
starting match: [abcde^fghij]
starting match: [abcdef^ghij]

Override command in expression - ${cmd}

The ${cmd} operator modifies the contents of the regular expression substitution pattern so that it applies to parameters in the input text that may vary depending on usage. Like other dynamic expressions used in MATLAB, any number of these expressions can be included in the whole alternative expression.

In the 'regexprep' call shown here, the alternative mode is' ${convertMe({,})} '. In this example, the whole alternative mode is a dynamic expression:

regexprep('This highway is 125 miles long', ...
          '(\d+\.?\d*)\W(\w+)', '${convertMe($1,$2)}');

This dynamic expression instructs MATLAB to execute a function called convertMe using two tags derived from the matched text (\ d+\.?\d *) and (\ w +) as parameters of the call to convertMe. Because the values of $1 , and $2 , are generated at run time, alternative patterns require dynamic expressions.

The following example defines a function called convertMe, which converts measurements from Imperial units to metric units.

function valout  = convertMe(valin, units)
switch(units)
    case 'inches'
        fun = @(in)in .* 2.54;
        uout = 'centimeters';
    case 'miles'
        fun = @(mi)mi .* 1.6093;
        uout = 'kilometers';
    case 'pounds'
        fun = @(lb)lb .* 0.4536;
        uout = 'kilograms';
    case 'pints'
        fun = @(pt)pt .* 0.4731;
        uout = 'litres';
    case 'ounces'
        fun = @(oz)oz .* 28.35;
        uout = 'grams';
end
val = fun(str2num(valin));
valout = [num2str(val) ' ' uout];
end

On the command line, call the convertMe function through regexprep, and pass in the quantity value and imperial unit name to be converted:

regexprep('This highway is 125 miles long', ...
          '(\d+\.?\d*)\W(\w+)', '${convertMe($1,$2)}')
ans =

    'This highway is 201.1625 kilometers long'
regexprep('This pitcher holds 2.5 pints of water', ...
          '(\d+\.?\d*)\W(\w+)', '${convertMe($1,$2)}')
ans =

    'This pitcher holds 1.1828 litres of water'
regexprep('This stone weighs about 10 pounds', ...
          '(\d+\.?\d*)\W(\w+)', '${convertMe($1,$2)}')
ans =

    'This stone weighs about 4.536 kilograms'

Like the (? @) operator discussed in the previous section, the ${} operator has access to variables in the currently active workspace. The following regexprep command uses array A defined in the underlying workspace:

A = magic(3)

A =

     8     1     6
     3     5     7
     4     9     2
regexprep('The columns of matrix _nam are _val', ...
          {'_nam', '_val'}, ...
          {'A', '${sprintf(''%d%d%d '', A)}'})
ans =

    'The columns of matrix A are 834 159 672'

Keywords: Front-end MATLAB regex

Added by shan169 on Mon, 14 Feb 2022 03:07:26 +0200