[algorithm analysis] 7 String search

catalogue

String matching problem:

Analysis of simple matching algorithm:

7.2 Rabin Karp algorithm

Fingerprint based algorithm:

Using the Hash function

Preprocessing: find fingerprint

Step: the key is the calculation of each shift

Rabin Karp algorithm:

analysis:

Application:

KMP algorithm

Automata search

Prefix function:

Prefix table:

The prefix table of size m is pre calculated to store the value of π [q] (0 < = q

KMP analysis

BMH algorithm:

Inverse simple algorithm:

Heuristic method:

Offset table:

Pseudo code:

Algorithm implementation:

BMH analysis:

String matching problem:

Input:

Text T = "at the thought of"

n=length(T) =17

Mode P="the"

m=length(P) = 3

Output: move to S (S is the subscript in the text)

S is the minimum integer satisfying t [s.. S + M-1] = P [0.. M-1] (0 < = s < = n-m)

If no such s exists, - 1 is returned

Simple matching algorithm:

Violent search: check all values from 0 to n-m

Pseudo code

Native-Search (T,P)
for s <- 0 to n-m
	j <- 0
	//check if  T[s..s+m-1] = P[0..m-1]
	while T[s+j]= P[j] do
		j <- j+1
		if j = m return s
return -1 

How many comparisons are needed to make t = "at the thought of", P = "though"?

Analysis of simple matching algorithm:

Worst case scenario:

  • Outer loop: n-m [how many times does the first letter of P align with the letter of T]
  • Inner circulation: m []
  • Total (n-m)*m = O(nm)
  • What inputs produce the worst case? [each time P is matched, the mismatch is found only after the last one]

Best case: n-m [mismatch found at the beginning of each time]

Completely random text and patterns: O(n-m)

Idea: if the text is completely random and the alphabet size is k, then any two letters are 1/k each. Comparing the first letter of P and T, the probability of their matching is 1/k and the probability of mismatch is 1-1/k. The probability of exactly two comparisons is (1/k) * (1-1/k)

7.2 Rabin Karp algorithm

Fingerprint idea

Assumptions:

  • We can calculate the fingerprint f(P) of a P in O(m) time
  • If the fingerprints are not equal, f(P) ≠ f (T[s..s+m-1]), then the substrings are not equal
  • We can compare fingerprints at O(1) time
  • We can calculate the current fingerprint of F '(t [S + 1.. S + M]) and the fingerprint after moving one bit from F (t [s.. S + M-1]) at O(1)

Fingerprint based algorithm:

Let the alphabet be Σ = {0,1,2,3,4,5,6,7,8,9}

Make the fingerprint a decimal number, that is, f("1045") =1*10**3 + 0*10**2 +4*10**1 + 5 =1045

Fingerprint-Search (T,P)
	fp <- compute f(P)  //Pattern or reference substring
	f <- compute f(T[0..m-1])

	for s <- 0 to n-m do
		if fp = f return s
		f <- (f - T[s]*10**(m-1) )*10 + T[s+m]
	
	return -1

The running time is 2 O (m) + O (n-m) = O (n)

Start counting two fingerprints 2 O(m) + time for sequential comparison n-m rounds, each round is O(1)

Using the Hash function

Problem: in fact, we can't assume that we can perform arithmetic operation on m digits in O(1) time.

Solution: use the hash function h = f mod q

For example, if q = 7, H ("52") = 52 mod 7 = 3

h(S1) ≠ h(S2) => S1 ≠ S2

But h(S1) = h(S2) does not mean S1 = S2

For example, if q = 7, H ("73") = 3 but "73" ≠ "52"

But mod arithmetic operation - the operand is limited to Q. if q is small enough / constant, it can be in O(1)

Arithmetic operation in time

(a+b)mod q= (a mod q +b mod q)mod q

(a*b)mod q=(a mod q)*(b mod q)mod q 

Preprocessing: find fingerprint

Disassemble (Qin Jiushao)

fp=P[m-1] + 10* ( P[m-2] + 10*(P[m-3] +...+10*(P[1] + 10* (P[0]) ) )   mod q

ft can also be calculated from T[0...m-1]

For example, P = "2531", q = 7, FP =?

Step: the key is the calculation of each shift

ft =  ( ft-T[s]*10**(m-1) mod q) *10 + T[s+m] ) mod q

//T[s+m] is less than Q, and mod q is not required. The formula T[s]*10**(m-1) may be greater than Q, and mod q is required, while T[s] is also less than Q, so mod q is required for 10**(m-1)

10**(m-1) mod q is calculated once in the pretreatment

Rabin Karp algorithm:

Karbin-Karp-Search (T,P)
	q <- a prime larger than m//Select a prime number q greater than m to 
	c <- 10**(m-1) mod q //run a loop multiplying by 10 mod q
	fp <- 0 ; ft <- 0
	for i <- 0 to m-1 //preprocessing
	 	fp <- (10*dp +P[i] ) mod q
	 	ft <- (10*ft +T[i] ) mod q
	for s <- 0 to n-m //Matching matching
		if fp= ft then //Fingerprint equality does not mean string matching
			if P[0..m-1] = T[s..s+m-1] return s//You also need to match each character
		ft <- (  (ft - T[s]*c ) *10 + T[s+m ] ) mod q //displacement 
	return -1  

How many times?

analysis:

If q is a prime number, the hash function will make the m-bit string evenly distributed among Q values.

Therefore, only every q times in s rotation needs to match the fingerprint (matching is slow, and O(m) times need to be compared). Every q time you need a fingerprint, others need a comparison.

Expected running time (if Q > m)

Preprocessing: O(m) / / each Pattern character needs to be calculated once

External circulation: O(n-m) / / ratio from head to tail

All internal loops: (n-m)/q * m = O(n-m) / / specific fingerprint comparison

Total time: O(n-m) / / linear

Worst run time: O(nm) / / and violent search are matched every time, but it is found that the last letter is different every time

Application:

KMP algorithm

Matching of N comparisons

Target: each character in the text matches only once

Simple algorithm problem:

The knowledge in the existing partial matching is not utilized

        T="Tweedledee and Tweedledum"  P="Tweedledum"

        T="pappar"  P="pappappappar"

Automata search

Algorithm:

Pretreatment:

For each Q (1 < = q < = m-1) and each α ∈∑ calculate a new value of Q in advance and record it as σ (q, α)

Fill in a table with the size of m * ∑

Scan text:

When a mismatch is found, (P[q] ≠ T[s+q]): set s=s+q - σ (q, α)+ 1, and q = σ (q, α)

Analysis:

Match phase O(n)

Disadvantages: too much memory, O (m |∑|), too many preprocessing o (m |∑|)

Prefix function:

Prefix table:

Pre calculate the prefix table with size m to store the value of π [q] (0 < = q < m, m= length(P))

 

KMP-Search(T,P)
{
	π<- Compute-Prefix-Table(P)
	q <- 0 //Number of matching characters 
	for i <- 0 to n-1 //Scan text from left to right 
		while q >0 and P[q] ≠T[i] do
			q <- π[q] 
		if P[q] = T[i] then q <- q+1 
		if q= m then return i-m+1
	return -1
} 
//Compute prefix table is the essence of executing KMP algorithm on P 

KMP analysis

Worst case running time: O(n+m)

  • Main algorithm: O(n)
  • Compute-Prefix-Table : O(m)

Space: O (m)

Main algorithm: match twice at most

BMH algorithm:

Inverse simple algorithm:

Start searching after P.

Boyer and Moore

Reverse-Naive-Search (T,P)
	for s <- 0 to n-m
		j <- m-1 //Start at the end
		//check if T[s..s+m-1] = P[0..m-1]
		while T[s+j] = P[j] do
			j <- j-1
			if j< 0 return s
	return -1 

The running time is the same as that of the simple algorithm.

Heuristic method:

Match P and T, from back to front,

et a and e find a mismatch. At this time, only look at which of the last letter E and P can match (except the last E has been compared),

No, at this time, you can string P after e, which is equivalent to string "date" to cative,

From the back to the front, e is different from v. match T[s+n-1] with this v into the pattern, remove the last bit, and appear on the right except the one that has been compared. We know that there is no v in front, so we can string the date four digits later,

At this time, align d to the position of e behind v, and then find that E and a don't match,

Then look for a in the prefix of P. if you find an a, align the positions of the two a, and then e t a d will be found.

Offset table:

If the last letter appears in P In the prefix of, calculate P How many positions can you move back to align with him? If it doesn't appear, skip.

In the preprocessing, the offset table with the size of ∑ is calculated.

Example: P="kettle" remove the prefix "kettl" from the last letter. If the last letter is "e", shift [e] = 4, shift [l] = 1, shift [t] = 2, shift [k] = 5

Pseudo code:

BMH-Search (T,P)
    n <- the length of T
    m <- the length of P
    |∑| <- the length of OffsetTable

	//Calculate the offset table of P 
	for c <- 0 to |∑| -1 
		shift[c] =m //Default value

	for k <- 0 to m-2
		shift[P[k]] = m-1 -k 

	//search
	s <- 0
	while s <= n - m do 
		j <- m-1 //Start at the end
        // check if T[s..s+m-1] = P[0..m-1]
		while T[s+j] = P[j] do
			j <- j-1 
			if j < 0 return s
		s <- s + shift[ T[s+m-1] ] //shift by last letter
	return -1 

Algorithm implementation:

#include<bits/stdc++.h>
using namespace std;

const int maxnum=1005;
int shift[maxnum];

int BMH(string &T,string &P)
{
	int n=T.length();
	int m=P.length();
	
	//Offset table
	for(int i=0;i<maxnum;i++)
	{
		shift[i]=m;//Default value 
	}
	//The last subscript of each letter in the pattern string P, (except the last letter)
	//The main string never matches the last character and the number of bits to be shifted left 
	for(int i=0;i<=m-2;i++)
	{
		shift[P[i]]=m-1-i;//Character numeric 
	}
	
	int s=0;// Where is the starting position of the mode string in the main string
	
	while(s <= n-m)
	{
		int j=m-1;// Match from the end of the pattern string
		while(T[s+j]==P[j])
		{
			j=j-1;
			if(j<0)return s;// Match successful
		} 
		// Bad character found (the last character that currently matches the pattern string)
        // The last position in the pattern string (except the last bit)
        // The number of steps required to move from the end of the pattern string to this position
		s=s+shift[ T[s+m-1] ];
	}
	return -1;
}

int main()
{
	string T,P;
	while(true)
	{
		getline (cin,T);
		getline(cin,P);
		//
		int res=BMH(T,P);
		if(res==-1 )
			cout<<"Mismatch"<<endl;
		else 
			cout<<"Pattern string P The position of the main string is:"<<res<<endl;
	}
	return 0;
} 

test

at the thought of
though
 Pattern string P The position in the main string is: 7
aaaaaaab
aaa
 Pattern string P The position in the main string is: 0

BMH analysis:

Worst case run time:

  • Pretreatment: O(124∑| + m)
  • Search: O(nm)---
  • Total: O(nm)

Space: O(∑)

  • And m independent

Very fast on real data sets.

Keywords: Algorithm

Added by fpbaum on Wed, 26 Jan 2022 22:32:54 +0200