Principle and C + + implementation of KMP algorithm

preface

Recently, Feynman learning method has been practiced and applied to learning. KMP algorithm is a difficult knowledge point in data structure and algorithm. It took several days to understand the principle of this algorithm and implement it in C + +. Write this blog to sort out the principle of KMP algorithm and my C + + implementation algorithm.

brief introduction

KMP algorithm is an optimization algorithm based on BF algorithm. BF algorithm decomposes the main string into (main string length - mode string length + 1) substrings, and then compares these substrings with mode strings. Each comparison must trace back the pointer of the main string. KMP algorithm is optimized in the backtracking part. The pointer of the main string does not trace back, but traces back the pointer of the mode string to a specific position to continue matching, The key part of KMP algorithm is to find the specific location of pattern string backtracking. This specific location of backtracking is stored in an array. In fact, this location is only related to pattern string. In order to explain why to backtrack to this location, it is necessary to introduce the concept of common pre suffix.

int search_kmp(string& M,string& P,int pos)
{
	int next[P.length()];//The next array stores the backtracking position of the pattern string after the matching fails
	get_next(P,next);//Gets the position of the mode string pointer backtracking, which is related to the mode string itself
	int i=pos,j=0;//i is the pointer to the main string and j is the pointer to the mode string
	while(j<P.length()&&i<M.length()) 
	{
		if(j==0||M[i]==P[j])
		{
			++i;
			++j;
		}
		else//Matching failed, the pattern string pointer is traced back to a specific location
		{
			j=next[j-1];
		}
	}
	if(j==P.length())//If the matching is successful, the j pointer points to the last bit after the last element of the text string (main string) 
		return i-P.length();//Returns the starting position when the pattern string is matched in the main string
	else//Matching failed. An error message is returned indicating that it was not found 
		return -1;
}

Related concepts

Prefix: all substrings of the string excluding the first element. For example, the prefix of string atabat includes t, ta, tab, tatb and tabat. The first element of this string is a, and the prefix does not contain the first element a

Suffix: all substrings of the string excluding the tail element. For example, the suffixes of the string atabat include a, at, ata, atab, ataba. The tail element of this string is t, and the suffix does not contain the tail element t

Common prefix: the same prefix becomes a common prefix. For example, the common prefix of the string atabat has at.

In fact, the value stored in the next array is the length of the common prefix, that is, the position to which the pattern string pointer is traced in case of mismatch.

Backtracking principle of KMP algorithm

give an example:

Why can we use a common prefix for backtracking? This problem bothered me for a long time.

For example:

There is A main string m, which can be decomposed into several substrings, represented by ABAC (A, B and C are substrings of M)

There is A pattern string P, and the decomposable component is composed of several substrings, represented by ABAD (A, B and D are substrings of P)

ABAC

ABAD, if the two strings are found to be inconsistent when they are matched to C and D, the mode string pointer is traced back from D to the second A

Equivalent to ABAC

ABAD

The next step is to match the substring C of the main string and the substring BCD of the pattern string, which is equivalent to completing part of the matching in advance without the main string pointer for backtracking, saving the main string backtracking steps in the BF algorithm and optimizing the time complexity.

The above uses an intuitive way for string matching. In the code, this effect is achieved by moving the pattern string pointer. It is equivalent to moving the mode string pointer to the first substring A. since the mode string array starts from 0, this position is exactly the length of the common prefix a.

In fact, when there is a mismatch, the common prefix length should be 0, using the common prefix sum of the previous element of the pattern string pointer value.

Find the value of the common prefix length

In the code, the value of the common prefix length is stored in the next array

void get_next(string& P,int* next)
{
	int j=0;//J is the tail pointer of prefix string and j is the tail pointer of suffix string
	next[0]=0;//Define the first element value of the next array as 0
	for(int i=1;i<P.length();i++)//The trailing pointer of the suffix starts with 1 
	{
		//Handle the mismatch between the front and back affixes, and the prefix pointer of the mismatch between the front and back affixes is fallback 
		while(j>0&&P[j]!=P[i])
		{
			//In case of mismatch, j looks for a partial match. If it exists, it must be the longest common prefix of P[0,j-1] 
			j=next[j-1];
		}
		//Suffix matching before processing
		if(P[j]==P[i])
		{
			j++;//Or next[i]=j,j is also the longest common prefix length at this time 
		} 
		next[i]=j;
	} 
}

Four situations need to be handled to get the next array

1. Initialization

2. Same prefix and suffix

3. Different affixes

4. Update the value of the next array

initialization

next [0] = 0. The for loop starts from 1, so you need to deal with the case where the prefix tail pointer is 0 first

Same prefix and suffix

That is, the tail pointer of the advance prefix, j++

Different affixes

Because the suffix pointer i is pushed back one bit, the suffix is changed. Take the former suffixes as a pattern string and find the substring matching the former part and the latter part of the pattern string. After the pointer is backtracked, the next element is compared until it is backtracked to 0 or matched.

This part of the brief book is well explained in this blog:

[algorithm] how to calculate the next array in KMP algorithm - Jianshu (jianshu.com)

Update the value of next

We can find that the value of j, that is, the tail pointer of the prefix, is exactly equal to the prefix length (also equal to the suffix length), so we can use the value of j to update the next array

Complete code

#include<iostream>
#include<string>
using namespace std;
void get_next(string& P,int* next);//Modify the value of the next array 
int search_kmp(string& M,string& P,int pos=0);
void check_next(string& P,int* next);
int main()
{
	string m,pattern;
	cout<<"Enter text string:\n";
	cin>>m;
	cout<<"Input mode string:\n";
	cin>>pattern;
	cout<<"The starting position of the mode string in the text string:\n";
	cout<<search_kmp(m,pattern,0);
	return 0;
}
void get_next(string& P,int* next)
{
	int j=0;//J is the tail pointer of prefix string and j is the tail pointer of suffix string
	next[0]=0;//Define the first element value of the next array as 0
	for(int i=1;i<P.length();i++)//The trailing pointer of the suffix starts with 1 
	{
		//Handle the mismatch between the front and back affixes, and the prefix pointer of the mismatch between the front and back affixes is fallback 
		while(j>0&&P[j]!=P[i])
		{
			//In case of mismatch, j looks for a partial match. If it exists, it must be the longest common prefix of P[0,j-1] 
			j=next[j-1];
		}
		//Suffix matching before processing
		if(P[j]==P[i])
		{
			j++;//Or next[i]=j,j is also the longest common prefix length at this time 
		} 
		next[i]=j;
	} 
} 
int search_kmp(string& M,string& P,int pos=0)
{
	int next[P.length()];
	get_next(P,next);
	int i=pos,j=0;//i is the pointer to the main string and j is the pointer to the mode string
	while(j<P.length()&&i<M.length()) 
	{
		if(j==0||M[i]==P[j])
		{
			++i;
			++j;
		}
		else
		{
			j=next[j-1];
		}
	}
	if(j==P.length())//If the matching is successful, the j pointer points to the last bit after the last element of the text string (main string) 
		return i-P.length();
	else//Matching failed. An error message is returned indicating that it was not found 
		return -1;
}

Keywords: C++ Algorithm data structure

Added by phpbeginner0120 on Sat, 02 Oct 2021 05:28:19 +0300

Programming VIP