Chapter IV string
4.1 string
-
A String is a finite sequence of zero or more characters
For example: S = "Hello World!"; Double quotation marks (Java, C) are used in some places; Use single quotation marks where necessary (Python)
-
Substring - composed of any consecutive characters; Main string - a string containing substrings
Empty string - no characters; Space string - the character is a space, and each space character occupies 1 B
Character position in the main string: the sequence number of the character in the string; Position of substring in the main string: the sequence number of the first character of the substring in the string
(Note: the serial number starts from 1, not 0)
-
String VS linear table:
String is a special linear table, and there is a linear relationship between data elements
String data object: character set (such as Chinese characters, English characters, numeric characters, punctuation characters, etc.)
Basic operation of string: usually take substring as the operation object, such as adding, deleting, modifying and querying; Index(S,T) - positioning operation; StrCompare (S,T) -- compare operation
-
String comparison: the string with larger symbol appears first is larger (ASCII code: lower case letter 97 > upper case letter 65); The long string prefix is the same as the short string, and the long string is larger
-
Expansion - garbled code problem: in the document, a set of coding rules was originally adopted; When the file is opened, the software thinks that another set of coding rules is adopted
4.1.2 string storage structure
*Sequential storage of strings
-
Static array implementation - fixed length sequential storage (the system will automatically recycle after the function is executed)
Dynamic array implementation - heap allocation storage (manual free is required when running out)
#define MAXLEN 255 typedef struct{ char ch[MAXLEN]; int length; }SString; typedef struct{ char *ch; int length; }HString;
-
Save string length data
-
Scheme 1: put the variable length at the end of the array
-
Scheme 2: put the variable length at the beginning of the array ch[0]
Advantages: the bit order of characters is the same as that of array subscripts; Disadvantages: ch[0] is of char type, only 1B=8bit can be put down, indicating the range of 0-255
-
Scheme 3: there is no length variable, and the symbol '\ 0' indicates the end (corresponding to 0 of ASCII code)
Method: traverse to '\ 0' to calculate the length, which is suitable for infrequently accessing the length of the string
-
Scheme 4: ch[0] is discarded and the variable length is declared at the end (combined with schemes 1 and 2)
-
*Chain storage of strings
- Low storage density: in a 32-bit computer, a char occupies 1b, but a pointer occupies 4B, that is, 32 byte bits
- Let each node store multiple bytes to increase the storage density; Bytes with insufficient memory can be # filled
typedef strut StringNode{ char ch; struct StringNode *next; }StringNode,*String; typedef strut StringNode{ char ch[4]; //Each node stores multiple characters struct StringNode *next; }StringNode,*String;
*Sequential string - substring
- (storage method of scheme 4) substring: use Sub to return the substring with a length of len from the pos character of string S
#define MAXLEN 255 typedef struct{ char ch[MAXLEN]; int length; }SString; bool SubString(SString &Sub,SString S,int pos,in len){ if(pos+len-1>S.length) return false; for(int i=pos;i<pos+len;i++) Sub.ch[i-pos+1]=S.ch[i]; //At this time, the position of the substring is i-pos+1 Sub.length=len; return true; }
*Sequence string - compare operation
- Comparison operation: if s > t, the return value is > 0; If S=T, the return value is 0; If s < T, the return value is < 0,
int StrCompare(SString S,SString T){ for(i=1;i<=SString S&&i<=SString T;i++){ //The end length here is the shorter of the two strings if(S.ch[i]!=T.ch[i]) return S.ch[i]-T.ch[i]; //During scanning, return is to subtract the value directly } return S.length-T.length; //At the end of scanning, the longer length is larger }
*Sequence string - positioning operation
- Positioning operation: suppose there is a substring T in the main string S; Returns the position of the first occurrence of a string, otherwise 0
int Index(SString S,SString T){ int i=1,n=StringLength(S),m=StringLength(T); SString sub; while(i<=n-m+1){ //i is initially defined as 1, so + 1 SubString(Sub,S,i,m); //The initial value of i must be 1. Take a substring with the same length in the main string if(StrCompare(sub,T)!=0) //Compare whether the two substrings are the same i++; //If it's different, move back and continue the comparison else return i; //Returns the substring position } return 0; //Compare all, no equal substring }
4.2 pattern matching of strings
4.2.1 naive pattern matching algorithm of string
-
Substring: it must exist in the main string to be called "substring"; Pattern string: the string you want to try to find in the main string may not exist
-
String pattern matching: find the same substring as the pattern string in the main string and return its location
*Naive pattern matching algorithm
-
Similar to the positioning operation, you can stop checking the current substring as long as one character is different
-
The time complexity is O(nm); n is much greater than m
int Index(SSring S,SString T){ int k=1; //k is to check the position of the substring in the main string int i=k;j=1; //i is the position of the check character of the check substring, and j is the position of the check character of the substring while(i<=S.length&&i<=T.length){ if(S.ch[i]==T.ch[j]){ i++; j++; }else{ k++; i=k; j=1; } } if(j>T.length) //The loop ends because the substring matches successfully return k; else //The loop ends because the substring does not match at the end return 0; } //Method 2: do not set variable k int Index(SSring S,SString T){ int i=1;j=1; while(i<=S.length&&i<=T.length){ if(S.ch[i]==T.ch[j]){ i++; j++; }else{ i=i-j+2; //i-(j-1)+1; If the J-1 substring has been scanned, + 1 moves back one bit j=1; } } if(j>T.length) return i-T.length; //Returns the position of the first character else return 0; }
4.2.2 KMP algorithm (I)
-
Mode string pointer backtracking position - int next[7]
-
For example: g o g l e
If j=k and character mismatch is found, let j=next[k]; If the current two characters match, i + +, j + +;
If a mismatch occurs when j=1, let J return to 0, i + +, j + + (let i + +, J still be 1); If a mismatch occurs when j=2, let J return to 1;
If a mismatch occurs when j=3, let j return to 1; If a mismatch occurs when j=4, let j return to 1;
If a mismatch occurs when j=5, let j return to 2; If a mismatch occurs when j=6, let j return to 1
*KMP algorithm code
int Index_KMP(SString S,SString T,int next[]){ int i=1,j=1; while(i<=S.length||j<=T.length){ if(j==0||S.ch[i]==T.ch[j]){ //i-1, j-1 if starting from 0 i++; j++; //Initial or matching, continue to compare later }else{ j=next[j]; //Mismatch, pattern string moves to the right } if(j>T.length) return i-T.length; else return 0; } }
4.2.3 KMP algorithm (Part 2)
*Find next array
-
Prefix of string: a substring that contains the first character and does not contain the last character
Suffix of string: a substring that contains the last character and does not contain the first character
-
When the j-th character fails to match, the string composed of the first 1~j-1 characters is marked as s, then next [J-1] = the longest phase of S and the length of the prefix + 1
In particular, next[1]=0; And next[2]=1, the prefix and suffix are not equal. next[x]=1 (0 + 1)
-
The time complexity is O(m+n)
void get_next(SString T,int next[]){ int i=1,j=0; next[1]=0; while(i<T.length){ if(j==0||T.ch[i]==T.ch[j]){ i++; j++; next[i]=j; }else j=next[j]; } } int Index_KMP(SString S,SString T){ int i=1,j=1; int next[T.length+1]; //Only when next[0] is empty can it be mapped get_next(T,next); //The time complexity is O(n) while(i<=S.length||j<=T.length){ //Time complexity is O(m) if(j==0||S.ch[i]==T.ch[j]){ i++; j++; }else{ j=next[j]; } if(j>T.length) return i-T.length; else return 0; } }
4.2.4 optimization of KMP algorithm
-
Optimization of next[j]: when substring and pattern string do not match, j=nextval[j]
Idea: if the character of the next backtrace is the same as this character, you can give the next of the backtrace character to this character:
*Find nextval array
//First calculate the next array, shilling nextval[1]=0 for(int j=2;j<=T.length;j++){ if(T.ch[next[j]]==T.ch[j]) nextval[j]=nextval[next[j]]; else nextval[j]=next[j]; }