Note: data structure - Chapter 4

Chapter IV string

4.1 string

  • A String is a finite sequence of zero or more characters

    For example: S = "Hello World!"; Double quotation marks (Java, C) are used in some places; Use single quotation marks where necessary (Python)

  • Substring - composed of any consecutive characters; Main string - a string containing substrings

    Empty string - no characters; Space string - the character is a space, and each space character occupies 1 B

    Character position in the main string: the sequence number of the character in the string; Position of substring in the main string: the sequence number of the first character of the substring in the string

    (Note: the serial number starts from 1, not 0)

  • String VS linear table:

    String is a special linear table, and there is a linear relationship between data elements

    String data object: character set (such as Chinese characters, English characters, numeric characters, punctuation characters, etc.)

    Basic operation of string: usually take substring as the operation object, such as adding, deleting, modifying and querying; Index(S,T) - positioning operation; StrCompare (S,T) -- compare operation

  • String comparison: the string with larger symbol appears first is larger (ASCII code: lower case letter 97 > upper case letter 65); The long string prefix is the same as the short string, and the long string is larger

  • Expansion - garbled code problem: in the document, a set of coding rules was originally adopted; When the file is opened, the software thinks that another set of coding rules is adopted

4.1.2 string storage structure

*Sequential storage of strings

  • Static array implementation - fixed length sequential storage (the system will automatically recycle after the function is executed)

    Dynamic array implementation - heap allocation storage (manual free is required when running out)

#define MAXLEN 255
typedef struct{
    char ch[MAXLEN];
    int length;
}SString;

typedef struct{
    char *ch;
    int length;
}HString;
  • Save string length data

    1. Scheme 1: put the variable length at the end of the array

    2. Scheme 2: put the variable length at the beginning of the array ch[0]

      Advantages: the bit order of characters is the same as that of array subscripts; Disadvantages: ch[0] is of char type, only 1B=8bit can be put down, indicating the range of 0-255

    3. Scheme 3: there is no length variable, and the symbol '\ 0' indicates the end (corresponding to 0 of ASCII code)

      Method: traverse to '\ 0' to calculate the length, which is suitable for infrequently accessing the length of the string

    4. Scheme 4: ch[0] is discarded and the variable length is declared at the end (combined with schemes 1 and 2)

*Chain storage of strings

  • Low storage density: in a 32-bit computer, a char occupies 1b, but a pointer occupies 4B, that is, 32 byte bits
  • Let each node store multiple bytes to increase the storage density; Bytes with insufficient memory can be # filled
typedef strut StringNode{
    char ch;
    struct StringNode *next;
}StringNode,*String;

typedef strut StringNode{
    char ch[4];						//Each node stores multiple characters
    struct StringNode *next;
}StringNode,*String;

*Sequential string - substring

  • (storage method of scheme 4) substring: use Sub to return the substring with a length of len from the pos character of string S
#define MAXLEN 255
typedef struct{
    char ch[MAXLEN];
    int length;
}SString;

bool SubString(SString &Sub,SString S,int pos,in len){
    if(pos+len-1>S.length)
        return false;
    for(int i=pos;i<pos+len;i++)
        Sub.ch[i-pos+1]=S.ch[i];	//At this time, the position of the substring is i-pos+1
    Sub.length=len;
    return true;
}

*Sequence string - compare operation

  • Comparison operation: if s > t, the return value is > 0; If S=T, the return value is 0; If s < T, the return value is < 0,
int StrCompare(SString S,SString T){
    for(i=1;i<=SString S&&i<=SString T;i++){	//The end length here is the shorter of the two strings
        if(S.ch[i]!=T.ch[i])
            return S.ch[i]-T.ch[i];		//During scanning, return is to subtract the value directly
    }
    return S.length-T.length;			//At the end of scanning, the longer length is larger
}

*Sequence string - positioning operation

  • Positioning operation: suppose there is a substring T in the main string S; Returns the position of the first occurrence of a string, otherwise 0
int Index(SString S,SString T){
    int i=1,n=StringLength(S),m=StringLength(T);
    SString sub;
    while(i<=n-m+1){					//i is initially defined as 1, so + 1
        SubString(Sub,S,i,m);			//The initial value of i must be 1. Take a substring with the same length in the main string
        if(StrCompare(sub,T)!=0)		//Compare whether the two substrings are the same
            i++;						//If it's different, move back and continue the comparison
        else
            return i;					//Returns the substring position
    }
    return 0;							//Compare all, no equal substring
}

4.2 pattern matching of strings

4.2.1 naive pattern matching algorithm of string

  • Substring: it must exist in the main string to be called "substring"; Pattern string: the string you want to try to find in the main string may not exist

  • String pattern matching: find the same substring as the pattern string in the main string and return its location

*Naive pattern matching algorithm

  • Similar to the positioning operation, you can stop checking the current substring as long as one character is different

  • The time complexity is O(nm); n is much greater than m

int Index(SSring S,SString T){
    int k=1;							//k is to check the position of the substring in the main string
    int i=k;j=1;						//i is the position of the check character of the check substring, and j is the position of the check character of the substring
    while(i<=S.length&&i<=T.length){
        if(S.ch[i]==T.ch[j]){
            i++;
            j++;
        }else{
            k++;
            i=k;
            j=1;
        }
    }
    if(j>T.length)						//The loop ends because the substring matches successfully
        return k;
    else								//The loop ends because the substring does not match at the end
        return 0;
}

//Method 2: do not set variable k
int Index(SSring S,SString T){						
    int i=1;j=1;						
    while(i<=S.length&&i<=T.length){
        if(S.ch[i]==T.ch[j]){
            i++;
            j++;
        }else{
            i=i-j+2;					//i-(j-1)+1;  If the J-1 substring has been scanned, + 1 moves back one bit
            j=1;
        }
    }
    if(j>T.length)						
        return i-T.length;				//Returns the position of the first character
    else
        return 0;
}

4.2.2 KMP algorithm (I)

  • Mode string pointer backtracking position - int next[7]

  • For example: g o g l e

    If j=k and character mismatch is found, let j=next[k]; If the current two characters match, i + +, j + +;

    If a mismatch occurs when j=1, let J return to 0, i + +, j + + (let i + +, J still be 1); If a mismatch occurs when j=2, let J return to 1;

    If a mismatch occurs when j=3, let j return to 1; If a mismatch occurs when j=4, let j return to 1;

    If a mismatch occurs when j=5, let j return to 2; If a mismatch occurs when j=6, let j return to 1

*KMP algorithm code

int Index_KMP(SString S,SString T,int next[]){
    int i=1,j=1;
    while(i<=S.length||j<=T.length){
        if(j==0||S.ch[i]==T.ch[j]){		//i-1, j-1 if starting from 0
            i++;
            j++;						//Initial or matching, continue to compare later
        }else{
            j=next[j];					//Mismatch, pattern string moves to the right
        }
        if(j>T.length)
            return i-T.length;
        else
            return 0;
    }
}

4.2.3 KMP algorithm (Part 2)

*Find next array

  • Prefix of string: a substring that contains the first character and does not contain the last character

    Suffix of string: a substring that contains the last character and does not contain the first character

  • When the j-th character fails to match, the string composed of the first 1~j-1 characters is marked as s, then next [J-1] = the longest phase of S and the length of the prefix + 1

    In particular, next[1]=0; And next[2]=1, the prefix and suffix are not equal. next[x]=1 (0 + 1)

  • The time complexity is O(m+n)

void get_next(SString T,int next[]){
    int i=1,j=0;
    next[1]=0;
    while(i<T.length){
        if(j==0||T.ch[i]==T.ch[j]){
            i++;
            j++;
            next[i]=j;
        }else
            j=next[j];
    }
}

int Index_KMP(SString S,SString T){
    int i=1,j=1;
    int next[T.length+1];				//Only when next[0] is empty can it be mapped
    get_next(T,next);					//The time complexity is O(n)
    while(i<=S.length||j<=T.length){	//Time complexity is O(m)
        if(j==0||S.ch[i]==T.ch[j]){
            i++;
            j++;						
        }else{
            j=next[j];					
        }
        if(j>T.length)
            return i-T.length;
        else
            return 0;
    }
}

4.2.4 optimization of KMP algorithm

  • Optimization of next[j]: when substring and pattern string do not match, j=nextval[j]

    Idea: if the character of the next backtrace is the same as this character, you can give the next of the backtrace character to this character:

*Find nextval array

//First calculate the next array, shilling nextval[1]=0
for(int j=2;j<=T.length;j++){
    if(T.ch[next[j]]==T.ch[j])
        nextval[j]=nextval[next[j]];
    else
        nextval[j]=next[j];
}

Keywords: data structure string

Added by Michan on Wed, 10 Nov 2021 19:02:20 +0200