Analysis and implementation of C language source code -- implementation of strtok() series functions

Source code analysis and Practice

strtok() source code implementation

Because it is a global variable storage address implemented by static, you only need to pass it once and then pass NULL to operate the previous string.

  • But it also causes a problem: if two threads use this function, they will share this variable because it is a global variable. If two threads process two different strings and their address storage uses the same variable, This will cause the string that should have been processed by this thread to be processed by another thread.
    This is called thread insecurity.
#include<string.h>

static char *_strtok = NULL; //Cache the last result with global variables

char *
strtok(char *s, char const *ct) {
    char *sbegin, *send;

    sbegin = s ? s : _strtok;
    if (!sbegin) {
        return NULL;
    }
    sbegin += strspn(sbegin, ct);
    if (*sbegin == '\0') {
        _strtok = NULL;
        return (NULL);
    }
    send = strpbrk(sbegin, ct);
    if (send && *send != '\0')
        *send++ = '\0';
    _strtok = send;
    return (sbegin);
}

The other functions used in the source code are strspn(), strpbrk(), and strcspn()

strspn()

Use of strspn

The second parameter forms a set and returns the subscript position of the first character in the first string parameter that is not in the set.

/* strspn example */
#include <stdio.h>
#include <string.h>

int main ()
{
  int i;
  char strtext[] = "129th";
  char cset[] = "1234567890";

  i = strspn (strtext,cset);
  printf ("The initial number has %d digits.\n",i);
  return 0;
}

strspn implementation (personal implementation version)

int strspn(const char* s1,const char* s2){
    if(s1==NULL || s2==NULL)//Handling special situations
        return 0;
    bool cset[256]={0};
    while((*s2)!='\0'){	//Create set
        cset[*s2] = 1;
        ++s2;
    }
    int idx = 0;
    while(s1[idx] != '\0'){//Get the correct idx position
        if(cset[s1[idx]])
        	idx++;
        else
            break;
    }
    return idx;
}

strcspn()

Prototype:

size_t strcspn(const char *str1, const char *str2)

The function is similar to that of strspn, except that there is an additional c, which means constant and continuous.

So it is to find out the continuous length that is not in the str2 character set from str1.

Code implementation (personal function implementation)

int strspn(const char* s1,const char* s2){
    if(s1==NULL || s2==NULL)//Handling special situations
        return 0;
    bool cset[256]={0};
    while((*s2)!='\0'){	//Create set
        cset[*s2] = 1;
        ++s2;
    }
    int len = 0;
    while(s1[idx] != '\0'){//Get the correct idx position
        if(!cset[s1[len]])
        	len++;
        else
            break;
    }
    return len;
}

strpbrk()

Use of strpbrk

Similarly, first establish a character set and return the address (pointer) of the first position of the character in the first string in the set.

/* strpbrk example */
#include <stdio.h>
#include <string.h>

int main ()
{
  char str[] = "This is a sample string";
  char key[] = "aeiou";
  char * pch;
  printf ("Vowels in '%s': ",str);
  pch = strpbrk (str, key);
  while (pch != NULL)
  {
    printf ("%c " , *pch);
    pch = strpbrk (pch+1,key);
  }
  printf ("\n");
  return 0;
}

Implementation of strpbrk (personal implementation version space saving version)

In order to save space, I use every bit here. If you can't understand it, you can save between 256 bool (one byte) variables.

typedef unsigned char uchar;

inline int get_pos(uchar x) {
    return x % 32;
}

char *strpbrk(const char *s1, const char *s2) {
    if (s1 == NULL || s2 == NULL)
        return NULL;
    uchar cset[32] = {0};                        //Using 32 unsigned char s to bool each bit can save more memory
    while ((*s2) != '\0') {                      //When creating a set,%32 determines which box to put in, and / 32 determines which bit of 8 bits to put in
        uchar t = (uchar) *s2++;
        cset[get_pos(t)] |= 1 << (t / 32);       //Since the maximum value is 255, t/32 is 0 ~ 31
    }
    while ((*s1) != '\0') {
        uchar t = (uchar) *s1;
        if (cset[get_pos(t)] & (1 << (t / 32))) {//How to find bitmap
            return (char *) s1;
        } else {
            ++s1;
        }
    }
    return NULL;
}

strtok_r() source code implementation

Function prototype:

char * strtok_r (char *s, const char *delim, char **save_ptr);

Since the static global variable is not used, a pointer variable needs to be passed in from the outside to store the string of the current operation.

Because external variables are used for caching, there will be no thread safety problem!

The following are usage tests:

#include <stdio.h>
#include <string.h>

void func() {
    char chSrc[] = "Can I help you";//Note that constants cannot be used directly and memory is required
    char *save = NULL;          //For caching
    char *pchSrc = chSrc;
    while (NULL != (pchSrc = strtok_s(pchSrc, " ", &save))) {
        printf("\npchSrc: %s\nsave: %s\n", pchSrc, save);
        pchSrc = NULL;
    }
}

int main() {
    func();
    return 0;
}

strtok_ (R) source code:

char *
strtok_r(char *s, const char *delim, char **save_ptr) {
    char *end;
    if (s == NULL)
        s = *save_ptr;
    if (*s == '\0') {
        *save_ptr = s;
        return NULL;
    }
    /* Scan leading delimiters.  */
    s += strspn(s, delim);
    if (*s == '\0') {
        *save_ptr = s;
        return NULL;
    }
    /* Find the end of the token.  */
    end = s + strcspn(s, delim);
    if (*end == '\0') {
        *save_ptr = end;
        return s;
    }
    /* Terminate the token and make *SAVE_PTR point past it.  */
    *end = '\0';
    *save_ptr = end + 1;
    return s;
}

(independent of other function libraries) implement strtok() entirely by yourself

design scheme

Design scheme:

  1. Function: split the string according to the separator
  2. Principle: global variable state cache
  3. Key: due to the lack of strspn() and strpbrk(), after implementing strspn(), strcspn() and strpbrk(), it is found that it is the process of establishing a collection to judge the movement of the pointer.
  • In order to save space, I use every bit here. If you can't understand it, you can save between 256 bool (one byte) variables.

I'm afraid someone can't understand it. Why do I need 256 locations for mapping and why do I need uchar? Here is the answer: because a character is 1 byte, if the symbol is not considered, it can represent 0 ~ 255 at most. Because we use the array subscript to map the value of the character, it needs to be a non negative number, so it is converted to uchar, and my typedef is uchar. I just feel that the unsigned char is too long 🤣

code implementation

typedef unsigned char uchar;

inline int get_pos(uchar x) {
    return x % 32;
}

char *strtok(char *src, const char *delimiters) {
    char *sbegin, *send;
    static char *ssave = NULL;
    sbegin = src ? src : ssave;           //If src is NULL, the last cache continues
    uchar cset[32] = {0};                 //Using 32 unsigned char s to bool each bit can save more memory
    while ((*delimiters) != '\0') {       //Update set
        uchar t = (uchar) *delimiters++;
        cset[get_pos(t)] |= 1 << (t / 32);//Since the maximum value is 255, t/32 is 0 ~ 7
    }
    //Let sbegin point to a location that is not in the set
    while (*sbegin != '\0' && (cset[get_pos(*sbegin)] & (1 << (((uchar) *sbegin) / 32)))) {
        ++sbegin;
    }
    if (*sbegin == '\0') {
        ssave = NULL;
        return NULL;
    }
    int idx = 0;
    //Equivalent to the function of strcspn
    while (sbegin[idx] != '\0' && !(cset[get_pos(sbegin[idx])] & (1 << ((uchar) sbegin[idx] / 32)))) {
        ++idx;
    }
    send = sbegin + idx;
    if (*send != '\0')
        *send++ = '\0';                   //Draw Terminator
    ssave = send;                         //Update cache location for next processing
    return sbegin;
}

performance analysis

The problem scale N is the length of the string. Because strtok is designed in the way of collection storage rather than violent enumeration, the time complexity is O(n)! The array used to store set elements also realizes space optimization through bit operation. The space does not change with the scale of the problem, so the space complexity of O(1) is very high.

Time complexity: O ( N ) O(N) O(N)

Space complexity: O ( 1 ) O(1) O(1)

test case

main function of the test

Test Description: split the whole string with ',', '.' - 'as separator.

int main() {
    char str[] = "- This 3234d s3242- -d-, a sample string.";
    char *pch;
    printf("Splitting string \"%s\" into tokens:\n", str);
    pch = strtok(str, " ,.-");
    while (pch != NULL) {
        printf("%s\n", pch);
        pch = strtok(NULL, " ,.-");
    }
    return 0;
}

Test results:

Very good, very perfect!!!

Keywords: C Back-end

Added by gladiator83x on Mon, 08 Nov 2021 21:02:23 +0200