Source code analysis and Practice
strtok() source code implementation
Because it is a global variable storage address implemented by static, you only need to pass it once and then pass NULL to operate the previous string.
- But it also causes a problem: if two threads use this function, they will share this variable because it is a global variable. If two threads process two different strings and their address storage uses the same variable, This will cause the string that should have been processed by this thread to be processed by another thread.
This is called thread insecurity.
#include<string.h> static char *_strtok = NULL; //Cache the last result with global variables char * strtok(char *s, char const *ct) { char *sbegin, *send; sbegin = s ? s : _strtok; if (!sbegin) { return NULL; } sbegin += strspn(sbegin, ct); if (*sbegin == '\0') { _strtok = NULL; return (NULL); } send = strpbrk(sbegin, ct); if (send && *send != '\0') *send++ = '\0'; _strtok = send; return (sbegin); }
The other functions used in the source code are strspn(), strpbrk(), and strcspn()
strspn()
Use of strspn
The second parameter forms a set and returns the subscript position of the first character in the first string parameter that is not in the set.
/* strspn example */ #include <stdio.h> #include <string.h> int main () { int i; char strtext[] = "129th"; char cset[] = "1234567890"; i = strspn (strtext,cset); printf ("The initial number has %d digits.\n",i); return 0; }
strspn implementation (personal implementation version)
int strspn(const char* s1,const char* s2){ if(s1==NULL || s2==NULL)//Handling special situations return 0; bool cset[256]={0}; while((*s2)!='\0'){ //Create set cset[*s2] = 1; ++s2; } int idx = 0; while(s1[idx] != '\0'){//Get the correct idx position if(cset[s1[idx]]) idx++; else break; } return idx; }
strcspn()
Prototype:
size_t strcspn(const char *str1, const char *str2)
The function is similar to that of strspn, except that there is an additional c, which means constant and continuous.
So it is to find out the continuous length that is not in the str2 character set from str1.
Code implementation (personal function implementation)
int strspn(const char* s1,const char* s2){ if(s1==NULL || s2==NULL)//Handling special situations return 0; bool cset[256]={0}; while((*s2)!='\0'){ //Create set cset[*s2] = 1; ++s2; } int len = 0; while(s1[idx] != '\0'){//Get the correct idx position if(!cset[s1[len]]) len++; else break; } return len; }
strpbrk()
Use of strpbrk
Similarly, first establish a character set and return the address (pointer) of the first position of the character in the first string in the set.
/* strpbrk example */ #include <stdio.h> #include <string.h> int main () { char str[] = "This is a sample string"; char key[] = "aeiou"; char * pch; printf ("Vowels in '%s': ",str); pch = strpbrk (str, key); while (pch != NULL) { printf ("%c " , *pch); pch = strpbrk (pch+1,key); } printf ("\n"); return 0; }
Implementation of strpbrk (personal implementation version space saving version)
In order to save space, I use every bit here. If you can't understand it, you can save between 256 bool (one byte) variables.
typedef unsigned char uchar; inline int get_pos(uchar x) { return x % 32; } char *strpbrk(const char *s1, const char *s2) { if (s1 == NULL || s2 == NULL) return NULL; uchar cset[32] = {0}; //Using 32 unsigned char s to bool each bit can save more memory while ((*s2) != '\0') { //When creating a set,%32 determines which box to put in, and / 32 determines which bit of 8 bits to put in uchar t = (uchar) *s2++; cset[get_pos(t)] |= 1 << (t / 32); //Since the maximum value is 255, t/32 is 0 ~ 31 } while ((*s1) != '\0') { uchar t = (uchar) *s1; if (cset[get_pos(t)] & (1 << (t / 32))) {//How to find bitmap return (char *) s1; } else { ++s1; } } return NULL; }
strtok_r() source code implementation
Function prototype:
char * strtok_r (char *s, const char *delim, char **save_ptr);
Since the static global variable is not used, a pointer variable needs to be passed in from the outside to store the string of the current operation.
Because external variables are used for caching, there will be no thread safety problem!
The following are usage tests:
#include <stdio.h> #include <string.h> void func() { char chSrc[] = "Can I help you";//Note that constants cannot be used directly and memory is required char *save = NULL; //For caching char *pchSrc = chSrc; while (NULL != (pchSrc = strtok_s(pchSrc, " ", &save))) { printf("\npchSrc: %s\nsave: %s\n", pchSrc, save); pchSrc = NULL; } } int main() { func(); return 0; }
strtok_ (R) source code:
char * strtok_r(char *s, const char *delim, char **save_ptr) { char *end; if (s == NULL) s = *save_ptr; if (*s == '\0') { *save_ptr = s; return NULL; } /* Scan leading delimiters. */ s += strspn(s, delim); if (*s == '\0') { *save_ptr = s; return NULL; } /* Find the end of the token. */ end = s + strcspn(s, delim); if (*end == '\0') { *save_ptr = end; return s; } /* Terminate the token and make *SAVE_PTR point past it. */ *end = '\0'; *save_ptr = end + 1; return s; }
(independent of other function libraries) implement strtok() entirely by yourself
design scheme
Design scheme:
- Function: split the string according to the separator
- Principle: global variable state cache
- Key: due to the lack of strspn() and strpbrk(), after implementing strspn(), strcspn() and strpbrk(), it is found that it is the process of establishing a collection to judge the movement of the pointer.
- In order to save space, I use every bit here. If you can't understand it, you can save between 256 bool (one byte) variables.
I'm afraid someone can't understand it. Why do I need 256 locations for mapping and why do I need uchar? Here is the answer: because a character is 1 byte, if the symbol is not considered, it can represent 0 ~ 255 at most. Because we use the array subscript to map the value of the character, it needs to be a non negative number, so it is converted to uchar, and my typedef is uchar. I just feel that the unsigned char is too long 🤣
code implementation
typedef unsigned char uchar; inline int get_pos(uchar x) { return x % 32; } char *strtok(char *src, const char *delimiters) { char *sbegin, *send; static char *ssave = NULL; sbegin = src ? src : ssave; //If src is NULL, the last cache continues uchar cset[32] = {0}; //Using 32 unsigned char s to bool each bit can save more memory while ((*delimiters) != '\0') { //Update set uchar t = (uchar) *delimiters++; cset[get_pos(t)] |= 1 << (t / 32);//Since the maximum value is 255, t/32 is 0 ~ 7 } //Let sbegin point to a location that is not in the set while (*sbegin != '\0' && (cset[get_pos(*sbegin)] & (1 << (((uchar) *sbegin) / 32)))) { ++sbegin; } if (*sbegin == '\0') { ssave = NULL; return NULL; } int idx = 0; //Equivalent to the function of strcspn while (sbegin[idx] != '\0' && !(cset[get_pos(sbegin[idx])] & (1 << ((uchar) sbegin[idx] / 32)))) { ++idx; } send = sbegin + idx; if (*send != '\0') *send++ = '\0'; //Draw Terminator ssave = send; //Update cache location for next processing return sbegin; }
performance analysis
The problem scale N is the length of the string. Because strtok is designed in the way of collection storage rather than violent enumeration, the time complexity is O(n)! The array used to store set elements also realizes space optimization through bit operation. The space does not change with the scale of the problem, so the space complexity of O(1) is very high.
Time complexity: O ( N ) O(N) O(N)
Space complexity: O ( 1 ) O(1) O(1)
test case
main function of the test
Test Description: split the whole string with ',', '.' - 'as separator.
int main() { char str[] = "- This 3234d s3242- -d-, a sample string."; char *pch; printf("Splitting string \"%s\" into tokens:\n", str); pch = strtok(str, " ,.-"); while (pch != NULL) { printf("%s\n", pch); pch = strtok(NULL, " ,.-"); } return 0; }
Test results:
Very good, very perfect!!!