Main content of stringr package: insert code piece here
1. String splitting
2. STR replace all
3. String extraction tool
4. String interceptor - str_sub
The four most common methods in string processing are "disassemble, replace, extract and fetch". The stringr package is highly recommended. I think it is much easier to use than the grep, regexp, strsplit, sub and other functions of R.
Sharp tool 1: disassemble: str_split
str_split(string, pattern, n = Inf, simplify = FALSE) string: Specifies the string vector to process pattern: Separators, which can be complex regular expressions n: Specifies the number of copies to be cut. By default, all strings that meet the conditions will be split. simplify: Whether to return string matrix, which is returned in the form of list by default
> str_split(c('lsxxx2011@163.com','0511-87208801'), '[@-]') [[1]] [1] "lsxxx2011" "163.com" [[2]] [1] "0511" "87208801"
```javascript ```javascript #For example, there is a column of mailbox field in the data table. How to split the address and domain name into two new columns? email <- c('lsxxx2011@163.com','1029776077@qq.com','qazwsx@gmail.com','abc123edc@126.com') # Combining the sapply function to get the contents before and after the @ separator add <- sapply(str_split(email,'@'),'[',1) doman <- sapply(str_split(email,'@'),'[',2) df <- data.frame(email, add, doman) df > df email add doman 1 lsxxx2011@163.com lsxxx2011 163.com 2 1029776077@qq.com 1029776077 qq.com 3 qazwsx@gmail.com qazwsx gmail.com 4 abc123edc@126.com abc123edc 126.com
Sharp tool 2: replace: STR replace and STR replace all
str_replace(string, pattern, replacement) str_replace_all(string, pattern, replacement) string: String vector pattern: The substring to be replaced can be a complex regular expression replacement: String to replace
The difference between the two functions is that the former function only replaces the substring that meets the condition for the first time, and the latter function can replace all the substrings that meet the condition.
#Example #Convert data containing a thousandth or percentile character to numeric data commadata <- c('123,456','780,123,433','45,234') percentdata <- c('23.4%','34.56','44.12%') commadatanew <- as.numeric(str_replace_all(commadata, ',', '')) percentdatanew <- as.numeric(str_replace_all(percentdata, '%', ''))/100 commadatanew percentdatanew
Sharp tool 3: extract: STR ﹣ extract and str ﹣ extract ﹣ all and str ﹣ match ﹣ all
str_extract(string, pattern) str_extract_all(string, pattern, simplify = FALSE) string: String vector pattern: Regular expressions are often used to extract substrings that meet the conditions. simplify: Whether to return string matrix, which is returned in the form of list by default //The difference between the two functions is that the previous function only extracts the substrings that meet the conditions for the first time, and the latter function can extract all the substrings that meet the conditions. When the previous function does not match the extracted result, theNA,The latter function returns when it does not match the extracted result character(0). str_match(string, pattern) str_match_all(string, pattern) //The meaning of the function parameter is the same as that of STR < extract. //Example: # Extract the date and flow values in the string s <- c('date:2017-04-14,pv:223453','date:2017-04-15,pv:228115','date:2017-04-16,pv:201233','date:2017-04-17,pv:324123') date <- str_extract_all(s, '[0-9]{4}-[0-9]{2}-[0-9]{2}') pv <- str_extract_all(s, 'pv:([0-9]*)') unlist(date) unlist(pv) #The pv in the result still contains'pv:'String, let's use another extraction function str_match_all. pv <- str_match_all(s, 'pv:([0-9]*)') pv <- sapply(pv,'[',2) pv
Sharp tool 4: Take: str_sub
str_sub(string, start = 1L, end = -1L) string: string vector start: Specifies the starting position to get the substring End: specifies where to get the end of the substring Note: if start or end is a negative integer, query forward from the last character of the string Case study #Get the last 4 digits of mobile number (negative integer parameter) s <- c('13611235678','13912343344','17888886666') (tail4 <- str_sub(s, -4))