1. separate and unite
spread and gather functions can help solve the problem of misplaced variables in data, while separate and unite are used to solve the following problems: multiple variables are crowded in the same column, or a variable is scattered into different columns
For example, the following data set records the response data of a drug treatment. It contains three variables (time, treatment status and response value), but the data of time and treatment status are recorded together and become a variable:
trt <- dplyr::data_frame(var = paste0(rep(c("beg", "end"), each = 3), "_", rep(c("a", "b", "c"))), val = c(1, 4, 2, 10, 5, 11)) trt
The separate() function can easily split this mixed column into multiple variables, which contains the following four main parameters
- data frame to be adjusted
- col column name of the column to be split
- into the column name of the newly generated variable after splitting. The format is character variable
- sep describes how to split the original variable. It can be a regular expression, such as_ Indicates split by underscore, or [^ a-z] indicates split by any non alphabetic character, or an integer at a specified position
In this example, use_ Split characters
separate(trt, var, c("time", "treatment"), "_")
(if variables are mixed together in a more complex form, you can try the following extract() function. In addition, if you need to create a separate list of variables generated by some operation, the change() function is a method)
The unite() function is the inverse of separate () -- it can merge multiple columns into one column. Although not commonly used, it is necessary to know the inverse function of separate()
2. Case study
For most real data, more than one collation command needs to be used. Although there are many paths to take, as long as you make the data set cleaner at each step, you will always get a satisfactory clean data set. This means that these functions are usually used in the same order: gather(), separate(), and spread() (although not necessarily each)
1. Blood pressure
The first step of data collation: determine the variables in the data set. Look at the following data set that simulates medical data. There are seven variables: name, age, start date, week, systolic blood pressure and diastolic blood pressure. In what form is it stored?
'# Modified from Barry Rowlingson's example # http://barryrowlingson.github.io/hadleyverse/ bpd <- readr::read_table( "name age start week1 week2 week3 Anne 35 2014-03-27 100/80 100/75 120/90 Ben 41 2014-03-09 110/65 100/65 135/70 Carl 33 2014-04-02 125/80 <NA> <NA> ", na = "<NA>" )
First, change width to length:
bpd_1 <- gather(bpd, week, bp, week1:week3) bpd_1
The data looks cleaner, but two variables (systolic blood pressure and diastolic blood pressure) are recorded together under the bp variable. Although blood pressure is often recorded this way, it is easier to analyze them separate ly
bpd_2 <- separate(bpd_1, bp, c("sys", "dia"), "/") bpd_2
The dataset is now neat, but you can go further and make it better to use. The following code uses the extract() function to take the number in the week column as the value of the column. The array () function is also used to arrange the rows so that everyone's records are gathered together
bpd_3 <- extract(bpd_2, week, "week", "(\\d)", convert = TRUE) bpd_4 <- dplyr::arrange(bpd_3, name, week) bpd_4
You may notice some duplication in this data set: if you know the name, you naturally know the age and start time. This reflects the third condition of cleanliness: each data frame should have and only have one data set. In fact, there are two data sets: personal information that does not change over time and their weekly blood pressure measurements.
2. Examination results
# Modified from https://stackoverflow.com/questions/29775461 set.seed(127) scores <- dplyr::data_frame( person = rep(c("Greg", "Sally", "Sue"), each = 2), time = rep(c("pre", "post"), 3), test1 = round(rnorm(6, mean = 80, sd = 4), 0), test2 = round(jitter(test1, 15), 0) ) scores
Variables include person, test, pre intervention score and post intervention score. As before, convert wide data (test1 and test2) to long data (test and score)
scores_1 <- gather(scores, test, score, test1:test2) scores_1
Next, we need to do the opposite: pre (before) and post (after) should be two variables rather than values, so we need to expand the time and score variables
scores_2 <- spread(scorres_1, time, score) scores_2
An important evidence that we have sorted out the data set is that it is now easy to calculate the desired statistics, that is, the performance difference before and after the intervention
scores_3 <- mutate(scores_2, diff = post - pre) score_3
Then it's easy to draw
ggplot(scores_3, aes(person, diff, color = test)) + geom_hline(size = 2, color = "white", yintercept = 0) + geom_point() + geom_path(aes(group = person), color = "grey50", arrow = arrow(length = unit(0.25, "cm")))
Learn more
Data collation is a big topic. This chapter only touches the surface. For its deeper discussion, the following references are recommended
- Documentation for tidyr package
- Tidy data
- data wrangling cheatsheet