ggplot2 - data collation 2

1. separate and unite

spread and gather functions can help solve the problem of misplaced variables in data, while separate and unite are used to solve the following problems: multiple variables are crowded in the same column, or a variable is scattered into different columns

For example, the following data set records the response data of a drug treatment. It contains three variables (time, treatment status and response value), but the data of time and treatment status are recorded together and become a variable:

trt <- dplyr::data_frame(var = paste0(rep(c("beg", "end"), each = 3), "_", rep(c("a", "b", "c"))), val = c(1, 4, 2, 10, 5, 11))
trt

The separate() function can easily split this mixed column into multiple variables, which contains the following four main parameters

data frame to be adjusted
col column name of the column to be split
into the column name of the newly generated variable after splitting. The format is character variable
sep describes how to split the original variable. It can be a regular expression, such as_ Indicates split by underscore, or [^ a-z] indicates split by any non alphabetic character, or an integer at a specified position

In this example, use_ Split characters

separate(trt, var, c("time", "treatment"), "_")

(if variables are mixed together in a more complex form, you can try the following extract() function. In addition, if you need to create a separate list of variables generated by some operation, the change() function is a method)

The unite() function is the inverse of separate () -- it can merge multiple columns into one column. Although not commonly used, it is necessary to know the inverse function of separate()

2. Case study

For most real data, more than one collation command needs to be used. Although there are many paths to take, as long as you make the data set cleaner at each step, you will always get a satisfactory clean data set. This means that these functions are usually used in the same order: gather(), separate(), and spread() (although not necessarily each)

1. Blood pressure

The first step of data collation: determine the variables in the data set. Look at the following data set that simulates medical data. There are seven variables: name, age, start date, week, systolic blood pressure and diastolic blood pressure. In what form is it stored?

'# Modified from Barry Rowlingson's example
# http://barryrowlingson.github.io/hadleyverse/
bpd <- readr::read_table(
"name	age	start	week1	week2	week3
Anne	35	2014-03-27	100/80	100/75	120/90
Ben	41	2014-03-09	110/65	100/65	135/70
Carl	33	2014-04-02	125/80	<NA>	<NA>
", na = "<NA>"
)

First, change width to length:

bpd_1 <- gather(bpd, week, bp, week1:week3)
bpd_1

The data looks cleaner, but two variables (systolic blood pressure and diastolic blood pressure) are recorded together under the bp variable. Although blood pressure is often recorded this way, it is easier to analyze them separate ly

bpd_2 <- separate(bpd_1, bp, c("sys", "dia"), "/")
bpd_2

The dataset is now neat, but you can go further and make it better to use. The following code uses the extract() function to take the number in the week column as the value of the column. The array () function is also used to arrange the rows so that everyone's records are gathered together

bpd_3 <- extract(bpd_2, week, "week", "(\\d)", convert = TRUE)
bpd_4 <- dplyr::arrange(bpd_3, name, week)
bpd_4

You may notice some duplication in this data set: if you know the name, you naturally know the age and start time. This reflects the third condition of cleanliness: each data frame should have and only have one data set. In fact, there are two data sets: personal information that does not change over time and their weekly blood pressure measurements.

2. Examination results

# Modified from https://stackoverflow.com/questions/29775461
set.seed(127)
scores <- dplyr::data_frame(
	person = rep(c("Greg", "Sally", "Sue"), each = 2),
	time = rep(c("pre", "post"), 3),
	test1 = round(rnorm(6, mean = 80, sd = 4), 0),
	test2 = round(jitter(test1, 15), 0)
)
scores

Variables include person, test, pre intervention score and post intervention score. As before, convert wide data (test1 and test2) to long data (test and score)

scores_1 <- gather(scores, test, score, test1:test2)
scores_1

Next, we need to do the opposite: pre (before) and post (after) should be two variables rather than values, so we need to expand the time and score variables

scores_2 <- spread(scorres_1, time, score)
scores_2

An important evidence that we have sorted out the data set is that it is now easy to calculate the desired statistics, that is, the performance difference before and after the intervention

scores_3 <- mutate(scores_2, diff = post - pre)
score_3

Then it's easy to draw

ggplot(scores_3, aes(person, diff, color = test)) +
	geom_hline(size = 2, color = "white", yintercept = 0) +
	geom_point() +
	geom_path(aes(group = person), color = "grey50",
	arrow = arrow(length = unit(0.25, "cm")))

Learn more

Data collation is a big topic. This chapter only touches the surface. For its deeper discussion, the following references are recommended

Documentation for tidyr package
Tidy data
data wrangling cheatsheet

Keywords: ggplot2

Added by lore_lanu on Thu, 30 Dec 2021 11:01:32 +0200

Programming VIP