「R」 dplyr line calculation

"Original text from: dplyr document"

Previous: 「R」 dplyr formulation calculation

Generally, dplyr and R are more suitable for column operation, while row operation is more troublesome. In this article, we will learn the dplyr operation method of the row wise data frame created around rowwise().

This article will discuss three common use cases:

Aggregate by row (for example, calculate the mean of x, y, z).
Call the same function multiple times with different parameters.
Process list columns.

These problems can usually be solved simply through the for loop, but it will be a very good solution if it can be naturally processed.

❝Of course, someone has to write loops. It doesn't have to be you. — Jenny Bryan❞

Load package

library(dplyr, warn.conflicts = FALSE)

establish

Line operations require a special grouping type, and each group simply contains a single row. You can create it using rowwise():

df <- tibble(x = 1:2, y = 3:4, z = 5:6)
df %>% rowwise()
#> # A tibble: 2 x 3
#> # Rowwise: 
#>       x     y     z
#>   <int> <int> <int>
#> 1     1     3     5
#> 2     2     4     6

And group_ Like by (), rowwise() does nothing by itself. It only changes how other verb operations work. For example, compare the results of the following mutate():

df %>% mutate(m = mean(c(x, y, z)))
#> # A tibble: 2 x 4
#>       x     y     z     m
#>   <int> <int> <int> <dbl>
#> 1     1     3     5   3.5
#> 2     2     4     6   3.5
df %>% rowwise() %>% mutate(m = mean(c(x, y, z)))
#> # A tibble: 2 x 4
#> # Rowwise: 
#>       x     y     z     m
#>   <int> <int> <int> <dbl>
#> 1     1     3     5     3
#> 2     2     4     6     4

If you use mutate() to manipulate a regular data frame, it calculates the mean of x, y and z of all rows. If you only apply to one row data frame, it calculates the average value of each row.

You can provide "identifier" variables in rowwise(). These variables will be retained when you call summarize (), so its behavior is similar to passing variables into group_by():

df <- tibble(name = c("Mara", "Hadley"), x = 1:2, y = 3:4, z = 5:6)

df %>% 
  rowwise() %>% 
  summarise(m = mean(c(x, y, z)))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 1
#>       m
#>   <dbl>
#> 1     3
#> 2     4

df %>% 
  rowwise(name) %>% 
  summarise(m = mean(c(x, y, z)))
#> `summarise()` regrouping output by 'name' (override with `.groups` argument)
#> # A tibble: 2 x 2
#> # Groups:   name [2]
#>   name       m
#>   <chr>  <dbl>
#> 1 Mara       3
#> 2 Hadley     4

rowwise() is only a special form of grouping, so if you want to remove it from the data frame, you can call ungroup().

Summary statistics by line

Dplyr:: summarize() makes the statistical summary of one column and multiple rows very simple. When it is combined with rowwise(), it can also easily summarize one row and multiple columns. To see how it works, let's start by creating a small data frame:

df <- tibble(id = 1:6, w = 10:15, x = 20:25, y = 30:35, z = 40:45)
df
#> # A tibble: 6 x 5
#>      id     w     x     y     z
#>   <int> <int> <int> <int> <int>
#> 1     1    10    20    30    40
#> 2     2    11    21    31    41
#> 3     3    12    22    32    42
#> 4     4    13    23    33    43
#> # ... with 2 more rows

Suppose we want to calculate the sum of each row w, x, y, and z, we create a row data frame:

rf <- df %>% rowwise(id)

We then use mutate() to add a new column, or use summarize() to return only one summary column:

rf %>% mutate(total = sum(c(w, x, y, z)))
#> # A tibble: 6 x 6
#> # Rowwise:  id
#>      id     w     x     y     z total
#>   <int> <int> <int> <int> <int> <int>
#> 1     1    10    20    30    40   100
#> 2     2    11    21    31    41   104
#> 3     3    12    22    32    42   108
#> 4     4    13    23    33    43   112
#> # ... with 2 more rows
rf %>% summarise(total = sum(c(w, x, y, z)))
#> `summarise()` regrouping output by 'id' (override with `.groups` argument)
#> # A tibble: 6 x 2
#> # Groups:   id [6]
#>      id total
#>   <int> <int>
#> 1     1   100
#> 2     2   104
#> 3     3   108
#> 4     4   112
#> # ... with 2 more rows

Of course, if you have a large number of variables, typing each variable name will be very boring. Therefore, you can use c_across(), which supports tidy selection syntax, so you can select many variables at once:

rf %>% mutate(total = sum(c_across(w:z)))
#> # A tibble: 6 x 6
#> # Rowwise:  id
#>      id     w     x     y     z total
#>   <int> <int> <int> <int> <int> <int>
#> 1     1    10    20    30    40   100
#> 2     2    11    21    31    41   104
#> 3     3    12    22    32    42   108
#> 4     4    13    23    33    43   112
#> # ... with 2 more rows
rf %>% mutate(total = sum(c_across(where(is.numeric))))
#> # A tibble: 6 x 6
#> # Rowwise:  id
#>      id     w     x     y     z total
#>   <int> <int> <int> <int> <int> <int>
#> 1     1    10    20    30    40   100
#> 2     2    11    21    31    41   104
#> 3     3    12    22    32    42   108
#> 4     4    13    23    33    43   112
#> # ... with 2 more rows

You can calculate the proportion of each row in combination with the column operation (see the previous article):

rf %>% 
  mutate(total = sum(c_across(w:z))) %>% 
  ungroup() %>% 
  mutate(across(w:z, ~ . / total))
#> # A tibble: 6 x 6
#>      id     w     x     y     z total
#>   <int> <dbl> <dbl> <dbl> <dbl> <int>
#> 1     1 0.1   0.2   0.3   0.4     100
#> 2     2 0.106 0.202 0.298 0.394   104
#> 3     3 0.111 0.204 0.296 0.389   108
#> 4     4 0.116 0.205 0.295 0.384   112
#> # ... with 2 more rows

Line summary function

The rowwise() method supports any aggregation function. But if you want to consider the speed of calculation, it's worth looking for a built-in line summary function that can complete the task. They are more efficient because they do not cut the data into rows, then calculate the statistics, and finally put the results together. They operate the whole data frame as a whole.

df %>% mutate(total = rowSums(across(where(is.numeric))))
#> # A tibble: 6 x 6
#>      id     w     x     y     z total
#>   <int> <int> <int> <int> <int> <dbl>
#> 1     1    10    20    30    40   101
#> 2     2    11    21    31    41   106
#> 3     3    12    22    32    42   111
#> 4     4    13    23    33    43   116
#> # ... with 2 more rows
df %>% mutate(mean = rowMeans(across(where(is.numeric))))
#> # A tibble: 6 x 6
#>      id     w     x     y     z  mean
#>   <int> <int> <int> <int> <int> <dbl>
#> 1     1    10    20    30    40  20.2
#> 2     2    11    21    31    41  21.2
#> 3     3    12    22    32    42  22.2
#> 4     4    13    23    33    43  23.2
#> # ... with 2 more rows

List column

When you have list columns, the rowwise() operation is a natural pairing. They allow you to avoid explicit loops and / or use the apply() or purrr::map family functions.

motivation

Imagine that you have the following data frame. You want to calculate the length of each element:

df <- tibble(
  x = list(1, 2:3, 4:6)
)

You might try length():

df %>% mutate(l = length(x))
#> # A tibble: 3 x 2
#>   x             l
#>   <list>    <int>
#> 1 <dbl [1]>     3
#> 2 <int [2]>     3
#> 3 <int [3]>     3

However, the length of the column is returned, not the length of individual values. If you are a fan of R documents, you may know that a base R function is used to deal with this situation:

df %>% mutate(l = lengths(x))
#> # A tibble: 3 x 2
#>   x             l
#>   <list>    <int>
#> 1 <dbl [1]>     1
#> 2 <int [2]>     2
#> 3 <int [3]>     3

Or if you are an experienced R programmer, you may know how to apply an operation to each element using functions such as sapply():

df %>% mutate(l = sapply(x, length))
#> # A tibble: 3 x 2
#>   x             l
#>   <list>    <int>
#> 1 <dbl [1]>     1
#> 2 <int [2]>     2
#> 3 <int [3]>     3
df %>% mutate(l = purrr::map_int(x, length))
#> # A tibble: 3 x 2
#>   x             l
#>   <list>    <int>
#> 1 <dbl [1]>     1
#> 2 <int [2]>     2
#> 3 <int [3]>     3

But wouldn't it be nice to just write length(x) dplyr to calculate the length of the elements in X? Now that you're here, you may have guessed the answer: This is just another application of line mode.

df %>% 
  rowwise() %>% 
  mutate(l = length(x))
#> # A tibble: 3 x 2
#> # Rowwise: 
#>   x             l
#>   <list>    <int>
#> 1 <dbl [1]>     1
#> 2 <int [2]>     2
#> 3 <int [3]>     3

subsetting

Before we move on, I'd like to briefly mention the magic that makes it work. This is not something you usually need to consider (it will work), but it is useful to know when something goes wrong.

There is an important difference between grouping data frames (there is exactly one row per group) and row data frames (there is always one row per group). Take these two data frames as an example:

df <- tibble(g = 1:2, y = list(1:3, "a"))
gf <- df %>% group_by(g)
rf <- df %>% rowwise(g)

If we calculate some properties of y, we will find that the results are different:

gf %>% mutate(type = typeof(y), length = length(y))
#> # A tibble: 2 x 4
#> # Groups:   g [2]
#>       g y         type  length
#>   <int> <list>    <chr>  <int>
#> 1     1 <int [3]> list       1
#> 2     2 <chr [1]> list       1
rf %>% mutate(type = typeof(y), length = length(y))
#> # A tibble: 2 x 4
#> # Rowwise:  g
#>       g y         type      length
#>   <int> <list>    <chr>      <int>
#> 1     1 <int [3]> integer        3
#> 2     2 <chr [1]> character      1

The key difference is that when change() splits a column and then passes it into length(y), grouping change uses the [operation, while row type change uses the [[. The following code shows this difference through the for loop:

# grouped
out1 <- integer(2)
for (i in 1:2) {
  out1[[i]] <- length(df$y[i])
}
out1
#> [1] 1 1

# rowwise
out2 <- integer(2)
for (i in 1:2) {
  out2[[i]] <- length(df$y[[i]])
}
out2
#> [1] 3 1

Note that this magic only applies when referencing an existing column, not when creating a new row. This may be confusing, but we are sure it is the worst solution, especially when prompted in the error message.

gf %>% mutate(y2 = y)
#> # A tibble: 2 x 3
#> # Groups:   g [2]
#>       g y         y2       
#>   <int> <list>    <list>   
#> 1     1 <int [3]> <int [3]>
#> 2     2 <chr [1]> <chr [1]>
rf %>% mutate(y2 = y)
#> Error: Problem with `mutate()` input `y2`.
#> x Input `y2` can't be recycled to size 1.
#> ℹ Input `y2` is `y`.
#> ℹ Input `y2` must be size 1, not 3.
#> ℹ Did you mean: `y2 = list(y)` ?
#> ℹ The error occurred in row 1.
rf %>% mutate(y2 = list(y))
#> # A tibble: 2 x 3
#> # Rowwise:  g
#>       g y         y2       
#>   <int> <list>    <list>   
#> 1     1 <int [3]> <int [3]>
#> 2     2 <chr [1]> <chr [1]>

❝ translator's note: Operation y in the second example has been untied, so it needs to be wrapped again. ❞

modeling

The rowwise() data frame allows us to solve many modeling problems in a particularly elegant way. Let's start by creating a nested data frame:

by_cyl <- mtcars %>% nest_by(cyl)
#> `summarise()` ungrouping output (override with `.groups` argument)
by_cyl
#> # A tibble: 3 x 2
#> # Rowwise:  cyl
#>     cyl data              
#>   <dbl> <list>            
#> 1     4 <tibble [11 × 12]>
#> 2     6 <tibble [7 × 12]> 
#> 3     8 <tibble [14 × 12]>

This is different from the usual group_ The by () output is a little different: we have significantly changed the structure of the data. Now we have three rows (one row for each group) and a list column data, which is used to store the data of the group. Also note that the output is rowwwise(); This is important because it will make it easier to work with the data frame list.

Once we have a data frame for each row, it is very intuitive to create a model for each row:

mods <- by_cyl %>% mutate(mod = list(lm(mpg ~ wt, data = data)))
mods
#> # A tibble: 3 x 3
#> # Rowwise:  cyl
#>     cyl data               mod   
#>   <dbl> <list>             <list>
#> 1     4 <tibble [11 × 12]> <lm>  
#> 2     6 <tibble [7 × 12]>  <lm>  
#> 3     8 <tibble [14 × 12]> <lm>

Supplement with a set of predicted values per line:

mods <- mods %>% mutate(pred = list(predict(mod, data)))
mods
#> # A tibble: 3 x 4
#> # Rowwise:  cyl
#>     cyl data               mod    pred      
#>   <dbl> <list>             <list> <list>    
#> 1     4 <tibble [11 × 12]> <lm>   <dbl [11]>
#> 2     6 <tibble [7 × 12]>  <lm>   <dbl [7]> 
#> 3     8 <tibble [14 × 12]> <lm>   <dbl [14]>

Then you can summarize the model in many ways:

mods %>% summarise(rmse = sqrt(mean((pred - data$mpg) ^ 2)))
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 3 x 2
#> # Groups:   cyl [3]
#>     cyl  rmse
#>   <dbl> <dbl>
#> 1     4 3.01 
#> 2     6 0.985
#> 3     8 1.87
mods %>% summarise(rsq = summary(mod)$r.squared)
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 3 x 2
#> # Groups:   cyl [3]
#>     cyl   rsq
#>   <dbl> <dbl>
#> 1     4 0.509
#> 2     6 0.465
#> 3     8 0.423
mods %>% summarise(broom::glance(mod))
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 3 x 13
#> # Groups:   cyl [3]
#>     cyl r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
#>   <dbl>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
#> 1     4     0.509         0.454  3.33      9.32  0.0137     1 -27.7   61.5  62.7
#> 2     6     0.465         0.357  1.17      4.34  0.0918     1  -9.83  25.7  25.5
#> 3     8     0.423         0.375  2.02      8.80  0.0118     1 -28.7   63.3  65.2
#> # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Or easily access the parameters of each model:

mods %>% summarise(broom::tidy(mod))
#> `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
#> # A tibble: 6 x 6
#> # Groups:   cyl [3]
#>     cyl term        estimate std.error statistic    p.value
#>   <dbl> <chr>          <dbl>     <dbl>     <dbl>      <dbl>
#> 1     4 (Intercept)    39.6       4.35      9.10 0.00000777
#> 2     4 wt             -5.65      1.85     -3.05 0.0137    
#> 3     6 (Intercept)    28.4       4.18      6.79 0.00105   
#> 4     6 wt             -2.78      1.33     -2.08 0.0918    
#> # ... with 2 more rows

Duplicate function call

rowwise() is not only applicable to functions that return vectors of length 1 (also known as summary functions); If the result is a list, it can work with any function. This means that rowwise() and mutate() provide an elegant way to call a function multiple times with different parameters and store the output with the input.

simulation

I think this is a particularly elegant way to perform simulations because it allows you to store simulated values and the parameters that generate them. For example, suppose you have the following data frame that describes the properties of three evenly distributed samples:

df <- tribble(
  ~ n, ~ min, ~ max,
    1,     0,     1,
    2,    10,   100,
    3,   100,  1000,
)

You can use rowwise() and mutate() to provide these parameters to runif():

df %>% 
  rowwise() %>% 
  mutate(data = list(runif(n, min, max)))
#> # A tibble: 3 x 4
#> # Rowwise: 
#>       n   min   max data     
#>   <dbl> <dbl> <dbl> <list>   
#> 1     1     0     1 <dbl [1]>
#> 2     2    10   100 <dbl [2]>
#> 3     3   100  1000 <dbl [3]>

Note that list() - runif() is used to return multiple values, while the change () expression must return a value of length 1. list() means that we will get a list column, where each row is a list containing multiple values. If you forget to use list(), dplyr will prompt you:

df %>% 
  rowwise() %>% 
  mutate(data = runif(n, min, max))
#> Error: Problem with `mutate()` input `data`.
#> x Input `data` can't be recycled to size 1.
#> ℹ Input `data` is `runif(n, min, max)`.
#> ℹ Input `data` must be size 1, not 2.
#> ℹ Did you mean: `data = list(runif(n, min, max))` ?
#> ℹ The error occurred in row 2.

Repeated combination

What if you want to call a function for each input combination? You can use expand Grid() or tidyr::expand_grid() to generate data frames, and then repeat the above pattern:

df <- expand.grid(mean = c(-1, 0, 1), sd = c(1, 10, 100))

df %>% 
  rowwise() %>% 
  mutate(data = list(rnorm(10, mean, sd)))
#> # A tibble: 9 x 3
#> # Rowwise: 
#>    mean    sd data      
#>   <dbl> <dbl> <list>    
#> 1    -1     1 <dbl [10]>
#> 2     0     1 <dbl [10]>
#> 3     1     1 <dbl [10]>
#> 4    -1    10 <dbl [10]>
#> # ... with 5 more rows

Different functions

In more complex problems, you may also want to change the called function. Because the columns in the input tibble are not so regular, this method is not suitable for this method. But it's still possible, and use do here Call () is natural:

df <- tribble(
   ~rng,     ~params,
   "runif",  list(n = 10), 
   "rnorm",  list(n = 20),
   "rpois",  list(n = 10, lambda = 5),
) %>%
  rowwise()

df %>% 
  mutate(data = list(do.call(rng, params)))
#> # A tibble: 3 x 3
#> # Rowwise: 
#>   rng   params           data      
#>   <chr> <list>           <list>    
#> 1 runif <named list [1]> <dbl [10]>
#> 2 rnorm <named list [1]> <dbl [20]>
#> 3 rpois <named list [2]> <int [10]>

before

rowwise()

rowwise() has also been questioned for a long time, partly because I don't understand how many people need to use local capabilities to calculate the summary of multiple variables in each row. As an alternative, we recommend using purrr's map() function to perform line by line operations. However, this is challenging because you need to select the mapping function according to the number of parameters and result type, which requires considerable knowledge of purrr functions.

I have also resisted rowwise(), because I think it's amazing to automatically switch between [and [[just as the result of automatic list()-ing makes do() amazing. Now I have convinced myself that line magic is good magic, in part because most people find that the difference between [and [[mystification and rowwise() means that you don't need to consider it.

Since rowwise() is obviously useful, it is no longer questioned, and we hope it will exist for a long time.

do()

We have questioned the necessity of do() for a long time because it is not very similar to other dplyr verbs. It has two main modes of operation:

No parameter name: you can call functions to input and output data frames. Reference the current group. For example, the following code gets the first line of each group:

mtcars %>% 
  group_by(cyl) %>% 
  do(head(., 1))
#> # A tibble: 3 x 13
#> # Groups:   cyl [3]
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb  cyl2  cyl4
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1     8    16
#> 2  21       6   160   110  3.9   2.62  16.5     0     1     4     4    12    24
#> 3  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2    16    32

This has been cur_data() and the more relaxed summary (), which can now create multiple columns and rows.

mtcars %>% 
  group_by(cyl) %>% 
  summarise(head(cur_data(), 1))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 13
#>     cyl   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb  cyl2  cyl4
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     4  22.8   108    93  3.85  2.32  18.6     1     1     4     1     8    16
#> 2     6  21     160   110  3.9   2.62  16.5     0     1     4     4    12    24
#> 3     8  18.7   360   175  3.15  3.44  17.0     0     0     3     2    16    32

• with parameters: it works like mutate() but automatically wraps each element into a list:

mtcars %>% 
  group_by(cyl) %>% 
  do(nrows = nrow(.))
#> # A tibble: 3 x 2
#> # Rowwise: 
#>     cyl nrows    
#>   <dbl> <list>   
#> 1     4 <int [1]>
#> 2     6 <int [1]>
#> 3     8 <int [1]>

I now think this behavior is both magical and not very useful. It can be summarized () and cur_ Replace with data().

mtcars %>% 
  group_by(cyl) %>% 
  summarise(nrows = nrow(cur_data()))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 3 x 2
#>     cyl nrows
#>   <dbl> <int>
#> 1     4    11
#> 2     6     7
#> 3     8    14

If necessary (unlike here), you can wrap the results in a list yourself.

cur_ The addition of data () / across () and the increase in the scope of application of summary () mean that do() is no longer needed, so it is now obsolete.

Added by oaskedal on Fri, 21 Jan 2022 15:25:46 +0200

Programming VIP