nums <- c(3, 1, 6, 4)
Iteration 1
In this session we will cover another way to reduce code duplication: iteration.
At the end of this section you will be able to:
Iteration means repeating steps multiple times until a condition is met
In other languages, iteration is performed with loops: for
, while
Iteration is different in R
You can use loops……. but you often don’t need to
Iteration is an inherent part of the language. For example, if
nums <- c(3, 1, 6, 4)
Then
2 * nums
is
and NOT
We have:
group_by()
with summarize()
across()
and purrr()
. . . - the apply()
family
other languages, a for loop would be right after hello world
“functional programming” because functions take other functions as input
modifying multiple columns {dplyr}
reading multiple files {purrr}
saving multiple outputs {purrr}
.R
usethis::use_r("iteration-01")
🎬 Load packages:
penguins
🎬 Load penguins
data set
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Recall our standard error function from this morning:
Which we might use as:
penguins |>
summarise(se_bill_len = sd_error(bill_length_mm),
se_bill_dep = sd_error(bill_depth_mm),
se_flip_len = sd_error(flipper_length_mm ),
se_body_mas = sd_error(body_mass_g))
# A tibble: 1 × 4
se_bill_len se_bill_dep se_flip_len se_body_mas
<dbl> <dbl> <dbl> <dbl>
1 0.295 0.107 0.760 43.4
⚠️ Code repetition!
How can we iterate over rows?
across()
across()
Argumentsacross(.cols, .fns, .names)
3 important arguments
across()
Arguments.cols = bill_length_mm:body_mass_g
what you want to do to each column: .fns = sd_error
.names
to control output.cols
bill_length_mm:body_mass_g
, because columns are adjacentbut
.cols
uses same specification as select()
: starts_with()
, ends_with()
, contains()
, matches()
.cols
.cols
everything()
: all non-grouping columnspenguins |>
group_by(species, island, sex) |>
summarise(across(everything(), sd_error))
# A tibble: 13 × 8
# Groups: species, island [5]
species island sex bill_length_mm bill_depth_mm flipper_length_mm
<fct> <fct> <fct> <dbl> <dbl> <dbl>
1 Adelie Biscoe female 0.376 0.233 1.44
2 Adelie Biscoe male 0.428 0.188 1.38
3 Adelie Dream female 0.402 0.173 1.06
4 Adelie Dream male 0.330 0.195 1.29
5 Adelie Dream <NA> NA NA NA
6 Adelie Torgersen female 0.451 0.180 0.947
7 Adelie Torgersen male 0.631 0.226 1.23
8 Adelie Torgersen <NA> 1.61 0.709 2.81
9 Chinstrap Dream female 0.533 0.134 0.987
10 Chinstrap Dream male 0.268 0.131 1.02
11 Gentoo Biscoe female 0.269 0.0709 0.512
12 Gentoo Biscoe male 0.348 0.0949 0.726
13 Gentoo Biscoe <NA> 0.687 0.405 0.629
# ℹ 2 more variables: body_mass_g <dbl>, year <dbl>
.cols
penguins |>
group_by(species, island, sex) |>
summarise(across(everything(), sd_error))
variables in group_by()
are excluded
all of bill_length_mm
, bill_depth_mm
, flipper_length_mm
, body_mass_g
, year
.cols
everything()
: all non-grouping columns without yearpenguins |>
select(-year) |>
group_by(species, island, sex) |>
summarise(across(everything(), sd_error))
# A tibble: 13 × 7
# Groups: species, island [5]
species island sex bill_length_mm bill_depth_mm flipper_length_mm
<fct> <fct> <fct> <dbl> <dbl> <dbl>
1 Adelie Biscoe female 0.376 0.233 1.44
2 Adelie Biscoe male 0.428 0.188 1.38
3 Adelie Dream female 0.402 0.173 1.06
4 Adelie Dream male 0.330 0.195 1.29
5 Adelie Dream <NA> NA NA NA
6 Adelie Torgersen female 0.451 0.180 0.947
7 Adelie Torgersen male 0.631 0.226 1.23
8 Adelie Torgersen <NA> 1.61 0.709 2.81
9 Chinstrap Dream female 0.533 0.134 0.987
10 Chinstrap Dream male 0.268 0.131 1.02
11 Gentoo Biscoe female 0.269 0.0709 0.512
12 Gentoo Biscoe male 0.348 0.0949 0.726
13 Gentoo Biscoe <NA> 0.687 0.405 0.629
# ℹ 1 more variable: body_mass_g <dbl>
.cols
.funs
: calling one function()
📢
Error in `summarise()`:
ℹ In argument: `across(where(is.numeric), sd_error())`.
Caused by error in `sd_error()`:
! argument "x" is missing, with no default
This error is easy to make!
# A tibble: 1 × 3
bill_length_mm bill_depth_mm flipper_length_mm
<dbl> <dbl> <dbl>
1 NA NA NA
We get the NA because we have missing values1.
mean()
has an na.rm
argument.
How can we pass on na.rm = TRUE
?
The solution is to create a new function that calls mean()
with na.rm = TRUE
mean
is replaced by a function definition
This is called an anonymous or lambda function.
It is anonymous because we do not give it a name with <-
Shorthand
Note, You might also see:
# A tibble: 1 × 3
bill_length_mm bill_depth_mm flipper_length_mm
<dbl> <dbl> <dbl>
1 43.9 17.2 201.
\(x)
is base syntax new in 4.1.0 Recommended
~ .x
is fine but only works in tidyverse functions
.funs
: calling more than one functionHow can we use more than one function across the columns?
by using a list
.funs
: calling more than one functionUsing a list:
.funs
: calling more than one functionpenguins |>
summarise(across(ends_with("mm"), list(
\(x) mean(x, na.rm = TRUE),
\(x) sd(x, na.rm = TRUE))))
# A tibble: 1 × 6
bill_length_mm_1 bill_length_mm_2 bill_depth_mm_1 bill_depth_mm_2
<dbl> <dbl> <dbl> <dbl>
1 43.9 5.46 17.2 1.97
# ℹ 2 more variables: flipper_length_mm_1 <dbl>, flipper_length_mm_2 <dbl>
Problem: the suffixes _1
and _2
for functions are not very useful.
.funs
: calling more than one functionWe can improve by naming the elements in the list
penguins |>
summarise(across(ends_with("mm"), list(
mean = \(x) mean(x, na.rm = TRUE),
sdev = \(x) sd(x, na.rm = TRUE))))
# A tibble: 1 × 6
bill_length_mm_mean bill_length_mm_sdev bill_depth_mm_mean bill_depth_mm_sdev
<dbl> <dbl> <dbl> <dbl>
1 43.9 5.46 17.2 1.97
# ℹ 2 more variables: flipper_length_mm_mean <dbl>,
# flipper_length_mm_sdev <dbl>
The column name is {.col}_{.fn}
: bill_length_mm_mean
fn: function name
We can change using the .names
argument
.names
to control outputpenguins |>
summarise(across(ends_with("mm"),
list(mean = \(x) mean(x, na.rm = TRUE),
sdev = \(x) sd(x, na.rm = TRUE)),
.names = "{.fn}_of_{.col}"))
# A tibble: 1 × 6
mean_of_bill_length_mm sdev_of_bill_length_mm mean_of_bill_depth_mm
<dbl> <dbl> <dbl>
1 43.9 5.46 17.2
# ℹ 3 more variables: sdev_of_bill_depth_mm <dbl>,
# mean_of_flipper_length_mm <dbl>, sdev_of_flipper_length_mm <dbl>
.names
to control outputEspecially important for mutate()
.
Recall our to_z()
function
to_z()
function in mutate()
which we used like this
penguins |>
mutate(
z_bill_length_mm = to_z(bill_length_mm),
z_bill_depth_mm = to_z(bill_depth_mm),
z_flipper_length_mm = to_z(flipper_length_mm)
) |>
glimpse()
Rows: 344
Columns: 11
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torger…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1…
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1…
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 1…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475…
$ sex <fct> male, female, female, NA, female, male, female, ma…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 20…
$ z_bill_length_mm <dbl> -0.8832047, -0.8099390, -0.6634077, NA, -1.3227986…
$ z_bill_depth_mm <dbl> 0.78430007, 0.12600328, 0.42983257, NA, 1.08812936…
$ z_flipper_length_mm <dbl> -1.4162715, -1.0606961, -0.4206603, NA, -0.5628905…
.names
to control outputIt makes sense to use across()
to apply the transformation to all three variables
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> -0.8832047, -0.8099390, -0.6634077, NA, -1.3227986, …
$ bill_depth_mm <dbl> 0.78430007, 0.12600328, 0.42983257, NA, 1.08812936, …
$ flipper_length_mm <dbl> -1.4162715, -1.0606961, -0.4206603, NA, -0.5628905, …
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
😮 Results go into existing columns!
Rows: 344
Columns: 11
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torger…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1…
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1…
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 1…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475…
$ sex <fct> male, female, female, NA, female, male, female, ma…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 20…
$ z_bill_length_mm <dbl> -0.8832047, -0.8099390, -0.6634077, NA, -1.3227986…
$ z_bill_depth_mm <dbl> 0.78430007, 0.12600328, 0.42983257, NA, 1.08812936…
$ z_flipper_length_mm <dbl> -1.4162715, -1.0606961, -0.4206603, NA, -0.5628905…
Time to bring together functions and iteration!
🎬 Write a function that summarises multiple specified columns of a data frame
my_summary(penguins, ends_with("mm"))
# A tibble: 3 × 7
species bill_length_mm_mean bill_length_mm_sdev bill_depth_mm_mean
<fct> <dbl> <dbl> <dbl>
1 Adelie 38.8 2.66 18.3
2 Chinstrap 48.8 3.34 18.4
3 Gentoo 47.5 3.08 15.0
# ℹ 3 more variables: bill_depth_mm_sdev <dbl>, flipper_length_mm_mean <dbl>,
# flipper_length_mm_sdev <dbl>
Include a default.
penguins |>
select(-year) |>
my_summary()
# A tibble: 1 × 8
bill_length_mm_mean bill_length_mm_sdev bill_depth_mm_mean bill_depth_mm_sdev
<dbl> <dbl> <dbl> <dbl>
1 43.9 5.46 17.2 1.97
# ℹ 4 more variables: flipper_length_mm_mean <dbl>,
# flipper_length_mm_sdev <dbl>, body_mass_g_mean <dbl>,
# body_mass_g_sdev <dbl>
you already knew some iteration: group_by()
, facet_wrap()
across()
iterates over columns
select()
spec()
You can use across()
in functions!