useR to programmeR

Iteration 1

Emma Rand and Ian Lyttle

Overview

Overview

In this session we will cover another way to reduce code duplication: iteration.

Learning Objectives

At the end of this section you will be able to:

  • recognise that much iteration comes free with R

  • iterate across rows using across()

    • use selection functions to select columns for iteration
    • use anonymous functions to pass arguments
    • give more than one function for iteration
    • use .names to control the output
  • use across() in functions

What is iteration?

  • Iteration means repeating steps multiple times until a condition is met

  • In other languages, iteration is performed with loops: for, while

  • Iteration is different in R

  • You can use loops……. but you often don’t need to

Iteration in R

Iteration is an inherent part of the language. For example, if

nums <- c(3, 1, 6, 4)

Then

2 * nums

is

Iteration in R

[1]  6  2 12  8

and NOT

[1]  6  2 12  8  6  2 12  8

Iteration in R

We have:

. . . - the apply() family

other languages, a for loop would be right after hello world

Functional programming

“functional programming” because functions take other functions as input

  • modifying multiple columns {dplyr}

  • reading multiple files {purrr}

  • saving multiple outputs {purrr}

Set up

Create a .R

usethis::use_r("iteration-01")

Packages

🎬 Load packages:

Load penguins

🎬 Load penguins data set

data(penguins)
glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Modifying multiple columns

Scenario

Recall our standard error function from this morning:

sd_error <- function(x){
  sd(x, na.rm = TRUE) / sqrt(sum(!is.na(x)))
}

Scenario

Which we might use as:

penguins |> 
  summarise(se_bill_len = sd_error(bill_length_mm),
            se_bill_dep = sd_error(bill_depth_mm),
            se_flip_len = sd_error(flipper_length_mm ),
            se_body_mas = sd_error(body_mass_g))
# A tibble: 1 × 4
  se_bill_len se_bill_dep se_flip_len se_body_mas
        <dbl>       <dbl>       <dbl>       <dbl>
1       0.295       0.107       0.760        43.4

⚠️ Code repetition!

How can we iterate over rows?

Solution: across()

penguins |> 
  summarise(across(bill_length_mm:body_mass_g, sd_error))
# A tibble: 1 × 4
  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
           <dbl>         <dbl>             <dbl>       <dbl>
1          0.295         0.107             0.760        43.4

across() Arguments

across(.cols, .fns, .names)

3 important arguments

across() Arguments

  • which columns you want to iterate over: .cols = bill_length_mm:body_mass_g
  • what you want to do to each column: .fns = sd_error

    • single function, include arguments, more than one function
  • .names to control output

selecting columns with .cols

  • we could use colon notation, bill_length_mm:body_mass_g, because columns are adjacent

but

selecting columns with .cols

penguins |> 
  summarise(across(ends_with("mm"), sd_error))
# A tibble: 1 × 3
  bill_length_mm bill_depth_mm flipper_length_mm
           <dbl>         <dbl>             <dbl>
1          0.295         0.107             0.760

selecting columns with .cols

penguins |> 
  group_by(species, island, sex) |> 
  summarise(across(everything(), sd_error))
# A tibble: 13 × 8
# Groups:   species, island [5]
   species   island    sex    bill_length_mm bill_depth_mm flipper_length_mm
   <fct>     <fct>     <fct>           <dbl>         <dbl>             <dbl>
 1 Adelie    Biscoe    female          0.376        0.233              1.44 
 2 Adelie    Biscoe    male            0.428        0.188              1.38 
 3 Adelie    Dream     female          0.402        0.173              1.06 
 4 Adelie    Dream     male            0.330        0.195              1.29 
 5 Adelie    Dream     <NA>           NA           NA                 NA    
 6 Adelie    Torgersen female          0.451        0.180              0.947
 7 Adelie    Torgersen male            0.631        0.226              1.23 
 8 Adelie    Torgersen <NA>            1.61         0.709              2.81 
 9 Chinstrap Dream     female          0.533        0.134              0.987
10 Chinstrap Dream     male            0.268        0.131              1.02 
11 Gentoo    Biscoe    female          0.269        0.0709             0.512
12 Gentoo    Biscoe    male            0.348        0.0949             0.726
13 Gentoo    Biscoe    <NA>            0.687        0.405              0.629
# ℹ 2 more variables: body_mass_g <dbl>, year <dbl>

selecting columns with .cols

penguins |> 
  group_by(species, island, sex) |> 
  summarise(across(everything(), sd_error))
  • variables in group_by() are excluded

  • all of bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year

selecting columns with .cols

penguins |> 
  select(-year) |>
  group_by(species, island, sex) |> 
  summarise(across(everything(), sd_error))
# A tibble: 13 × 7
# Groups:   species, island [5]
   species   island    sex    bill_length_mm bill_depth_mm flipper_length_mm
   <fct>     <fct>     <fct>           <dbl>         <dbl>             <dbl>
 1 Adelie    Biscoe    female          0.376        0.233              1.44 
 2 Adelie    Biscoe    male            0.428        0.188              1.38 
 3 Adelie    Dream     female          0.402        0.173              1.06 
 4 Adelie    Dream     male            0.330        0.195              1.29 
 5 Adelie    Dream     <NA>           NA           NA                 NA    
 6 Adelie    Torgersen female          0.451        0.180              0.947
 7 Adelie    Torgersen male            0.631        0.226              1.23 
 8 Adelie    Torgersen <NA>            1.61         0.709              2.81 
 9 Chinstrap Dream     female          0.533        0.134              0.987
10 Chinstrap Dream     male            0.268        0.131              1.02 
11 Gentoo    Biscoe    female          0.269        0.0709             0.512
12 Gentoo    Biscoe    male            0.348        0.0949             0.726
13 Gentoo    Biscoe    <NA>            0.687        0.405              0.629
# ℹ 1 more variable: body_mass_g <dbl>

selecting columns with .cols

  • My columns have very different names and I don’t want to group!
penguins |> 
  select(-year) |>
  summarise(across(where(is.numeric), sd_error))
# A tibble: 1 × 4
  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
           <dbl>         <dbl>             <dbl>       <dbl>
1          0.295         0.107             0.760        43.4

.funs: calling one function

  • we can pass a function, sd_error to across() since R is a functional programming language

  • note, we are not calling sd_error()

  • instead we pass sd_error so across() can call it

  • thus function name is not followed by ()

function name is not followed by ()

📢

penguins |> 
  select(-year) |>
  summarise(across(where(is.numeric), sd_error()))
Error in `summarise()`:
ℹ In argument: `across(where(is.numeric), sd_error())`.
Caused by error in `sd_error()`:
! argument "x" is missing, with no default

This error is easy to make!

Include arguments

penguins |> 
  summarise(across(ends_with("mm"), mean))
# A tibble: 1 × 3
  bill_length_mm bill_depth_mm flipper_length_mm
           <dbl>         <dbl>             <dbl>
1             NA            NA                NA

We get the NA because we have missing values1.

Include arguments

mean() has an na.rm argument.

How can we pass on na.rm = TRUE?

We might try:

penguins |> 
  summarise(across(ends_with("mm"), mean(na.rm = TRUE)))
Error in `summarise()`:
ℹ In argument: `across(ends_with("mm"), mean(na.rm = TRUE))`.
Caused by error in `mean.default()`:
! argument "x" is missing, with no default

Include arguments

The solution is to create a new function that calls mean() with na.rm = TRUE

penguins |> 
  summarise(across(ends_with("mm"), 
                   function(x) mean(x, na.rm = TRUE)))
# A tibble: 1 × 3
  bill_length_mm bill_depth_mm flipper_length_mm
           <dbl>         <dbl>             <dbl>
1           43.9          17.2              201.

mean is replaced by a function definition

Anonymous functions

penguins |> 
  summarise(across(ends_with("mm"), 
                   function(x) mean(x, na.rm = TRUE)))
  • This is called an anonymous or lambda function.

  • It is anonymous because we do not give it a name with <-

Anonymous functions

Shorthand

Instead of writing function we can use \

penguins |> 
  summarise(across(ends_with("mm"), 
                   \(x) mean(x, na.rm = TRUE)))
# A tibble: 1 × 3
  bill_length_mm bill_depth_mm flipper_length_mm
           <dbl>         <dbl>             <dbl>
1           43.9          17.2              201.

Anonymous functions

Note, You might also see:

penguins |> 
  summarise(across(ends_with("mm"), 
                   ~ mean(.x, na.rm = TRUE)))
# A tibble: 1 × 3
  bill_length_mm bill_depth_mm flipper_length_mm
           <dbl>         <dbl>             <dbl>
1           43.9          17.2              201.
  • \(x) is base syntax new in 4.1.0 Recommended

  • ~ .x is fine but only works in tidyverse functions

.funs: calling more than one function

How can we use more than one function across the columns?

penguins |> 
  summarise(across(ends_with("mm"), _MORE THAN ONE FUNCTION_))

by using a list

.funs: calling more than one function

Using a list:

penguins |> 
  summarise(across(where(is.numeric), list(
    sd_error, 
    length)))

Or, with anonymous functions:

penguins |> 
  summarise(across(ends_with("mm"), list(
    \(x) mean(x, na.rm = TRUE),
    \(x) sd(x, na.rm = TRUE))))

.funs: calling more than one function

penguins |> 
  summarise(across(ends_with("mm"), list(
    \(x) mean(x, na.rm = TRUE),
    \(x) sd(x, na.rm = TRUE))))
# A tibble: 1 × 6
  bill_length_mm_1 bill_length_mm_2 bill_depth_mm_1 bill_depth_mm_2
             <dbl>            <dbl>           <dbl>           <dbl>
1             43.9             5.46            17.2            1.97
# ℹ 2 more variables: flipper_length_mm_1 <dbl>, flipper_length_mm_2 <dbl>

Problem: the suffixes _1 and _2 for functions are not very useful.

.funs: calling more than one function

We can improve by naming the elements in the list

penguins |> 
  summarise(across(ends_with("mm"), list(
    mean = \(x) mean(x, na.rm = TRUE),
    sdev = \(x) sd(x, na.rm = TRUE))))
# A tibble: 1 × 6
  bill_length_mm_mean bill_length_mm_sdev bill_depth_mm_mean bill_depth_mm_sdev
                <dbl>               <dbl>              <dbl>              <dbl>
1                43.9                5.46               17.2               1.97
# ℹ 2 more variables: flipper_length_mm_mean <dbl>,
#   flipper_length_mm_sdev <dbl>

The column name is {.col}_{.fn}: bill_length_mm_mean

fn: function name

We can change using the .names argument

.names to control output

penguins |> 
  summarise(across(ends_with("mm"),
                   list(mean = \(x) mean(x, na.rm = TRUE),
                        sdev = \(x) sd(x, na.rm = TRUE)),
                   .names = "{.fn}_of_{.col}"))
# A tibble: 1 × 6
  mean_of_bill_length_mm sdev_of_bill_length_mm mean_of_bill_depth_mm
                   <dbl>                  <dbl>                 <dbl>
1                   43.9                   5.46                  17.2
# ℹ 3 more variables: sdev_of_bill_depth_mm <dbl>,
#   mean_of_flipper_length_mm <dbl>, sdev_of_flipper_length_mm <dbl>

.names to control output

Especially important for mutate().

Recall our to_z() function

to_z <- function(x, middle = 1) {
  trim = (1 - middle)/2
  (x - mean(x, na.rm = TRUE, trim = trim)) / sd(x, na.rm = TRUE)
}

to_z() function in mutate()

which we used like this

penguins |>
  mutate(
    z_bill_length_mm = to_z(bill_length_mm),
    z_bill_depth_mm = to_z(bill_depth_mm),
    z_flipper_length_mm = to_z(flipper_length_mm)
  ) |> 
  glimpse()
Rows: 344
Columns: 11
$ species             <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
$ island              <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torger…
$ bill_length_mm      <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1…
$ bill_depth_mm       <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1…
$ flipper_length_mm   <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 1…
$ body_mass_g         <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475…
$ sex                 <fct> male, female, female, NA, female, male, female, ma…
$ year                <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 20…
$ z_bill_length_mm    <dbl> -0.8832047, -0.8099390, -0.6634077, NA, -1.3227986…
$ z_bill_depth_mm     <dbl> 0.78430007, 0.12600328, 0.42983257, NA, 1.08812936…
$ z_flipper_length_mm <dbl> -1.4162715, -1.0606961, -0.4206603, NA, -0.5628905…

.names to control output

It makes sense to use across() to apply the transformation to all three variables

penguins |>
  mutate(across(ends_with("mm"),
                to_z)
  ) |> 
  glimpse()
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> -0.8832047, -0.8099390, -0.6634077, NA, -1.3227986, …
$ bill_depth_mm     <dbl> 0.78430007, 0.12600328, 0.42983257, NA, 1.08812936, …
$ flipper_length_mm <dbl> -1.4162715, -1.0606961, -0.4206603, NA, -0.5628905, …
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

😮 Results go into existing columns!

penguins |>
  mutate(across(ends_with("mm"),
                to_z,
                .names = "z_{.col}")
  ) |> 
  glimpse()
Rows: 344
Columns: 11
$ species             <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
$ island              <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torger…
$ bill_length_mm      <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1…
$ bill_depth_mm       <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1…
$ flipper_length_mm   <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 1…
$ body_mass_g         <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475…
$ sex                 <fct> male, female, female, NA, female, male, female, ma…
$ year                <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 20…
$ z_bill_length_mm    <dbl> -0.8832047, -0.8099390, -0.6634077, NA, -1.3227986…
$ z_bill_depth_mm     <dbl> 0.78430007, 0.12600328, 0.42983257, NA, 1.08812936…
$ z_flipper_length_mm <dbl> -1.4162715, -1.0606961, -0.4206603, NA, -0.5628905…

Your turn

Time to bring together functions and iteration!

🎬 Write a function that summarises multiple specified columns of a data frame

my_summary <- function(df, cols) {

   . . . .

}
my_summary(penguins, ends_with("mm"))

A solution

my_summary <- function(df, cols) {
  df |> 
    summarise(across({{ cols }},
                     list(mean = \(x) mean(x, na.rm = TRUE),
                          sdev = \(x) sd(x, na.rm = TRUE))),
              .groups = "drop")
}

Try it out

penguins |> 
  group_by(species) |> 
  my_summary(ends_with("mm"))
# A tibble: 3 × 7
  species   bill_length_mm_mean bill_length_mm_sdev bill_depth_mm_mean
  <fct>                   <dbl>               <dbl>              <dbl>
1 Adelie                   38.8                2.66               18.3
2 Chinstrap                48.8                3.34               18.4
3 Gentoo                   47.5                3.08               15.0
# ℹ 3 more variables: bill_depth_mm_sdev <dbl>, flipper_length_mm_mean <dbl>,
#   flipper_length_mm_sdev <dbl>

A improved solution

Include a default.

my_summary <- function(df, cols = where(is.numeric)) {
  df |> 
    summarise(across({{cols}},
                     list(mean = \(x) mean(x, na.rm = TRUE),
                          sdev = \(x) sd(x, na.rm = TRUE))),
              .groups = "drop")
}

Try it out

penguins |> 
  select(-year) |>
  my_summary()
# A tibble: 1 × 8
  bill_length_mm_mean bill_length_mm_sdev bill_depth_mm_mean bill_depth_mm_sdev
                <dbl>               <dbl>              <dbl>              <dbl>
1                43.9                5.46               17.2               1.97
# ℹ 4 more variables: flipper_length_mm_mean <dbl>,
#   flipper_length_mm_sdev <dbl>, body_mass_g_mean <dbl>,
#   body_mass_g_sdev <dbl>

Summary

  • you already knew some iteration: group_by(), facet_wrap()

  • across() iterates over columns

    • choose columns with familiar select() spec
    • pass functions without their ()
    • use anonymous functions to add arguments
    • use a list to use multiple functions
    • specify the names
  • You can use across() in functions!