useR to programmeR

๐Ÿ‘‹ & Functions 1

Emma Rand and Ian Lyttle
WiFi: Posit Conf 2024
Password: conf2024

๐Ÿ‘‹ Welcome

Introductions

This is a two-day, hands-on workshop for those who have embraced the tidyverse and want to improve their R programming skills and, especially, reduce the amount of duplication in their code.

  • do you have experience equivalent to an introductory data science course using tidyverse?
  • are you comfortable with the Whole game chapter of R for Data Science (2nd Edition) by by Hadley Wickham, Mine ร‡etinkaya-Rundel, and Garrett Grolemund.


Material

The team

Emma Rand

๐Ÿ˜@3mma@mastodon.social

Lionel Henry

Ian Lyttle

๐Ÿ˜@ijlyttle@vis.social

Jonathan McPherson

 

Standing on the shoulders of

WiFi

  • Network: Posit Conf 2024
  • Password: conf2024

Introductions

To each other! With help from Yorkshire!

. . .Posit Conf 2024

Ingredients: Sugar, Glucose syrup, Cocoa mass, Vegetable fats (Palm, Rapeseed, Sunflower, Coconut,Mango kernel/ Sal/ Shea), Sweetened condensed skimmed milk (Skimmed milk, Sugar), Cocoa butter, Dried whole milk, Glucose-fructose syrup, Coconut, Lactose and proteins from whey (from Milk), Whey powder (from Milk), Hazelnuts, Skimmed milk powder, Butter (from Milk), Emulsifiers (Sunflower lecithin, E471), Flavourings, Butterfat (from Milk), Fat-reduced cocoa powder, Salt, Lactic acid.

Ingredients: Sugar, Semi-Sweet Chocolate (Sugar, Chocolate, Cocoa Butter, Milkfat, Soy and Sunflower Lecithin, Natural Vanilla Flavor), Glucose Syrup, Peppermint Oil, Citric Acid, Invertase.

Ingredients: Glucose Syrup, Sugar, Starch, Acid: Citric Acid, Flavouring, Fruit and Plant Concentrates: Aronia, Blackcurrant, Elderberry, Grape, Lemon, Orange, Safflower Spirulina, Caramelised Sugar Syrup, Glazing Agents: Beeswax, Carnauba Wax, Elderberry Extract.

Code of Conduct

Code of Conduct. Please Review

  • ๐Ÿ’™ Treat everyone with respect
  • ๐Ÿงก Everyone should feel welcome and safe
  • Red lanyard = โŒ๐Ÿ“ท

Reporting:

  • ๐Ÿ—ฃ๏ธ any posit::conf staff member (t-shirt) or Info desk
  • ๐Ÿ“ง codeofconduct@posit.com
  • โ˜Ž๏ธ 844-448-1212

Housekeeping

  • There are gender-neutral bathrooms on levels 3, 4, 5, 6 & 7
  • Meditation/prayer rooms: in 503. Open Mon/Tues 0700 - 1900, Wed 0700 - 1700
  • Lactation room: in 509 Open Mon/Tues 0700 - 1900, Wed 0700 - 1700

๐Ÿ™ to

  • Lionel and Jonathan

  • colleagues, friends and learners at Schneider Electric, University of York and RForwards!

  • Posit team and especially Mine ร‡etinkaya-Rundel

  • Ian!
  • Experience ๐Ÿฑ ๐Ÿฅ— ๐ŸŒฎ ๐Ÿด ๐Ÿ•

Prerequisites

We built this course using the most-recent versions of R (4.4) and RStudio (2024.04). However, things should work with at least R 4.2 and RStudio 2023.03. You will need packages:

  • {tidyverse}
  • {palmerpenguins}
  • {devtools}
  • {here}

๐ŸŽฌ Detailed instructions for installing these were covered in Prerequisites

Schedule

Time Activity
09:00 - 10:30 Functions 1 Introduction, vector and dataframe functions, embracing
10:30 - 11:00 โ˜• Coffee break
11:00 - 12:30 Functions 2 Plot functions, style and side effects
12:30 - 13:30 ๐Ÿฑ ๐Ÿฅ— ๐ŸŒฎ ๐Ÿด Lunch break
13:30 - 15:00 Iteration 1 Introduction and modifying multiple columns
15:00 - 15:30 โ˜• Coffee break
15:30 - 17:00 Iteration 2 Reading and writing multiple files

How we will work

  • stickies (TODO, update with current colors)

    • ๐ŸŸฆ Iโ€™m all good, Iโ€™m done

    • ๐ŸŸช I could do with some help

  • Discord

  • no stupid questions

  • ๐ŸŽฌ Action!

Learning Objectives

At the end of this section you will be able to:

  • explain the rationale for writing functions
  • write vector functions
    • that take one or more vectors as input and output a vector
    • that take one or more vectors as input and output a single value
  • specify defaults for function argument
  • write functions that take dataframes as input and output a dataframe
  • using embracing to allow data masking and tidy selection within functions

Set up

Project

https://github.com/posit-conf-2024/programming-r-exercises

๐ŸŽฌ Create a Project:

usethis::use_course("posit-conf-2024/programming-r-exercises")

> usethis::use_course("posit-conf-2024/programming-r-exercises")
โœ” Downloading from 'https://github.com/posit-conf-2024/programming-r-exercises/zipball/HEAD'
Downloaded: 0.26 MB  
โœ” Download stored in 'C:/Users/er13/OneDrive - University of York/Desktop/Desktop/posit-conf-2024-programming-r-exercises-978baff.zip'
โœ” Unpacking ZIP file into 'posit-conf-2024-programming-r-exercises-978baff/' (45 files extracted)
Shall we delete the ZIP file ('posit-conf-2024-programming-r-exercises-978baff.zip')?

1: Not now
2: Yeah
3: Nope

๐ŸŽฌ Choose the option that means yes!

โœ” Deleting 'posit-conf-2024-programming-r-exercises-978baff.zip'
โœ” Opening project in RStudio

RStudio will restart

Create a .R

usethis::use_r("functions-01")

Packages

๐ŸŽฌ Load packages:

โ”€โ”€ Attaching core tidyverse packages โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ tidyverse 2.0.0 โ”€โ”€
โœ” dplyr     1.1.2     โœ” readr     2.1.4
โœ” forcats   1.0.0     โœ” stringr   1.5.0
โœ” ggplot2   3.4.2     โœ” tibble    3.2.1
โœ” lubridate 1.9.2     โœ” tidyr     1.3.0
โœ” purrr     1.0.1     โ”€โ”€ Conflicts โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ tidyverse_conflicts() โ”€โ”€
โœ– dplyr::filter() masks stats::filter()
โœ– dplyr::lag()    masks stats::lag()
โ„น Use the conflicted package to force all conflicts to become errors'

Load penguins

๐ŸŽฌ Load penguins data set

data(penguins)
glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelโ€ฆ
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerseโ€ฆ
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, โ€ฆ
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, โ€ฆ
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186โ€ฆ
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, โ€ฆ
$ sex               <fct> male, female, female, NA, female, male, female, maleโ€ฆ
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007โ€ฆ

Why write functions?

Rationale

  • impact from code: reach and clarity
  • efficiency: update code in one place, decrease error rate, improve ability to reuse

Example

We have several measurements:

  • bill_length_mm
  • bill_depth_mm
  • flipper_length_mm
  • body_mass_g

These are on very different scales

Example

  • difficult to plot on same axis or determine what value is large for that variable

  • A common solution is to apply a \(z\) score transformation to each variable.

  • Normalises the values to have a mean of 0 and a standard deviation of 1

\[z = \frac{x - \bar{x}}{s.d.}\]

Apply transformation

We can apply the same transformation to each variable:

penguins <- penguins |>
  mutate(
    z_bill_length_mm = (bill_length_mm - mean(bill_length_mm, na.rm = TRUE)) / sd(bill_length_mm, na.rm = TRUE),
    z_bill_depth_mm = (bill_depth_mm - mean(bill_depth_mm, na.rm = TRUE)) / sd(bill_depth_mm, na.rm = TRUE),
    z_flipper_length_mm = (flipper_length_mm - mean(flipper_length_mm, na.rm = TRUE)) / sd(flipper_length_mm, na.rm = TRUE),
    z_body_mass_g = (body_mass_g - mean(body_mass_g, na.rm = TRUE)) / sd(body_mass_g, na.rm = TRUE)
  )

Long, unclear

(bill_length_mm - mean(bill_length_mm, na.rm = TRUE)) / sd(bill_length_mm, na.rm = TRUE)

  • Quite a lot of code
  • Difficult to determine what the transformation is

How to shorten and make more clear?

Coping and pasting

  • is error prone

How to make fewer mistakes?

Writing a function:

  • can be named to make transformation transparent
  • will make code shorter
  • can be reused

๐Ÿ”‘๏ธ You may think you have to write complex functions - you donโ€™t! Start with the simple things.

Types of function

Types of function

We will cover two types of function

  1. vector functions: one of more vectors as input, one vector as output
  1. data frame functions: df as input and df as output

Types of function

We will cover two types of function

  1. vector functions: one of more vectors as input, one vector as output

    1. output same length as input. โ€œmutateโ€ functions will work well in mutate() and filter(). Principles of writing functions

    2. summary functions: input is vector, output is a single value

  2. data frame functions: df as input and df as output

Vector functions

Output same length as input

  • output same length as input
  • work well in mutate()
  • appropriate for the z-transformation example

General

To turn your code into a function you need:

  • a name
  • the arguments - which represent the bits that vary
  • the code body for the function
name <- function(arguments) {
  code body
}

Function name

Use a verb - The tidyverse style guide (Wickham, n.d.) but good advice regardless

Difficulty in naming? Should this be two or three functions?

What should we call the function we write to do a \(z\) score transformation?

Arguments

  • the input vector

  • additional arguments

Naming conventions

  • x for the vector input
name <- function(x) {
  body does things with x
}

Example

\[z = \frac{x - \bar{x}}{s.d.}\]

penguins <- penguins |>
  mutate(
    z_bill_length_mm = (bill_length_mm - mean(bill_length_mm, na.rm = TRUE)) / sd(bill_length_mm, na.rm = TRUE),
    z_bill_depth_mm = (bill_depth_mm - mean(bill_depth_mm, na.rm = TRUE)) / sd(bill_depth_mm, na.rm = TRUE),
    z_flipper_length_mm = (flipper_length_mm - mean(flipper_length_mm, na.rm = TRUE)) / sd(flipper_length_mm, na.rm = TRUE),
    z_body_mass_g = (body_mass_g - mean(body_mass_g, na.rm = TRUE)) / sd(body_mass_g, na.rm = TRUE)
  )

Example

Identify the arguments: the things that vary across calls

(bill_length_mm    - mean(bill_length_mm,    na.rm = TRUE)) / sd(bill_length_mm,    na.rm = TRUE)
(bill_depth_mm     - mean(bill_depth_mm,     na.rm = TRUE)) / sd(bill_depth_mm,     na.rm = TRUE)
(flipper_length_mm - mean(flipper_length_mm, na.rm = TRUE)) / sd(flipper_length_mm, na.rm = TRUE)
(body_mass_g       - mean(body_mass_g,       na.rm = TRUE)) / sd(body_mass_g,       na.rm = TRUE)


(๐ŸŸง - mean(๐ŸŸง, na.rm = TRUE)) / sd(๐ŸŸง, na.rm = TRUE)
(๐ŸŸง - mean(๐ŸŸง, na.rm = TRUE)) / sd(๐ŸŸง, na.rm = TRUE)
(๐ŸŸง - mean(๐ŸŸง, na.rm = TRUE)) / sd(๐ŸŸง, na.rm = TRUE)
(๐ŸŸง - mean(๐ŸŸง, na.rm = TRUE)) / sd(๐ŸŸง, na.rm = TRUE)

๐ŸŸง is x

Example

Put into the template

name <- function(x) {
  body does things with x
}


to_z <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}

Apply

Rewrite the call to mutate() as:

penguins <- penguins |>
  mutate(
    z_bill_length_mm = to_z(bill_length_mm),
    z_bill_depth_mm = to_z(bill_depth_mm),
    z_flipper_length_mm = to_z(flipper_length_mm),
    z_body_mass_g = to_z(body_mass_g)
  )

Much shorter, much more clear.

A modification

mean() has a trim argument: mean(x, trim = 0, na.rm = FALSE, ...)

the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed.

Suppose we want to specify the middle proportion left rather than the proportion trimmed from each end [^1]

A modification

  • A value of 0.1 for trim trims 0.1 from each end leaving 0.8 in the middle

  • trim = (1 - middle)/2

schematic of trim and middle demonstrating that trim = (1 - middle)/2

Trim is the proportion trimmed off each end; middle is whatโ€™s left

Add an argument

to_z <- function(x, middle) {
  trim = (1 - middle)/2
  (x - mean(x, na.rm = TRUE, trim = trim)) / sd(x, na.rm = TRUE)
}

Try it out

to_z(penguins$bill_length_mm, middle = 0.2)
  [1] -0.92838057 -0.85511491 -0.70858359          NA -1.36797452 -0.89174774
  [7] -0.96501340 -0.91006415 -1.84420131 -0.39720454 -1.16649396 -1.16649396
 [13] -0.56205227 -1.01996264 -1.75261923 -1.38629094 -1.00164623 -0.30562246
 [19] -1.78925206  0.33545205 -1.16649396 -1.18481038 -1.51450584 -1.09322830
 [25] -0.98332981 -1.62440433 -0.65363435 -0.67195076 -1.14817755 -0.67195076
 [31] -0.85511491 -1.27639245 -0.85511491 -0.59868510 -1.42292377 -0.91006415
 [37] -0.98332981 -0.36057171 -1.20312679 -0.80016566 -1.40460735 -0.61700152
 [43] -1.49618943 -0.01255983 -1.31302528 -0.83679849 -0.56205227 -1.22144320
 [49] -1.49618943 -0.34225529 -0.83679849 -0.74521642 -1.67935358 -0.39720454
 [55] -1.77093565 -0.50710303 -0.94669698 -0.65363435 -1.40460735 -1.20312679
 [61] -1.55113867 -0.52541944 -1.20312679 -0.56205227 -1.42292377 -0.47047020
 [67] -1.58777150 -0.56205227 -1.51450584 -0.43383737 -1.95409980 -0.81848208
 [73] -0.83679849  0.29881922 -1.58777150 -0.25067322 -0.59868510 -1.27639245
 [79] -1.45955660 -0.37888812 -1.75261923 -0.23235681 -1.36797452 -1.66103716
 [85] -1.25807603 -0.52541944 -1.44124018 -1.33134169 -1.07491189 -0.96501340
 [91] -1.55113867 -0.56205227 -1.86251772 -0.83679849 -1.45955660 -0.61700152
 [97] -1.11154472 -0.70858359 -2.02736546 -0.17740756 -1.67935358 -0.58036869
[103] -1.18481038 -1.16649396 -1.14817755 -0.81848208 -1.01996264 -1.09322830
[109] -1.11154472 -0.17740756 -1.11154472  0.26218639 -0.81848208 -0.36057171
[115] -0.83679849 -0.26898963 -1.01996264 -1.25807603 -1.55113867 -0.56205227
[121] -1.45955660 -1.18481038 -0.72690000 -0.50710303 -1.64272075 -0.65363435
[127] -0.98332981 -0.48878661 -0.94669698 -0.01255983 -1.03827906 -0.19572398
[133] -1.34965811 -1.22144320 -1.11154472 -0.56205227 -1.56945509 -0.72690000
[139] -1.31302528 -0.81848208 -0.72690000 -0.65363435 -2.21052960 -0.63531793
[145] -1.25807603 -0.94669698 -0.91006415 -1.38629094 -1.49618943 -1.16649396
[151] -1.49618943 -0.48878661  0.35376847  1.06810865  0.82999525  1.06810865
[157]  0.62851469  0.42703413  0.22555357  0.46366696 -0.15909115  0.48198337
[163] -0.59868510  0.88494450  0.24386998  0.77504601  0.29881922  0.93989374
[169] -0.39720454  0.92157733  0.37208488  0.82999525  1.10474148  0.17060432
[175]  0.42703413  0.39040130 -0.23235681  0.35376847  0.06070583  0.66514752
[181]  0.73841318  1.06810865  0.57356545 -0.25067322  0.17060432  2.82648447
[187]  0.90326091  0.77504601 -0.28730605  0.04238942 -0.03087624  0.82999525
[193] -0.26898963  0.99484299  0.20723715  0.99484299  1.15969072 -0.10414190
[199]  0.24386998  1.15969072  0.13397149  0.18892074  0.44535054  0.79336242
[205]  0.17060432  1.08642506  0.42703413  0.15228791 -0.06750907  0.24386998
[211] -0.17740756  1.14137431  0.20723715  0.37208488  0.28050281  1.85571448
[217]  0.29881922  1.03147582  0.37208488  0.97652657 -0.12245832  1.19632355
[223]  0.64683111  0.40871771  0.73841318  0.42703413  0.40871771  0.81167884
[229]  0.61019828  1.26958921  0.18892074  0.18892074  0.90326091  1.52601902
[235]  0.59188186  1.06810865  0.13397149  1.21463997 -0.14077473  1.30622204
[241]  0.61019828  1.45275336  0.61019828  1.47106977  0.24386998  0.97652657
[247]  0.06070583  1.21463997  0.95821016  0.50029979  0.77504601  1.26958921
[253]  0.79336242  2.14877712  0.55524903  0.90326091  0.57356545  0.48198337
[259] -0.45215378  1.69086675 -0.15909115  0.72009677  1.15969072  1.03147582
[265] -0.12245832  1.34285487  0.37208488  2.00224580  0.06070583  0.84831167
[271]  0.55524903          NA  0.48198337  1.14137431  0.18892074  1.04979223
[277]  0.42703413  1.06810865  1.30622204  0.22555357  1.56265185  0.18892074
[283]  0.35376847  1.30622204  0.33545205  1.30622204  0.44535054  1.37948770
[289]  0.51861620  1.43443694  0.31713564  1.15969072  1.12305789  2.53342183
[295]  0.40871771  0.92157733 -0.32393888  0.79336242 -0.17740756  1.17800714
[301]  0.46366696  1.43443694  1.15969072  0.97652657  0.40871771  1.58096826
[307] -0.59868510  1.83739807 -0.30562246  1.25127279  1.01315940  0.61019828
[313]  0.62851469  1.43443694  0.50029979  1.70918316  0.88494450  0.37208488
[319]  1.23295638  0.24386998  1.23295638  1.21463997  1.08642506  0.88494450
[325]  1.34285487  1.03147582  0.72009677  1.32453845  0.28050281  1.19632355
[331] -0.30562246  1.47106977  0.18892074  0.93989374  1.10474148  0.26218639
[337]  1.41612053  0.48198337  0.28050281  2.13046071 -0.12245832  0.99484299
[343]  1.21463997  1.10474148

But what if we forget?

to_z(penguins$bill_length_mm)
Error in to_z(penguins$bill_length_mm): argument "middle" is missing, with no default

Give a default

Give defaults whenever possible:

to_z <- function(x, middle = 1) {
  trim = (1 - middle)/2
  (x - mean(x, na.rm = TRUE, trim = trim)) / sd(x, na.rm = TRUE)
}

Try it out

to_z(penguins$bill_length_mm)
  [1] -0.88320467 -0.80993901 -0.66340769          NA -1.32279862 -0.84657184
  [7] -0.91983750 -0.86488825 -1.79902541 -0.35202864 -1.12131806 -1.12131806
 [13] -0.51687637 -0.97478674 -1.70744334 -1.34111504 -0.95647033 -0.26044656
 [19] -1.74407616  0.38062795 -1.12131806 -1.13963448 -1.46932994 -1.04805240
 [25] -0.93815391 -1.57922843 -0.60845845 -0.62677486 -1.10300165 -0.62677486
 [31] -0.80993901 -1.23121655 -0.80993901 -0.55350920 -1.37774787 -0.86488825
 [37] -0.93815391 -0.31539581 -1.15795089 -0.75498976 -1.35943145 -0.57182562
 [43] -1.45101353  0.03261607 -1.26784938 -0.79162259 -0.51687637 -1.17626731
 [49] -1.45101353 -0.29707939 -0.79162259 -0.70004052 -1.63417768 -0.35202864
 [55] -1.72575975 -0.46192713 -0.90152108 -0.60845845 -1.35943145 -1.15795089
 [61] -1.50596277 -0.48024354 -1.15795089 -0.51687637 -1.37774787 -0.42529430
 [67] -1.54259560 -0.51687637 -1.46932994 -0.38866147 -1.90892390 -0.77330618
 [73] -0.79162259  0.34399512 -1.54259560 -0.20549732 -0.55350920 -1.23121655
 [79] -1.41438070 -0.33371222 -1.70744334 -0.18718091 -1.32279862 -1.61586126
 [85] -1.21290014 -0.48024354 -1.39606428 -1.28616579 -1.02973599 -0.91983750
 [91] -1.50596277 -0.51687637 -1.81734182 -0.79162259 -1.41438070 -0.57182562
 [97] -1.06636882 -0.66340769 -1.98218956 -0.13223166 -1.63417768 -0.53519279
[103] -1.13963448 -1.12131806 -1.10300165 -0.77330618 -0.97478674 -1.04805240
[109] -1.06636882 -0.13223166 -1.06636882  0.30736229 -0.77330618 -0.31539581
[115] -0.79162259 -0.22381374 -0.97478674 -1.21290014 -1.50596277 -0.51687637
[121] -1.41438070 -1.13963448 -0.68172411 -0.46192713 -1.59754485 -0.60845845
[127] -0.93815391 -0.44361071 -0.90152108  0.03261607 -0.99310316 -0.15054808
[133] -1.30448221 -1.17626731 -1.06636882 -0.51687637 -1.52427919 -0.68172411
[139] -1.26784938 -0.77330618 -0.68172411 -0.60845845 -2.16535371 -0.59014203
[145] -1.21290014 -0.90152108 -0.86488825 -1.34111504 -1.45101353 -1.12131806
[151] -1.45101353 -0.44361071  0.39894437  1.11328455  0.87517115  1.11328455
[157]  0.67369059  0.47221003  0.27072946  0.50884286 -0.11391525  0.52715927
[163] -0.55350920  0.93012040  0.28904588  0.82022191  0.34399512  0.98506964
[169] -0.35202864  0.96675323  0.41726078  0.87517115  1.14991738  0.21578022
[175]  0.47221003  0.43557720 -0.18718091  0.39894437  0.10588173  0.71032342
[181]  0.78358908  1.11328455  0.61874135 -0.20549732  0.21578022  2.87166037
[187]  0.94843681  0.82022191 -0.24213015  0.08756532  0.01429966  0.87517115
[193] -0.22381374  1.04001889  0.25241305  1.04001889  1.20486662 -0.05896600
[199]  0.28904588  1.20486662  0.17914739  0.23409663  0.49052644  0.83853832
[205]  0.21578022  1.13160096  0.47221003  0.19746381 -0.02233317  0.28904588
[211] -0.13223166  1.18655021  0.25241305  0.41726078  0.32567871  1.90089038
[217]  0.34399512  1.07665172  0.41726078  1.02170247 -0.07728242  1.24149945
[223]  0.69200701  0.45389361  0.78358908  0.47221003  0.45389361  0.85685474
[229]  0.65537418  1.31476511  0.23409663  0.23409663  0.94843681  1.57119492
[235]  0.63705776  1.11328455  0.17914739  1.25981586 -0.09559883  1.35139794
[241]  0.65537418  1.49792926  0.65537418  1.51624567  0.28904588  1.02170247
[247]  0.10588173  1.25981586  1.00338606  0.54547569  0.82022191  1.31476511
[253]  0.83853832  2.19395302  0.60042493  0.94843681  0.61874135  0.52715927
[259] -0.40697788  1.73604265 -0.11391525  0.76527266  1.20486662  1.07665172
[265] -0.07728242  1.38803077  0.41726078  2.04742170  0.10588173  0.89348757
[271]  0.60042493          NA  0.52715927  1.18655021  0.23409663  1.09496813
[277]  0.47221003  1.11328455  1.35139794  0.27072946  1.60782775  0.23409663
[283]  0.39894437  1.35139794  0.38062795  1.35139794  0.49052644  1.42466360
[289]  0.56379210  1.47961284  0.36231154  1.20486662  1.16823379  2.57859773
[295]  0.45389361  0.96675323 -0.27876298  0.83853832 -0.13223166  1.22318303
[301]  0.50884286  1.47961284  1.20486662  1.02170247  0.45389361  1.62614416
[307] -0.55350920  1.88257397 -0.26044656  1.29644869  1.05833530  0.65537418
[313]  0.67369059  1.47961284  0.54547569  1.75435906  0.93012040  0.41726078
[319]  1.27813228  0.28904588  1.27813228  1.25981586  1.13160096  0.93012040
[325]  1.38803077  1.07665172  0.76527266  1.36971435  0.32567871  1.24149945
[331] -0.26044656  1.51624567  0.23409663  0.98506964  1.14991738  0.30736229
[337]  1.46129643  0.52715927  0.32567871  2.17563660 -0.07728242  1.04001889
[343]  1.25981586  1.14991738

Your turn

๐ŸŽฌ Write a function that performs the Box-Cox power transformation using the value of (non-zero) lambda (\(\lambda\)) supplied.

Your turn: Box-Cox

\[bc = \frac{x^{\lambda} - 1}{\lambda} \text{ for }\lambda \ne 0\]

  • Set the default \(\lambda = 1\)
  • Still have time? Check \(\lambda \ne 0\)
  • Still have time? Check and amend for:

\[ bc = \begin{cases} \frac{x^{\lambda} - 1}{\lambda} & \text{for }\lambda \ne 0\\ log(x) & \text{for }\lambda = 0 \end{cases} \]

A solution 1

to_box_cox <- function(x, lambda = 1) {
  (x^lambda - 1) / lambda
}

A solution 1 - test

vals <- rexp(10000, 10) 
vals |> hist()

A solution 1 - test

to_box_cox(vals, 0.3) |> 
  hist()

A solution 2

Check \(\lambda \ne 0\)

A solution 3

Check and amend for:

\[ bc = \begin{cases} \frac{x^{\lambda} - 1}{\lambda} & \text{for }\lambda \ne 0\\ log(x) & \text{for }\lambda = 0 \end{cases} \]

Types of function

We will cover two types of function

  1. vector functions: one of more vectors as input, one vector as output

    1. โœ”๏ธ output same length as input.

    ii. โžก๏ธ summary functions: input is vector, output is a single value

  2. data frame functions: df as input and df as output

Summary functions

  • input is vector
  • output is a single value
  • could be used in summarise()

Example

Write a function to compute the standard error of a sample.

\[s.e. = \frac{s.d.}{\sqrt{n}}\]

Example

sd_error <- function(x){
  sd(x, na.rm = TRUE) / sqrt(sum(!is.na(x)))
}

Note: sum(TRUE) = 1 and sum(FALSE) = 0 Thus,sum(!is.na(x)) gives you the number of TRUE (i.e., the number of non-NA values) and is a bit shorter than length(x[!is.na(x)])

Try it out

๐ŸŽฌ Call the function on penguins$bill_length_mm

sd_error(penguins$bill_length_mm)
[1] 0.2952205

Or in a pipeline

penguins |> 
  summarise(se = sd_error(bill_length_mm))
# A tibble: 1 ร— 1
     se
  <dbl>
1 0.295

Your turn

๐ŸŽฌ Write a function to compute the sums of squares (sum of the squared deviations from the mean)

\[SS(x) = \sum{(x - \bar{x})^2}\]

or

\[SS(x) = s^2 * (n-1)\]

A solution - 1

sum_sq <- function(x){
 sum((x[!is.na(x)] - mean(x[!is.na(x)]))^2)
}

๐ŸŽฌ Try it out

sum_sq(penguins$bill_length_mm)
[1] 10164.21

Types of function

We will cover two types of function

  1. vector functions: one of more vectors as input, one vector as output

    1. โœ”๏ธ output same length as input.

    2. โœ”๏ธ summary functions: input is vector, output is a single value

2. โžก๏ธ data frame functions: df as input and df as output

Dataframe functions

Dataframe functions

Dataframe as input and Dataframe as output

For example, we might summarise one of our columns like this:

penguins |> 
  summarise(mean = mean(bill_length_mm, na.rm = TRUE),
            n = sum(!is.na(bill_length_mm)),
            sd = sd(bill_length_mm, na.rm = TRUE),
            se = sd_error(bill_length_mm))
# A tibble: 1 ร— 4
   mean     n    sd    se
  <dbl> <int> <dbl> <dbl>
1  43.9   342  5.46 0.295

Output is a dataframe

Dataframe functions

and summarise several dataframes in the same way

Good candidate for a function to avoid repetitive code: my_summary()

Define my_summary() function

my_summary <- function(df, column){
  df |> 
  summarise(mean = mean(column, na.rm = TRUE),
            n = sum(!is.na(column)),
            sd = sd(column, na.rm = TRUE),
            se = sd_error(column))
}

Use function

my_summary(penguins, bill_length_mm)
Error in `summarise()`:
โ„น In argument: `mean = mean(column, na.rm = TRUE)`.
Caused by error:
! object 'bill_length_mm' not found

๐Ÿ˜•

Tidy evaluation

tidyverse functions like dplyr::summarise() use โ€œtidy evaluationโ€ so you can refer to the names of variables inside dataframes. For example, you can use:

either

penguins |> summarise(mean = mean(bill_depth_mm))

Or

summarise(penguins, mean = mean(bill_depth_mm))

rather than $ notation

summarise(penguins, mean = mean(penguins$bill_depth_mm))

This is known as data-masking: the dataframe environment masks the user environment by giving priority to the dataframe.

Data masking is greatโ€ฆ.

and makes life easier when working interactively

But not so useful in functions

Because of data-masking, summarise() in my_summary() is looking for a column literally called column in the dataframe that has been passed in. It is not looking in the variable column for the name of column you want to give it.

Fix my_summary() function

The solution is to use embracing: { var }

my_summary <- function(df, column){
  df |> 
  summarise(mean = mean({{ column }}, na.rm = TRUE),
            n = sum(!is.na({{ column }})),
            sd = sd({{ column }}, na.rm = TRUE),
            se = sd_error({{ column }}),
            .groups = "drop")
}
  • look inside column variable
  • style with spaces
  • .groups = "drop" to avoid message and leave the data in an ungrouped state

Use function

my_summary(penguins, bill_length_mm)
# A tibble: 1 ร— 4
   mean     n    sd    se
  <dbl> <int> <dbl> <dbl>
1  43.9   342  5.46 0.295

๐ŸŽ‰

When to embrace?

When tidy evaluation is used

Your turn

๐ŸŽฌ Write a function to calculate the median, maximum and minimum values of a variable grouped by another variable.

A solution - 1

my_summary <- function(df, summary_var, group_var){
  df |> 
    group_by({{ group_var }}) |> 
  summarise(median = median({{summary_var  }}, na.rm = TRUE),
            minimum = min({{summary_var  }}, na.rm = TRUE),
            maximum = max({{summary_var  }}, na.rm = TRUE),
            .groups = "drop")
}

Your turn

๐ŸŽฌ Try it out

my_summary(penguins, bill_length_mm, species)
# A tibble: 3 ร— 4
  species   median minimum maximum
  <fct>      <dbl>   <dbl>   <dbl>
1 Adelie      38.8    32.1    46  
2 Chinstrap   49.6    40.9    58  
3 Gentoo      47.3    40.9    59.6

A solution - 2

Improvement: Have a default of NULL for the grouping variable

my_summary <- function(df, summary_var, group_var = NULL){
  df |> 
    group_by({{ group_var }}) |> 
    summarise(median = median({{summary_var  }}, na.rm = TRUE),
            minimum = min({{summary_var  }}, na.rm = TRUE),
            maximum = max({{summary_var  }}, na.rm = TRUE),
            .groups = "drop")
}

Your turn

๐ŸŽฌ Try it out

my_summary(penguins, bill_length_mm)
# A tibble: 1 ร— 3
  median minimum maximum
   <dbl>   <dbl>   <dbl>
1   44.4    32.1    59.6

Your turn

๐ŸŽฌ Try it out with more than one group

my_summary(penguins, bill_length_mm, c(species, island),)
Error in my_summary(penguins, bill_length_mm, c(species, island), ): unused argument (alist())

๐Ÿ˜•

A solution - 3

Use pick() which allows you to select a subset of columns inside a data masking function:

my_summary <- function(df, summary_var, group_var = NULL){
  df |> 
    group_by(pick({{ group_var }})) |> 
    summarise(median = median({{summary_var  }}, na.rm = TRUE),
            minimum = min({{summary_var  }}, na.rm = TRUE),
            maximum = max({{summary_var  }}, na.rm = TRUE),
            .groups = "drop")
}

๐ŸŽฌ Try it out with more than one group

my_summary(penguins, bill_length_mm, c(species, island))
# A tibble: 5 ร— 5
  species   island    median minimum maximum
  <fct>     <fct>      <dbl>   <dbl>   <dbl>
1 Adelie    Biscoe      38.7    34.5    45.6
2 Adelie    Dream       38.6    32.1    44.1
3 Adelie    Torgersen   38.9    33.5    46  
4 Chinstrap Dream       49.6    40.9    58  
5 Gentoo    Biscoe      47.3    40.9    59.6

Extras

  • Short cuts:

    • put cursor on a function call and press F2 to find its definition
    • Ctrl+. opens section/file search

Summary โ˜•

  • Writing functions can make you more efficient and make your code more readable. This can be just for your benefit.

  • Vector functions take one of more vectors as input; output can be a vector (useful in mutate() and filter()) or a single value (useful in summarise())

  • Dataframe functions take a dataframe as input and output a dataframe

  • Give arguments a default where possible

  • We use { var } embracing to manage data masking

  • We use pick() to select more than one variable

References

Wickham, Hadley. n.d. Welcome | The Tidyverse Style Guide. https://style.tidyverse.org/index.html.
Wickham, Hadley, Mine ร‡etinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. 2nd ed. https://r4ds.hadley.nz/.
Wickham, Hadley, Romain Franรงois, Lionel Henry, and Kirill Mรผller. 2022. โ€œDplyr: A Grammar of Data Manipulation.โ€ https://CRAN.R-project.org/package=dplyr.