useR to programmeR

Iteration 2

Emma Rand and Ian Lyttle

Learning objectives

In this session, we will discuss:

using purrr::map() to read a bunch of files
using purrr::walk() to write a bunch of files
functional programming, more generally

For coding, we will use r-programming-exercises:

R/iteration-02-01-reading-files.R, etc.
restart R

Reading multiple files

Using {purrr} to iterate can help you perform many tasks repeatably and reproducibly.

Example

Read Excel files from a directory, then combine into a single data-frame.

Aside: {here} package

When you first call here::here(), (simplified):

climbs your local directory until it finds a .RProj file
sets directory containing .RProj as reference-path
here::here() prepends reference-path to argument

If project in /Users/ian/important-project/:

here("data/file.csv")

"/Users/ian/important-project/data/file.csv"

Our turn

In the programming-r-exercises repository:

open iteration-02-01-reading-files.R
restart R

Our turn: reading data manually

Here’s our starting code:

data1952 <- read_excel(here("data/gapminder/1952.xlsx"))
data1957 <- read_excel(here("data/gapminder/1957.xlsx"))
data1962 <- read_excel(here("data/gapminder/1952.xlsx"))
data1967 <- read_excel(here("data/gapminder/1967.xlsx"))

data_manual <- bind_rows(data1952, data1957, data1962, data1967)

What problems do you see?

(I see two real problems, and one philosophical problem)

Run this example code, discuss with your neighbor.

Our turn: make list of paths

I see this as a two step problem:

make a named list of paths, name is year
use list of paths to read data frames, combine

Let’s work together to improve this code to get paths:

paths <-
  # get the filepaths from the directory
  fs::dir_ls(here("data/gapminder")) |>
  # convert to list
  # extract the year as names
  print()

Our turn: read data

Let’s work together to improve this code to read data:

data <-
  paths |>
  # read each file from excel, into data frame
  # keep only non-null elements
  # set list-names as column `year`
  # bind into single data-frame
  # convert year to number
  print()

Handling failures

If we have a failure, we may not want to stop everything.

library("readr")
read_csv("not/a/file.csv")

Error: 'not/a/file.csv' does not exist in current working directory ('/home/runner/work/programming-r/programming-r').

Function operators

Function operators:

take a function
return a modified function

library("purrr")

poss_read_csv <- possibly(read_csv, otherwise = NULL, quiet = FALSE)

poss_read_csv("not/a/file.csv")

Error: 'not/a/file.csv' does not exist in current working directory ('/home/runner/work/programming-r/programming-r').

NULL

poss_read_csv(I("a, b\n 1, 2"), col_types = "dd")

# A tibble: 1 × 2
      a     b
  <dbl> <dbl>
1     1     2

Our turn: handle failure

In the programming-r-exercises repository:

look at data/gapminder_party/
try running your script using this directory

Create a new function:

possibly_read_excel <- possibly() # we do the rest

Use this function in your script.

If we have time

Functional programming has three fundamental operations:

filter() - like spaghetti, not coffee: purrr::keep()
map() - do this to each element: purrr::map()
reduce() - combine into new thing: purrr::reduce()

Functional sandwiches

Shows ingredients of a sandwich: onions and pickles *filtered* out, remaining ingredients *mapped* to a slicer-function, then *reduced* to a sandwich

Anjana Vakil’s Functional Sandwiches

Horrible example

Implement list_rbind() using functional-programming techniques:

list_rbind2 <- function(df, names_to) {
  df |>
    purrr::keep(\(x) !is.null(x)) |>
    purrr::imap(\(d, name) dplyr::mutate(d, "{names_to}" := name)) |>
    purrr::reduce(rbind)
}

filters in not-NULL values, purrr::keep()
maps name of element to data-column, purrr::imap()
reduces list to single data-frame, purrr::reduce()

Our turn: saving multiple outputs

Goal: write out a csv file for each value of clarity within ggplot2’s diamonds dataset.

When we read “for each”, we might think of using map():

Writing out a file is a side effect.
We aren’t interested in the return value.

{purrr} has a function for that: walk() (and friends).

Our turn - starter code

iteration-02-02-writing-files.R

# ?dplyr::group_nest(), ?stringr::str_glue()
# from diamonds, create tibble with columns: clarity, data, filename
by_clarity_csv <-
  diamonds |>
  # nest by clarity
  # create column for filename
  print()

# ?readr::write_csv()
# using the data and filename, write out csv files
walk2(
  by_clarity_csv$data,
  by_clarity_csv$filename,
  \(data, filename) NULL # replace with actual code
)

Our turn: writing multiple plots

Goal: Save histogram for carat for each value of clarity within diamonds dataset.

Create a plot column, where each element is a ggplot. This will be a list-column.

You can use map():

within mutate(), with all the tidy-eval goodness!
with additional arguments (after the function), e.g.:

mutate(
  plot = map(data, histogram, carat)
)

equivalent to

plot[[1]] = histogram(data[[1]], carat)
plot[[2]] = histogram(data[[2]], carat)
...

Our turn: starter-code

# from diamonds, create tibble with columns: clarity, data, plot, filename
by_clarity_plot <-
  diamonds |>
  # nest by clarity
  group_nest(clarity) |>
  # create columns for plot, filename
  mutate(
    filename = str_glue("clarity-{clarity}.png")#,
    #plot = map(),
  ) |>
  print()

Our turn: more starter-code

# ?ggplot2::ggsave()
ggsave_local <- function(filename, plot) {

}

# using filename and plot, write out plots to png files
walk2(
  by_clarity_plot$filename,
  by_clarity_plot$plot,
  # write plot file to data/clarity directory
  ggsave_local
)

Functions as arguments

library("tidyverse")
library("palmerpenguins")

ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
  geom_point() +
  scale_color_discrete(labels = tolower) # tolower is a function

If we have time (2)

Three fundamental operations in functional programming

Given a list and a function:

filter(): make a new list, subset of old list
map(): make a new list, operating on each element
reduce(): make a new “thing”

dplyr using purrr?

We can use map(), filter(), reduce() to “implement”, using purrr:

I claim it’s possible, I don’t claim it’s a good idea.

Tabular data: two perspectives

column-based: named list of column vectors

{
  mpg: [21.0, 22.8, ...],
  cyl: [6, 4, ...],
  ...
}

row-based: collection of rows, each a named list

[
  {mpg: 21.0, cyl: 6, ...}, 
  {mpg: 22.8, cyl: 4, ...}, 
  ...
]

`dpurrr_filter()`

dpurrr_filter <- function(df, predicate) {
  df |>
    as.list() |>
    purrr::list_transpose(simplify = FALSE) |>
    purrr::keep(predicate) |>
    purrr::list_transpose() |>
    as.data.frame() 
}

dpurrr_filter(mtcars, \(d) d$gear == 3) |> head()

   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
2 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
3 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
4 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
5 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
6 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3

`dpurrr_mutate()`

dpurrr_mutate <- function(df, mapper) {
  df |>
    as.list() |>
    purrr::list_transpose(simplify = FALSE) |>
    purrr::map(\(d) c(d, mapper(d))) |>
    purrr::list_transpose() |>
    as.data.frame() 
}

mtcars |> 
  dpurrr_mutate(\(d) list(wt_kg = d$wt * 1000 / 2.2)) |> 
  head()

   mpg cyl disp  hp drat    wt  qsec vs am gear carb    wt_kg
1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 1190.909
2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 1306.818
3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 1054.545
4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 1461.364
5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 1563.636
6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 1572.727

`dpurrr_summarise()`

dpurrr_summarise <- function(df, reducer, .init) {
  df |>
    as.list() |>
    purrr::list_transpose(simplify = FALSE) |>
    purrr::reduce(reducer, .init = .init) |>
    as.data.frame()
}

mtcars |> 
  dpurrr_summarise(
    reducer = \(acc, val) list(
      wt_min = min(acc$wt_min, val$wt), 
      wt_max = max(acc$wt_max, val$wt)
    ),
    .init = list(wt_min = Inf, wt_max = -Inf)
  )

  wt_min wt_max
1  1.513  5.424

With grouping

First, a little prep work:

ireduce <- function(x, reducer, .init) {
  purrr::reduce2(x, names(x), reducer, .init = .init)
}

summariser <- purrr::partial(
  dpurrr_summarise,
  reducer = \(acc, val) list(
    wt_min = min(acc$wt_min, val$wt), 
    wt_max = max(acc$wt_max, val$wt)
  ),
  .init = list(wt_min = Inf, wt_max = -Inf)
)

Et voilà

mtcars |> 
  split(mtcars$gear) |>
  purrr::map(summariser) |> 
  ireduce( 
    reducer = \(acc, x, y) rbind(acc, c(list(gear = y), x)),
    .init = data.frame()
  )

  gear wt_min wt_max
1    3  2.465  5.424
2    4  1.615  3.440
3    5  1.513  3.570

We can agree this presents no danger to dplyr.

In JavaScript, data frames are often arrays of objects (lists), so you’ll see formulations like this (e.g. tidyjs).

Summary

you can use purrr::map() to read a bunch of files
you can use purrr::walk() to write a bunch of files
functional programming has three foundational operations:
- filter (purrr::keep())
- map
- reduce

Functional programming comes up a lot in JavaScript

Wrap-up

Please go to pos.it/conf-workshop-survey.

Your feedback is crucial!

Data from the survey informs curriculum and format decisions for future conf workshops, and we really appreciate you taking the time to provide it.

Thank you!

Emma
Lionel and Jonathan
Mine Çetinkaya-Rundel, Posit
You 🤗