Have you used or experimented with Arrow before today?
Vote using emojis on the #workshop-arrow discord channel!
1️⃣ Not yet
2️⃣ Not yet, but I have read about it!
3️⃣ A little
4️⃣ A lot
arrow::open_dataset()
1.15 billion rows 🤯
What percentage of taxi rides each year had more than 1 passenger?
library(dplyr)
nyc_taxi |>
group_by(year) |>
summarise(
all_trips = n(),
shared_trips = sum(passenger_count > 1, na.rm = TRUE)
) |>
mutate(pct_shared = shared_trips / all_trips * 100) |>
collect()
# A tibble: 10 × 4
year all_trips shared_trips pct_shared
<int> <int> <int> <dbl>
1 2012 178544324 53313752 29.9
2 2013 173179759 51215013 29.6
3 2014 165114361 48816505 29.6
4 2015 146112989 43081091 29.5
5 2016 131165043 38163870 29.1
6 2017 113495512 32296166 28.5
7 2018 102797401 28796633 28.0
8 2019 84393604 23515989 27.9
9 2020 24647055 5837960 23.7
10 2021 30902618 7221844 23.4
6.077 sec elapsed
Calculate the longest trip distance for every month in 2019
How long did this query take to run?
A multi-language toolbox for accelerated data interchange and in-memory processing
Arrow is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another
In-memory columnar format: a standardized, language-agnostic specification for representing structured, table-like data sets in-memory.
Arrow’s Columnar Format is Fast