using-databricks-with-r

Welcome to

Using Databricks
with R

Workshop
August 12

Housekeeping items

posit::conf mobile app
Wi-fi password
Accessing content -pos.it/databricks_24

Schedule

9am - 10:30am

Break (30mins)

11am - 12:30pm

Lunch (1hr)

1:30pm - 3pm

Break (30mins)

3:30pm - 5pm

The team

Edgar Ruiz

Instructor

James Blair

Materials

Posit Workbench Server
Databricks Cluster
Deck
Exercise book

Posit + Databricks

Special work we have done as part of the new partnership

Let’s get started…

Log into Posit Workbench

Step 1 - Landing Page

Navigate to conf.posit.team, click on Posit Workbench

Step 2 - OpenID Page

Click on Sign in with OpenID

Step 3 - KeyCloak Page

Click on the GitHub icon

Step 4 - GitHub

Log into GitHub, and/or approve the RStudio org

Step 5 - Posit Workbench homepage

Click on the New Session button

Step 6 - Setup new session

Confirm that the image matches what’s on the screenshot (1), then click on Start Session (2).

Step 7 - Run setup command

In the R console type

workshop:::begin()

ℹ Adding DATABRICKS_HOST to .Renviron file
ℹ Adding DATABRICKS_TOKEN to .Renviron file
- Enter password:

“Embrace the remoteness”

Martin Grund, Databricks

Key concept

🔑 Data & processing need to be as physically close as possible

Working with Databricks

“Default” approach

Better approach!

Takeaways

Download data if doing something only R can do
🔑 Data & processing need to be as physically close as possible
Move most of the processing to Databricks
But how?…

Unit 1

Accessing
Databricks

Using RStudio

In the “Connections” pane, select “New Connection”

Using RStudio

Select ‘warehouse’

Using RStudio

Click ‘OK’

RStudio DB Navigator

Explore the catalogs, schema and tables

Exercises 1.1 - 1.2

How everything connects

Connecting - Good

Easily connect with dbConnect() & odbc()

con <- dbConnect(odbc::odbc(),
  Driver = ".../lib/libsparkodbc_sb64-universal.dylib",
  Host = Sys.getenv("DATABRICKS_HOST"),
  PWD = Sys.getenv("DATABRICKS_TOKEN"),
  HTTPPath = "/sql/1.0/warehouses/300bd24ba12adf8e",
  Port = 443,
  AuthMech = 3,
  Protocol = "https",
  ThriftTransport = 2,
  SSL = 1,
  UID = "token")

Connecting - Better

odbc::databricks() simplifies the setup
Automatically detects credentials, driver, and sets default arguments
HTTPPath is the only argument that you will need

con <- dbConnect(
  odbc::databricks(),
  HTTPPath = "/sql/1.0/warehouses/300bd24ba12adf8e"
)

Exercises 1.3 - 1.4

Using DBI

dbConnect() - Connects
dbDisconnect() - Disconnects
dbListTables() - Lists the tables
dbListFields() - Lists the fields of a table
dbGetQuery() - Runs query & collects results

Exercise 1.5

How everything connects (Revisited)

Other RStudio interfaces

knitr SQL engine

RStudio SQL script

Exercise 1.6 - 1.7

Unit 2

Remote
processing

Approaches

Preferred

“Use in case of emergencies”

SQL not practical

SQL language was not meant for exploratory data analysis
“Cognitive cost” of switching between SQL and R
It is not as easy to plot, model, or present results in SQL as it is with R

select 'cut', 
avg('price') as 
'avg_price' 
from 'diamonds'
group by 'cut' 
order by 'avg_price'
desc

Use dplyr & dbplyr

Translates R code to SQL!
‘Modularity’ of piped code
Guardrails against pulling all of the data
All your code is in R

tbl(con, "diamonds") |> 
  group_by(cut) |> 
  summarise(
    avg_price = mean(price, na.rm = TRUE)
  ) |> 
  arrange(desc(avg_price))

🔑 How does it work?

Creates a “virtual” table
Behaves like a regular data frame
It contains no data
It’s a pointer to the database table

tbl_diamonds <- tbl(con, “diamonds”)

tbl_diamonds |>
  count()

Source: SQL [1 x 1]
Database: Spark SQL 3.1.1[token@Spark SQL/hive_metastore]
   n
<int64>
1 53940

🔑 How does it work?

Create a variable in R without importing data
SQL is sent only when data is requested
Behind the scenes, translates code to SQL

tbl_diamonds |>
  count() |>
  show_query()

<SQL>
SELECT COUNT(*) AS ‘n’
FROM ‘diamonds’

Exercises 2.1 - 2.2

PSA

Please,please,please…don’t pass a full query to tbl()

tbl(con, “Select * from ‘table’ where x = 1”)

PSA

Please, please, please…don’t pass a full query to tbl()

tbl(con, “Select * from ‘table’ where x = 1”)

tbl_table <- tbl(con, “table”)
x_1 <- tbl_table |>
XXfilter(x == 1)

Data collection

There are 3 commands that download data:

print() - Downloads the first 1K rows by default (Guardrails)
collect() - Downloads all of the data
pull() - Download all the data of a single field

tbl_cars <- tbl(con, “cars”)
print(tbl_cars)
tbl_cars #Print is implied
collect(tbl_cars)
pull(tbl_cars, mpg)

Exercise 2.3

Databricks Unity Catalog (UC)

Centralizes access control, auditing, lineage, and data discovery across workspaces.

Accessing the UC

“diamonds” can be accessed without specifying catalog and schema

# Why do you work?!
tbl(con, “diamonds”)

Accessing the UC

“hive_metastore” & “default” are the default catalog & schema respectively

# Oooh, that’s why!
tbl(con, “diamonds”)

Non-default catalog

How do I access tables not in the default catalog?

tbl(con, “customer”)
x (31740) Table or view not found: ..customer

Non-default catalog

How do I access tables not in the default catalog?

tbl(con, “customer”)
x (31740) Table or view not found: ..customer

tbl(
Xcon,
X“workshops.tpch.customer”
X)
x (31740) Table or view not found: ..workshops.tpch.
customer

Non-default catalog

Use I() to create the correct table reference

tbl(
Xcon,
XI(“workshops.tpch.customer”)
X)

Exercise 2.4

Unit 3

Preparing
and exploring
data

Variable selection

Using select() with these functions make variable selection easy:

starts_with() - Starts with an exact prefix
ends_with() - Ends with an exact suffix
contains() - Contains a literal string
everything() - Selects everything else. Use with other selection functions.

Exercise 3.1

Joining tables

left_join()

right_join()

full_join()

Exercise 3.2

Preparing data

Load prepared data to a variable, without downloading it

prep_orders <- orders |> 
  left_join(customers, by = c("o_custkey" = "c_custkey")) |> 
  left_join(nation, by = c("c_nationkey" = "n_nationkey")) |> 
  mutate(
    order_year = year(o_orderdate), 
    order_month = month(o_orderdate)
  ) |> 
  rename(customer = o_custkey) |> 
  select(-ends_with("comment"),  -ends_with("key"))

Exercise 3.3

Exploring data

Use the base query for data exploration

prep_orders |> 
  group_by(n_name) |> 
  summarise(total_price = sum(o_totalprice, na.rm = TRUE)) |> 
  arrange(desc(total_price)) |> 
  head(5)
#>   n_name     total_price
#>   <chr>            <dbl>
#> 1 IRAQ      45968155784.
#> 2 INDONESIA 45921526909.
#> 3 GERMANY   45725194460.
#> 4 FRANCE    45713656960.
#> 5 IRAN      45708446851.

Exercise 3.4

Unit 4

Visualizations

Plotting local data

Plotting remote data

What that looks like in R

Pipe the prepared data into a ggplot2

prep_orders |> 
  group_by(order_year) |> 
  summarise(
    total_price = sum(o_totalprice, na.rm = TRUE)
  ) |> 
  arrange(order_year) |> 
  ggplot() +
  geom_col(aes(order_year, total_price))

What that looks like in R

Pipe the prepared data into a ggplot2

prep_orders |> 
  group_by(order_year) |> 
  summarise(
   total_price = sum(o_totalprice, na.rm = TRUE)
  ) |> 
  arrange(order_year) |> 
  ggplot() +
  geom_col(
    aes(order_year, total_price)
    )

What that looks like in R

Pipe the prepared data into a ggplot2

prep_orders |> 
  group_by(order_year) |> 
  summarise(
    total_price = sum(o_totalprice, na.rm = TRUE)
  ) |> 
  collect() |> # Why are you missing?
  arrange(order_year) |> 
  ggplot() +
  geom_col(aes(order_year, total_price))

ggplot2 “auto-collects”

Be careful! It downloads all of the data!

nation <- tbl(
  con, 
  I("workshops.tpch.nation")
  )

nation |> 
  ggplot() +
  geom_col(aes(n_name, n_regionkey))

Exercise 4.1

The truth about ggplot2

A plot gets refined iteratively, data must be local.

‘Improve scales’
‘Add labels to the data’
‘Add title and subtitle’
‘Improve colors’
‘Change the theme’

What that looks like in R

Collect (download) the data into an R variable

sales_by_year <- prep_orders |> 
  group_by(order_year) |> 
  summarise(
    total_price = sum(o_totalprice, na.rm = TRUE)
  ) |> 
  collect()

What that looks like in R

Iterate on the plot using the local data

sales_by_year |> 
  ggplot() +
  geom_col(aes(order_year, total_price))

What that looks like in R

Iterate on the plot using the local data

breaks <- as.double(quantile(c(0, max(sales_by_year$total_price))))
breaks_labels <- paste(round(breaks / 1000000000, 1), "B")
sales_by_year |> 
  ggplot() +
  geom_col(aes(order_year, total_price)) +
  scale_x_continuous(breaks = unique(sales_by_year$order_year)) +
  scale_y_continuous(breaks = breaks, labels = breaks_labels) +
  xlab("Year") +
  ylab("Total Sales") +
  labs(title = "Sales by year") +
  theme_light()

Exercises 4.2 - 4.4

Unit 5

Databricks
Connect

Databricks Connect

Spark Connect, offers true remote connectivity
Uses gRPC to as the communication interface
Databricks Connect ‘v2’ is based on Spark Connect (DBR 13+)

Databricks Connect

databricks-connect integrates with gRPC via pyspark

Databricks Connect

sparklyr integrates with databricks-connect via reticulate

Why not just use ‘reticulate’?

sparklyr extends functionality and user experience

dplyr back-end
DBI back-end
R UDFs
RStudio, and Positron, Connections Pane integration

Getting started

Python 3.10+
A Python environment with databricks-connect and its dependencies
pysparklyr extension

install.packages("pysparklyr")

library(sparklyr) 

sc <- spark_connect(
 cluster_id = "[cluster's id]",
 method = "databricks_connect"
 )

Getting started

Automatically, checks for, and installs the Python packages

install.packages("pysparklyr")
library(sparklyr) 
sc <- spark_connect(
  cluster_id = "[cluster's id]", 
  method = "databricks_connect"
  )
#> ! Retrieving version from cluster '1026-175310-7cpsh3g8'
#> Cluster version: '14.1' 
#> ! No viable Python Environment was identified for 
#> Databricks Connect version 14.1 
#> Do you wish to install Databricks Connect version 14.1? 
#> 1: Yes
#> 2: No
#> 3: Cancel

Exercise 5.1

Understanding Spark’s data caching

Spark’s data capabilities

Spark has the ability to cache large amounts of data
Amount of data is limited by the size of the cluster
Data in cache is the fastest way to access data in Spark

Default approach

Data is read and processed. Results go to R.

About this approach

Well suited when exploring the entirety of the data. Usually to find relevant variables
Not efficient when accessing the same fields and rows over and over

Uploading data from R

copy_to() to upload data to Spark

Exercise 5.2

Caching data

2 step process. first, cache all or some data in memory

Caching data

Second, read and process from memory. Much faster

About this approach

Well suited when accessing the same fields and rows over and over
Best when running models and UDFs (More about this later)

Exercise 5.3

Reading files

By default, files are read and saved to memory

Reading files

Afterwards, the data is read from memory for processing

About this approach

Read files using the spark_read... family of functions
The file path needs to be relative to your Databricks environment
Databricks Volumes are ideal for this approach

Exercise 5.4

“Mapping” files

The files can be mapped but not imported to memory

“Mapping” files

Data is read and processed. Results sent to R.

Exercise 5.5

Partial cache

Alternatively, you can cache specific data from the files

Partial cache

Afterwards, the data is read from memory for processing

Exercise 5.6

Very large files, read or map

Reading, you “pay” in time at the beginning
Mapping, you “pay” in time as you access the data
Extended EDA, reading would be better
EDA of targeted data, partial caching would be better
Jobs that pull a predetermined set of data, mapping would be better

End game

Combine the data from any approach. Cache the resulting table

Exercise 5.7

Unit 6

Intro to
R UDFs

What is an UDF?

Stands for “User Defined Function”
Enables operations not built-in Spark
Can be written in Scala, Python, or R

Ok, but what does that mean?!

We can run R code inside Spark! 🎉 🎉

But first… parallel processing

Spark partitions the data logically

Parallel processing

Each node gets one, or several partitions

Parallel processing

The cluster runs jobs that process each partition in parallel

Parallel processing

Your R function runs on each partition independently

Parallel processing

Results are appended, and returned to user as a single table

Accessing in R

spark_apply() Enables access to the R run-time installed in the cluster
The R function will run over each individual partition

tbl_mtcars <- copy_to(sc, 
                      mtcars)

tbl_mtcars |> 
  spark_apply(nrow)
#>       x
#>   <dbl>
#> 1     8
#> 2     8
#> 3     8
#> 4     8

Exercise 6.1

Group by variable

Use group_by to override the partitions, and divide data by a column

tbl_mtcars |> 
  spark_apply(nrow, group_by = "am")
#> # Source:   table<`sparklyr_tmp_table`> [2 x 2]
#> # Database: spark_connection
#>      am     x
#>   <dbl> <dbl>
#> 1     0    19
#> 2     1    13

Exercise 6.2

Custom functions

R function should expect and return a table

Exercise 6.3

R packages

R packages have to be referenced from within your function’s code using library() or ::

tbl_mtcars |> 
  spark_apply(function(x) {
      library(broom)
      model <- lm(mpg ~ ., x)
      tidy(model)[1,]
      }, 
      group_by = "am"
      )
#> # Source:   table<`sparklyr_tmp`> [2 x 6]
#> # Database: spark_connection
#>      am term        estimate std_error statistic p_value
#>   <dbl> <chr>          <dbl>     <dbl>     <dbl>   <dbl>
#> 1     0 (Intercept)     8.64      21.5     0.402   0.697
#> 2     1 (Intercept)  -138.        69.1    -1.99    0.140

Exercise 6.4

Dependencies in Databricks

Spark clusters are tied to a Databricks Runtime version. Each version of DBR contains specific versions of:

Spark
Java
Scala
Python
R
Set of Python and R packages

Missing packages

Additional packages can be installed
In the Databricks portal, use the Libraries tab of the cluster
Python packages will install from PyPi
R packages will install from CRAN

Key considerations

You are dealing with two different environments
Your Python version must match the cluster
R version can be different, but could be source of errors
Same for Python, and R packages. OK if mismatch, but could be errors

Key considerations

An example of different R versions

my_func <- function(x) R.Version()$minor
# Local output ------------------------
my_func(mtcars)
#> [1] "4.0"
# Remote output ----------------------
tbl_mtcars |> 
  spark_apply(my_func) |> 
  head(1)
#>   x    
#>   <chr>
#> 1 3.1

Exercise 6.5

Solving for R UDFs

It wasn’t easy…

Problem

Spark Connect does not support R UDFs natively

Problem

Spark Connect does not support R UDFs natively

(Databricks Connect v2, is based on Spark Connect)

In Python

Code sent through gRPC, code runs in remote Python environment through Spark

R UDFs

R code is inserted in a Python script, rpy2 executes

Wrapped Python functions

spark_apply(tbl, my_fun)

Python:

tbl.mapInPandas(r_objects.r(my_fun))

spark_apply(tbl, my_fun, group_by = "col1")

Python:

tbl.applyInPandas.group_by(col1).(r_objects.r(my_fun))

Unit 7

Modeling
with R UDFs

Run R models in Spark

1. Sample - Use dplyr

Use dplyr’s slice_sample() function
Download enough data to be representative

local_df <- remote_df |>
 slice_sample(n = 100) |>
 collect()

Exercise 7.1

2. Train - Ideally, tidymodels

Fit the model locally in R

library(tidymodels)
my_rec <- recipe(x ~ ., data = local_df) |> 
  step_normalize(all_numeric_predictors()) 
  
my_reg <- linear_reg()

my_wkf <- workflow() |> 
  add_model(my_reg) |> 
  add_recipe(my_rec)

model <- fit(my_wkf, local_df)

Exercise 7.2

3. Publish - Package with vetiver

vetiver packages tidymodels workflows and models
Model is automatically reduced in size
Output is easy to load into another R session

library(vetiver)
m <- vetiver_model(
 model = model,
 model_name = “model”
 )
saveRDS(m, “model.rds”)

Exercise 7.3

3. Publish - Transfer model file

Copy the R model where Spark can access
Location has to be secure
Options: Databricks UC, s3 buckets, Posit Connect, others

3. Publish - Databricks UC Volume

4. Predict - Create the function

It needs to read the model file, runs prediction, and process results

my_function <- function(x) {
  library(workflows)
  library(vetiver)
  model <- readRDS("model.rds")
  preds <- predict(model, x)
  x$pred <- preds[,1][[1]]
  x
}

4. Predict - Test the function locally

Good practice to test the function with local data

my_function(local_df)

Exercise 7.4

4. Predict - Switch to published

Function to use the copy of the model in Databricks

my_function <- function(x) {
  library(workflows)
  library(vetiver)
  if(file.exists("model.rds")) {
    model <- readRDS("model.rds")
  } else {
    model <- readRDS("/Volumes/[catalog]/[schema]/[volume]/model.rds")
  }
  preds <- predict(model, x)
  x$pred <- preds[,1][[1]]
  x
}

4. Predict - Test remotely

Use a sample or top rows to test the function in Spark

remote_df |> 
  head() |> 
  spark_apply(my_function)

4. Predict - Run predictions

Run full set, make sure results are saved

pred_df <- remote_df |> 
  spark_apply(my_function) |> 
  compute()

Exercise 7.5

4. Predict

Create the function
Test the function locally
Switch to published model
Test function remotely
Run predictions

Run R models in Spark

Unit 8

Working with
LLMs in
Databricks

Enhance SQL operations using LLM’s

Databricks LLMs via SQL

Databricks allows us to access LLM functionality right from SQL
Accessible via calling the ai_ family of functions
New way of leveraging LLMs, batch scoring

> SELECT ai_analyze_sentiment(
'I am happy');
  positive

> SELECT ai_analyze_sentiment(
'I am sad');
negative

Call ai functions in dplyr

Call the ai functions via dplyr verbs
dbplyr passes unrecognized functions ‘as-is’ in the query

orders |>
  head() |> 
  select(o_comment) |> 
  mutate(
    sentiment = ai_analyze_sentiment(o_comment)
    )
#> # Source:   SQL [6 x 2]
#>   o_comment                   sentiment
#>   <chr>                        <chr>    
#> 1 ", pending theodolites …    neutral  
#> 2 "uriously special foxes …   neutral  
#> 3 "sleep. courts after the …  neutral  
#> 4 "ess foxes may sleep …      neutral  
#> 5 "ts wake blithely unusual … mixed    
#> 6 "hins sleep. fluffily …     neutral

Available LLM functions

ai_analyze_sentiment - Text is positve, negative, neutral, etc.
ai_classify - Give choices, LLM chooses which of the choice match
ai_extract - Have LLM extract specific elements from the text
ai_fix_grammar - Makes grammar more better
ai_mask - Replaces specified elements with ‘[Masked]’
ai_similarity - Compares 2 texts, returns integer indicating similarity
ai_summarize - Summarizes text, you can also specify length of output
ai_translate - Tell LLM to translate to specific language
ai_forecast - Time series prediction, not text

Considerations in Databricks

Need APIs pay-per-token supported region
Functions not available on Databricks SQL Classic
In preview, functions have performance restrictions
Functions are tuned for English, but model can handle several languages
Mixtral-8x7B Instruct is the underlying model that powers most of these AI functions

Exercise 8.1

Specify array

Most ai functions require an argument of ARRAY type
SQL’s array() function can be called directly inside the ai function

SQL

> SELECT
     description,
     ai_classify(
     description, array('clothing', 'furniture')) AS category
   FROM
     products

tbl_products |> 
  mutate(
    category = ai_classify(
      description, array("clothing", "furniture"))
  ) |> 
  select(description, category)

Exercise 8.2

Process complex output

Some ai functions return a STRUCT
It is a more complex variable type
It can only be coerced to a string
as.character() does the trick

tbl_reviews |> 
  mutate(
    product = as.character(
      ai_extract(review, array("product"))
      )
    ) |> 
  select(product, review)
#> # Source:   SQL [4 x 2]
#>   product           review                                                      
#>   <chr>             <chr>                                                       
#> 1 {toaster}         This is the best toaster I have ever bought                 
#> 2 {Toaster}         Toaster arrived broken, waiting for replancement            
#> 3 {washing machine} The washing machine is as advertised, cant wait to use it   
#> 4 {television}      Not sure how to feel about this tevelision, nice brightness…

Exercise 8.3

Chat with LLM’s from R

Introducing chattr

chattr is an interface to LLMs (Large Language Models)
It enables interaction with the model directly from RStudio
Submit a prompt to the LLM from your script, or by using the provided Shiny Gadget.

library(chattr)

chattr("my programming question")

chattr_app()

chattr supports Databricks’ LLM

Big thank you to Zack Davies at Databricks!! 🎉🎉🎉

chattr supports Databricks’ LLM

Automatically makes the options available if it detects your Databricks token.

> chattr_app()
── chattr - Available models 
Select the number of the model you would like to use:

1: Databricks - databricks-dbrx-instruct (databricks-dbrx) 

2: Databricks - databricks-meta-llama-3-70b-instruct (databricks-meta-llama3-70b) 

3: Databricks - databricks-mixtral-8x7b-instruct (databricks-mixtral8x7b) 

Selection: 1

Exercise 8.4

Unit 9

Dashboards

Why dashboards?

🔑 Two fundamental points:

App load time matters! The faster the better!
Download raw data only if there is no other option

Important features

Data driven dropdowns

Important features

Data driven dropdowns
Interactive plots (hover over)

Important features

Data driven dropdowns
Interactive plots (hover over)
Drill down when interacting with the plot

Exercise 9.1

Data driven dropdowns

When do we need them?

When options in the data change regularly

Exercise 9.2

Adding the plot

Move the code from the previous unit to the app

sales_by_year <- prep_orders |>
  filter(n_name == country) |>
  group_by(order_year) |>
  summarise(
    total_price = sum(o_totalprice, na.rm = TRUE)
  ) |>
  collect()
  
breaks <- as.double(quantile(c(0, max(sales_by_year$total_price))))
breaks_labels <- paste(round(breaks / 1000000000, 1), "B")

sales_by_year |>
  ggplot() +
  geom_col(aes(order_year, total_price)) +
  scale_x_continuous(breaks = unique(sales_by_year$order_year)) +
  scale_y_continuous(breaks = breaks, labels = breaks_labels) +
  xlab("Year") +
  ylab("Total Sales") +
  labs(title = "Sales by year", subtitle = country) +
  theme_light()

shinyApp(
  ui = fluidPage(
    selectInput("country", "Country:", choices = countries, selected = "FRANCE"),
    plotOutput("sales_plot")
  ),
  server = function(input, output) {
    output$sales_plot <- renderPlot({
      # Your code here
    })
  },
  options = list(height = 200)
)

Adding the plot

Prefix input$ to both instances of country

sales_by_year <- prep_orders |>
  filter(n_name == country) |>
  group_by(order_year) |>
  summarise(total_price = sum(o_totalprice, na.rm = TRUE)) |>
  collect()

sales_by_year |>
  ggplot() +
  geom_col(aes(order_year, total_price)) +
  scale_x_continuous(breaks = unique(sales_by_year$order_year)) +
  scale_y_continuous(breaks = breaks, labels = breaks_labels) +
  xlab("Year") +
  ylab("Total Sales") +
  labs(title = "Sales by year", subtitle = country) +
  theme_light()

Adding the plot

Prefix input$ to both instances of country

sales_by_year <- prep_orders |>
  filter(n_name == input$country) |>
  group_by(order_year) |>
  summarise(total_price = sum(o_totalprice, na.rm = TRUE)) |>
  collect()

sales_by_year |>
  ggplot() +
  geom_col(aes(order_year, total_price)) +
  scale_x_continuous(breaks = unique(sales_by_year$order_year)) +
  scale_y_continuous(breaks = breaks, labels = breaks_labels) +
  xlab("Year") +
  ylab("Total Sales") +
  labs(title = "Sales by year", subtitle = input$country) +
  theme_light()

Exercise 9.3

Interactive Plots

ggiraph

Allows you to create dynamic ggplot graphs.
Adds tooltips, hover effects, and JavaScript actions
Enables selecting elements of a plot inside a shiny app
Adds interactivity to ggplot geometries, legends and theme elements

Interactive Plots

Replace with ggiraph functions

  ui = fluidPage(
    selectInput("country", "Country:", choices = countries, selected = "FRANCE"),
    plotOutput()("sales_plot")),
  server = function(input, output) {
    output$sales_plot <- renderPlot()({
      sales_by_year <- prep_orders |>
        filter(n_name == !!input$country) |>
        group_by(order_year) |>
        summarise(total_price = sum(o_totalprice, na.rm = TRUE)) |>
        collect()
      sales_by_year |>
        ggplot() +
        geom_col(aes(order_year, total_price)) +
        scale_x_continuous(breaks = unique(sales_by_year$order_year)) +

Interactive Plots

Replace with ggiraph functions

  ui = fluidPage(
    selectInput("country", "Country:", choices = countries, selected = "FRANCE"),
    girafeOutput("sales_plot")),
  server = function(input, output) {
    output$sales_plot <- renderGirafe({
      sales_by_year <- prep_orders |>
        filter(n_name == !!input$country) |>
        group_by(order_year) |>
        summarise(total_price = sum(o_totalprice, na.rm = TRUE)) |>
        collect()
      sales_by_year |>
        ggplot() +
        geom_col_interactive(aes(order_year, total_price)) +
        scale_x_continuous(breaks = unique(sales_by_year$order_year)) +

Interactive Plots

Add girafe() output step

      sales_by_year |>
        ggplot() +
        geom_col_interactive(aes(order_year, total_price)) +
        scale_x_continuous(breaks = unique(sales_by_year$order_year)) +
        scale_y_continuous(breaks = breaks, labels = breaks_labels) +
        xlab("Year") +
        ylab("Total Sales") +
        labs(title = "Sales by year", subtitle = input$country) +
        theme_light()

Interactive Plots

Add girafe() output step

      g <- sales_by_year |>
        ggplot() +
        geom_col_interactive(aes(order_year, total_price)) +
        scale_x_continuous(breaks = unique(sales_by_year$order_year)) +
        scale_y_continuous(breaks = breaks, labels = breaks_labels) +
        xlab("Year") +
        ylab("Total Sales") +
        labs(title = "Sales by year", subtitle = input$country) +
        theme_light()

      girafe(ggobj = g)

Exercise 9.4

Plot drill-down

Plot drill down

Add a showModal() when a column is clicked

      g <- sales_by_year |>
        ggplot() +
        geom_col_interactive(aes(order_year, total_price, data_id = order_year, tooltip = total_price)) +
        scale_x_continuous(breaks = unique(sales_by_year$order_year)) +
        scale_y_continuous(breaks = breaks, labels = breaks_labels) +
        xlab("Year") +
        ylab("Total Sales") +
        labs(title = "Sales by year", subtitle = input$country) +
        theme_light()

      girafe(ggobj = g)
    })
  },
  options = list(height = 200)

Plot drill down

Add a showModal() when a column is clicked

      g <- sales_by_year |>
        ggplot() +
        geom_col_interactive(aes(order_year, total_price, data_id = order_year, tooltip = total_price)) +
        scale_x_continuous(breaks = unique(sales_by_year$order_year)) +
        scale_y_continuous(breaks = breaks, labels = breaks_labels) +
        xlab("Year") +
        ylab("Total Sales") +
        labs(title = "Sales by year", subtitle = input$country) +
        theme_light()

      girafe(ggobj = g)
    })
    
    observeEvent(input$sales_plot_selected, {
      showModal(modalDialog())
      })
  },
  options = list(height = 200)

Plot drill down

Allow only 1 column to be selected at the time

      g <- sales_by_year |>
        ggplot() +
        geom_col_interactive(aes(order_year, total_price, data_id = order_year, tooltip = total_price)) +
        scale_x_continuous(breaks = unique(sales_by_year$order_year)) +
        scale_y_continuous(breaks = breaks, labels = breaks_labels) +
        xlab("Year") +
        ylab("Total Sales") +
        labs(title = "Sales by year", subtitle = input$country) +
        theme_light()

      girafe(ggobj = g)
    })
        
    observeEvent(input$sales_plot_selected, {
      showModal(modalDialog())
      })        
  },
  options = list(height = 200)

Plot drill down

Allow only 1 column to be selected at the time

      g <- sales_by_year |>
        ggplot() +
        geom_col_interactive(aes(order_year, total_price, data_id = order_year, tooltip = total_price)) +
        scale_x_continuous(breaks = unique(sales_by_year$order_year)) +
        scale_y_continuous(breaks = breaks, labels = breaks_labels) +
        xlab("Year") +
        ylab("Total Sales") +
        labs(title = "Sales by year", subtitle = input$country) +
        theme_light()

      girafe(ggobj = g, options = list(opts_selection(type = "single")))
    })
    
    observeEvent(input$sales_plot_selected, {
      showModal(modalDialog())
      })
  },
  options = list(height = 200)

Plot drill down

Add a dynamic title using the country and year selected

  observeEvent(input$sales_plot_selected, {
        showModal(modalDialog())
        })

Plot drill down

Add a dynamic title using the country and year selected

observeEvent(input$sales_plot_selected, {
 showModal(
  modalDialog(
   title = paste0(input$country, " - ", input$sales_plot_selected)
  )
 )

Plot drill down

Add a new renderPLot() function, with code from Plot data by month

  output$monthly_sales <- renderPlot({
    sales_by_month <- prep_orders |>
      filter(
        n_name == !!input$country, 
        order_year == !!input$sales_plot_selected
        ) |>
      group_by(order_month) |>
      summarise(
        total_price = sum(o_totalprice, na.rm = TRUE)
      ) |>
      collect()

    breaks <- as.double(quantile(c(0, max(sales_by_month$total_price))))
    breaks_labels <- paste(round(breaks / 1000000000, 1), "B")

    sales_by_month |>
      ggplot() +
      geom_col(aes(order_month, total_price)) +
      scale_x_continuous(breaks = unique(sales_by_month$order_month)) +
      scale_y_continuous(breaks = breaks, labels = breaks_labels) +
      xlab("Year") +
      ylab("Total Sales") +
      labs(
        title = "Sales by month", 
        subtitle = paste0(input$country, " - ", input$sales_plot_selected)
        ) +
      theme_light()
    })

Plot drill down

Add plotOutput("monthly_sales") to the modelDialog() call

observeEvent(input$sales_plot_selected, {
 showModal(
  modalDialog(
   title = paste0(input$country, " - ", input$sales_plot_selected)
  )
 )

Plot drill down

Add plotOutput("monthly_sales") to the modelDialog() call

observeEvent(input$sales_plot_selected, {
 showModal(
  modalDialog(
   title = paste0(input$country, " - ", input$sales_plot_selected),
   plotOutput("monthly_sales")
  )
 )

Additional resources

Official Site

spark.posit.co

Cheatsheet

spark.posit.co/images/homepage/sparklyr.pdf

Ask questions

forum.posit.co

Let us know what you thought

Please take our survey:
pos.it/conf-workshop-survey