2  Remote processing

2.1 Create a table variable

Basics to how to point a variable in R to a table or view inside the database

  1. Load the dplyr, DBI and dbplyr libraries
library(dplyr)
library(dbplyr)
library(DBI)
  1. (Optional) Open a connection to the database if it’s currently closed
con <- dbConnect(
  odbc::databricks(),
  HTTPPath = "/sql/1.0/warehouses/300bd24ba12adf8e"
)
  1. Using dbGetQuery() create a query to pull the diamonds table
dbGetQuery(con, "select * from diamonds")
  1. Use the tbl() to perform the same
tbl(con, "diamonds")
#> # Source:   table<`diamonds`> [?? x 11]
#> # Database: Spark SQL 3.1.1[token@Spark SQL/hive_metastore]
#>    `_c0` carat cut       color clarity depth table price     x     y     z
#>    <int> <dbl> <chr>     <chr> <chr>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#>  1     1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
#>  2     2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
#>  3     3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
#>  4     4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
#>  5     5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
#>  6     6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
#>  7     7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
#>  8     8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
#>  9     9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
#> 10    10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
#> # ℹ more rows
  1. Load the reference, not the table data, into a variable
tbl_diamonds <- tbl(con, "diamonds")
  1. Call the variable to see preview the data in the table
tbl_diamonds
#> # Source:   table<`diamonds`> [?? x 11]
#> # Database: Spark SQL 3.1.1[token@Spark SQL/hive_metastore]
#>    `_c0` carat cut       color clarity depth table price     x     y     z
#>    <int> <dbl> <chr>     <chr> <chr>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#>  1     1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
#>  2     2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
#>  3     3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
#>  4     4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
#>  5     5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
#>  6     6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
#>  7     7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
#>  8     8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
#>  9     9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
#> 10    10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
#> # ℹ more rows
  1. Add count() to easily get the number of rows
tbl_diamonds |> 
  count()
#> # Source:   SQL [1 x 1]
#> # Database: Spark SQL 3.1.1[token@Spark SQL/hive_metastore]
#>         n
#>   <int64>
#> 1   53940
  1. Add cut as an argument to count() to see the count by that field
tbl_diamonds |> 
  count(cut)
#> # Source:   SQL [5 x 2]
#> # Database: Spark SQL 3.1.1[token@Spark SQL/hive_metastore]
#>   cut             n
#>   <chr>     <int64>
#> 1 Ideal       21551
#> 2 Premium     13791
#> 3 Very Good   12082
#> 4 Good         4906
#> 5 Fair         1610
  1. Add show_query() to see the how dplyr translates your code to SQL
tbl_diamonds |> 
  count(cut) |> 
  show_query()
#> <SQL>
#> SELECT `cut`, COUNT(*) AS `n`
#> FROM `diamonds`
#> GROUP BY `cut`

2.2 Easily aggretate data

An example of how we can use the same code against a local R data frame and a remote table

  1. Using dplyr, get the average price for each cut, and sort it by the average for diamonds, from the ggplot2 package
ggplot2::diamonds |> 
  group_by(cut) |> 
  summarise(avg_price = mean(price, na.rm = TRUE)) |> 
  arrange(desc(avg_price))
#> # A tibble: 5 × 2
#>   cut       avg_price
#>   <ord>         <dbl>
#> 1 Premium       4584.
#> 2 Fair          4359.
#> 3 Very Good     3982.
#> 4 Good          3929.
#> 5 Ideal         3458.
  1. Use tbl_diamonds to perform the exact same operation
tbl_diamonds |> 
  group_by(cut) |> 
  summarise(avg_price = mean(price, na.rm = TRUE)) |> 
  arrange(desc(avg_price))
#> # Source:     SQL [5 x 2]
#> # Database:   Spark SQL 3.1.1[token@Spark SQL/hive_metastore]
#> # Ordered by: desc(avg_price)
#>   cut       avg_price
#>   <chr>         <dbl>
#> 1 Premium       4584.
#> 2 Fair          4359.
#> 3 Very Good     3982.
#> 4 Good          3929.
#> 5 Ideal         3458.
  1. Load code into a variable named price_by_cut
price_by_cut <- tbl_diamonds |> 
  group_by(cut) |> 
  summarise(avg_price = mean(price, na.rm = TRUE)) |> 
  arrange(desc(avg_price))
  1. Call price_by_cut
price_by_cut
#> # Source:     SQL [5 x 2]
#> # Database:   Spark SQL 3.1.1[token@Spark SQL/hive_metastore]
#> # Ordered by: desc(avg_price)
#>   cut       avg_price
#>   <chr>         <dbl>
#> 1 Premium       4584.
#> 2 Fair          4359.
#> 3 Very Good     3982.
#> 4 Good          3929.
#> 5 Ideal         3458.

2.3 Data collection

Understand the difference between printing and collecting

  1. Call the tbl_diamonds variable directly
tbl_diamonds
#> # Source:   table<`diamonds`> [?? x 11]
#> # Database: Spark SQL 3.1.1[token@Spark SQL/hive_metastore]
#>    `_c0` carat cut       color clarity depth table price     x     y     z
#>    <int> <dbl> <chr>     <chr> <chr>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#>  1     1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
#>  2     2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
#>  3     3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
#>  4     4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
#>  5     5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
#>  6     6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
#>  7     7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
#>  8     8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
#>  9     9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
#> 10    10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
#> # ℹ more rows
  1. Call tbl_diamonds via print().
print(tbl_diamonds)
#> # Source:   table<`diamonds`> [?? x 11]
#> # Database: Spark SQL 3.1.1[token@Spark SQL/hive_metastore]
#>    `_c0` carat cut       color clarity depth table price     x     y     z
#>    <int> <dbl> <chr>     <chr> <chr>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#>  1     1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
#>  2     2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
#>  3     3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
#>  4     4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
#>  5     5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
#>  6     6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
#>  7     7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
#>  8     8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
#>  9     9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
#> 10    10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
#> # ℹ more rows
  1. Call tbl_diamonds via collect(). Notice it prints the entire table
collect(tbl_diamonds)
#> # A tibble: 53,940 × 11
#>    `_c0` carat cut       color clarity depth table price     x     y     z
#>    <int> <dbl> <chr>     <chr> <chr>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#>  1     1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
#>  2     2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
#>  3     3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
#>  4     4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
#>  5     5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
#>  6     6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
#>  7     7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
#>  8     8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
#>  9     9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
#> 10    10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
#> # ℹ 53,930 more rows
  1. Load the previous operation to a variable named local_diamonds
local_diamonds <- collect(tbl_diamonds)
  1. Use pull() to extract the values from price only
tbl_diamonds |> 
  pull(price)
  1. Load the previous operation to a variable named price. Notice that this time, the variable is a vector, not a data frame.
price <- tbl_diamonds |> 
  pull(price)

2.4 Referring to non-default catalog tables

Using I() to create non-default table references

  1. Create a reference to the customer table, which is under the tpch schema, inside the samples catalog
tbl(con, I("workshops.tpch.customer"))
#> # Source:   table<workshops.tpch.customer> [?? x 8]
#> # Database: Spark SQL 3.1.1[token@Spark SQL/hive_metastore]
#>    c_custkey c_name         c_address c_nationkey c_phone c_acctbal c_mktsegment
#>        <dbl> <chr>          <chr>           <dbl> <chr>       <dbl> <chr>       
#>  1         1 Customer#0000… IVhzIApe…          15 25-989…      712. BUILDING    
#>  2         2 Customer#0000… XSTf4,NC…          13 23-768…      122. AUTOMOBILE  
#>  3         3 Customer#0000… MG9kdTD2…           1 11-719…     7498. AUTOMOBILE  
#>  4         4 Customer#0000… XxVSJsLA…           4 14-128…     2867. MACHINERY   
#>  5         5 Customer#0000… KvpyuHCp…           3 13-750…      794. HOUSEHOLD   
#>  6         6 Customer#0000… sKZz0Csn…          20 30-114…     7639. AUTOMOBILE  
#>  7         7 Customer#0000… TcGe5gaZ…          18 28-190…     9562. AUTOMOBILE  
#>  8         8 Customer#0000… I0B10bB0…          17 27-147…     6820. BUILDING    
#>  9         9 Customer#0000… xKiAFTjU…           8 18-338…     8324. FURNITURE   
#> 10        10 Customer#0000… 6LrEaV6K…           5 15-741…     2754. HOUSEHOLD   
#> # ℹ more rows
#> # ℹ 1 more variable: c_comment <chr>
  1. Create a reference to the sales_order_in_la table, under the ferit schema
tbl(con, I("ferit.sales_order_in_la"))
#> # Source:   table<ferit.sales_order_in_la> [?? x 7]
#> # Database: Spark SQL 3.1.1[token@Spark SQL/hive_metastore]
#>    order_date city        customer_id customer_name      currency sales quantity
#>    <date>     <chr>       <chr>       <chr>              <chr>    <int>  <int64>
#>  1 2019-09-27 Los Angeles 14120673    DEL MONICO,  DOMI… USD       1166        2
#>  2 2019-08-30 Los Angeles 12593702    SAUTKUS,  LAURA    USD       1250        8
#>  3 2019-11-12 Los Angeles 14159223    ROBERTS,  ALAN E   USD         98        2
#>  4 2019-10-15 Los Angeles 13005776    SAUTKUS,  LAURA    USD        731        4
#>  5 2019-08-23 Los Angeles 14501552    osborn paper co    USD       2180        3
#>  6 2019-09-03 Los Angeles 13265811    RYCHTANEK,  NICOLE USD         72        3
#>  7 2019-10-16 Los Angeles 12947406    epikos church      USD       2529        3
#>  8 2019-08-08 Los Angeles 15171127    GORZEN,  WALDEMAR… USD       1207        6
#>  9 2019-08-16 Los Angeles 14065091    VARGAS,  DAVID     USD        349        2
#> 10 2019-10-08 Los Angeles 13999146    SEGOVIA,  VICTOR M USD       2092        5
#> # ℹ more rows