library(dplyr)
library(dbplyr)
library(DBI)
2 Remote processing
2.1 Create a table variable
Basics to how to point a variable in R to a table or view inside the database
- Load the
dplyr
,DBI
anddbplyr
libraries
- (Optional) Open a connection to the database if it’s currently closed
<- dbConnect(
con ::databricks(),
odbcHTTPPath = "/sql/1.0/warehouses/300bd24ba12adf8e"
)
- Using
dbGetQuery()
create a query to pull thediamonds
table
dbGetQuery(con, "select * from diamonds")
- Use the
tbl()
to perform the same
tbl(con, "diamonds")
#> # Source: table<`diamonds`> [?? x 11]
#> # Database: Spark SQL 3.1.1[token@Spark SQL/hive_metastore]
#> `_c0` carat cut color clarity depth table price x y z
#> <int> <dbl> <chr> <chr> <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
#> 2 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
#> 3 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#> 4 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#> 5 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
#> 6 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#> 7 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
#> 8 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
#> 9 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
#> 10 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
#> # ℹ more rows
- Load the reference, not the table data, into a variable
<- tbl(con, "diamonds") tbl_diamonds
- Call the variable to see preview the data in the table
tbl_diamonds#> # Source: table<`diamonds`> [?? x 11]
#> # Database: Spark SQL 3.1.1[token@Spark SQL/hive_metastore]
#> `_c0` carat cut color clarity depth table price x y z
#> <int> <dbl> <chr> <chr> <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
#> 2 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
#> 3 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#> 4 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#> 5 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
#> 6 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#> 7 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
#> 8 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
#> 9 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
#> 10 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
#> # ℹ more rows
- Add
count()
to easily get the number of rows
|>
tbl_diamonds count()
#> # Source: SQL [1 x 1]
#> # Database: Spark SQL 3.1.1[token@Spark SQL/hive_metastore]
#> n
#> <int64>
#> 1 53940
- Add
cut
as an argument tocount()
to see the count by that field
|>
tbl_diamonds count(cut)
#> # Source: SQL [5 x 2]
#> # Database: Spark SQL 3.1.1[token@Spark SQL/hive_metastore]
#> cut n
#> <chr> <int64>
#> 1 Ideal 21551
#> 2 Premium 13791
#> 3 Very Good 12082
#> 4 Good 4906
#> 5 Fair 1610
- Add
show_query()
to see the howdplyr
translates your code to SQL
|>
tbl_diamonds count(cut) |>
show_query()
#> <SQL>
#> SELECT `cut`, COUNT(*) AS `n`
#> FROM `diamonds`
#> GROUP BY `cut`
2.2 Easily aggretate data
An example of how we can use the same code against a local R data frame and a remote table
- Using
dplyr
, get the average price for eachcut
, and sort it by the average fordiamonds
, from theggplot2
package
::diamonds |>
ggplot2group_by(cut) |>
summarise(avg_price = mean(price, na.rm = TRUE)) |>
arrange(desc(avg_price))
#> # A tibble: 5 × 2
#> cut avg_price
#> <ord> <dbl>
#> 1 Premium 4584.
#> 2 Fair 4359.
#> 3 Very Good 3982.
#> 4 Good 3929.
#> 5 Ideal 3458.
- Use
tbl_diamonds
to perform the exact same operation
|>
tbl_diamonds group_by(cut) |>
summarise(avg_price = mean(price, na.rm = TRUE)) |>
arrange(desc(avg_price))
#> # Source: SQL [5 x 2]
#> # Database: Spark SQL 3.1.1[token@Spark SQL/hive_metastore]
#> # Ordered by: desc(avg_price)
#> cut avg_price
#> <chr> <dbl>
#> 1 Premium 4584.
#> 2 Fair 4359.
#> 3 Very Good 3982.
#> 4 Good 3929.
#> 5 Ideal 3458.
- Load code into a variable named
price_by_cut
<- tbl_diamonds |>
price_by_cut group_by(cut) |>
summarise(avg_price = mean(price, na.rm = TRUE)) |>
arrange(desc(avg_price))
- Call
price_by_cut
price_by_cut#> # Source: SQL [5 x 2]
#> # Database: Spark SQL 3.1.1[token@Spark SQL/hive_metastore]
#> # Ordered by: desc(avg_price)
#> cut avg_price
#> <chr> <dbl>
#> 1 Premium 4584.
#> 2 Fair 4359.
#> 3 Very Good 3982.
#> 4 Good 3929.
#> 5 Ideal 3458.
2.3 Data collection
Understand the difference between printing and collecting
- Call the
tbl_diamonds
variable directly
tbl_diamonds#> # Source: table<`diamonds`> [?? x 11]
#> # Database: Spark SQL 3.1.1[token@Spark SQL/hive_metastore]
#> `_c0` carat cut color clarity depth table price x y z
#> <int> <dbl> <chr> <chr> <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
#> 2 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
#> 3 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#> 4 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#> 5 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
#> 6 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#> 7 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
#> 8 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
#> 9 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
#> 10 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
#> # ℹ more rows
- Call
tbl_diamonds
viaprint()
.
print(tbl_diamonds)
#> # Source: table<`diamonds`> [?? x 11]
#> # Database: Spark SQL 3.1.1[token@Spark SQL/hive_metastore]
#> `_c0` carat cut color clarity depth table price x y z
#> <int> <dbl> <chr> <chr> <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
#> 2 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
#> 3 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#> 4 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#> 5 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
#> 6 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#> 7 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
#> 8 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
#> 9 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
#> 10 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
#> # ℹ more rows
- Call
tbl_diamonds
viacollect()
. Notice it prints the entire table
collect(tbl_diamonds)
#> # A tibble: 53,940 × 11
#> `_c0` carat cut color clarity depth table price x y z
#> <int> <dbl> <chr> <chr> <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
#> 2 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
#> 3 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
#> 4 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
#> 5 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
#> 6 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
#> 7 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
#> 8 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
#> 9 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
#> 10 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
#> # ℹ 53,930 more rows
- Load the previous operation to a variable named
local_diamonds
<- collect(tbl_diamonds) local_diamonds
- Use
pull()
to extract the values fromprice
only
|>
tbl_diamonds pull(price)
- Load the previous operation to a variable named
price
. Notice that this time, the variable is a vector, not a data frame.
<- tbl_diamonds |>
price pull(price)
2.4 Referring to non-default catalog tables
Using I() to create non-default table references
- Create a reference to the
customer
table, which is under thetpch
schema, inside thesamples
catalog
tbl(con, I("workshops.tpch.customer"))
#> # Source: table<workshops.tpch.customer> [?? x 8]
#> # Database: Spark SQL 3.1.1[token@Spark SQL/hive_metastore]
#> c_custkey c_name c_address c_nationkey c_phone c_acctbal c_mktsegment
#> <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr>
#> 1 1 Customer#0000… IVhzIApe… 15 25-989… 712. BUILDING
#> 2 2 Customer#0000… XSTf4,NC… 13 23-768… 122. AUTOMOBILE
#> 3 3 Customer#0000… MG9kdTD2… 1 11-719… 7498. AUTOMOBILE
#> 4 4 Customer#0000… XxVSJsLA… 4 14-128… 2867. MACHINERY
#> 5 5 Customer#0000… KvpyuHCp… 3 13-750… 794. HOUSEHOLD
#> 6 6 Customer#0000… sKZz0Csn… 20 30-114… 7639. AUTOMOBILE
#> 7 7 Customer#0000… TcGe5gaZ… 18 28-190… 9562. AUTOMOBILE
#> 8 8 Customer#0000… I0B10bB0… 17 27-147… 6820. BUILDING
#> 9 9 Customer#0000… xKiAFTjU… 8 18-338… 8324. FURNITURE
#> 10 10 Customer#0000… 6LrEaV6K… 5 15-741… 2754. HOUSEHOLD
#> # ℹ more rows
#> # ℹ 1 more variable: c_comment <chr>
- Create a reference to the
sales_order_in_la
table, under theferit
schema
tbl(con, I("ferit.sales_order_in_la"))
#> # Source: table<ferit.sales_order_in_la> [?? x 7]
#> # Database: Spark SQL 3.1.1[token@Spark SQL/hive_metastore]
#> order_date city customer_id customer_name currency sales quantity
#> <date> <chr> <chr> <chr> <chr> <int> <int64>
#> 1 2019-09-27 Los Angeles 14120673 DEL MONICO, DOMI… USD 1166 2
#> 2 2019-08-30 Los Angeles 12593702 SAUTKUS, LAURA USD 1250 8
#> 3 2019-11-12 Los Angeles 14159223 ROBERTS, ALAN E USD 98 2
#> 4 2019-10-15 Los Angeles 13005776 SAUTKUS, LAURA USD 731 4
#> 5 2019-08-23 Los Angeles 14501552 osborn paper co USD 2180 3
#> 6 2019-09-03 Los Angeles 13265811 RYCHTANEK, NICOLE USD 72 3
#> 7 2019-10-16 Los Angeles 12947406 epikos church USD 2529 3
#> 8 2019-08-08 Los Angeles 15171127 GORZEN, WALDEMAR… USD 1207 6
#> 9 2019-08-16 Los Angeles 14065091 VARGAS, DAVID USD 349 2
#> 10 2019-10-08 Los Angeles 13999146 SEGOVIA, VICTOR M USD 2092 5
#> # ℹ more rows