1 - Introduction

Intro to MLOps with vetiver

Welcome!

Wi-Fi network name

Posit Conf 2024

Wi-Fi password

conf2024

Welcome!

  • There are gender-neutral bathrooms located on levels 3, 4, 5, 6, & 7
  • A meditation/prayer room is available in room 503
    • Open Monday & Tuesday 7am - 7pm, Wednesday 7am - 5pm
  • A lactation room is available in room 509
    • Open Monday & Tuesday 7am - 7pm, Wednesday 7am - 5pm
  • Participants who do not wish to be photographed have red lanyards; please note everyone’s lanyard colors before taking a photo and respect their choices
  • The Code of Conduct can be found at https://posit.co/code-of-conduct/
    • Please review them carefully! ❤️
    • You can report Code of Conduct violations in person, by email, or by phone; see the policy linked above for contact information

Who are you?

  • You have intermediate R or Python knowledge

  • You can read data from CSV and other flat files, transform and reshape data, and make a wide variety of graphs

  • You can fit a model to data with your modeling framework of choice wide variety of graphs

  • You have exposure to basic modeling and machine learning practice

  • You do not need expert familiarity with advanced ML or MLOps topics

Who are we?

Asking for help

🧡 “I’m stuck and need help!”

💙 “I finished the exercise”

If you prefer, post on GitHub Discussions for help:

https://github.com/posit-conf-2024/vetiver/discussions

Plan for this workshop

  • Versioning
    • Managing change in models ✅
  • Deploying
    • Putting models in REST APIs 🎯
  • Monitoring
    • Tracking model performance 👀

Introduce yourself to your neighbors 👋

Optional

Post an introduction on GitHub Discussions: https://github.com/posit-conf-2024/vetiver/discussions

What is machine learning?

What is machine learning?

MLOps is…

MLOps is…

a set of practices to deploy and maintain machine learning models in production reliably and efficiently

MLOps with vetiver

Vetiver, the oil of tranquility, is used as a stabilizing ingredient in perfumery to preserve more volatile fragrances.

If you develop a model…

you can operationalize that model!

If you develop a model…

you likely should be the one to operationalize that model!

Your turn 🏺

Activity

What language does your team use for machine learning?

What kinds of models do you commonly use?

Have you ever deployed a model?

03:00

Workshop infrastructure

  • Log in at https://vetiver.posit.team
  • You’ll use your GitHub account to get access
  • Even if you plan to work locally, set this up with us so you can use Posit Connect as a deployment target
  • For Posit Workbench, use RStudio for R or VS Code for Python
  • Open the folder class-work in the vetiver directory

Your turn 🏺

Activity

Log in to Posit Workbench at https://vetiver.posit.team

Start a new session, either RStudio or VS Code.

We recommend that you open the vetiver directory as a project (RStudio) or workspace (VS Code).

In your new session, open the folder class-work in the vetiver directory, and choose the first Quarto file!

05:00

Seattle housing data

  • Home sale prices for King County, including Seattle, between May 2014 and May 2015
  • Can certain measurements be used to predict the sale price?
  • Data from Kaggle by way of mlr3data::kc_housing

Seattle housing data

  • N = 14633
  • A numeric outcome, price
  • Other variables to use for prediction:
    • bedrooms, bathrooms, sqft_living, and yr_built are numeric predictors
    • waterfront could be a logical (or maybe nominal) predictor
    • date could be a date predictor

R

library(arrow)
path <- here::here("data", "housing.parquet")
housing <- read_parquet(path)

Python

import pandas as pd
housing = pd.read_parquet('../data/housing.parquet')

Home prices in Seattle

price date bedrooms bathrooms sqft_living yr_built waterfront lat long
350000 2014-09-11 2 1.50 1070 2003 FALSE 47.6761 -122.300
250275 2014-06-17 2 1.00 790 1942 FALSE 47.4413 -122.349
712198 2014-05-04 4 2.50 2450 2013 FALSE 47.7048 -122.113
283200 2014-05-26 4 2.50 1982 2004 FALSE 47.3636 -122.192
435000 2014-11-16 5 1.00 2170 1930 FALSE 47.7555 -122.204
299950 2014-11-17 3 2.50 1570 2005 FALSE 47.7456 -121.984
368500 2014-12-10 5 2.75 2530 1992 FALSE 47.4683 -122.263
540000 2014-10-06 5 1.50 1940 1940 FALSE 47.7213 -122.310
299950 2014-10-27 2 1.75 1460 1983 FALSE 47.4048 -122.178
299880 2014-07-08 3 2.50 1460 2000 FALSE 47.5440 -122.296
545000 2014-07-09 2 2.00 2930 1980 FALSE 47.4025 -122.463
615000 2014-07-21 3 3.25 1470 2003 FALSE 47.6516 -122.337
680000 2014-06-09 3 1.75 1760 1960 FALSE 47.5355 -122.390
512000 2014-08-07 4 2.50 2550 1996 FALSE 47.4836 -122.136
452000 2014-06-05 2 1.75 1740 1946 FALSE 47.6971 -122.282

Your turn 🏺

Activity

Explore the housing data on your own!

  • What’s the distribution of the outcome price?
  • What’s the distribution of the numeric variable sqft_living?
  • How do results differ across the waterfront category?

Share something you noticed with your neighbor.

08:00

library(tidyverse)
housing |>
  group_by(date = floor_date(date, unit = "week")) |>
  summarise(price = mean(price)) |>
  ggplot(aes(date, price)) +
  geom_line(alpha = 0.8, linewidth = 1.5) +
  scale_y_continuous(labels = scales::dollar) +
  labs(y = "Mean price", x = NULL)

from plotnine import ggplot, aes, geom_boxplot, coord_flip, scale_y_log10
(ggplot(housing, aes('waterfront', 'price', fill = 'waterfront')) 
  + geom_boxplot(alpha = 0.5, show_legend = False)
  + coord_flip()
  + scale_y_log10()
)
#> <Figure Size: (640 x 480)>

housing |>
  ggplot(aes(long, lat, z = log10(price))) +
  stat_summary_hex(alpha = 0.7) +
  scale_fill_viridis_c() +
  labs(fill = "Mean price\n(log10)", x = NULL, y = NULL)

Time for building a model!

Spend your data budget

R

library(tidymodels)
set.seed(123)
housing_split <- housing |>
  mutate(price = log10(price)) |> 
  initial_split(prop = 0.8)
housing_train <- training(housing_split)
housing_test <- testing(housing_split)

Python

from sklearn import model_selection
import numpy as np
np.random.seed(123)
X, y = housing[["bedrooms", "bathrooms", "sqft_living", "yr_built"]], np.log10(housing["price"])
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    X, y,
    test_size = 0.2
)

Fit a linear regression model 🚀

Or your model of choice!

housing_fit <-
  workflow(
    price ~ bedrooms + bathrooms + sqft_living + yr_built, 
    linear_reg()
    ) |> 
  fit(data = housing_train)
from sklearn import linear_model
housing_fit = linear_model.LinearRegression().fit(X_train, y_train)

Your turn 🏺

Activity

Split your data in training and testing.

Fit a model to your training data.

05:00

Create a deployable bundle

Create a deployable model object

R

library(vetiver)
v <- vetiver_model(housing_fit, "seattle-housing-rstats")
v
#> 
#> ── seattle-housing-rstats ─ <bundled_workflow> model for deployment 
#> A lm regression modeling workflow using 4 features

Python

from vetiver import VetiverModel
v = VetiverModel(housing_fit, "seattle-housing-python", prototype_data = X_train)
v.description
#> 'A scikit-learn LinearRegression model'

Deploy preprocessors and models together

What is wrong with this?

Your turn 🏺

Activity

Create your vetiver model object.

Check out the default description that is created, and try out using a custom description.

Show your custom description to your neighbor.

05:00

Version your model

pins 📌

The pins package publishes data, models, and other R and Python objects, making it easy to share them across projects and with your colleagues.

You can pin objects to a variety of pin boards, including:

  • a local folder (like a network drive or even a temporary directory)
  • Posit Connect
  • Amazon S3
  • Azure Storage
  • Google Cloud

Version your model

Learn about the pins package for Python and for R

from pins import board_temp
from vetiver import vetiver_pin_write

board = board_temp(allow_pickle_read = True)
vetiver_pin_write(board, v)
#> Model Cards provide a framework for transparent, responsible reporting. 
#>  Use the vetiver `.qmd` Quarto template as a place to start, 
#>  with vetiver.model_card()
#> Writing pin:
#> Name: 'seattle-housing-python'
#> Version: 20240812T004027Z-258d7
library(pins)

board <- board_temp()
board |> vetiver_pin_write(v)
#> Creating new version '20240812T004027Z-f27f6'
#> Writing to pin 'seattle-housing-rstats'
#> 
#> Create a Model Card for your published model
#> • Model Cards provide a framework for transparent, responsible reporting
#> • Use the vetiver `.Rmd` template as a place to start

Your turn 🏺

Activity

Pin your vetiver model object to a temporary board.

Retrieve the model metadata with pin_meta().

05:00

Posit Connect

Version your model

R

library(pins)
library(vetiver)
board <- board_temp()
v <- vetiver_model(housing_fit, "seattle-housing-rstats")
board |> vetiver_pin_write(v)

Python

from pins import board_temp
from vetiver import VetiverModel, vetiver_pin_write

board = board_temp(allow_pickle_read = True)
v = VetiverModel(housing_fit, "seattle-housing-python", prototype_data = X_train)
vetiver_pin_write(board, v)

Version your model

R

library(pins)
library(vetiver)

board <- board_connect()
v <- vetiver_model(housing_fit, "julia.silge/seattle-housing-rstats")
board |> vetiver_pin_write(v)

Python

from pins import board_connect
from vetiver import VetiverModel, vetiver_pin_write
from dotenv import load_dotenv
load_dotenv()

board = board_connect(allow_pickle_read = True)
v = VetiverModel(housing_fit, "isabel.zimmerman/seattle-housing-python", prototype_data = X_train)
vetiver_pin_write(board, v)

Your turn 🏺

Activity

Either:

  • Set up Connect publishing from RStudio.
  • Create an API key for your Posit Connect server, and save it on Workbench in your working directory (in .Renviron for R or .env for Python).

Create a new vetiver model object that includes your username, and pin this vetiver model to your Connect instance.

Visit your pin’s homepage on Connect.

Train your model again, using a different ML algorithm (decision tree or random forest are good options).

Write this new version of your model to the same pin, and see what versions you have with pin_versions.

10:00