4 - Monitor your model

Intro to MLOps with vetiver

Plan for this workshop

Versioning
- Managing change in models ✅
Deploying
- Putting models in REST APIs 🎯
Monitoring
- Tracking model performance 👀

Data for model development

Data that you use while building a model for training/testing

R

library(arrow)
path <- here::here("data", "housing.parquet")
housing <- read_parquet(path)

Python

import pandas as pd

housing = pd.read_parquet("../data/housing.parquet")

Data for model monitoring

New data that you predict on after your model deployed

R

library(arrow)
path <- here::here("data", "housing_monitoring.parquet")
housing_new <- read_parquet(path)

Python

import pandas as pd
housing_new = pd.read_parquet('../data/housing_monitoring.parquet')

Data for model monitoring

My model is performing well!

👩🏼‍🔧 My model returns predictions quickly, doesn’t use too much memory or processing power, and doesn’t have outages.

Metrics

latency
memory and CPU usage
uptime

My model is performing well!

👩🏽‍🔬 My model returns predictions that are close to the true values for the predicted quantity.

Metrics

accuracy
ROC AUC
F1 score
RMSE
log loss

Model drift 📉

DATA drift

Model drift 📉

CONCEPT drift

When should you retrain your model? 🧐

Your turn 🏺

Activity

Using our data, what could be an example of data drift? Concept drift?

05:00

Monitor your model’s inputs

Typically it is most useful to compare to your model development data¹

Statistical distribution of features individually
Statistical characteristics of features as a whole
Applicability scores: https://applicable.tidymodels.org/

Monitor your model’s inputs

Your turn 🏺

Activity

Create a plot or table comparing the development vs. monitoring distributions of a model input/feature.

How might you make this comparison if you didn’t have all the model development data available when monitoring?

What summary statistics might you record during model development, to prepare for monitoring?

07:00

Monitor your model’s outputs

If a realtor used a model like this one before putting a house on the market, they would get:
- A predicted price from their model
- A real price result after the home was sold
In this case, we can monitor our model’s statistical performance
If you don’t ever get a “real” result, you can still monitor the distribution of your outputs

Monitor your model’s outputs

Python

from vetiver import vetiver_endpoint, predict, compute_metrics, plot_metrics
from sklearn.metrics import root_mean_squared_error, r2_score, mean_absolute_error
from datetime import timedelta
import numpy as np

url = "https://pub.demo.posit.team/public/seattle-housing-python/predict"
endpoint = vetiver_endpoint(url)
housing_new["pred"] = predict(endpoint = url, 
    data = housing_new[["bedrooms", "bathrooms", "sqft_living", "yr_built"]])
housing_new["price"] = np.log10(housing_new["price"])

td = timedelta(weeks = 2)
metric_set = [root_mean_squared_error, r2_score, mean_absolute_error]

m = compute_metrics(
    data = housing_new,
    date_var = "date", 
    period = td,
    metric_set = metric_set,
    truth = "price",
    estimate = "pred")

metrics_plot = plot_metrics(m).update_yaxes(matches = None)

Monitor your model’s outputs

R

library(vetiver)
library(tidymodels)
url <- "https://pub.demo.posit.team/public/seattle-housing-rstats/predict"
endpoint <- vetiver_endpoint(url)

augment(endpoint, new_data = housing_new) |>
    mutate(price = log10(price)) |>
    vetiver_compute_metrics(
        date,
        "week",
        price,
        .pred,
        metric_set = metric_set(rmse, rsq, mae)
    ) |>
    vetiver_plot_metrics()

Monitor your model’s outputs

Your turn 🏺

Activity

Use the functions for metrics monitoring from vetiver to create a monitoring visualization.

Choose a different set of metrics or time aggregation.

Note that there are functions for using pins as a way to version and update monitoring results too!

05:00

Feedback loops 🔁

Deployment of an ML model may alter the training data

Movie recommendation systems on Netflix, Disney+, Hulu, etc
Identifying fraudulent credit card transactions at Stripe
Recidivism models

Feedback loops can have unexpected consequences

Feedback loops 🔁

Users take some action as a result of a prediction
Users rate or correct the quality of a prediction
Produce annotations (crowdsource or expert)
Produce feedback automatically

Your turn 🏺

Activity

What is a possible feedback loop for the Seattle housing data?

Do you think your example would be harmful or helpful? To whom?

05:00

ML metrics ➡️ organizational outcomes

Are machine learning metrics like F1 score or RMSE what matter to your organization?
Consider how ML metrics are related to important outcomes or KPIs for your business or org
There isn’t always a 1-to-1 mapping 😔
You can partner with stakeholders to monitor what’s truly important about your model

Your turn 🏺

Activity

Let’s say that the most important organizational outcome for a Seattle realtor is how accurate a pricing model is in terms of percentage on prices in USD rather than an absolute value. (Think about being 20% wrong vs. $20,000 wrong.)

We can measure this with the mean absolute percentage error.

Compute this quantity with the monitoring data, and aggregate by week/month, number of bedrooms/bathrooms, or waterfront status.

For extra credit, make a visualization showing your results.

07:00

augment(endpoint, housing_new) |> 
    mutate(.pred = 10 ^ .pred) |> 
    group_by(waterfront) |> 
    mape(price, .pred)
#> # A tibble: 2 × 4
#>   waterfront .metric .estimator .estimate
#>   <lgl>      <chr>   <chr>          <dbl>
#> 1 FALSE      mape    standard        29.5
#> 2 TRUE       mape    standard        52.0

from sklearn.metrics import mean_absolute_percentage_error

housing_new \
    .groupby("waterfront") \
    .apply(lambda x: mean_absolute_percentage_error(
        y_pred= 10 ** x["pred"], y_true= 10 ** x["price"]), include_groups=False)
#> waterfront
#> False    0.304799
#> True     0.531649
#> dtype: float64

Possible model monitoring artifacts

Adhoc analysis that you post in Slack
Report that you share in Google Drive
Fully automated dashboard published to Posit Connect

Possible model monitoring artifacts

Your turn 🏺

Activity

Create a Quarto report or R Markdown dashboard for model monitoring.

Publish your document to Connect.

15:00

We made it! 🎉

Your turn 🏺

Activity

What is one thing you learned that surprised you?

What is one thing you learned that you plan to use?

05:00

Resources to keep learning

Documentation at https://vetiver.posit.co/
Isabel’s talk from rstudio::conf() 2022 on Demystifying MLOps
End-to-end demos from Posit Solution Engineering in R and Python
Are you on the right track with your MLOps system? Use the rubric in “The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction” by Breck et al (2017)
Want to learn about how MLOps is being practiced? Read one of our favorite 😍 recent papers, “Operationalizing Machine Learning: An Interview Study” by Shankar et al (2022)

Follow Posit and/or us on your preferred social media for updates!

Submit feedback before you leave 🗳️

pos.it/conf-workshop-survey

Your feedback is crucial! Data from the survey informs curriculum and format decisions for future conf workshops and we really appreciate you taking the time to provide it.