Processing math: 100%

Tidymodels with R

Brian N. White, STOR Ph.D. Student, UNC Chapel Hill

Acknowldegments

These slides are adapted, with permission, from a lecture given by Dr. Lucy D’Agostino McGowan (@LucyStats).

Preliminaries

If you have not done so already please install and load the following packages:

tidyverse
tidymodels
ISLR

library(tidyverse)
library(tidymodels)
library(ISLR)

Tidyverse

This presentation assumes a basic familiarity with the core tidyverse packages:

Tidymodels

Tidymodels is an opinionated collection of R packages designed for modeling and statistical analysis.
All packages contain an underlying philosophy and common grammar
Developed by Max Kuhn (who also developed caret, short for C-lassification A-nd RE-gression T-raining: a single package with 100s of ML algorithms)

Tidymodels

Core Packages

Data Resampling and Feature Engineering:

rsample
recipes

Model Fitting and Tuning:

parsnip
tune
dials

Model Evaluation:

yardstick

Tidymodels Workflow

A First Example

data(Auto)
head(Auto)

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

Data Resampling & Feature Engineering

Test/Train Split

set.seed(100) # for reproducibility

# partition data into training and testing set
auto_split <- Auto %>% initial_split(prop = 0.75, strata = mpg)
auto_split

## <Analysis/Assess/Total>
## <292/100/392>

auto_train <- auto_split %>% training() # extract training data
auto_test <- auto_split %>% testing() # extract testing data

Data Resampling & Feature Engineering

K-Fold CV

# shuffle and partition the training data into 5 folds
auto_folds <- auto_train %>% vfold_cv(v = 5, strata = mpg)

Model Fitting & Tuning

Step 1: Specify Your Model

Model Fitting & Tuning

Step 1: Specify Your Model

linear_reg() %>% # select model
  set_engine('lm') %>% # specify function used to fit model
  set_mode('regression') # learning task

Model Fitting & Tuning

Step 1: Specify Your Model

linear_reg() %>% # select model
  set_engine('glmnet') %>% # specify function used to fit model
  set_mode('regression') # learning task

Model Fitting & Tuning

Step 1: Specify Your Model

linear_reg() %>% # select model
  set_engine('spark') %>% # specify function used to fit model
  set_mode('regression') # learning task

All available models (as of 2/29/2022): https://www.tidymodels.org/find/

Exercise

Write a pipe that creates a model that uses lm() to fit a linear regression using tidymodels.

Save it as lm_spec and look at the object. What does it return?

Exercise

Write a pipe that creates a model that uses lm() to fit a linear regression using tidymodels.

Save it as lm_spec and look at the object. What does it return?

lm_spec <- linear_reg() %>%
            set_engine('lm') %>%
            set_mode('regression')

lm_spec

## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

Model Fitting & Tuning

Step 2: Train Your Model

To train your model simply pipe your model specification into the fit function.

# pass linear_mod to fit function
lm_fit <- lm_spec %>%
  fit(mpg ~ horsepower, data = auto_train)

Exercise

Does the previous model give the same results as:

lm(mpg ~ horsepower, data = auto_train)

Exercise

Does the previous model give the same results as:

lm(mpg ~ horsepower, data = auto_train)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = auto_train)
## 
## Coefficients:
## (Intercept)   horsepower  
##     40.9868      -0.1676

Exercise

Yes!

lm_fit

## parsnip model object
## 
## 
## Call:
## stats::lm(formula = mpg ~ horsepower, data = data)
## 
## Coefficients:
## (Intercept)   horsepower  
##     40.9868      -0.1676

Model Evaluation

Step 1: Generate Predictions

# make predictions on test set
lm_pred <- lm_fit %>%
  predict(new_data = auto_test) %>%
  bind_cols(auto_test)

Model Evaluation

Step 1: Generate Predictions

lm_pred %>%
  select(.pred, mpg) %>%
  head()

## # A tibble: 6 × 2
##   .pred   mpg
##   <dbl> <dbl>
## 1  4.96    14
## 2  9.15    15
## 3 15.9     15
## 4 25.1     25
## 5 25.9     21
## 6  4.96    10

Model Evaluation

Step 2: Assess Model

# specify performance metrics of interest
auto_metrics <- metric_set(
  yardstick::rmse,
  yardstick::rsq
)

Model Evaluation

Step 2: Assess Model

# specify performance metrics of interest
auto_metrics <- metric_set(
  yardstick::rmse,
  yardstick::rsq
)

# compute metrics
lm_pred %>%
  auto_metrics(truth = mpg, estimate = .pred)

## # A tibble: 2 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       4.70 
## 2 rsq     standard       0.597

Exercise

Copy the code below, fill in the blanks to fit a model on the training data then calculate the test RMSE

set.seed(100)
auto_split  <- ________
auto_train  <- ________
auto_test   <- ________

lm_fit      <-  lm_spec %>% fit(mpg ~ horsepower, data = ________)

lm_pred  <- ________ %>% 
  predict(new_data = ________) %>% 
  bind_cols(________)

lm_pred %>% auto_metrics(truth = ________, estimate = ________)

Exercise

Copy the code below, fill in the blanks to fit a model on the training data then calculate the test RMSE

set.seed(100)
auto_split  <- initial_split(Auto, strata = mpg)
auto_train  <- auto_split %>% training()
auto_test   <- auto_split %>% testing()

lm_fit      <-  lm_spec %>% fit(mpg ~ horsepower, data = auto_train)

lm_pred  <- lm_fit %>% 
  predict(new_data = auto_test) %>% 
  bind_cols(auto_test)

lm_pred %>% auto_metrics(truth = mpg, estimate = .pred)

A Faster Way: last_fit()

You can use last_fit() and specify the split
This will automatically train the data on the train data from the split
Instead of specifying which metric to calculate you can just use collect_metrics() and it will automatically calculate the metrics on the test data from the split

A Faster Way: last_fit()

set.seed(100)
auto_split <- initial_split(Auto, prop = 0.75, strata = mpg)

lm_fit <- lm_spec %>% last_fit(mpg ~ horsepower, 
                               split = auto_split)

lm_fit %>% collect_metrics()

## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard       4.70  Preprocessor1_Model1
## 2 rsq     standard       0.597 Preprocessor1_Model1

What About Cross-Validation?

# shuffle and partition the training data into 5 folds
auto_cv <- auto_train %>% vfold_cv(v = 5, strata = mpg)

# extract and the fist fold
fold1 <- as.data.frame(auto_cv$splits[[1]])
fold1 %>%
  select(mpg, horsepower) %>%
  head()

##    mpg horsepower
## 2   15        165
## 4   16        150
## 5   17        140
## 7   14        220
## 11  15        170
## 14  14        225

What About Cross-Validation?

# perform 5-fold cross-validation
results <- lm_spec %>% fit_resamples(mpg ~ horsepower,
                                       resamples = auto_cv)

What About Cross-Validation?

How do we get the metrics out?

With collect_metrics() again!

# compute performance metrics
results %>% collect_metrics()

## # A tibble: 2 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   4.99      5  0.174  Preprocessor1_Model1
## 2 rsq     standard   0.619     5  0.0263 Preprocessor1_Model1

Exercise

Edit the code below to get the 5-fold cross validation error rate for the following model:

$mpg=\beta_0 + \beta_1horsepower + \beta_2horsepower^2 + \epsilon$

auto_cv <- vfold_cv(Auto, v = 5)
results <- lm_spec %>% fit_resamples(----,
                                      resamples = ---)
results %>%
  collect_metrics()

Exercise

Edit the code below to get the 5-fold cross validation error rate for the following model:

$mpg=\beta_0 + \beta_1horsepower + \beta_2horsepower^2 + \epsilon$

auto_cv <- vfold_cv(Auto, v = 5)
results <- lm_spec %>% fit_resamples(mpg ~ horsepower + horsepower^2,
                                      resamples = auto_cv)
results %>%
  collect_metrics()

## # A tibble: 2 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   4.93      5  0.134  Preprocessor1_Model1
## 2 rsq     standard   0.609     5  0.0244 Preprocessor1_Model1

Tidymodels Workflow

What have we covered so far?

What if we wanted to do some pre-processing?

recipe()!
Using the recipe() function along with step_*() functions, we can specify preprocessing steps and R will apply them to each fold appropriately.

rec <- recipe(mpg ~ horsepower, data = Auto) %>%
  step_scale(horsepower) %>%
  step_center(horsepower)

You can find all of the potential preprocessing steps here: https://tidymodels.github.io/recipes/reference/index.html

What if we want to predict mpg with more variables?

Now we still want to add a step to scale predictors
We could either write out all predictors individually to scale them
OR we could use the all_predictors() short hand.

rec <- recipe(mpg ~ horsepower + displacement + weight, 
              data = Auto) %>%
  step_scale(all_predictors())

How to incorporate a recipe?

workflow()!

# specify model
lm_spec <- linear_reg() %>%
              set_engine('lm') %>%
              set_mode('regression')

# create recipe
rec <- recipe(mpg ~ horsepower + displacement + weight, 
              data = Auto) %>%
  step_scale(all_predictors())

# combine into a single workflow object
lm_workflow <- workflow() %>%
                add_model(lm_spec) %>%
                add_recipe(rec)

How to incorporate a recipe?

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_scale()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

lm_workflow %>% fit(data = auto_train)

How to incorporate a recipe?

results <- lm_workflow %>% 
            fit_resamples(resamples = auto_cv)

results %>% collect_metrics()

## # A tibble: 2 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   4.28      5  0.216  Preprocessor1_Model1
## 2 rsq     standard   0.712     5  0.0265 Preprocessor1_Model1

What about hyper-parameter tuning?

Let’s extend our model to elastic net.
Recall that the objective function to be minimized for elastic net is:

$RSS + \lambda((1 -\alpha)\vert\vert\beta\vert\vert_{2} + \alpha\vert\vert\beta\vert\vert_1)$

What about hyper-parameter tuning?

First specify the engine. We’ll use glmnet.
The linear_reg() function has two additional parameters, penalty and mixture
penalty is $\lambda$
mixture is $\alpha$

lm_spec <- linear_reg() %>%
  set_engine("glmnet")

Exercise

$RSS + \lambda((1 -\alpha)\vert\vert\beta\vert\vert_{2} + \alpha\vert\vert\beta\vert\vert_1)$

What would we set mixture to in order to perform Ridge regression?

ridge_spec <- linear_reg(penalty = 100, mixture = ___) %>%
  set_engine("glmnet")

What about LASSO?

lasso_spec <- linear_reg(penalty = 100, mixture = ___) %>%
  set_engine("glmnet")

Exercise

$RSS + \lambda((1 -\alpha)\vert\vert\beta\vert\vert_{2} + \alpha\vert\vert\beta\vert\vert_1)$

What would we set mixture to in order to perform Ridge regression?

ridge_spec <- linear_reg(penalty = 100, mixture = 0) %>%
  set_engine("glmnet")

What about LASSO?

lasso_spec <- linear_reg(penalty = 100, mixture = 1) %>%
  set_engine("glmnet")

Putting it all together

Tuning $\lambda$ and $\alpha$ is straightforward.

Generate folds on training data.
Specify your model/recipe. Insert tune() into the hyper-parameter arguments in question.
Create the grid you’d like to tune over.
Put it all together with tune_grid().
Evaluate the tuned model on the test data.

Putting it all together

set.seed(100)

# shuffle and partition the data into training and testing sets
auto_split <- Auto %>% initial_split(prop = 0.75, strata = mpg)
auto_train <- auto_split %>% training()
auto_test <- auto_split %>% testing()


# generate folds for 5-fold cv on training set
auto_cv <- vfold_cv(auto_train, v = 5, strata = mpg)

Putting it all together

# specify your model, insert tune() into hyper-parameter arguments
elastic_spec <- linear_reg(penalty = tune(), mixture = tune()) %>%
                  set_engine('glmnet') %>%
                  set_mode('regression')

# create recipe to scale predictors
elastic_rec <- recipe(mpg ~ horsepower + weight + acceleration, 
                      data = auto_train) %>%
               step_scale(all_predictors())

# combine model spec and recipe into a workflow
elastic_workflow <- workflow() %>%
                      add_model(elastic_spec) %>%
                      add_recipe(elastic_rec)

# generate grid of values to tune hyper-parameters over
elastic_grid <- expand_grid(penalty = seq(0, 100, by = 10),
                            mixture = seq(0, 1, by = 0.2))

Putting it all together

# tune
results <- elastic_workflow %>% tune_grid(grid = elastic_grid,
                                          resamples = auto_cv)

Putting it all together

# examine the results
results %>%
  collect_metrics() %>%
  filter(.metric == "rmse") %>%
  arrange(mean) %>%
  dplyr::select(penalty, mixture, .metric, mean) %>%
  slice(1:4)

## # A tibble: 4 × 4
##   penalty mixture .metric  mean
##     <dbl>   <dbl> <chr>   <dbl>
## 1       0     0.2 rmse     4.38
## 2       0     0.4 rmse     4.38
## 3       0     0.6 rmse     4.38
## 4       0     0.8 rmse     4.38

Putting it all together

results %>%
  collect_metrics() %>%
  filter(.metric == "rmse") %>%
  ggplot(aes(penalty, mean, color = factor(mixture),
             group = factor(mixture))) +
  geom_line() +
  geom_point() + 
  labs(y = "RMSE")

Putting it all together

final_spec <- linear_reg(penalty = 0, mixture = 0) %>%
  set_engine("glmnet")

fit <- last_fit(final_spec,
                elastic_rec,
                split = auto_split)
fit %>%
  collect_metrics()

## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard       3.90  Preprocessor1_Model1
## 2 rsq     standard       0.710 Preprocessor1_Model1

Closing Remarks

PROS

Streamline the modeling process.
Improves readability and communicative power through consistent, suggestive synatx.
Active community!

CONS

Not all methods implemented (though you have the ability to contribute!)
Like any modeling framework, one can use it blindly. Tidymodels (and other modeling ecosystems) should not be treated as a black-box.

Learn More!

The primary references for tidyverse and tidymodels are:

For convenience, here are the parsnip model and recipe step lists:

Exercise: The Rest of Class

Use tidymodels to fit KNN on the two simulated data sets from the second part of the last computational homework.

Hint:

knn_spec <- nearest_neighbor(neighbors = tune()) %>%
              set_engine('kknn') %>%
              set_mode('classification')

Work on your group project