Acknowldegments

These slides are adapted, with permission, from a lecture given by Dr. Lucy D’Agostino McGowan (@LucyStats).

Preliminaries

If you have not done so already please install and load the following packages:

  • tidyverse
  • tidymodels
  • ISLR
library(tidyverse)
library(tidymodels)
library(ISLR)

Tidyverse

This presentation assumes a basic familiarity with the core tidyverse packages:

Tidymodels

  • Tidymodels is an opinionated collection of R packages designed for modeling and statistical analysis.

  • All packages contain an underlying philosophy and common grammar

  • Developed by Max Kuhn (who also developed caret, short for C-lassification A-nd RE-gression T-raining: a single package with 100s of ML algorithms)

Tidymodels

Core Packages

Data Resampling and Feature Engineering:

  • rsample
  • recipes

Model Fitting and Tuning:

  • parsnip
  • tune
  • dials

Model Evaluation:

  • yardstick

Tidymodels Workflow

A First Example

data(Auto)
head(Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

Data Resampling & Feature Engineering

Test/Train Split

set.seed(100) # for reproducibility

# partition data into training and testing set
auto_split <- Auto %>% initial_split(prop = 0.75, strata = mpg)
auto_split
## <Analysis/Assess/Total>
## <292/100/392>
auto_train <- auto_split %>% training() # extract training data
auto_test <- auto_split %>% testing() # extract testing data

Data Resampling & Feature Engineering

K-Fold CV

# shuffle and partition the training data into 5 folds
auto_folds <- auto_train %>% vfold_cv(v = 5, strata = mpg)

Model Fitting & Tuning

Step 1: Specify Your Model

Model Fitting & Tuning

Step 1: Specify Your Model

linear_reg() %>% # select model
  set_engine('lm') %>% # specify function used to fit model
  set_mode('regression') # learning task

Model Fitting & Tuning

Step 1: Specify Your Model

linear_reg() %>% # select model
  set_engine('glmnet') %>% # specify function used to fit model
  set_mode('regression') # learning task

Model Fitting & Tuning

Step 1: Specify Your Model

linear_reg() %>% # select model
  set_engine('spark') %>% # specify function used to fit model
  set_mode('regression') # learning task

Exercise

Write a pipe that creates a model that uses lm() to fit a linear regression using tidymodels.

Save it as lm_spec and look at the object. What does it return?

Exercise

Write a pipe that creates a model that uses lm() to fit a linear regression using tidymodels.

Save it as lm_spec and look at the object. What does it return?

lm_spec <- linear_reg() %>%
            set_engine('lm') %>%
            set_mode('regression')

lm_spec
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

Model Fitting & Tuning

Step 2: Train Your Model

To train your model simply pipe your model specification into the fit function.

# pass linear_mod to fit function
lm_fit <- lm_spec %>%
  fit(mpg ~ horsepower, data = auto_train)

Exercise

Does the previous model give the same results as:

lm(mpg ~ horsepower, data = auto_train)

Exercise

Does the previous model give the same results as:

lm(mpg ~ horsepower, data = auto_train)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = auto_train)
## 
## Coefficients:
## (Intercept)   horsepower  
##     40.9868      -0.1676

Exercise

Yes!

lm_fit
## parsnip model object
## 
## 
## Call:
## stats::lm(formula = mpg ~ horsepower, data = data)
## 
## Coefficients:
## (Intercept)   horsepower  
##     40.9868      -0.1676

Model Evaluation

Step 1: Generate Predictions

# make predictions on test set
lm_pred <- lm_fit %>%
  predict(new_data = auto_test) %>%
  bind_cols(auto_test)

Model Evaluation

Step 1: Generate Predictions

lm_pred %>%
  select(.pred, mpg) %>%
  head()
## # A tibble: 6 × 2
##   .pred   mpg
##   <dbl> <dbl>
## 1  4.96    14
## 2  9.15    15
## 3 15.9     15
## 4 25.1     25
## 5 25.9     21
## 6  4.96    10

Model Evaluation

Step 2: Assess Model

# specify performance metrics of interest
auto_metrics <- metric_set(
  yardstick::rmse,
  yardstick::rsq
)

Model Evaluation

Step 2: Assess Model

# specify performance metrics of interest
auto_metrics <- metric_set(
  yardstick::rmse,
  yardstick::rsq
)
# compute metrics
lm_pred %>%
  auto_metrics(truth = mpg, estimate = .pred)
## # A tibble: 2 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       4.70 
## 2 rsq     standard       0.597

Exercise

Copy the code below, fill in the blanks to fit a model on the training data then calculate the test RMSE

set.seed(100)
auto_split  <- ________
auto_train  <- ________
auto_test   <- ________

lm_fit      <-  lm_spec %>% fit(mpg ~ horsepower, data = ________)

lm_pred  <- ________ %>% 
  predict(new_data = ________) %>% 
  bind_cols(________)

lm_pred %>% auto_metrics(truth = ________, estimate = ________)

Exercise

Copy the code below, fill in the blanks to fit a model on the training data then calculate the test RMSE

set.seed(100)
auto_split  <- initial_split(Auto, strata = mpg)
auto_train  <- auto_split %>% training()
auto_test   <- auto_split %>% testing()

lm_fit      <-  lm_spec %>% fit(mpg ~ horsepower, data = auto_train)

lm_pred  <- lm_fit %>% 
  predict(new_data = auto_test) %>% 
  bind_cols(auto_test)

lm_pred %>% auto_metrics(truth = mpg, estimate = .pred)

A Faster Way: last_fit()

  • You can use last_fit() and specify the split
  • This will automatically train the data on the train data from the split
  • Instead of specifying which metric to calculate you can just use collect_metrics() and it will automatically calculate the metrics on the test data from the split

A Faster Way: last_fit()

set.seed(100)
auto_split <- initial_split(Auto, prop = 0.75, strata = mpg)

lm_fit <- lm_spec %>% last_fit(mpg ~ horsepower, 
                               split = auto_split)

lm_fit %>% collect_metrics()
## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard       4.70  Preprocessor1_Model1
## 2 rsq     standard       0.597 Preprocessor1_Model1

What About Cross-Validation?

# shuffle and partition the training data into 5 folds
auto_cv <- auto_train %>% vfold_cv(v = 5, strata = mpg)

# extract and the fist fold
fold1 <- as.data.frame(auto_cv$splits[[1]])
fold1 %>%
  select(mpg, horsepower) %>%
  head()
##    mpg horsepower
## 2   15        165
## 4   16        150
## 5   17        140
## 7   14        220
## 11  15        170
## 14  14        225

What About Cross-Validation?

# perform 5-fold cross-validation
results <- lm_spec %>% fit_resamples(mpg ~ horsepower,
                                       resamples = auto_cv)

What About Cross-Validation?

How do we get the metrics out?

  • With collect_metrics() again!
# compute performance metrics
results %>% collect_metrics()
## # A tibble: 2 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   4.99      5  0.174  Preprocessor1_Model1
## 2 rsq     standard   0.619     5  0.0263 Preprocessor1_Model1

Exercise

Edit the code below to get the 5-fold cross validation error rate for the following model:

\[mpg=\beta_0 + \beta_1horsepower + \beta_2horsepower^2 + \epsilon\]

auto_cv <- vfold_cv(Auto, v = 5)
results <- lm_spec %>% fit_resamples(----,
                                      resamples = ---)
results %>%
  collect_metrics()

Exercise

Edit the code below to get the 5-fold cross validation error rate for the following model:

\[mpg=\beta_0 + \beta_1horsepower + \beta_2horsepower^2 + \epsilon\]

auto_cv <- vfold_cv(Auto, v = 5)
results <- lm_spec %>% fit_resamples(mpg ~ horsepower + horsepower^2,
                                      resamples = auto_cv)
results %>%
  collect_metrics()
## # A tibble: 2 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   4.93      5  0.134  Preprocessor1_Model1
## 2 rsq     standard   0.609     5  0.0244 Preprocessor1_Model1

Tidymodels Workflow

What have we covered so far?

What if we wanted to do some pre-processing?

  • recipe()!
  • Using the recipe() function along with step_*() functions, we can specify preprocessing steps and R will apply them to each fold appropriately.
rec <- recipe(mpg ~ horsepower, data = Auto) %>%
  step_scale(horsepower) %>%
  step_center(horsepower)

What if we want to predict mpg with more variables?

  • Now we still want to add a step to scale predictors
  • We could either write out all predictors individually to scale them
  • OR we could use the all_predictors() short hand.
rec <- recipe(mpg ~ horsepower + displacement + weight, 
              data = Auto) %>%
  step_scale(all_predictors())

How to incorporate a recipe?

  • workflow()!
# specify model
lm_spec <- linear_reg() %>%
              set_engine('lm') %>%
              set_mode('regression')

# create recipe
rec <- recipe(mpg ~ horsepower + displacement + weight, 
              data = Auto) %>%
  step_scale(all_predictors())

# combine into a single workflow object
lm_workflow <- workflow() %>%
                add_model(lm_spec) %>%
                add_recipe(rec)

How to incorporate a recipe?

## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 1 Recipe Step
## 
## • step_scale()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm
lm_workflow %>% fit(data = auto_train)

How to incorporate a recipe?

results <- lm_workflow %>% 
            fit_resamples(resamples = auto_cv)

results %>% collect_metrics()
## # A tibble: 2 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 rmse    standard   4.28      5  0.216  Preprocessor1_Model1
## 2 rsq     standard   0.712     5  0.0265 Preprocessor1_Model1

What about hyper-parameter tuning?

  • Let’s extend our model to elastic net.

  • Recall that the objective function to be minimized for elastic net is:

\[RSS + \lambda((1 -\alpha)\vert\vert\beta\vert\vert_{2} + \alpha\vert\vert\beta\vert\vert_1)\]

What about hyper-parameter tuning?

  • First specify the engine. We’ll use glmnet.

  • The linear_reg() function has two additional parameters, penalty and mixture

  • penalty is \(\lambda\)

  • mixture is \(\alpha\)

lm_spec <- linear_reg() %>%
  set_engine("glmnet")

Exercise

\[RSS + \lambda((1 -\alpha)\vert\vert\beta\vert\vert_{2} + \alpha\vert\vert\beta\vert\vert_1)\]

What would we set mixture to in order to perform Ridge regression?

ridge_spec <- linear_reg(penalty = 100, mixture = ___) %>%
  set_engine("glmnet")

What about LASSO?

lasso_spec <- linear_reg(penalty = 100, mixture = ___) %>%
  set_engine("glmnet")

Exercise

\[RSS + \lambda((1 -\alpha)\vert\vert\beta\vert\vert_{2} + \alpha\vert\vert\beta\vert\vert_1)\]

What would we set mixture to in order to perform Ridge regression?

ridge_spec <- linear_reg(penalty = 100, mixture = 0) %>%
  set_engine("glmnet")

What about LASSO?

lasso_spec <- linear_reg(penalty = 100, mixture = 1) %>%
  set_engine("glmnet")

Putting it all together

  • Tuning \(\lambda\) and \(\alpha\) is straightforward.
  1. Generate folds on training data.
  2. Specify your model/recipe. Insert tune() into the hyper-parameter arguments in question.
  3. Create the grid you’d like to tune over.
  4. Put it all together with tune_grid().
  5. Evaluate the tuned model on the test data.

Putting it all together

Putting it all together

set.seed(100)

# shuffle and partition the data into training and testing sets
auto_split <- Auto %>% initial_split(prop = 0.75, strata = mpg)
auto_train <- auto_split %>% training()
auto_test <- auto_split %>% testing()


# generate folds for 5-fold cv on training set
auto_cv <- vfold_cv(auto_train, v = 5, strata = mpg)

Putting it all together

# specify your model, insert tune() into hyper-parameter arguments
elastic_spec <- linear_reg(penalty = tune(), mixture = tune()) %>%
                  set_engine('glmnet') %>%
                  set_mode('regression')

# create recipe to scale predictors
elastic_rec <- recipe(mpg ~ horsepower + weight + acceleration, 
                      data = auto_train) %>%
               step_scale(all_predictors())

# combine model spec and recipe into a workflow
elastic_workflow <- workflow() %>%
                      add_model(elastic_spec) %>%
                      add_recipe(elastic_rec)

# generate grid of values to tune hyper-parameters over
elastic_grid <- expand_grid(penalty = seq(0, 100, by = 10),
                            mixture = seq(0, 1, by = 0.2))

Putting it all together

# tune
results <- elastic_workflow %>% tune_grid(grid = elastic_grid,
                                          resamples = auto_cv)

Putting it all together

# examine the results
results %>%
  collect_metrics() %>%
  filter(.metric == "rmse") %>%
  arrange(mean) %>%
  dplyr::select(penalty, mixture, .metric, mean) %>%
  slice(1:4)
## # A tibble: 4 × 4
##   penalty mixture .metric  mean
##     <dbl>   <dbl> <chr>   <dbl>
## 1       0     0.2 rmse     4.38
## 2       0     0.4 rmse     4.38
## 3       0     0.6 rmse     4.38
## 4       0     0.8 rmse     4.38

Putting it all together

results %>%
  collect_metrics() %>%
  filter(.metric == "rmse") %>%
  ggplot(aes(penalty, mean, color = factor(mixture),
             group = factor(mixture))) +
  geom_line() +
  geom_point() + 
  labs(y = "RMSE")

Putting it all together

Putting it all together

final_spec <- linear_reg(penalty = 0, mixture = 0) %>%
  set_engine("glmnet")

fit <- last_fit(final_spec,
                elastic_rec,
                split = auto_split)
fit %>%
  collect_metrics()
## # A tibble: 2 × 4
##   .metric .estimator .estimate .config             
##   <chr>   <chr>          <dbl> <chr>               
## 1 rmse    standard       3.90  Preprocessor1_Model1
## 2 rsq     standard       0.710 Preprocessor1_Model1

Closing Remarks

PROS

  • Streamline the modeling process.
  • Improves readability and communicative power through consistent, suggestive synatx.
  • Active community!

CONS

  • Not all methods implemented (though you have the ability to contribute!)
  • Like any modeling framework, one can use it blindly. Tidymodels (and other modeling ecosystems) should not be treated as a black-box.

Learn More!

Exercise: The Rest of Class

  • Use tidymodels to fit KNN on the two simulated data sets from the second part of the last computational homework.

Hint:

knn_spec <- nearest_neighbor(neighbors = tune()) %>%
              set_engine('kknn') %>%
              set_mode('classification')
  • Work on your group project