These slides are adapted, with permission, from a lecture given by Dr. Lucy D’Agostino McGowan (@LucyStats).
These slides are adapted, with permission, from a lecture given by Dr. Lucy D’Agostino McGowan (@LucyStats).
If you have not done so already please install and load the following packages:
library(tidyverse) library(tidymodels) library(ISLR)
This presentation assumes a basic familiarity with the core tidyverse packages:
Tidymodels is an opinionated collection of R packages designed for modeling and statistical analysis.
All packages contain an underlying philosophy and common grammar
Developed by Max Kuhn (who also developed caret, short for C-lassification A-nd RE-gression T-raining: a single package with 100s of ML algorithms)
Data Resampling and Feature Engineering:
Model Fitting and Tuning:
Model Evaluation:
data(Auto) head(Auto)
## mpg cylinders displacement horsepower weight acceleration year origin ## 1 18 8 307 130 3504 12.0 70 1 ## 2 15 8 350 165 3693 11.5 70 1 ## 3 18 8 318 150 3436 11.0 70 1 ## 4 16 8 304 150 3433 12.0 70 1 ## 5 17 8 302 140 3449 10.5 70 1 ## 6 15 8 429 198 4341 10.0 70 1 ## name ## 1 chevrolet chevelle malibu ## 2 buick skylark 320 ## 3 plymouth satellite ## 4 amc rebel sst ## 5 ford torino ## 6 ford galaxie 500
set.seed(100) # for reproducibility # partition data into training and testing set auto_split <- Auto %>% initial_split(prop = 0.75, strata = mpg) auto_split
## <Analysis/Assess/Total> ## <292/100/392>
auto_train <- auto_split %>% training() # extract training data auto_test <- auto_split %>% testing() # extract testing data
# shuffle and partition the training data into 5 folds auto_folds <- auto_train %>% vfold_cv(v = 5, strata = mpg)
linear_reg() %>% # select model set_engine('lm') %>% # specify function used to fit model set_mode('regression') # learning task
linear_reg() %>% # select model set_engine('glmnet') %>% # specify function used to fit model set_mode('regression') # learning task
linear_reg() %>% # select model set_engine('spark') %>% # specify function used to fit model set_mode('regression') # learning task
Write a pipe that creates a model that uses lm() to fit a linear regression using tidymodels.
Save it as lm_spec and look at the object. What does it return?
Write a pipe that creates a model that uses lm() to fit a linear regression using tidymodels.
Save it as lm_spec and look at the object. What does it return?
lm_spec <- linear_reg() %>% set_engine('lm') %>% set_mode('regression') lm_spec
## Linear Regression Model Specification (regression) ## ## Computational engine: lm
To train your model simply pipe your model specification into the fit function.
# pass linear_mod to fit function lm_fit <- lm_spec %>% fit(mpg ~ horsepower, data = auto_train)
Does the previous model give the same results as:
lm(mpg ~ horsepower, data = auto_train)
Does the previous model give the same results as:
lm(mpg ~ horsepower, data = auto_train)
## ## Call: ## lm(formula = mpg ~ horsepower, data = auto_train) ## ## Coefficients: ## (Intercept) horsepower ## 40.9868 -0.1676
Yes!
lm_fit
## parsnip model object ## ## ## Call: ## stats::lm(formula = mpg ~ horsepower, data = data) ## ## Coefficients: ## (Intercept) horsepower ## 40.9868 -0.1676
# make predictions on test set lm_pred <- lm_fit %>% predict(new_data = auto_test) %>% bind_cols(auto_test)
lm_pred %>% select(.pred, mpg) %>% head()
## # A tibble: 6 × 2 ## .pred mpg ## <dbl> <dbl> ## 1 4.96 14 ## 2 9.15 15 ## 3 15.9 15 ## 4 25.1 25 ## 5 25.9 21 ## 6 4.96 10
# specify performance metrics of interest auto_metrics <- metric_set( yardstick::rmse, yardstick::rsq )
# specify performance metrics of interest auto_metrics <- metric_set( yardstick::rmse, yardstick::rsq )
# compute metrics lm_pred %>% auto_metrics(truth = mpg, estimate = .pred)
## # A tibble: 2 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 4.70 ## 2 rsq standard 0.597
Copy the code below, fill in the blanks to fit a model on the training data then calculate the test RMSE
set.seed(100) auto_split <- ________ auto_train <- ________ auto_test <- ________ lm_fit <- lm_spec %>% fit(mpg ~ horsepower, data = ________) lm_pred <- ________ %>% predict(new_data = ________) %>% bind_cols(________) lm_pred %>% auto_metrics(truth = ________, estimate = ________)
Copy the code below, fill in the blanks to fit a model on the training data then calculate the test RMSE
set.seed(100) auto_split <- initial_split(Auto, strata = mpg) auto_train <- auto_split %>% training() auto_test <- auto_split %>% testing() lm_fit <- lm_spec %>% fit(mpg ~ horsepower, data = auto_train) lm_pred <- lm_fit %>% predict(new_data = auto_test) %>% bind_cols(auto_test) lm_pred %>% auto_metrics(truth = mpg, estimate = .pred)
set.seed(100) auto_split <- initial_split(Auto, prop = 0.75, strata = mpg) lm_fit <- lm_spec %>% last_fit(mpg ~ horsepower, split = auto_split) lm_fit %>% collect_metrics()
## # A tibble: 2 × 4 ## .metric .estimator .estimate .config ## <chr> <chr> <dbl> <chr> ## 1 rmse standard 4.70 Preprocessor1_Model1 ## 2 rsq standard 0.597 Preprocessor1_Model1
# shuffle and partition the training data into 5 folds auto_cv <- auto_train %>% vfold_cv(v = 5, strata = mpg) # extract and the fist fold fold1 <- as.data.frame(auto_cv$splits[[1]]) fold1 %>% select(mpg, horsepower) %>% head()
## mpg horsepower ## 2 15 165 ## 4 16 150 ## 5 17 140 ## 7 14 220 ## 11 15 170 ## 14 14 225
# perform 5-fold cross-validation results <- lm_spec %>% fit_resamples(mpg ~ horsepower, resamples = auto_cv)
How do we get the metrics out?
# compute performance metrics results %>% collect_metrics()
## # A tibble: 2 × 6 ## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 rmse standard 4.99 5 0.174 Preprocessor1_Model1 ## 2 rsq standard 0.619 5 0.0263 Preprocessor1_Model1
Edit the code below to get the 5-fold cross validation error rate for the following model:
\[mpg=\beta_0 + \beta_1horsepower + \beta_2horsepower^2 + \epsilon\]
auto_cv <- vfold_cv(Auto, v = 5) results <- lm_spec %>% fit_resamples(----, resamples = ---) results %>% collect_metrics()
Edit the code below to get the 5-fold cross validation error rate for the following model:
\[mpg=\beta_0 + \beta_1horsepower + \beta_2horsepower^2 + \epsilon\]
auto_cv <- vfold_cv(Auto, v = 5) results <- lm_spec %>% fit_resamples(mpg ~ horsepower + horsepower^2, resamples = auto_cv) results %>% collect_metrics()
## # A tibble: 2 × 6 ## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 rmse standard 4.93 5 0.134 Preprocessor1_Model1 ## 2 rsq standard 0.609 5 0.0244 Preprocessor1_Model1
rec <- recipe(mpg ~ horsepower, data = Auto) %>% step_scale(horsepower) %>% step_center(horsepower)
rec <- recipe(mpg ~ horsepower + displacement + weight, data = Auto) %>% step_scale(all_predictors())
# specify model lm_spec <- linear_reg() %>% set_engine('lm') %>% set_mode('regression') # create recipe rec <- recipe(mpg ~ horsepower + displacement + weight, data = Auto) %>% step_scale(all_predictors()) # combine into a single workflow object lm_workflow <- workflow() %>% add_model(lm_spec) %>% add_recipe(rec)
## ══ Workflow ════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: linear_reg() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────── ## 1 Recipe Step ## ## • step_scale() ## ## ── Model ─────────────────────────────────────────────────────────────────────── ## Linear Regression Model Specification (regression) ## ## Computational engine: lm
lm_workflow %>% fit(data = auto_train)
results <- lm_workflow %>% fit_resamples(resamples = auto_cv) results %>% collect_metrics()
## # A tibble: 2 × 6 ## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 rmse standard 4.28 5 0.216 Preprocessor1_Model1 ## 2 rsq standard 0.712 5 0.0265 Preprocessor1_Model1
Let’s extend our model to elastic net.
Recall that the objective function to be minimized for elastic net is:
\[RSS + \lambda((1 -\alpha)\vert\vert\beta\vert\vert_{2} + \alpha\vert\vert\beta\vert\vert_1)\]
First specify the engine. We’ll use glmnet.
The linear_reg() function has two additional parameters, penalty and mixture
penalty is \(\lambda\)
mixture is \(\alpha\)
lm_spec <- linear_reg() %>% set_engine("glmnet")
\[RSS + \lambda((1 -\alpha)\vert\vert\beta\vert\vert_{2} + \alpha\vert\vert\beta\vert\vert_1)\]
What would we set mixture to in order to perform Ridge regression?
ridge_spec <- linear_reg(penalty = 100, mixture = ___) %>% set_engine("glmnet")
What about LASSO?
lasso_spec <- linear_reg(penalty = 100, mixture = ___) %>% set_engine("glmnet")
\[RSS + \lambda((1 -\alpha)\vert\vert\beta\vert\vert_{2} + \alpha\vert\vert\beta\vert\vert_1)\]
What would we set mixture to in order to perform Ridge regression?
ridge_spec <- linear_reg(penalty = 100, mixture = 0) %>% set_engine("glmnet")
What about LASSO?
lasso_spec <- linear_reg(penalty = 100, mixture = 1) %>% set_engine("glmnet")
set.seed(100) # shuffle and partition the data into training and testing sets auto_split <- Auto %>% initial_split(prop = 0.75, strata = mpg) auto_train <- auto_split %>% training() auto_test <- auto_split %>% testing() # generate folds for 5-fold cv on training set auto_cv <- vfold_cv(auto_train, v = 5, strata = mpg)
# specify your model, insert tune() into hyper-parameter arguments elastic_spec <- linear_reg(penalty = tune(), mixture = tune()) %>% set_engine('glmnet') %>% set_mode('regression') # create recipe to scale predictors elastic_rec <- recipe(mpg ~ horsepower + weight + acceleration, data = auto_train) %>% step_scale(all_predictors()) # combine model spec and recipe into a workflow elastic_workflow <- workflow() %>% add_model(elastic_spec) %>% add_recipe(elastic_rec) # generate grid of values to tune hyper-parameters over elastic_grid <- expand_grid(penalty = seq(0, 100, by = 10), mixture = seq(0, 1, by = 0.2))
# tune results <- elastic_workflow %>% tune_grid(grid = elastic_grid, resamples = auto_cv)
# examine the results results %>% collect_metrics() %>% filter(.metric == "rmse") %>% arrange(mean) %>% dplyr::select(penalty, mixture, .metric, mean) %>% slice(1:4)
## # A tibble: 4 × 4 ## penalty mixture .metric mean ## <dbl> <dbl> <chr> <dbl> ## 1 0 0.2 rmse 4.38 ## 2 0 0.4 rmse 4.38 ## 3 0 0.6 rmse 4.38 ## 4 0 0.8 rmse 4.38
results %>% collect_metrics() %>% filter(.metric == "rmse") %>% ggplot(aes(penalty, mean, color = factor(mixture), group = factor(mixture))) + geom_line() + geom_point() + labs(y = "RMSE")
final_spec <- linear_reg(penalty = 0, mixture = 0) %>% set_engine("glmnet") fit <- last_fit(final_spec, elastic_rec, split = auto_split) fit %>% collect_metrics()
## # A tibble: 2 × 4 ## .metric .estimator .estimate .config ## <chr> <chr> <dbl> <chr> ## 1 rmse standard 3.90 Preprocessor1_Model1 ## 2 rsq standard 0.710 Preprocessor1_Model1
PROS
CONS
The primary references for tidyverse and tidymodels are:
For convenience, here are the parsnip model and recipe step lists:
Hint:
knn_spec <- nearest_neighbor(neighbors = tune()) %>% set_engine('kknn') %>% set_mode('classification')