Ridge Regression

The Ridge Regression method is a regularization technique that imposes a penalty on predictors with a significant large parameter. By adding a tuning parameter, $\lambda$ to the objective function, ridge regression imposes a penalty for large coefficients forcing the overall effect to zero in order to minimize the objective function (and vice versa).

To motive Ridge regression, let's reconstruct the loss function/objective function for regression by adding the weights. Mathematically, we can think of it this way:

$$ RSS(w) + \lambda||W||_2^2 $$

where $RSS$ is the residual sum of squares, $\lambda$ is the ridge hyper parameter

What happens when:

$\lambda$ = 0, then the total cost is $RSS(W)$ which is the original least squares formulation of linear regression

$\lambda = \infty$, the only minimizing solution is when $\hat W$ = 0, because the the total $RSS(W) = 0$.

Then the ridge solution is always between $0 \le ||W||_2^2 \le RSS(W)$

Now that we have explore conceptually the idea, we move on to implementation.

Implementation of Ridge Regression with Hitter's Dataset

In the note below, we implement ridge regression using tidymodels. Specifically, we will need the glmnet package which houses the ridge regression engine.

library(ISLR2)
library(tidyverse, quietly = TRUE)

hitters <- as_tibble(Hitters) %>% 
            filter( !is.na(Salary) )
dim(hitters)

OUTPUT[1] 263 20

Defining the Ridge Regression Model

As we have done several times on the lab, we begin by

# install.packages("glmnet")
library(tidymodels, quietly = TRUE)

ridge_regression <- linear_reg( mixture = 0, penalty = 0 ) %>%  # setting a rigde penalty of zero
                    set_mode("regression") %>%
                    set_engine("glmnet")

ridge_regression

OUTPUT Linear Regression Model Specification (regression) Main Arguments: penalty = 0 mixture = 0 Computational engine: glmnet

We have not defined the specification for the ridge regression model. We can go ahead and fit the model using the data. For this modeling process, we skip the train and test splitting. This knowledge is expected. Instead, we focus on the implementation details.

ridge_regression_fit <- fit( ridge_regression, Salary ~ ., data = hitters )

ridge_regression_fit %>% tidy()

OUTPUTLoaded glmnet 4.1-10 # A tibble: 20 × 3 term estimate penalty 1 (Intercept) 81.1 0 2 AtBat -0.682 0 3 Hits 2.77 0 4 HmRun -1.37 0 5 Runs 1.01 0 6 RBI 0.713 0 7 Walks 3.38 0 8 Years -9.07 0 9 CAtBat -0.00120 0 10 CHits 0.136 0 11 CHmRun 0.698 0 12 CRuns 0.296 0 13 CRBI 0.257 0 14 CWalks -0.279 0 15 LeagueN 53.2 0 16 DivisionW -123. 0 17 PutOuts 0.264 0 18 Assists 0.170 0 19 Errors -3.69 0 20 NewLeagueN -18.1 0

The output above shows what the estimate of the variable parameter is when rigde penalty is set to zero (no penalty). However, we can see the results when the ridge penalty is increased. Not that we may need a sufficiently high ridge penalty to see the changes in the estimated coefficient. In the case below, we use 1000.

tidy(ridge_regression_fit, penalty = 1000)

OUTPUT# A tibble: 20 × 3 term estimate penalty 1 (Intercept) 85.8 1000 2 AtBat 0.114 1000 3 Hits 0.578 1000 4 HmRun 1.29 1000 5 Runs 0.862 1000 6 RBI 0.805 1000 7 Walks 1.18 1000 8 Years 2.78 1000 9 CAtBat 0.0102 1000 10 CHits 0.0424 1000 11 CHmRun 0.309 1000 12 CRuns 0.0849 1000 13 CRBI 0.0883 1000 14 CWalks 0.0726 1000 15 LeagueN 10.3 1000 16 DivisionW -44.4 1000 17 PutOuts 0.0985 1000 18 Assists 0.0129 1000 19 Errors -0.490 1000 20 NewLeagueN 7.26 1000

Try a few penalty values for yourself now. What do you see as the result?

Visualizing Impact of Penalty

Using parsnip::extract_fit_engine we can visualize how the magnitude of the coefficients are being regularized towards zero as the penalty goes up.

par(mar = c(5, 5, 5, 1))
ridge_regression_fit %>%
  parsnip::extract_fit_engine() %>%
  plot(xvar = "lambda")

Tuning the Ridge L2 Penalty

We have seen that we can implement various penalty values for the same model. This is well and good for demonstration by in reality, we need a way to optimize the L2 parameter. In the next section, we demonstrate how to build a pipeline that finds the best penalty value for us.

We begin by splitting the data into train and test and then create a cross-validation set.

set.seed(1234)
hitters <- as_tibble(ISLR2::Hitters) %>% drop_na()

names(hitters)

OUTPUT [1] "AtBat" "Hits" "HmRun" "Runs" [5] "RBI" "Walks" "Years" "CAtBat" [9] "CHits" "CHmRun" "CRuns" "CRBI" [13] "CWalks" "League" "Division" "PutOuts" [17] "Assists" "Errors" "Salary" "NewLeague"

data_split <- initial_split(hitters, strata = "Salary")

train_data <- training(data_split)
test_data <- testing(data_split)

hitters_fold <- vfold_cv(train_data, v = 10)
hitters_fold

OUTPUT# 10-fold cross-validation # A tibble: 10 × 2 splits id 1 Fold01 2 Fold02 3 Fold03 4 Fold04 5 Fold05 6 Fold06 7 Fold07 8 Fold08 9 Fold09 10 Fold10

Tune Grid for Parameter Search

We can now use the tune_grid() function to perform hyperparameter turning using grid search. To do this, we need to create a workflow, samples and output file to contain the parameters. Before we do this, we will need to perform data processing as ridge regression is sensitive to scale. For this, we will normalize the data.

ridge_recipe <- recipe( formula = Salary ~ ., data = train_data ) %>%
                step_novel( all_nominal_predictors() ) %>% # Handle new (unseen) factor levels in categorical predictors
                step_dummy( all_nominal_predictors() ) %>% # Convert all categorical predictors to dummy/indicator variables
                step_zv( all_predictors() ) %>%            # Remove predictors with zero variance (constant variables)
                step_normalize( all_predictors() )         # Normalize (center and scale) all predictors to have a mean 
                                                                                 # of 0 and standard deviation of 1

ridge_recipe

OUTPUT ── Recipe ────────────────────────────────────────────── ── Inputs Number of variables by role outcome: 1 predictor: 19 ── Operations • Novel factor level assignment for: all_nominal_predictors() • Dummy variables from: all_nominal_predictors() • Zero variance filter on: all_predictors() • Centering and scaling for: all_predictors()

With the pre-processing above complete, we can now create the model specification. This time, we set the penalty to tune().

ridge_spec <- linear_reg( penalty = tune(), mixture = 0 ) %>%
        set_mode("regression") %>%
        set_engine("glmnet")

ridge_spec

OUTPUTLinear Regression Model Specification (regression) Main Arguments: penalty = tune() mixture = 0 Computational engine: glmnet

ridge_workflow <- workflow() %>%
            add_recipe( ridge_recipe ) %>%
            add_model( ridge_spec )

ridge_workflow

OUTPUT══ Workflow ════════════════════════════════════════════ Preprocessor: Recipe Model: linear_reg() ── Preprocessor ──────────────────────────────────────── 4 Recipe Steps • step_novel() • step_dummy() • step_zv() • step_normalize() ── Model ─────────────────────────────────────────────── Linear Regression Model Specification (regression) Main Arguments: penalty = tune() mixture = 0 Computational engine: glmnet

Finally, the last thing we need is the values of the penalty we are going to pass on to the grid search. We can provide these range of values in many forms. One of the forms is using grid_regular() as demonstrated below.

penalty_grid <- dials::grid_regular(
  x = dials::penalty(
    # Define the range of the penalty parameter.
    range = c(-5, 5),
    # Apply a logarithmic transformation (base 10) to the penalty values.
    trans = scales::log10_trans()
  ),
  # Create 50 evenly spaced levels within the specified penalty range.
  levels = 50
)

penalty_grid %>% slice_head( n = 10)

OUTPUT # A tibble: 10 × 1 penalty 1 0.00001 2 0.0000160 3 0.0000256 4 0.0000409 5 0.0000655 6 0.000105 7 0.000168 8 0.000268 9 0.000429 10 0.000687

Note that the values here are scaled to meet the normalization step we performed on the dataset.

Tuning the Parameter

Now we have everything we need. We can go ahead and implement the grid search.

tune_res <- tune::tune_grid(
  object = ridge_workflow,    # object with ridge model
  resamples = hitters_fold,   # define cross-validation fold
  grid = penalty_grid        # testing penalty values
)

tune_res

OUTPUT# Tuning results # 10-fold cross-validation # A tibble: 10 × 4 splits id .metrics .notes 1 Fold01 2 Fold02 3 Fold03 4 Fold04 5 Fold05 6 Fold06 7 Fold07 8 Fold08 9 Fold09 10 Fold10

We see that our cross validation is complete. We can immediately look at the cross-validation plot output to gauge the impact of regularization on the data.

autoplot(tune_res)

From the plot we see that early regularization have limited impact on rsq and rmse. However, over time we see the impact on the metrics with rsq as the penalty increases. This is also the case for rsme metric as well.

Overall Metrics

We can capture the metrics for the regularization grid search by passing the collect_metrics function on the tune object `tune_res`. This returns a table of all penalties and associated metrics.

collect_metrics(tune_res)

OUTPUT # A tibble: 100 × 7 penalty .metric .estimator mean n std_err .config 1 0.00001 rmse standard 340. 10 32.7 pre0_mod01_post0 2 0.00001 rsq standard 0.473 10 0.0525 pre0_mod01_post0 3 0.0000160 rmse standard 340. 10 32.7 pre0_mod02_post0 4 0.0000160 rsq standard 0.473 10 0.0525 pre0_mod02_post0 5 0.0000256 rmse standard 340. 10 32.7 pre0_mod03_post0 6 0.0000256 rsq standard 0.473 10 0.0525 pre0_mod03_post0 7 0.0000409 rmse standard 340. 10 32.7 pre0_mod04_post0 8 0.0000409 rsq standard 0.473 10 0.0525 pre0_mod04_post0 9 0.0000655 rmse standard 340. 10 32.7 pre0_mod05_post0 10 0.0000655 rsq standard 0.473 10 0.0525 pre0_mod05_post0 # ℹ 90 more rows # ℹ Use `print(n = ...)` to see more rows

To get the best penalty, we can simply use the select_best which takes on the grid_search objects and a parameter to select for the best results. The example below uses rsq

best_penalty <- select_best(tune_res, metric = "rsq")
best_penalty

OUTPUT # A tibble: 1 × 2 penalty .config 1 569. pre0_mod39_post0

Finally, we can then extract the final penalization and retrieve the model fit.

ridge_final <- finalize_workflow( ridge_workflow, best_penalty )

ridge_final_fit <- fit(ridge_final, data = train_data)

Now, we apply this to the testing data.

augment(ridge_final_fit, new_data = test_data) %>%
  rsq(truth = Salary, estimate = .pred)

OUTPUT # A tibble: 1 × 3 .metric .estimator .estimate 1 rsq standard 0.488