Lasso and Ridge Linear Regression Regularization

This post is part 2 of Linear Regression and Regularizaton series. Please check part 1 Machine Learning Linear Regression And Regularization

In [2]:
library(h2o)
h2o.init()

Lets import our data file student-mat.csv

In [2]:
st_mat <- h2o.importFile('student-mat.csv')
  |======================================================================| 100%
In [3]:
students.splits <- h2o.splitFrame(data =  st_mat, ratios = .8)
train <- students.splits[[1]]
valid <- students.splits[[2]]
y = 28
x=-match(c("Walc","Dalc"),names(st_mat))

H2o Generalized Linear Regression Model (GLM)

In [4]:
students.glm <- h2o.glm(x=x,y=y, training_frame = train,
                        validation_frame = valid,remove_collinear_columns = TRUE)
  |======================================================================| 100%

If we do model$model_summary, we can see which model type has been run by h2o default.

In [5]:
students.glm@model$model_summary
A H2OTable: 1 × 7
familylinkregularizationnumber_of_predictors_totalnumber_of_active_predictorsnumber_of_iterationstraining_frame
<chr><chr><chr><int><int><int><chr>
gaussianidentityElastic Net (alpha = 0.5, lambda = 0.101 )57101RTMP_sid_88ca_2

Above tables shows that regression type is "gaussian". Also the table shows regularization type which is Elastic Net.

Regularization

H2o's glm fits linear regression using maximum log-likelihood. We can use regularization to better fit the model. Using regularization H2O tries to maximize difference of "GLM max log-likelihood" and "regularization".

There are 3 types of regularization techniques.

  1. Lasso Regression (L1)
  2. Ridge Regression (L2)
  3. Elastic Net (Weighted sum of (L1 + L2))

Regularization depends upon hyper tuning parameter alpha and lambda. For lambda > 0, if alpha = 1, we get Lasso. For alpha = 0, we get Ridge regression. Otherwise for alpha between 0 and 1, we get Elastic Net regression.

Let us check what should be the optimal value of alpha for our dataset. Let us give it a list of values to choose alpha from.

In [6]:
hyper_params <- list( alpha = c(0, .25, .5, .75, .1, 1) )
In [7]:
h2o.grid(x=x,y=y, training_frame = train,
                        validation_frame = valid,hyper_params=hyper_params,
                         search_criteria = list(strategy = "Cartesian"),algorithm = "glm",
                        grid_id = "student_grid")
  |======================================================================| 100%
H2O Grid Details
================

Grid ID: student_grid 
Used hyper parameters: 
  -  alpha 
Number of models: 12 
Number of failed models: 0 

Hyper-Parameter Search Summary: ordered by increasing residual_deviance
    alpha             model_ids  residual_deviance
1   [0.0]  student_grid_model_7  79.50790677500659
2   [1.0] student_grid_model_12   91.2447911418529
3  [0.75] student_grid_model_10  91.55635741162314
4   [0.5]  student_grid_model_9  92.18487887050757
5  [0.25]  student_grid_model_8  94.04144279433028
6   [0.1] student_grid_model_11  98.87271830795697
7   [0.1] student_grid_model_11  98.87271830795697
8   [0.5]  student_grid_model_3 106.02649678592279
9  [0.75]  student_grid_model_4   106.323804549756
10 [0.25]  student_grid_model_2 106.33857113059179
11  [0.1]  student_grid_model_5  108.2715773332973
12  [0.0]  student_grid_model_1 109.03048641410442

As we see above, for alpha = 0.5, we get the least MSE. Ridge regression which is alpha = 0 has the most MSE. Lasso regression which is alpha = 1 also not doing that good.

Lasso Regression

Lasso regression represents the L1 penality. Lasso is also sometimes called a variable selection technique. Lasso depends upon the tunining parameter lambda. As lambda becomes huge, the co-efficient value becomes zero.

To apply Lasso regularization, set alpha = 1

In [8]:
students.glm <- h2o.glm(x=x,y=y, training_frame = train,
                        validation_frame = valid,remove_collinear_columns = TRUE,alpha=1)
  |======================================================================| 100%

Let us look at the Lasso model summary.

In [9]:
students.glm@model$model_summary
A H2OTable: 1 × 7
familylinkregularizationnumber_of_predictors_totalnumber_of_active_predictorsnumber_of_iterationstraining_frame
<chr><chr><chr><int><int><int><chr>
gaussianidentityLasso (lambda = 0.05048 )57101RTMP_sid_88ca_2

As we see above regularization is Lasso, with lamdba = 0.05612

As I said, Lasso is predictor selection technique. We can simlply fiter our predictors based on coefficient value greater than zero as shown below.

In [10]:
students.glm@model$coefficients_table[students.glm@model$coefficients_table$coefficients > 0,]
A H2OTable: 6 × 3
namescoefficientsstandardized_coefficients
<chr><dbl><dbl>
1Intercept 2.174234662.59851126
48traveltime0.166250750.12113867
50failures 0.045680470.03478202
53goout 0.419705040.47231209
54health 0.068630530.09553533
55absences 0.015455130.11203287
In [11]:
print(h2o.mse(students.glm, valid=TRUE))
[1] 1.1232

Ridge Regression

In Ridge regression, we set alpha = 0 as shown below.

In [12]:
students.glm <- h2o.glm(x=x,y=y, training_frame = train,
                        validation_frame = valid,remove_collinear_columns = TRUE,alpha=0)
  |======================================================================| 100%

Let us print the MSE.

In [13]:
print(h2o.mse(students.glm, valid=TRUE))
[1] 0.9985721
In [14]:
students.glm@model$model_summary
A H2OTable: 1 × 7
familylinkregularizationnumber_of_predictors_totalnumber_of_active_predictorsnumber_of_iterationstraining_frame
<chr><chr><chr><int><int><int><chr>
gaussianidentityRidge ( lambda = 0.05048 )57401RTMP_sid_88ca_2

From the model summary, we can see that number of active predictors in Ridge regression are 40 which are far more than the number of predictors in Lasso regression. In Lasso regression number of predictors were only 6.