Machine Learning Linear Regression And Regularlization
Linear regression is a model to predict a variable based on independent variables. The model assumes linear relationship between dependent and independent variables. Below represents a simple linear regression equation.
y = a + c1x1 + c2x2
In above equation y is a dependent variable and x1,x2 are independent variables. a is a intercept, c1 and c2 are coefficients. In above equation, we are trying to predict y based on x1 and x2 variables.
In this post, I will do an example of linear regression and regularization using Maching Learning package H2o. H2o is a great library and offers lot of techniques right out of the box.
I will use students alcohol data which I downloaded from following UCI website...
Before we delve in to our data analysis, Make sure you have following installed and working...
In your R repl, lets import the H2o package.
Lets import our data file student-mat.csv
st_mat <- h2o.importFile('student-mat.csv')
Lets look at first two rows using head method.
Lets look at the column names also.
To check number of rows, we can do using h2o.nrow.
For linear regression, we should check how many columns are there. We can do with command h2o.ncol.
One of most important thing about linear regression is chosing the right set of independent variables for our dependent variable.
For our dependent variable which is the variable we want to predict, Lets us pick "Walc" which is column number 28.
Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
Basically we are trying to predict weekend alcohol consumption. Lets see which of the variables help us doing that.
To train our Linear regression model, let us split our data in the ratio of 80% to 20% using h2o.splitFrame.
students.splits <- h2o.splitFrame(data = st_mat, ratios = .8)
train <- students.splits[] valid <- students.splits[]
Ok now we got our train and validation set separated.
y = 28
Lets take out Walc and Dalc (daily alcohol consumption) from our independent variables.
Ok now let us run our linear regression model. For that we can use h2o.glm package. glm stands for generalized linear regression models.
H2o Generalized Linear Regression Model (GLM)
students.glm <- h2o.glm(x=x,y=y, training_frame = train, validation_frame = valid,remove_collinear_columns = TRUE)
Ok since it is a small data set, the model just ran instantly.
Now we can print out the glm model coefficients using h2o.std_coef_plot
From the above graph we can look at the positive and negative parameters. Lets print the model coefficients to actually know their magnitudes.
Lets check which parameters are affecting positively to alcohol consumption.
We can use model$coefficients to access the coefficients of the variables of our linear regression.
coeff_vector = students.glm@model$coefficients print(coeff_vector[coeff_vector > 0])
Intercept age failures goout health absences G2 0.43908352 0.11540452 0.05622664 0.40241119 0.12427294 0.01856066 0.05650706
As we see above, other than intercept , age , failures, goout, health, absences, G2 (second period Grade) all affect positively.
Lets see if any parameters which affect the alcohol consumption negatively.
print(coeff_vector[coeff_vector < 0])
sex.F studytime famrel freetime G1 -0.611686028 -0.225279062 -0.228980650 -0.008235832 -0.074973142
Female, studetime, famrel(quality of family relatives), freetime and (first period grade) all affect the weakly alcohol consumption negatively.
If we do model$model_summary, we can see which model type has been run by h2o default.
|gaussian||identity||Elastic Net (alpha = 0.5, lambda = 0.1043 )||57||11||1||RTMP_sid_85ff_8|
Above tables shows that regression type is "gaussian". Also the table shows regularization type which is Elastic Net.
H2o's glm fits linear regression using maximum log-likelihood. We can use regularization to better fit the model. Using regularization H2O tries to maximize difference of "GLM max log-likelihood" and "regularization".
There are 3 types of regularization techniques.
- Lasso Regression (L1)
- Ridge Regression (L2)
- Elastic Net (Weighted sum of (L1 + L2))
Lasso regression represents the L1 penality. Lasso is also sometimes called a variable selection techinque. Lasso depends upon the tunining parameter lambda. As lambda becomes huge, the co-efficient value becomes zero.
- Select Pandas Dataframe Rows And Columns Using iloc loc and ix
- How To Run Logistic Regression In R
- Summarising Aggregating and Grouping data in Python Pandas
- Merge and Join DataFrames with Pandas in Python
- Python Pandas String To Integer And Integer To String DataFrame
- How to do SQL Select and Where Using Python Pandas