Linear regression with regularization

by Xiaoqi Zheng, 03/18/2020

In [88]:
set.seed(123)    # seed for reproducibility
library(glmnet)  # for ridge regression
In [89]:
## load data
data("mtcars")

X <- as.matrix(mtcars[,which(names(mtcars)!="mpg")])
y <- as.matrix(scale(mtcars[,"mpg"],center = T,scale = T)) # scale y
y <- mtcars[,"mpg"]
In [90]:
## select training and test samples
n <- nrow(X)
train_rows <- sample(1:n, .66*n) ## 66% training, 34% test

X.train <- X[train_rows, ]
X.test <- X[-train_rows, ]

y.train <- y[train_rows]
y.test <- y[-train_rows]

1. Lasso

Steps

  1. Select best lambda by cross-validation;
  2. Train a model by selected lambda;
  3. Prediction
In [91]:
# See how increasing lambda shrinks the coefficients --------------------------
# Setting alpha = 1 implements lasso regression
lasso.model <- glmnet(X.train, y.train, alpha = 1, standardize = TRUE)
plot(lasso.model, xvar = "lambda")
legend("bottomright", lwd = 1, col = 1:ncol(X.train), legend = colnames(X), cex = 1)
In [92]:
# Perform 10-fold cross-validation to select lambda
lasso_cv <- cv.glmnet(X.train, y.train, alpha = 1, type.measure="mse",
                      standardize = TRUE, nfolds =10,family="gaussian")
# Plot cross-validation results
plot(lasso_cv)
Warning message:
“Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per fold”
In [101]:
# Select best cross-validated lambda
lambda_cv <- lasso_cv$lambda.min
# Fit final model, get its sum of squared residuals and multiple R-squared
## 注:也有人不做前面的交叉验证步骤,直接给 lambda一个值,比如lambda = 0.001
lasso.model <- glmnet(X.train, y.train, lambda = lambda_cv, alpha = 1, standardize = TRUE)
In [94]:
# Make prediction
y.test_hat <- predict(lasso.model, X.test)

Rsquare <- cor(y.test,y.test_hat)^2
MSE <- mean((y.test - y.test_hat)^2)

plot(y.test,y.test_hat,col="#00000050",pch = 19,main = "Test result for lasso")
abline(lm(y.test_hat~y.test),lty = 2,lwd = 2,col = "gray")
text(x = 12,y = 20,labels = paste0("MSE = ",round(MSE,4),"\n","R-sequared = ",round(Rsquare,4)))

2. Ridge

In [95]:
# See how increasing lambda shrinks the coefficients --------------------------
# Setting alpha = 0 implements ridge regression
ridge.model <- glmnet(X.train, y.train, alpha = 0, standardize = TRUE)
plot(ridge.model, xvar = "lambda",main = "Ridge")
legend("bottomright", lwd = 1, col = 1:ncol(X.train), legend = colnames(X), cex = 1)
In [96]:
# Perform 10-fold cross-validation to select lambda
ridge_cv <- cv.glmnet(X.train, y.train, alpha = 0, type.measure="mse",
                      standardize = TRUE, nfolds =10,family="gaussian")
# Plot cross-validation results
plot(ridge_cv,main = "Ridge")
Warning message:
“Option grouped=FALSE enforced in cv.glmnet, since < 3 observations per fold”
In [97]:
# Select best cross-validated lambda
lambda_cv <- ridge_cv$lambda.min
# Fit final model, get its sum of squared residuals and multiple R-squared
## 注:也有人不做前面的交叉验证步骤,直接给 lambda一个值,比如lambda = 0.001
ridge.model <- glmnet(X.train, y.train, lambda = lambda_cv, alpha = 0, standardize = TRUE)
In [45]:
# Make prediction
y.test_hat <- predict(ridge.model, X.test)

Rsquare <- cor(y.test,y.test_hat)^2
MSE <- mean((y.test - y.test_hat)^2)

plot(y.test,y.test_hat,col="#00000050",pch = 19,main = "Test result for ridge")
abline(lm(y.test_hat~y.test),lty = 2,lwd = 2,col = "gray")
text(x = 12,y = 20,labels = paste0("MSE = ",round(MSE,4),"\n","R-sequared = ",round(Rsquare,4)))

3. Elastic Net

There are two parameters to tune: λ and α. The glmnet package allows to tune λ via cross-validation for a fixed α, but it does not support α-tuning, so we will turn to caret for this job.

In [98]:
dat <- cbind(y.train, X.train)
head(dat)
A matrix: 6 × 11 of type dbl
y.traincyldisphpdratwtqsecvsamgearcarb
Maserati Bora15.08301.03353.543.57014.600158
Cadillac Fleetwood10.48472.02052.935.25017.980034
Honda Civic30.44 75.7 524.931.61518.521142
Merc 450SLC15.28275.81803.073.78018.000033
Datsun 71022.84108.0 933.852.32018.611141
Merc 28019.26167.61233.923.44018.301044
In [99]:
library(caret)

# Set training control
train_control <- trainControl(method = "repeatedcv",
                              number = 5,
                              repeats = 5,
                              search = "random",
                              verboseIter = F)

# Train the model
elastic_net_model <- train(y.train ~ .,
                           data = dat,
                           method = "glmnet",
                           preProcess = c("center", "scale"),
                           tuneLength = 25,
                           trControl = train_control)
Warning message in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
“There were missing values in resampled performance measures.”
In [100]:
# Check results
y.test_hat <- predict(elastic_net_model, X.test)
Rsquare <- cor(y.test, y.test_hat)^2
MSE <- mean((y.test - y.test_hat)^2)

plot(y.test,y.test_hat,col="#00000050",pch = 19,main = "Test result for ENet")
abline(lm(y.test_hat~y.test),lty = 2,lwd = 2,col = "gray")
text(x = 12,y = 20,labels = paste0("MSE = ",round(MSE,4),"\n","R-sequared = ",round(Rsquare,4)))