Boosting methods in R

by Xiaoqi Zheng, 0724/2020

In [76]:
library(rsample)      # data splitting 
library(gbm)          # basic implementation
library(xgboost)      # a faster implementation of gbm
library(caret)        # an aggregator package for performing many machine learning
In [77]:
## load data
library(tidyverse)
library(ISLR)

ml_data <- College
ml_data[1:5,1:5]
dim(ml_data)
A data.frame: 5 × 5
PrivateAppsAcceptEnrollTop10perc
<fct><dbl><dbl><dbl><dbl>
Abilene Christian UniversityYes1660123272123
Adelphi UniversityYes2186192451216
Adrian CollegeYes1428109733622
Agnes Scott CollegeYes 417 34913760
Alaska Pacific UniversityYes 193 146 5516
  1. 777
  2. 18
In [78]:
# Partition into training and test data
set.seed(42)
index <- createDataPartition(ml_data$Private, p = 0.7, list = FALSE)
train_data <- ml_data[index, ]
test_data  <- ml_data[-index, ]

1. Gradient Boosting Machines (GBM)

In [79]:
# Train model with preprocessing & repeated cv
model_gbm <- caret::train(Private ~ .,
                          data = train_data,
                          method = "gbm",
                          trControl = trainControl(method = "repeatedcv", 
                                                  number = 5, 
                                                  repeats = 3, 
                                                  verboseIter = FALSE),
                          verbose = 0)
model_gbm
Stochastic Gradient Boosting 

545 samples
 17 predictor
  2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (5 fold, repeated 3 times) 
Summary of sample sizes: 437, 436, 436, 435, 436, 436, ... 
Resampling results across tuning parameters:

  interaction.depth  n.trees  Accuracy   Kappa    
  1                   50      0.9369738  0.8366455
  1                  100      0.9388143  0.8424288
  1                  150      0.9443022  0.8584537
  2                   50      0.9430959  0.8521284
  2                  100      0.9449251  0.8583012
  2                  150      0.9442967  0.8579690
  3                   50      0.9381859  0.8399493
  3                  100      0.9412385  0.8490957
  3                  150      0.9485780  0.8686399

Tuning parameter 'shrinkage' was held constant at a value of 0.1

Tuning parameter 'n.minobsinnode' was held constant at a value of 10
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were n.trees = 150, interaction.depth =
 3, shrinkage = 0.1 and n.minobsinnode = 10.
In [80]:
## test 
caret::confusionMatrix(data = predict(model_gbm, test_data),
                       reference = test_data$Private)
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No   56   6
       Yes   7 163
                                          
               Accuracy : 0.944           
                 95% CI : (0.9061, 0.9698)
    No Information Rate : 0.7284          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.8577          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.8889          
            Specificity : 0.9645          
         Pos Pred Value : 0.9032          
         Neg Pred Value : 0.9588          
             Prevalence : 0.2716          
         Detection Rate : 0.2414          
   Detection Prevalence : 0.2672          
      Balanced Accuracy : 0.9267          
                                          
       'Positive' Class : No              
                                          

2. eXtreme Gradient Boosting (XGboost)

In [105]:
trctrl <- trainControl(method = "cv", number = 5)

tune_grid <- expand.grid(nrounds = 140:150,
                        max_depth = 5,
                        eta = 0.05,
                        gamma = 0.01,
                        colsample_bytree = 0.75,
                        min_child_weight = 0,
                        subsample = 0.5)

xgb.model <- train(Private ~ .,
                data = train_data, 
                method = "xgbTree",
                trControl=trctrl,
                tuneGrid = tune_grid,
                tuneLength = 10)
In [106]:
# have a look at the model 
xgb.model
eXtreme Gradient Boosting 

545 samples
 17 predictor
  2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 437, 435, 436, 436, 436 
Resampling results across tuning parameters:

  nrounds  Accuracy   Kappa    
  140      0.9413156  0.8485455
  141      0.9413156  0.8485455
  142      0.9413156  0.8485455
  143      0.9413156  0.8485455
  144      0.9413156  0.8485455
  145      0.9413156  0.8485455
  146      0.9413156  0.8485455
  147      0.9394807  0.8443698
  148      0.9413156  0.8485455
  149      0.9449686  0.8577046
  150      0.9449686  0.8577046

Tuning parameter 'max_depth' was held constant at a value of 5
Tuning

Tuning parameter 'min_child_weight' was held constant at a value of 0

Tuning parameter 'subsample' was held constant at a value of 0.5
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were nrounds = 149, max_depth = 5, eta
 = 0.05, gamma = 0.01, colsample_bytree = 0.75, min_child_weight = 0
 and subsample = 0.5.
In [107]:
# Testing
test_predict <- predict(xgb.model, test_data)
In [108]:
caret::confusionMatrix(data = test_predict,
                       reference = test_data$Private)
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No   55   9
       Yes   8 160
                                          
               Accuracy : 0.9267          
                 95% CI : (0.8853, 0.9567)
    No Information Rate : 0.7284          
    P-Value [Acc > NIR] : 1.967e-14       
                                          
                  Kappa : 0.8157          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.8730          
            Specificity : 0.9467          
         Pos Pred Value : 0.8594          
         Neg Pred Value : 0.9524          
             Prevalence : 0.2716          
         Detection Rate : 0.2371          
   Detection Prevalence : 0.2759          
      Balanced Accuracy : 0.9099          
                                          
       'Positive' Class : No              
                                          
In [ ]: