Decision tree and random forest

By Xiaoqi Zheng, 0414/2020

In [1]:
# Load the dataset
data1 <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data", header = TRUE)

head(data1)
str(data1)
summary(data1)
A data.frame: 6 × 7
vhighvhigh.1X2X2.1smalllowunacc
<fct><fct><fct><fct><fct><fct><fct>
1vhighvhigh22smallmed unacc
2vhighvhigh22smallhighunacc
3vhighvhigh22med low unacc
4vhighvhigh22med med unacc
5vhighvhigh22med highunacc
6vhighvhigh22big low unacc
'data.frame':	1727 obs. of  7 variables:
 $ vhigh  : Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ vhigh.1: Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ X2     : Factor w/ 4 levels "2","3","4","5more": 1 1 1 1 1 1 1 1 1 1 ...
 $ X2.1   : Factor w/ 3 levels "2","4","more": 1 1 1 1 1 1 1 1 2 2 ...
 $ small  : Factor w/ 3 levels "big","med","small": 3 3 2 2 2 1 1 1 3 3 ...
 $ low    : Factor w/ 3 levels "high","low","med": 3 1 2 3 1 2 3 1 2 3 ...
 $ unacc  : Factor w/ 4 levels "acc","good","unacc",..: 3 3 3 3 3 3 3 3 3 3 ...
   vhigh      vhigh.1        X2        X2.1       small       low     
 high :432   high :432   2    :431   2   :575   big  :576   high:576  
 low  :432   low  :432   3    :432   4   :576   med  :576   low :575  
 med  :432   med  :432   4    :432   more:576   small:575   med :576  
 vhigh:431   vhigh:431   5more:432                                    
   unacc     
 acc  : 384  
 good :  69  
 unacc:1209  
 vgood:  65  
In [2]:
# Split into Train and Test sets
# Training Set : Test Set = 70 : 30 (random)
set.seed(100)
train <- sample(nrow(data1), 0.7*nrow(data1), replace = FALSE)
TrainSet <- data1[train,]
TestSet <- data1[-train,]

1. Decision tree

In [5]:
#install.packages("rpart")
#install.packages("caret")
#install.packages("e1071")
 
library(rpart)
library(caret)
library(e1071)
 
model_dt = train(unacc ~ ., data = TrainSet, method = "rpart")
out_train = predict(model_dt, data = TrainSet)
table(out_train, TrainSet$unacc)
 
mean(out_train == TrainSet$unacc)
         
out_train acc good unacc vgood
    acc   255   50   132    40
    good    0    0     0     0
    unacc  22    0   709     0
    vgood   0    0     0     0
0.798013245033113
In [6]:
# Running on Test Set
out_test = predict(model_dt, newdata = TestSet)
table(out_test, TestSet$unacc)
 
mean(out_test == TestSet$unacc)
        
out_test acc good unacc vgood
   acc    93   19    58    25
   good    0    0     0     0
   unacc  14    0   310     0
   vgood   0    0     0     0
0.776493256262042

2. Random Forest

In [7]:
#install.packages("randomForest")
library(randomForest)
randomForest 4.6-14

Type rfNews() to see new features/changes/bug fixes.


Attaching package: ‘randomForest’


The following object is masked from ‘package:ggplot2’:

    margin


In [8]:
# Create a Random Forest model with default parameters
model1 <- randomForest(unacc ~ ., data = TrainSet, importance = TRUE)
model1
Call:
 randomForest(formula = unacc ~ ., data = TrainSet, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 3.81%
Confusion matrix:
      acc good unacc vgood class.error
acc   268    6     2     1  0.03249097
good    9   39     0     2  0.22000000
unacc  15    1   825     0  0.01902497
vgood   7    3     0    30  0.25000000
In [20]:
# Fine tuning parameters of Random Forest model
model2 <- randomForest(unacc ~ ., data = TrainSet, ntree = 100, mtry = 6, importance = TRUE)
model2
Call:
 randomForest(formula = unacc ~ ., data = TrainSet, ntree = 100,      mtry = 6, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 100
No. of variables tried at each split: 6

        OOB estimate of  error rate: 2.32%
Confusion matrix:
      acc good unacc vgood class.error
acc   268    4     5     0  0.03249097
good    2   47     0     1  0.06000000
unacc  11    3   827     0  0.01664685
vgood   1    1     0    38  0.05000000
In [21]:
# Predicting on train set
predTrain <- predict(model2, TrainSet, type = "class")
# Checking classification accuracy
table(predTrain, TrainSet$unacc)  

mean(predTrain == TrainSet$unacc)
         
predTrain acc good unacc vgood
    acc   277    0     0     0
    good    0   50     0     0
    unacc   0    0   841     0
    vgood   0    0     0    40
1
In [12]:
# Predicting on Test set
predValid <- predict(model2, TestSet, type = "class")
# Checking classification accuracy
                 
table(predValid,TestSet$unacc)

mean(predValid == TestSet$unacc)
         
predValid acc good unacc vgood
    acc   106    0     3     3
    good    1   19     0     0
    unacc   0    0   365     0
    vgood   0    0     0    22
0.986512524084778

注意:随机森林属于bagging集成算法,采用Bootstrap,理论和实践可以发现Bootstrap每次约有1/3的样本不会出现在Bootstrap所采集的样本集合中。故没有参加决策树的建立,这些数据称为袋外数据(oob)。故随机森林训练时就是采用的oob数据,不需要交叉验证!其测试误差与训练误差差不多。

In [13]:
# To check important variables
importance(model2)        
varImpPlot(model2)
A matrix: 6 × 6 of type dbl
accgoodunaccvgoodMeanDecreaseAccuracyMeanDecreaseGini
vhigh147.7736079.33809101.9761168.85989187.13598 84.97387
vhigh.1137.5072682.20760103.1117742.37166184.22476 82.84334
X2 25.4803417.39170 39.7992012.03426 48.03208 32.92085
X2.1146.2971957.93151181.1956151.67678217.36778119.22094
small 86.0222654.51063 70.5399950.83151128.01673 73.62060
low165.4641886.93651185.9371896.77266247.98836161.52586
In [14]:
plot(model2)