Basic classification algorithms

Xiaoqi Zheng, 0330/2020

In [31]:
## Load data
data(iris)
set.seed(123)
## seprate into training and test samples
idxs <- sample(1:nrow(iris),as.integer(0.7*nrow(iris)))
trainIris <- iris[idxs,]
testIris <- iris[-idxs,]

1. Naive Bayes

In [32]:
library(e1071)
In [33]:
nb_default <- naiveBayes(Species~., data=trainIris)
default_pred <- predict(nb_default, testIris, type="class")

table(default_pred, testIris$Species,dnn=c("Prediction","Actual"))
            Actual
Prediction   setosa versicolor virginica
  setosa         14          0         0
  versicolor      0         18         0
  virginica       0          0        13
In [34]:
nb_laplace1 <- naiveBayes(Species~., data=trainIris, laplace=1)
laplace1_pred <- predict(nb_laplace1, testIris, type="class")

table(laplace1_pred, testIris$Species,dnn=c("Prediction","Actual"))
            Actual
Prediction   setosa versicolor virginica
  setosa         14          0         0
  versicolor      0         18         0
  virginica       0          0        13

2. Linear Discrimant Analysis (LDA)

In [35]:
library(MASS)
In [36]:
lda_model <- lda(Species ~ ., # training model
          trainIris, 
          prior = c(1,1,1)/3)
plda = predict(object = lda_model, newdata = testIris,)# predictions

## The resulting confusion matrix
table(testIris[,'Species'],plda$class)
            
             setosa versicolor virginica
  setosa         14          0         0
  versicolor      0         17         1
  virginica       0          0        13

3. K-Nearest neighbor

In [37]:
library(caret) ## can use 'knn' function in package 'class' and 'train' function in package 'caret'
In [38]:
## train a model with different ks, need a lot of time ...
knn_fit <- train(Species ~ ., data = trainIris, method = "knn",
                 preProcess = c("center", "scale"),
                 tuneLength = 10)
In [39]:
knn_fit
k-Nearest Neighbors 

105 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

Pre-processing: centered (4), scaled (4) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 105, 105, 105, 105, 105, 105, ... 
Resampling results across tuning parameters:

  k   Accuracy   Kappa    
   5  0.9244023  0.8854666
   7  0.9207687  0.8798835
   9  0.9125272  0.8674484
  11  0.9130132  0.8679116
  13  0.9101934  0.8638234
  15  0.9151561  0.8716461
  17  0.9111116  0.8655509
  19  0.9065056  0.8585699
  21  0.9162736  0.8728821
  23  0.8979103  0.8460394

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 5.
In [40]:
plot(knn_fit)
In [41]:
test_pred <- predict(knn_fit, newdata = testIris)

## The resulting confusion matrix
table(testIris[,'Species'],test_pred)
            test_pred
             setosa versicolor virginica
  setosa         14          0         0
  versicolor      0         17         1
  virginica       0          0        13