Support Vector Machines

Support Vector Machines (SVMs) can also be applied for multiclass classification tasks using techniques such as one-vs-one or one-vs-all. For the one-vs-one strategy, SVM constructs multiple binary classifiers, each trained to distinguish between pairs of classes. For the one-vs-all strategy, SVM constructs a single classifier for each class, trained to distinguish that class from all other classes.

Description of data set

We use a version of the bfi dataset from class to predict the level of education by Big-5 personality traits. For the data, a subset of observations with balanced educational levels is chosen from the original dataset. The reason is that classifiers often struggle with imbalanced classes (e.g., majority of education being 3 in the original data).

For simplicity, we treat education as a categorical variable here, although it is actually an ordinal variable (i.e., 1 < 2 < 3 < 4 < 5).

Type ?psych::bfi into your console for more information on the dataset. Note that the Big-5 traits agree, conscientious, extra, neuro, and open were created by averaging each participant’s targets to the five survey items per trait (e.g., A1-A5).

Tasks

Read the data file module2-bfi.csv into R (assign it to a variable called “dat”).

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

dat <- read.csv('module2-bfi.csv', header = TRUE)

Transform the education variable to a factor and assign the data set “dat” to a mlr3 classification task called “tsk” with education as target and agree and conscientious as features.

dat$education <- factor(dat$education)

library(mlr3verse)

## Lade nötiges Paket: mlr3

tsk <- as_task_classif(education ~ agree + conscientious, data = dat)

Randomly separate the dataset into 80% training and 20% testing data (Hint: Set the seed to ensure reproducibility of your results).

set.seed(42)
row_ids <- partition(tsk, ratio = 0.8)
row_ids

## $train
##  [1]   2   3   4   5   6   8   9  10  12  14  15  16  17  18  20  21  22  23  24
## [20]  25  26  27  28  29  30  32  33  34  35  36  37  38  39  40  41  42  43  44
## [39]  45  47  48  49  50  51  52  54  55  56  57  58  61  63  65  66  67  68  69
## [58]  70  71  72  74  76  78  79  80  81  83  84  87  88  89  91  92  93  94  95
## [77]  96  97  99 100
## 
## $test
##  [1]  1  7 11 13 19 31 46 53 59 60 62 64 73 75 77 82 85 86 90 98
## 
## $validation
## integer(0)

Use the training sample to build a SVM (with default settings) to predict the target education with agree and conscientious as features.

mdl = lrn("classif.svm")
mdl$train(tsk, row_ids = row_ids$train)
summary(mdl$model)

## 
## Call:
## svm.default(x = data, y = task$truth(), probability = (self$predict_type == 
##     "prob"))
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  80
## 
##  ( 15 19 16 15 15 )
## 
## 
## Number of Classes:  5 
## 
## Levels: 
##  1 2 3 4 5

Visualize the classifier for agreeableness on the x-axis and conscientiousness on the y-axis.

autoplot(mdl, task = tsk)

Now use the training sample to fit another SVM (with default settings) for education as target and the full set of Big-5 traits as features.

tsk <- as_task_classif(education ~ agree + conscientious + extra + neuro + open, data = dat)
mdl$train(tsk, row_ids = row_ids$train)
summary(mdl$model)

## 
## Call:
## svm.default(x = data, y = task$truth(), probability = (self$predict_type == 
##     "prob"))
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  80
## 
##  ( 15 19 16 15 15 )
## 
## 
## Number of Classes:  5 
## 
## Levels: 
##  1 2 3 4 5

Predict the educational levels of the observations in the training sample as well as in the held-out test sample. Also calculate the in-sample training classification error and compare it to the out-of-sample testing classification error. Why is the former likely (much) smaller than the latter?

mes <- msrs("classif.ce")

# In-sample performance:
pred <- mdl$predict(tsk, row_ids = row_ids$train)
pred$confusion

##         truth
## response  1  2  3  4  5
##        1  9  0  0  1  0
##        2  1 15  3  2  3
##        3  2  2 10  1  2
##        4  2  1  2 10  1
##        5  1  1  1  1  9

pred$score(mes)

## classif.ce 
##     0.3375

# Out-of-sample performance:
pred <- mdl$predict(tsk, row_ids = row_ids$test)
pred$confusion

##         truth
## response 1 2 3 4 5
##        1 1 0 0 0 0
##        2 3 0 1 1 0
##        3 0 0 1 3 4
##        4 1 0 1 1 0
##        5 0 1 1 0 1

pred$score(mes)

## classif.ce 
##        0.8

The in-sample training classification error is typically (much) smaller than the out-of-sample testing classification error due to overfitting the training data. Cross-validation (CV) helps to address this issue by partitioning the data into multiple subsets, allowing the model to be trained and evaluated on different combinations of training and validation sets, providing a more robust estimate of its performance on unseen data.

Assess the out-of-sample performance of your SVM from task 6 using 10-fold cross-validation (CV). Does CV improve the prediction of your model’s out-of-sample classification performance as observed in task 7? (Hint: Set the seed to ensure reproducibility of your results)

# 10-fold CV:
set.seed(42)
tsk_cv <- as_task_classif(education ~ agree + conscientious + extra + neuro + open, data = dat[row_ids$train,])
cv <- rsmp("cv", folds = 10)
mdl_cv <- resample(learner = mdl, task = tsk_cv, resampling = cv)

## INFO  [15:48:23.739] [mlr3] Applying learner 'classif.svm' on task 'dat[row_ids$train, ]' (iter 1/10)
## INFO  [15:48:23.874] [mlr3] Applying learner 'classif.svm' on task 'dat[row_ids$train, ]' (iter 2/10)
## INFO  [15:48:23.950] [mlr3] Applying learner 'classif.svm' on task 'dat[row_ids$train, ]' (iter 3/10)
## INFO  [15:48:23.994] [mlr3] Applying learner 'classif.svm' on task 'dat[row_ids$train, ]' (iter 4/10)
## INFO  [15:48:24.030] [mlr3] Applying learner 'classif.svm' on task 'dat[row_ids$train, ]' (iter 5/10)
## INFO  [15:48:24.052] [mlr3] Applying learner 'classif.svm' on task 'dat[row_ids$train, ]' (iter 6/10)
## INFO  [15:48:24.076] [mlr3] Applying learner 'classif.svm' on task 'dat[row_ids$train, ]' (iter 7/10)
## INFO  [15:48:24.097] [mlr3] Applying learner 'classif.svm' on task 'dat[row_ids$train, ]' (iter 8/10)
## INFO  [15:48:24.122] [mlr3] Applying learner 'classif.svm' on task 'dat[row_ids$train, ]' (iter 9/10)
## INFO  [15:48:24.147] [mlr3] Applying learner 'classif.svm' on task 'dat[row_ids$train, ]' (iter 10/10)

mdl_cv$aggregate(mes)

## classif.ce 
##     0.7875

The classification error derived from cross-validation should be much closer to the actual out-of-sample classification error (e.g., as observed in task 7).

Bonus: Using 10-fold cross-validation, choose a value for the tuning parameter \(C\) (cost) from the set (1, 10, 50, 100) based on the full dataset. Also investigate the final (best) SVM by printing the summary of the model. (Hint: Set the seed to ensure reproducibility of your results)

set.seed(42)

# Define set of cost parameter values to be tested
C_cv <- c(1, 10, 50, 100)

# Set up the conditions for the hyperparameter tuning
mdl_cv = auto_tuner(
  learner = lrn("classif.svm", type = 'C-classification', cost = to_tune(levels = C_cv)),
  resampling = rsmp("cv", folds = 10),
  measure = msr("classif.ce"),
  tuner = tnr("grid_search"),
  terminator = trm("none")
)

# Actually tune the hyperparameter (i.e., cp) and fit the final model
invisible({capture.output({ #remove console output from html document
  mdl_cv$train(tsk)
})})

# Print the output of the tuning
mdl_cv$archive %>% 
  as.data.table() %>% 
  select(cost, classif.ce) %>% 
  arrange(as.numeric(cost))

mdl_cv$tuning_result

# Final model:
summary(mdl_cv$learner$model)

## 
## Call:
## svm.default(x = data, y = task$truth(), type = "C-classification", 
##     cost = 50, probability = (self$predict_type == "prob"))
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  50 
## 
## Number of Support Vectors:  97
## 
##  ( 20 20 20 17 20 )
## 
## 
## Number of Classes:  5 
## 
## Levels: 
##  1 2 3 4 5

Note that it is not possible (in mlr3; and quite complex in general) to plot classifiers using more than two features. Therefore, we cannot plot the classification surface of the final (best) SVM model here.

Module 2: Tutorial: Support Vector Machines

Support Vector Machines

Description of data set

Tasks