Support Vector Machines (SVMs) can also be applied for multiclass classification tasks through techniques such as one-vs-one or one-vs-all. In the one-vs-one strategy, SVM constructs multiple binary classifiers, each trained to distinguish between pairs of classes. In the one-vs-all strategy, SVM constructs a single classifier for each class, which is trained to distinguish that class from all other classes.
We use a version of the bfi dataset from class to
predict the level of education by Big-5 personality traits. For the
data, a subset of observations is chosen from the original dataset where
educational levels are balanced. The reason is that classifiers often
struggle with imbalanced classes (e.g., majority of
education being 3).
For simplicity, we treat education as a categorical
variable here, although it is actually an ordinal variable (i.e., 1 <
2 < 3 < 4 < 5).
Type ?psych::bfi into your console for more information on the
dataset. Note that the Big-5 triats agree,
conscientious, extra, neuro, and
open were created by averaging each participant’s targets
to the five survey items per trait (e.g.,
A1-A5).
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
dat <- read.csv('module2-bfi.csv', header = TRUE)
mlr3 classification task called “tsk” with
education as target and agree and
conscientious as features.dat$education <- factor(dat$education)
library(mlr3verse)
Lade nötiges Paket: mlr3
Registered S3 method overwritten by 'data.table':
method from
print.data.table
tsk <- as_task_classif(education ~ agree + conscientious, data = dat)
set.seed(42)
row_ids <- partition(tsk, ratio = 0.8)
row_ids
$train
[1] 1 2 4 5 6 7 8 9 10 14 15 16 17 18 19 20 21 22 23 24 25 27 28 29 30 31
[27] 33 34 35 37 38 40 41 42 43 44 45 46 48 49 51 52 53 54 55 56 57 58 61 62 63 64
[53] 65 66 67 68 71 73 74 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
[79] 97 98
$test
[1] 3 11 12 13 26 32 36 39 47 50 59 60 69 70 72 75 95 96 99 100
education with agree and
conscientious as features.mdl = lrn("classif.svm")
mdl$train(tsk, row_ids = row_ids$train)
summary(mdl$model)
Call:
svm.default(x = data, y = task$truth(), probability = (self$predict_type ==
"prob"))
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
Number of Support Vectors: 80
( 16 16 16 16 16 )
Number of Classes: 5
Levels:
1 2 3 4 5
autoplot(mdl, task = tsk)
education as target and all Big-5 traits as
features.tsk <- as_task_classif(education ~ agree + conscientious + extra + neuro + open, data = dat)
mdl$train(tsk, row_ids = row_ids$train)
summary(mdl$model)
Call:
svm.default(x = data, y = task$truth(), probability = (self$predict_type ==
"prob"))
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
Number of Support Vectors: 80
( 16 16 16 16 16 )
Number of Classes: 5
Levels:
1 2 3 4 5
mes <- msrs("classif.ce")
# In-sample performance:
pred <- mdl$predict(tsk, row_ids = row_ids$train)
pred$confusion
truth
response 1 2 3 4 5
1 9 0 2 0 0
2 0 11 1 0 2
3 2 1 12 2 6
4 4 3 1 11 1
5 1 1 0 3 7
pred$score(mes)
classif.ce
0.375
# Out-of-sample performance:
pred <- mdl$predict(tsk, row_ids = row_ids$test)
pred$confusion
truth
response 1 2 3 4 5
1 2 0 1 1 0
2 1 1 1 0 0
3 0 2 1 1 3
4 0 0 1 1 1
5 1 1 0 1 0
pred$score(mes)
classif.ce
0.75
The in-sample training classification error is likely (much) smaller than the out-of-sample testing classification error due to overfitting the training data. Cross-validation (CV) helps to address this issue by partitioning the data into multiple subsets, allowing the model to be trained and evaluated on different combinations of training and validation sets, providing a more robust estimate of its performance on unseen data.
# 10-fold CV:
set.seed(42)
cv <- rsmp("cv", folds = 10)
mdl_cv <- resample(learner = mdl, task = tsk, resampling = cv)
INFO [13:43:50.770] [mlr3] Applying learner 'classif.svm' on task 'dat' (iter 1/10)
INFO [13:43:50.866] [mlr3] Applying learner 'classif.svm' on task 'dat' (iter 2/10)
INFO [13:43:50.914] [mlr3] Applying learner 'classif.svm' on task 'dat' (iter 3/10)
INFO [13:43:51.035] [mlr3] Applying learner 'classif.svm' on task 'dat' (iter 4/10)
INFO [13:43:51.068] [mlr3] Applying learner 'classif.svm' on task 'dat' (iter 5/10)
INFO [13:43:51.108] [mlr3] Applying learner 'classif.svm' on task 'dat' (iter 6/10)
INFO [13:43:51.145] [mlr3] Applying learner 'classif.svm' on task 'dat' (iter 7/10)
INFO [13:43:51.188] [mlr3] Applying learner 'classif.svm' on task 'dat' (iter 8/10)
INFO [13:43:51.228] [mlr3] Applying learner 'classif.svm' on task 'dat' (iter 9/10)
INFO [13:43:51.267] [mlr3] Applying learner 'classif.svm' on task 'dat' (iter 10/10)
mdl_cv$aggregate(mes)
classif.ce
0.8
The classification error derived from cross-validation is much closer to the out-of-sample classification error observed in task 7.
cost)
from the set (1, 10, 50, 100). (Hint: Set the seed to
ensure reproducibility of your results)Note that it is not possible (in mlr3; and quite complex
in general) to print classifiers using more than two features.
Therefore, we cannot plot the classification of the final (best)
model.