Support Vector Machines

Support Vector Machines (SVMs) can also be applied for multiclass classification tasks using techniques such as one-vs-one or one-vs-all. For the one-vs-one strategy, SVM constructs multiple binary classifiers, each trained to distinguish between pairs of classes. For the one-vs-all strategy, SVM constructs a single classifier for each class, trained to distinguish that class from all other classes.

Description of data set

We use a version of the bfi dataset from class to predict the level of education by Big-5 personality traits. For the data, a subset of observations with balanced educational levels is chosen from the original dataset. The reason is that classifiers often struggle with imbalanced classes (e.g., majority of education being 3 in the original data).

For simplicity, we treat education as a categorical variable here, although it is actually an ordinal variable (i.e., 1 < 2 < 3 < 4 < 5).

Type ?psych::bfi into your console for more information on the dataset. Note that the Big-5 traits agree, conscientious, extra, neuro, and open were created by averaging each participant’s targets to the five survey items per trait (e.g., A1-A5).

Tasks

  1. Read the data file module2-bfi.csv into R (assign it to a variable called “dat”).

  2. Transform the education variable to a factor and assign the data set “dat” to a mlr3 classification task called “tsk” with education as target and agree and conscientious as features.

  3. Randomly separate the dataset into 80% training and 20% testing data (Hint: Set the seed to ensure reproducibility of your results).

  4. Use the training sample to build a SVM (with default settings) to predict the target education with agree and conscientious as features.

  5. Visualize the classifier for agreeableness on the x-axis and conscientiousness on the y-axis.

  6. Now use the training sample to fit another SVM (with default settings) for education as target and the full set of Big-5 traits as features.

  7. Predict the educational levels of the observations in the training sample as well as in the held-out test sample. Also calculate the in-sample training classification error and compare it to the out-of-sample testing classification error. Why is the former likely (much) smaller than the latter?

  8. Assess the out-of-sample performance of your SVM from task 6 using 10-fold cross-validation (CV). Does CV improve the prediction of your model’s out-of-sample classification performance as observed in task 7? (Hint: Set the seed to ensure reproducibility of your results)

  9. Bonus: Using 10-fold cross-validation, choose a value for the tuning parameter \(C\) (cost) from the set (1, 10, 50, 100) based on the full dataset. Also investigate the final (best) SVM by printing the summary of the model. (Hint: Set the seed to ensure reproducibility of your results)