Support Vector Machines

Support Vector Machines (SVMs) can also be applied for multiclass classification tasks through techniques such as one-vs-one or one-vs-all. In the one-vs-one strategy, SVM constructs multiple binary classifiers, each trained to distinguish between pairs of classes. In the one-vs-all strategy, SVM constructs a single classifier for each class, which is trained to distinguish that class from all other classes.

Description of data set

We use a version of the bfi dataset from class to predict the level of education by Big-5 personality traits. For the data, a subset of observations is chosen from the original dataset where educational levels are balanced. The reason is that classifiers often struggle with imbalanced classes (e.g., majority of education being 3).

For simplicity, we treat education as a categorical variable here, although it is actually an ordinal variable (i.e., 1 < 2 < 3 < 4 < 5).

Type ?psych::bfi into your console for more information on the dataset. Note that the Big-5 triats agree, conscientious, extra, neuro, and open were created by averaging each participant’s targets to the five survey items per trait (e.g., A1-A5).

Tasks

Read the data file modeul2-bfi.csv into R (assign it to a variable called “dat”).

Transform the education variable to a factor and assign the data set “dat” to a mlr3 classification task called “tsk” with education as target and agree and conscientious as features.

Randomly separate the dataset into 80% training and 20% testing data (Hint: Set the seed to ensure reproducibility of your results).

Use the training sample to build a SVM (with default settings) to predict the target education with agree and conscientious as features.

Visualize the classifier for agreeableness on the x-axis and conscientiousness on the y-axis.

Now use the training sample to build a SVM (with default settings) for education as target and all Big-5 traits as features.

Predict the educational levels of the observations in the training sample as well as in the held-out test sample. Also calculate the the in-sample training classification error and compare it to the out-of-sample testing classification error. Why is the former likely (much) smaller than the latter?

Assess the expected out-of-sample performance of your learner from task 6 using 10-fold cross-validation (CV). Does CV improve the prediction of your model’s out-of-sample classification performance? (Hint: Set the seed to ensure reproducibility of your results)

Bonus: Using 10-fold cross-validation, choose a value for the tuning parameter \(C\) (cost) from the set (1, 10, 50, 100). (Hint: Set the seed to ensure reproducibility of your results)

LS0tDQp0aXRsZTogIk1vZHVsZSAyOiBUdXRvcmlhbDogU3VwcG9ydCBWZWN0b3IgTWFjaGluZXMiDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCmVkaXRvcl9vcHRpb25zOiANCiAgY2h1bmtfb3V0cHV0X3R5cGU6IGlubGluZQ0KLS0tDQoNCiMgU3VwcG9ydCBWZWN0b3IgTWFjaGluZXMNCg0KU3VwcG9ydCBWZWN0b3IgTWFjaGluZXMgKFNWTXMpIGNhbiBhbHNvIGJlIGFwcGxpZWQgZm9yIG11bHRpY2xhc3MgY2xhc3NpZmljYXRpb24gdGFza3MgdGhyb3VnaCB0ZWNobmlxdWVzIHN1Y2ggYXMgb25lLXZzLW9uZSBvciBvbmUtdnMtYWxsLiBJbiB0aGUgb25lLXZzLW9uZSBzdHJhdGVneSwgU1ZNIGNvbnN0cnVjdHMgbXVsdGlwbGUgYmluYXJ5IGNsYXNzaWZpZXJzLCBlYWNoIHRyYWluZWQgdG8gZGlzdGluZ3Vpc2ggYmV0d2VlbiBwYWlycyBvZiBjbGFzc2VzLiBJbiB0aGUgb25lLXZzLWFsbCBzdHJhdGVneSwgU1ZNIGNvbnN0cnVjdHMgYSBzaW5nbGUgY2xhc3NpZmllciBmb3IgZWFjaCBjbGFzcywgd2hpY2ggaXMgdHJhaW5lZCB0byBkaXN0aW5ndWlzaCB0aGF0IGNsYXNzIGZyb20gYWxsIG90aGVyIGNsYXNzZXMuDQoNCiMjIERlc2NyaXB0aW9uIG9mIGRhdGEgc2V0DQoNCldlIHVzZSBhIHZlcnNpb24gb2YgdGhlIGBiZmlgIGRhdGFzZXQgZnJvbSBjbGFzcyB0byBwcmVkaWN0IHRoZSBsZXZlbCBvZiBlZHVjYXRpb24gYnkgQmlnLTUgcGVyc29uYWxpdHkgdHJhaXRzLiBGb3IgdGhlIGRhdGEsIGEgc3Vic2V0IG9mIG9ic2VydmF0aW9ucyBpcyBjaG9zZW4gZnJvbSB0aGUgb3JpZ2luYWwgZGF0YXNldCB3aGVyZSBlZHVjYXRpb25hbCBsZXZlbHMgYXJlIGJhbGFuY2VkLiBUaGUgcmVhc29uIGlzIHRoYXQgY2xhc3NpZmllcnMgb2Z0ZW4gc3RydWdnbGUgd2l0aCBpbWJhbGFuY2VkIGNsYXNzZXMgKGUuZy4sIG1ham9yaXR5IG9mIGBlZHVjYXRpb25gIGJlaW5nIDMpLg0KDQpGb3Igc2ltcGxpY2l0eSwgd2UgdHJlYXQgYGVkdWNhdGlvbmAgYXMgYSBjYXRlZ29yaWNhbCB2YXJpYWJsZSBoZXJlLCBhbHRob3VnaCBpdCBpcyBhY3R1YWxseSBhbiBvcmRpbmFsIHZhcmlhYmxlIChpLmUuLCAxIFw8IDIgXDwgMyBcPCA0IFw8IDUpLg0KDQpUeXBlID9wc3ljaDo6YmZpIGludG8geW91ciBjb25zb2xlIGZvciBtb3JlIGluZm9ybWF0aW9uIG9uIHRoZSBkYXRhc2V0LiBOb3RlIHRoYXQgdGhlIEJpZy01IHRyaWF0cyBgYWdyZWVgLCBgY29uc2NpZW50aW91c2AsIGBleHRyYWAsIGBuZXVyb2AsIGFuZCBgb3BlbmAgd2VyZSBjcmVhdGVkIGJ5IGF2ZXJhZ2luZyBlYWNoIHBhcnRpY2lwYW50J3MgdGFyZ2V0cyB0byB0aGUgZml2ZSBzdXJ2ZXkgaXRlbXMgcGVyIHRyYWl0IChlLmcuLCBgQTFgLWBBNWApLg0KDQojIyBUYXNrcw0KDQoxLiAgUmVhZCB0aGUgZGF0YSBmaWxlIG1vZGV1bDItYmZpLmNzdiBpbnRvIFIgKGFzc2lnbiBpdCB0byBhIHZhcmlhYmxlIGNhbGxlZCAiZGF0IikuDQoNCmBgYHtyfQ0KDQpgYGANCg0KMi4gIFRyYW5zZm9ybSB0aGUgZWR1Y2F0aW9uIHZhcmlhYmxlIHRvIGEgZmFjdG9yIGFuZCBhc3NpZ24gdGhlIGRhdGEgc2V0ICJkYXQiIHRvIGEgYG1scjNgIGNsYXNzaWZpY2F0aW9uIHRhc2sgY2FsbGVkICJ0c2siIHdpdGggYGVkdWNhdGlvbmAgYXMgdGFyZ2V0IGFuZCBgYWdyZWVgIGFuZCBgY29uc2NpZW50aW91c2AgYXMgZmVhdHVyZXMuDQoNCmBgYHtyfQ0KDQpgYGANCg0KMy4gIFJhbmRvbWx5IHNlcGFyYXRlIHRoZSBkYXRhc2V0IGludG8gODAlIHRyYWluaW5nIGFuZCAyMCUgdGVzdGluZyBkYXRhIChIaW50OiBTZXQgdGhlIHNlZWQgdG8gZW5zdXJlIHJlcHJvZHVjaWJpbGl0eSBvZiB5b3VyIHJlc3VsdHMpLg0KDQpgYGB7cn0NCg0KYGBgDQoNCjQuICBVc2UgdGhlIHRyYWluaW5nIHNhbXBsZSB0byBidWlsZCBhIFNWTSAod2l0aCBkZWZhdWx0IHNldHRpbmdzKSB0byBwcmVkaWN0IHRoZSB0YXJnZXQgYGVkdWNhdGlvbmAgd2l0aCBgYWdyZWVgIGFuZCBgY29uc2NpZW50aW91c2AgYXMgZmVhdHVyZXMuDQoNCmBgYHtyfQ0KDQpgYGANCg0KNS4gIFZpc3VhbGl6ZSB0aGUgY2xhc3NpZmllciBmb3IgYWdyZWVhYmxlbmVzcyBvbiB0aGUgeC1heGlzIGFuZCBjb25zY2llbnRpb3VzbmVzcyBvbiB0aGUgeS1heGlzLg0KDQpgYGB7ciBvdXQud2lkdGg9IjUwJSIsIGZpZy5hbGlnbj0nY2VudGVyJ30NCg0KYGBgDQoNCjYuICBOb3cgdXNlIHRoZSB0cmFpbmluZyBzYW1wbGUgdG8gYnVpbGQgYSBTVk0gKHdpdGggZGVmYXVsdCBzZXR0aW5ncykgZm9yIGBlZHVjYXRpb25gIGFzIHRhcmdldCBhbmQgYWxsIEJpZy01IHRyYWl0cyBhcyBmZWF0dXJlcy4NCg0KYGBge3J9DQoNCmBgYA0KDQo3LiAgUHJlZGljdCB0aGUgZWR1Y2F0aW9uYWwgbGV2ZWxzIG9mIHRoZSBvYnNlcnZhdGlvbnMgaW4gdGhlIHRyYWluaW5nIHNhbXBsZSBhcyB3ZWxsIGFzIGluIHRoZSBoZWxkLW91dCB0ZXN0IHNhbXBsZS4gQWxzbyBjYWxjdWxhdGUgdGhlIHRoZSBpbi1zYW1wbGUgdHJhaW5pbmcgY2xhc3NpZmljYXRpb24gZXJyb3IgYW5kIGNvbXBhcmUgaXQgdG8gdGhlIG91dC1vZi1zYW1wbGUgdGVzdGluZyBjbGFzc2lmaWNhdGlvbiBlcnJvci4gV2h5IGlzIHRoZSBmb3JtZXIgbGlrZWx5IChtdWNoKSBzbWFsbGVyIHRoYW4gdGhlIGxhdHRlcj8NCg0KYGBge3J9DQoNCmBgYA0KDQo4LiAgQXNzZXNzIHRoZSBleHBlY3RlZCBvdXQtb2Ytc2FtcGxlIHBlcmZvcm1hbmNlIG9mIHlvdXIgbGVhcm5lciBmcm9tIHRhc2sgNiB1c2luZyAxMC1mb2xkIGNyb3NzLXZhbGlkYXRpb24gKENWKS4gRG9lcyBDViBpbXByb3ZlIHRoZSBwcmVkaWN0aW9uIG9mIHlvdXIgbW9kZWwncyBvdXQtb2Ytc2FtcGxlIGNsYXNzaWZpY2F0aW9uIHBlcmZvcm1hbmNlPyAoSGludDogU2V0IHRoZSBzZWVkIHRvIGVuc3VyZSByZXByb2R1Y2liaWxpdHkgb2YgeW91ciByZXN1bHRzKQ0KDQpgYGB7cn0NCg0KYGBgDQoNCjkuICBCb251czogVXNpbmcgMTAtZm9sZCBjcm9zcy12YWxpZGF0aW9uLCBjaG9vc2UgYSB2YWx1ZSBmb3IgdGhlIHR1bmluZyBwYXJhbWV0ZXIgJEMkIChgY29zdGApIGZyb20gdGhlIHNldCBgKDEsIDEwLCA1MCwgMTAwKWAuIChIaW50OiBTZXQgdGhlIHNlZWQgdG8gZW5zdXJlIHJlcHJvZHVjaWJpbGl0eSBvZiB5b3VyIHJlc3VsdHMpDQoNCmBgYHtyfQ0KDQpgYGANCg==

Module 2: Tutorial: Support Vector Machines

Support Vector Machines

Description of data set

Tasks