Tree-Based Classifiers
Similar to Support Vector Machines (SVMs), trees are very good in
multiclass classification. Essentially, however, the majority voting
procedure to assign classes to terminal nodes implies that there is no
need for techniques such as one-vs-one (OvO) or one-vs-all (OvA)
strategies.
Description of data set
In contrast to the SVM tutorial, we use the bfi dataset
to predict level of education by the Big-5 personality traits. We do not
select a subset of observations that has balanced educational levels,
because trees are much better in handling unbalanced data.
For simplicity, we treat education as a categorical
variable here, although it is actually an ordinal variable (i.e., 1 <
2 < 3 < 4 < 5).
Type ?psych::bfi into your console for more information on the
dataset. Note that the Big-5 triats agree,
conscientious, extra, neuro, and
open were created by averaging each participant’s targets
to the five survey items per trait (e.g.,
A1-A5).
Tasks
- Read the data file modeul2-bfi-imbalanced.csv into R (assign it to a
variable called “dat”).
- Transform all discrete variables to factors for the tree algorithm
to work as intended.
- Build a tree model to predict the target “education” by all features
except for the identifier “CASE”. (Hint: Set the seed to ensure
reproducibility of your results, e.g., if your model has to randomly
break ties)
- Visualize your result from task 3 as a tree.
- Prune your tree from task 3 by means of 10-fold cross-validation.
That is, choose the complexity penalty parameter
cp
(between 0 and 0.05 in steps of 0.01) to potentially remove unnecessary
terminal nodes and reduce overfitting. Visualize your final result
(i.e., best model) as a tree. Would your pruned tree be able to predict
all available class labels. In other words, are there any educational
levels for which no combination of features would result in the tree
making a corresponding prediction? (Hint: Set the seed to ensure
reproducibility of your results)
- Because of the instability of a single tree, build an ensamble of
trees using the random forest approach and default tuning parameter
settings. To proceed later with task 7, you must set the
importance argument of the learner equal to “permutation”.
(Hint: Set the seed to ensure reproducibility of your results)
- Plot the feature importance of all features used in your random
forest from task 8.
- Build a random forest and tune the hyperparameters
num.trees from 500 to 1500 in steps of 500 and
mtry from 2 to 5 in steps of 1. To proceed later with task
9, you must again set the importance argument of the
learner equal to “permutation”. (Hint: Set the seed to ensure
reproducibility of your results)
- Plot the feature importance of the tuned random forest and compare
the ranking to the feature importance plot of the random forest that was
fit with default tuning parameter settings in task 6. Are there
substantial differences between the two plots?
LS0tDQp0aXRsZTogIk1vZHVsZSAyOiBUdXRvcmlhbDogVHJlZXMgYW5kIEZvcmVzdHMiDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCmVkaXRvcl9vcHRpb25zOiANCiAgY2h1bmtfb3V0cHV0X3R5cGU6IGlubGluZQ0KLS0tDQoNCiMgVHJlZS1CYXNlZCBDbGFzc2lmaWVycw0KDQpTaW1pbGFyIHRvIFN1cHBvcnQgVmVjdG9yIE1hY2hpbmVzIChTVk1zKSwgdHJlZXMgYXJlIHZlcnkgZ29vZCBpbiBtdWx0aWNsYXNzIGNsYXNzaWZpY2F0aW9uLiBFc3NlbnRpYWxseSwgaG93ZXZlciwgdGhlIG1ham9yaXR5IHZvdGluZyBwcm9jZWR1cmUgdG8gYXNzaWduIGNsYXNzZXMgdG8gdGVybWluYWwgbm9kZXMgaW1wbGllcyB0aGF0IHRoZXJlIGlzIG5vIG5lZWQgZm9yIHRlY2huaXF1ZXMgc3VjaCBhcyBvbmUtdnMtb25lIChPdk8pIG9yIG9uZS12cy1hbGwgKE92QSkgc3RyYXRlZ2llcy4NCg0KIyMgRGVzY3JpcHRpb24gb2YgZGF0YSBzZXQNCg0KSW4gY29udHJhc3QgdG8gdGhlIFNWTSB0dXRvcmlhbCwgd2UgdXNlIHRoZSBgYmZpYCBkYXRhc2V0IHRvIHByZWRpY3QgbGV2ZWwgb2YgZWR1Y2F0aW9uIGJ5IHRoZSBCaWctNSBwZXJzb25hbGl0eSB0cmFpdHMuIFdlIGRvIG5vdCBzZWxlY3QgYSBzdWJzZXQgb2Ygb2JzZXJ2YXRpb25zIHRoYXQgaGFzIGJhbGFuY2VkIGVkdWNhdGlvbmFsIGxldmVscywgYmVjYXVzZSB0cmVlcyBhcmUgbXVjaCBiZXR0ZXIgaW4gaGFuZGxpbmcgdW5iYWxhbmNlZCBkYXRhLiANCg0KRm9yIHNpbXBsaWNpdHksIHdlIHRyZWF0IGBlZHVjYXRpb25gIGFzIGEgY2F0ZWdvcmljYWwgdmFyaWFibGUgaGVyZSwgYWx0aG91Z2ggaXQgaXMgYWN0dWFsbHkgYW4gb3JkaW5hbCB2YXJpYWJsZSAoaS5lLiwgMSBcPCAyIFw8IDMgXDwgNCBcPCA1KS4NCg0KVHlwZSA/cHN5Y2g6OmJmaSBpbnRvIHlvdXIgY29uc29sZSBmb3IgbW9yZSBpbmZvcm1hdGlvbiBvbiB0aGUgZGF0YXNldC4gTm90ZSB0aGF0IHRoZSBCaWctNSB0cmlhdHMgYGFncmVlYCwgYGNvbnNjaWVudGlvdXNgLCBgZXh0cmFgLCBgbmV1cm9gLCBhbmQgYG9wZW5gIHdlcmUgY3JlYXRlZCBieSBhdmVyYWdpbmcgZWFjaCBwYXJ0aWNpcGFudCdzIHRhcmdldHMgdG8gdGhlIGZpdmUgc3VydmV5IGl0ZW1zIHBlciB0cmFpdCAoZS5nLiwgYEExYC1gQTVgKS4NCg0KIyMgVGFza3MNCg0KMS4gIFJlYWQgdGhlIGRhdGEgZmlsZSBtb2RldWwyLWJmaS1pbWJhbGFuY2VkLmNzdiBpbnRvIFIgKGFzc2lnbiBpdCB0byBhIHZhcmlhYmxlIGNhbGxlZCAiZGF0IikuDQoNCmBgYHtyfQ0KDQpgYGANCg0KMi4gIFRyYW5zZm9ybSBhbGwgZGlzY3JldGUgdmFyaWFibGVzIHRvIGZhY3RvcnMgZm9yIHRoZSB0cmVlIGFsZ29yaXRobSB0byB3b3JrIGFzIGludGVuZGVkLg0KDQpgYGB7cn0NCg0KYGBgDQoNCjMuICBCdWlsZCBhIHRyZWUgbW9kZWwgdG8gcHJlZGljdCB0aGUgdGFyZ2V0ICJlZHVjYXRpb24iIGJ5IGFsbCBmZWF0dXJlcyBleGNlcHQgZm9yIHRoZSBpZGVudGlmaWVyICJDQVNFIi4gKEhpbnQ6IFNldCB0aGUgc2VlZCB0byBlbnN1cmUgcmVwcm9kdWNpYmlsaXR5IG9mIHlvdXIgcmVzdWx0cywgZS5nLiwgaWYgeW91ciBtb2RlbCBoYXMgdG8gcmFuZG9tbHkgYnJlYWsgdGllcykNCg0KYGBge3J9DQoNCmBgYA0KDQo0LiAgVmlzdWFsaXplIHlvdXIgcmVzdWx0IGZyb20gdGFzayAzIGFzIGEgdHJlZS4NCg0KYGBge3J9DQoNCmBgYA0KDQo1LiAgUHJ1bmUgeW91ciB0cmVlIGZyb20gdGFzayAzIGJ5IG1lYW5zIG9mIDEwLWZvbGQgY3Jvc3MtdmFsaWRhdGlvbi4gVGhhdCBpcywgY2hvb3NlIHRoZSBjb21wbGV4aXR5IHBlbmFsdHkgcGFyYW1ldGVyIGBjcGAgKGJldHdlZW4gMCBhbmQgMC4wNSBpbiBzdGVwcyBvZiAwLjAxKSB0byBwb3RlbnRpYWxseSByZW1vdmUgdW5uZWNlc3NhcnkgdGVybWluYWwgbm9kZXMgYW5kIHJlZHVjZSBvdmVyZml0dGluZy4gVmlzdWFsaXplIHlvdXIgZmluYWwgcmVzdWx0IChpLmUuLCBiZXN0IG1vZGVsKSBhcyBhIHRyZWUuIFdvdWxkIHlvdXIgcHJ1bmVkIHRyZWUgYmUgYWJsZSB0byBwcmVkaWN0IGFsbCBhdmFpbGFibGUgY2xhc3MgbGFiZWxzLiBJbiBvdGhlciB3b3JkcywgYXJlIHRoZXJlIGFueSBlZHVjYXRpb25hbCBsZXZlbHMgZm9yIHdoaWNoIG5vIGNvbWJpbmF0aW9uIG9mIGZlYXR1cmVzIHdvdWxkIHJlc3VsdCBpbiB0aGUgdHJlZSBtYWtpbmcgYSBjb3JyZXNwb25kaW5nIHByZWRpY3Rpb24/IChIaW50OiBTZXQgdGhlIHNlZWQgdG8gZW5zdXJlIHJlcHJvZHVjaWJpbGl0eSBvZiB5b3VyIHJlc3VsdHMpDQoNCmBgYHtyfQ0KDQpgYGANCg0KNi4gIEJlY2F1c2Ugb2YgdGhlIGluc3RhYmlsaXR5IG9mIGEgc2luZ2xlIHRyZWUsIGJ1aWxkIGFuIGVuc2FtYmxlIG9mIHRyZWVzIHVzaW5nIHRoZSByYW5kb20gZm9yZXN0IGFwcHJvYWNoIGFuZCBkZWZhdWx0IHR1bmluZyBwYXJhbWV0ZXIgc2V0dGluZ3MuIFRvIHByb2NlZWQgbGF0ZXIgd2l0aCB0YXNrIDcsIHlvdSBtdXN0IHNldCB0aGUgYGltcG9ydGFuY2VgIGFyZ3VtZW50IG9mIHRoZSBsZWFybmVyIGVxdWFsIHRvICJwZXJtdXRhdGlvbiIuIChIaW50OiBTZXQgdGhlIHNlZWQgdG8gZW5zdXJlIHJlcHJvZHVjaWJpbGl0eSBvZiB5b3VyIHJlc3VsdHMpDQoNCmBgYHtyfQ0KDQpgYGANCg0KNy4gIFBsb3QgdGhlIGZlYXR1cmUgaW1wb3J0YW5jZSBvZiBhbGwgZmVhdHVyZXMgdXNlZCBpbiB5b3VyIHJhbmRvbSBmb3Jlc3QgZnJvbSB0YXNrIDguDQoNCmBgYHtyfQ0KDQpgYGANCjguICBCdWlsZCBhIHJhbmRvbSBmb3Jlc3QgYW5kIHR1bmUgdGhlIGh5cGVycGFyYW1ldGVycyBgbnVtLnRyZWVzYCBmcm9tIDUwMCB0byAxNTAwIGluIHN0ZXBzIG9mIDUwMCBhbmQgYG10cnlgIGZyb20gMiB0byA1IGluIHN0ZXBzIG9mIDEuIFRvIHByb2NlZWQgbGF0ZXIgd2l0aCB0YXNrIDksIHlvdSBtdXN0IGFnYWluIHNldCB0aGUgYGltcG9ydGFuY2VgIGFyZ3VtZW50IG9mIHRoZSBsZWFybmVyIGVxdWFsIHRvICJwZXJtdXRhdGlvbiIuIChIaW50OiBTZXQgdGhlIHNlZWQgdG8gZW5zdXJlIHJlcHJvZHVjaWJpbGl0eSBvZiB5b3VyIHJlc3VsdHMpDQoNCmBgYHtyfQ0KDQpgYGANCg0KOS4gIFBsb3QgdGhlIGZlYXR1cmUgaW1wb3J0YW5jZSBvZiB0aGUgdHVuZWQgcmFuZG9tIGZvcmVzdCBhbmQgY29tcGFyZSB0aGUgcmFua2luZyB0byB0aGUgZmVhdHVyZSBpbXBvcnRhbmNlIHBsb3Qgb2YgdGhlIHJhbmRvbSBmb3Jlc3QgdGhhdCB3YXMgZml0IHdpdGggZGVmYXVsdCB0dW5pbmcgcGFyYW1ldGVyIHNldHRpbmdzIGluIHRhc2sgNi4gQXJlIHRoZXJlIHN1YnN0YW50aWFsIGRpZmZlcmVuY2VzIGJldHdlZW4gdGhlIHR3byBwbG90cz8NCg0KYGBge3J9DQoNCmBgYA0K