Similar to Support Vector Machines (SVMs), trees are very good in multiclass classification. Essentially, however, there is no need for special techniques, such as one-vs-one or one-vs-all for SVMs, to handle multiclass problems. Instead, the majority voting procedure used to assign classes to terminal nodes implies a kind of one-vs-all strategy by default.
As in the SVM tutorial, we will use the bfi dataset to
predict level of education by the Big-5 personality traits. However,
here we do not select a subset of observations that has balanced
educational levels. The reason is that trees are much better in handling
unbalanced data, as we will see below.
For simplicity, we treat education as a categorical
variable here, although it is actually an ordinal variable (i.e., 1 <
2 < 3 < 4 < 5).
Type ?psych::bfi into your console for more information on the
dataset. Note that the Big-5 triats agree,
conscientious, extra, neuro, and
open were created by averaging each participant’s targets
to the five survey items per trait (e.g.,
A1-A5).
dat <- read.csv('module2-bfi-imbalanced.csv', header = TRUE)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dat <- dat %>% mutate_at(vars(education, gender), ~ factor(.))
education by
all features. Make sure to set the learner’s keep_model
argument to TRUE, which is needed for task 4. (Hint: Avoid including the
identifier CASE in the feature set; Hint: Set the seed to
ensure reproducibility of your results, e.g., if your model has to
randomly break ties)library(mlr3verse)
## Loading required package: mlr3
set.seed(42)
tsk = as_task_classif(education ~ ., data = dat %>% select(-CASE))
mdl = lrn("classif.rpart", keep_model = TRUE)
mdl$train(tsk)
autoplot(mdl, type = "ggparty")
cp
(between 0 and 0.05 in steps of 0.01) to remove unnecessary terminal
nodes and reduce overfitting. (Hint: Set the seed to ensure
reproducibility of your results)set.seed(2)
# Define set of complexity parameter values to be tested
cp_cv <- seq(0, 0.05, 0.01)
# Set up the conditions for the hyperparameter tuning
mdl_cv = auto_tuner(
learner = lrn("classif.rpart", keep_model = TRUE, cp = to_tune(levels = cp_cv)),
resampling = rsmp("cv", folds = 10),
measure = msr("classif.ce"),
tuner = tnr("grid_search"),
terminator = trm("none")
)
# Actually tune the hyperparameter (i.e., cp) and fit the final model
invisible({capture.output({ #remove console output from html document
mdl_cv$train(tsk)
})})
# Print the output of the tuning
mdl_cv$archive %>%
as.data.table() %>%
select(cp, classif.ce) %>%
arrange(as.numeric(cp))
mdl_cv$tuning_result
# Plot the final model
autoplot(mdl_cv$learner, type = "ggparty")
# Extract predicted labels
pred_labs <- mdl_cv$learner$model$frame |>
filter(var == '<leaf>') |>
select(var, n, yval)
pred_labs
# Check for uniqueness and compare to actual target levels
unique(pred_labs$yval)
## [1] 3 4 5
levels(dat$education)
## [1] "1" "2" "3" "4" "5"
For this specific tree, multiple education levels would never be predicted. However, trees are rather unstable and even small changes in the data can yield a completely different result. For instance, changing the seed to a different value produces a much more complex tree here, but still not all education levels will be predicted:
set.seed(42)
mdl_cv2 = auto_tuner(
learner = lrn("classif.rpart", keep_model = TRUE, cp = to_tune(levels = cp_cv)),
resampling = rsmp("cv", folds = 10),
measure = msr("classif.ce"),
tuner = tnr("grid_search"),
terminator = trm("none")
)
invisible({capture.output({ #remove console output from html document
mdl_cv2$train(tsk)
})})
mdl_cv2$archive %>%
as.data.table() %>%
select(cp, classif.ce) %>%
arrange(as.numeric(cp))
mdl_cv2$tuning_result
autoplot(mdl_cv2$learner, type = "ggparty")
importance
argument to “permutation”, which is needed for task 8. (Hint: Set the
seed to ensure reproducibility of your results)set.seed(42)
mdl = lrn("classif.ranger", importance = 'permutation')
mdl$train(tsk)
mdl$model
## Ranger result
##
## Call:
## ranger::ranger(dependent.variable.name = task$target_names, data = task$data(), probability = self$predict_type == "prob", case.weights = task$weights$weight, importance = "permutation", num.threads = 1L)
##
## Type: Classification
## Number of trees: 500
## Sample size: 100
## Number of independent variables: 7
## Mtry: 2
## Target node size: 1
## Variable importance mode: permutation
## Splitrule: gini
## OOB prediction error: 61.00 %
barplot(mdl$importance(), horiz = T, las = 2)
num.trees from 500 to 1500 in steps of 500 and
mtry from 2 to 5 in steps of 1. Again make sure to set the
learner’s importance argument to “permutation”, which is
needed for task 10. (Hint: Set the seed to ensure reproducibility of
your results)set.seed(42)
mtry_cv <- seq(2, 5)
num.trees_cv <- c(500, 1000, 1500)
mdl_cv = auto_tuner(
learner = lrn("classif.ranger", importance = 'permutation',
mtry = to_tune(levels = mtry_cv),
num.trees = to_tune(levels = num.trees_cv)),
resampling = rsmp("cv", folds = 5),
measure = msr("classif.ce"),
tuner = tnr("grid_search"),
terminator = trm("none")
)
invisible({capture.output({ #remove console output from html document
mdl_cv$train(tsk)
})})
mdl_cv$archive %>%
as.data.table() %>%
select(mtry, num.trees, classif.ce) %>%
arrange(as.numeric(mtry), as.numeric(num.trees))
mdl_cv$tuning_result
mdl_cv$learner$model
## Ranger result
##
## Call:
## ranger::ranger(dependent.variable.name = task$target_names, data = task$data(), probability = self$predict_type == "prob", case.weights = task$weights$weight, importance = "permutation", mtry = 2L, num.threads = 1L, num.trees = 1000L)
##
## Type: Classification
## Number of trees: 1000
## Sample size: 100
## Number of independent variables: 7
## Mtry: 2
## Target node size: 1
## Variable importance mode: permutation
## Splitrule: gini
## OOB prediction error: 61.00 %
par(mfrow = c(1,2))
barplot(mdl$importance(), horiz = T, las = 2, main = "Default RF")
barplot(mdl_cv$importance(), horiz = T, las = 2, main = "Tuned RF")
There are rather substantial differences between the feature importance plots in terms of relative rankings of the features. Due to the tuned RF constituting the better model (in terms of using cross-validated hyperparameter settings), its ranking is more reliable. However, note that RFs are generally less sensitive to hyperparameter tuning (as compared with other ML methods, such as SVMs).
Note: Feature importance scores are typically calculated based on metrics like Gini impurity or mean decrease in node impurity. Accordingly, these scores provide a relative measure of the importance of each feature in the model. Comparing the absolute values of feature importance scores across different models is thus not very informative.