Tree-Based Classifiers

Similar to Support Vector Machines (SVMs), trees are very good in multiclass classification. Essentially, however, the majority voting procedure to assign classes to terminal nodes implies that there is no need for techniques such as one-vs-one (OvO) or one-vs-all (OvA) strategies.

Description of data set

In contrast to the SVM tutorial, we use the bfi dataset to predict level of education by the Big-5 personality traits. We do not select a subset of observations that has balanced educational levels, because trees are much better in handling unbalanced data.

For simplicity, we treat education as a categorical variable here, although it is actually an ordinal variable (i.e., 1 < 2 < 3 < 4 < 5).

Type ?psych::bfi into your console for more information on the dataset. Note that the Big-5 triats agree, conscientious, extra, neuro, and open were created by averaging each participant’s targets to the five survey items per trait (e.g., A1-A5).

Tasks

  1. Read the data file modeul2-bfi-imbalanced.csv into R (assign it to a variable called “dat”).
dat <- read.csv('module2-bfi-imbalanced.csv', header = TRUE)
  1. Transform all discrete variables to factors for the tree algorithm to work as intended.
library(tidyverse)
── Attaching core tidyverse packages ─────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ───────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
dat <- dat %>% mutate_at(vars(education, gender), ~ factor(.))
  1. Build a tree model to predict the target “education” by all features except for the identifier “CASE”. (Hint: Set the seed to ensure reproducibility of your results, e.g., if your model has to randomly break ties)
library(mlr3verse)
Lade nötiges Paket: mlr3
set.seed(42)
tsk = as_task_classif(education ~ ., data = dat %>% select(-CASE))
mdl = lrn("classif.rpart", keep_model = TRUE)
mdl$train(tsk)
  1. Visualize your result from task 3 as a tree.
par(mfrow = c(1,2))
autoplot(mdl, type = "ggparty")

  1. Prune your tree from task 3 by means of 10-fold cross-validation. That is, choose the complexity penalty parameter cp (between 0 and 0.05 in steps of 0.01) to potentially remove unnecessary terminal nodes and reduce overfitting. Visualize your final result (i.e., best model) as a tree. Would your pruned tree be able to predict all available class labels. In other words, are there any educational levels for which no combination of features would result in the tree making a corresponding prediction? (Hint: Set the seed to ensure reproducibility of your results)
set.seed(42)

# Define set of complexity parameter values to be tested
cp_cv <- seq(0, 0.05, 0.01)

# Set up the conditions for the hyperparameter tuning
mdl_cv = auto_tuner(
  learner = lrn("classif.rpart", keep_model = TRUE, cp = to_tune(levels = cp_cv)),
  resampling = rsmp("cv", folds = 10),
  measure = msr("classif.ce"),
  tuner = tnr("grid_search"),
  terminator = trm("none")
)

# Actually tune the hyperparameter (i.e., cp) and fit the final model
invisible({capture.output({ #remove console output from html document
  mdl_cv$train(tsk)
})})

# Print the output of the tuning
mdl_cv$archive %>% 
  as.data.table() %>% 
  select(cp, classif.ce) %>% 
  arrange(as.numeric(cp))
mdl_cv$tuning_result

# Plot the final model
autoplot(mdl_cv$learner, type = "ggparty")

For this specific tree, multiple education levels would never be predicted. However, trees are rather unstable and even small changes in the data can yield a completely different result, such as for changing the seed to a different value. In my case, a different seed again produces a much more complex tree, but still not all education levels will be predicted:

set.seed(1)

mdl_cv2 = auto_tuner(
  learner = lrn("classif.rpart", keep_model = TRUE, cp = to_tune(levels = cp_cv)),
  resampling = rsmp("cv", folds = 10),
  measure = msr("classif.ce"),
  tuner = tnr("grid_search"),
  terminator = trm("none")
)

invisible({capture.output({ #remove console output from html document
  mdl_cv2$train(tsk)
})})

mdl_cv2$archive %>% 
  as.data.table() %>% 
  select(cp, classif.ce) %>% 
  arrange(as.numeric(cp))
mdl_cv2$tuning_result

autoplot(mdl_cv2$learner, type = "ggparty")

  1. Because of the instability of a single tree, build an ensamble of trees using the random forest approach and default tuning parameter settings. To proceed later with task 7, you must set the importance argument of the learner equal to “permutation”. (Hint: Set the seed to ensure reproducibility of your results)
set.seed(42)
mdl = lrn("classif.ranger", importance = 'permutation')
mdl$train(tsk)
mdl$model
Ranger result

Call:
 ranger::ranger(dependent.variable.name = task$target_names, data = task$data(),      probability = self$predict_type == "prob", case.weights = task$weights$weight,      num.threads = 1L, importance = "permutation") 

Type:                             Classification 
Number of trees:                  500 
Sample size:                      100 
Number of independent variables:  7 
Mtry:                             2 
Target node size:                 1 
Variable importance mode:         permutation 
Splitrule:                        gini 
OOB prediction error:             61.00 % 
  1. Plot the feature importance of all features used in your random forest from task 8.
barplot(mdl$importance(), horiz = T, las = 2)

  1. Build a random forest and tune the hyperparameters num.trees from 500 to 1500 in steps of 500 and mtry from 2 to 5 in steps of 1. To proceed later with task 9, you must again set the importance argument of the learner equal to “permutation”. (Hint: Set the seed to ensure reproducibility of your results)
set.seed(42)

mtry_cv <- seq(2, 5)
num.trees_cv <- c(500, 1000, 1500)

mdl_cv = auto_tuner(
  learner = lrn("classif.ranger", importance = 'permutation',
                mtry = to_tune(levels = mtry_cv), 
                num.trees = to_tune(levels = num.trees_cv)),
  resampling = rsmp("cv", folds = 5),
  measure = msr("classif.ce"),
  tuner = tnr("grid_search"),
  terminator = trm("none")
)


invisible({capture.output({ #remove console output from html document
  mdl_cv$train(tsk)
})})

mdl_cv$archive %>% 
  as.data.table() %>% 
  select(mtry, num.trees, classif.ce) %>% 
  arrange(as.numeric(mtry), as.numeric(num.trees))

mdl_cv$tuning_result

mdl_cv$learner$model
Ranger result

Call:
 ranger::ranger(dependent.variable.name = task$target_names, data = task$data(),      probability = self$predict_type == "prob", case.weights = task$weights$weight,      num.threads = 1L, importance = "permutation", mtry = 2L,      num.trees = 1000L) 

Type:                             Classification 
Number of trees:                  1000 
Sample size:                      100 
Number of independent variables:  7 
Mtry:                             2 
Target node size:                 1 
Variable importance mode:         permutation 
Splitrule:                        gini 
OOB prediction error:             64.00 % 
  1. Plot the feature importance of the tuned random forest and compare the ranking to the feature importance plot of the random forest that was fit with default tuning parameter settings in task 6. Are there substantial differences between the two plots?
par(mfrow = c(1,2))
barplot(mdl$importance(), horiz = T, las = 2)
barplot(mdl_cv$importance(), horiz = T, las = 2)

There are rather substantial differences between the feature importance plots in terms of relative rankings of the features.

Note: Feature importance scores are typically calculated based on metrics like Gini impurity or mean decrease in node impurity. These scores provide a relative measure of the importance of each feature in the model. Comparing the absolute values of feature importance scores across different models is thus not very informative.

---
title: "Module 2: Tutorial: Trees and Forests"
output: html_notebook
editor_options: 
  chunk_output_type: inline
---

# Tree-Based Classifiers

Similar to Support Vector Machines (SVMs), trees are very good in multiclass classification. Essentially, however, the majority voting procedure to assign classes to terminal nodes implies that there is no need for techniques such as one-vs-one (OvO) or one-vs-all (OvA) strategies.

## Description of data set

In contrast to the SVM tutorial, we use the `bfi` dataset to predict level of education by the Big-5 personality traits. We do not select a subset of observations that has balanced educational levels, because trees are much better in handling unbalanced data. 

For simplicity, we treat `education` as a categorical variable here, although it is actually an ordinal variable (i.e., 1 \< 2 \< 3 \< 4 \< 5).

Type ?psych::bfi into your console for more information on the dataset. Note that the Big-5 triats `agree`, `conscientious`, `extra`, `neuro`, and `open` were created by averaging each participant's targets to the five survey items per trait (e.g., `A1`-`A5`).

## Tasks

1.  Read the data file modeul2-bfi-imbalanced.csv into R (assign it to a variable called "dat").

```{r}
dat <- read.csv('module2-bfi-imbalanced.csv', header = TRUE)
```

2.  Transform all discrete variables to factors for the tree algorithm to work as intended.

```{r}
library(tidyverse)
dat <- dat %>% mutate_at(vars(education, gender), ~ factor(.))
```

3.  Build a tree model to predict the target "education" by all features except for the identifier "CASE". (Hint: Set the seed to ensure reproducibility of your results, e.g., if your model has to randomly break ties)

```{r}
library(mlr3verse)

set.seed(42)
tsk = as_task_classif(education ~ ., data = dat %>% select(-CASE))
mdl = lrn("classif.rpart", keep_model = TRUE)
mdl$train(tsk)
```

4.  Visualize your result from task 3 as a tree.

```{r}
autoplot(mdl, type = "ggparty")
```

5.  Prune your tree from task 3 by means of 10-fold cross-validation. That is, choose the complexity penalty parameter `cp` (between 0 and 0.05 in steps of 0.01) to potentially remove unnecessary terminal nodes and reduce overfitting. Visualize your final result (i.e., best model) as a tree. Would your pruned tree be able to predict all available class labels. In other words, are there any educational levels for which no combination of features would result in the tree making a corresponding prediction? (Hint: Set the seed to ensure reproducibility of your results)

```{r}
set.seed(42)

# Define set of complexity parameter values to be tested
cp_cv <- seq(0, 0.05, 0.01)

# Set up the conditions for the hyperparameter tuning
mdl_cv = auto_tuner(
  learner = lrn("classif.rpart", keep_model = TRUE, cp = to_tune(levels = cp_cv)),
  resampling = rsmp("cv", folds = 10),
  measure = msr("classif.ce"),
  tuner = tnr("grid_search"),
  terminator = trm("none")
)

# Actually tune the hyperparameter (i.e., cp) and fit the final model
invisible({capture.output({ #remove console output from html document
  mdl_cv$train(tsk)
})})

# Print the output of the tuning
mdl_cv$archive %>% 
  as.data.table() %>% 
  select(cp, classif.ce) %>% 
  arrange(as.numeric(cp))
mdl_cv$tuning_result

# Plot the final model
autoplot(mdl_cv$learner, type = "ggparty")
```

For this specific tree, multiple education levels would never be predicted. However, trees are rather unstable and even small changes in the data can yield a completely different result, such as for changing the seed to a different value. In my case, a different seed again produces a much more complex tree, but still not all education levels will be predicted:

```{r}
set.seed(1)

mdl_cv2 = auto_tuner(
  learner = lrn("classif.rpart", keep_model = TRUE, cp = to_tune(levels = cp_cv)),
  resampling = rsmp("cv", folds = 10),
  measure = msr("classif.ce"),
  tuner = tnr("grid_search"),
  terminator = trm("none")
)

invisible({capture.output({ #remove console output from html document
  mdl_cv2$train(tsk)
})})

mdl_cv2$archive %>% 
  as.data.table() %>% 
  select(cp, classif.ce) %>% 
  arrange(as.numeric(cp))
mdl_cv2$tuning_result

autoplot(mdl_cv2$learner, type = "ggparty")
```

6.  Because of the instability of a single tree, build an ensamble of trees using the random forest approach and default tuning parameter settings. To proceed later with task 7, you must set the `importance` argument of the learner equal to "permutation". (Hint: Set the seed to ensure reproducibility of your results)

```{r}
set.seed(42)
mdl = lrn("classif.ranger", importance = 'permutation')
mdl$train(tsk)
mdl$model
```

7.  Plot the feature importance of all features used in your random forest from task 8.

```{r}
barplot(mdl$importance(), horiz = T, las = 2)
```
8.  Build a random forest and tune the hyperparameters `num.trees` from 500 to 1500 in steps of 500 and `mtry` from 2 to 5 in steps of 1. To proceed later with task 9, you must again set the `importance` argument of the learner equal to "permutation". (Hint: Set the seed to ensure reproducibility of your results)

```{r}
set.seed(42)

mtry_cv <- seq(2, 5)
num.trees_cv <- c(500, 1000, 1500)

mdl_cv = auto_tuner(
  learner = lrn("classif.ranger", importance = 'permutation',
                mtry = to_tune(levels = mtry_cv), 
                num.trees = to_tune(levels = num.trees_cv)),
  resampling = rsmp("cv", folds = 5),
  measure = msr("classif.ce"),
  tuner = tnr("grid_search"),
  terminator = trm("none")
)


invisible({capture.output({ #remove console output from html document
  mdl_cv$train(tsk)
})})

mdl_cv$archive %>% 
  as.data.table() %>% 
  select(mtry, num.trees, classif.ce) %>% 
  arrange(as.numeric(mtry), as.numeric(num.trees))

mdl_cv$tuning_result

mdl_cv$learner$model
```

9.  Plot the feature importance of the tuned random forest and compare the ranking to the feature importance plot of the random forest that was fit with default tuning parameter settings in task 6. Are there substantial differences between the two plots?

```{r}
par(mfrow = c(1,2))
barplot(mdl$importance(), horiz = T, las = 2)
barplot(mdl_cv$importance(), horiz = T, las = 2)
```

There are rather substantial differences between the feature importance plots in terms of relative rankings of the features.

Note: Feature importance scores are typically calculated based on metrics like Gini impurity or mean decrease in node impurity. These scores provide a relative measure of the importance of each feature in the model. Comparing the absolute values of feature importance scores across different models is thus not very informative.
