In this tutorial, we will use the dataset from Wulff et al. (2023),
which is part of the moursetrap package. The dataset, as
prepared in the following code chunk, contains the mouse movement
trajectories of participants in a two-options forced-choice paradigm.
The trajectories are normalized using the
mt_length_normalize() function from the
moursetrap package so that all trajectories consist of 50
points (default is 20) in a 2D space.
library(mousetrap)
## Warning: package 'mousetrap' was built under R version 4.4.2
## Welcome to mousetrap 3.2.3!
## Summary of recent changes: http://pascalkieslich.github.io/mousetrap/news/
## Forum for questions: https://forum.cogsci.nl/index.php?p=/categories/mousetrap
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dat <- data(KH2017)
# Preprocess trajectory data
dat <- KH2017 %>% mt_length_normalize(n_points = 50)
dat <- dat$ln_trajectories
dat[1:5,1:5,'xpos']; dat[1:5,1:5,'ypos'] #examles
## [,1] [,2] [,3] [,4] [,5]
## id0001 0 -18.06069 -38.967198 -57.753756 -76.540313
## id0002 0 -15.60052 -32.227659 -48.262305 -64.296952
## id0003 0 -10.02617 -17.001541 -21.124366 -23.588654
## id0004 0 -20.06305 -36.929651 -52.368759 -65.752596
## id0005 0 0.00000 1.080633 3.535698 5.587012
## [,1] [,2] [,3] [,4] [,5]
## id0001 0 7.695611 13.135988 23.98456 34.83314
## id0002 0 11.244849 24.194179 37.87079 51.54740
## id0003 0 16.565420 39.008474 62.18148 85.59222
## id0004 0 -2.531525 6.049725 20.71095 37.44075
## id0005 0 40.034589 80.048229 119.99954 159.97921
dat2 <- data.frame(cbind(dat[,,'xpos'], dat[,,'ypos']))
We can use the mt_heatmap() function from the
moursetrap package to visualize the trajectories. The
resulting plot contains 1064 mouse movement trajectories of
participants. In this tutorial, we want to cluster these trajectories to
make sense of this kind of data (i.e., shed light on the processes of
information integration and preference formation; Wulff et al.,
2023).
mt_heatmap(dat, colors = c('white', 'black'), verbose = FALSE)
dat2 (i.e., treating the
x- and y-coordinates as features) into 5 clusters by means of
agglomerative hierarchical clustering using the agnes
algorithm via mlr3.k = 5
library(mlr3verse)
## Loading required package: mlr3
tsk = as_task_clust(dat2)
set.seed(42)
mdl = lrn("clust.agnes", stand = T, k = k)
mdl$train(tsk)
pred <- mdl$predict(tsk)
## Warning: Learner 'clust.agnes' doesn't predict on new data and predictions may
## not make sense on new data.
round(table(pred$partition)/length(pred$partition) * 100, 2)
##
## 1 2 3 4 5
## 95.68 0.38 3.10 0.75 0.09
method argument to “ward”). Also predict the
clusters using the new model. Did the relative frequencies improve (in
terms of a more balanced clustering)?# Clustering
set.seed(42)
mdl = lrn("clust.agnes", stand = T, k = k, method = "ward")
mdl$train(tsk)
# Predictions
pred <- mdl$predict(tsk)
## Warning: Learner 'clust.agnes' doesn't predict on new data and predictions may
## not make sense on new data.
round(table(pred$partition)/length(pred$partition) * 100, 2)
##
## 1 2 3 4 5
## 26.88 21.33 30.26 17.95 3.57
The resulting clustering is more balanced and more closely resembles the frequency proportions as reported in Wulff et al. (2023, Figure 8; see also the slides for module 3). However, to reassure that our clustering distinguishes between the same movement patterns, we also have to visualize the clusters (see next task).
for-loop and the mt_heatmap()
function from above to produce a separate heatmap for each cluster of
movement trajectories.par(mfrow = c(1,5))
for (j in 1:k) {
cat(mt_heatmap(dat[pred$partition == j,,], colors = c('white', 'black'), verbose = FALSE))
}
It appears that our clustering distinguishes similar choice strategies as found in Wulff et al. (2023, Figure 8). However, not all five prototypes (straight, curved, cCoM = continuous change of mind, dCoM = discrete change of mind, and dCoM2 = discrete double change of mind) are unambiguously recovered.
# Clustering
set.seed(42)
mdl = lrn("clust.kmeans", centers = k)
mdl$train(tsk)
# Predictions
pred <- mdl$predict(tsk)
round(table(pred$partition)/length(pred$partition) * 100, 2)
##
## 1 2 3 4 5
## 4.04 39.38 12.88 10.81 32.89
par(mfrow = c(1,5))
for (j in 1:k) {
cat(mt_heatmap(dat[pred$partition == j,,], colors = c('white', 'black'), verbose = FALSE))
}
Partitional clustering seems to perform slightly better in discriminating between all five prototype strategies as originally reported in Wulff et al. (2023; i.e., straight, curved, cCoM = continuous change of mind, dCoM = discrete change of mind, and dCoM2 = discrete double change of mind). This can also be confirmed by plotting the prototypes of each cluster (see next task).
df <- array(NA, c(2, 50, 2), dimnames = list(NULL, NULL, c('xpos', 'ypos'))) #auxiliary data frame to submit the prototypes in the format needed for plotting with mt_heatmap()
par(mfrow = c(1,5))
for (j in 1:k) {
df[1,,'xpos'] <- mdl$model$centers[j,paste0('X',1:50)]
df[1,,'ypos'] <- mdl$model$centers[j,paste0('X',51:100)]
mt_heatmap(df, colors = c('white', 'blue'), verbose = FALSE)
}
The prototypes are indeed very similar to the original prototypes reported in Wulff et al. (2023).
filter pipeline operation after the
pca pipeline operation. Filter for “variance” using the
flt() function and set a corresponding fraction for
clustering according to the PCs that explain the highest amount of
variance in the data)Agnes clustering:
# Clustering
set.seed(42)
mdl <- as_learner(po("pca") %>>%
po("filter", filter = flt("variance"), filter.frac = 0.05) %>>%
lrn('clust.agnes', stand = T, k = k, method = "ward"))
mdl$train(tsk)
# Which PCs were actually used as features?
mdl$model$variance$features
## [1] "PC1" "PC2" "PC3" "PC4" "PC5"
# Predictions
pred <- mdl$predict(tsk)
## Warning: Learner 'clust.agnes' doesn't predict on new data and predictions may not make sense on new data.
## This happened PipeOp clust.agnes's $predict()
round(table(pred$partition)/length(pred$partition) * 100, 2)
##
## 1 2 3 4 5
## 54.51 24.44 15.04 5.26 0.75
# Plotting
par(mfrow = c(1,5))
for (j in 1:k) {
cat(mt_heatmap(dat[pred$partition == j,,], colors = c('white', 'black'), verbose = FALSE))
}
\(k\)-means clustering:
# Clustering
set.seed(42)
mdl <- as_learner(po("pca") %>>%
po("filter", filter = flt("variance"), filter.frac = 0.05) %>>%
lrn('clust.kmeans', centers = k))
mdl$train(tsk)
# Which PCs were actually used as features?
mdl$model$variance$features
## [1] "PC1" "PC2" "PC3" "PC4" "PC5"
# Predictions
pred <- mdl$predict(tsk)
round(table(pred$partition)/length(pred$partition) * 100, 2)
##
## 1 2 3 4 5
## 39.29 32.99 10.81 12.88 4.04
# Plotting
par(mfrow = c(1,5))
for (j in 1:k) {
cat(mt_heatmap(dat[pred$partition == j,,], colors = c('white', 'black'), verbose = FALSE))
}
Hierarchical clustering performs worse with PCA compared to using the information contained in the original/full dataset. In contrast, for partitional \(k\)-means clustering, it seems that all the necessary information to discriminate between the five original prototype strategies is contained in the first five PCs. In other words, the partitional clustering works equally well with reduced information.