In this tutorial, we will use the dataset from Wulff et al. (2023),
which is part of the moursetrap package. The dataset, as
prepared in the following code chunk, contains the mouse movement
trajectories of participants in a two-options forced-choice paradigm.
The trajectories are normalized using the
mt_length_normalize() function from the
moursetrap package so that all trajectories consist of 50
points (default is 20) in a 2D space.
library(mousetrap)
## Warning: package 'mousetrap' was built under R version 4.4.2
## Welcome to mousetrap 3.2.3!
## Summary of recent changes: http://pascalkieslich.github.io/mousetrap/news/
## Forum for questions: https://forum.cogsci.nl/index.php?p=/categories/mousetrap
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dat <- data(KH2017)
# Preprocess trajectory data
dat <- KH2017 %>% mt_length_normalize(n_points = 50)
dat <- dat$ln_trajectories
dat[1:5,1:5,'xpos']; dat[1:5,1:5,'ypos'] #examles
## [,1] [,2] [,3] [,4] [,5]
## id0001 0 -18.06069 -38.967198 -57.753756 -76.540313
## id0002 0 -15.60052 -32.227659 -48.262305 -64.296952
## id0003 0 -10.02617 -17.001541 -21.124366 -23.588654
## id0004 0 -20.06305 -36.929651 -52.368759 -65.752596
## id0005 0 0.00000 1.080633 3.535698 5.587012
## [,1] [,2] [,3] [,4] [,5]
## id0001 0 7.695611 13.135988 23.98456 34.83314
## id0002 0 11.244849 24.194179 37.87079 51.54740
## id0003 0 16.565420 39.008474 62.18148 85.59222
## id0004 0 -2.531525 6.049725 20.71095 37.44075
## id0005 0 40.034589 80.048229 119.99954 159.97921
dat2 <- data.frame(cbind(dat[,,'xpos'], dat[,,'ypos']))
We can use the mt_heatmap() function from the
moursetrap package to visualize the trajectories. The
resulting plot contains 1064 mouse movement trajectories of
participants. In this tutorial, we want to cluster these trajectories to
make sense of this kind of data (i.e., shed light on the processes of
information integration and preference formation; Wulff et al.,
2023).
mt_heatmap(dat, colors = c('white', 'black'), verbose = FALSE)
Cluster the trajectories from dat2 (i.e., treating
the x- and y-coordinates as features) into 5 clusters by means of
agglomerative hierarchical clustering using the agnes
algorithm via mlr3.
Predict the cluster of each individual trajectory using the model from task 1 and use the predictions to calculate a frequency table for the relative proportions of instances in each cluster.
Redo task 1 and cluster the trajectories into 5 clusters using
agglomerative hierarchical clustering and Ward’s method (i.e., set the
learner’s method argument to “ward”). Also predict the
clusters using the new model. Did the relative frequencies improve (in
terms of a more balanced clustering)?
Plot the trajectories according to the clustering from task 4.
For instance, use a for-loop and the
mt_heatmap() function from above to produce a separate
heatmap for each cluster of movement trajectories.
Redo task 1 and cluster the trajectories into 5 clusters using partitional (i.e., \(k\)-means) clustering. Then, redo task 2 and predict the clusters using the new model.
Compare the results from task 5 to the results with the hierarchical clustering by redoing task 4, that is, visualize the trajectories of the partitional clustering. Do you spot any major (performance) differences?
Bonus: Also plot the prototypes of your \(k\)-means clustering solution.
Bonus: Prior to performing the clustering, do a principal
component analysis (PCA). Then, redo task 6 (i.e., compare agglomerative
hierarchicalc lustering using Ward’s method and \(k\)-means partitional clustering) with both
clusterings specified using only the 5 principal components (PCs) as
features that explain the highest amount of variance in the data. (Hint:
You can select a subset of PCs that should be used for the modeling by
adding a filter pipeline operation after the
pca pipeline operation. Filter for “variance” using the
flt() function and set a corresponding fraction for
clustering according to the PCs that explain the highest amount of
variance in the data)
Agnes clustering:
\(k\)-means clustering: