Clustering

Description of data set

In this tutorial, we will use the dataset from Wulff et al. (2023), which is part of the moursetrap package. The dataset, as prepared in the following code chunk, contains the mouse movement trajectories of participants in a two-options forced-choice paradigm. The trajectories are normalized using the mt_length_normalize() function from the moursetrap package so that all trajectories consist of 50 points (default is 20) in a 2D space.

library(mousetrap)
## Warning: package 'mousetrap' was built under R version 4.4.2
## Welcome to mousetrap 3.2.3!
## Summary of recent changes: http://pascalkieslich.github.io/mousetrap/news/
## Forum for questions: https://forum.cogsci.nl/index.php?p=/categories/mousetrap
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dat <- data(KH2017)

# Preprocess trajectory data
dat <- KH2017 %>% mt_length_normalize(n_points = 50)
dat <- dat$ln_trajectories
dat[1:5,1:5,'xpos']; dat[1:5,1:5,'ypos'] #examles
##        [,1]      [,2]       [,3]       [,4]       [,5]
## id0001    0 -18.06069 -38.967198 -57.753756 -76.540313
## id0002    0 -15.60052 -32.227659 -48.262305 -64.296952
## id0003    0 -10.02617 -17.001541 -21.124366 -23.588654
## id0004    0 -20.06305 -36.929651 -52.368759 -65.752596
## id0005    0   0.00000   1.080633   3.535698   5.587012
##        [,1]      [,2]      [,3]      [,4]      [,5]
## id0001    0  7.695611 13.135988  23.98456  34.83314
## id0002    0 11.244849 24.194179  37.87079  51.54740
## id0003    0 16.565420 39.008474  62.18148  85.59222
## id0004    0 -2.531525  6.049725  20.71095  37.44075
## id0005    0 40.034589 80.048229 119.99954 159.97921
dat2 <- data.frame(cbind(dat[,,'xpos'], dat[,,'ypos']))

We can use the mt_heatmap() function from the moursetrap package to visualize the trajectories. The resulting plot contains 1064 mouse movement trajectories of participants. In this tutorial, we want to cluster these trajectories to make sense of this kind of data (i.e., shed light on the processes of information integration and preference formation; Wulff et al., 2023).

mt_heatmap(dat, colors = c('white', 'black'), verbose = FALSE)

Tasks

  1. Cluster the trajectories from dat2 (i.e., treating the x- and y-coordinates as features) into 5 clusters by means of agglomerative hierarchical clustering using the agnes algorithm via mlr3.

  2. Predict the cluster of each individual trajectory using the model from task 1 and use the predictions to calculate a frequency table for the relative proportions of instances in each cluster.

  3. Redo task 1 and cluster the trajectories into 5 clusters using agglomerative hierarchical clustering and Ward’s method (i.e., set the learner’s method argument to “ward”). Also predict the clusters using the new model. Did the relative frequencies improve (in terms of a more balanced clustering)?

  4. Plot the trajectories according to the clustering from task 4. For instance, use a for-loop and the mt_heatmap() function from above to produce a separate heatmap for each cluster of movement trajectories.

  5. Redo task 1 and cluster the trajectories into 5 clusters using partitional (i.e., \(k\)-means) clustering. Then, redo task 2 and predict the clusters using the new model.

  6. Compare the results from task 5 to the results with the hierarchical clustering by redoing task 4, that is, visualize the trajectories of the partitional clustering. Do you spot any major (performance) differences?

  7. Bonus: Also plot the prototypes of your \(k\)-means clustering solution.

  8. Bonus: Prior to performing the clustering, do a principal component analysis (PCA). Then, redo task 6 (i.e., compare agglomerative hierarchicalc lustering using Ward’s method and \(k\)-means partitional clustering) with both clusterings specified using only the 5 principal components (PCs) as features that explain the highest amount of variance in the data. (Hint: You can select a subset of PCs that should be used for the modeling by adding a filter pipeline operation after the pca pipeline operation. Filter for “variance” using the flt() function and set a corresponding fraction for clustering according to the PCs that explain the highest amount of variance in the data)

Agnes clustering:

\(k\)-means clustering: