Within this study, the participants were aurally presented string combinations at varying audio volumes and were asked to identify the string.
The file module1-auditory_strings.csv contains the following data:
stimulus: character string that was aurally presented
condition: the volume at which it was presented (1: very quiet to 100: very loud)
response_correct: whether the response given by the participant was correct or incorrect
response_time: response time in seconds
dat <- read.csv("module1-auditory_strings.csv")
head(dat)
Create three new variables in dat:
“volume” that contains the volume from “condition” as a numeric
vector (e.g., 63 for the “condition” volume_63) (Hint: You can use the
function str_split_fixed() from the stringr
package)
“stimulus_length” that contains the length of the “stimulus”
variable (Hint: You can use the function str_length from
the stringr package)
“response_correct” that contains the value 1 when the response was correct and 0 otherwise
library(stringr)
dat$volume <- str_split_fixed(dat$condition, "_", 2)[, 2]
dat$volume <- as.numeric(dat$volume)
dat$stimulus_length <- str_length(dat$stimulus)
dat$response_correct <- ifelse(dat$response_correct == 'correct', 1, 0)
library(mlr3verse)
## Loading required package: mlr3
tsk = as_task_classif(response_correct ~ volume + stimulus_length + response_time, data = dat, positive = "1")
mdl = lrn("classif.log_reg", predict_type = "prob")
mdl$train(tsk)
summary(mdl$model)
##
## Call:
## stats::glm(formula = task$formula(), family = "binomial", data = data,
## model = FALSE)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.501333 1.414371 0.354 0.72300
## response_time 0.093562 0.464206 0.202 0.84027
## stimulus_length -0.122235 0.064224 -1.903 0.05701 .
## volume -0.008218 0.003155 -2.605 0.00919 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 693.15 on 499 degrees of freedom
## Residual deviance: 682.31 on 496 degrees of freedom
## AIC: 690.31
##
## Number of Fisher Scoring iterations: 4
# estimated odds ratios
exp(coefficients(mdl$model))
## (Intercept) response_time stimulus_length volume
## 1.6509210 1.0980789 0.8849406 0.9918159
The exponential coefficients of a logistic model can be interpreted as odds ratios. Accordingly, negative original coefficients (as for volume and stimulus_length) indicate that the chances to observe a 1 for the target decrease in the respective feature (and vice versa for positive coefficients, such as, response_time). That is, higher volume and stimulus_length decrease, whereas descriptively higher response_time increases the chances for a correct response here. Moreover, the intercept not being significantly different from zero means that there are no differences in odds for correct or incorrect response for all features being zero (although this is not very informative without, for instance, using centered coefficients here).
pred <- mdl$predict(tsk)
# predicted probability
dat$prob_correct_pred <- pred$prob[,"1"]
# exploratory plotting
library(ggplot2)
ggplot(dat, aes(x = volume, y = prob_correct_pred)) +
geom_point()
When we plot the predicted probabilities as a function of a feature, we see that the logistic model actually fitted a linear function to the auxiliary target.
mlr3’s predict()
method?# predicted response (with cutoff of 0.5)
dat$response_correct_pred <- ifelse(dat$prob_correct_pred >= 0.5, 1, 0)
# comparison to mlr3's default prediction
sum(dat$response_correct_pred == pred$response) / nrow(dat)
## [1] 1
By default, mlr3’s predict() method uses a
cutoff of 0.5 to classify observations according to the predicted
probabilities of class membership.
# contingency table
## mlr3verse:
pred$confusion
## truth
## response 1 0
## 1 138 111
## 0 112 139
## manual:
#cont_table <- with(dat, table(response_correct_pred, response_correct))
#cont_table
# prediction accuracy
## mlr3verse:
mes <- msrs(c("classif.acc"))
pred$score(measures = mes)
## classif.acc
## 0.554
## manual:
#acc <- (cont_table[1,1] + cont_table[2,2]) / sum(cont_table)
#acc
The prediction accuracy is only slightly above chance for our logistic regression model.
dat_new <- data.frame('volume' = mean(dat$volume)
, 'stimulus_length' = 3
, 'response_time' = quantile(dat$response_time)[c('25%','75%')]
)
pred_new <- mdl$predict_newdata(newdata = dat_new)
#pred_old <- mdl$predict_newdata(newdata = dat)
dat_new$prob_correct_pred <- pred_new$prob[,"1"]
dat_new
As the predicted probability for a correct response is smaller than the cutoff for the first quartile of “response_time”, the model would predict a wrong response. In contrast, it would predict a correct response for the third quartile. However, the differences are not very substantial because the coefficient of “response_time” is not very large (and thus not statistically significant).
task_lin = as_task_regr(response_correct ~ volume + stimulus_length + response_time, data = dat)
mdl_lin = lrn("regr.lm")
mdl_lin$train(task_lin)
summary(mdl_lin$model)
##
## Call:
## stats::lm(formula = task$formula(), data = task$data())
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.65500 -0.48392 0.00504 0.49281 0.65963
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.6230940 0.3471279 1.795 0.07326 .
## response_time 0.0230195 0.1139398 0.202 0.83997
## stimulus_length -0.0299921 0.0157301 -1.907 0.05714 .
## volume -0.0020219 0.0007707 -2.623 0.00898 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4966 on 496 degrees of freedom
## Multiple R-squared: 0.02149, Adjusted R-squared: 0.01558
## F-statistic: 3.632 on 3 and 496 DF, p-value: 0.01293
Note that that statistical significance of the individual features’ coefficients is comparable to the results of the logistic regression model from task 3.
dat_new2 <- data.frame('stimulus_length' = 50
, 'response_time' = 5*max(dat$response_time)
, 'volume' = mean(dat$volume)
)
# prediction of linear model for extreme data
mdl_lin$predict_newdata(newdata = dat_new2)$response
## [1] -0.5548653
# prediction of logistic model for extreme data
mdl$predict_newdata(newdata = dat_new2)$prob[,"1"]
## 1
## 0.01334644
The linear model is conceptually wrong for binary targets because it allows negative probabilities and probabilities above 1. In contrast, the logistic model is well-defined, as can be seen from the exemplary predicted probabilities of both models for the new (extreme) data set “dat_new2”.