Within this study, the participants were aurally presented string combinations at varying audio volumes and were asked to identify the string.
The file module1-auditory_strings.csv contains the following data:
stimulus: character string that was aurally presented
condition: the volume at which it was presented (1: very quiet to 100: very loud)
response_correct: whether the response given by the participant was correct or incorrect
response_time: response time in seconds
dat <- read.csv("module1-auditory_strings.csv")
head(dat)
Create three new variables in dat:
“volume” that contains the volume from “condition” as a numeric
vector (e.g., 63 for the “condition” volume_63; Hint: You can use the
function str_split_fixed() from the stringr
package)
“stimulus_length” that contains the length of the “stimulus”
variable (Hint: You can use the function str_length from
the stringr package)
“response_correct” that contains the value 1 when the response was correct and 0 otherwise
library(stringr)
dat$volume <- str_split_fixed(dat$condition, "_", 2)[, 2]
dat$volume <- as.numeric(dat$volume)
dat$stimulus_length <- str_length(dat$stimulus)
dat$response_correct <- ifelse(dat$response_correct == 'correct', 1, 0)
library(mlr3verse)
tsk = as_task_classif(response_correct ~ volume + stimulus_length + response_time, data = dat, positive = "1")
mdl = lrn("classif.log_reg", predict_type = "prob")
mdl$train(tsk)
summary(mdl$model)
Call:
stats::glm(formula = task$formula(), family = "binomial", data = data,
model = FALSE)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.501333 1.414371 0.354 0.72300
response_time 0.093562 0.464206 0.202 0.84027
stimulus_length -0.122235 0.064224 -1.903 0.05701 .
volume -0.008218 0.003155 -2.605 0.00919 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 693.15 on 499 degrees of freedom
Residual deviance: 682.31 on 496 degrees of freedom
AIC: 690.31
Number of Fisher Scoring iterations: 4
# estimated odds ratios
exp(coefficients(mdl$model))
(Intercept) response_time stimulus_length volume
1.6509210 1.0980789 0.8849406 0.9918159
The exponential coefficients of a logistic model can be interpreted as odds ratios. Accordingly, negative original coefficients (as for volume and stimulus_length) indicate that the chances to observe a 1 for the target decrease in the respective feature (and vice versa for positive coefficients; e.g., response_time). That is, higher volume and stimulus_length decrease, whereas descriptively higher response_time increases the chances for a correct response here. Moreover, the intercept not being significantly different from zero means that there are no differences in odds for correct or incorrect response for all features being zero (although this is not very informative without, for instance, using centered coefficients here).
pred <- mdl$predict(tsk)
# predicted probability
dat$prob_correct_pred <- pred$prob[,"1"]
# exploratory plotting
ggplot(dat, aes(x = volume, y = prob_correct_pred)) +
geom_point()
When we plot the predicted probabilities as a function of a feature, we see that the logistic model actually fitted a linear function to the auxiliary target.
mlr3’s predict()
method?# predicted response (with cutoff of 0.5)
dat$response_correct_pred <- ifelse(dat$prob_correct_pred >= 0.5, 1, 0)
# comparison to mlr3's default prediction
sum(dat$response_correct_pred == pred$response) / nrow(dat)
[1] 1
By default, mlr3’s predict() method uses a
cutoff of 0.5 to classify observations according to the predicted
probabilities of class membership.
# contingency table
## mlr3verse:
pred$confusion
truth
response 1 0
1 138 111
0 112 139
## manual:
#cont_table <- with(dat, table(response_correct_pred, response_correct))
#cont_table
# prediction accuracy
## mlr3verse:
mes <- msrs(c("classif.acc"))
pred$score(measures = mes)
classif.acc
0.554
## manual:
#acc <- (cont_table[1,1] + cont_table[2,2]) / sum(cont_table)
#acc
The prediction accuracy is only slightly above chance for our logistic regression model.
dat_new <- data.frame('volume' = mean(dat$volume)
, 'stimulus_length' = 3
, 'response_time' = quantile(dat$response_time)[c('25%','75%')]
)
pred_new <- mdl$predict_newdata(newdata = dat_new)
#pred_old <- mdl$predict_newdata(newdata = dat)
dat_new$prob_correct_pred <- pred_new$prob[,"1"]
dat_new
As the predicted probability for a correct response is smaller than the cutoff for the first quartile of “response_time”, the model would predict a wrong response, whereas it would predict a correct response for the third quartile. Importantly, however, the differences are not very substantial because the coefficient of “response_time” is not significantly different from zero.
task_lin = as_task_regr(response_correct ~ volume + stimulus_length + response_time, data = dat)
mdl_lin = lrn("regr.lm")
mdl_lin$train(task_lin)
summary(mdl_lin$model)
Call:
stats::lm(formula = task$formula(), data = task$data())
Residuals:
Min 1Q Median 3Q Max
-0.65500 -0.48392 0.00504 0.49281 0.65963
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.6230940 0.3471279 1.795 0.07326 .
response_time 0.0230195 0.1139398 0.202 0.83997
stimulus_length -0.0299921 0.0157301 -1.907 0.05714 .
volume -0.0020219 0.0007707 -2.623 0.00898 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4966 on 496 degrees of freedom
Multiple R-squared: 0.02149, Adjusted R-squared: 0.01558
F-statistic: 3.632 on 3 and 496 DF, p-value: 0.01293
Note that that significance of the individual features is comparable to the results of the significance testing in the logistic regression model from task 3.
dat_new2 <- data.frame('stimulus_length' = 50
, 'response_time' = 5*max(dat$response_time)
, 'volume' = mean(dat$volume)
)
# prediction of linear model for extreme data
mdl_lin$predict_newdata(newdata = dat_new2)$response
[1] -0.5548653
# prediction of logistic model for extreme data
mdl$predict_newdata(newdata = dat_new2)$prob[,"1"]
1
0.01334644
The linear model is conceptually wrong for binary targets because it allows negative probabilities and probabilities above 1, whereas the logistic model is well-defined, which can be seen, for instance, for the predicted probabilities of both models for the new (extreme) data.