Logistic Regression

Description of data set

Within this study, the participants were aurally presented string combinations at varying audio volumes and were asked to identify the string.

The file module1-auditory_strings.csv contains the following data:

stimulus: character string that was aurally presented
condition: the volume at which it was presented (1: very quiet to 100: very loud)
response_correct: whether the response given by the participant was correct or incorrect
response_time: response time in seconds

Tasks

Read the data file module1-auditory_strings.csv into R (assign it to a variable called “dat”).

dat <- read.csv("module1-auditory_strings.csv")
head(dat)

Create three new variables in dat:
- “volume” that contains the volume from “condition” as a numeric vector (e.g., 63 for the “condition” volume_63) (Hint: You can use the function str_split_fixed() from the stringr package)
- “stimulus_length” that contains the length of the “stimulus” variable (Hint: You can use the function str_length from the stringr package)
- “response_correct” that contains the value 1 when the response was correct and 0 otherwise

library(stringr)
dat$volume <- str_split_fixed(dat$condition, "_", 2)[, 2]
dat$volume <- as.numeric(dat$volume)

dat$stimulus_length <- str_length(dat$stimulus)

dat$response_correct <- ifelse(dat$response_correct == 'correct', 1, 0)

Estimate a logistic regression model for “response_correct” as target and “volume”, “stimulus_length”, and “response_time” as features. How to interpret the coefficients of this model? In other words, what’s the effect of each feature on the target in standardized units (i.e., odds ratio)?

library(mlr3verse)

## Loading required package: mlr3

tsk = as_task_classif(response_correct ~ volume + stimulus_length + response_time, data = dat, positive = "1")
mdl = lrn("classif.log_reg", predict_type = "prob")
mdl$train(tsk)
summary(mdl$model)

## 
## Call:
## stats::glm(formula = task$formula(), family = "binomial", data = data, 
##     model = FALSE)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)   
## (Intercept)      0.501333   1.414371   0.354  0.72300   
## response_time    0.093562   0.464206   0.202  0.84027   
## stimulus_length -0.122235   0.064224  -1.903  0.05701 . 
## volume          -0.008218   0.003155  -2.605  0.00919 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 693.15  on 499  degrees of freedom
## Residual deviance: 682.31  on 496  degrees of freedom
## AIC: 690.31
## 
## Number of Fisher Scoring iterations: 4

# estimated odds ratios
exp(coefficients(mdl$model))

##     (Intercept)   response_time stimulus_length          volume 
##       1.6509210       1.0980789       0.8849406       0.9918159

The exponential coefficients of a logistic model can be interpreted as odds ratios. Accordingly, negative original coefficients (as for volume and stimulus_length) indicate that the chances to observe a 1 for the target decrease in the respective feature (and vice versa for positive coefficients, such as, response_time). That is, higher volume and stimulus_length decrease, whereas descriptively higher response_time increases the chances for a correct response here. Moreover, the intercept not being significantly different from zero means that there are no differences in odds for correct or incorrect response for all features being zero (although this is not very informative without, for instance, using centered coefficients here).

Using the model from task 3, calculate the predicted probability for a correct response for each observation and save it as “prob_correct_pred” in dat.

pred <- mdl$predict(tsk)

# predicted probability
dat$prob_correct_pred <- pred$prob[,"1"]

# exploratory plotting
library(ggplot2)
ggplot(dat, aes(x = volume, y = prob_correct_pred)) +
  geom_point()

When we plot the predicted probabilities as a function of a feature, we see that the logistic model actually fitted a linear function to the auxiliary target.

Manually calculate the predicted value of “response_correct” using a cutoff value of 0.5 for the probabilities calculated in task 4 and save it as “response_correct_pred” in dat. Is the result equivalent to the default prediction of mlr3’s predict() method?

# predicted response (with cutoff of 0.5) 
dat$response_correct_pred <- ifelse(dat$prob_correct_pred >= 0.5, 1, 0)

# comparison to mlr3's default prediction
sum(dat$response_correct_pred == pred$response) / nrow(dat)

## [1] 1

By default, mlr3’s predict() method uses a cutoff of 0.5 to classify observations according to the predicted probabilities of class membership.

Assess the prediction performance of the model by comparing the actual “response_correct” to the predicted “response_correct_pred” from task 5 using a contingency table. What’s the prediction accuracy of the model?

# contingency table
## mlr3verse:
pred$confusion

##         truth
## response   1   0
##        1 138 111
##        0 112 139

## manual:
#cont_table <- with(dat, table(response_correct_pred, response_correct))
#cont_table

# prediction accuracy
## mlr3verse:
mes <- msrs(c("classif.acc"))
pred$score(measures = mes)

## classif.acc 
##       0.554

## manual:
#acc <- (cont_table[1,1] + cont_table[2,2]) / sum(cont_table)
#acc

The prediction accuracy is only slightly above chance for our logistic regression model.

Calculate the predicted probability for a correct response for a “stimulus_length” of 3, the mean “volume”, and the first and third quartiles of “response_time”. Do the logistic model’s predictions for “response_correct” differ between the first and third quartiles of “response_time” using a cutoff value of 0.5?

dat_new <- data.frame('volume' = mean(dat$volume)
                      , 'stimulus_length' = 3
                      , 'response_time' = quantile(dat$response_time)[c('25%','75%')]
                      )
pred_new <- mdl$predict_newdata(newdata = dat_new)
#pred_old <- mdl$predict_newdata(newdata = dat)

dat_new$prob_correct_pred <- pred_new$prob[,"1"]
dat_new

As the predicted probability for a correct response is smaller than the cutoff for the first quartile of “response_time”, the model would predict a wrong response. In contrast, it would predict a correct response for the third quartile. However, the differences are not very substantial because the coefficient of “response_time” is not very large (and thus not statistically significant).

Bonus: Also estimate a linear regression model for the model as specified in task 3, that is, for “response_correct” as target and “volume”, “stimulus_length”, and “response_time” as features.

task_lin = as_task_regr(response_correct ~ volume + stimulus_length + response_time, data = dat)
mdl_lin = lrn("regr.lm")
mdl_lin$train(task_lin)
summary(mdl_lin$model)

## 
## Call:
## stats::lm(formula = task$formula(), data = task$data())
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65500 -0.48392  0.00504  0.49281  0.65963 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)   
## (Intercept)      0.6230940  0.3471279   1.795  0.07326 . 
## response_time    0.0230195  0.1139398   0.202  0.83997   
## stimulus_length -0.0299921  0.0157301  -1.907  0.05714 . 
## volume          -0.0020219  0.0007707  -2.623  0.00898 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4966 on 496 degrees of freedom
## Multiple R-squared:  0.02149,    Adjusted R-squared:  0.01558 
## F-statistic: 3.632 on 3 and 496 DF,  p-value: 0.01293

Note that that statistical significance of the individual features’ coefficients is comparable to the results of the logistic regression model from task 3.

Bonus: Why is the linear model estimated in task 8 fundamentally wrong? (Hint: Use both models, i.e., linear and logistic regression, to predict the probability for a correct response for the following new data set “dat_new2” with extremely high “stimulus_length”, and compare the results)

dat_new2 <- data.frame('stimulus_length' = 50
                       , 'response_time' = 5*max(dat$response_time)
                       , 'volume' = mean(dat$volume)
                       )

# prediction of linear model for extreme data
mdl_lin$predict_newdata(newdata = dat_new2)$response

## [1] -0.5548653

# prediction of logistic model for extreme data
mdl$predict_newdata(newdata = dat_new2)$prob[,"1"]

##          1 
## 0.01334644

The linear model is conceptually wrong for binary targets because it allows negative probabilities and probabilities above 1. In contrast, the logistic model is well-defined, as can be seen from the exemplary predicted probabilities of both models for the new (extreme) data set “dat_new2”.

Module 1: Tutorial: Regression

Logistic Regression

Description of data set

Tasks