Logistic Regression

Description of data set

Within this study, the participants were aurally presented string combinations at varying audio volumes and were asked to identify the string.

The file module1-auditory_strings.csv contains the following data:

  • stimulus: character string that was aurally presented

  • condition: the volume at which it was presented (1: very quiet to 100: very loud)

  • response_correct: whether the response given by the participant was correct or incorrect

  • response_time: response time in seconds

Tasks

  1. Read the data file module1-auditory_strings.csv into R (assign it to a variable called “dat”).
dat <- read.csv("module1-auditory_strings.csv")
head(dat)
  1. Create three new variables in dat:

    • “volume” that contains the volume from “condition” as a numeric vector (e.g., 63 for the “condition” volume_63; Hint: You can use the function str_split_fixed() from the stringr package)

    • “stimulus_length” that contains the length of the “stimulus” variable (Hint: You can use the function str_length from the stringr package)

    • “response_correct” that contains the value 1 when the response was correct and 0 otherwise

library(stringr)
dat$volume <- str_split_fixed(dat$condition, "_", 2)[, 2]
dat$volume <- as.numeric(dat$volume)

dat$stimulus_length <- str_length(dat$stimulus)

dat$response_correct <- ifelse(dat$response_correct == 'correct', 1, 0)
  1. Estimate a logistic regression model for “response_correct” as target and “volume”, “stimulus_length”, and “response_time” as features. How to interpret the coefficients of this model? In other words, what’s the effect of each feature on the target in standardized units (i.e., odds ratio)?
library(mlr3verse)
tsk = as_task_classif(response_correct ~ volume + stimulus_length + response_time, data = dat, positive = "1")
mdl = lrn("classif.log_reg", predict_type = "prob")
mdl$train(tsk)
summary(mdl$model)

Call:
stats::glm(formula = task$formula(), family = "binomial", data = data, 
    model = FALSE)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)   
(Intercept)      0.501333   1.414371   0.354  0.72300   
response_time    0.093562   0.464206   0.202  0.84027   
stimulus_length -0.122235   0.064224  -1.903  0.05701 . 
volume          -0.008218   0.003155  -2.605  0.00919 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 693.15  on 499  degrees of freedom
Residual deviance: 682.31  on 496  degrees of freedom
AIC: 690.31

Number of Fisher Scoring iterations: 4
# estimated odds ratios
exp(coefficients(mdl$model))
    (Intercept)   response_time stimulus_length          volume 
      1.6509210       1.0980789       0.8849406       0.9918159 

The exponential coefficients of a logistic model can be interpreted as odds ratios. Accordingly, negative original coefficients (as for volume and stimulus_length) indicate that the chances to observe a 1 for the target decrease in the respective feature (and vice versa for positive coefficients; e.g., response_time). That is, higher volume and stimulus_length decrease, whereas descriptively higher response_time increases the chances for a correct response here. Moreover, the intercept not being significantly different from zero means that there are no differences in odds for correct or incorrect response for all features being zero (although this is not very informative without, for instance, using centered coefficients here).

  1. Using the model from task 3, calculate the predicted probability for a correct response for each observation and save it as “prob_correct_pred” in dat.
pred <- mdl$predict(tsk)

# predicted probability
dat$prob_correct_pred <- pred$prob[,"1"]

# exploratory plotting
ggplot(dat, aes(x = volume, y = prob_correct_pred)) +
  geom_point()

When we plot the predicted probabilities as a function of a feature, we see that the logistic model actually fitted a linear function to the auxiliary target.

  1. Manually calculate the predicted value of “response_correct” using a cutoff value of 0.5 for the probabilities calculated in task 4 and save it as “response_correct_pred” in dat. Is the result equivalent to the default prediction of mlr3’s predict() method?
# predicted response (with cutoff of 0.5) 
dat$response_correct_pred <- ifelse(dat$prob_correct_pred >= 0.5, 1, 0)

# comparison to mlr3's default prediction
sum(dat$response_correct_pred == pred$response) / nrow(dat)
[1] 1

By default, mlr3’s predict() method uses a cutoff of 0.5 to classify observations according to the predicted probabilities of class membership.

  1. Assess the prediction performance of the model by comparing the actual “response_correct” to the predicted “response_correct_pred” from task 5 using a contingency table. What’s the prediction accuracy of the model?
# contingency table
## mlr3verse:
pred$confusion
        truth
response   1   0
       1 138 111
       0 112 139
## manual:
#cont_table <- with(dat, table(response_correct_pred, response_correct))
#cont_table

# prediction accuracy
## mlr3verse:
mes <- msrs(c("classif.acc"))
pred$score(measures = mes)
classif.acc 
      0.554 
## manual:
#acc <- (cont_table[1,1] + cont_table[2,2]) / sum(cont_table)
#acc

The prediction accuracy is only slightly above chance for our logistic regression model.

  1. Calculate the predicted probability for a correct response for a “stimulus_length” of 3, the mean “volume”, and the first and third quartiles of “response_time”. Do the logistic model’s predictions for “response_correct” differ between the first and third quartiles of “response_time” using a cutoff value of 0.5?
dat_new <- data.frame('volume' = mean(dat$volume)
                      , 'stimulus_length' = 3
                      , 'response_time' = quantile(dat$response_time)[c('25%','75%')]
                      )
pred_new <- mdl$predict_newdata(newdata = dat_new)
#pred_old <- mdl$predict_newdata(newdata = dat)

dat_new$prob_correct_pred <- pred_new$prob[,"1"]
dat_new

As the predicted probability for a correct response is smaller than the cutoff for the first quartile of “response_time”, the model would predict a wrong response, whereas it would predict a correct response for the third quartile. Importantly, however, the differences are not very substantial because the coefficient of “response_time” is not significantly different from zero.

  1. Bonus: Also estimate a linear regression model for the model as specified in task 3, that is, for “response_correct” as target and “volume”, “stimulus_length”, and “response_time” as features.
task_lin = as_task_regr(response_correct ~ volume + stimulus_length + response_time, data = dat)
mdl_lin = lrn("regr.lm")
mdl_lin$train(task_lin)
summary(mdl_lin$model)

Call:
stats::lm(formula = task$formula(), data = task$data())

Residuals:
     Min       1Q   Median       3Q      Max 
-0.65500 -0.48392  0.00504  0.49281  0.65963 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)   
(Intercept)      0.6230940  0.3471279   1.795  0.07326 . 
response_time    0.0230195  0.1139398   0.202  0.83997   
stimulus_length -0.0299921  0.0157301  -1.907  0.05714 . 
volume          -0.0020219  0.0007707  -2.623  0.00898 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4966 on 496 degrees of freedom
Multiple R-squared:  0.02149,   Adjusted R-squared:  0.01558 
F-statistic: 3.632 on 3 and 496 DF,  p-value: 0.01293

Note that that significance of the individual features is comparable to the results of the significance testing in the logistic regression model from task 3.

  1. Bonus: Why is the linear model estimated in task 8 fundamentally wrong? (Hint: Use both models, i.e., linear and logistic regression, to predict the probability for a correct response for the following new data set “dat_new2” with extremely high “stimulus_length”, and compare the results)
dat_new2 <- data.frame('stimulus_length' = 50
                       , 'response_time' = 5*max(dat$response_time)
                       , 'volume' = mean(dat$volume)
                       )
# prediction of linear model for extreme data
mdl_lin$predict_newdata(newdata = dat_new2)$response
[1] -0.5548653
# prediction of logistic model for extreme data
mdl$predict_newdata(newdata = dat_new2)$prob[,"1"]
         1 
0.01334644 

The linear model is conceptually wrong for binary targets because it allows negative probabilities and probabilities above 1, whereas the logistic model is well-defined, which can be seen, for instance, for the predicted probabilities of both models for the new (extreme) data.

---
title: "Module 1: Tutorial: Regression"
output: html_notebook
editor_options: 
  chunk_output_type: inline
---

# Logistic Regression

## Description of data set

Within this study, the participants were aurally presented string combinations at varying audio volumes and were asked to identify the string.

The file module1-auditory_strings.csv contains the following data:

-   stimulus: character string that was aurally presented

-   condition: the volume at which it was presented (1: very quiet to 100: very loud)

-   response_correct: whether the response given by the participant was correct or incorrect

-   response_time: response time in seconds

## Tasks

1.  Read the data file module1-auditory_strings.csv into R (assign it to a variable called "dat").

```{r}
dat <- read.csv("module1-auditory_strings.csv")
head(dat)
```

2.  Create three new variables in dat:

    -   "volume" that contains the volume from "condition" as a numeric vector (e.g., 63 for the "condition" volume_63; Hint: You can use the function `str_split_fixed()` from the `stringr` package)

    -   "stimulus_length" that contains the length of the "stimulus" variable (Hint: You can use the function `str_length` from the `stringr` package)

    -   "response_correct" that contains the value 1 when the response was correct and 0 otherwise

```{r}
library(stringr)
dat$volume <- str_split_fixed(dat$condition, "_", 2)[, 2]
dat$volume <- as.numeric(dat$volume)

dat$stimulus_length <- str_length(dat$stimulus)

dat$response_correct <- ifelse(dat$response_correct == 'correct', 1, 0)
```

3.  Estimate a logistic regression model for "response_correct" as target and "volume", "stimulus_length", and "response_time" as features. How to interpret the coefficients of this model? In other words, what's the effect of each feature on the target in standardized units (i.e., odds ratio)?

```{r}
library(mlr3verse)
tsk = as_task_classif(response_correct ~ volume + stimulus_length + response_time, data = dat, positive = "1")
mdl = lrn("classif.log_reg", predict_type = "prob")
mdl$train(tsk)
summary(mdl$model)

# estimated odds ratios
exp(coefficients(mdl$model))
```

The exponential coefficients of a logistic model can be interpreted as odds ratios. Accordingly, negative original coefficients (as for volume and stimulus_length) indicate that the chances to observe a 1 for the target decrease in the respective feature (and vice versa for positive coefficients; e.g., response_time). That is, higher volume and stimulus_length decrease, whereas descriptively higher response_time increases the chances for a correct response here. Moreover, the intercept not being significantly different from zero means that there are no differences in odds for correct or incorrect response for all features being zero (although this is not very informative without, for instance, using centered coefficients here).

4.  Using the model from task 3, calculate the predicted probability for a correct response for each observation and save it as "prob_correct_pred" in dat.

```{r}
pred <- mdl$predict(tsk)

# predicted probability
dat$prob_correct_pred <- pred$prob[,"1"]

# exploratory plotting
ggplot(dat, aes(x = volume, y = prob_correct_pred)) +
  geom_point()
```

When we plot the predicted probabilities as a function of a feature, we see that the logistic model actually fitted a linear function to the auxiliary target.

5.  Manually calculate the predicted value of "response_correct" using a cutoff value of 0.5 for the probabilities calculated in task 4 and save it as "response_correct_pred" in dat. Is the result equivalent to the default prediction of `mlr3`'s `predict()` method?

```{r}
# predicted response (with cutoff of 0.5) 
dat$response_correct_pred <- ifelse(dat$prob_correct_pred >= 0.5, 1, 0)

# comparison to mlr3's default prediction
sum(dat$response_correct_pred == pred$response) / nrow(dat)
```

By default, `mlr3`'s `predict()` method uses a cutoff of 0.5 to classify observations according to the predicted probabilities of class membership.

6.  Assess the prediction performance of the model by comparing the actual "response_correct" to the predicted "response_correct_pred" from task 5 using a contingency table. What's the prediction accuracy of the model?

```{r}
# contingency table
## mlr3verse:
pred$confusion

## manual:
#cont_table <- with(dat, table(response_correct_pred, response_correct))
#cont_table

# prediction accuracy
## mlr3verse:
mes <- msrs(c("classif.acc"))
pred$score(measures = mes)

## manual:
#acc <- (cont_table[1,1] + cont_table[2,2]) / sum(cont_table)
#acc
```

The prediction accuracy is only slightly above chance for our logistic regression model.

7.  Calculate the predicted probability for a correct response for a "stimulus_length" of 3, the mean "volume", and the first and third quartiles of "response_time". Do the logistic model's predictions for "response_correct" differ between the first and third quartiles of "response_time" using a cutoff value of 0.5?

```{r}
dat_new <- data.frame('volume' = mean(dat$volume)
                      , 'stimulus_length' = 3
                      , 'response_time' = quantile(dat$response_time)[c('25%','75%')]
                      )
pred_new <- mdl$predict_newdata(newdata = dat_new)
#pred_old <- mdl$predict_newdata(newdata = dat)

dat_new$prob_correct_pred <- pred_new$prob[,"1"]
dat_new
```

As the predicted probability for a correct response is smaller than the cutoff for the first quartile of "response_time", the model would predict a wrong response, whereas it would predict a correct response for the third quartile. Importantly, however, the differences are not very substantial because the coefficient of "response_time" is not significantly different from zero.

8.  Bonus: Also estimate a linear regression model for the model as specified in task 3, that is, for "response_correct" as target and "volume", "stimulus_length", and "response_time" as features.

```{r}
task_lin = as_task_regr(response_correct ~ volume + stimulus_length + response_time, data = dat)
mdl_lin = lrn("regr.lm")
mdl_lin$train(task_lin)
summary(mdl_lin$model)
```

Note that that significance of the individual features is comparable to the results of the significance testing in the logistic regression model from task 3.

9.  Bonus: Why is the linear model estimated in task 8 fundamentally wrong? (Hint: Use both models, i.e., linear and logistic regression, to predict the probability for a correct response for the following new data set "dat_new2" with extremely high "stimulus_length", and compare the results)

```{r}
dat_new2 <- data.frame('stimulus_length' = 50
                       , 'response_time' = 5*max(dat$response_time)
                       , 'volume' = mean(dat$volume)
                       )
```

```{r}
# prediction of linear model for extreme data
mdl_lin$predict_newdata(newdata = dat_new2)$response

# prediction of logistic model for extreme data
mdl$predict_newdata(newdata = dat_new2)$prob[,"1"]
```

The linear model is conceptually wrong for binary targets because it allows negative probabilities and probabilities above 1, whereas the logistic model is well-defined, which can be seen, for instance, for the predicted probabilities of both models for the new (extreme) data.
