NOTES ON STATISTICS, PROBABILITY and MATHEMATICS


Logistic Regression:


NOTE: There is a companion post on ROC curves and AUC here.


We want to estimate the probability of success / failure of the dependent variable \((\Pr(Y=1))\), given the explanatory variables: \(\Pr(Y=1\mid X_1,X_2,\cdots, X_n)\). The explanatory variables can be categorical or continuous - it doesn’t matter.

Logistic regression addresses the following issues:

  1. There is a need to bound at \(1\) since we are estimating a probability, but lines are not naturally bounded. So we transform probability into odds:

\[ \text{odds}=\frac{\Pr(Y=1)}{1-\Pr(Y=1)}\tag 1\]

In addition, this also helps turning a sigmoid curve of a typical cummulative probability distribution (e.g. normal) into a linear relation.

  1. Now, when the probability goes down to zero, the odds also tend to zero; yet there is no floor restriction in a line. This problem can be addressed by expressing the linear relation in log scale: \(\log(\text{odds})\), which will tend to \(-\infty\) as the odds tend to zero. This is the logit or log-odds function.

The logistic regression model is therefore:

\[\text{log odds of disease}=\color{magenta}{\text{logit}}\left(P(Y=1)\right)=\color{red}{\log} \left[\color{blue}{\text{odds(P(Y=1))}}\right]=\color{red}{\text{log}}\left(\frac{ P\,(Y=1)}{1- P\,(Y=1)}\right) = X\beta = \beta_o + \beta_1 x_1 + \beta_2 x_2 +\cdots+ \beta_k x_k\]

Consequently,

\[\color{blue}{\text{odds(Y=1)}} = \frac{\Pr\,(Y=1)}{1\,-\,\Pr\,(Y=1)} = e^{\,\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k} = e^{\,\beta_o}\,e^{\,\beta_1x_1}\,e^{\,\beta_2x_2}\,\cdots\,e^{\,\beta_kx_k}\]

Therefore a unit increase in \(x_1\) increases the odds \(\large e^{\,\beta_1}\) times.

How to intpret an increase of \(1\) in a covariate as the odds ratio:

Fixing the covariate \(x_2 =k\) (any given value):

\[\begin{align} \text{log odds of disease } &=\alpha + \beta_1 x_1 + \beta_2 k \\[2ex] \text{odds of disease } &= e^{\alpha + \beta_1 x_1 + \beta_2 k} \end{align}\]

Now for \(x_2=k + 1:\)

\[\begin{align} \text{log odds of disease } &=\alpha + \beta_1 x_1 + \beta_2 k+\beta_2) \\[2ex] \text{odds of disease } &= e^{\alpha + \beta_1 x_1 + \beta_2 k + \beta_2} \end{align}\]

The odds ratio going from \(k\) to \(k+1\) is

\[\begin{align}\text{Odds Ratio (OR)}&=\frac{\text{odds when }x_2=k+1}{\text{odds when }x_2=k}=\frac{e^{\alpha + \beta_1 x_1 + \beta_2 k + \beta_2}}{e^{\alpha + \beta_1 x_1 + \beta_2 k}}=e^{\beta_2}\\[2ex] \log\left(\text{Odds Ratio (OR)}\right)&= \beta_2 \end{align}\]

Therefore \(e^{\beta_2}\) is the relative increase in the odds of disease going from \(x_2=k\) to \(x_2=k+1\) holding \(x_1\) fixed (or adjusting for \(x_1\)). For every increase of \(1\) in \(x_2\) the odds of disease increases by a factor of \(e^{\beta_2}.\) Generally, if you incfrease \(x_2\) from \(k\) to \(k + \Delta\) then

\[\text{OR}= \frac{\text{odds when }x_2 = k +\Delta}{\text{odds when }x_2=k}=e^{\beta_2\Delta}=\left(e^{\beta_2}\right)^\Delta\]


Translation into probabilities:


From equation \((1)\) and \((2)\):

\[\begin{align} \Pr(Y = 1) &= (1-\Pr(Y=1))\times\,e^{\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p}\\[2ex] e^{\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p}&= \Pr(Y = 1)\,\times (1 \,+\,e^{\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p}) \\[2ex] \color{blue}{\Pr(Y = 1)} = \frac{\color{blue}{\text{odds(Y=1)}}}{1\,+\,\color{blue}{\text{odds(Y=1)}}}&=\frac{e^{\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p}}{1 \,+\,e^{\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p}}=\frac{1}{1 \,+\,e^{-(\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p)}} \end{align}\]


For CATEGORICAL VARIABLES, the ODDS RATIO denote how much the presence or absence of a factor variable increases the odds of a positive dependent variable.

For CONTINOUS VARIABLES the odds ratio tell us how much the odds increase multiplicatively with a one-unit change in the independent variable. To calculate the difference in odds you raise the OR to the power of the difference between the observations.

To get the exponentiated coefficients and their confidence intervals in R, we use cbind to bind the coefficients and confidence intervals column-wise:

exp(cbind(OR = coef(mylogit), confint(mylogit)))

See here.


OUTPUT IN R:

f = glm(am ~ mpg, 'binomial', mtcars)
# The linear part applied to values of i.v.'s:
summary(predict(f))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3.4104 -1.8676 -0.7086 -0.4351  0.3967  3.8047
# Pr(Y = 1 | x1, x2,...)
summary(fitted.values(f))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03197 0.13388 0.32991 0.40625 0.59790 0.97822
# Pr(Y = 1 | x1, x2,...) = exp(predict(f)) / (1 + exp(predict(f)) = 1 / (1 + exp(-predict(f))
all.equal(exp(predict(f)) / (1 + exp(predict(f))), fitted.values(f))
## [1] TRUE

Interpretation of Logistic Regression Results:


Nothing better than simulating the data to obtain the coefficients through a logistic model (from here):

set.seed(1)
n <- 100

gender <- sample(c(0,1), size = n, replace = TRUE)
age <- round(runif(n, 18, 80))

beta0 <- - 10
beta1 <-   2.5
beta2 <-   0.4
linear <- beta0 + beta1 * gender + beta2 * age

p <- exp(linear) / (1 + exp(linear))

y <- rbinom(n, size = 1, prob = p)
          
mod <- glm(y ~ gender + age, family = "binomial")
summary(mod)
## 
## Call:
## glm(formula = y ~ gender + age, family = "binomial")
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.82870   0.00040   0.00431   0.05913   1.37965  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -9.9962     4.5019  -2.220   0.0264 *
## gender        2.5084     1.4183   1.769   0.0770 .
## age           0.3697     0.1552   2.382   0.0172 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 50.728  on 99  degrees of freedom
## Residual deviance: 22.084  on 97  degrees of freedom
## AIC: 28.084
## 
## Number of Fisher Scoring iterations: 9


QUESTION:

(Initially posted here)

I ran a linear regression of acceptance into college against SAT scores and family / ethnic background. The data are fictional. The question focuses in the gathering and interpretation of odds ratios when leaving the SAT scores aside for simplicity.

The variables are Accepted (0 or 1) and Background (“red” or “blue”). I set up the data so that people of “red” background were more likely to get in:

fit <- glm(Accepted ~ Background, data=dat, family="binomial")
exp(cbind(Odds_Ratio_RedvBlue = coef(fit), confint(fit)))

                        Odds_Ratio_RedvBlue             2.5 %       97.5 %
(Intercept)             0.7088608                     0.5553459   0.9017961
Backgroundred           2.4480042                     1.7397640   3.4595454

Questions:

  1. Is 0.7 the odd ratio of a person of “blue” background being accepted? I’m asking this because I also get 0.7 for “Backgroundblue” if instead I run the following code:

     fit <- glm(Accepted~Background - 1, data=dat, family="binomial")
     exp(cbind(OR=coef(fit), confint(fit)))
  2. Shouldn’t the odds ratio of “red” being accepted (\(\rm Accepted/Red:Accepted/Blue\)) just the reciprocal: (\(\rm OddsBlue = 1 / OddsRed\))?

ANSWER:

I’ve been working on answering my question by calculating manually the odds and odds ratios:

Acceptance   blue            red            Grand Total
0              158           102                260
1              112           177                289
Total          270           279                549

So the Odds Ratio of getting into the school of Red over Blue is:

\[ \frac{\rm Odds\ Accept\ If\ Red}{\rm Odds\ Acccept\ If\ Blue} = \frac{^{177}/_{102}}{^{112}/_{158}} = \frac {1.7353}{0.7089} = 2.448 \]

And this is the Backgroundredreturn of:

fit <- glm(Accepted~Background, data=dat, family="binomial")
exp(cbind(Odds_and_OR=coef(fit), confint(fit)))

                      Odds_and_OR                         2.5 %      97.5 %
(Intercept)             0.7088608                     0.5553459   0.9017961
Backgroundred           2.4480042                     1.7397640   3.4595454

At the same time, the (Intercept)corresponds to the numerator of the odds ratio, which is exactly the odds of getting in being of ‘blue’ family background: \(112/158 = 0.7089\).

If instead, I run:

fit2 <- glm(Accepted~Background-1, data=dat, family="binomial")
exp(cbind(Odds=coef(fit2), confint(fit2)))

                        Odds            2.5 %      97.5 %
Backgroundblue     0.7088608        0.5553459   0.9017961
Backgroundred      1.7352941        1.3632702   2.2206569

The returns are precisely the odds of getting in being ‘blue’: Backgroundblue (0.7089) and the odds of being accepted being ‘red’: Backgroundred (1.7353). No Odds Ratio there. Therefore the two return values are not expected to be reciprocal.

Finally, How to read the results if there are 3 factors in the categorical regressor?

Same manual versus [R] calculation:

I created a different fictitious data set with the same premise, but this time there were three ethnic backgrounds: “red”, “blue” and “orange”, and ran the same sequence:

First, the contingency table:

Acceptance  blue    orange  red   Total
0             86        65  130     281
1             64        42  162     268
Total        150       107  292     549

And calculated the Odds of getting in for each ethnic group:

  • Odds Accept If Red = 1.246154;
  • Odds Accept If Blue = 0.744186;
  • Odds Accept If Orange = 0.646154

As well as the different Odds Ratios:

  • OR red v blue = 1.674519;
  • OR red v orange = 1.928571;
  • OR blue v red = 0.597186;
  • OR blue v orange = 1.151717;
  • OR orange v red = 0.518519; and
  • OR orange v blue = 0.868269

And proceeded with the now routine logistic regression followed by exponentiation of coefficients:

fit <- glm(Accepted~Background, data=dat, family="binomial")
exp(cbind(ODDS=coef(fit), confint(fit)))

                      ODDS     2.5 %   97.5 %
(Intercept)      0.7441860 0.5367042 1.026588
Backgroundorange 0.8682692 0.5223358 1.437108
Backgroundred    1.6745192 1.1271430 2.497853

Yielding the odds of getting in for “blues” as the (Intercept), and the Odds Ratios of Orange versus Blue in Backgroundorange, and the OR of Red v Blue in Backgroundred .

On the other hand, the regression without intercept predictably returned just the three independent odds:

fit2 <- glm(Accepted~Background-1, data=dat, family="binomial")
exp(cbind(ODDS=coef(fit2), confint(fit2)))

                      ODDS     2.5 %    97.5 %
Backgroundblue   0.7441860 0.5367042 1.0265875
Backgroundorange 0.6461538 0.4354366 0.9484999
Backgroundred    1.2461538 0.9900426 1.5715814
END OF THE STACK OVERFLOW POST

Categorical Independent Variables:


Here is an example extracted form this site:

# Define an explanatory variable with two levels:
parentsmoke <- as.factor(c(1,0))
# create response vector:
response <- cbind(yes=c(816,188), no=c(3203,1168))
response
##      yes   no
## [1,] 816 3203
## [2,] 188 1168
smoke.logistic <- glm(response~parentsmoke, family=binomial(link=logit))
smoke.logistic
## 
## Call:  glm(formula = response ~ parentsmoke, family = binomial(link = logit))
## 
## Coefficients:
##  (Intercept)  parentsmoke1  
##      -1.8266        0.4592  
## 
## Degrees of Freedom: 1 Total (i.e. Null);  0 Residual
## Null Deviance:       29.12 
## Residual Deviance: -3.717e-13    AIC: 19.24
summary(smoke.logistic)
## 
## Call:
## glm(formula = response ~ parentsmoke, family = binomial(link = logit))
## 
## Deviance Residuals: 
## [1]  0  0
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.82661    0.07858 -23.244  < 2e-16 ***
## parentsmoke1  0.45918    0.08782   5.228 1.71e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance:  2.9121e+01  on 1  degrees of freedom
## Residual deviance: -3.7170e-13  on 0  degrees of freedom
## AIC: 19.242
## 
## Number of Fisher Scoring iterations: 2

This information gives us the fitted model:

\[\text{logit}(\hat\pi_i)=\log\frac{\hat\pi_i}{1-\hat\pi_i}=\hat\beta_0+\hat\beta_1X_i=-1.837+0.459\,(\text{parents smoke = 1})\] where parentsmoke is a dummy variable ((e.g. design variable) that takes value \(1\) if at least one parents is smoking and \(0\) if neither is smoking.

parent smoking = 0 is the baseline. We are modeling here probability of “children smoking”. Estimated \(\beta_0=1.827\) with a standard error of \(0.078\) is significant and it says that log-odds of a child smoking versus not smoking if neither parents is smoking (the baseline level) is \(-1.827\) and it’s statistically significant.

Estimated \(\beta_1=0.459\) with a standard error of \(0.088\) is significant and it says that log-odds-ratio of a child smoking versus not smoking if at least one parent is smoking versus neither parents is smoking (the baseline level) is \(0.459\) and it’s statistically significant. \(\exp(0.459)=1.58\) are the estimated odds-ratios.

Why not just ran a Chi square test:

prop.test(response, correct = F)
## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  response
## X-squared = 27.677, df = 1, p-value = 1.434e-07
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.04218944 0.08659558
## sample estimates:
##    prop 1    prop 2 
## 0.2030356 0.1386431

It does detect differences in proportions but it doesn’t address:

  • Modeling the “risk” of student smoking as a function of parents’ smoking behavior.
  • Describing the differences between student smokers and nonsmokers as a function of parents smoking behavior via descriptive discriminate analyses. -Predicting the probabilities that individuals fall into two categories of the binary response as a function of some explanatory variables, e.g. what is the probability that a student is a smoker given that neither of his/her parents smokes.
  • Predicting that a student is a smoker given that neither of his/her parents smokes, i.e. probabilities that individuals fall into two categories of the binary response as a function of some explanatory variables, we want to classify new students into “smoking” or “nonsmoking” group depending on parents smoking behavior. we want to develop a social network model, adjust for “bias”, These are just some of the possibilities of logistic regression, which cannot be handled within a framework of goodness-of-fit only.

Here is the same model splitting mothers and fathers as predictors:

kids.smoke <- matrix(c(188,599,816-205,1168,1456,3203-1456), nrow = 3)
dimnames(kids.smoke) = list(Parents = c("None", "Mother", "Father"), Kids = c("Smokers", "Non-smokers"))
kids.smoke
##         Kids
## Parents  Smokers Non-smokers
##   None       188        1168
##   Mother     599        1456
##   Father     611        1747
bad.example <- as.factor(c(0,"1Mother","2Father"))
smoke.logistic <- glm(kids.smoke~bad.example, family=binomial(link=logit))
smoke.logistic
## 
## Call:  glm(formula = kids.smoke ~ bad.example, family = binomial(link = logit))
## 
## Coefficients:
##        (Intercept)  bad.example1Mother  bad.example2Father  
##            -1.8266              0.9384              0.7760  
## 
## Degrees of Freedom: 2 Total (i.e. Null);  0 Residual
## Null Deviance:       119.2 
## Residual Deviance: -4.341e-13    AIC: 28.77
summary(smoke.logistic)
## 
## Call:
## glm(formula = kids.smoke ~ bad.example, family = binomial(link = logit))
## 
## Deviance Residuals: 
## [1]  0  0  0
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.82661    0.07858 -23.244   <2e-16 ***
## bad.example1Mother  0.93842    0.09237  10.160   <2e-16 ***
## bad.example2Father  0.77605    0.09157   8.475   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance:  1.1915e+02  on 2  degrees of freedom
## Residual deviance: -4.3410e-13  on 0  degrees of freedom
## AIC: 28.768
## 
## Number of Fisher Scoring iterations: 2

An example with credit defaults with some calculations extracted from here:

set.seed(123)

gender <- factor(round(runif(100,0,1)),
                labels=c("female", "male"))
age <- round(runif(100,20,80))
profession <- factor(round(runif(100,1,4)),
             labels=c("student", "worker", "teacher", "self-employed"))


# Let's make males, teachers and self-employed more likely to default:

credit.default <-  round(((as.numeric(gender)/2 + as.numeric(profession)/4)/2) * runif(100,0.5,1))

df <- data.frame(credit.default,gender,age,profession)
head(df)
##   credit.default gender age    profession
## 1              0 female  56        worker
## 2              1   male  40 self-employed
## 3              1 female  49       teacher
## 4              1   male  77       teacher
## 5              1   male  49        worker
## 6              1 female  73 self-employed
catlogis <- glm(credit.default ~ gender+age+profession, df, family=binomial(link="logit"))
summary(catlogis)
## 
## Call:
## glm(formula = credit.default ~ gender + age + profession, family = binomial(link = "logit"), 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2580  -0.9295  -0.2042   0.5361   1.5200  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -3.860e+00  1.327e+00  -2.910 0.003616 ** 
## gendermale               3.084e+00  7.844e-01   3.931 8.46e-05 ***
## age                      9.465e-07  1.692e-02   0.000 0.999955    
## professionworker         6.325e-01  8.211e-01   0.770 0.441075    
## professionteacher        3.245e+00  1.003e+00   3.236 0.001214 ** 
## professionself-employed  4.464e+00  1.193e+00   3.740 0.000184 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 136.663  on 99  degrees of freedom
## Residual deviance:  88.237  on 94  degrees of freedom
## AIC: 100.24
## 
## Number of Fisher Scoring iterations: 5
require(pROC)
auc(credit.default, predict(catlogis, type="response"), quiet=T)
## Area under the curve: 0.8517

We can generate an ANOVA to analyze the table of variance, paying attention at the difference between the NULL variance and the Resid. Deviance to see which variable reduce this residual variance in comparison to the model with only an intecept. The p-values also help select variables. We want a drop in the deviance and AIC.

catlogis <- glm(credit.default ~ gender+profession, df, family=binomial(link="logit"))
summary(catlogis)
## 
## Call:
## glm(formula = credit.default ~ gender + profession, family = binomial(link = "logit"), 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2580  -0.9295  -0.2042   0.5361   1.5200  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -3.8604     0.9996  -3.862 0.000113 ***
## gendermale                3.0836     0.7841   3.933 8.39e-05 ***
## professionworker          0.6325     0.8202   0.771 0.440570    
## professionteacher         3.2448     1.0028   3.236 0.001213 ** 
## professionself-employed   4.4636     1.1873   3.759 0.000170 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 136.663  on 99  degrees of freedom
## Residual deviance:  88.237  on 95  degrees of freedom
## AIC: 98.237
## 
## Number of Fisher Scoring iterations: 5
auc(df$credit.default, predict(catlogis, type="response"), quiet=T)
## Area under the curve: 0.8527
genlogis <- glm(credit.default ~ gender, df, family=binomial(link="logit"))
summary(genlogis)
## 
## Call:
## glm(formula = credit.default ~ gender, family = binomial(link = "logit"), 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4261  -0.7502  -0.7502   0.9476   1.6765  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.1239     0.3193  -3.520 0.000431 ***
## gendermale    1.6919     0.4405   3.841 0.000123 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 136.66  on 99  degrees of freedom
## Residual deviance: 120.56  on 98  degrees of freedom
## AIC: 124.56
## 
## Number of Fisher Scoring iterations: 4
auc(df$credit.default, predict(genlogis, type="response"), quiet=T)
## Area under the curve: 0.6997
fitted.results <- predict(genlogis, type="response")
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != df$credit.default)
print(paste('Accuracy', 1-misClasificError))
## [1] "Accuracy 0.7"
library(ROCR)
pr <- prediction(predict(genlogis, type="response"), df$credit.default)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf, main= 'ROC curve',
     colorize=TRUE,
     lwd= 3)
plot(prf,
     lty=3,
     col="grey78",
     add=TRUE)

auc <- performance(pr, measure = "auc")
(auc <- auc@y.values[[1]])
## [1] 0.6997144
proflogis <- glm(credit.default ~ profession, df, family=binomial(link="logit"))
summary(proflogis)
## 
## Call:
## glm(formula = credit.default ~ profession, family = binomial(link = "logit"), 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7941  -0.7710  -0.6231   1.0508   1.8626  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)   
## (Intercept)              -1.5404     0.6362  -2.421  0.01547 * 
## professionworker          0.4796     0.7445   0.644  0.51950   
## professionteacher         1.8458     0.7272   2.538  0.01114 * 
## professionself-employed   2.9267     0.9063   3.229  0.00124 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 136.66  on 99  degrees of freedom
## Residual deviance: 115.75  on 96  degrees of freedom
## AIC: 123.75
## 
## Number of Fisher Scoring iterations: 4
auc(df$credit.default, predict(proflogis, type="response"), quiet=T)
## Area under the curve: 0.7438
pr <- prediction(predict(proflogis, type="response"), df$credit.default)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")

plot(prf,
     colorize=TRUE,
     lwd= 3,
     main= "ROC curve")
plot(prf,
     lty=3,
     col="grey78",
     add=TRUE)

auc <- performance(pr, measure = "auc")
(auc <- auc@y.values[[1]])
## [1] 0.743778

Another example from this video:

set.seed(420)
n.samples <- 100
weight <- sort(rnorm(n.samples,172,29))
obese <- ifelse(runif(n.samples) < (rank(weight)/100), yes=1, no=0)
plot(weight,obese, pch=19, col='tan4', cex=.7)
glm.fit <- glm(obese ~ weight, family="binomial")
lines(weight, glm.fit$fitted.values)

par(pty="s")
roc(obese, glm.fit$fitted.values, plot=T)

## 
## Call:
## roc.default(response = obese, predictor = glm.fit$fitted.values,     plot = T)
## 
## Data: glm.fit$fitted.values in 45 controls (obese 0) < 55 cases (obese 1).
## Area under the curve: 0.8291
roc.info <- roc(obese, glm.fit$fitted.values)
roc.df <- data.frame(ttp=roc.info$sensitivities*100, fpp=(1-roc.info$specificities) *100, thresholds=roc.info$thresholds)
head(roc.df)
##   ttp       fpp thresholds
## 1 100 100.00000       -Inf
## 2 100  97.77778 0.01349011
## 3 100  95.55556 0.03245008
## 4 100  93.33333 0.05250145
## 5 100  91.11111 0.07017225
## 6 100  88.88889 0.08798755
tail(roc.df)
##          ttp fpp thresholds
## 96  9.090909   0  0.9275222
## 97  7.272727   0  0.9371857
## 98  5.454545   0  0.9480358
## 99  3.636364   0  0.9648800
## 100 1.818182   0  0.9735257
## 101 0.000000   0        Inf

Notice that the first row of head(roc.df) with a threshold for labeling “obese” of \(-\inf\) the ttp (true positive percentage) will be of \(100\%,\) and so will be the fpp (false positive percentage), and it corresponds to the upper-right hand part of the ROC curve. The bottom-left part of the curve corresponds to the last row of tail(roc.df).

We can isolate part of the data, for example with ttp between \(60\) and \(80\), to select an optimal threshold value:

roc.df[roc.df$ttp > 60 & roc.df$ttp < 80,]
##         ttp      fpp thresholds
## 42 78.18182 35.55556  0.5049310
## 43 78.18182 33.33333  0.5067116
## 44 78.18182 31.11111  0.5166680
## 45 76.36364 31.11111  0.5287933
## 46 76.36364 28.88889  0.5429351
## 47 76.36364 26.66667  0.5589494
## 48 74.54545 26.66667  0.5676342
## 49 74.54545 24.44444  0.5776086
## 50 74.54545 22.22222  0.5946054
## 51 72.72727 22.22222  0.6227449
## 52 70.90909 22.22222  0.6398136
## 53 69.09091 22.22222  0.6441654
## 54 67.27273 22.22222  0.6556705
## 55 67.27273 20.00000  0.6683618
## 56 67.27273 17.77778  0.6767661
## 57 65.45455 17.77778  0.6802060
## 58 65.45455 15.55556  0.6831936
## 59 65.45455 13.33333  0.6917225
## 60 63.63636 13.33333  0.6975300
## 61 61.81818 13.33333  0.6982807

We can plot the ROC differently:

par(pty="s")
roc(obese, glm.fit$fitted.values, plot=T, legacy.axes=T, percent=T,
    xlab='False positive percentage', ylab='True positive percentage',
    col="darkslateblue",lwd=4, print.auc=T)

## 
## Call:
## roc.default(response = obese, predictor = glm.fit$fitted.values,     percent = T, plot = T, legacy.axes = T, xlab = "False positive percentage",     ylab = "True positive percentage", col = "darkslateblue",     lwd = 4, print.auc = T)
## 
## Data: glm.fit$fitted.values in 45 controls (obese 0) < 55 cases (obese 1).
## Area under the curve: 82.91%

To calculate the partial AUC of the beginning, where only a small number of false positives are allowed…

par(pty="s")
roc(obese, glm.fit$fitted.values, plot=T, legacy.axes=T, percent=T,
    xlab='False positive percentage', ylab='True positive percentage',
    col='darkslateblue',lwd=4, print.auc=T, print.auc.x=45,         partial.auc=c(100,90),
    auc.polygon=T, auc.polygon.col='midnightblue', quiet=T)

## 
## Call:
## roc.default(response = obese, predictor = glm.fit$fitted.values,     percent = T, quiet = T, plot = T, legacy.axes = T, xlab = "False positive percentage",     ylab = "True positive percentage", col = "darkslateblue",     lwd = 4, print.auc = T, print.auc.x = 45, partial.auc = c(100,         90), auc.polygon = T, auc.polygon.col = "midnightblue")
## 
## Data: glm.fit$fitted.values in 45 controls (obese 0) < 55 cases (obese 1).
## Partial area under the curve (specificity 100%-90%): 4.727%

Can we bootstrap?

ci.auc(roc(obese, glm.fit$fitted.values), conf.level=0.95, method=
"bootstrap", boot.n = 2000, boot.stratified = TRUE, reuse.auc=TRUE,
progress = getOption("pROCProgress")$name, parallel=FALSE)
## 95% CI: 0.7438-0.8954 (2000 stratified bootstrap replicates)

Testing a Logistic Regression Model:


  1. GOODNESS-OF-FIT: Model calibration

The Hosmer-Lemeshow goodness of fit test divides up in boxes the predicted probabilities (in R the function is fitted as opposed to predict), and runs a chi-square test comparing to the percentage of cases that have \(Y=1\) among those with predicted probability within a certain interval. See here:

The Hosmer-Lemeshow goodness of fit test is based on dividing the sample up according to their predicted probabilities, or risks. Specifically, based on the estimated parameter values \(\hat{\beta}_{0},\hat{\beta}_{1},..,\hat{\beta}_{p},\) for each observation in the sample the probability that \(Y=1\) is calculated, based on each observation’s covariate values:

\[\hat{\pi} = \frac{\exp(\hat{\beta}_{0}+\hat{\beta}_{1}X_{1}+..+\hat{\beta}_{p}X_{p})}{1+\exp(\hat{\beta}_{0}+\hat{\beta}_{1}X_{1}+..+\hat{\beta}_{p}X_{p})}\]

The observations in the sample are then split into \(g\) groups (we come back to choice of g later) according to their predicted probabilities. Suppose (as is commonly done) that \(g=10.\) Then the first group consists of the observations with the lowest \(10\%\) predicted probabilities. The second group consists of the \(10\%\) of the sample whose predicted probabilities are next smallest, etc etc.

Suppose for the moment that all of the observations in the first group had a predicted probability of \(0.1.\) Then, if our model is correctly specified, we would expect the proportion of these observations who have \(Y=1\) to be \(10\%.\) Of course, even if the model is correctly specified, the observed proportion will deviate to some extent from 10%, but not by too much. If the proportion of observations with \(Y=1\) in the group were instead \(90\%,\) this is suggestive that our model is not accurately predicting probability (risk), i.e. an indication that our model is not fitting the data well.

In practice, as soon as some of our model covariates are continuous, each observation will have a different predicted probability, and so the predicted probabilities will vary in each of the groups we have formed. To calculate how many Y=1 observations we would expect, the Hosmer-Lemeshow test takes the average of the predicted probabilities in the group, and multiplies this by the number of observations in the group. The test also performs the same calculation for \(Y=0,\) and then calculates a Pearson goodness of fit statistic

\[\sum^{1}_{k=0} \sum^{g}_{l=1} \frac{(o_{kl}-e_{kl})^{2}}{e_{kl}}\]

where \(o_{0l}\) denotes the number of observed \(Y=0\) observations in the \(l\)-th group, \(o_{1l}\) denotes the number of observed \(Y=1\) observations in the \(l\)-th group, and \(e_{0l}\) and \(e_{1l}\) similarly denote the expected number of zeros.

In a 1980 paper Hosmer-Lemeshow showed by simulation that (provided \(p+1<g\)) their test statistic approximately followed a chi-squared distribution on \(g-2\) degrees of freedom, when the model is correctly specified. This means that given our fitted model, the p-value can be calculated as the right hand tail probability of the corresponding chi-squared distribution using the calculated test statistic. If the p-value is small, this is indicative of poor fit.

It should be emphasized that a large p-value does not mean the model fits well, since lack of evidence against a null hypothesis is not equivalent to evidence in favor of the alternative hypothesis. In particular, if our sample size is small, a high p-value from the test may simply be a consequence of the test having lower power to detect mis-specification, rather than being indicative of good fit.

  1. C-STATISTIC or AUC (area under the curve of the ROC): Discriminatory power of the model

It is based on selecting different probabilities from \(0\) to \(1\) and calculating sensitivity and specificity in an ROC curve, and then measuring the AUC. See here:

Our model or prediction rule is perfect at classifying observations if it has 100% sensitivity and 100% specificity. Unfortunately in practice this is (usually) not attainable. So how can we summarize the discrimination ability of our logistic regression model? For each observation, our fitted model can be used to calculate the fitted probabilities \(P(Y=1|X_{1},..,X_{p}).\) On their own, these don’t tell us how to classify observations as positive or negative. One way to create such a classification rule is to choose a cut-point \(c,\) and classify those observations with a fitted probability above \(c\) as positive and those at or below it as negative. For this particular cut-off, we can estimate the sensitivity by the proportion of observations with \(Y=1\) which have a predicted probability above \(c,\) and similarly we can estimate specificity by the proportion of \(Y=0\) observations with a predicted probability at or below \(c.\) If we increase the cut-point \(c,\) fewer observations will be predicted as positive. This will mean that fewer of the \(Y=1\) observations will be predicted as positive (reduced sensitivity), but more of the \(Y=0\) observations will be predicted as negative (increased specificity). In picking the cut-point, there is thus an intrinsic trade off between sensitivity and specificity. Now we come to the ROC curve, which is simply a plot of the values of sensitivity against one minus specificity, as the value of the cut-point \(c\) is increased from \(0\) through to \(1.\) Example:

set.seed(63126)
n <- 1000
x <- rnorm(n)
pr <- exp(x)/(1+exp(x))
y <- 1*(runif(n) < pr)
mod <- glm(y~x, family="binomial")
predpr <- predict(mod,type=c("response"))
roccurve <- roc(y ~ predpr)
plot(roccurve)

A popular way of summarizing the discrimination ability of a model is to report the area under the ROC curve. We have seen that a model with discrimination ability has an ROC curve which goes closer to the top left hand corner of the plot, whereas a model with no discrimination ability has an ROC curve close to a \(45\) degree line. Thus the area under the curve ranges from \(1,\) corresponding to perfect discrimination, to \(0.5,\) corresponding to a model with no discrimination ability. The area under the ROC curve is also sometimes referred to as the c-statistic (c for concordance).

auc(roccurve)
## Area under the curve: 0.7463

The ROC curve (AUC) has a somewhat appealing interpretation. It turns out that the AUC is the probability that if you were to take a random pair of observations, one with \(Y=1\) and one with \(Y=0,\) the observation with \(Y=1\) has a higher predicted probability than the other. The AUC thus gives the probability that the model correctly ranks such pairs of observations.

In the biomedical context of risk prediction modelling, the AUC has been criticized by some. In the risk prediction context, individuals have their risk of developing (for example) coronary heart disease over the next 10 years predicted. Thus a measure of discrimination which examines the predicted probability of pairs of individuals, one with \(Y=1\) and one with \(Y=0,\) does not really match the prospective risk prediction setting, where we do not have such pairs.


From here:


How do I interpret odds ratios in logistic regression?


When a binary outcome variable is modeled using logistic regression, it is assumed that the logit transformation of the outcome variable has a linear relationship with the predictor variables. This makes the interpretation of the regression coefficients somewhat tricky. In this page, we will walk through the concept of odds ratio and try to interpret the logistic regression results using the concept of odds ratio in a couple of examples.

From probability to odds to log of odds:

Everything starts with the concept of probability. Let’s say that the probability of success of some event is .8. Then the probability of failure is 1 – .8 = .2. The odds of success are defined as the ratio of the probability of success over the probability of failure. In our example, the odds of success are .8/.2 = 4. That is to say that the odds of success are 4 to 1. If the probability of success is .5, i.e., 50-50 percent chance, then the odds of success is 1 to 1.

The transformation from probability to odds is a monotonic transformation, meaning the odds increase as the probability increases or vice versa. Probability ranges from 0 and 1. Odds range from 0 and positive infinity.

Below is a table of the transformation from probability to odds.

    p       odds  
  .001    .001001
   .01    .010101
   .15   .1764706
    .2        .25
   .25   .3333333
    .3   .4285714
   .35   .5384616
    .4   .6666667
   .45   .8181818
    .5          1
   .55   1.222222
    .6        1.5
   .65   1.857143
    .7   2.333333
   .75          3
    .8          4
   .85   5.666667
    .9          9
  .999        999
 .9999       9999

The transformation from odds to log of odds is the log transformation. Again this is a monotonic transformation. That is to say, the greater the odds, the greater the log of odds and vice versa. The table below shows the relationship among the probability, odds and log of odds.

    p       odds     logodds  
  .001    .001001  -6.906755
   .01    .010101   -4.59512
   .15   .1764706  -1.734601
    .2        .25  -1.386294
   .25   .3333333  -1.098612
    .3   .4285714  -.8472978
   .35   .5384616  -.6190392
    .4   .6666667  -.4054651
   .45   .8181818  -.2006707
    .5          1          0
   .55   1.222222   .2006707
    .6        1.5   .4054651
   .65   1.857143   .6190392
    .7   2.333333   .8472978
   .75          3   1.098612
    .8          4   1.386294
   .85   5.666667   1.734601
    .9          9   2.197225
  .999        999   6.906755
 .9999       9999    9.21024

Why do we take all the trouble doing the transformation from probability to log odds? One reason is that it is usually difficult to model a variable which has restricted range, such as probability. This transformation is an attempt to get around the restricted range problem. It maps probability ranging between 0 and 1 to log odds ranging from negative infinity to positive infinity. Another reason is that among all of the infinitely many choices of transformation, the log of odds is one of the easiest to understand and interpret. This transformation is called logit transformation. The other common choice is the probit transformation, which will not be covered here.

A logistic regression model allows us to establish a relationship between a binary outcome variable and a group of predictor variables. It models the logit-transformed probability as a linear relationship with the predictor variables. More formally, let \(Y\) be the binary outcome variable indicating failure/success with \(\{0,1\}\) and \(p\) be the probability of \(y\) to be 1. Let \(x_1,x_2,\dots\) be a set of predictor variables. Then the logistic regression of \(Y\) on \(x_1,x_2,\dots\) estimates parameter values via maximum likelihood method of the following equation

\[\text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k.\]

Exponentiate and take the multiplicative inverse of both sides,

\[\frac{1-p}{p} = \frac{1}{\exp(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}.\]

Partial out the fraction on the left-hand side of the equation and add one to both sides,

\[\frac{1}{p} = 1 + \frac{1}{\exp(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}.\]

Change 1 to a common denominator,

\[\frac{1}{p} = \frac{\exp(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)+1}{\exp(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}.\]

Finally, take the multiplicative inverse again to obtain the formula for the probability,

\[{p} = \frac{\exp(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}{1+\exp(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}.\]

We are now ready for a few examples of logistic regressions. We will use a this dataset. The data set has 200 observations and the outcome variable used will be hon, indicating if a student is in an honors class or not. So our p = prob(hon=1). We will purposely ignore all the significance tests and focus on the meaning of the regression coefficients.


Logistic regression with no predictor variables


Let’s start with the simplest logistic regression, a model without any predictor variables. In an equation, we are modeling

\[\text{logit}(p)= β_0\]

require(RCurl)
url="https://raw.githubusercontent.com/RInterested/DATASETS/gh-pages/Honors%20math%20write%20gender"
d <- read.csv(url)
head(d)
##   female read write math hon femalexmath
## 1      0   57    52   41   0           0
## 2      1   68    59   53   0          53
## 3      0   44    33   54   0           0
## 4      0   63    44   47   0           0
## 5      0   47    52   57   0           0
## 6      0   44    52   51   0           0
fit <- glm(hon~1,d,family='binomial')
summary(fit)
## 
## Call:
## glm(formula = hon ~ 1, family = "binomial", data = d)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7497  -0.7497  -0.7497  -0.7497   1.6772  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.1255     0.1644  -6.845 7.62e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 222.71  on 199  degrees of freedom
## Residual deviance: 222.71  on 199  degrees of freedom
## AIC: 224.71
## 
## Number of Fisher Scoring iterations: 4

This means \(\log(p/(1-p)) = -1.1255.\) What is \(p\) here? It turns out that \(p\) is the overall probability of being in honors class (hon = 1). Let’s take a look at the frequency table for hon.

addmargins(table(t(d$hon)))
## 
##   0   1 Sum 
## 151  49 200

So \(p = 49/200 = .245.\) The odds are \(.245/(1-.245) = .3245\) and the log of the odds (logit) is \(\log(.3245) = -1.12546.\) In other words, the intercept from the model with no predictor variables is the estimated log odds of being in honors class for the whole population of interest. We can also transform the log of the odds back to a probability: \(p = \exp(-1.12546)/(1+\exp(-1.12546)) = .245.\)


Logistic regression with a single dichotomous predictor variable:


Now let’s go one step further by adding a binary predictor variable, female, to the model. Writing it in an equation, the model describes the following linear relationship.

\[\text{logit}(p) = β_0 + β_1\times\text{female}\]

fit <- glm(hon ~ female,d,family='binomial')
summary(fit)
## 
## Call:
## glm(formula = hon ~ female, family = "binomial", data = d)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8337  -0.8337  -0.6431  -0.6431   1.8317  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.4709     0.2690  -5.469 4.53e-08 ***
## female        0.5928     0.3414   1.736   0.0825 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 222.71  on 199  degrees of freedom
## Residual deviance: 219.61  on 198  degrees of freedom
## AIC: 223.61
## 
## Number of Fisher Scoring iterations: 4

Before trying to interpret the two parameters estimated above, let’s take a look at the crosstab of the variable hon with female.

addmargins(table(honors=t(d$hon),female=d$female))
##       female
## honors   0   1 Sum
##    0    74  77 151
##    1    17  32  49
##    Sum  91 109 200

In our dataset, what are the odds of a male being in the honors class and what are the odds of a female being in the honors class? We can manually calculate these odds from the table: for males, the odds of being in the honors class are \((17/91)/(74/91) = 17/74 = .23;\) and for females, the odds of being in the honors class are \((32/109)/(77/109) = 32/77 = .42.\) The ratio of the odds for female to the odds for male is \((32/77)/(17/74) = 1.809.\) So the odds for males are \(17\) to \(74,\) the odds for females are \(32\) to \(77,\) and the odds for female are about \(81\%\) higher than the odds for males.

Now we can relate the odds for males and females and the output from the logistic regression. The intercept of \(-1.471\) is the log odds for males since male is the reference group \((female = 0).\) Using the odds we calculated above for males, we can confirm this: \(log(.23) = -1.47.\) The coefficient for female is the log of odds ratio between the female group and male group: \(log(1.809) = .593.\) So we can get the odds ratio by exponentiating the coefficient for female.

exp(cbind(OR = coef(fit), confint(fit)))
##                    OR     2.5 %    97.5 %
## (Intercept) 0.2297297 0.1312460 0.3792884
## female      1.8090145 0.9362394 3.5929859

Logistic regression with a single continuous predictor variable:


Another simple example is a model with a single continuous predictor variable such as the model below. It describes the relationship between students’ math scores and the log odds of being in an honors class.

\[\text{logit}(p) = β_0 + β_1\times \text{math}\]

fit <- glm(hon ~ math,d,family='binomial')
summary(fit)
## 
## Call:
## glm(formula = hon ~ math, family = "binomial", data = d)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0332  -0.6785  -0.3506  -0.1565   2.6143  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.79394    1.48174  -6.610 3.85e-11 ***
## math         0.15634    0.02561   6.105 1.03e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 222.71  on 199  degrees of freedom
## Residual deviance: 167.07  on 198  degrees of freedom
## AIC: 171.07
## 
## Number of Fisher Scoring iterations: 5

In this case, the estimated coefficient for the intercept is the log odds of a student with a math score of zero being in an honors class. In other words, the odds of being in an honors class when the math score is zero is \(\exp(-9.793942) = .00005579.\) These odds are very low, but if we look at the distribution of the variable math, we will see that no one in the sample has math score lower than \(30.\) In fact, all the test scores in the data set were standardized around mean of \(50\) and standard deviation of \(10.\) So the intercept in this model corresponds to the log odds of being in an honors class when math is at the hypothetical value of zero.

How do we interpret the coefficient for math? The coefficient and intercept estimates give us the following equation:

\[\log(p/(1-p)) = \text{logit}(p) = – 9.793942 + .1563404 \times \text{math}\]

Let’s fix math at some value. We will use \(54.\) Then the conditional logit of being in an honors class when the math score is held at \(54\) is

\[\log(p/(1-p))(\text{math}=54) = – 9.793942 + .1563404 \times 54.\]

We can examine the effect of a one-unit increase in math score. When the math score is held at \(55,\) the conditional logit of being in an honors class is

\[\log(p/(1-p))(\text{math}=55) = – 9.793942 + .1563404 \times 55.\]

Taking the difference of the two equations, we have the following:

\[\log(p/(1-p))(\text{math}=55) – \log(p/(1-p))(\text{math} = 54) = .1563404.\]

We can say now that the coefficient for math is the difference in the log odds. In other words, for a one-unit increase in the math score, the expected change in log odds is \(.1563404.\)

Can we translate this change in log odds to the change in odds? Indeed, we can. Recall that logarithm converts multiplication and division to addition and subtraction. Its inverse, the exponentiation converts addition and subtraction back to multiplication and division. If we exponentiate both sides of our last equation, we have the following:

\[\begin{align} &\exp[\log(p/(1-p))(\text{math=55}) – \log(p/(1-p))(\text{math} = 54)] \\[2ex] &=\exp(\log(p/(1-p))(\text{math}=55)) / \exp(\log(p/(1-p))(\text{math} = 54))\\[2ex] &= \text{odds}(\text{math}=55)/\text{odds}(\text{math}=54)\\[2ex] &= \exp(.1563404) = 1.1692241. \end{align}\]

So we can say for a one-unit increase in math score, we expect to see about \(17\%\) increase in the odds of being in an honors class. This 17% of increase does not depend on the value that math is held at.


Logistic regression with multiple predictor variables and no interaction terms:


In general, we can have multiple predictor variables in a logistic regression model.

\[\text{logit}(p) = \log(p/(1-p))= β_0 + β_1\, x_1 + … + β_k\, x_k\]

Applying such a model to our example dataset, each estimated coefficient is the expected change in the log odds of being in an honors class for a unit increase in the corresponding predictor variable holding the other predictor variables constant at certain value. Each exponentiated coefficient is the ratio of two odds, or the change in odds in the multiplicative scale for a unit increase in the corresponding predictor variable holding other variables at certain value. Here is an example.

\[\text{logit}(p) = \log(p/(1-p))= β_0 + β_1 \text{math} + β_2 \text{female} + β_3 \text{read}\]

fit <- glm(hon ~ math + female + read,d,family='binomial')
summary(fit)
## 
## Call:
## glm(formula = hon ~ math + female + read, family = "binomial", 
##     data = d)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8305  -0.6327  -0.3300  -0.1258   2.3896  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -11.77025    1.71068  -6.880 5.97e-12 ***
## math          0.12296    0.03128   3.931 8.44e-05 ***
## female        0.97995    0.42163   2.324   0.0201 *  
## read          0.05906    0.02655   2.224   0.0261 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 222.71  on 199  degrees of freedom
## Residual deviance: 156.17  on 196  degrees of freedom
## AIC: 164.17
## 
## Number of Fisher Scoring iterations: 5

This fitted model says that, holding math and reading at a fixed value, the odds of getting into an honors class for females (female = 1) over the odds of getting into an honors class for males (female = 0) is \(\exp(.979948) = 2.66.\) In terms of percent change, we can say that the odds for females are \(166\%\) higher than the odds for males. The coefficient for math says that, holding female and reading at a fixed value, we will see \(13\%\) increase in the odds of getting into an honors class for a one-unit increase in math score since \(\exp(.1229589) = 1.13.\)

Logistic regression with an interaction term of two predictor variables In all the previous examples, we have said that the regression coefficient of a variable corresponds to the change in log odds and its exponentiated form corresponds to the odds ratio. This is only true when our model does not have any interaction terms. When a model has interaction term(s) of two predictor variables, it attempts to describe how the effect of a predictor variable depends on the level/value of another predictor variable. The interpretation of the regression coefficients become more involved.

Let’s take a simple example.

\[\text{logit}(p) = \log(p/(1-p))= β_0 + β_1 \text{female} + β_2 \text{math} + β_3 \text{female}\times\text{math}\]

fit <- glm(hon ~ female + math + female*math,d,family='binomial')
summary(fit)
## 
## Call:
## glm(formula = hon ~ female + math + female * math, family = "binomial", 
##     data = d)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7623  -0.6725  -0.3421  -0.1450   2.6913  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.74584    2.12913  -4.108    4e-05 ***
## female      -2.89986    3.09418  -0.937 0.348657    
## math         0.12938    0.03588   3.606 0.000312 ***
## female:math  0.06700    0.05346   1.253 0.210139    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 222.71  on 199  degrees of freedom
## Residual deviance: 159.77  on 196  degrees of freedom
## AIC: 167.77
## 
## Number of Fisher Scoring iterations: 5

In the presence of interaction term of female by math, we can no longer talk about the effect of female, holding all other variables at certain value, since it does not make sense to fix math and female x math at certain value and still allow female change from 0 to 1!

In this simple example where we examine the interaction of a binary variable and a continuous variable, we can think that we actually have two equations: one for males and one for females. For males (female=0), the equation is simply

\[\text{logit}(p) = \log(p/(1-p))= β_0 + β_2 \text{math}.\]

For females, the equation is

\[\text{logit}(p) = \log(p/(1-p))= β_0 + β_1\text{female} + β_2\text{math} + β_3 \text{math}\times \text{female} .\]

Now we can map the logistic regression output to these two equations. So we can say that the coefficient for math is the effect of math when female = 0. More explicitly, we can say that for male students, a one-unit increase in math score yields a change in log odds of \(0.13.\) On the other hand, for the female students, a one-unit increase in math score yields a change in log odds of \((.13 + .067) = 0.197.\) In terms of odds ratios, we can say that for male students, the odds ratio is \(\exp(.13) = 1.14\) for a one-unit increase in math score and the odds ratio for female students is \(\exp(.197) = 1.22\) for a one-unit increase in math score. The ratio of these two odds ratios (female over male) turns out to be the exponentiated coefficient for the interaction term of female by math: \(1.22/1.14 = \exp(.067) = 1.07.\)


Odds Ratios as Effect Size Statistics:


From here:

First, it’s important to understand what effect size statistics are for and why they’re worth reporting.

This quotation by Joseph Durlak explains it nicely:

“…provide information about the magnitude and direction of the difference between two groups or the relationship between two variables.”

There are two types of effect size statistics: standardized and unstandardized.

Standardized statistics have been stripped of all units of measurement.

Correlation is a nice example. People like correlation because the strength and direction of any two correlations can be compared, regardless of the units of the variables on which the correlation was measured.

Unstandardized statistics are still measured in the original units of the variables. So a difference in two means and a regression coefficient are both effect size statistics and both are useful to report.

Most people mean standardized when they say “effect size statistic.” But both describe the magnitude and direction of the research findings.

If you’re at all familiar with logistic regression, you’re also familiar with odds ratios. Odds ratios measure how many times bigger the odds of one outcome is for one value of an IV, compared to another value.

For example, let’s say you’re doing a logistic regression for a ecology study on whether or not a wetland in a certain area has been infected with a specific invasive plant. Predictors include water temperature in degrees Celsius, altitude, and whether the wetland is a fen or a marsh.

If the odds ratio for water temperature is \(1.12,\) that means that for each one-degree Celsius increase in water temperature, the odds of the wetland having the invasive plant species is \(1.12\) times as big, after controlling for the other predictors.

That odds ratio is an unstandardized effect size statistic. It tells you the direction and the strength of the relationship between water temperature and the odds that the plant is present.

It’s unstandardized because it’s based on the units of temperature.

I realize no ecologist would do so, but if the water were measured in degrees Fahrenheit, that odds ratio would have a different value. The direction and the strength of the relationship would be the same, but the statistic would be evaluated on a different scale.


Odds Ratios as Standardized Effect Size Statistics


Surprisingly, I’ve seen odds ratios listed as standardized effect size statistics.

A little digging showed those authors were referring to one of two situations.

The first situation is when the predictor is also binary. The odds ratio for whether the wetland was a fen or a marsh can be considered standardized, not because we’ve removed any units, but because there never were any.

So if the odds ratio for fen vs. marsh is 2.3, we know the odds of the invasive plant being in a fen is 2.3 times that of a marsh.

The second is when the numerical predictor is standardized. If we standardize the temperature (aka, convert the scale to Z scores), we’ve removed the units.

We’ll get the same Z scores from degrees Fahrenheit as degrees Celsius, and the new odds ratio will be in terms of one standard deviation increases in temperature, rather than one degree.

Voila! Units removed.




Home Page

NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.