The essence of the problem addressed by logistic regression is the binary nature of the dependent variable. We want to estimate the probability of success/failure of the dependent variable (\(p(Y=1)\)), given the explanatory variables: \(p(Y=1|X_1,X_2,\cdots, X_n)\). The explanatory variables can be categorical or continuous - it doesn’t matter.

Logistic regression addresses the following issues:

  1. There is a need to bound at \(1\) since we are estimating a probability, but lines are not naturally bounded. So we transform probability into odds:

\(\Large \text{odds}=\frac{p(Y=1)}{1-p(Y=1)}\).

In addition, this also helps turning a sigmoid curve of the typical cummulative probability distribution (e.g. normal) into a linear relation.

  1. Now, when the probability goes down to zero, the odds also tend to zero; yet there is no floor restriction in a line.

This problem can be addressed by expressing the linear relation in log scale: \(log(\text{odds})\), which will tend to \(-\infty\) as the odds tend to zero. This is the logit or log-odds function.

The logistic regression model is therefore:

\(\Large \color{red}{\text{log}} \left[\color{blue}{\text{odds(p(Y=1))}}\right]=\color{red}{\text{log}}\left(\frac{\hat p\,(Y=1)}{1-\hat p\,(Y=1)}\right) = X\beta = \beta_o + \beta_1 x_1 + \beta_2 x_2 +\cdots+ \beta_p x_p\)

Consequently,

\(\Large \color{blue}{\text{odds(Y=1)}} = \frac{p\,(Y=1)}{1\,-\,p\,(Y=1)} = e^{\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p} = e^{\,\beta_o}\,e^{\,\beta_1x_1}\,e^{\,\beta_2x_2}\,\cdots\,e^{\,\beta_px_p}\)

Therefore a unit increase in \(x_1\) increases the odds \(e^{\,\beta_1}\) times. The factors \(e^{\,\beta_i}\) are the ODDS RATIOs. On the other hand, \(\beta_i\) (the coefficients) are the LOG ODDS-RATIO:



For CATEGORICAL VARIABLES, are the ODDS RATIO denote how much the presence or absence of a factor variable increases the odds of a positive dependent variable.

For CONTINOUS VARIABLES the odds ratio tell us how much the odds increase multiplicatively with a one-unit change in the independent variable. To calculate the difference in odds you raise the OR to the power of the difference between the observations.

To get the exponentiated coefficients and their confidence intervals in R, we use cbind to bind the coefficients and confidence intervals column-wise:

exp(cbind(OR = coef(mylogit), confint(mylogit)))

See here.



Translation into probabilities:


\(\large p(Y = 1) = (1-p(Y=1))\times\,e^{\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p}\)


\(\large p(Y = 1)\,\times (1 \,+\,e^{\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p}) = e^{\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p}\)


\(\Large \color{green}{\text{p(Y = 1)}} = \frac{\color{blue}{\text{odds(Y=1)}}}{1\,+\,\color{blue}{\text{odds(Y=1)}}}=\frac{e^{\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p}}{1 \,+\,e^{\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p}}=\frac{1}{1 \,+\,e^{-(\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p)}}\)



TESTING A LOGISTIC REGRESSION MODEL:


  1. GOODNESS-OF-FIT: Model calibration

The Hosmer-Lemeshow goodness of fit test divides up in boxes the predicted probabilities (in R the function is fitted as opposed to predict), and runs a chi-square test comparing to the percentage of cases that have \(Y=1\) among those with predicted probability within a certain interval. See here

  1. C-STATISTIC or AUC (area under the curve of the ROC): Discriminatory power of the model

It is based on selecting different probabilities from \(0\) to \(1\) and calculating sensitivity and specificity in an ROC curve, and then measuring the AUC. See here.



INTERPRETATION OF LOGISTIC REGRESSION RESULTS:


QUESTION:

(Initially posted here)

I ran a linear regression of acceptance into college against SAT scores and family / ethnic background. The data are fictional. The question focuses in the gathering and interpretation of odds ratios when leaving the SAT scores aside for simplicity.

The variables are Accepted (0 or 1) and Background (“red” or “blue”). I set up the data so that people of “red” background were more likely to get in:

fit <- glm(Accepted ~ Background, data=dat, family="binomial")
exp(cbind(Odds_Ratio_RedvBlue = coef(fit), confint(fit)))

                        Odds_Ratio_RedvBlue             2.5 %       97.5 %
(Intercept)             0.7088608                     0.5553459   0.9017961
Backgroundred           2.4480042                     1.7397640   3.4595454

Questions:

  1. Is 0.7 the odd ratio of a person of “blue” background being accepted? I’m asking this because I also get 0.7 for “Backgroundblue” if instead I run the following code:

    fit <- glm(Accepted~Background - 1, data=dat, family="binomial")
    exp(cbind(OR=coef(fit), confint(fit)))
  2. Shouldn’t the odds ratio of “red” being accepted (\(\rm Accepted/Red:Accepted/Blue\)) just the reciprocal: (\(\rm OddsBlue = 1 / OddsRed\))?

ANSWER:

I’ve been working on answering my question by calculating manually the odds and odds ratios:

Acceptance   blue            red            Grand Total
0              158           102                260
1              112           177                289
Total          270           279                549

So the Odds Ratio of getting into the school of Red over Blue is:

\[ \frac{\rm Odds\ Accept\ If\ Red}{\rm Odds\ Acccept\ If\ Blue} = \frac{^{177}/_{102}}{^{112}/_{158}} = \frac {1.7353}{0.7089} = 2.448 \]

And this is the Backgroundredreturn of:

fit <- glm(Accepted~Background, data=dat, family="binomial")
exp(cbind(Odds_and_OR=coef(fit), confint(fit)))

                      Odds_and_OR                         2.5 %      97.5 %
(Intercept)             0.7088608                     0.5553459   0.9017961
Backgroundred           2.4480042                     1.7397640   3.4595454

At the same time, the (Intercept)corresponds to the numerator of the odds ratio, which is exactly the odds of getting in being of ‘blue’ family background: \(112/158 = 0.7089\).

If instead, I run:

fit2 <- glm(Accepted~Background-1, data=dat, family="binomial")
exp(cbind(Odds=coef(fit2), confint(fit2)))

                        Odds            2.5 %      97.5 %
Backgroundblue     0.7088608        0.5553459   0.9017961
Backgroundred      1.7352941        1.3632702   2.2206569

The returns are precisely the odds of getting in being ‘blue’: Backgroundblue (0.7089) and the odds of being accepted being ‘red’: Backgroundred (1.7353). No Odds Ratio there. Therefore the two return values are not expected to be reciprocal.

Finally, How to read the results if there are 3 factors in the categorical regressor?

Same manual versus [R] calculation:

I created a different fictitious data set with the same premise, but this time there were three ethnic backgrounds: “red”, “blue” and “orange”, and ran the same sequence:

First, the contingency table:

Acceptance  blue    orange  red   Total
0             86        65  130     281
1             64        42  162     268
Total        150       107  292     549

And calculated the Odds of getting in for each ethnic group:

As well as the different Odds Ratios:

And proceeded with the now routine logistic regression followed by exponentiation of coefficients:

fit <- glm(Accepted~Background, data=dat, family="binomial")
exp(cbind(ODDS=coef(fit), confint(fit)))

                      ODDS     2.5 %   97.5 %
(Intercept)      0.7441860 0.5367042 1.026588
Backgroundorange 0.8682692 0.5223358 1.437108
Backgroundred    1.6745192 1.1271430 2.497853

Yielding the odds of getting in for “blues” as the (Intercept), and the Odds Ratios of Orange versus Blue in Backgroundorange, and the OR of Red v Blue in Backgroundred .

On the other hand, the regression without intercept predictably returned just the three independent odds:

fit2 <- glm(Accepted~Background-1, data=dat, family="binomial")
exp(cbind(ODDS=coef(fit2), confint(fit2)))

                      ODDS     2.5 %    97.5 %
Backgroundblue   0.7441860 0.5367042 1.0265875
Backgroundorange 0.6461538 0.4354366 0.9484999
Backgroundred    1.2461538 0.9900426 1.5715814

Home Page