NOTE: There is a companion post on ROC curves and AUC here.
We want to estimate the probability of success / failure of the dependent variable \((\Pr(Y=1))\), given the explanatory variables: \(\Pr(Y=1\mid X_1,X_2,\cdots, X_n)\). The explanatory variables can be categorical or continuous - it doesn’t matter.
Logistic regression addresses the following issues:
\[ \text{odds}=\frac{\Pr(Y=1)}{1-\Pr(Y=1)}\tag 1\]
In addition, this also helps turning a sigmoid curve of a typical cummulative probability distribution (e.g. normal) into a linear relation.
The logistic regression model is therefore:
\[\text{log odds of disease}=\color{magenta}{\text{logit}}\left(P(Y=1)\right)=\color{red}{\log} \left[\color{blue}{\text{odds(P(Y=1))}}\right]=\color{red}{\text{log}}\left(\frac{ P\,(Y=1)}{1- P\,(Y=1)}\right) = X\beta = \beta_o + \beta_1 x_1 + \beta_2 x_2 +\cdots+ \beta_k x_k\]
Consequently,
\[\color{blue}{\text{odds(Y=1)}} = \frac{\Pr\,(Y=1)}{1\,-\,\Pr\,(Y=1)} = e^{\,\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k} = e^{\,\beta_o}\,e^{\,\beta_1x_1}\,e^{\,\beta_2x_2}\,\cdots\,e^{\,\beta_kx_k}\]
Therefore a unit increase in \(x_1\) increases the odds \(\large e^{\,\beta_1}\) times.
The factors \(\large e^{\,\beta_i}\) are the ODDS RATIO’s.
On the other hand, \(\beta_i\) (the coefficients) are the LOG ODDS-RATIO:
Fixing the covariate \(x_2 =k\) (any given value):
\[\begin{align} \text{log odds of disease } &=\alpha + \beta_1 x_1 + \beta_2 k \\[2ex] \text{odds of disease } &= e^{\alpha + \beta_1 x_1 + \beta_2 k} \end{align}\]
Now for \(x_2=k + 1:\)
\[\begin{align} \text{log odds of disease } &=\alpha + \beta_1 x_1 + \beta_2 k+\beta_2) \\[2ex] \text{odds of disease } &= e^{\alpha + \beta_1 x_1 + \beta_2 k + \beta_2} \end{align}\]
The odds ratio going from \(k\) to \(k+1\) is
\[\begin{align}\text{Odds Ratio (OR)}&=\frac{\text{odds when }x_2=k+1}{\text{odds when }x_2=k}=\frac{e^{\alpha + \beta_1 x_1 + \beta_2 k + \beta_2}}{e^{\alpha + \beta_1 x_1 + \beta_2 k}}=e^{\beta_2}\\[2ex] \log\left(\text{Odds Ratio (OR)}\right)&= \beta_2 \end{align}\]
Therefore \(e^{\beta_2}\) is the relative increase in the odds of disease going from \(x_2=k\) to \(x_2=k+1\) holding \(x_1\) fixed (or adjusting for \(x_1\)). For every increase of \(1\) in \(x_2\) the odds of disease increases by a factor of \(e^{\beta_2}.\) Generally, if you incfrease \(x_2\) from \(k\) to \(k + \Delta\) then
\[\text{OR}= \frac{\text{odds when }x_2 = k +\Delta}{\text{odds when }x_2=k}=e^{\beta_2\Delta}=\left(e^{\beta_2}\right)^\Delta\]
From equation \((1)\) and \((2)\):
\[\begin{align} \Pr(Y = 1) &= (1-\Pr(Y=1))\times\,e^{\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p}\\[2ex] e^{\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p}&= \Pr(Y = 1)\,\times (1 \,+\,e^{\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p}) \\[2ex] \color{blue}{\Pr(Y = 1)} = \frac{\color{blue}{\text{odds(Y=1)}}}{1\,+\,\color{blue}{\text{odds(Y=1)}}}&=\frac{e^{\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p}}{1 \,+\,e^{\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p}}=\frac{1}{1 \,+\,e^{-(\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p)}} \end{align}\]
For CATEGORICAL VARIABLES, the ODDS RATIO denote how much the presence or absence of a factor variable increases the odds of a positive dependent variable.
For CONTINOUS VARIABLES the odds ratio tell us how much the odds increase multiplicatively with a one-unit change in the independent variable. To calculate the difference in odds you raise the OR to the power of the difference between the observations.
To get the exponentiated coefficients and their confidence intervals in R, we use cbind to bind the coefficients and confidence intervals column-wise:
exp(cbind(OR = coef(mylogit), confint(mylogit)))
See here.
OUTPUT IN R:
f = glm(am ~ mpg, 'binomial', mtcars)
# The linear part applied to values of i.v.'s:
summary(predict(f))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.4104 -1.8676 -0.7086 -0.4351 0.3967 3.8047
# Pr(Y = 1 | x1, x2,...)
summary(fitted.values(f))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03197 0.13388 0.32991 0.40625 0.59790 0.97822
# Pr(Y = 1 | x1, x2,...) = exp(predict(f)) / (1 + exp(predict(f)) = 1 / (1 + exp(-predict(f))
all.equal(exp(predict(f)) / (1 + exp(predict(f))), fitted.values(f))
## [1] TRUE
Nothing better than simulating the data to obtain the coefficients through a logistic model (from here):
set.seed(1)
n <- 100
gender <- sample(c(0,1), size = n, replace = TRUE)
age <- round(runif(n, 18, 80))
beta0 <- - 10
beta1 <- 2.5
beta2 <- 0.4
linear <- beta0 + beta1 * gender + beta2 * age
p <- exp(linear) / (1 + exp(linear))
y <- rbinom(n, size = 1, prob = p)
mod <- glm(y ~ gender + age, family = "binomial")
summary(mod)
##
## Call:
## glm(formula = y ~ gender + age, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.82870 0.00040 0.00431 0.05913 1.37965
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.9962 4.5019 -2.220 0.0264 *
## gender 2.5084 1.4183 1.769 0.0770 .
## age 0.3697 0.1552 2.382 0.0172 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 50.728 on 99 degrees of freedom
## Residual deviance: 22.084 on 97 degrees of freedom
## AIC: 28.084
##
## Number of Fisher Scoring iterations: 9
QUESTION:
(Initially posted here)
I ran a linear regression of acceptance into college against SAT scores and family / ethnic background. The data are fictional. The question focuses in the gathering and interpretation of odds ratios when leaving the SAT scores aside for simplicity.
The variables are Accepted
(0 or 1) and
Background
(“red” or “blue”). I set up the data so that
people of “red” background were more likely to get in:
fit <- glm(Accepted ~ Background, data=dat, family="binomial")
exp(cbind(Odds_Ratio_RedvBlue = coef(fit), confint(fit)))
Odds_Ratio_RedvBlue 2.5 % 97.5 %
(Intercept) 0.7088608 0.5553459 0.9017961
Backgroundred 2.4480042 1.7397640 3.4595454
Questions:
Is 0.7 the odd ratio of a person of “blue” background being
accepted? I’m asking this because I also get 0.7 for
“Backgroundblue
” if instead I run the following code:
fit <- glm(Accepted~Background - 1, data=dat, family="binomial")
exp(cbind(OR=coef(fit), confint(fit)))
Shouldn’t the odds ratio of “red” being accepted (\(\rm Accepted/Red:Accepted/Blue\)) just the reciprocal: (\(\rm OddsBlue = 1 / OddsRed\))?
ANSWER:
I’ve been working on answering my question by calculating manually the odds and odds ratios:
Acceptance blue red Grand Total
0 158 102 260
1 112 177 289
Total 270 279 549
So the Odds Ratio of getting into the school of Red over Blue is:
\[ \frac{\rm Odds\ Accept\ If\ Red}{\rm Odds\ Acccept\ If\ Blue} = \frac{^{177}/_{102}}{^{112}/_{158}} = \frac {1.7353}{0.7089} = 2.448 \]
And this is the Backgroundred
return of:
fit <- glm(Accepted~Background, data=dat, family="binomial")
exp(cbind(Odds_and_OR=coef(fit), confint(fit)))
Odds_and_OR 2.5 % 97.5 %
(Intercept) 0.7088608 0.5553459 0.9017961
Backgroundred 2.4480042 1.7397640 3.4595454
At the same time, the (Intercept)
corresponds to the
numerator of the odds ratio, which is exactly the odds
of getting in being of ‘blue’ family background: \(112/158 = 0.7089\).
If instead, I run:
fit2 <- glm(Accepted~Background-1, data=dat, family="binomial")
exp(cbind(Odds=coef(fit2), confint(fit2)))
Odds 2.5 % 97.5 %
Backgroundblue 0.7088608 0.5553459 0.9017961
Backgroundred 1.7352941 1.3632702 2.2206569
The returns are precisely the odds of getting in being
‘blue’: Backgroundblue
(0.7089) and the odds of
being accepted being ‘red’: Backgroundred
(1.7353). No
Odds Ratio there. Therefore the two return values are not
expected to be reciprocal.
Finally, How to read the results if there are 3 factors in the categorical regressor?
Same manual versus [R] calculation:
I created a different fictitious data set with the same premise, but this time there were three ethnic backgrounds: “red”, “blue” and “orange”, and ran the same sequence:
First, the contingency table:
Acceptance blue orange red Total
0 86 65 130 281
1 64 42 162 268
Total 150 107 292 549
And calculated the Odds of getting in for each ethnic group:
As well as the different Odds Ratios:
And proceeded with the now routine logistic regression followed by exponentiation of coefficients:
fit <- glm(Accepted~Background, data=dat, family="binomial")
exp(cbind(ODDS=coef(fit), confint(fit)))
ODDS 2.5 % 97.5 %
(Intercept) 0.7441860 0.5367042 1.026588
Backgroundorange 0.8682692 0.5223358 1.437108
Backgroundred 1.6745192 1.1271430 2.497853
Yielding the odds of getting in for “blues” as the
(Intercept)
, and the Odds Ratios of Orange versus
Blue in Backgroundorange
, and the OR of Red v Blue in
Backgroundred
.
On the other hand, the regression without intercept predictably returned just the three independent odds:
fit2 <- glm(Accepted~Background-1, data=dat, family="binomial")
exp(cbind(ODDS=coef(fit2), confint(fit2)))
ODDS 2.5 % 97.5 %
Backgroundblue 0.7441860 0.5367042 1.0265875
Backgroundorange 0.6461538 0.4354366 0.9484999
Backgroundred 1.2461538 0.9900426 1.5715814
END OF THE STACK OVERFLOW POST |
Here is an example extracted form this site:
# Define an explanatory variable with two levels:
parentsmoke <- as.factor(c(1,0))
# create response vector:
response <- cbind(yes=c(816,188), no=c(3203,1168))
response
## yes no
## [1,] 816 3203
## [2,] 188 1168
smoke.logistic <- glm(response~parentsmoke, family=binomial(link=logit))
smoke.logistic
##
## Call: glm(formula = response ~ parentsmoke, family = binomial(link = logit))
##
## Coefficients:
## (Intercept) parentsmoke1
## -1.8266 0.4592
##
## Degrees of Freedom: 1 Total (i.e. Null); 0 Residual
## Null Deviance: 29.12
## Residual Deviance: -3.717e-13 AIC: 19.24
summary(smoke.logistic)
##
## Call:
## glm(formula = response ~ parentsmoke, family = binomial(link = logit))
##
## Deviance Residuals:
## [1] 0 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.82661 0.07858 -23.244 < 2e-16 ***
## parentsmoke1 0.45918 0.08782 5.228 1.71e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2.9121e+01 on 1 degrees of freedom
## Residual deviance: -3.7170e-13 on 0 degrees of freedom
## AIC: 19.242
##
## Number of Fisher Scoring iterations: 2
This information gives us the fitted model:
\[\text{logit}(\hat\pi_i)=\log\frac{\hat\pi_i}{1-\hat\pi_i}=\hat\beta_0+\hat\beta_1X_i=-1.837+0.459\,(\text{parents
smoke = 1})\] where parentsmoke
is a dummy variable
((e.g. design variable) that takes value \(1\) if at least one parents is smoking and
\(0\) if neither is smoking.
parent smoking = 0 is the baseline
. We are modeling
here probability of “children smoking”. Estimated \(\beta_0=1.827\) with a standard error of
\(0.078\) is significant and it says
that log-odds of a child smoking versus not smoking if neither parents
is smoking (the baseline level) is \(-1.827\) and it’s statistically
significant.
Estimated \(\beta_1=0.459\) with a standard error of \(0.088\) is significant and it says that log-odds-ratio of a child smoking versus not smoking if at least one parent is smoking versus neither parents is smoking (the baseline level) is \(0.459\) and it’s statistically significant. \(\exp(0.459)=1.58\) are the estimated odds-ratios.
Why not just ran a Chi square test:
prop.test(response, correct = F)
##
## 2-sample test for equality of proportions without continuity
## correction
##
## data: response
## X-squared = 27.677, df = 1, p-value = 1.434e-07
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.04218944 0.08659558
## sample estimates:
## prop 1 prop 2
## 0.2030356 0.1386431
It does detect differences in proportions but it doesn’t address:
Here is the same model splitting mothers and fathers as predictors:
kids.smoke <- matrix(c(188,599,816-205,1168,1456,3203-1456), nrow = 3)
dimnames(kids.smoke) = list(Parents = c("None", "Mother", "Father"), Kids = c("Smokers", "Non-smokers"))
kids.smoke
## Kids
## Parents Smokers Non-smokers
## None 188 1168
## Mother 599 1456
## Father 611 1747
bad.example <- as.factor(c(0,"1Mother","2Father"))
smoke.logistic <- glm(kids.smoke~bad.example, family=binomial(link=logit))
smoke.logistic
##
## Call: glm(formula = kids.smoke ~ bad.example, family = binomial(link = logit))
##
## Coefficients:
## (Intercept) bad.example1Mother bad.example2Father
## -1.8266 0.9384 0.7760
##
## Degrees of Freedom: 2 Total (i.e. Null); 0 Residual
## Null Deviance: 119.2
## Residual Deviance: -4.341e-13 AIC: 28.77
summary(smoke.logistic)
##
## Call:
## glm(formula = kids.smoke ~ bad.example, family = binomial(link = logit))
##
## Deviance Residuals:
## [1] 0 0 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.82661 0.07858 -23.244 <2e-16 ***
## bad.example1Mother 0.93842 0.09237 10.160 <2e-16 ***
## bad.example2Father 0.77605 0.09157 8.475 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.1915e+02 on 2 degrees of freedom
## Residual deviance: -4.3410e-13 on 0 degrees of freedom
## AIC: 28.768
##
## Number of Fisher Scoring iterations: 2
An example with credit defaults with some calculations extracted from here:
set.seed(123)
gender <- factor(round(runif(100,0,1)),
labels=c("female", "male"))
age <- round(runif(100,20,80))
profession <- factor(round(runif(100,1,4)),
labels=c("student", "worker", "teacher", "self-employed"))
# Let's make males, teachers and self-employed more likely to default:
credit.default <- round(((as.numeric(gender)/2 + as.numeric(profession)/4)/2) * runif(100,0.5,1))
df <- data.frame(credit.default,gender,age,profession)
head(df)
## credit.default gender age profession
## 1 0 female 56 worker
## 2 1 male 40 self-employed
## 3 1 female 49 teacher
## 4 1 male 77 teacher
## 5 1 male 49 worker
## 6 1 female 73 self-employed
catlogis <- glm(credit.default ~ gender+age+profession, df, family=binomial(link="logit"))
summary(catlogis)
##
## Call:
## glm(formula = credit.default ~ gender + age + profession, family = binomial(link = "logit"),
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2580 -0.9295 -0.2042 0.5361 1.5200
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.860e+00 1.327e+00 -2.910 0.003616 **
## gendermale 3.084e+00 7.844e-01 3.931 8.46e-05 ***
## age 9.465e-07 1.692e-02 0.000 0.999955
## professionworker 6.325e-01 8.211e-01 0.770 0.441075
## professionteacher 3.245e+00 1.003e+00 3.236 0.001214 **
## professionself-employed 4.464e+00 1.193e+00 3.740 0.000184 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 136.663 on 99 degrees of freedom
## Residual deviance: 88.237 on 94 degrees of freedom
## AIC: 100.24
##
## Number of Fisher Scoring iterations: 5
require(pROC)
auc(credit.default, predict(catlogis, type="response"), quiet=T)
## Area under the curve: 0.8517
We can generate an ANOVA to analyze the table of variance, paying
attention at the difference between the NULL
variance and
the Resid. Deviance
to see which variable reduce this
residual variance in comparison to the model with only an intecept. The
p-values also help select variables. We want a drop in the deviance and
AIC.
catlogis <- glm(credit.default ~ gender+profession, df, family=binomial(link="logit"))
summary(catlogis)
##
## Call:
## glm(formula = credit.default ~ gender + profession, family = binomial(link = "logit"),
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2580 -0.9295 -0.2042 0.5361 1.5200
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.8604 0.9996 -3.862 0.000113 ***
## gendermale 3.0836 0.7841 3.933 8.39e-05 ***
## professionworker 0.6325 0.8202 0.771 0.440570
## professionteacher 3.2448 1.0028 3.236 0.001213 **
## professionself-employed 4.4636 1.1873 3.759 0.000170 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 136.663 on 99 degrees of freedom
## Residual deviance: 88.237 on 95 degrees of freedom
## AIC: 98.237
##
## Number of Fisher Scoring iterations: 5
auc(df$credit.default, predict(catlogis, type="response"), quiet=T)
## Area under the curve: 0.8527
genlogis <- glm(credit.default ~ gender, df, family=binomial(link="logit"))
summary(genlogis)
##
## Call:
## glm(formula = credit.default ~ gender, family = binomial(link = "logit"),
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4261 -0.7502 -0.7502 0.9476 1.6765
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.1239 0.3193 -3.520 0.000431 ***
## gendermale 1.6919 0.4405 3.841 0.000123 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 136.66 on 99 degrees of freedom
## Residual deviance: 120.56 on 98 degrees of freedom
## AIC: 124.56
##
## Number of Fisher Scoring iterations: 4
auc(df$credit.default, predict(genlogis, type="response"), quiet=T)
## Area under the curve: 0.6997
fitted.results <- predict(genlogis, type="response")
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != df$credit.default)
print(paste('Accuracy', 1-misClasificError))
## [1] "Accuracy 0.7"
library(ROCR)
pr <- prediction(predict(genlogis, type="response"), df$credit.default)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf, main= 'ROC curve',
colorize=TRUE,
lwd= 3)
plot(prf,
lty=3,
col="grey78",
add=TRUE)
auc <- performance(pr, measure = "auc")
(auc <- auc@y.values[[1]])
## [1] 0.6997144
proflogis <- glm(credit.default ~ profession, df, family=binomial(link="logit"))
summary(proflogis)
##
## Call:
## glm(formula = credit.default ~ profession, family = binomial(link = "logit"),
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7941 -0.7710 -0.6231 1.0508 1.8626
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.5404 0.6362 -2.421 0.01547 *
## professionworker 0.4796 0.7445 0.644 0.51950
## professionteacher 1.8458 0.7272 2.538 0.01114 *
## professionself-employed 2.9267 0.9063 3.229 0.00124 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 136.66 on 99 degrees of freedom
## Residual deviance: 115.75 on 96 degrees of freedom
## AIC: 123.75
##
## Number of Fisher Scoring iterations: 4
auc(df$credit.default, predict(proflogis, type="response"), quiet=T)
## Area under the curve: 0.7438
pr <- prediction(predict(proflogis, type="response"), df$credit.default)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf,
colorize=TRUE,
lwd= 3,
main= "ROC curve")
plot(prf,
lty=3,
col="grey78",
add=TRUE)
auc <- performance(pr, measure = "auc")
(auc <- auc@y.values[[1]])
## [1] 0.743778
Another example from this video:
set.seed(420)
n.samples <- 100
weight <- sort(rnorm(n.samples,172,29))
obese <- ifelse(runif(n.samples) < (rank(weight)/100), yes=1, no=0)
plot(weight,obese, pch=19, col='tan4', cex=.7)
glm.fit <- glm(obese ~ weight, family="binomial")
lines(weight, glm.fit$fitted.values)
par(pty="s")
roc(obese, glm.fit$fitted.values, plot=T)
##
## Call:
## roc.default(response = obese, predictor = glm.fit$fitted.values, plot = T)
##
## Data: glm.fit$fitted.values in 45 controls (obese 0) < 55 cases (obese 1).
## Area under the curve: 0.8291
roc.info <- roc(obese, glm.fit$fitted.values)
roc.df <- data.frame(ttp=roc.info$sensitivities*100, fpp=(1-roc.info$specificities) *100, thresholds=roc.info$thresholds)
head(roc.df)
## ttp fpp thresholds
## 1 100 100.00000 -Inf
## 2 100 97.77778 0.01349011
## 3 100 95.55556 0.03245008
## 4 100 93.33333 0.05250145
## 5 100 91.11111 0.07017225
## 6 100 88.88889 0.08798755
tail(roc.df)
## ttp fpp thresholds
## 96 9.090909 0 0.9275222
## 97 7.272727 0 0.9371857
## 98 5.454545 0 0.9480358
## 99 3.636364 0 0.9648800
## 100 1.818182 0 0.9735257
## 101 0.000000 0 Inf
Notice that the first row of head(roc.df)
with a
threshold for labeling “obese” of \(-\inf\) the ttp (true positive percentage)
will be of \(100\%,\) and so will be
the fpp (false positive percentage), and it corresponds to the
upper-right hand part of the ROC curve. The bottom-left part of the
curve corresponds to the last row of tail(roc.df)
.
We can isolate part of the data, for example with ttp between \(60\) and \(80\), to select an optimal threshold value:
roc.df[roc.df$ttp > 60 & roc.df$ttp < 80,]
## ttp fpp thresholds
## 42 78.18182 35.55556 0.5049310
## 43 78.18182 33.33333 0.5067116
## 44 78.18182 31.11111 0.5166680
## 45 76.36364 31.11111 0.5287933
## 46 76.36364 28.88889 0.5429351
## 47 76.36364 26.66667 0.5589494
## 48 74.54545 26.66667 0.5676342
## 49 74.54545 24.44444 0.5776086
## 50 74.54545 22.22222 0.5946054
## 51 72.72727 22.22222 0.6227449
## 52 70.90909 22.22222 0.6398136
## 53 69.09091 22.22222 0.6441654
## 54 67.27273 22.22222 0.6556705
## 55 67.27273 20.00000 0.6683618
## 56 67.27273 17.77778 0.6767661
## 57 65.45455 17.77778 0.6802060
## 58 65.45455 15.55556 0.6831936
## 59 65.45455 13.33333 0.6917225
## 60 63.63636 13.33333 0.6975300
## 61 61.81818 13.33333 0.6982807
We can plot the ROC differently:
par(pty="s")
roc(obese, glm.fit$fitted.values, plot=T, legacy.axes=T, percent=T,
xlab='False positive percentage', ylab='True positive percentage',
col="darkslateblue",lwd=4, print.auc=T)
##
## Call:
## roc.default(response = obese, predictor = glm.fit$fitted.values, percent = T, plot = T, legacy.axes = T, xlab = "False positive percentage", ylab = "True positive percentage", col = "darkslateblue", lwd = 4, print.auc = T)
##
## Data: glm.fit$fitted.values in 45 controls (obese 0) < 55 cases (obese 1).
## Area under the curve: 82.91%
To calculate the partial AUC of the beginning, where only a small number of false positives are allowed…
par(pty="s")
roc(obese, glm.fit$fitted.values, plot=T, legacy.axes=T, percent=T,
xlab='False positive percentage', ylab='True positive percentage',
col='darkslateblue',lwd=4, print.auc=T, print.auc.x=45, partial.auc=c(100,90),
auc.polygon=T, auc.polygon.col='midnightblue', quiet=T)
##
## Call:
## roc.default(response = obese, predictor = glm.fit$fitted.values, percent = T, quiet = T, plot = T, legacy.axes = T, xlab = "False positive percentage", ylab = "True positive percentage", col = "darkslateblue", lwd = 4, print.auc = T, print.auc.x = 45, partial.auc = c(100, 90), auc.polygon = T, auc.polygon.col = "midnightblue")
##
## Data: glm.fit$fitted.values in 45 controls (obese 0) < 55 cases (obese 1).
## Partial area under the curve (specificity 100%-90%): 4.727%
Can we bootstrap?
ci.auc(roc(obese, glm.fit$fitted.values), conf.level=0.95, method=
"bootstrap", boot.n = 2000, boot.stratified = TRUE, reuse.auc=TRUE,
progress = getOption("pROCProgress")$name, parallel=FALSE)
## 95% CI: 0.7438-0.8954 (2000 stratified bootstrap replicates)
The Hosmer-Lemeshow goodness of fit test divides up in boxes the
predicted probabilities (in R the function is fitted
as
opposed to predict
), and runs a chi-square test comparing
to the percentage of cases that have \(Y=1\) among those with predicted
probability within a certain interval. See here:
The Hosmer-Lemeshow goodness of fit test is based on dividing the sample up according to their predicted probabilities, or risks. Specifically, based on the estimated parameter values \(\hat{\beta}_{0},\hat{\beta}_{1},..,\hat{\beta}_{p},\) for each observation in the sample the probability that \(Y=1\) is calculated, based on each observation’s covariate values:
\[\hat{\pi} = \frac{\exp(\hat{\beta}_{0}+\hat{\beta}_{1}X_{1}+..+\hat{\beta}_{p}X_{p})}{1+\exp(\hat{\beta}_{0}+\hat{\beta}_{1}X_{1}+..+\hat{\beta}_{p}X_{p})}\]
The observations in the sample are then split into \(g\) groups (we come back to choice of g later) according to their predicted probabilities. Suppose (as is commonly done) that \(g=10.\) Then the first group consists of the observations with the lowest \(10\%\) predicted probabilities. The second group consists of the \(10\%\) of the sample whose predicted probabilities are next smallest, etc etc.
Suppose for the moment that all of the observations in the first group had a predicted probability of \(0.1.\) Then, if our model is correctly specified, we would expect the proportion of these observations who have \(Y=1\) to be \(10\%.\) Of course, even if the model is correctly specified, the observed proportion will deviate to some extent from 10%, but not by too much. If the proportion of observations with \(Y=1\) in the group were instead \(90\%,\) this is suggestive that our model is not accurately predicting probability (risk), i.e. an indication that our model is not fitting the data well.
In practice, as soon as some of our model covariates are continuous, each observation will have a different predicted probability, and so the predicted probabilities will vary in each of the groups we have formed. To calculate how many Y=1 observations we would expect, the Hosmer-Lemeshow test takes the average of the predicted probabilities in the group, and multiplies this by the number of observations in the group. The test also performs the same calculation for \(Y=0,\) and then calculates a Pearson goodness of fit statistic
\[\sum^{1}_{k=0} \sum^{g}_{l=1} \frac{(o_{kl}-e_{kl})^{2}}{e_{kl}}\]
where \(o_{0l}\) denotes the number of observed \(Y=0\) observations in the \(l\)-th group, \(o_{1l}\) denotes the number of observed \(Y=1\) observations in the \(l\)-th group, and \(e_{0l}\) and \(e_{1l}\) similarly denote the expected number of zeros.
In a 1980 paper Hosmer-Lemeshow showed by simulation that (provided \(p+1<g\)) their test statistic approximately followed a chi-squared distribution on \(g-2\) degrees of freedom, when the model is correctly specified. This means that given our fitted model, the p-value can be calculated as the right hand tail probability of the corresponding chi-squared distribution using the calculated test statistic. If the p-value is small, this is indicative of poor fit.
It should be emphasized that a large p-value does not mean the model fits well, since lack of evidence against a null hypothesis is not equivalent to evidence in favor of the alternative hypothesis. In particular, if our sample size is small, a high p-value from the test may simply be a consequence of the test having lower power to detect mis-specification, rather than being indicative of good fit.
It is based on selecting different probabilities from \(0\) to \(1\) and calculating sensitivity and specificity in an ROC curve, and then measuring the AUC. See here:
Our model or prediction rule is perfect at classifying observations if it has 100% sensitivity and 100% specificity. Unfortunately in practice this is (usually) not attainable. So how can we summarize the discrimination ability of our logistic regression model? For each observation, our fitted model can be used to calculate the fitted probabilities \(P(Y=1|X_{1},..,X_{p}).\) On their own, these don’t tell us how to classify observations as positive or negative. One way to create such a classification rule is to choose a cut-point \(c,\) and classify those observations with a fitted probability above \(c\) as positive and those at or below it as negative. For this particular cut-off, we can estimate the sensitivity by the proportion of observations with \(Y=1\) which have a predicted probability above \(c,\) and similarly we can estimate specificity by the proportion of \(Y=0\) observations with a predicted probability at or below \(c.\) If we increase the cut-point \(c,\) fewer observations will be predicted as positive. This will mean that fewer of the \(Y=1\) observations will be predicted as positive (reduced sensitivity), but more of the \(Y=0\) observations will be predicted as negative (increased specificity). In picking the cut-point, there is thus an intrinsic trade off between sensitivity and specificity. Now we come to the ROC curve, which is simply a plot of the values of sensitivity against one minus specificity, as the value of the cut-point \(c\) is increased from \(0\) through to \(1.\) Example:
set.seed(63126)
n <- 1000
x <- rnorm(n)
pr <- exp(x)/(1+exp(x))
y <- 1*(runif(n) < pr)
mod <- glm(y~x, family="binomial")
predpr <- predict(mod,type=c("response"))
roccurve <- roc(y ~ predpr)
plot(roccurve)
A popular way of summarizing the discrimination ability of a model is to report the area under the ROC curve. We have seen that a model with discrimination ability has an ROC curve which goes closer to the top left hand corner of the plot, whereas a model with no discrimination ability has an ROC curve close to a \(45\) degree line. Thus the area under the curve ranges from \(1,\) corresponding to perfect discrimination, to \(0.5,\) corresponding to a model with no discrimination ability. The area under the ROC curve is also sometimes referred to as the c-statistic (c for concordance).
auc(roccurve)
## Area under the curve: 0.7463
The ROC curve (AUC) has a somewhat appealing interpretation. It turns out that the AUC is the probability that if you were to take a random pair of observations, one with \(Y=1\) and one with \(Y=0,\) the observation with \(Y=1\) has a higher predicted probability than the other. The AUC thus gives the probability that the model correctly ranks such pairs of observations.
In the biomedical context of risk prediction modelling, the AUC has been criticized by some. In the risk prediction context, individuals have their risk of developing (for example) coronary heart disease over the next 10 years predicted. Thus a measure of discrimination which examines the predicted probability of pairs of individuals, one with \(Y=1\) and one with \(Y=0,\) does not really match the prospective risk prediction setting, where we do not have such pairs.
From here:
When a binary outcome variable is modeled using logistic regression, it is assumed that the logit transformation of the outcome variable has a linear relationship with the predictor variables. This makes the interpretation of the regression coefficients somewhat tricky. In this page, we will walk through the concept of odds ratio and try to interpret the logistic regression results using the concept of odds ratio in a couple of examples.
From probability to odds to log of odds:
Everything starts with the concept of probability. Let’s say that the probability of success of some event is .8. Then the probability of failure is 1 – .8 = .2. The odds of success are defined as the ratio of the probability of success over the probability of failure. In our example, the odds of success are .8/.2 = 4. That is to say that the odds of success are 4 to 1. If the probability of success is .5, i.e., 50-50 percent chance, then the odds of success is 1 to 1.
The transformation from probability to odds is a monotonic transformation, meaning the odds increase as the probability increases or vice versa. Probability ranges from 0 and 1. Odds range from 0 and positive infinity.
Below is a table of the transformation from probability to odds.
p odds
.001 .001001
.01 .010101
.15 .1764706
.2 .25
.25 .3333333
.3 .4285714
.35 .5384616
.4 .6666667
.45 .8181818
.5 1
.55 1.222222
.6 1.5
.65 1.857143
.7 2.333333
.75 3
.8 4
.85 5.666667
.9 9
.999 999
.9999 9999
The transformation from odds to log of odds is the log transformation. Again this is a monotonic transformation. That is to say, the greater the odds, the greater the log of odds and vice versa. The table below shows the relationship among the probability, odds and log of odds.
p odds logodds
.001 .001001 -6.906755
.01 .010101 -4.59512
.15 .1764706 -1.734601
.2 .25 -1.386294
.25 .3333333 -1.098612
.3 .4285714 -.8472978
.35 .5384616 -.6190392
.4 .6666667 -.4054651
.45 .8181818 -.2006707
.5 1 0
.55 1.222222 .2006707
.6 1.5 .4054651
.65 1.857143 .6190392
.7 2.333333 .8472978
.75 3 1.098612
.8 4 1.386294
.85 5.666667 1.734601
.9 9 2.197225
.999 999 6.906755
.9999 9999 9.21024
Why do we take all the trouble doing the transformation from probability to log odds? One reason is that it is usually difficult to model a variable which has restricted range, such as probability. This transformation is an attempt to get around the restricted range problem. It maps probability ranging between 0 and 1 to log odds ranging from negative infinity to positive infinity. Another reason is that among all of the infinitely many choices of transformation, the log of odds is one of the easiest to understand and interpret. This transformation is called logit transformation. The other common choice is the probit transformation, which will not be covered here.
A logistic regression model allows us to establish a relationship between a binary outcome variable and a group of predictor variables. It models the logit-transformed probability as a linear relationship with the predictor variables. More formally, let \(Y\) be the binary outcome variable indicating failure/success with \(\{0,1\}\) and \(p\) be the probability of \(y\) to be 1. Let \(x_1,x_2,\dots\) be a set of predictor variables. Then the logistic regression of \(Y\) on \(x_1,x_2,\dots\) estimates parameter values via maximum likelihood method of the following equation
\[\text{logit}(p) = \log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k.\]
Exponentiate and take the multiplicative inverse of both sides,
\[\frac{1-p}{p} = \frac{1}{\exp(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}.\]
Partial out the fraction on the left-hand side of the equation and add one to both sides,
\[\frac{1}{p} = 1 + \frac{1}{\exp(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}.\]
Change 1 to a common denominator,
\[\frac{1}{p} = \frac{\exp(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)+1}{\exp(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}.\]
Finally, take the multiplicative inverse again to obtain the formula for the probability,
\[{p} = \frac{\exp(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}{1+\exp(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}.\]
We are now ready for a few examples of logistic regressions. We will
use a this
dataset. The data set has 200 observations and the outcome variable
used will be hon
, indicating if a student is in an honors
class or not. So our p = prob(hon=1)
. We will purposely
ignore all the significance tests and focus on the meaning of the
regression coefficients.
Let’s start with the simplest logistic regression, a model without any predictor variables. In an equation, we are modeling
\[\text{logit}(p)= β_0\]
require(RCurl)
url="https://raw.githubusercontent.com/RInterested/DATASETS/gh-pages/Honors%20math%20write%20gender"
d <- read.csv(url)
head(d)
## female read write math hon femalexmath
## 1 0 57 52 41 0 0
## 2 1 68 59 53 0 53
## 3 0 44 33 54 0 0
## 4 0 63 44 47 0 0
## 5 0 47 52 57 0 0
## 6 0 44 52 51 0 0
fit <- glm(hon~1,d,family='binomial')
summary(fit)
##
## Call:
## glm(formula = hon ~ 1, family = "binomial", data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7497 -0.7497 -0.7497 -0.7497 1.6772
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.1255 0.1644 -6.845 7.62e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 222.71 on 199 degrees of freedom
## Residual deviance: 222.71 on 199 degrees of freedom
## AIC: 224.71
##
## Number of Fisher Scoring iterations: 4
This means \(\log(p/(1-p)) =
-1.1255.\) What is \(p\) here?
It turns out that \(p\) is the overall
probability of being in honors class (hon = 1
). Let’s take
a look at the frequency table for hon.
addmargins(table(t(d$hon)))
##
## 0 1 Sum
## 151 49 200
So \(p = 49/200 = .245.\) The odds are \(.245/(1-.245) = .3245\) and the log of the odds (logit) is \(\log(.3245) = -1.12546.\) In other words, the intercept from the model with no predictor variables is the estimated log odds of being in honors class for the whole population of interest. We can also transform the log of the odds back to a probability: \(p = \exp(-1.12546)/(1+\exp(-1.12546)) = .245.\)
Now let’s go one step further by adding a binary predictor variable, female, to the model. Writing it in an equation, the model describes the following linear relationship.
\[\text{logit}(p) = β_0 + β_1\times\text{female}\]
fit <- glm(hon ~ female,d,family='binomial')
summary(fit)
##
## Call:
## glm(formula = hon ~ female, family = "binomial", data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8337 -0.8337 -0.6431 -0.6431 1.8317
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.4709 0.2690 -5.469 4.53e-08 ***
## female 0.5928 0.3414 1.736 0.0825 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 222.71 on 199 degrees of freedom
## Residual deviance: 219.61 on 198 degrees of freedom
## AIC: 223.61
##
## Number of Fisher Scoring iterations: 4
Before trying to interpret the two parameters estimated above, let’s take a look at the crosstab of the variable hon with female.
addmargins(table(honors=t(d$hon),female=d$female))
## female
## honors 0 1 Sum
## 0 74 77 151
## 1 17 32 49
## Sum 91 109 200
In our dataset, what are the odds of a male being in the honors class and what are the odds of a female being in the honors class? We can manually calculate these odds from the table: for males, the odds of being in the honors class are \((17/91)/(74/91) = 17/74 = .23;\) and for females, the odds of being in the honors class are \((32/109)/(77/109) = 32/77 = .42.\) The ratio of the odds for female to the odds for male is \((32/77)/(17/74) = 1.809.\) So the odds for males are \(17\) to \(74,\) the odds for females are \(32\) to \(77,\) and the odds for female are about \(81\%\) higher than the odds for males.
Now we can relate the odds for males and females and the output from the logistic regression. The intercept of \(-1.471\) is the log odds for males since male is the reference group \((female = 0).\) Using the odds we calculated above for males, we can confirm this: \(log(.23) = -1.47.\) The coefficient for female is the log of odds ratio between the female group and male group: \(log(1.809) = .593.\) So we can get the odds ratio by exponentiating the coefficient for female.
exp(cbind(OR = coef(fit), confint(fit)))
## OR 2.5 % 97.5 %
## (Intercept) 0.2297297 0.1312460 0.3792884
## female 1.8090145 0.9362394 3.5929859
Another simple example is a model with a single continuous predictor variable such as the model below. It describes the relationship between students’ math scores and the log odds of being in an honors class.
\[\text{logit}(p) = β_0 + β_1\times \text{math}\]
fit <- glm(hon ~ math,d,family='binomial')
summary(fit)
##
## Call:
## glm(formula = hon ~ math, family = "binomial", data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0332 -0.6785 -0.3506 -0.1565 2.6143
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.79394 1.48174 -6.610 3.85e-11 ***
## math 0.15634 0.02561 6.105 1.03e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 222.71 on 199 degrees of freedom
## Residual deviance: 167.07 on 198 degrees of freedom
## AIC: 171.07
##
## Number of Fisher Scoring iterations: 5
In this case, the estimated coefficient for the intercept is the log odds of a student with a math score of zero being in an honors class. In other words, the odds of being in an honors class when the math score is zero is \(\exp(-9.793942) = .00005579.\) These odds are very low, but if we look at the distribution of the variable math, we will see that no one in the sample has math score lower than \(30.\) In fact, all the test scores in the data set were standardized around mean of \(50\) and standard deviation of \(10.\) So the intercept in this model corresponds to the log odds of being in an honors class when math is at the hypothetical value of zero.
How do we interpret the coefficient for math? The coefficient and intercept estimates give us the following equation:
\[\log(p/(1-p)) = \text{logit}(p) = – 9.793942 + .1563404 \times \text{math}\]
Let’s fix math at some value. We will use \(54.\) Then the conditional logit of being in an honors class when the math score is held at \(54\) is
\[\log(p/(1-p))(\text{math}=54) = – 9.793942 + .1563404 \times 54.\]
We can examine the effect of a one-unit increase in math score. When the math score is held at \(55,\) the conditional logit of being in an honors class is
\[\log(p/(1-p))(\text{math}=55) = – 9.793942 + .1563404 \times 55.\]
Taking the difference of the two equations, we have the following:
\[\log(p/(1-p))(\text{math}=55) – \log(p/(1-p))(\text{math} = 54) = .1563404.\]
We can say now that the coefficient for math is the difference in the log odds. In other words, for a one-unit increase in the math score, the expected change in log odds is \(.1563404.\)
Can we translate this change in log odds to the change in odds? Indeed, we can. Recall that logarithm converts multiplication and division to addition and subtraction. Its inverse, the exponentiation converts addition and subtraction back to multiplication and division. If we exponentiate both sides of our last equation, we have the following:
\[\begin{align} &\exp[\log(p/(1-p))(\text{math=55}) – \log(p/(1-p))(\text{math} = 54)] \\[2ex] &=\exp(\log(p/(1-p))(\text{math}=55)) / \exp(\log(p/(1-p))(\text{math} = 54))\\[2ex] &= \text{odds}(\text{math}=55)/\text{odds}(\text{math}=54)\\[2ex] &= \exp(.1563404) = 1.1692241. \end{align}\]
So we can say for a one-unit increase in math score, we expect to see about \(17\%\) increase in the odds of being in an honors class. This 17% of increase does not depend on the value that math is held at.
In general, we can have multiple predictor variables in a logistic regression model.
\[\text{logit}(p) = \log(p/(1-p))= β_0 + β_1\, x_1 + … + β_k\, x_k\]
Applying such a model to our example dataset, each estimated coefficient is the expected change in the log odds of being in an honors class for a unit increase in the corresponding predictor variable holding the other predictor variables constant at certain value. Each exponentiated coefficient is the ratio of two odds, or the change in odds in the multiplicative scale for a unit increase in the corresponding predictor variable holding other variables at certain value. Here is an example.
\[\text{logit}(p) = \log(p/(1-p))= β_0 + β_1 \text{math} + β_2 \text{female} + β_3 \text{read}\]
fit <- glm(hon ~ math + female + read,d,family='binomial')
summary(fit)
##
## Call:
## glm(formula = hon ~ math + female + read, family = "binomial",
## data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8305 -0.6327 -0.3300 -0.1258 2.3896
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.77025 1.71068 -6.880 5.97e-12 ***
## math 0.12296 0.03128 3.931 8.44e-05 ***
## female 0.97995 0.42163 2.324 0.0201 *
## read 0.05906 0.02655 2.224 0.0261 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 222.71 on 199 degrees of freedom
## Residual deviance: 156.17 on 196 degrees of freedom
## AIC: 164.17
##
## Number of Fisher Scoring iterations: 5
This fitted model says that, holding math and reading at a fixed
value, the odds of getting into an honors class for females
(female = 1
) over the odds of getting into an honors class
for males (female = 0
) is \(\exp(.979948) = 2.66.\) In terms of percent
change, we can say that the odds for females are \(166\%\) higher than the odds for males. The
coefficient for math says that, holding female and reading at a fixed
value, we will see \(13\%\) increase in
the odds of getting into an honors class for a one-unit increase in math
score since \(\exp(.1229589) =
1.13.\)
Logistic regression with an interaction term of two predictor variables In all the previous examples, we have said that the regression coefficient of a variable corresponds to the change in log odds and its exponentiated form corresponds to the odds ratio. This is only true when our model does not have any interaction terms. When a model has interaction term(s) of two predictor variables, it attempts to describe how the effect of a predictor variable depends on the level/value of another predictor variable. The interpretation of the regression coefficients become more involved.
Let’s take a simple example.
\[\text{logit}(p) = \log(p/(1-p))= β_0 + β_1 \text{female} + β_2 \text{math} + β_3 \text{female}\times\text{math}\]
fit <- glm(hon ~ female + math + female*math,d,family='binomial')
summary(fit)
##
## Call:
## glm(formula = hon ~ female + math + female * math, family = "binomial",
## data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7623 -0.6725 -0.3421 -0.1450 2.6913
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.74584 2.12913 -4.108 4e-05 ***
## female -2.89986 3.09418 -0.937 0.348657
## math 0.12938 0.03588 3.606 0.000312 ***
## female:math 0.06700 0.05346 1.253 0.210139
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 222.71 on 199 degrees of freedom
## Residual deviance: 159.77 on 196 degrees of freedom
## AIC: 167.77
##
## Number of Fisher Scoring iterations: 5
In the presence of interaction term of female by math, we can no
longer talk about the effect of female
, holding all other
variables at certain value, since it does not make sense to fix
math
and female x math
at certain value and
still allow female change from 0 to 1!
In this simple example where we examine the interaction of a binary
variable and a continuous variable, we can think that we actually have
two equations: one for males and one for females. For males
(female=0
), the equation is simply
\[\text{logit}(p) = \log(p/(1-p))= β_0 + β_2 \text{math}.\]
For females, the equation is
\[\text{logit}(p) = \log(p/(1-p))= β_0 + β_1\text{female} + β_2\text{math} + β_3 \text{math}\times \text{female} .\]
Now we can map the logistic regression output to these two equations.
So we can say that the coefficient for math
is the effect
of math when female = 0
. More explicitly, we can say that
for male students, a one-unit increase in math score yields a change in
log odds of \(0.13.\) On the other
hand, for the female students, a one-unit increase in math score yields
a change in log odds of \((.13 + .067) =
0.197.\) In terms of odds ratios, we can say that for male
students, the odds ratio is \(\exp(.13) =
1.14\) for a one-unit increase in math score and the odds ratio
for female students is \(\exp(.197) =
1.22\) for a one-unit increase in math score. The ratio of these
two odds ratios (female over male) turns out to be the exponentiated
coefficient for the interaction term of female by math: \(1.22/1.14 = \exp(.067) = 1.07.\)
From here:
First, it’s important to understand what effect size statistics are for and why they’re worth reporting.
This quotation by Joseph Durlak explains it nicely:
“…provide information about the magnitude and direction of the difference between two groups or the relationship between two variables.”
There are two types of effect size statistics: standardized and unstandardized.
Standardized statistics have been stripped of all units of measurement.
Correlation is a nice example. People like correlation because the strength and direction of any two correlations can be compared, regardless of the units of the variables on which the correlation was measured.
Unstandardized statistics are still measured in the original units of the variables. So a difference in two means and a regression coefficient are both effect size statistics and both are useful to report.
Most people mean standardized when they say “effect size statistic.” But both describe the magnitude and direction of the research findings.
If you’re at all familiar with logistic regression, you’re also familiar with odds ratios. Odds ratios measure how many times bigger the odds of one outcome is for one value of an IV, compared to another value.
For example, let’s say you’re doing a logistic regression for a ecology study on whether or not a wetland in a certain area has been infected with a specific invasive plant. Predictors include water temperature in degrees Celsius, altitude, and whether the wetland is a fen or a marsh.
If the odds ratio for water temperature is \(1.12,\) that means that for each one-degree Celsius increase in water temperature, the odds of the wetland having the invasive plant species is \(1.12\) times as big, after controlling for the other predictors.
That odds ratio is an unstandardized effect size statistic. It tells you the direction and the strength of the relationship between water temperature and the odds that the plant is present.
It’s unstandardized because it’s based on the units of temperature.
I realize no ecologist would do so, but if the water were measured in degrees Fahrenheit, that odds ratio would have a different value. The direction and the strength of the relationship would be the same, but the statistic would be evaluated on a different scale.
Surprisingly, I’ve seen odds ratios listed as standardized effect size statistics.
A little digging showed those authors were referring to one of two situations.
The first situation is when the predictor is also binary. The odds ratio for whether the wetland was a fen or a marsh can be considered standardized, not because we’ve removed any units, but because there never were any.
So if the odds ratio for fen vs. marsh is 2.3, we know the odds of the invasive plant being in a fen is 2.3 times that of a marsh.
The second is when the numerical predictor is standardized. If we standardize the temperature (aka, convert the scale to Z scores), we’ve removed the units.
We’ll get the same Z scores from degrees Fahrenheit as degrees Celsius, and the new odds ratio will be in terms of one standard deviation increases in temperature, rather than one degree.
Voila! Units removed.
NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.