We want to estimate the probability of success / failure of the dependent variable (\(\Pr(Y=1)\)), given the explanatory variables: \(\Pr(Y=1|X_1,X_2,\cdots, X_n)\). The explanatory variables can be categorical or continuous - it doesn’t matter.

Logistic regression addresses the following issues:

- There is a need to bound at \(1\) since we are estimating a probability, but lines are not naturally bounded. So we transform probability into odds:

\[ \text{odds}=\frac{\Pr(Y=1)}{1-\Pr(Y=1)}\tag 1\]

In addition, this also helps turning a sigmoid curve of a typical cummulative probability distribution (e.g. normal) into a linear relation.

- Now, when the probability goes down to zero, the odds also tend to zero; yet there is no floor restriction in a line. This problem can be addressed by expressing the linear relation in log scale: \(\log(\text{odds})\), which will tend to \(-\infty\) as the odds tend to zero. This is the
**logit**or log-odds function.

The logistic regression model is therefore:

\(\large \color{red}{\log} \left[\color{blue}{\text{odds(p(Y=1))}}\right]=\color{red}{\text{log}}\left(\frac{\hat p\,(Y=1)}{1-\hat p\,(Y=1)}\right) = X\beta = \beta_o + \beta_1 x_1 + \beta_2 x_2 +\cdots+ \beta_p x_p\)

Consequently,

\[\large \color{blue}{\text{odds(Y=1)}} = \frac{\Pr\,(Y=1)}{1\,-\,\Pr\,(Y=1)} = e^{\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p} = e^{\,\beta_o}\,e^{\,\beta_1x_1}\,e^{\,\beta_2x_2}\,\cdots\,e^{\,\beta_px_p}\]

Therefore a unit increase in \(x_1\) increases the odds \(\large e^{\,\beta_1}\) times. The factors \(\large e^{\,\beta_i}\) are the **ODDS RATIO**’s. On the other hand, \(\beta_i\) (the coefficients) are the **LOG ODDS-RATIO**:

For CATEGORICAL VARIABLES, the **ODDS RATIO** denote how much the presence or absence of a factor variable increases the odds of a positive dependent variable.

For CONTINOUS VARIABLES the odds ratio tell us how much the odds increase multiplicatively with a one-unit change in the independent variable. To calculate the difference in odds you raise the OR to the power of the difference between the observations.

To get the exponentiated coefficients and their confidence intervals in R, we use cbind to bind the coefficients and confidence intervals column-wise:

`exp(cbind(OR = coef(mylogit), confint(mylogit)))`

See here.

From equation \((1):\)

\[\begin{align} \Pr(Y = 1) &= (1-\Pr(Y=1))\times\,e^{\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p}\\[2ex] e^{\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p}&= \Pr(Y = 1)\,\times (1 \,+\,e^{\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p}) \\[2ex] \color{green}{\Pr(Y = 1)} = \frac{\color{blue}{\text{odds(Y=1)}}}{1\,+\,\color{blue}{\text{odds(Y=1)}}}&=\frac{e^{\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p}}{1 \,+\,e^{\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p}}=\frac{1}{1 \,+\,e^{-(\,\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p)}} \end{align}\]

####INTERPRETATION OF LOGISTIC REGRESSION RESULTS:

*QUESTION:*

(Initially posted here)

I ran a linear regression of acceptance into college against SAT scores and family / ethnic background. The data are fictional. The question focuses in the gathering and interpretation of odds ratios when leaving the SAT scores aside for simplicity.

The variables are `Accepted`

(0 or 1) and `Background`

(“red” or “blue”). I set up the data so that people of “red” background were more likely to get in:

```
fit <- glm(Accepted ~ Background, data=dat, family="binomial")
exp(cbind(Odds_Ratio_RedvBlue = coef(fit), confint(fit)))
Odds_Ratio_RedvBlue 2.5 % 97.5 %
(Intercept) 0.7088608 0.5553459 0.9017961
Backgroundred 2.4480042 1.7397640 3.4595454
```

Questions:

Is 0.7 the odd ratio of a person of “blue” background being accepted? I’m asking this because I also get 0.7 for “

`Backgroundblue`

” if instead I run the following code:`fit <- glm(Accepted~Background - 1, data=dat, family="binomial") exp(cbind(OR=coef(fit), confint(fit)))`

Shouldn’t the odds ratio of “red” being accepted (\(\rm Accepted/Red:Accepted/Blue\)) just the reciprocal: (\(\rm OddsBlue = 1 / OddsRed\))?

*ANSWER:*

I’ve been working on answering my question by calculating manually the odds and odds ratios:

```
Acceptance blue red Grand Total
0 158 102 260
1 112 177 289
Total 270 279 549
```

So the *Odds Ratio* of getting into the school of Red over Blue is:

\[ \frac{\rm Odds\ Accept\ If\ Red}{\rm Odds\ Acccept\ If\ Blue} = \frac{^{177}/_{102}}{^{112}/_{158}} = \frac {1.7353}{0.7089} = 2.448 \]

And this is the `Backgroundred`

return of:

```
fit <- glm(Accepted~Background, data=dat, family="binomial")
exp(cbind(Odds_and_OR=coef(fit), confint(fit)))
Odds_and_OR 2.5 % 97.5 %
(Intercept) 0.7088608 0.5553459 0.9017961
Backgroundred 2.4480042 1.7397640 3.4595454
```

At the same time, the `(Intercept)`

corresponds to the numerator of the *odds ratio*, which is exactly the *odds* of getting in being of ‘blue’ family background: \(112/158 = 0.7089\).

If instead, I run:

```
fit2 <- glm(Accepted~Background-1, data=dat, family="binomial")
exp(cbind(Odds=coef(fit2), confint(fit2)))
Odds 2.5 % 97.5 %
Backgroundblue 0.7088608 0.5553459 0.9017961
Backgroundred 1.7352941 1.3632702 2.2206569
```

The returns are precisely the *odds* of getting in being ‘blue’: `Backgroundblue`

(0.7089) and the *odds* of being accepted being ‘red’: `Backgroundred`

(1.7353). No *Odds Ratio* there. Therefore the two return values are not expected to be reciprocal.

Finally, How to read the results if there are 3 factors in the categorical regressor?

Same manual versus [R] calculation:

I created a different fictitious data set with the same premise, but this time there were three ethnic backgrounds: “red”, “blue” and “orange”, and ran the same sequence:

First, the contingency table:

```
Acceptance blue orange red Total
0 86 65 130 281
1 64 42 162 268
Total 150 107 292 549
```

And calculated the *Odds* of getting in for each ethnic group:

- Odds Accept If Red = 1.246154;
- Odds Accept If Blue = 0.744186;
- Odds Accept If Orange = 0.646154

As well as the different *Odds Ratios*:

- OR red v blue = 1.674519;
- OR red v orange = 1.928571;
- OR blue v red = 0.597186;
- OR blue v orange = 1.151717;
- OR orange v red = 0.518519; and
- OR orange v blue = 0.868269

And proceeded with the now routine logistic regression followed by exponentiation of coefficients:

```
fit <- glm(Accepted~Background, data=dat, family="binomial")
exp(cbind(ODDS=coef(fit), confint(fit)))
ODDS 2.5 % 97.5 %
(Intercept) 0.7441860 0.5367042 1.026588
Backgroundorange 0.8682692 0.5223358 1.437108
Backgroundred 1.6745192 1.1271430 2.497853
```

Yielding the *odds* of getting in for “blues” as the `(Intercept)`

, and the *Odds Ratios* of Orange versus Blue in `Backgroundorange`

, and the OR of Red v Blue in `Backgroundred`

.

On the other hand, the regression without intercept predictably returned just the three independent *odds*:

```
fit2 <- glm(Accepted~Background-1, data=dat, family="binomial")
exp(cbind(ODDS=coef(fit2), confint(fit2)))
ODDS 2.5 % 97.5 %
Backgroundblue 0.7441860 0.5367042 1.0265875
Backgroundorange 0.6461538 0.4354366 0.9484999
Backgroundred 1.2461538 0.9900426 1.5715814
```

####CATEGORICAL INDEPENDENT VARIABLES:

Here is an example extracted form this site:

```
# Define an explanatory variable with two levels:
parentsmoke <- as.factor(c(1,0))
# create response vector:
response <- cbind(yes=c(816,188), no=c(3203,1168))
response
```

```
## yes no
## [1,] 816 3203
## [2,] 188 1168
```

```
smoke.logistic <- glm(response~parentsmoke, family=binomial(link=logit))
smoke.logistic
```

```
##
## Call: glm(formula = response ~ parentsmoke, family = binomial(link = logit))
##
## Coefficients:
## (Intercept) parentsmoke1
## -1.8266 0.4592
##
## Degrees of Freedom: 1 Total (i.e. Null); 0 Residual
## Null Deviance: 29.12
## Residual Deviance: 4.365e-13 AIC: 19.24
```

`summary(smoke.logistic)`

```
##
## Call:
## glm(formula = response ~ parentsmoke, family = binomial(link = logit))
##
## Deviance Residuals:
## [1] 0 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.82661 0.07858 -23.244 < 2e-16 ***
## parentsmoke1 0.45918 0.08782 5.228 1.71e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2.9121e+01 on 1 degrees of freedom
## Residual deviance: 4.3654e-13 on 0 degrees of freedom
## AIC: 19.242
##
## Number of Fisher Scoring iterations: 2
```

This information gives us the fitted model:

\[\text{logit}(\hat\pi_i)=\log\frac{\hat\pi_i}{1-\hat\pi_i}=\hat\beta_0+\hat\beta_1X_i=-1.837+0.459\,(\text{parents smoke = 1})\] where `parentsmoke`

is a dummy variable ((e.g. design variable) that takes value \(1\) if at least one parents is smoking and \(0\) if neither is smoking.

`parent smoking = 0 is the baseline`

. We are modeling here probability of “children smoking”. Estimated \(\beta_0=1.827\) with a standard error of \(0.078\) is significant and it says that log-odds of a child smoking versus not smoking if neither parents is smoking (the baseline level) is \(-1.827\) and it’s statistically significant.

Estimated \(\beta_1=0.459\) with a standard error of \(0.088\) is significant and it says that log-odds-ratio of a child smoking versus not smoking if at least one parent is smoking versus neither parents is smoking (the baseline level) is \(0.459\) and it’s statistically significant. \(\exp(0.459)=1.58\) are the estimated odds-ratios.

Why not just ran a Chi square test:

`prop.test(response, correct = F)`

```
##
## 2-sample test for equality of proportions without continuity
## correction
##
## data: response
## X-squared = 27.677, df = 1, p-value = 1.434e-07
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.04218944 0.08659558
## sample estimates:
## prop 1 prop 2
## 0.2030356 0.1386431
```

It does detect differences in proportions but it doesn’t address:

- Modeling the “risk” of student smoking as a function of parents’ smoking behavior.
- Describing the differences between student smokers and nonsmokers as a function of parents smoking behavior via descriptive discriminate analyses. -Predicting the probabilities that individuals fall into two categories of the binary response as a function of some explanatory variables, e.g. what is the probability that a student is a smoker given that neither of his/her parents smokes.
- Predicting that a student is a smoker given that neither of his/her parents smokes, i.e. probabilities that individuals fall into two categories of the binary response as a function of some explanatory variables, we want to classify new students into “smoking” or “nonsmoking” group depending on parents smoking behavior. we want to develop a social network model, adjust for “bias”, These are just some of the possibilities of logistic regression, which cannot be handled within a framework of goodness-of-fit only.

Here is the same model splitting mothers and fathers as predictors:

```
kids.smoke <- matrix(c(188,599,816-205,1168,1456,3203-1456), nrow = 3)
dimnames(kids.smoke) = list(Parents = c("None", "Mother", "Father"), Kids = c("Smokers", "Non-smokers"))
kids.smoke
```

```
## Kids
## Parents Smokers Non-smokers
## None 188 1168
## Mother 599 1456
## Father 611 1747
```

```
bad.example <- as.factor(c(0,"1Mother","2Father"))
smoke.logistic <- glm(kids.smoke~bad.example, family=binomial(link=logit))
smoke.logistic
```

```
##
## Call: glm(formula = kids.smoke ~ bad.example, family = binomial(link = logit))
##
## Coefficients:
## (Intercept) bad.example1Mother bad.example2Father
## -1.8266 0.9384 0.7760
##
## Degrees of Freedom: 2 Total (i.e. Null); 0 Residual
## Null Deviance: 119.2
## Residual Deviance: -1.06e-12 AIC: 28.77
```

`summary(smoke.logistic)`

```
##
## Call:
## glm(formula = kids.smoke ~ bad.example, family = binomial(link = logit))
##
## Deviance Residuals:
## [1] 0 0 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.82661 0.07858 -23.244 <2e-16 ***
## bad.example1Mother 0.93842 0.09237 10.160 <2e-16 ***
## bad.example2Father 0.77605 0.09157 8.475 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.1915e+02 on 2 degrees of freedom
## Residual deviance: -1.0603e-12 on 0 degrees of freedom
## AIC: 28.768
##
## Number of Fisher Scoring iterations: 2
```

An example with credit defaults with some calculations extracted from here:

```
set.seed(123)
gender <- factor(round(runif(100,0,1)),
labels=c("female", "male"))
age <- round(runif(100,20,80))
profession <- factor(round(runif(100,1,4)),
labels=c("student", "worker", "teacher", "self-employed"))
# Let's make males, teachers and self-employed more likely to default:
credit.default <- round(((as.numeric(gender)/2 + as.numeric(profession)/4)/2) * runif(100,0.5,1))
df <- data.frame(credit.default,gender,age,profession)
head(df)
```

```
## credit.default gender age profession
## 1 0 female 56 worker
## 2 1 male 40 self-employed
## 3 1 female 49 teacher
## 4 1 male 77 teacher
## 5 1 male 49 worker
## 6 1 female 73 self-employed
```

```
catlogis <- glm(credit.default ~ gender+age+profession, df, family=binomial(link="logit"))
summary(catlogis)
```

```
##
## Call:
## glm(formula = credit.default ~ gender + age + profession, family = binomial(link = "logit"),
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2580 -0.9295 -0.2042 0.5361 1.5200
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.860e+00 1.327e+00 -2.910 0.003616 **
## gendermale 3.084e+00 7.844e-01 3.931 8.46e-05 ***
## age 9.465e-07 1.692e-02 0.000 0.999955
## professionworker 6.325e-01 8.211e-01 0.770 0.441075
## professionteacher 3.245e+00 1.003e+00 3.236 0.001214 **
## professionself-employed 4.464e+00 1.193e+00 3.740 0.000184 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 136.663 on 99 degrees of freedom
## Residual deviance: 88.237 on 94 degrees of freedom
## AIC: 100.24
##
## Number of Fisher Scoring iterations: 5
```

`require(pROC)`

`## Loading required package: pROC`

`## Type 'citation("pROC")' for a citation.`

```
##
## Attaching package: 'pROC'
```

```
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
```

`auc(df$credit.default, predict(catlogis, type="response"))`

`## Area under the curve: 0.8517`

We can generate an ANOVA to analyze the table of variance, paying attention at the difference between the `NULL`

variance and the `Resid. Deviance`

to see which variable reduce this residual variance in comparison to the model with only an intecept. The p-values also help select variables. We want a drop in the deviance and AIC.

```
catlogis <- glm(credit.default ~ gender+profession, df, family=binomial(link="logit"))
summary(catlogis)
```

```
##
## Call:
## glm(formula = credit.default ~ gender + profession, family = binomial(link = "logit"),
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2580 -0.9295 -0.2042 0.5361 1.5200
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.8604 0.9996 -3.862 0.000113 ***
## gendermale 3.0836 0.7841 3.933 8.39e-05 ***
## professionworker 0.6325 0.8202 0.771 0.440570
## professionteacher 3.2448 1.0028 3.236 0.001213 **
## professionself-employed 4.4636 1.1873 3.759 0.000170 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 136.663 on 99 degrees of freedom
## Residual deviance: 88.237 on 95 degrees of freedom
## AIC: 98.237
##
## Number of Fisher Scoring iterations: 5
```

`auc(df$credit.default, predict(catlogis, type="response"))`

`## Area under the curve: 0.8527`

```
genlogis <- glm(credit.default ~ gender, df, family=binomial(link="logit"))
summary(genlogis)
```

```
##
## Call:
## glm(formula = credit.default ~ gender, family = binomial(link = "logit"),
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4261 -0.7502 -0.7502 0.9476 1.6765
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.1239 0.3193 -3.520 0.000431 ***
## gendermale 1.6919 0.4405 3.841 0.000123 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 136.66 on 99 degrees of freedom
## Residual deviance: 120.56 on 98 degrees of freedom
## AIC: 124.56
##
## Number of Fisher Scoring iterations: 4
```

`auc(df$credit.default, predict(genlogis, type="response"))`

`## Area under the curve: 0.6997`

```
fitted.results <- predict(genlogis, type="response")
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != df$credit.default)
print(paste('Accuracy', 1-misClasificError))
```

`## [1] "Accuracy 0.7"`

`library(ROCR)`

`## Loading required package: gplots`

```
##
## Attaching package: 'gplots'
```

```
## The following object is masked from 'package:stats':
##
## lowess
```

```
pr <- prediction(predict(genlogis, type="response"), df$credit.default)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
```

```
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
```

`## [1] 0.6997144`

```
proflogis <- glm(credit.default ~ profession, df, family=binomial(link="logit"))
summary(proflogis)
```

```
##
## Call:
## glm(formula = credit.default ~ profession, family = binomial(link = "logit"),
## data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7941 -0.7710 -0.6231 1.0508 1.8626
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.5404 0.6362 -2.421 0.01547 *
## professionworker 0.4796 0.7445 0.644 0.51950
## professionteacher 1.8458 0.7272 2.538 0.01114 *
## professionself-employed 2.9267 0.9063 3.229 0.00124 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 136.66 on 99 degrees of freedom
## Residual deviance: 115.75 on 96 degrees of freedom
## AIC: 123.75
##
## Number of Fisher Scoring iterations: 4
```

`auc(df$credit.default, predict(proflogis, type="response"))`

`## Area under the curve: 0.7438`

```
pr <- prediction(predict(proflogis, type="response"), df$credit.default)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
```

```
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
```

`## [1] 0.743778`

Another example from this video:

```
set.seed(420)
n.samples <- 100
weight <- sort(rnorm(n.samples,172,29))
obese <- ifelse(runif(n.samples) < (rank(weight)/100), yes=1, no=0)
plot(weight,obese, pch=19, col='tan4', cex=.7)
glm.fit <- glm(obese ~ weight, family="binomial")
lines(weight, glm.fit$fitted.values)
```

```
par(pty="s")
roc(obese, glm.fit$fitted.values, plot=T)
```

```
##
## Call:
## roc.default(response = obese, predictor = glm.fit$fitted.values, plot = T)
##
## Data: glm.fit$fitted.values in 45 controls (obese 0) < 55 cases (obese 1).
## Area under the curve: 0.8291
```

```
roc.info <- roc(obese, glm.fit$fitted.values)
roc.df <- data.frame(ttp=roc.info$sensitivities*100, fpp=(1-roc.info$specificities) *100, thresholds=roc.info$thresholds)
head(roc.df)
```

```
## ttp fpp thresholds
## 1 100 100.00000 -Inf
## 2 100 97.77778 0.01349011
## 3 100 95.55556 0.03245008
## 4 100 93.33333 0.05250145
## 5 100 91.11111 0.07017225
## 6 100 88.88889 0.08798755
```

`tail(roc.df)`

```
## ttp fpp thresholds
## 96 9.090909 0 0.9275222
## 97 7.272727 0 0.9371857
## 98 5.454545 0 0.9480358
## 99 3.636364 0 0.9648800
## 100 1.818182 0 0.9735257
## 101 0.000000 0 Inf
```

Notice that the first row of `head(roc.df)`

with a threshold for labeling “obese” of \(-\inf\) the ttp (true positive percentage) will be of \(100\%,\) and so will be the fpp (false positive percentage), and it corresponds to the upper-right hand part of the ROC curve. The bottom-left part of the curve corresponds to the last row of `tail(roc.df)`

.

We can isolate part of the data, for example with ttp between \(60\) and \(80\), to select an optimal threshold value:

`roc.df[roc.df$ttp > 60 & roc.df$ttp < 80,]`

```
## ttp fpp thresholds
## 42 78.18182 35.55556 0.5049310
## 43 78.18182 33.33333 0.5067116
## 44 78.18182 31.11111 0.5166680
## 45 76.36364 31.11111 0.5287933
## 46 76.36364 28.88889 0.5429351
## 47 76.36364 26.66667 0.5589494
## 48 74.54545 26.66667 0.5676342
## 49 74.54545 24.44444 0.5776086
## 50 74.54545 22.22222 0.5946054
## 51 72.72727 22.22222 0.6227449
## 52 70.90909 22.22222 0.6398136
## 53 69.09091 22.22222 0.6441654
## 54 67.27273 22.22222 0.6556705
## 55 67.27273 20.00000 0.6683618
## 56 67.27273 17.77778 0.6767661
## 57 65.45455 17.77778 0.6802060
## 58 65.45455 15.55556 0.6831936
## 59 65.45455 13.33333 0.6917225
## 60 63.63636 13.33333 0.6975300
## 61 61.81818 13.33333 0.6982807
```

We can plot the ROC differently:

```
par(pty="s")
roc(obese, glm.fit$fitted.values, plot=T, legacy.axes=T, percent=T,
xlab='False positive percentage', ylab='True positive percentage',
col="darkslateblue",lwd=4, print.auc=T)
```

```
##
## Call:
## roc.default(response = obese, predictor = glm.fit$fitted.values, percent = T, plot = T, legacy.axes = T, xlab = "False positive percentage", ylab = "True positive percentage", col = "darkslateblue", lwd = 4, print.auc = T)
##
## Data: glm.fit$fitted.values in 45 controls (obese 0) < 55 cases (obese 1).
## Area under the curve: 82.91%
```

To calculate the partial AUC of the beginning, where only a small number of false positives are allowed…

```
par(pty="s")
roc(obese, glm.fit$fitted.values, plot=T, legacy.axes=T, percent=T,
xlab='False positive percentage', ylab='True positive percentage',
col='darkslateblue',lwd=4, print.auc=T, print.auc.x=45, partial.auc=c(100,90),
auc.polygon=T, auc.polygon.col='midnightblue')
```

```
##
## Call:
## roc.default(response = obese, predictor = glm.fit$fitted.values, percent = T, plot = T, legacy.axes = T, xlab = "False positive percentage", ylab = "True positive percentage", col = "darkslateblue", lwd = 4, print.auc = T, print.auc.x = 45, partial.auc = c(100, 90), auc.polygon = T, auc.polygon.col = "midnightblue")
##
## Data: glm.fit$fitted.values in 45 controls (obese 0) < 55 cases (obese 1).
## Partial area under the curve (specificity 100%-90%): 4.727%
```

Can we bootstrap?

```
ci.auc(roc(obese, glm.fit$fitted.values), conf.level=0.95, method=
"bootstrap", boot.n = 2000, boot.stratified = TRUE, reuse.auc=TRUE,
progress = getOption("pROCProgress")$name, parallel=FALSE)
```

`## 95% CI: 0.7471-0.9006 (2000 stratified bootstrap replicates)`

####TESTING A LOGISTIC REGRESSION MODEL:

- GOODNESS-OF-FIT: Model calibration

The Hosmer-Lemeshow goodness of fit test divides up in boxes the predicted probabilities (in R the function is `fitted`

as opposed to `predict`

), and runs a chi-square test comparing to the percentage of cases that have \(Y=1\) among those with predicted probability within a certain interval. See here:

The Hosmer-Lemeshow goodness of fit test is based on dividing the sample up according to their predicted probabilities, or risks. Specifically, based on the estimated parameter values \(\hat{\beta}_{0},\hat{\beta}_{1},..,\hat{\beta}_{p},\) for each observation in the sample the probability that \(Y=1\) is calculated, based on each observation’s covariate values:

\[\hat{\pi} = \frac{\exp(\hat{\beta}_{0}+\hat{\beta}_{1}X_{1}+..+\hat{\beta}_{p}X_{p})}{1+\exp(\hat{\beta}_{0}+\hat{\beta}_{1}X_{1}+..+\hat{\beta}_{p}X_{p})}\]

The observations in the sample are then split into \(g\) groups (we come back to choice of g later) according to their predicted probabilities. Suppose (as is commonly done) that \(g=10.\) Then the first group consists of the observations with the lowest \(10\%\) predicted probabilities. The second group consists of the \(10\%\) of the sample whose predicted probabilities are next smallest, etc etc.

Suppose for the moment that all of the observations in the first group had a predicted probability of \(0.1.\) Then, if our model is correctly specified, we would expect the proportion of these observations who have \(Y=1\) to be \(10\%.\) Of course, even if the model is correctly specified, the observed proportion will deviate to some extent from 10%, but not by too much. If the proportion of observations with \(Y=1\) in the group were instead \(90\%,\) this is suggestive that our model is not accurately predicting probability (risk), i.e. an indication that our model is not fitting the data well.

In practice, as soon as some of our model covariates are continuous, each observation will have a different predicted probability, and so the predicted probabilities will vary in each of the groups we have formed. To calculate how many Y=1 observations we would expect, the Hosmer-Lemeshow test takes the average of the predicted probabilities in the group, and multiplies this by the number of observations in the group. The test also performs the same calculation for \(Y=0,\) and then calculates a Pearson goodness of fit statistic

\[\sum^{1}_{k=0} \sum^{g}_{l=1} \frac{(o_{kl}-e_{kl})^{2}}{e_{kl}}\]

where \(o_{0l}\) denotes the number of observed \(Y=0\) observations in the \(l\)-th group, \(o_{1l}\) denotes the number of observed \(Y=1\) observations in the \(l\)-th group, and \(e_{0l}\) and \(e_{1l}\) similarly denote the expected number of zeros.

In a 1980 paper Hosmer-Lemeshow showed by simulation that (provided \(p+1<g\)) their test statistic approximately followed a chi-squared distribution on \(g-2\) degrees of freedom, when the model is correctly specified. This means that given our fitted model, the p-value can be calculated as the right hand tail probability of the corresponding chi-squared distribution using the calculated test statistic. If the p-value is small, this is indicative of poor fit.

It should be emphasized that a large p-value does not mean the model fits well, since lack of evidence against a null hypothesis is not equivalent to evidence in favor of the alternative hypothesis. In particular, if our sample size is small, a high p-value from the test may simply be a consequence of the test having lower power to detect mis-specification, rather than being indicative of good fit.

- C-STATISTIC or AUC (area under the curve of the ROC): Discriminatory power of the model

It is based on selecting different probabilities from \(0\) to \(1\) and calculating sensitivity and specificity in an ROC curve, and then measuring the AUC. See here:

Our model or prediction rule is perfect at classifying observations if it has 100% sensitivity and 100% specificity. Unfortunately in practice this is (usually) not attainable. So how can we summarize the discrimination ability of our logistic regression model? For each observation, our fitted model can be used to calculate the fitted probabilities \(P(Y=1|X_{1},..,X_{p}).\) On their own, these don’t tell us how to classify observations as positive or negative. One way to create such a classification rule is to choose a cut-point \(c,\) and classify those observations with a fitted probability above \(c\) as positive and those at or below it as negative. For this particular cut-off, we can estimate the sensitivity by the proportion of observations with \(Y=1\) which have a predicted probability above \(c,\) and similarly we can estimate specificity by the proportion of \(Y=0\) observations with a predicted probability at or below \(c.\) If we increase the cut-point \(c,\) fewer observations will be predicted as positive. This will mean that fewer of the \(Y=1\) observations will be predicted as positive (reduced sensitivity), but more of the \(Y=0\) observations will be predicted as negative (increased specificity). In picking the cut-point, there is thus an intrinsic trade off between sensitivity and specificity. Now we come to the ROC curve, which is simply a plot of the values of sensitivity against one minus specificity, as the value of the cut-point \(c\) is increased from \(0\) through to \(1.\) Example:

```
set.seed(63126)
n <- 1000
x <- rnorm(n)
pr <- exp(x)/(1+exp(x))
y <- 1*(runif(n) < pr)
mod <- glm(y~x, family="binomial")
predpr <- predict(mod,type=c("response"))
library(pROC)
roccurve <- roc(y ~ predpr)
plot(roccurve)
```

A popular way of summarizing the discrimination ability of a model is to report the area under the ROC curve. We have seen that a model with discrimination ability has an ROC curve which goes closer to the top left hand corner of the plot, whereas a model with no discrimination ability has an ROC curve close to a \(45\) degree line. Thus the area under the curve ranges from \(1,\) corresponding to perfect discrimination, to \(0.5,\) corresponding to a model with no discrimination ability. The area under the ROC curve is also sometimes referred to as the **c-statistic** (c for concordance).

`auc(roccurve)`

`## Area under the curve: 0.7463`

The ROC curve (AUC) has a somewhat appealing interpretation. It turns out that the AUC is the probability that if you were to take a random pair of observations, one with \(Y=1\) and one with \(Y=0,\) the observation with \(Y=1\) has a higher predicted probability than the other. The AUC thus gives the probability that the model correctly ranks such pairs of observations.

In the biomedical context of risk prediction modelling, the AUC has been criticized by some. In the risk prediction context, individuals have their risk of developing (for example) coronary heart disease over the next 10 years predicted. Thus a measure of discrimination which examines the predicted probability of pairs of individuals, one with \(Y=1\) and one with \(Y=0,\) does not really match the prospective risk prediction setting, where we do not have such pairs.