NOTES ON STATISTICS, PROBABILITY and MATHEMATICS


Conditional Logistic Regression:

From here:

Matching is a way of controlling for confounding and bias, as well as increasing precision. Subjects are paired or grouped based on pre-specified characteristics. In matched case-control studies, each patient is matched by one or more subjects without the disease, matched of covariates of interest (age, gender, ethnicity).

Matching variables are not included in the model as covariates. Including them, leads to extreme bias in the odds-ratio estimates. It will be approximately the squared value of the true odds-ratio.

The notation is (from here)

The probability that the \(i\)-th person in the \(k\)-th matched set has disease is:

\[P_{k}({\bf X}_{ik})\]

where \(k\) is the strata (matched groups \(k \in \{1,2,\dots, K\}\)), and \(i\) is the individual \(i\in \{0,1,\dots,M+1\}\) with \(i=0\) being the case, and \(i=1,\dots, M\) the matched controls in a \(1:M\) case-control study. \({\bf X}_{ik}= X_{ik_1}, X_{ik_2},\dots,X_{ik_p}\) is the vector of covariates of interest for the \(i\)-th person in the \(k\)-th matched set.

The model is

\[\text{logit}\left(P_k(X_{ik})\right)=\log\frac{P_k(X_{ik})}{1 - P_k(X_{ik})}=\alpha_k + \beta_1 \,X_{ik_1}+\dots+\beta_p \,X_{ik_p}\]

The odds that a subject with \(x = 1\) is a case equals \(\exp(\beta)\) times the odds that a subject with \(x = 0\) is a case in the simple case \(\text{logit}(P(\text{case}))=\alpha + \beta\,x.\)

Each matched set has a different intercept, summarizing the effect of the matching variables on the probability of disease.

And \(\beta_p\) is the \(\log\text{OR}\) of disease of those exposed to covariate \(p\) to those not exposed to it.

The problem with the model above is that the number of parameters increases at the same rate as the sample size with the consequence that maximum likelihood estimation is no longer viable. We can overcome this problem if we regard the parameters \(\alpha_i\) as of little interest and so are willing to forgo their estimation. If we do, we can then create a conditional likelihood function that will yield maximum likelihood estimators of the coefficients, \(β_1, \dots , β_q,\) that are consistent and asymptotically normally distributed.

The joint probability that the vector of covariates \({\bf X}_{0k}\) corresponds to the case, and \({\bf X}_{ik}\) with \(i\in \{1,\dots, M\}\) to the controls is

\[P(X_{0k}\mid Y=1 \;\cap\; X_{1,2,\dots,M} \mid Y=0) = P({\bf X}_{0k} \mid Y=1)\;\; \prod_{i=1}^M\,P({\bf X}_{ik} \mid Y=0)\]

because we assume that the covariate vectors are independent of each other.

To construct the conditional likelihood that one of the \(M+1\) subjects in the \(k\)-th matched set is the case and the remainder are the controls:

\[\begin{align} &P({\bf X}_{0k}\mid Y=1) \prod_{i=1}^M P({\bf X}_{ik}\mid Y=0) + P({\bf X}_{1k}\mid Y=1) \prod_{i\neq 1} P({\bf X}_{ik}\mid Y=0)+ P({\bf X}_{Mk}\mid Y=1) \prod_{i\neq M} P({\bf X}_{ik}\mid Y=0)\\[2ex] &=\sum_{l=0}^MP({\bf X}_{lk}\mid Y=1) \prod_{r \neq l}^M P({\bf X}_{rk}\mid Y=0) \end{align} \]

The conditional likelihood of interest is

\[L_k = \frac{P({\bf X}_{0k}\mid Y=1) \quad\displaystyle\prod_{i=1}^M P({\bf X}_{ik}\mid Y=0)}{\sum_{l=0}^MP({\bf X}_{lk}\mid Y=1)\quad \displaystyle \prod_{r \neq l}^M P({\bf X}_{rk}\mid Y=0)}\] Utilizing Bayes’ theorem, \(\small P({\bf X}_{ik}\mid Y=1)=\frac{P(Y=1\mid {\bf X}_{ik})\,P({\bf X}_{ik})}{P(Y=1)}\) and \(\small P({\bf X}_{ik}\mid Y=0)=\frac{P(Y=0\mid {\bf X}_{ik})\,P({\bf X}_{ik})}{P(Y=0)}\):

\[L_k =\displaystyle \frac{\frac{P(Y=1\mid {\bf X}_{0k})\,P({\bf X}_{0k})}{P(Y=1)} \quad\displaystyle\prod_{i=1}^M \frac{P(Y=0\mid {\bf X}_{ik})\,P({\bf X}_{ik})}{P(Y=0)}}{\displaystyle\sum_{l=0}^M\frac{P(Y=1\mid {\bf X}_{lk})\,P({\bf X}_{lk})}{P(Y=1)} \quad\displaystyle\prod_{r \neq l}^M \frac{P(Y=0\mid {\bf X}_{rk})\,P({\bf X}_{rk})}{P(Y=0)}}\]

The factor \(\frac{\prod_i^M P({\bf X}_{ik})}{P(Y=1)P(Y=0)^M}\) appears both in the numerator and denominator and will cancel out, yielding

\[L_k =\frac{P(Y=1\mid {\bf X}_{0k})\,P({\bf X}_{0k})\quad \displaystyle \prod_{i=1}^M P(Y=0\mid {\bf X}_{ik})}{\displaystyle \sum_{l=0}^M P(Y=1\mid {\bf X}_{lk})\,P({\bf X}_{lk}) \quad \displaystyle\prod_{r \neq l}^M P(Y=0\mid {\bf X}_{rk})}\tag 1\]

Now, for the \(i\)-th person in the \(k\)-th stratum,

\[\text{logit}\,P(Y_i = 1 \mid {\bf X}_{ik}) =\alpha_k + \beta_1 X_{ik_1}+\dots + \beta_p X_{ik_p}\]

or

\[P(Y_i=1\mid {\bf X}_{ik}) = \frac{e^{\alpha_k + {\bf X}_{ik}{\bf \beta}}}{1 + e^{\alpha_k + {\bf X}_{ik}{\bf \beta}}} \]

and

\[P(Y_i=0\mid {\bf X}_{ik}) =1 - P(Y_1 =1\mid {\bf X}_{ik})= \frac{1}{1 + e^{\alpha_k + {\bf X}_{ik}{\bf \beta}}} \] Substituting into Eq \((1),\)

\[L_k(\beta) = \frac{e^{\alpha_k + {\bf X}_{ik}{\bf \beta}}}{\sum_{l=0}^M e^{\alpha_k + {\bf X}_{ikl}{\bf \beta}}}\]

Now \(e^{\alpha_k}\) can be pulled out from both numerator and denominator:

\[L_k(\beta) = \frac{e^{ {\bf X}_{ik}{\bf \beta}}}{\sum_{l=0}^M e^{ {\bf X}_{ikl}{\bf \beta}}}\]

After the derivation of the log likelihood, the intercepts will cancel: Since the intercepts can’t be estimated the probability of disease per strata cannot be estimated either. However, the \(\log{\text{ODDS}}\) is independent of the intercepts and can be estimated.

The overall likelihood is the product of likelihood for each stratum:

\[L(\beta)= \prod_{k=1}^K L_k(\beta)\]


From here:

data(backpain)
head(backpain)
##   ID  status driver suburban
## 1  1    case    yes      yes
## 2  1 control    yes       no
## 3  2    case    yes      yes
## 4  2 control    yes      yes
## 5  3    case    yes       no
## 6  3 control    yes      yes
backpain_glm <- clogit(I(status == "case") ~ driver + suburban + strata(ID), data = backpain)
summary(backpain_glm)
## Call:
## coxph(formula = Surv(rep(1, 434L), I(status == "case")) ~ driver + 
##     suburban + strata(ID), data = backpain, method = "exact")
## 
##   n= 434, number of events= 217 
## 
##               coef exp(coef) se(coef)     z Pr(>|z|)  
## driveryes   0.6579    1.9307   0.2940 2.238   0.0252 *
## suburbanyes 0.2555    1.2911   0.2258 1.131   0.2580  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##             exp(coef) exp(-coef) lower .95 upper .95
## driveryes       1.931     0.5180    1.0851     3.435
## suburbanyes     1.291     0.7746    0.8293     2.010
## 
## Concordance= 0.562  (se = 0.036 )
## Likelihood ratio test= 9.55  on 2 df,   p=0.008
## Wald test            = 8.85  on 2 df,   p=0.01
## Score (logrank) test = 9.31  on 2 df,   p=0.009

The estimate of the odds ratio of a herniated disc occurring in a driver relative to a nondriver is \(1.93\) with a \(95\%\) confidence interval of \((1.09, 3.44).\) Conditional on residence we can say that the risk of a herniated disc occurring in a driver is about twice that of a nondriver. There is no evidence that where a person lives affects the risk of lower back pain.


From here:

Matching on demographic variables is commonly used in case–control studies to adjust for confounding at the design stage. There is a presumption that matched data need to be analyzed by matched methods. Conditional logistic regression has become a standard for matched case–control data to tackle the sparse data problem.

From here:

In matched case control studies each case is matched with a control subject based on variables that could affect the response but is not necessarily of interest to the research. Factors such as age, gender, race etc. are taken into consideration when matching. Because matching is subject specific each case-control pair potentially has a different probability of risk. Performing a logistic regression analysis on this would result in needing dummy variables for each pair! Doing so results in too many fixed effects to estimate with respect to the sample size and leads to biased estimates.

From here. The matching algorithm in R is here.

The Los Angeles Study of Endometrial Cancer was a matched case-control study conducted in California in the 1970’s. There are 63 cases of endometrial cancer, all women age 55 or over, each matched to four controls living in the same retirement community. The primary exposure of interest was estrogen use. The secondary exposure was gallbladder disease. The Epi library in R includes two versions of the data: the full dataset bdendo and a subset containing a single control matched to each case bdendo11.

set         Case-control set: a numeric vector
d           Case or control: a numeric vector (1=case, 0=control)
gall        Gall bladder disease
hyp         Hypertension
ob          Obesity
est         Estrogen therapy
dur         Duration of conjugated estrogen therapy: levels 0, 1, 2, 3, 4.
non         Use of non estrogen drugs: a factor with levels No Yes.
duration    Months of oestrogen therapy: a numeric vector.
cest        Conjugated oestrogen dose a factor with levels 0, 1, 2, 3

data(bdendo11)
head(bdendo11)
##   set d gall hyp   ob est dur non duration age cest agegrp  age3
## 1   1 1   No  No  Yes Yes   4 Yes       96  74    3  70-74 65-74
## 2   1 0   No  No <NA>  No   0  No        0  75    0  70-74 65-74
## 3   2 1   No  No   No Yes   4 Yes       96  67    3  65-69 65-74
## 4   2 0   No  No   No Yes   1  No        5  67    3  65-69 65-74
## 5   3 1   No Yes  Yes Yes   1 Yes        9  76    1  75-79   75+
## 6   3 0   No Yes  Yes Yes   4 Yes       96  76    2  75-79   75+
with(bdendo11, twoby2(d,est))
## 2 by 2 table analysis: 
## ------------------------------------------------------ 
## Outcome   : No 
## Comparing : 0 vs. 1 
## 
##   No Yes    P(No) 95% conf. interval
## 0 33  30   0.5238    0.4015   0.6433
## 1  7  56   0.1111    0.0539   0.2152
## 
##                                    95% conf. interval
##              Relative Risk: 4.7143    2.2559   9.8517
##          Sample Odds Ratio: 8.8000    3.4778  22.2669
## Conditional MLE Odds Ratio: 8.6344    3.2632  26.0178
##     Probability difference: 0.4127    0.2550   0.5437
## 
##              Exact P-value: 0.0000 
##         Asymptotic P-value: 0.0000 
## ------------------------------------------------------

From the package survival the function clogit():

fit <- clogit(d ~ est + strata(set), data = bdendo11)
summary(fit)
## Call:
## coxph(formula = Surv(rep(1, 126L), d) ~ est + strata(set), data = bdendo11, 
##     method = "exact")
## 
##   n= 126, number of events= 63 
## 
##          coef exp(coef) se(coef)     z Pr(>|z|)    
## estYes 2.2687    9.6667   0.6065 3.741 0.000183 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##        exp(coef) exp(-coef) lower .95 upper .95
## estYes     9.667     0.1034     2.945     31.73
## 
## Concordance= 0.706  (se = 0.052 )
## Likelihood ratio test= 24.45  on 1 df,   p=8e-07
## Wald test            = 13.99  on 1 df,   p=2e-04
## Score (logrank) test = 21.12  on 1 df,   p=4e-06
extractAIC(fit)[2]
## [1] 64.88739

From the package Epi the function clogistic():

fit2 <- clogistic(d ~ est, strata = set, data = bdendo11)
fit2
## 
## Call: 
## clogistic(formula = d ~ est, strata = set, data = bdendo11)
## 
## 
## 
## 
##        coef exp(coef) se(coef)    z       p
## estYes 2.27      9.67    0.606 3.74 0.00018
## 
## Likelihood ratio test=24.4  on 1 df, p=7.63e-07, n=64

From here:

Cases (case) are women with secondary infertility (1). spontaneous and induced are abortions. The cases are matched to controls based on education, age and parity. There are two controls per case:

data("infert")
infert[infert$pooled.stratum==2,]
##     education age parity induced case spontaneous stratum pooled.stratum
## 4      0-5yrs  34      4       2    1           0       4              2
## 87     0-5yrs  34      4       0    0           1       4              2
## 169    0-5yrs  34      4       0    0           2       4              2

We could ignore the matching factors altogether:

regLG <- glm(case ~ induced + spontaneous, data=infert, family='binomial')
summary(regLG)$coef
##               Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -1.7078601  0.2677095 -6.379528 1.776344e-10
## induced      0.4181294  0.2056274  2.033432 4.200891e-02
## spontaneous  1.1972050  0.2116433  5.656712 1.543004e-08

Or treat matching factors as covariates:

adjregLG <- glm(case ~ induced + spontaneous + age + education + parity, data=infert, family='binomial')
summary(adjregLG)$coef
##                    Estimate Std. Error    z value     Pr(>|z|)
## (Intercept)      -1.1492365 1.41219537 -0.8137943 4.157628e-01
## induced           1.2887574 0.30146019  4.2750499 1.910945e-05
## spontaneous       2.0459050 0.31015652  6.5963631 4.213663e-11
## age               0.0395820 0.03120253  1.2685509 2.046013e-01
## education6-11yrs -1.0442436 0.79254976 -1.3175748 1.876460e-01
## education12+ yrs -1.4032051 0.83415618 -1.6821851 9.253295e-02
## parity           -0.8282774 0.19648949 -4.2153775 2.493607e-05

However, this can lead to bias in the coefficients, as can bee seen in the different outputs above. Without adjustment, for instance, only spontaneous abortion is significant.

Using conditional logistic regression:

condLG <- clogit(case ~ induced + spontaneous + strata(pooled.stratum), data=infert)
summary(condLG)$coef
##                 coef exp(coef)  se(coef)        z     Pr(>|z|)
## induced     1.426510  4.164142 0.3502751 4.072543 4.650262e-05
## spontaneous 2.027885  7.598001 0.3460100 5.860770 4.607247e-09

The values are closer to the adjusted coefficients.


Another example from here:

data(VC1to6)
fitclogit<-clogit(case ~ smoking + rubber + alcohol + strata(matset), data=VC1to6)
summary(fitclogit)
## Call:
## coxph(formula = Surv(rep(1, 119L), case) ~ smoking + rubber + 
##     alcohol + strata(matset), data = VC1to6, method = "exact")
## 
##   n= 119, number of events= 26 
## 
##            coef exp(coef) se(coef)      z Pr(>|z|)   
## smoking  0.4398    1.5523   0.6462  0.681  0.49616   
## rubber  -0.4572    0.6331   0.6474 -0.706  0.48002   
## alcohol  1.6668    5.2951   0.5952  2.800  0.00511 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##         exp(coef) exp(-coef) lower .95 upper .95
## smoking    1.5523     0.6442    0.4375     5.508
## rubber     0.6331     1.5797    0.1780     2.251
## alcohol    5.2951     0.1889    1.6490    17.003
## 
## Concordance= 0.688  (se = 0.065 )
## Likelihood ratio test= 12  on 3 df,   p=0.007
## Wald test            = 9.18  on 3 df,   p=0.03
## Score (logrank) test = 11.24  on 3 df,   p=0.01

Home Page

NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.