From here:
Matching is a way of controlling for confounding and bias, as well as increasing precision. Subjects are paired or grouped based on pre-specified characteristics. In matched case-control studies, each patient is matched by one or more subjects without the disease, matched of covariates of interest (age, gender, ethnicity).
Matching variables are not included in the model as covariates. Including them, leads to extreme bias in the odds-ratio estimates. It will be approximately the squared value of the true odds-ratio.
The notation is (from here)
The probability that the \(i\)-th person in the \(k\)-th matched set has disease is:
\[P_{k}({\bf X}_{ik})\]
where \(k\) is the strata (matched groups \(k \in \{1,2,\dots, K\}\)), and \(i\) is the individual \(i\in \{0,1,\dots,M+1\}\) with \(i=0\) being the case, and \(i=1,\dots, M\) the matched controls in a \(1:M\) case-control study. \({\bf X}_{ik}= X_{ik_1}, X_{ik_2},\dots,X_{ik_p}\) is the vector of covariates of interest for the \(i\)-th person in the \(k\)-th matched set.
The model is
\[\text{logit}\left(P_k(X_{ik})\right)=\log\frac{P_k(X_{ik})}{1 - P_k(X_{ik})}=\alpha_k + \beta_1 \,X_{ik_1}+\dots+\beta_p \,X_{ik_p}\]
The odds that a subject with \(x = 1\) is a case equals \(\exp(\beta)\) times the odds that a subject with \(x = 0\) is a case in the simple case \(\text{logit}(P(\text{case}))=\alpha + \beta\,x.\)
Each matched set has a different intercept, summarizing the effect of the matching variables on the probability of disease.
And \(\beta_p\) is the \(\log\text{OR}\) of disease of those exposed to covariate \(p\) to those not exposed to it.
The problem with the model above is that the number of parameters increases at the same rate as the sample size with the consequence that maximum likelihood estimation is no longer viable. We can overcome this problem if we regard the parameters \(\alpha_i\) as of little interest and so are willing to forgo their estimation. If we do, we can then create a conditional likelihood function that will yield maximum likelihood estimators of the coefficients, \(β_1, \dots , β_q,\) that are consistent and asymptotically normally distributed.
The joint probability that the vector of covariates \({\bf X}_{0k}\) corresponds to the case, and \({\bf X}_{ik}\) with \(i\in \{1,\dots, M\}\) to the controls is
\[P(X_{0k}\mid Y=1 \;\cap\; X_{1,2,\dots,M} \mid Y=0) = P({\bf X}_{0k} \mid Y=1)\;\; \prod_{i=1}^M\,P({\bf X}_{ik} \mid Y=0)\]
because we assume that the covariate vectors are independent of each other.
To construct the conditional likelihood that one of the \(M+1\) subjects in the \(k\)-th matched set is the case and the remainder are the controls:
\[\begin{align} &P({\bf X}_{0k}\mid Y=1) \prod_{i=1}^M P({\bf X}_{ik}\mid Y=0) + P({\bf X}_{1k}\mid Y=1) \prod_{i\neq 1} P({\bf X}_{ik}\mid Y=0)+ P({\bf X}_{Mk}\mid Y=1) \prod_{i\neq M} P({\bf X}_{ik}\mid Y=0)\\[2ex] &=\sum_{l=0}^MP({\bf X}_{lk}\mid Y=1) \prod_{r \neq l}^M P({\bf X}_{rk}\mid Y=0) \end{align} \]
The conditional likelihood of interest is
\[L_k = \frac{P({\bf X}_{0k}\mid Y=1) \quad\displaystyle\prod_{i=1}^M P({\bf X}_{ik}\mid Y=0)}{\sum_{l=0}^MP({\bf X}_{lk}\mid Y=1)\quad \displaystyle \prod_{r \neq l}^M P({\bf X}_{rk}\mid Y=0)}\] Utilizing Bayes’ theorem, \(\small P({\bf X}_{ik}\mid Y=1)=\frac{P(Y=1\mid {\bf X}_{ik})\,P({\bf X}_{ik})}{P(Y=1)}\) and \(\small P({\bf X}_{ik}\mid Y=0)=\frac{P(Y=0\mid {\bf X}_{ik})\,P({\bf X}_{ik})}{P(Y=0)}\):
\[L_k =\displaystyle \frac{\frac{P(Y=1\mid {\bf X}_{0k})\,P({\bf X}_{0k})}{P(Y=1)} \quad\displaystyle\prod_{i=1}^M \frac{P(Y=0\mid {\bf X}_{ik})\,P({\bf X}_{ik})}{P(Y=0)}}{\displaystyle\sum_{l=0}^M\frac{P(Y=1\mid {\bf X}_{lk})\,P({\bf X}_{lk})}{P(Y=1)} \quad\displaystyle\prod_{r \neq l}^M \frac{P(Y=0\mid {\bf X}_{rk})\,P({\bf X}_{rk})}{P(Y=0)}}\]
The factor \(\frac{\prod_i^M P({\bf X}_{ik})}{P(Y=1)P(Y=0)^M}\) appears both in the numerator and denominator and will cancel out, yielding
\[L_k =\frac{P(Y=1\mid {\bf X}_{0k})\,P({\bf X}_{0k})\quad \displaystyle \prod_{i=1}^M P(Y=0\mid {\bf X}_{ik})}{\displaystyle \sum_{l=0}^M P(Y=1\mid {\bf X}_{lk})\,P({\bf X}_{lk}) \quad \displaystyle\prod_{r \neq l}^M P(Y=0\mid {\bf X}_{rk})}\tag 1\]
Now, for the \(i\)-th person in the \(k\)-th stratum,
\[\text{logit}\,P(Y_i = 1 \mid {\bf X}_{ik}) =\alpha_k + \beta_1 X_{ik_1}+\dots + \beta_p X_{ik_p}\]
or
\[P(Y_i=1\mid {\bf X}_{ik}) = \frac{e^{\alpha_k + {\bf X}_{ik}{\bf \beta}}}{1 + e^{\alpha_k + {\bf X}_{ik}{\bf \beta}}} \]
and
\[P(Y_i=0\mid {\bf X}_{ik}) =1 - P(Y_1 =1\mid {\bf X}_{ik})= \frac{1}{1 + e^{\alpha_k + {\bf X}_{ik}{\bf \beta}}} \] Substituting into Eq \((1),\)
\[L_k(\beta) = \frac{e^{\alpha_k + {\bf X}_{ik}{\bf \beta}}}{\sum_{l=0}^M e^{\alpha_k + {\bf X}_{ikl}{\bf \beta}}}\]
Now \(e^{\alpha_k}\) can be pulled out from both numerator and denominator:
\[L_k(\beta) = \frac{e^{ {\bf X}_{ik}{\bf \beta}}}{\sum_{l=0}^M e^{ {\bf X}_{ikl}{\bf \beta}}}\]
After the derivation of the log likelihood, the intercepts will cancel: Since the intercepts can’t be estimated the probability of disease per strata cannot be estimated either. However, the \(\log{\text{ODDS}}\) is independent of the intercepts and can be estimated.
The overall likelihood is the product of likelihood for each stratum:
\[L(\beta)= \prod_{k=1}^K L_k(\beta)\]
From here:
data(backpain)
head(backpain)
## ID status driver suburban
## 1 1 case yes yes
## 2 1 control yes no
## 3 2 case yes yes
## 4 2 control yes yes
## 5 3 case yes no
## 6 3 control yes yes
backpain_glm <- clogit(I(status == "case") ~ driver + suburban + strata(ID), data = backpain)
summary(backpain_glm)
## Call:
## coxph(formula = Surv(rep(1, 434L), I(status == "case")) ~ driver +
## suburban + strata(ID), data = backpain, method = "exact")
##
## n= 434, number of events= 217
##
## coef exp(coef) se(coef) z Pr(>|z|)
## driveryes 0.6579 1.9307 0.2940 2.238 0.0252 *
## suburbanyes 0.2555 1.2911 0.2258 1.131 0.2580
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## driveryes 1.931 0.5180 1.0851 3.435
## suburbanyes 1.291 0.7746 0.8293 2.010
##
## Concordance= 0.562 (se = 0.036 )
## Likelihood ratio test= 9.55 on 2 df, p=0.008
## Wald test = 8.85 on 2 df, p=0.01
## Score (logrank) test = 9.31 on 2 df, p=0.009
The estimate of the odds ratio of a herniated disc occurring in a driver relative to a nondriver is \(1.93\) with a \(95\%\) confidence interval of \((1.09, 3.44).\) Conditional on residence we can say that the risk of a herniated disc occurring in a driver is about twice that of a nondriver. There is no evidence that where a person lives affects the risk of lower back pain.
From here:
Matching on demographic variables is commonly used in case–control studies to adjust for confounding at the design stage. There is a presumption that matched data need to be analyzed by matched methods. Conditional logistic regression has become a standard for matched case–control data to tackle the sparse data problem.
From here:
In matched case control studies each case is matched with a control subject based on variables that could affect the response but is not necessarily of interest to the research. Factors such as age, gender, race etc. are taken into consideration when matching. Because matching is subject specific each case-control pair potentially has a different probability of risk. Performing a logistic regression analysis on this would result in needing dummy variables for each pair! Doing so results in too many fixed effects to estimate with respect to the sample size and leads to biased estimates.
From here. The matching algorithm in R is here.
The Los Angeles Study of Endometrial Cancer was a matched
case-control study conducted in California in the 1970’s. There are 63
cases of endometrial cancer, all women age 55 or over, each matched to
four controls living in the same retirement community. The primary
exposure of interest was estrogen use. The secondary exposure was
gallbladder disease. The Epi library in R includes two versions of the
data: the full dataset bdendo
and a subset containing a
single control matched to each case bdendo11
.
set Case-control set: a numeric vector
d Case or control: a numeric vector (1=case, 0=control)
gall Gall bladder disease
hyp Hypertension
ob Obesity
est Estrogen therapy
dur Duration of conjugated estrogen therapy: levels 0, 1, 2, 3, 4.
non Use of non estrogen drugs: a factor with levels No Yes.
duration Months of oestrogen therapy: a numeric vector.
cest Conjugated oestrogen dose a factor with levels 0, 1, 2, 3
data(bdendo11)
head(bdendo11)
## set d gall hyp ob est dur non duration age cest agegrp age3
## 1 1 1 No No Yes Yes 4 Yes 96 74 3 70-74 65-74
## 2 1 0 No No <NA> No 0 No 0 75 0 70-74 65-74
## 3 2 1 No No No Yes 4 Yes 96 67 3 65-69 65-74
## 4 2 0 No No No Yes 1 No 5 67 3 65-69 65-74
## 5 3 1 No Yes Yes Yes 1 Yes 9 76 1 75-79 75+
## 6 3 0 No Yes Yes Yes 4 Yes 96 76 2 75-79 75+
with(bdendo11, twoby2(d,est))
## 2 by 2 table analysis:
## ------------------------------------------------------
## Outcome : No
## Comparing : 0 vs. 1
##
## No Yes P(No) 95% conf. interval
## 0 33 30 0.5238 0.4015 0.6433
## 1 7 56 0.1111 0.0539 0.2152
##
## 95% conf. interval
## Relative Risk: 4.7143 2.2559 9.8517
## Sample Odds Ratio: 8.8000 3.4778 22.2669
## Conditional MLE Odds Ratio: 8.6344 3.2632 26.0178
## Probability difference: 0.4127 0.2550 0.5437
##
## Exact P-value: 0.0000
## Asymptotic P-value: 0.0000
## ------------------------------------------------------
From the package survival
the function
clogit()
:
fit <- clogit(d ~ est + strata(set), data = bdendo11)
summary(fit)
## Call:
## coxph(formula = Surv(rep(1, 126L), d) ~ est + strata(set), data = bdendo11,
## method = "exact")
##
## n= 126, number of events= 63
##
## coef exp(coef) se(coef) z Pr(>|z|)
## estYes 2.2687 9.6667 0.6065 3.741 0.000183 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## estYes 9.667 0.1034 2.945 31.73
##
## Concordance= 0.706 (se = 0.052 )
## Likelihood ratio test= 24.45 on 1 df, p=8e-07
## Wald test = 13.99 on 1 df, p=2e-04
## Score (logrank) test = 21.12 on 1 df, p=4e-06
extractAIC(fit)[2]
## [1] 64.88739
From the package Epi
the function
clogistic()
:
fit2 <- clogistic(d ~ est, strata = set, data = bdendo11)
fit2
##
## Call:
## clogistic(formula = d ~ est, strata = set, data = bdendo11)
##
##
##
##
## coef exp(coef) se(coef) z p
## estYes 2.27 9.67 0.606 3.74 0.00018
##
## Likelihood ratio test=24.4 on 1 df, p=7.63e-07, n=64
From here:
Cases (case
) are women with secondary infertility
(1
). spontaneous
and induced
are
abortions. The cases are matched to controls based on
education
, age
and parity
. There
are two controls per case:
data("infert")
infert[infert$pooled.stratum==2,]
## education age parity induced case spontaneous stratum pooled.stratum
## 4 0-5yrs 34 4 2 1 0 4 2
## 87 0-5yrs 34 4 0 0 1 4 2
## 169 0-5yrs 34 4 0 0 2 4 2
We could ignore the matching factors altogether:
regLG <- glm(case ~ induced + spontaneous, data=infert, family='binomial')
summary(regLG)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.7078601 0.2677095 -6.379528 1.776344e-10
## induced 0.4181294 0.2056274 2.033432 4.200891e-02
## spontaneous 1.1972050 0.2116433 5.656712 1.543004e-08
Or treat matching factors as covariates:
adjregLG <- glm(case ~ induced + spontaneous + age + education + parity, data=infert, family='binomial')
summary(adjregLG)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.1492365 1.41219537 -0.8137943 4.157628e-01
## induced 1.2887574 0.30146019 4.2750499 1.910945e-05
## spontaneous 2.0459050 0.31015652 6.5963631 4.213663e-11
## age 0.0395820 0.03120253 1.2685509 2.046013e-01
## education6-11yrs -1.0442436 0.79254976 -1.3175748 1.876460e-01
## education12+ yrs -1.4032051 0.83415618 -1.6821851 9.253295e-02
## parity -0.8282774 0.19648949 -4.2153775 2.493607e-05
However, this can lead to bias in the coefficients, as can bee seen
in the different outputs above. Without adjustment, for instance, only
spontaneous
abortion is significant.
Using conditional logistic regression:
condLG <- clogit(case ~ induced + spontaneous + strata(pooled.stratum), data=infert)
summary(condLG)$coef
## coef exp(coef) se(coef) z Pr(>|z|)
## induced 1.426510 4.164142 0.3502751 4.072543 4.650262e-05
## spontaneous 2.027885 7.598001 0.3460100 5.860770 4.607247e-09
The values are closer to the adjusted coefficients.
Another example from here:
data(VC1to6)
fitclogit<-clogit(case ~ smoking + rubber + alcohol + strata(matset), data=VC1to6)
summary(fitclogit)
## Call:
## coxph(formula = Surv(rep(1, 119L), case) ~ smoking + rubber +
## alcohol + strata(matset), data = VC1to6, method = "exact")
##
## n= 119, number of events= 26
##
## coef exp(coef) se(coef) z Pr(>|z|)
## smoking 0.4398 1.5523 0.6462 0.681 0.49616
## rubber -0.4572 0.6331 0.6474 -0.706 0.48002
## alcohol 1.6668 5.2951 0.5952 2.800 0.00511 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## smoking 1.5523 0.6442 0.4375 5.508
## rubber 0.6331 1.5797 0.1780 2.251
## alcohol 5.2951 0.1889 1.6490 17.003
##
## Concordance= 0.688 (se = 0.065 )
## Likelihood ratio test= 12 on 3 df, p=0.007
## Wald test = 9.18 on 3 df, p=0.03
## Score (logrank) test = 11.24 on 3 df, p=0.01
NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.