NOTES ON STATISTICS, PROBABILITY and MATHEMATICS


Causal Inference:


The theoretical \(\text{do}(\cdot)\) operator (Judea Pearl)

The \(do()\) operator is written mathematically as \(P(Y | do(X=x))\), which is the probability distribution of the outcome \(Y\) when the treatment \(X\) is forced to the value \(x\).

In standard conditional probability, \(P(Y | X=x)\) means “What is the probability of \(Y\) given that we observed \(X\) taking the value \(x\)?” This includes the effects of confounding. In Do-Calculus the \(do()\) operator represents intervention. It means: “What is the probability of \(Y\) if we surgically set \(X\) to the value \(x\), breaking all other confounding influences?”

In a Directed Acyclic Graph (DAG), applying the \(do()\) operator is mathematically equivalent to deleting all incoming arrows to the variable being intervened upon. This manipulation allows deriving formulas (like the backdoor criterion) that tell researchers how to adjust for confounding using only observational data.

The Backdoor Criterion provides a principled way to identify the confounders you must control for in your statistical analysis (e.g., regression, IPTW, matching). A Backdoor Path is any non-causal path between the treatment variable (\(X\)) and the outcome variable (\(Y\)) that begins with an arrow pointing into the treatment variable (\(X\)). These paths represent the spurious correlation or confounding created by common causes of \(X\) and \(Y\). Example Path: \(X \leftarrow \text{Confounder} \rightarrow Y\). Any correlation flowing through this path is biased.

The Backdoor Criterion: A set of variables \(\mathbf{Z}\) satisfies the Backdoor Criterion relative to the treatment \(X\) and outcome \(Y\) if it meets two conditions in the Directed Acyclic Graph (DAG): Condition 1 (Blockage): \(\mathbf{Z}\) blocks every backdoor path between \(X\) and \(Y\). Blocking means ensuring that no spurious correlation flows along these paths. This is achieved by conditioning on (adjusting for) the variables in \(\mathbf{Z}\). Condition 2 (No Descendants): No variable in \(\mathbf{Z}\) is a descendant of \(X\). This prevents introducing collider bias or removing a part of the true causal effect, which can happen if you control for a variable that is caused by the treatment.

If a set of variables \(\mathbf{Z}\) satisfies the Backdoor Criterion, the causal effect of \(X\) on \(Y\) can be formally identified and estimated using the Backdoor Adjustment Formula:

\[P(Y|do(X=x)) = \sum_{z} P(Y|X=x, \mathbf{Z}=z)P(\mathbf{Z}=z)\]

This formula states that the probability of the outcome \(Y\) under a theoretical intervention (\(do(X)\)) can be calculated by adjusting for \(\mathbf{Z}\) across the entire population. The formula works in two steps:

  • Step 1: Conditioning and Estimation: The term \(P(Y|X=x, \mathbf{Z}=z)\) means we estimate the effect of \(X\) on \(Y\) separately within each subgroup defined by the specific value of the confounder(s) \(\mathbf{Z}=z\). Because \(\mathbf{Z}\) blocks the backdoor paths, within these subgroups, the relationship between \(X\) and \(Y\) is assumed to be unconfounded (like an RCT).

  • Step 2: Marginalization (Averaging): The term \(\sum_{z} \dots P(\mathbf{Z}=z)\) is the marginalization step. After estimating the unconfounded effect in every subgroup, we must average these effects across all possible values of \(\mathbf{Z}\) to get the overall effect for the entire population. The term \(P(\mathbf{Z}=z)\) acts as the weight for the average, ensuring that each subgroup contribution is proportional to its prevalence in the population. This final averaging step is necessary to arrive at the population-level causal effect \(P(Y|do(X=x))\), which is the goal of the \(do()\) operator.

The theoretical \(do()\) operator represents the ability to perfectly intervene and set a variable value while breaking all causal paths leading into it. A Randomized Controlled Trial (RCT) is the real-world gold standard because the act of random assignment guarantees, on average, that: The treatment and control groups are perfectly exchangeable (identical in all respects, measured or unmeasured, except for the treatment). Any observed difference in the outcome must be due to the treatment. All confounding paths are naturally blocked.


Implementation of the \(\text{do}(\cdot)\) Operator:


The implementation for empirical observational data is carried out with methods that rely on a powerful, untestable assumption called Conditional Exchangeability (or No Unmeasured Confounding): The treatment assignment is independent of the outcome, conditional on the set of measured confounders (\(X\)). This assumption means that once you adjust for all the variables you did measure (those included in your propensity model), there are no remaining unmeasured variables that affect both the treatment and the outcome.

\[\text{Outcome} \perp \text{Treatment} \mid X\]

We can never be certain that the “No Unmeasured Confounding” assumption holds because of unmeasured Variables: there might be critical confounders that are simply not in the dataset that influence both the treatment and the outcome. Further, there is possible measurement Error: even if a confounder is measured, poor or noisy measurement can still lead to residual confounding. The adjustment is only as good as the quality of the measured variables.

Here are some of the methods:

  1. IPTW (Inverse Probability of Treatment Weighting) is a statistical method used to emulate the effects of the theoretical \(do()\) operator from Causal Inference.

  2. Matching (e.g., Propensity Score Matching). This involves pairing each treated unit with one or more untreated units that have a very similar propensity score (*) (or similar covariate values). It creates a subset of the data where, within each matched pair, the treatment assignment is nearly random. It physically discards observations that have no good match. Analogy to \(do()\): It constructs a highly balanced, smaller “pseudo-population” by ensuring that for every individual who received the treatment, there was a virtually identical individual who did not.

  3. G-Computation is a direct method for estimating the expected outcome under a specific intervention. It involves two steps: 1. Fit a statistical model (e.g., linear regression) to predict the outcome based on the treatment and confounders. 2. Predict: Use that model to predict the outcome for every person in the dataset, first assuming everyone received the treatment (\(do(A=1)\)), and then predicting the outcome assuming no one received the treatment (\(do(A=0)\)). Analogy to \(do()\): It uses the fitted model to mathematically simulate the outcomes under the hypothetical forced interventions (\(A=1\) and \(A=0\)). The difference in the average predicted outcomes is the estimated causal effect.

  4. Double Robust Methods (e.g., Targeted Maximum Likelihood Estimation - TMLE) These methods combine elements of both modeling the treatment (like IPTW) and modeling the outcome (like G-Computation). Mechanism: They rely on fitting both the propensity score model and the outcome model. They are called “double robust” because the resulting causal estimate will be unbiased if either the propensity score model or the outcome model is correctly specified, providing a safety net against model misspecification. Analogy to \(do()\): By cross-checking the assumptions of two different models, these methods provide a highly reliable estimate of the counterfactual outcome under the \(do()\) intervention.

(*) Propensity score: To predict a propensity score using logistic regression, you build a model where the dependent variable is the treatment or exposure status, and the independent variables are the confounding covariates. The predicted probability from this model for each individual is their propensity score, which represents the probability of receiving the treatment given their covariates. This score is then used to balance the treated and untreated groups for causal inference.


IPTW (Inverse Probability of Treatment Weighting)

The example that follows is from this YouTube presentation with the code found in here

library(tidyverse)
library(broom)
library(rsample)
library(ggdag)
library(causaldata)
library(halfmoon)
library(propensity)
library(ggokabeito)

set.seed(1234)

In this guided exercise, we’ll attempt to answer a causal question: does quitting smoking make you gain weight? Causal modeling has a special place in the history of smoking research: the studies that demonstrated that smoking causes lung cancer were observational. Thanks to other studies, we also know that, if you’re already a smoker, quitting smoking reduces your risk of lung cancer. However, some have observed that former smokers tend to gain weight.

To answer this question, we’ll use causal inference methods to examine the relationship between quitting smoking and gaining weight. First, we’ll draw our assumptions with a causal diagram (a directed acyclic graph, or DAG), which will guide our model. Then, we’ll use a modeling approach called inverse probability weighting–one of many causal modeling techniques–to estimate the causal effect we’re interested in.

We’ll use data from NHEFS to try to estimate the causal effect of quitting smoking on weight game. NHEFS is a longitudinal, observational study that has many of the variables we’ll need.

Getting rid of entries lost to follow up:

nhefs_complete_uc <- nhefs_complete |>
  filter(censored == 0)
head(nhefs_complete_uc)
## # A tibble: 6 × 67
##    seqn  qsmk death yrdth modth dadth   sbp   dbp sex     age race  income
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <fct>  <dbl>
## 1   233     0     0    NA    NA    NA   175    96 0        42 1         19
## 2   235     0     0    NA    NA    NA   123    80 0        36 0         18
## 3   244     0     0    NA    NA    NA   115    75 1        56 1         15
## 4   245     0     1    85     2    14   148    78 0        68 1         15
## 5   252     0     0    NA    NA    NA   118    77 0        40 0         18
## 6   257     0     0    NA    NA    NA   141    83 1        43 1         11
## # ℹ 55 more variables: marital <dbl>, school <dbl>, education <fct>, ht <dbl>,
## #   wt71 <dbl>, wt82 <dbl>, wt82_71 <dbl>, birthplace <dbl>,
## #   smokeintensity <dbl>, smkintensity82_71 <dbl>, smokeyrs <dbl>,
## #   asthma <dbl>, bronch <dbl>, tb <dbl>, hf <dbl>, hbp <dbl>,
## #   pepticulcer <dbl>, colitis <dbl>, hepatitis <dbl>, chroniccough <dbl>,
## #   hayfever <dbl>, diabetes <dbl>, polio <dbl>, tumor <dbl>,
## #   nervousbreak <dbl>, alcoholpy <dbl>, alcoholfreq <dbl>, …

Here is the distribution in the difference in weight between years 71 and 82 for people who did and did not quit smoking:

nhefs_complete_uc |>
  ggplot(aes(wt82_71, fill = factor(qsmk))) + 
  geom_vline(xintercept = 0, color = "grey60", linewidth = 1) +
  geom_density(color = "white", alpha = .75, linewidth = .5) +
  scale_color_okabe_ito(order = c(1, 5)) + 
  theme_minimal() +
  theme(legend.position = "bottom") + 
  labs(
    x = "change in weight (kg)",
    fill = "quit smoking (1 = yes)"
  )

Numerically,

# ~2.5 kg gained for quit vs. not quit
nhefs_complete_uc |>
  group_by(qsmk) |>
  summarize(
    mean_weight_change = mean(wt82_71), 
    sd = sd(wt82_71),
    .groups = "drop"
  )
## # A tibble: 2 × 3
##    qsmk mean_weight_change    sd
##   <dbl>              <dbl> <dbl>
## 1     0               1.98  7.45
## 2     1               4.53  8.75

Here, it looks like those who quit smoking gained, on average, 2.5 kg. But is there something else that could explain these results? There are many factors associated with both quitting smoking and gaining weight; could one of those factors explain away the results we’re seeing here?

To truly answer this question, we need to specify a causal diagram based on domain knowledge. Sadly, for most circumstances, there is no data-driven approach that consistently identify confounders. Only our causal assumptions can help us identify them. Causal diagrams are a visual expression of those assumptions linked to rigorous mathematics that allow us to understand what we need to account for in our model.

In R, we can visualize and analyze our DAGs with the {ggdag} package. {ggdag} uses {ggplot2} and {ggraph} to visualize diagrams and {dagitty} to analyze them. Let’s set up our assumptions. The dagify() function takes formulas, much like lm() and friends, to express assumptions. We have two basic causal structures: the causes of quitting smoking and the causes of gaining weight. Here, we’re assuming that the set of variables here affect both. Additionally, we’re adding qsmk as a cause of wt82_71, which is our causal question; we also identify these as our outcome and exposure. Finally, we’ll add some labels so the diagram is easier to understand. The result is a dagitty object, and we can transform it to a tidy_dagitty data set with tidy_dagitty().

smk_wt_dag <- dagify(
  # specify causes of quitting smoking and weight gain:
  qsmk ~ sex + race + age + education + 
    smokeintensity + smokeyrs + exercise + active + wt71,
  wt82_71 ~ qsmk + sex + race + age + education + 
    smokeintensity + smokeyrs + exercise + active + wt71,
  # specify causal question:
  exposure = "qsmk", 
  outcome = "wt82_71",
  coords = time_ordered_coords(),
  # set up labels:
  # here, I'll use the same variable names as the data set, but I'll label them
  # with clearer names
  labels = c(
    # causal question
    "qsmk" = "quit\nsmoking",
    "wt82_71" = "change in\nweight",
    
    # demographics
    "age" = "age",
    "sex" = "sex",
    "race" = "race",
    "education" = "education",
    
    # health
    "wt71" = "baseline\nweight",
    "active" = "daily\nactivity\nlevel",
    "exercise" = "exercise",
    
    # smoking history
    "smokeintensity" = "smoking\nintensity",
    "smokeyrs" = "yrs of\nsmoking"
  )
) |>
  tidy_dagitty()

smk_wt_dag |>
  ggdag(text = FALSE, use_labels = "label")

The function ggdag_adjustment_set() visually identifies the variables that satisfy the Backdoor Criterion (the adjustment set). It highlights these crucial variables without showing the full graphical structure. The function prioritizes showing the variables (nodes) in the graph, specifically coloring or highlighting those that must be controlled for to estimate the causal effect. By default, it plots the nodes and their suggested role (Exposure, Outcome, Adjust), but omits the edges (arrows) to keep the visualization clean and focused solely on the required adjustment set.

smk_wt_dag |>
  ggdag_adjustment_set(text = FALSE, use_labels = "label") +
  theme_dag() +
  scale_color_okabe_ito(order = c(1, 5)) + 
  scale_fill_okabe_ito(order = c(1, 5))

Let’s fit a model with these variables. Note that we’ll fit all continuous variables with squared terms, as well, to allow them a bit of flexibility.

lm(
  wt82_71~ qsmk + sex + 
    race + age + I(age^2) + education + 
    smokeintensity + I(smokeintensity^2) + 
    smokeyrs + I(smokeyrs^2) + exercise + active + 
    wt71 + I(wt71^2), 
  data = nhefs_complete_uc
) |>
  tidy(conf.int = TRUE) |>
  filter(term == "qsmk")
## # A tibble: 1 × 7
##   term  estimate std.error statistic  p.value conf.low conf.high
##   <chr>    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 qsmk      3.46     0.438      7.90 5.36e-15     2.60      4.32

When we adjust for the variables in our DAG, we get an estimate of about 3.5 kg–people who quit smoking gained about this amount of weight. However, we are trying to answer a specific causal question: how much weight would a person gain if the quit smoking vs. if the same person did not quit smoking? Let’s use an inverse probability weighting model to try to estimate that effect at the population level (what if everyone quit smoking vs what if no one quit smoking).

For a simple IPW model, we have two modeling steps. First, we fit a propensity score model, which predicts the probability that you received a treatment or exposure (here, that a participant quit smoking). We use this model to calculate inverse probability weights – 1 / your probability of treatment. Then, in the second step, we use this weights in the outcome model, which estimates the effect of exposure on the outcome (here, the effect of quitting smoking on gaining weight).

For the propensity score model, we’ll use logistic regression (since quitting smoking is a binary variable). The outcome is quitting smoking, and the variables in the model are all those included in our adjustment set. Then, we’ll use augment() from {broom} (which calls predict() on the inside) to calculate our weights using propensity::wt_ate() and save it back into our data set.

propensity_model <- glm(
  qsmk ~ sex + 
    race + age + I(age^2) + education + 
    smokeintensity + I(smokeintensity^2) + 
    smokeyrs + I(smokeyrs^2) + exercise + active + 
    wt71 + I(wt71^2), 
  family = binomial(), 
  data = nhefs_complete_uc
)

nhefs_complete_uc <- propensity_model |>
  # predict whether quit smoking
  augment(type.predict = "response", data = nhefs_complete_uc) |>
  # calculate inverse probability
  mutate(wts = wt_ate(.fitted, qsmk),
  manual = ifelse(qsmk == 1, 1 / .fitted, 1 / (1 - .fitted))
)

nhefs_complete_uc |>
  select(qsmk, .fitted, wts, manual)
## # A tibble: 1,566 × 4
##     qsmk .fitted        wts manual
##    <dbl>   <dbl> <psw{ate}>  <dbl>
##  1     0  0.0987   1.109530   1.11
##  2     0  0.140    1.163233   1.16
##  3     0  0.126    1.143679   1.14
##  4     0  0.400    1.666994   1.67
##  5     0  0.294    1.415590   1.42
##  6     0  0.170    1.204166   1.20
##  7     0  0.220    1.282511   1.28
##  8     0  0.345    1.527715   1.53
##  9     0  0.283    1.395508   1.40
## 10     0  0.265    1.360346   1.36
## # ℹ 1,556 more rows

What are we doing here? .fitted is the Propensity Score. The .fitted column is generated by the augment() function and contains the propensity score for each individual. What it is: The probability of an individual receiving the treatment (qsmk = 1, quitting smoking) given their observed set of confounder variables (age, sex, baseline weight, etc.). Source: It is the predicted value from the propensity_model (a logistic regression model) applied to the individual’s covariate values. Value: A number between 0 and 1. A value close to 1 means the person was very likely to quit smoking based on their characteristics. A value close to 0 means the person was very unlikely to quit smoking based on their characteristics. In other words, we use the glm model to predict whether each on of the subjects will quit smoking based on the values of the covariates. From .fitted and qsmk we get the wts (The IPTW Weight). The wts column is created by the mutate() function using the propensity::wt_ate() function, which converts the propensity scores into Inverse Probability of Treatment Weights (IPTW) for the Average Treatment Effect (ATE). A calculated weight used to rebalance the data such that the distribution of confounders in the treatment group equals the distribution in the control group. Calculation: The weight assigned to an individual depends on their actual treatment status (qsmk) and their predicted probability of receiving that status (.fitted): For the Treated group (\(\text{qsmk}=1\)):\[W = \frac{1}{\text{Propensity Score}} = \frac{1}{\text{.fitted}}\] For the Control group (\(\text{qsmk}=0\)):\[W = \frac{1}{1 - \text{Propensity Score}} = \frac{1}{1 - \text{.fitted}}\] Impact: These weights inflate the influence of individuals who are “mismatched” (e.g., someone who quit smoking despite having low health motivation) and deflate the influence of individuals who are “matched” (e.g., someone who quit and was highly motivated). The weighted dataset effectively simulates a randomized experiment.

Let’s look at the distribution of the weights.

ggplot(nhefs_complete_uc, aes(wts)) +
  geom_histogram(color = "white", fill = "#E69F00", bins = 50) + 
  #  use a log scale for the x axis
  scale_x_log10() + 
  theme_minimal(base_size = 20) + 
  xlab("Weights")

It looks a little skewed, particularly that there are some participants with much higher weights. There are a few techniques for dealing with this–trimming weights and stabilizing weights–but we’ll keep it simple for now and just use them as is. Extreme weights inflate the variance (the measure of uncertainty) of the causal estimate (\(\hat{\beta}_{\text{qsmk}}\)) in the final weighted regression. A few highly-weighted observations end up dominating the calculation, making the result less precise and leading to wider confidence intervals.

The main goal here is to break the non-causal associations between quitting smoking and gaining weight–the other paths that might distort our results. In other words, if we succeed, there should be no differences in these variables between our two groups, those who quit smoking and those who didn’t. This is where randomized trials shine; you can often assume that there is no baseline differences among potential confounders between your treatment groups (of course, no study is perfect, and there’s a whole set of literature on dealing with this problem in randomized trials).

Standardized mean differences (SMD) are a simple measurement of differences that work across variable types. In general, the closer to 0 we are, the better job we have done eliminating the non-causal relationships we drew in our DAG. Note that low SMDs for everything we adjust for does not mean that there is not something else that might confound our study. Unmeasured confounders or misspecified DAGs can still distort our effects, even if our SMDs look great!

We’ll use the {halfmoon} package to calculate the SMDs, then visualize them.

vars <- c(
  "sex", "race", "age", "education", 
  "smokeintensity", "smokeyrs", 
  "exercise", "active", "wt71"
)

plot_df <- tidy_smd(
    nhefs_complete_uc,
    all_of(vars),
    qsmk,
    wts
)

ggplot(
    data = plot_df,
    mapping = aes(x = abs(smd), y = variable, group = method, color = method)
) +
    geom_love()

The SMD is a statistic that measures the magnitude of the difference in the mean (average) of a specific covariate (confounder) between the treated group and the control group, expressed in units of standard deviation.1. The CalculationThe SMD for a covariate \(Z\) is calculated as:\[\text{SMD} = \frac{\bar{Z}_{\text{treated}} - \bar{Z}_{\text{control}}}{\text{Pooled Standard Deviation}}\]

Both groups are being upweighted so that their distributions of propensity scores are much more similar.

We could do more here to analyze our assumptions, but let’s move on to our second step: fitting the outcome model weighted by our inverse probabilities. Some researchers call these Marginal Structural Models, in part because the model is marginal; we only need to include our outcome (wt82_71) and exposure (qsmk). The other variables aren’t in the model; they are accounted for with the IPWs!

Both groups are being upweighted so that their distributions of propensity scores are much more similar.

We could do more here to analyze our assumptions, but let’s move on to our second step: fitting the outcome model weighted by our inverse probabilities. Some researchers call these Marginal Structural Models, in part because the model is marginal; we only need to include our outcome (wt82_71) and exposure (qsmk). The other variables aren’t in the model; they are accounted for with the IPWs!

ipw_model <- lm(
  wt82_71 ~ qsmk, 
  data = nhefs_complete_uc, 
  weights = wts# inverse probability weights
) 

  head(model.matrix(ipw_model))
##   (Intercept) qsmk
## 1           1    0
## 2           1    0
## 3           1    0
## 4           1    0
## 5           1    0
## 6           1    0
ipw_estimate <- ipw_model |>
  tidy(conf.int = TRUE) |>
  filter(term == "qsmk")

ipw_estimate
## # A tibble: 1 × 7
##   term  estimate std.error statistic  p.value conf.low conf.high
##   <chr>    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 qsmk      3.44     0.408      8.43 7.47e-17     2.64      4.24

So we are doing a weighted regression: Weighted Least Squares (WLS) minimizes the Weighted Sum of Squared Residuals (WSSR). A unique weight 8\(w_i\) is assigned to each residual before it is squared:

\[\text{Minimize: } \text{WSSR} = \sum_{i=1}^{n} w_i(y_i - \hat{y}_i)^2\]

Effect of Weights: High Weight (\(w_i > 1\)): If an observation has a large weight (like your Inverse Probability Weights, or IPTW, for hard-to-find people), the \(\text{WSSR}\) penalizes a large residual for that point much more heavily. The model must adjust the line to fit this point closely, giving it more influence on the coefficient estimates. Low Weight (\(w_i < 1\)): If an observation has a small weight, the \(\text{WSSR}\) barely registers its residual. The model can safely ignore this point to improve the fit for others, giving it less influence.

The most direct way to see how the weights enter the equation is through the matrix formula for the coefficient estimates, \(\hat{\beta}\). OLS Estimator (\(\hat{\beta}_{\text{OLS}}\))The formula for the OLS coefficient vector \(\hat{\beta}\) (which includes the intercept and slope(s)) is:

\[\hat{\beta}_{\text{OLS}} = (\mathbf{X}^{\text{T}}\mathbf{X})^{-1}\mathbf{X}^{\text{T}}\mathbf{y}\] The WLS estimator introduces the Weight Matrix (\(\mathbf{W}\)) into the OLS formula. \(\mathbf{W}\) is a diagonal matrix where the individual weights (\(w_i\)) are placed along the main diagonal, and all other entries are zero:

\[\mathbf{W} = \begin{pmatrix} w_1 & 0 & \cdots & 0 \\ 0 & w_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & w_n \end{pmatrix}\] The modified formula for the WLS coefficients \(\hat{\beta}\) is:

\[\hat{\beta}_{\text{WLS}} = (\mathbf{X}^{\text{T}}\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^{\text{T}}\mathbf{W}\mathbf{y}\]

This estimate is pretty similar to what we saw before, if a little smaller. In fact, for simple causal questions, this is often the case: adjusting for confounders directly in your regression model sometimes estimates the same effect as IPWs and other causal techniques. Causal techniques are special, though, in that the use counterfactual modeling, which allows you to deal with many circumstances, such as when you have selection bias or time-dependendent confounding. They also often have variance properties.

But we have other problem that we need to address. While we’re just usinmag lm() to estimate our IPW model, it doesn’t properly account for the weights. That means our standard error is too small, which will artificially narrow confidence intervals and artificially shrink p-values. There are many ways to address this, including robust estimators.

We want to give low weight both to those cases in which the covariates show a high propensity to quit (or not to quit) because the same motivation can explain the loss of weight. We want to break the causal path

\[\text{Covariates (e.g., motivation)} \rightarrow \text{Quit Smoking (X)}\] \[\text{Covariates (e.g., motivation)} \rightarrow \text{Exercise/Diet} \rightarrow \text{Weight Change (Y)}\]

G Computation:

From here

We use Luque-Fernandez et al. example from Epidemiology. The outcome 푌 of interest is (systolic) blood pressure. The “treatment” of interest is sodium intake. Sodium intake is a continuous variable; in order to easily apply

\[\begin{align}\text{ATE} =\mathbb E[Y(1) − Y(0)] &= \mathbb E_X[\mathbb E[Y | T = 1, X] − \mathbb E[Y | T = 0, X]]\\[2px]&=\frac 1 n \sum_i \left[\mathbb E[Y | T = 1, X=x_i] − \mathbb E[Y | T = 0, X=x_i] \right] \end{align}\]

which is specified for binary treatment, we will binarize T by letting T = 1 denote daily sodium intake above 3.5 grams and letting T = 0 denote daily sodium intake below 3.5 grams. The expression \(\mathbb E_X\) the marginalization step: It means you must average the conditional effect across the entire distribution of the confounders \(X\). The overall causal effect (ATE) is found by:Calculating the difference in outcomes between the treated and control groups within every subgroup defined by the confounders \(X\).

In our data, we also have the age of the individuals and amount of protein in their urine as covariates \(X\). Because we are using data from a simulation, we know that the true ATE of sodium on blood pressure is 1.05.

Minimizing the mean-squared error (MSE) of predicting \(Y\) from (\(T\), \(X\)) pairs is equivalent to modeling this conditional expectation. Therefore, we can plug in any machine learning model for \(\mathbb E[Y | t, x],\) which gives us a model-assisted estimator. We’ll use linear regression here, which works out nicely since blood pressure is generated as a linear combination of other variables, in this simulation.

library(dplyr)
library(stats) 

# --- Data Generation Function ---
generate_data <- function(n=1000, seed=0, beta1=1.05, alpha1=0.4, alpha2=0.3, binary_treatment=TRUE, binary_cutoff=3.5) {
  set.seed(seed)
  age <- rnorm(n, mean=65, sd=5)
  sodium_raw <- age / 18 + rnorm(n)
  
  sodium <- sodium_raw
  if (binary_treatment) {
    if (is.null(binary_cutoff)) {
      binary_cutoff <- mean(sodium_raw)
    }
    # Convert to binary (0 or 1)
    sodium <- as.integer(sodium_raw > binary_cutoff)
  }
  
  # Blood pressure is the outcome, determined by sodium (treatment) and age (confounder)
  blood_pressure <- beta1 * sodium + 2 * age + rnorm(n)
  # Proteinuria is a collider, determined by both sodium and blood pressure
  proteinuria <- alpha1 * sodium + alpha2 * blood_pressure + rnorm(n)
  
  df <- data.frame(
    blood_pressure = blood_pressure, 
    sodium = sodium,
    age = age, 
    proteinuria = proteinuria
  )
  return(df)
}


# --- Causal Effect Estimation Function (Modified to Return Predictions) ---
estimate_causal_effect <- function(df, outcome_var, treatment_vars, 
                                   covariate_vars, regression_coef=FALSE, 
                                   return_predictions=FALSE) {
  
  # 1. Define the model formula
  all_vars <- c(treatment_vars, covariate_vars)
  formula_str <- paste(outcome_var, "~", paste(all_vars, collapse=" + "))
  
  # 2. Fit the Linear Regression Model
  model <- lm(as.formula(formula_str), data=df)
  
  if (regression_coef) {
    # Return the regression coefficient for the treatment variable
    return(coef(model)[treatment_vars[1]])
  } else {
    # Estimate ATE using the Adjustment Formula (G-Computation)
    
    # 3. Create Counterfactual Data (T=1)
    Xt1 <- df
    Xt1[[treatment_vars[1]]] <- 1 # Set treatment to 1 for all
    
    # 4. Create Counterfactual Data (T=0)
    Xt0 <- df
    Xt0[[treatment_vars[1]]] <- 0 # Set treatment to 0 for all
    
    # 5. Predict outcomes for counterfactuals
    y_hat1 <- predict(model, newdata = Xt1)
    y_hat0 <- predict(model, newdata = Xt0)
    
    # 6. Calculate ATE (Mean Difference)
    ate <- mean(y_hat1 - y_hat0)
    
    if (return_predictions) {
      # Return a list containing ATE and predictions
      return(list(ate = ate, y_hat1 = y_hat1, y_hat0 = y_hat0))
    } else {
      # Return only the ATE
      return(ate)
    }
  }
}


# --- Main Execution Block ---

# Generate large datasets (N reduced for practical execution speed)
n_obs <- 1000000 
binary_t_df <- generate_data(beta1=1.05, alpha1=.4, alpha2=.3, binary_treatment=TRUE, n=n_obs)
continuous_t_df <- generate_data(beta1=1.05, alpha1=.4, alpha2=.3, binary_treatment=FALSE, n=n_obs)

data_list <- list("Binary Treatment Data" = binary_t_df, 
                  "Continuous Treatment Data" = continuous_t_df)

for (name in names(data_list)) {
  df <- data_list[[name]]
  
  cat("\n###", name, "###\n\n")
  
  # --- Adjustment Formula Estimates (regression_coef=FALSE) ---
  
  # Adjust All: BP ~ sodium + age + proteinuria
  # This call saves the full prediction vectors in the 'results_all' list.
  results_all <- estimate_causal_effect(df, 
                                        outcome_var='blood_pressure', 
                                        treatment_vars='sodium', 
                                        covariate_vars=c('age', 'proteinuria'),
                                        regression_coef=FALSE,
                                        return_predictions=TRUE) 
  
  ate_est_adjust_all <- results_all$ate # Extract ATE for printing
  
  # Naive and Adjust Age estimates (using the simpler return=ATE)
  ate_est_naive <- estimate_causal_effect(df, 
                                          outcome_var='blood_pressure', 
                                          treatment_vars='sodium', 
                                          covariate_vars=NULL,
                                          regression_coef=FALSE)
  
  ate_est_adjust_age <- estimate_causal_effect(df, 
                                               outcome_var='blood_pressure', 
                                               treatment_vars='sodium', 
                                               covariate_vars='age',
                                               regression_coef=FALSE)
  
  
  cat("# Adjustment Formula Estimates #\n")
  cat("Naive ATE estimate:\t\t\t\t\t\t\t", format(ate_est_naive, digits=4), "\n")
  cat("ATE estimate adjusting for all covariates:\t", format(ate_est_adjust_all, digits=4), "\n")
  cat("ATE estimate adjusting for age:\t\t\t\t", format(ate_est_adjust_age, digits=4), "\n")
  cat("\n")
  
  # --- Example of saved predictions (The full vectors are available via results_all$y_hat1/0) ---
  cat("--- Saved Counterfactual Predictions ---\n")
  cat("Full y_hat1 vector length:", length(results_all$y_hat1), "\n")
  cat("First 5 T=1 Predictions (y_hat1):\n")
  print(head(results_all$y_hat1, 5))
  cat("First 5 T=0 Predictions (y_hat0):\n")
  print(head(results_all$y_hat0, 5))
  cat("\n")
  
  # --- Linear Regression Coefficient Estimates (regression_coef=TRUE) ---
  
  ate_est_naive_coef <- estimate_causal_effect(df, 
                                               outcome_var='blood_pressure', 
                                               treatment_vars='sodium', 
                                               covariate_vars=NULL,
                                               regression_coef=TRUE)
  
  ate_est_adjust_all_coef <- estimate_causal_effect(df, 
                                                    outcome_var='blood_pressure', 
                                                    treatment_vars='sodium', 
                                                    covariate_vars=c('age', 'proteinuria'),
                                                    regression_coef=TRUE)
  
  ate_est_adjust_age_coef <- estimate_causal_effect(df, 
                                                    outcome_var='blood_pressure', 
                                                    treatment_vars='sodium', 
                                                    covariate_vars='age',
                                                    regression_coef=TRUE)
  
  cat("# Regression Coefficient Estimates #\n")
  cat("Naive ATE estimate:\t\t\t\t\t\t\t", format(ate_est_naive_coef, digits=4), "\n")
  cat("ATE estimate adjusting for all covariates:\t", format(ate_est_adjust_all_coef, digits=4), "\n")
  cat("ATE estimate adjusting for age:\t\t\t\t", format(ate_est_adjust_age_coef, digits=4), "\n")
  cat("\n")
}
## 
## ### Binary Treatment Data ###
## 
## # Adjustment Formula Estimates #
## Naive ATE estimate:                           5.331 
## ATE estimate adjusting for all covariates:    0.8537 
## ATE estimate adjusting for age:               1.049 
## 
## --- Saved Counterfactual Predictions ---
## Full y_hat1 vector length: 1000000 
## First 5 T=1 Predictions (y_hat1):
##        1        2        3        4        5 
## 144.0520 127.2587 144.6509 144.4139 134.8669 
## First 5 T=0 Predictions (y_hat0):
##        1        2        3        4        5 
## 143.1983 126.4050 143.7972 143.5602 134.0133 
## 
## # Regression Coefficient Estimates #
## Naive ATE estimate:                           5.331 
## ATE estimate adjusting for all covariates:    0.8537 
## ATE estimate adjusting for age:               1.049 
## 
## 
## ### Continuous Treatment Data ###
## 
## # Adjustment Formula Estimates #
## Naive ATE estimate:                           3.626 
## ATE estimate adjusting for all covariates:    0.8537 
## ATE estimate adjusting for age:               1.049 
## 
## --- Saved Counterfactual Predictions ---
## Full y_hat1 vector length: 1000000 
## First 5 T=1 Predictions (y_hat1):
##        1        2        3        4        5 
## 144.8395 127.6933 145.6432 145.0535 135.4998 
## First 5 T=0 Predictions (y_hat0):
##        1        2        3        4        5 
## 143.9859 126.8397 144.7896 144.1999 134.6462 
## 
## # Regression Coefficient Estimates #
## Naive ATE estimate:                           3.626 
## ATE estimate adjusting for all covariates:    0.8537 
## ATE estimate adjusting for age:               1.049

All of the above is done using the adjustment formula with model-assisted estimation, where we first fit a model for the conditional expectation, and then we take an empirical mean over\(X\), using that model. However, because we are using a linear model, this is equivalent to just taking the coefficient in front of \(T\) in the linear regression as the ATE estimate.

Matching:

From this YouTube presentation:

Propensity score-based methods include matching, stratification and inverse probability of treatment weighting (IPTW).

Matching creates treated / untreated pairs with similar propensity scores.

# Load necessary libraries
library(MatchIt)
library(tidyverse)
# library(causaldata) # Assuming this is where nhefs_complete is loaded
# --- Setup Data (Conceptual) ---
# Assuming nhefs_complete_uc is already loaded and filtered as before
# The variables are:
# Outcome: wt82_71 (Weight Change)
# Treatment: qsmk (Quit Smoking: 1=Yes, 0=No)
# Confounders (X): age, sex, race, education, smokeintensity, smokeyrs, exercise, active, wt71
# Define the matching formula (Treatment ~ Confounders)

matching_formula <- qsmk ~ age + I(age^2) + sex + race + education + smokeintensity + I(smokeintensity^2) + smokeyrs +  I(smokeyrs^2) + exercise + active + wt71 + I(wt71^2)

# Run the matching algorithm
m.out <- matchit(
  matching_formula, 
  data = nhefs_complete_uc, 
  method = "nearest",         
  distance = "glm"
  # estimand is now implicitly "ATT", which is fine for nearest neighbor matching
)

# Optional: Examine the matching results (summary of balance)
# summary(m.out, unlist = FALSE)
# Summarize balance before and after matching
# summary(m.out)

# Note: The output will show SMDs. You'd want the 'Matched Sample' SMDs 
# to be close to zero (e.g., < 0.1) for all covariates.
# Extract the matched data (only includes matched units)
m.data <- match.data(m.out)

# Run a simple linear model on the matched data
# Note: No confounders are needed in the final model if balance is achieved,
# but sometimes they are included for robustness. Here we keep it simple.
lm.matched <- lm(wt82_71 ~ qsmk, data = m.data)

# Print the result (the coefficient for qsmk is the estimated ATE)
summary(lm.matched)
## 
## Call:
## lm(formula = wt82_71 ~ qsmk, data = m.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.549  -4.072   0.087   4.537  42.986 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.0468     0.3975   2.634  0.00861 ** 
## qsmk          3.4783     0.5621   6.187 9.73e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.98 on 804 degrees of freedom
## Multiple R-squared:  0.04545,    Adjusted R-squared:  0.04427 
## F-statistic: 38.28 on 1 and 804 DF,  p-value: 9.732e-10

Home Page

NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.