NOTES ON STATISTICS, PROBABILITY and MATHEMATICS


Interrupted time series regression for the evaluation of public health interventions:

from here for the theory, and here for the code.

In an ITS study, a time series of a particular outcome of interest is used to establish an underlying trend, which is ‘interrupted’ by an intervention at a known point in time.

The hypothetical scenario under which the intervention had not taken place and the trend continues unchanged (that is: the ‘expected’ trend, in the absence of the intervention, given the pre-existing trend) is referred to as the ‘counterfactual’.

The interrupted time series (ITS) study design is increasingly being used for the evaluation of public health interventions; it is particularly suited to interventions introduced at a population level over a clearly defined time period and that target population-level health outcomes.

ITS has been used for the evaluation of a wide range of public health interventions including new vaccines, cycle helmet legislation, changes to paracetamol packaging, traffic speed zones and precautions against nosocomial infections, as well as in the evaluation of health impacts of unplanned events such as the global financial crisis.

In standard ITS analyses, the following segmented regression model is used:

\[Y_t=β_0 +β_1T+β_2X_t + β_3TX_t\]

where \(β_0\) represents the baseline level at \(T = 0,\) \(β_1\) is interpreted as the change in outcome associated with a time unit increase (representing the underlying pre-intervention trend), \(β_2\) is the level change following the intervention and \(β_3\) indicates the slope change following the intervention (using the interaction between time and intervention: \(TX_t\)⁠).

The model above corresponds to impact mode \((c):\)


models \((a)\) and \((b)\) can easily be specified by excluding the terms \(β_3TX_t\) or \(β_2 X_t⁠\), respectively.The other models require slightly more complex variable specifications.


From this site:

In the dataset sicily aces = count of acute coronary episodes in Sicily per month (the outcome) and smokban = smoking ban (the intervention) coded 0 before intervention. time = elapsed time since the start of the study. pop = the population of Sicily (in 10000s). stdpop = age standardised population

data <- read.csv('https://raw.githubusercontent.com/RInterested/DATASETS/gh-pages/sicily.csv')
names(data)[names(data) == 'aces'] <- 'infarcts'
head(data)
##   year month infarcts time smokban      pop   stdpop
## 1 2002     1      728    1       0 364277.4 379875.3
## 2 2002     2      659    2       0 364277.4 376495.5
## 3 2002     3      791    3       0 364277.4 377040.8
## 4 2002     4      734    4       0 364277.4 377116.4
## 5 2002     5      757    5       0 364277.4 377383.4
## 6 2002     6      726    6       0 364277.4 374113.1

Descriptive analysis:

Examining the data is an important first step. Looking at the pre-intervention trend can give an indication of how stable the trend is over time, whether a linear model is likely to be appropriate, and whether there appears to be a seasonal trend.

library(tidyverse) # The "pipe" for data %>%

data$infarct.rate <- with(data, infarcts/stdpop*10^5)

plot(data$infarct.rate, ylim=c(00,300), xlab="Year", ylab="Std rate x 10,000", 
      pch=19, bty="l", xaxt="n")

axis(1, at=0:4*12 + 6, labels=2002:2006)
title("Sicily, 2002-2006")

# Summary statistics:
data %>%
  gather(Variable, Value) %>%
  group_by(Variable) %>%
  summarize(n = n(),
            Mean = mean(Value),
            SD = sd(Value),
            Median = median(Value),
            IQR = IQR(Value),
            Min = min(Value),
            Max = max(Value)) %>%
  knitr::kable()
Variable n Mean SD Median IQR Min Max
infarct.rate 59 2.160942e+02 17.3402530 215.4118 26.38489 175.0353 260.7162
infarcts 59 8.290508e+02 72.4139022 831.0000 111.50000 659.0000 1021.0000
month 59 6.406780e+00 3.4347044 6.0000 5.50000 1.0000 12.0000
pop 59 3.641212e+05 481.2691260 364277.4000 588.20000 363350.8000 364700.4000
smokban 59 3.898305e-01 0.4918981 0.0000 1.00000 0.0000 1.0000
stdpop 59 3.834644e+05 5245.7738584 383428.4000 7898.05000 374113.1000 394005.6000
time 59 3.000000e+01 17.1755640 30.0000 29.00000 1.0000 59.0000
year 59 2.003966e+03 1.4138002 2004.0000 2.00000 2002.0000 2006.0000
# Statistics before and after smoking ban:
data %>%
  mutate(Period = case_when(smokban == 0 ~ "1. Before Ban",
                            smokban == 1 ~ "2. After Ban")) %>%
  select(Period, infarcts, infarct.rate) %>%
  gather(Variable, Value, -Period) %>%
  group_by(Period, Variable) %>%
    summarize(n = n(),
            Mean = mean(Value),
            SD = sd(Value),
            Median = median(Value),
            IQR = IQR(Value),
            Min = min(Value),
            Max = max(Value)) %>%
  knitr::kable()
Period Variable n Mean SD Median IQR Min Max
1. Before Ban infarct.rate 36 213.2719 16.81020 213.1061 24.62515 175.0353 244.3259
1. Before Ban infarcts 36 810.8611 67.76521 811.0000 99.25000 659.0000 937.0000
2. After Ban infarct.rate 23 220.5118 17.59869 217.8656 21.43912 190.9362 260.7162
2. After Ban infarcts 23 857.5217 71.62394 843.0000 83.00000 738.0000 1021.0000

Is the data stationary, i.e. there is no seasonality, there is no trend, the variance remains constant?

From here, here and here.

# Seasonality, trend and variance:
library(forecast)

components <- decompose(ts(data$infarct.rate, frequency=12))
plot(components)

# Is there seasonality in the dataset?

x <- ts(data$infarct.rate, frequency=12) # 12 measurements per year
fit <- tbats(x)
!is.null(fit$seasonal)
## [1] TRUE
# Is the variance constant?

first.half <- data$infarct.rate[1: floor(0.5 * length(data$infarct.rate))]
second.half <- data$infarct.rate[ceiling(0.5 * length(data$infarct.rate)):length(data$infarct.rate)]

# The variance increases over time:

(ratio <- var(first.half) / var(second.half) )
## [1] 1.300939
# Is it statistically significant?

var.test(first.half,second.half)
## 
##  F test to compare two variances
## 
## data:  first.half and second.half
## F = 1.3009, num df = 28, denom df = 29, p-value = 0.4853
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.6164912 2.7589393
## sample estimates:
## ratio of variances 
##           1.300939
# Is there a trend?

summary(lm(infarct.rate ~ time, data))
## 
## Call:
## lm(formula = infarct.rate ~ time, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.866  -9.860  -3.236   9.517  35.518 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 201.7192     4.0603  49.681  < 2e-16 ***
## time          0.4792     0.1177   4.071 0.000146 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.4 on 57 degrees of freedom
## Multiple R-squared:  0.2253, Adjusted R-squared:  0.2117 
## F-statistic: 16.57 on 1 and 57 DF,  p-value: 0.0001459

Poisson Regression Model:

We model the count data directly (rather than the rate which doesn’t follow a Poisson distribution), using the population (log transformed) as an offset variable in order to transform back to rates.

library(Epi)

model1 <- glm(infarcts ~ offset(log(stdpop)) + smokban + time, family=poisson, data)
summary(model1)
## 
## Call:
## glm(formula = infarcts ~ offset(log(stdpop)) + smokban + time, 
##     family = poisson, data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4336  -1.4432  -0.4431   1.2735   5.0701  
## 
## Coefficients:
##               Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) -6.2431509  0.0111894 -557.955  < 2e-16 ***
## smokban     -0.1116284  0.0172248   -6.481 9.13e-11 ***
## time         0.0049450  0.0004992    9.905  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 308.52  on 58  degrees of freedom
## Residual deviance: 197.31  on 56  degrees of freedom
## AIC: 708.03
## 
## Number of Fisher Scoring iterations: 3
round(ci.lin(model1,Exp=T),3)
##             Estimate StdErr        z P exp(Est.)  2.5% 97.5%
## (Intercept)   -6.243  0.011 -557.955 0     0.002 0.002 0.002
## smokban       -0.112  0.017   -6.481 0     0.894 0.865 0.925
## time           0.005  0.000    9.905 0     1.005 1.004 1.006

The predictions with this model are as follows:

# Predictions:

datanew <- data.frame(stdpop = mean(data$stdpop),
                      smokban = rep(c(0, 1), c(360, 240)),
                      time = 1:600/10,
                      month = rep(1:120/10, 5))

head(datanew)
##     stdpop smokban time month
## 1 383464.4       0  0.1   0.1
## 2 383464.4       0  0.2   0.2
## 3 383464.4       0  0.3   0.3
## 4 383464.4       0  0.4   0.4
## 5 383464.4       0  0.5   0.5
## 6 383464.4       0  0.6   0.6
pred1 <- predict(model1, type="response", datanew) / mean(data$stdpop) * 10^5

plot(data$infarct.rate, type="n", ylim=c(0,300), xlab="Year", ylab="Std rate x 10,000", bty="l", xaxt="n")
rect(36, 0, 60, 300, col=grey(0.9), border=F)
points(data$infarct.rate,cex=0.7)
axis(1, at=0:5*12, labels=F)
axis(1, at=0:4*12+6, tick=F, labels=2002:2006)
lines((1:600/10), pred1, col=2)
title("Sicily, 2002-2006")

# To plot the counterfactual scenario we create a data frame as if smokban (the intervention) had never been implemented.

datanew <- data.frame(stdpop=mean(data$stdpop),smokban=0,time=1:600/10,
  month=rep(1:120/10,5))

pred1b <- predict(model1, datanew, type="response") / mean(data$stdpop) * 10^5

plot(data$infarct.rate, type="n", ylim=c(0,300), xlab="Year", ylab="Std rate x 10,000", bty="l", xaxt="n")
rect(36, 0, 60, 300, col=grey(0.9), border=F)
points(data$infarct.rate,cex=0.7)
axis(1, at=0:5*12, labels=F)
axis(1, at=0:4*12+6, tick=F, labels=2002:2006)
lines((1:600/10), pred1, col=2)
lines(datanew$time, pred1b, col=2, lty=2)
title("Sicily, 2002-2006")

In the model above we have not allowed for overdispersion - in order to do this we can use a quasipoisson model, which allows the variance to be proportional rather than equal to the mean.

model2 <- glm(infarcts ~ offset(log(stdpop)) + smokban + time, 
              family=quasipoisson, data)
summary(model2)
## 
## Call:
## glm(formula = infarcts ~ offset(log(stdpop)) + smokban + time, 
##     family = quasipoisson, data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4336  -1.4432  -0.4431   1.2735   5.0701  
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -6.2431509  0.0211096 -295.749  < 2e-16 ***
## smokban     -0.1116284  0.0324959   -3.435  0.00112 ** 
## time         0.0049450  0.0009418    5.250 2.43e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 3.559195)
## 
##     Null deviance: 308.52  on 58  degrees of freedom
## Residual deviance: 197.31  on 56  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 3

A second assumption of standard regression models is that observations are independent. This assumption is often violated in time series data because consecutive observations tend to be more similar to one another than those that are further apart, a phenomenon known as autocorrelation. Fortunately, in many epidemiological data, autocorrelation is largely explained by other variables, in particular seasonality (discussed above); therefore, after controlling for these factors, residual autocorrelation is rarely a problem. Nevertheless, autocorrelation should always be assessed by examining the plot of residuals and the partial autocorrelation function and, where data are normally distributed, conducting tests such as the Breusch-Godfrey test.22,25 Where residual autocorrelation remains, this should be adjusted for using methods such as Prais regression or autoregressive integrated moving average (ARIMA), described in more detail elsewhere. There is very little evidence of autocorrelation in the worked example and even less after adjustment for seasonality.


Checking for correlation, autocorrelation and seasonality in the residuals:

# Checking autocorrelation and partial autocorrelation in the model (using residuals):
library(ggplot2)
library(cowplot) 
library(fpp2)  

res2 <- residuals(model2, type="deviance")

ggtsdisplay(res2)

pattern <- decompose(ts(res2, frequency=12))
plot(pattern)

!is.null(tbats(ts(res2, frequency=12))$seasonal)
## [1] TRUE

Clearly there is residual seasonality. There are various ways of adjusting for seasonality - here we use harmonic terms specifying the number of sin and cosine pairs to include (in this case 3) and the length of the period (12 months):

library(tsModel)

model3 <- glm(infarcts ~ offset(log(stdpop)) + smokban + time + harmonic(month,3,12), 
              family=quasipoisson, data)
summary(model3)
## 
## Call:
## glm(formula = infarcts ~ offset(log(stdpop)) + smokban + time + 
##     harmonic(month, 3, 12), family = quasipoisson, data = data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.189  -1.055  -0.107   1.037   3.468  
## 
## Coefficients:
##                           Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)             -6.2521824  0.0176090 -355.056  < 2e-16 ***
## smokban                 -0.1210185  0.0271207   -4.462 4.62e-05 ***
## time                     0.0053751  0.0007985    6.731 1.58e-08 ***
## harmonic(month, 3, 12)1  0.0368454  0.0100163    3.679 0.000574 ***
## harmonic(month, 3, 12)2 -0.0182348  0.0096298   -1.894 0.064075 .  
## harmonic(month, 3, 12)3  0.0062659  0.0095570    0.656 0.515061    
## harmonic(month, 3, 12)4  0.0386223  0.0096767    3.991 0.000215 ***
## harmonic(month, 3, 12)5  0.0149246  0.0097112    1.537 0.130636    
## harmonic(month, 3, 12)6  0.0127700  0.0097201    1.314 0.194921    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 2.256077)
## 
##     Null deviance: 308.52  on 58  degrees of freedom
## Residual deviance: 112.39  on 50  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 3
round(ci.lin(model3,Exp=T),3)
##                         Estimate StdErr        z     P exp(Est.)  2.5% 97.5%
## (Intercept)               -6.252  0.018 -355.056 0.000     0.002 0.002 0.002
## smokban                   -0.121  0.027   -4.462 0.000     0.886 0.840 0.934
## time                       0.005  0.001    6.731 0.000     1.005 1.004 1.007
## harmonic(month, 3, 12)1    0.037  0.010    3.679 0.000     1.038 1.017 1.058
## harmonic(month, 3, 12)2   -0.018  0.010   -1.894 0.058     0.982 0.964 1.001
## harmonic(month, 3, 12)3    0.006  0.010    0.656 0.512     1.006 0.988 1.025
## harmonic(month, 3, 12)4    0.039  0.010    3.991 0.000     1.039 1.020 1.059
## harmonic(month, 3, 12)5    0.015  0.010    1.537 0.124     1.015 0.996 1.035
## harmonic(month, 3, 12)6    0.013  0.010    1.314 0.189     1.013 0.994 1.032
ci.lin(model3,Exp=T)["smokban",5:7]
## exp(Est.)      2.5%     97.5% 
## 0.8860176 0.8401506 0.9343886

After adjusting for seasonality:

# After adjusting for seasonality:

# Checking residuals:

res3 <- residuals(model3,type="deviance")

ggtsdisplay(res3)

pattern <- decompose(ts(res3, frequency=12))
plot(pattern)

!is.null(tbats(ts(res3, frequency=12))$seasonal)
## [1] FALSE

After adjustment for seasonality through a Fourier term, with results suggesting that the association is largely unaffected: (RR: 0.885; 95% CI 0.839-0.933; P < 0.001).

Taking into account seasonality of the ACE rate, the ACE rate was being multiplied by about 1.07 each year, a long-term temporal trend.

exp(coef(model3)["time"]*12)
##     time 
## 1.066627
# Predict and plot of the seasonally adjusted model


datanew <- data.frame(stdpop = mean(data$stdpop),
                      smokban = rep(c(0, 1), c(360, 240)),
                      time = 1:600/10,
                      month = rep(1:120/10, 5))

pred3 <- predict(model3, type="response", datanew) / mean(data$stdpop) * 10^5

plot(data$infarct.rate, type="n", ylim=c(120,300), xlab="Year", ylab="Std rate x 10,000", bty="l", xaxt="n")
rect(36, 120, 60, 300, col=grey(0.9), border=F)
points(data$infarct.rate, cex=0.7)
axis(1, at=0:5*12, labels=F)
axis(1, at=0:4*12+6, tick=F, labels=2002:2006)
lines(1:600/10, pred3, col=2)
title("Sicily, 2002-2006")


Home Page

NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.