from here for the theory, and here for the code.
In an ITS study, a time series of a particular outcome of interest is used to establish an underlying trend, which is ‘interrupted’ by an intervention at a known point in time.
The hypothetical scenario under which the intervention had not taken place and the trend continues unchanged (that is: the ‘expected’ trend, in the absence of the intervention, given the pre-existing trend) is referred to as the ‘counterfactual’.
The interrupted time series (ITS) study design is increasingly being used for the evaluation of public health interventions; it is particularly suited to interventions introduced at a population level over a clearly defined time period and that target population-level health outcomes.
ITS has been used for the evaluation of a wide range of public health interventions including new vaccines, cycle helmet legislation, changes to paracetamol packaging, traffic speed zones and precautions against nosocomial infections, as well as in the evaluation of health impacts of unplanned events such as the global financial crisis.
In standard ITS analyses, the following segmented regression model is used:
\[Y_t=β_0 +β_1T+β_2X_t + β_3TX_t\]
where \(β_0\) represents the baseline level at \(T = 0,\) \(β_1\) is interpreted as the change in outcome associated with a time unit increase (representing the underlying pre-intervention trend), \(β_2\) is the level change following the intervention and \(β_3\) indicates the slope change following the intervention (using the interaction between time and intervention: \(TX_t\)).
The model above corresponds to impact mode \((c):\)
models \((a)\) and \((b)\) can easily be specified by excluding the terms \(β_3TX_t\) or \(β_2 X_t\), respectively.The other models require slightly more complex variable specifications.
From this site:
In the dataset sicily
aces
= count of acute
coronary episodes in Sicily per month (the outcome) and
smokban
= smoking ban (the intervention) coded 0 before
intervention. time
= elapsed time since the start of the
study. pop
= the population of Sicily (in 10000s).
stdpop
= age standardised population
data <- read.csv('https://raw.githubusercontent.com/RInterested/DATASETS/gh-pages/sicily.csv')
names(data)[names(data) == 'aces'] <- 'infarcts'
head(data)
## year month infarcts time smokban pop stdpop
## 1 2002 1 728 1 0 364277.4 379875.3
## 2 2002 2 659 2 0 364277.4 376495.5
## 3 2002 3 791 3 0 364277.4 377040.8
## 4 2002 4 734 4 0 364277.4 377116.4
## 5 2002 5 757 5 0 364277.4 377383.4
## 6 2002 6 726 6 0 364277.4 374113.1
Examining the data is an important first step. Looking at the pre-intervention trend can give an indication of how stable the trend is over time, whether a linear model is likely to be appropriate, and whether there appears to be a seasonal trend.
library(tidyverse) # The "pipe" for data %>%
data$infarct.rate <- with(data, infarcts/stdpop*10^5)
plot(data$infarct.rate, ylim=c(00,300), xlab="Year", ylab="Std rate x 10,000",
pch=19, bty="l", xaxt="n")
axis(1, at=0:4*12 + 6, labels=2002:2006)
title("Sicily, 2002-2006")
# Summary statistics:
data %>%
gather(Variable, Value) %>%
group_by(Variable) %>%
summarize(n = n(),
Mean = mean(Value),
SD = sd(Value),
Median = median(Value),
IQR = IQR(Value),
Min = min(Value),
Max = max(Value)) %>%
knitr::kable()
Variable | n | Mean | SD | Median | IQR | Min | Max |
---|---|---|---|---|---|---|---|
infarct.rate | 59 | 2.160942e+02 | 17.3402530 | 215.4118 | 26.38489 | 175.0353 | 260.7162 |
infarcts | 59 | 8.290508e+02 | 72.4139022 | 831.0000 | 111.50000 | 659.0000 | 1021.0000 |
month | 59 | 6.406780e+00 | 3.4347044 | 6.0000 | 5.50000 | 1.0000 | 12.0000 |
pop | 59 | 3.641212e+05 | 481.2691260 | 364277.4000 | 588.20000 | 363350.8000 | 364700.4000 |
smokban | 59 | 3.898305e-01 | 0.4918981 | 0.0000 | 1.00000 | 0.0000 | 1.0000 |
stdpop | 59 | 3.834644e+05 | 5245.7738584 | 383428.4000 | 7898.05000 | 374113.1000 | 394005.6000 |
time | 59 | 3.000000e+01 | 17.1755640 | 30.0000 | 29.00000 | 1.0000 | 59.0000 |
year | 59 | 2.003966e+03 | 1.4138002 | 2004.0000 | 2.00000 | 2002.0000 | 2006.0000 |
# Statistics before and after smoking ban:
data %>%
mutate(Period = case_when(smokban == 0 ~ "1. Before Ban",
smokban == 1 ~ "2. After Ban")) %>%
select(Period, infarcts, infarct.rate) %>%
gather(Variable, Value, -Period) %>%
group_by(Period, Variable) %>%
summarize(n = n(),
Mean = mean(Value),
SD = sd(Value),
Median = median(Value),
IQR = IQR(Value),
Min = min(Value),
Max = max(Value)) %>%
knitr::kable()
Period | Variable | n | Mean | SD | Median | IQR | Min | Max |
---|---|---|---|---|---|---|---|---|
1. Before Ban | infarct.rate | 36 | 213.2719 | 16.81020 | 213.1061 | 24.62515 | 175.0353 | 244.3259 |
1. Before Ban | infarcts | 36 | 810.8611 | 67.76521 | 811.0000 | 99.25000 | 659.0000 | 937.0000 |
2. After Ban | infarct.rate | 23 | 220.5118 | 17.59869 | 217.8656 | 21.43912 | 190.9362 | 260.7162 |
2. After Ban | infarcts | 23 | 857.5217 | 71.62394 | 843.0000 | 83.00000 | 738.0000 | 1021.0000 |
# Seasonality, trend and variance:
library(forecast)
components <- decompose(ts(data$infarct.rate, frequency=12))
plot(components)
# Is there seasonality in the dataset?
x <- ts(data$infarct.rate, frequency=12) # 12 measurements per year
fit <- tbats(x)
!is.null(fit$seasonal)
## [1] TRUE
# Is the variance constant?
first.half <- data$infarct.rate[1: floor(0.5 * length(data$infarct.rate))]
second.half <- data$infarct.rate[ceiling(0.5 * length(data$infarct.rate)):length(data$infarct.rate)]
# The variance increases over time:
(ratio <- var(first.half) / var(second.half) )
## [1] 1.300939
# Is it statistically significant?
var.test(first.half,second.half)
##
## F test to compare two variances
##
## data: first.half and second.half
## F = 1.3009, num df = 28, denom df = 29, p-value = 0.4853
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.6164912 2.7589393
## sample estimates:
## ratio of variances
## 1.300939
# Is there a trend?
summary(lm(infarct.rate ~ time, data))
##
## Call:
## lm(formula = infarct.rate ~ time, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.866 -9.860 -3.236 9.517 35.518
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 201.7192 4.0603 49.681 < 2e-16 ***
## time 0.4792 0.1177 4.071 0.000146 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.4 on 57 degrees of freedom
## Multiple R-squared: 0.2253, Adjusted R-squared: 0.2117
## F-statistic: 16.57 on 1 and 57 DF, p-value: 0.0001459
We model the count data directly (rather than the rate which doesn’t follow a Poisson distribution), using the population (log transformed) as an offset variable in order to transform back to rates.
library(Epi)
model1 <- glm(infarcts ~ offset(log(stdpop)) + smokban + time, family=poisson, data)
summary(model1)
##
## Call:
## glm(formula = infarcts ~ offset(log(stdpop)) + smokban + time,
## family = poisson, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4336 -1.4432 -0.4431 1.2735 5.0701
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.2431509 0.0111894 -557.955 < 2e-16 ***
## smokban -0.1116284 0.0172248 -6.481 9.13e-11 ***
## time 0.0049450 0.0004992 9.905 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 308.52 on 58 degrees of freedom
## Residual deviance: 197.31 on 56 degrees of freedom
## AIC: 708.03
##
## Number of Fisher Scoring iterations: 3
round(ci.lin(model1,Exp=T),3)
## Estimate StdErr z P exp(Est.) 2.5% 97.5%
## (Intercept) -6.243 0.011 -557.955 0 0.002 0.002 0.002
## smokban -0.112 0.017 -6.481 0 0.894 0.865 0.925
## time 0.005 0.000 9.905 0 1.005 1.004 1.006
The predictions with this model are as follows:
# Predictions:
datanew <- data.frame(stdpop = mean(data$stdpop),
smokban = rep(c(0, 1), c(360, 240)),
time = 1:600/10,
month = rep(1:120/10, 5))
head(datanew)
## stdpop smokban time month
## 1 383464.4 0 0.1 0.1
## 2 383464.4 0 0.2 0.2
## 3 383464.4 0 0.3 0.3
## 4 383464.4 0 0.4 0.4
## 5 383464.4 0 0.5 0.5
## 6 383464.4 0 0.6 0.6
pred1 <- predict(model1, type="response", datanew) / mean(data$stdpop) * 10^5
plot(data$infarct.rate, type="n", ylim=c(0,300), xlab="Year", ylab="Std rate x 10,000", bty="l", xaxt="n")
rect(36, 0, 60, 300, col=grey(0.9), border=F)
points(data$infarct.rate,cex=0.7)
axis(1, at=0:5*12, labels=F)
axis(1, at=0:4*12+6, tick=F, labels=2002:2006)
lines((1:600/10), pred1, col=2)
title("Sicily, 2002-2006")
# To plot the counterfactual scenario we create a data frame as if smokban (the intervention) had never been implemented.
datanew <- data.frame(stdpop=mean(data$stdpop),smokban=0,time=1:600/10,
month=rep(1:120/10,5))
pred1b <- predict(model1, datanew, type="response") / mean(data$stdpop) * 10^5
plot(data$infarct.rate, type="n", ylim=c(0,300), xlab="Year", ylab="Std rate x 10,000", bty="l", xaxt="n")
rect(36, 0, 60, 300, col=grey(0.9), border=F)
points(data$infarct.rate,cex=0.7)
axis(1, at=0:5*12, labels=F)
axis(1, at=0:4*12+6, tick=F, labels=2002:2006)
lines((1:600/10), pred1, col=2)
lines(datanew$time, pred1b, col=2, lty=2)
title("Sicily, 2002-2006")
In the model above we have not allowed for overdispersion - in order to do this we can use a quasipoisson model, which allows the variance to be proportional rather than equal to the mean.
model2 <- glm(infarcts ~ offset(log(stdpop)) + smokban + time,
family=quasipoisson, data)
summary(model2)
##
## Call:
## glm(formula = infarcts ~ offset(log(stdpop)) + smokban + time,
## family = quasipoisson, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4336 -1.4432 -0.4431 1.2735 5.0701
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.2431509 0.0211096 -295.749 < 2e-16 ***
## smokban -0.1116284 0.0324959 -3.435 0.00112 **
## time 0.0049450 0.0009418 5.250 2.43e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 3.559195)
##
## Null deviance: 308.52 on 58 degrees of freedom
## Residual deviance: 197.31 on 56 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 3
A second assumption of standard regression models is that observations are independent. This assumption is often violated in time series data because consecutive observations tend to be more similar to one another than those that are further apart, a phenomenon known as autocorrelation. Fortunately, in many epidemiological data, autocorrelation is largely explained by other variables, in particular seasonality (discussed above); therefore, after controlling for these factors, residual autocorrelation is rarely a problem. Nevertheless, autocorrelation should always be assessed by examining the plot of residuals and the partial autocorrelation function and, where data are normally distributed, conducting tests such as the Breusch-Godfrey test.22,25 Where residual autocorrelation remains, this should be adjusted for using methods such as Prais regression or autoregressive integrated moving average (ARIMA), described in more detail elsewhere. There is very little evidence of autocorrelation in the worked example and even less after adjustment for seasonality.
# Checking autocorrelation and partial autocorrelation in the model (using residuals):
library(ggplot2)
library(cowplot)
library(fpp2)
res2 <- residuals(model2, type="deviance")
ggtsdisplay(res2)
pattern <- decompose(ts(res2, frequency=12))
plot(pattern)
!is.null(tbats(ts(res2, frequency=12))$seasonal)
## [1] TRUE
Clearly there is residual seasonality. There are various ways of adjusting for seasonality - here we use harmonic terms specifying the number of sin and cosine pairs to include (in this case 3) and the length of the period (12 months):
library(tsModel)
model3 <- glm(infarcts ~ offset(log(stdpop)) + smokban + time + harmonic(month,3,12),
family=quasipoisson, data)
summary(model3)
##
## Call:
## glm(formula = infarcts ~ offset(log(stdpop)) + smokban + time +
## harmonic(month, 3, 12), family = quasipoisson, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.189 -1.055 -0.107 1.037 3.468
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.2521824 0.0176090 -355.056 < 2e-16 ***
## smokban -0.1210185 0.0271207 -4.462 4.62e-05 ***
## time 0.0053751 0.0007985 6.731 1.58e-08 ***
## harmonic(month, 3, 12)1 0.0368454 0.0100163 3.679 0.000574 ***
## harmonic(month, 3, 12)2 -0.0182348 0.0096298 -1.894 0.064075 .
## harmonic(month, 3, 12)3 0.0062659 0.0095570 0.656 0.515061
## harmonic(month, 3, 12)4 0.0386223 0.0096767 3.991 0.000215 ***
## harmonic(month, 3, 12)5 0.0149246 0.0097112 1.537 0.130636
## harmonic(month, 3, 12)6 0.0127700 0.0097201 1.314 0.194921
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 2.256077)
##
## Null deviance: 308.52 on 58 degrees of freedom
## Residual deviance: 112.39 on 50 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 3
round(ci.lin(model3,Exp=T),3)
## Estimate StdErr z P exp(Est.) 2.5% 97.5%
## (Intercept) -6.252 0.018 -355.056 0.000 0.002 0.002 0.002
## smokban -0.121 0.027 -4.462 0.000 0.886 0.840 0.934
## time 0.005 0.001 6.731 0.000 1.005 1.004 1.007
## harmonic(month, 3, 12)1 0.037 0.010 3.679 0.000 1.038 1.017 1.058
## harmonic(month, 3, 12)2 -0.018 0.010 -1.894 0.058 0.982 0.964 1.001
## harmonic(month, 3, 12)3 0.006 0.010 0.656 0.512 1.006 0.988 1.025
## harmonic(month, 3, 12)4 0.039 0.010 3.991 0.000 1.039 1.020 1.059
## harmonic(month, 3, 12)5 0.015 0.010 1.537 0.124 1.015 0.996 1.035
## harmonic(month, 3, 12)6 0.013 0.010 1.314 0.189 1.013 0.994 1.032
ci.lin(model3,Exp=T)["smokban",5:7]
## exp(Est.) 2.5% 97.5%
## 0.8860176 0.8401506 0.9343886
After adjusting for seasonality:
# After adjusting for seasonality:
# Checking residuals:
res3 <- residuals(model3,type="deviance")
ggtsdisplay(res3)
pattern <- decompose(ts(res3, frequency=12))
plot(pattern)
!is.null(tbats(ts(res3, frequency=12))$seasonal)
## [1] FALSE
After adjustment for seasonality through a Fourier term, with results
suggesting that the association is largely unaffected:
(RR: 0.885; 95% CI 0.839-0.933; P < 0.001).
Taking into account seasonality of the ACE rate, the ACE rate was being multiplied by about 1.07 each year, a long-term temporal trend.
exp(coef(model3)["time"]*12)
## time
## 1.066627
# Predict and plot of the seasonally adjusted model
datanew <- data.frame(stdpop = mean(data$stdpop),
smokban = rep(c(0, 1), c(360, 240)),
time = 1:600/10,
month = rep(1:120/10, 5))
pred3 <- predict(model3, type="response", datanew) / mean(data$stdpop) * 10^5
plot(data$infarct.rate, type="n", ylim=c(120,300), xlab="Year", ylab="Std rate x 10,000", bty="l", xaxt="n")
rect(36, 120, 60, 300, col=grey(0.9), border=F)
points(data$infarct.rate, cex=0.7)
axis(1, at=0:5*12, labels=F)
axis(1, at=0:4*12+6, tick=F, labels=2002:2006)
lines(1:600/10, pred3, col=2)
title("Sicily, 2002-2006")
NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.