Poisson Regression

NOTES ON STATISTICS, PROBABILITY and MATHEMATICS

Interrupted time series regression for the evaluation of public health interventions:

from here for the theory, and here for the code.

In an ITS study, a time series of a particular outcome of interest is used to establish an underlying trend, which is ‘interrupted’ by an intervention at a known point in time.

The hypothetical scenario under which the intervention had not taken place and the trend continues unchanged (that is: the ‘expected’ trend, in the absence of the intervention, given the pre-existing trend) is referred to as the ‘counterfactual’.

The interrupted time series (ITS) study design is increasingly being used for the evaluation of public health interventions; it is particularly suited to interventions introduced at a population level over a clearly defined time period and that target population-level health outcomes.

ITS has been used for the evaluation of a wide range of public health interventions including new vaccines, cycle helmet legislation, changes to paracetamol packaging, traffic speed zones and precautions against nosocomial infections, as well as in the evaluation of health impacts of unplanned events such as the global financial crisis.

In standard ITS analyses, the following segmented regression model is used:

\[Y_t=β_0 +β_1T+β_2X_t + β_3TX_t\]

where \(β_0\) represents the baseline level at \(T = 0,\) \(β_1\) is interpreted as the change in outcome associated with a time unit increase (representing the underlying pre-intervention trend), \(β_2\) is the level change following the intervention and \(β_3\) indicates the slope change following the intervention (using the interaction between time and intervention: \(TX_t\)⁠).

The model above corresponds to impact mode \((c):\)

models \((a)\) and \((b)\) can easily be specified by excluding the terms \(β_3TX_t\) or \(β_2 X_t⁠\), respectively.The other models require slightly more complex variable specifications.

From this site:

In the dataset sicily aces = count of acute coronary episodes in Sicily per month (the outcome) and smokban = smoking ban (the intervention) coded 0 before intervention. time = elapsed time since the start of the study. pop = the population of Sicily (in 10000s). stdpop = age standardised population

data <- read.csv('https://raw.githubusercontent.com/RInterested/DATASETS/gh-pages/sicily.csv')
names(data)[names(data) == 'aces'] <- 'infarcts'
head(data)

##   year month infarcts time smokban      pop   stdpop
## 1 2002     1      728    1       0 364277.4 379875.3
## 2 2002     2      659    2       0 364277.4 376495.5
## 3 2002     3      791    3       0 364277.4 377040.8
## 4 2002     4      734    4       0 364277.4 377116.4
## 5 2002     5      757    5       0 364277.4 377383.4
## 6 2002     6      726    6       0 364277.4 374113.1

Descriptive analysis:

Examining the data is an important first step. Looking at the pre-intervention trend can give an indication of how stable the trend is over time, whether a linear model is likely to be appropriate, and whether there appears to be a seasonal trend.

library(tidyverse) # The "pipe" for data %>%

data$infarct.rate <- with(data, infarcts/stdpop*10^5)

plot(data$infarct.rate, ylim=c(00,300), xlab="Year", ylab="Std rate x 10,000", 
      pch=19, bty="l", xaxt="n")

axis(1, at=0:4*12 + 6, labels=2002:2006)
title("Sicily, 2002-2006")

# Summary statistics:
data %>%
  gather(Variable, Value) %>%
  group_by(Variable) %>%
  summarize(n = n(),
            Mean = mean(Value),
            SD = sd(Value),
            Median = median(Value),
            IQR = IQR(Value),
            Min = min(Value),
            Max = max(Value)) %>%
  knitr::kable()

Variable	n	Mean	SD	Median	IQR	Min	Max
infarct.rate	59	2.160942e+02	17.3402530	215.4118	26.38489	175.0353	260.7162
infarcts	59	8.290508e+02	72.4139022	831.0000	111.50000	659.0000	1021.0000
month	59	6.406780e+00	3.4347044	6.0000	5.50000	1.0000	12.0000
pop	59	3.641212e+05	481.2691260	364277.4000	588.20000	363350.8000	364700.4000
smokban	59	3.898305e-01	0.4918981	0.0000	1.00000	0.0000	1.0000
stdpop	59	3.834644e+05	5245.7738584	383428.4000	7898.05000	374113.1000	394005.6000
time	59	3.000000e+01	17.1755640	30.0000	29.00000	1.0000	59.0000
year	59	2.003966e+03	1.4138002	2004.0000	2.00000	2002.0000	2006.0000

# Statistics before and after smoking ban:
data %>%
  mutate(Period = case_when(smokban == 0 ~ "1. Before Ban",
                            smokban == 1 ~ "2. After Ban")) %>%
  select(Period, infarcts, infarct.rate) %>%
  gather(Variable, Value, -Period) %>%
  group_by(Period, Variable) %>%
    summarize(n = n(),
            Mean = mean(Value),
            SD = sd(Value),
            Median = median(Value),
            IQR = IQR(Value),
            Min = min(Value),
            Max = max(Value)) %>%
  knitr::kable()

Period	Variable	n	Mean	SD	Median	IQR	Min	Max
1. Before Ban	infarct.rate	36	213.2719	16.81020	213.1061	24.62515	175.0353	244.3259
1. Before Ban	infarcts	36	810.8611	67.76521	811.0000	99.25000	659.0000	937.0000
2. After Ban	infarct.rate	23	220.5118	17.59869	217.8656	21.43912	190.9362	260.7162
2. After Ban	infarcts	23	857.5217	71.62394	843.0000	83.00000	738.0000	1021.0000

Is the data stationary, i.e. there is no seasonality, there is no trend, the variance remains constant?

From here, here and here.

# Seasonality, trend and variance:
library(forecast)

components <- decompose(ts(data$infarct.rate, frequency=12))
plot(components)

# Is there seasonality in the dataset?

x <- ts(data$infarct.rate, frequency=12) # 12 measurements per year
fit <- tbats(x)
!is.null(fit$seasonal)

## [1] TRUE

# Is the variance constant?

first.half <- data$infarct.rate[1: floor(0.5 * length(data$infarct.rate))]
second.half <- data$infarct.rate[ceiling(0.5 * length(data$infarct.rate)):length(data$infarct.rate)]

# The variance increases over time:

(ratio <- var(first.half) / var(second.half) )

## [1] 1.300939

# Is it statistically significant?

var.test(first.half,second.half)

## 
##  F test to compare two variances
## 
## data:  first.half and second.half
## F = 1.3009, num df = 28, denom df = 29, p-value = 0.4853
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.6164912 2.7589393
## sample estimates:
## ratio of variances 
##           1.300939

# Is there a trend?

summary(lm(infarct.rate ~ time, data))

## 
## Call:
## lm(formula = infarct.rate ~ time, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.866  -9.860  -3.236   9.517  35.518 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 201.7192     4.0603  49.681  < 2e-16 ***
## time          0.4792     0.1177   4.071 0.000146 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.4 on 57 degrees of freedom
## Multiple R-squared:  0.2253, Adjusted R-squared:  0.2117 
## F-statistic: 16.57 on 1 and 57 DF,  p-value: 0.0001459

Poisson Regression Model:

We model the count data directly (rather than the rate which doesn’t follow a Poisson distribution), using the population (log transformed) as an offset variable in order to transform back to rates.

library(Epi)

model1 <- glm(infarcts ~ offset(log(stdpop)) + smokban + time, family=poisson, data)
summary(model1)

## 
## Call:
## glm(formula = infarcts ~ offset(log(stdpop)) + smokban + time, 
##     family = poisson, data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4336  -1.4432  -0.4431   1.2735   5.0701  
## 
## Coefficients:
##               Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) -6.2431509  0.0111894 -557.955  < 2e-16 ***
## smokban     -0.1116284  0.0172248   -6.481 9.13e-11 ***
## time         0.0049450  0.0004992    9.905  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 308.52  on 58  degrees of freedom
## Residual deviance: 197.31  on 56  degrees of freedom
## AIC: 708.03
## 
## Number of Fisher Scoring iterations: 3

round(ci.lin(model1,Exp=T),3)

##             Estimate StdErr        z P exp(Est.)  2.5% 97.5%
## (Intercept)   -6.243  0.011 -557.955 0     0.002 0.002 0.002
## smokban       -0.112  0.017   -6.481 0     0.894 0.865 0.925
## time           0.005  0.000    9.905 0     1.005 1.004 1.006

The predictions with this model are as follows:

# Predictions:

datanew <- data.frame(stdpop = mean(data$stdpop),
                      smokban = rep(c(0, 1), c(360, 240)),
                      time = 1:600/10,
                      month = rep(1:120/10, 5))

head(datanew)

##     stdpop smokban time month
## 1 383464.4       0  0.1   0.1
## 2 383464.4       0  0.2   0.2
## 3 383464.4       0  0.3   0.3
## 4 383464.4       0  0.4   0.4
## 5 383464.4       0  0.5   0.5
## 6 383464.4       0  0.6   0.6

pred1 <- predict(model1, type="response", datanew) / mean(data$stdpop) * 10^5

plot(data$infarct.rate, type="n", ylim=c(0,300), xlab="Year", ylab="Std rate x 10,000", bty="l", xaxt="n")
rect(36, 0, 60, 300, col=grey(0.9), border=F)
points(data$infarct.rate,cex=0.7)
axis(1, at=0:5*12, labels=F)
axis(1, at=0:4*12+6, tick=F, labels=2002:2006)
lines((1:600/10), pred1, col=2)
title("Sicily, 2002-2006")

# To plot the counterfactual scenario we create a data frame as if smokban (the intervention) had never been implemented.

datanew <- data.frame(stdpop=mean(data$stdpop),smokban=0,time=1:600/10,
  month=rep(1:120/10,5))

pred1b <- predict(model1, datanew, type="response") / mean(data$stdpop) * 10^5

plot(data$infarct.rate, type="n", ylim=c(0,300), xlab="Year", ylab="Std rate x 10,000", bty="l", xaxt="n")
rect(36, 0, 60, 300, col=grey(0.9), border=F)
points(data$infarct.rate,cex=0.7)
axis(1, at=0:5*12, labels=F)
axis(1, at=0:4*12+6, tick=F, labels=2002:2006)
lines((1:600/10), pred1, col=2)
lines(datanew$time, pred1b, col=2, lty=2)
title("Sicily, 2002-2006")

In the model above we have not allowed for overdispersion - in order to do this we can use a quasipoisson model, which allows the variance to be proportional rather than equal to the mean.

model2 <- glm(infarcts ~ offset(log(stdpop)) + smokban + time, 
              family=quasipoisson, data)
summary(model2)

## 
## Call:
## glm(formula = infarcts ~ offset(log(stdpop)) + smokban + time, 
##     family = quasipoisson, data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4336  -1.4432  -0.4431   1.2735   5.0701  
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -6.2431509  0.0211096 -295.749  < 2e-16 ***
## smokban     -0.1116284  0.0324959   -3.435  0.00112 ** 
## time         0.0049450  0.0009418    5.250 2.43e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 3.559195)
## 
##     Null deviance: 308.52  on 58  degrees of freedom
## Residual deviance: 197.31  on 56  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 3

A second assumption of standard regression models is that observations are independent. This assumption is often violated in time series data because consecutive observations tend to be more similar to one another than those that are further apart, a phenomenon known as autocorrelation. Fortunately, in many epidemiological data, autocorrelation is largely explained by other variables, in particular seasonality (discussed above); therefore, after controlling for these factors, residual autocorrelation is rarely a problem. Nevertheless, autocorrelation should always be assessed by examining the plot of residuals and the partial autocorrelation function and, where data are normally distributed, conducting tests such as the Breusch-Godfrey test.22,25 Where residual autocorrelation remains, this should be adjusted for using methods such as Prais regression or autoregressive integrated moving average (ARIMA), described in more detail elsewhere. There is very little evidence of autocorrelation in the worked example and even less after adjustment for seasonality.

Checking for correlation, autocorrelation and seasonality in the residuals:

# Checking autocorrelation and partial autocorrelation in the model (using residuals):
library(ggplot2)
library(cowplot) 
library(fpp2)  

res2 <- residuals(model2, type="deviance")

ggtsdisplay(res2)

pattern <- decompose(ts(res2, frequency=12))
plot(pattern)

!is.null(tbats(ts(res2, frequency=12))$seasonal)

## [1] TRUE

Clearly there is residual seasonality. There are various ways of adjusting for seasonality - here we use harmonic terms specifying the number of sin and cosine pairs to include (in this case 3) and the length of the period (12 months):

library(tsModel)

model3 <- glm(infarcts ~ offset(log(stdpop)) + smokban + time + harmonic(month,3,12), 
              family=quasipoisson, data)
summary(model3)

## 
## Call:
## glm(formula = infarcts ~ offset(log(stdpop)) + smokban + time + 
##     harmonic(month, 3, 12), family = quasipoisson, data = data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.189  -1.055  -0.107   1.037   3.468  
## 
## Coefficients:
##                           Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)             -6.2521824  0.0176090 -355.056  < 2e-16 ***
## smokban                 -0.1210185  0.0271207   -4.462 4.62e-05 ***
## time                     0.0053751  0.0007985    6.731 1.58e-08 ***
## harmonic(month, 3, 12)1  0.0368454  0.0100163    3.679 0.000574 ***
## harmonic(month, 3, 12)2 -0.0182348  0.0096298   -1.894 0.064075 .  
## harmonic(month, 3, 12)3  0.0062659  0.0095570    0.656 0.515061    
## harmonic(month, 3, 12)4  0.0386223  0.0096767    3.991 0.000215 ***
## harmonic(month, 3, 12)5  0.0149246  0.0097112    1.537 0.130636    
## harmonic(month, 3, 12)6  0.0127700  0.0097201    1.314 0.194921    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 2.256077)
## 
##     Null deviance: 308.52  on 58  degrees of freedom
## Residual deviance: 112.39  on 50  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 3

round(ci.lin(model3,Exp=T),3)

##                         Estimate StdErr        z     P exp(Est.)  2.5% 97.5%
## (Intercept)               -6.252  0.018 -355.056 0.000     0.002 0.002 0.002
## smokban                   -0.121  0.027   -4.462 0.000     0.886 0.840 0.934
## time                       0.005  0.001    6.731 0.000     1.005 1.004 1.007
## harmonic(month, 3, 12)1    0.037  0.010    3.679 0.000     1.038 1.017 1.058
## harmonic(month, 3, 12)2   -0.018  0.010   -1.894 0.058     0.982 0.964 1.001
## harmonic(month, 3, 12)3    0.006  0.010    0.656 0.512     1.006 0.988 1.025
## harmonic(month, 3, 12)4    0.039  0.010    3.991 0.000     1.039 1.020 1.059
## harmonic(month, 3, 12)5    0.015  0.010    1.537 0.124     1.015 0.996 1.035
## harmonic(month, 3, 12)6    0.013  0.010    1.314 0.189     1.013 0.994 1.032

ci.lin(model3,Exp=T)["smokban",5:7]

## exp(Est.)      2.5%     97.5% 
## 0.8860176 0.8401506 0.9343886

After adjusting for seasonality:

# After adjusting for seasonality:

# Checking residuals:

res3 <- residuals(model3,type="deviance")

ggtsdisplay(res3)

pattern <- decompose(ts(res3, frequency=12))
plot(pattern)

!is.null(tbats(ts(res3, frequency=12))$seasonal)

## [1] FALSE

After adjustment for seasonality through a Fourier term, with results suggesting that the association is largely unaffected: (RR: 0.885; 95% CI 0.839-0.933; P < 0.001).

Taking into account seasonality of the ACE rate, the ACE rate was being multiplied by about 1.07 each year, a long-term temporal trend.

exp(coef(model3)["time"]*12)

##     time 
## 1.066627

# Predict and plot of the seasonally adjusted model


datanew <- data.frame(stdpop = mean(data$stdpop),
                      smokban = rep(c(0, 1), c(360, 240)),
                      time = 1:600/10,
                      month = rep(1:120/10, 5))

pred3 <- predict(model3, type="response", datanew) / mean(data$stdpop) * 10^5

plot(data$infarct.rate, type="n", ylim=c(120,300), xlab="Year", ylab="Std rate x 10,000", bty="l", xaxt="n")
rect(36, 120, 60, 300, col=grey(0.9), border=F)
points(data$infarct.rate, cex=0.7)
axis(1, at=0:5*12, labels=F)
axis(1, at=0:4*12+6, tick=F, labels=2002:2006)
lines(1:600/10, pred3, col=2)
title("Sicily, 2002-2006")

Home Page

NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.