NOTES ON STATISTICS, PROBABILITY and MATHEMATICS


Orthogonal, Uncorrelated and Independent Random Variables:


From this post:

Orthogonality is defined as \(E[XY^⋆]=0\)

So:

If \(Y=X^2\) with symmetric pdf they they are dependent yet orthogonal.

If \(Y=X^2\) but pdf zero for negative values, then they dependent but not orthogonal.

Therefore, orthogonality does not imply independence.

See a proof of orthogonality of these two random variables here.

Another way of seeing this is by considering

\[\text{Cov(X,Y)}=\sigma_{X,Y}=\mathbb E[XY]-\mathbb E[X] \mathbb E[Y]\]

Since in the case of \(X \sim N(0,1),\) the \(\mathbb E[X]=0,\) zero covariance is equivalent to orthogonality. See also this post.


From this post

I realized today that orthogonality and non-correlation (in the linear sense) of two random variables \(X\) and \(Y\) are strongly linked in our minds – and they shouldn’t. The fact is that whether orthogonality, i.e. \(\mathbb E(XY)=0\), implies correlation, or non-correlation, depends on whether the expected values of the variables are non-zero or not. I believe in the power of tables - so here it is, and easily verifiable from the definition \[\text{Cov}(X,Y) =\mathbb E(XY) – \mathbb E(X) \mathbb E(Y)\]


To illustrate with an example the first scenario, consider two Gaussian normal variables with expectations \(\mathbb E[X] \neq 0 =-1\) and \(\mathbb E[X] \neq 0=1\) and covariance of \(1.\) This implies that \(\mathbb E[XY]=\sigma_{X,Y}+\mathbb E[X] \mathbb E[Y]=0.\)

library(mvtnorm)
library(squash)
library(plot3D)
knitr::opts_chunk$set(fig.width=12, fig.height=8) 
set.seed(0)

n <- 10000

sigma <- matrix(c(1,1,1,1),nrow=2,byrow=T)
mean <- c(-1,1)

bivar <- rmvnorm(n,mean,sigma)
X <- bivar[,1]
Y <- bivar[,2]
print(paste('The mean of X is: ', round(mean(X)), ' and the mean of Y is: ', round(mean(Y)),'.'))
## [1] "The mean of X is:  -1  and the mean of Y is:  1 ."
print(paste('The covariance of X and Y is: ', round(cov(X,Y)),'.'))
## [1] "The covariance of X and Y is:  1 ."
print(paste('The expectation of XY is: ', round((X %*% Y) / n),'.'))
## [1] "The expectation of XY is:  0 ."

Here is the bivariate histogram:

##  Create cuts:
x_c <- cut(X, 50)
y_c <- cut(Y, 50)
##  Calculate joint counts at cut levels:
z <- table(x_c, y_c)
##  Plot as a 3D histogram:
hist3D(x = seq(-3, 3, length.out = nrow(z)),
       y = seq(-3, 3, length.out = nrow(z)),
       z=z, col = jet.col(100, alpha = 0.8), 
       phi = 0, theta=180, contour = T,
       border = NA, facets = T, lighting = 'exponent',
       colkey = F, scale = T, bty ='g',
       ticktype = 'detailed', zlab='', xlab='X', ylab='Y',
       axes3D='gray', frequency=T)

As for the second scenario, and keeping the same non-zero expectations we can produce non-orthogonal random variables with zero correlation: \(\mathbb E[XY]=\sigma_{X,Y}+\mathbb E[X] \mathbb E[Y]=-1.\)

Sig <- matrix(c(1,0,0,1),nrow=2,byrow=T)
bivar <- rmvnorm(n,mean,Sig)

X <- bivar[,1]
Y <- bivar[,2]
print(paste('The mean of X is: ', round(mean(X)), ' and the mean of Y is: ', round(mean(Y)),'.'))
## [1] "The mean of X is:  -1  and the mean of Y is:  1 ."
print(paste('The covariance of X and Y is: ', round(cov(X,Y)),'.'))
## [1] "The covariance of X and Y is:  0 ."
print(paste('The expectation of XY is: ', round((X %*% Y) / n),'.'))
## [1] "The expectation of XY is:  -1 ."
##  Create cuts:
x_c <- cut(X, 70)
y_c <- cut(Y, 70)
##  Calculate joint counts at cut levels:
z <- table(x_c, y_c)

##  Plot as a 3D histogram:
hist3D(x = seq(-5, 5, length.out = nrow(z)),
       y = seq(-5, 5, length.out = nrow(z)),
       z=z, 
       phi = 24, theta=-65, 
       border = NA, facets = T, lighting = 'ambient',
       scale = T, bty ='g', 
       ticktype = 'detailed', zlab='', xlab='X', ylab='Y',
       axes3D='gray')

The third scenario of either one of the random variables with zero expectation, and with both orthogonality and no correlation can be exemplified as before with \(X \sim N(0,1)\) and \(Y=X^2\): \(\mathbb E[XY]=\sigma_{X,Y}+\mathbb E[X] \mathbb E[Y]=0.\) The covariance is zero as exposed here:

The general form of the covariance depends on the first three moments of the distribution. To facilitate our analysis, we suppose that \(X\) has mean \(\mu\), variance \(\sigma^2\) and skewness \(\gamma\). The covariance of interest exists if \(\gamma < \infty\) and does not exist otherwise. Using the relationship between the raw moments and the cumulants, you have the general expression:

\[\begin{equation} \begin{aligned} \mathbb{C}(X,X^2) &= \mathbb{E}(X^3) - \mathbb{E}(X) \mathbb{E}(X^2) \\[6pt] &= ( \mu^3 + 3 \mu \sigma^2 + \gamma \sigma^3 ) - \mu ( \mu^2 + \sigma^2 ) \\[6pt] &= 2 \mu \sigma^2 + \gamma \sigma^3. \\[6pt] \end{aligned} \end{equation}\]

The special case for an unskewed distribution with zero mean (e.g., the centred normal distribution) occurs when \(\mu = 0\) and \(\gamma = 0\), which gives zero covariance.

n <- 1e6
X <- rnorm(n)
Y <- X^2
print(paste('The mean of X is: ', round(mean(X)), ' and the mean of Y is: ', round(mean(Y)),'.'))
## [1] "The mean of X is:  0  and the mean of Y is:  1 ."
print(paste('The covariance of X and Y is: ', round(cov(X,Y)),'.'))
## [1] "The covariance of X and Y is:  0 ."
print(paste('The expectation of XY is: ', round((X %*% Y) / n),'.'))
## [1] "The expectation of XY is:  0 ."

A similar example could be \(X \sim \text{Unif}(-1,1)\) and \(Y=\vert X \vert.\) The correlation is \(1\) on the positive support half, and \(-1\) on the negative. Therefore the covariance is \(0.\) The mean of \(X=0.\) Therefore \(\mathbb E[XY]=\sigma_{X,Y}=0\):

X <- runif(1e6,min=-1,max=1)
Y <- ifelse(X>0,X,-X)
plot(X[1:100],Y[1:100],pch=16)
rug(X[1:100],side=1,col="grey")
rug(Y[1:100],side=2,col="grey")


These concepts are nicely explained in this post:

Independence is a statistical concept. Two random variables \(X\) and \(Y\) are statistically independent if their joint distribution is the product of the marginal distributions, i.e. \[ f(x, y) = f(x) f(y) \] if each variable has a density \(f\), or more generally \[ F(x, y) = F(x) F(y) \] where \(F\) denotes each random variable’s cumulative distribution function.

Correlation is a weaker but related statistical concept. The (Pearson) correlation of two random variables is the expectancy of the product of the standardized variables, i.e. \[ \newcommand{\E}{\mathbf E} \rho = \E \left [ \frac{X - \E[X]}{\sqrt{\E[(X - \E[X])^2]}} \frac{Y - \E[Y]}{\sqrt{\E[(Y - \E[Y])^2]}} \right ]. \] The variables are uncorrelated if \(\rho = 0\). It can be shown that two random variables that are independent are necessarily uncorrelated, but not vice versa.

Orthogonality is a concept that originated in geometry, and was generalized in linear algebra and related fields of mathematics. In linear algebra, orthogonality of two vectors \(u\) and \(v\) is defined in inner product spaces, i.e. vector spaces with an inner product \(\langle u, v \rangle\), as the condition that \[ \langle u, v \rangle = 0. \] The inner product can be defined in different ways (resulting in different inner product spaces). If the vectors are given in the form of sequences of numbers, \(u = (u_1, u_2, \ldots u_n)\), then a typical choice is the dot product, \(\langle u, v \rangle = \sum_{i = 1}^n u_i v_i\).


Orthogonality is therefore not a statistical concept per se, and the confusion you observe is likely due to different translations of the linear algebra concept to statistics:

  1. Formally, a space of random variables can be considered as a vector space. It is then possible to define an inner product in that space, in different ways. One common choice is to define it as the covariance: \[ \langle X, Y \rangle = \mathrm{cov} (X, Y) = \E [ (X - \E[X]) (Y - \E[Y]) ]. \] Since the correlation of two random variables is zero exactly if the covariance is zero, according to this definition uncorrelatedness is the same as orthogonality. (Another possibility is to define the inner product of random variables simply as the expectancy of the product.)

  2. Not all the variables we consider in statistics are random variables. Especially in linear regression, we have independent variables which are not considered random but predefined. Independent variables are usually given as sequences of numbers, for which orthogonality is naturally defined by the dot product (see above). We can then investigate the statistical consequences of regression models where the independent variables are or are not orthogonal. In this context, orthogonality does not have a specifically statistical definition, and even more: it does not apply to random variables.

Addition responding to Silverfish’s comment: Orthogonality is not only relevant with respect to the original regressors but also with respect to contrasts, because (sets of) simple contrasts (specified by contrast vectors) can be seen as transformations of the design matrix, i.e. the set of independent variables, into a new set of independent variables. Orthogonality for contrasts is defined via the dot product. If the original regressors are mutually orthogonal and one applies orthogonal contrasts, the new regressors are mutually orthogonal, too. This ensures that the set of contrasts can be seen as describing a decomposition of variance, e.g. into main effects and interactions, the idea underlying ANOVA.

Since according to variant a), uncorrelatedness and orthogonality are just different names for the same thing, in my opinion it is best to avoid using the term in that sense. If we want to talk about uncorrelatedness of random variables, let’s just say so and not complicate matters by using another word with a different background and different implications. This also frees up the term orthogonality to be used according to variant b), which is highly useful especially in discussing multiple regression. And the other way around, we should avoid applying the term correlation to independent variables, since they are not random variables.


Rodgers et al.’s presentation is largely in line with this view, especially as they understand orthogonality to be distinct from uncorrelatedness. However, they do apply the term correlation to non-random variables (sequences of numbers). This only makes sense statistically with respect to the sample correlation coefficient \(r\). I would still recommend to avoid this use of the term, unless the number sequence is considered as a sequence of realizations of a random variable.

I’ve scattered links to the answers to the two related questions throughout the above text, which should help you put them into the context of this answer.


Home Page

NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.