NOTES ON STATISTICS, PROBABILITY and MATHEMATICS


The matrix \(A^\top A\) (Gramian matrix) Properties:


\(A^\top A\) is a key matrix structure because of the role it plays in orthogonal projection. Covariance matrices are just special cases.

\(A^\top A\) is a covariance matrix - you can define a multivariate normal distribution for which \(A\top A\) is the covariance matrix.

This is equivalent to talking about symmetric positive semidefinite matrices (every s.p.s.d. matrix can be written as \(A^\top A\) for some \(A\)).

PROPERTIES:
  1. Symmetry
  2. Positive semidefinite-ness (can be zero)
  3. Real and positive eigenvalues
  4. The trace is positive (the trace is the sum of eigenvalues)
  5. The determinant is positive (the determinant is the product of the eigenvalues)
  6. The diagonal entries are all positive
  7. Orthogonal eigenvectors
  8. Diagonalizable as \(Q\Lambda Q^T\)
  9. It is possible to obtain a Cholesky decomposition.
  10. Rank of \(A^TA\) is the same as rank of \(A\).
  11. \(\text{ker}(A^TA)=\text{ker}(A)\)

COVARIANCE MATRIX:

If the entries of a column vector:

\[X=\begin{bmatrix}X_1\\\vdots\\X_n \end{bmatrix}\]

are random variables with finite variance, then the covariance matrix \(\Sigma\) is the matrix whose \((i,j)\) entry is the covariance

\[\Sigma_{ij}=cov(X_i,X_j) = E\left[ (X_i - \mu_i)(X_j - \mu_j\right] = E[X_i, X_j]-E[X]E[Y]\]

where \(\mu_i = E(X_i)\)

is the expected value of the \(i\)-th entry in the vector \(X.\) In other words,


A more succint definition is \(\mathbb E\big((\mathbf X - \mu)(\mathbf X - \mu)^T\big)\) for a random vector \(\mathbf X \in \mathbb R^n\) with mean vector \(\mu.\)

This is in line with the other definition in Wikipedia:

\[\large \Sigma = E\left[\left(X-E[X] \right)\left( X-E[X]\right)^\top \right]\]

However, notice that the cov() formula in R may not work so well unless the rows of the dataset correspond to the features (usually the features are the columns):

set.seed(0)

# COVARIANCE as X transpose X:

cols.R.features = matrix(rnorm(30, 5, 3), nrow= 10)
centered = sweep(cols.R.features, 2, apply(cols.R.features,2, function(x) mean(x)),"-")
# same as scale(cols.R.features, center = T, scale = F)
colnames(centered) = c("radius","temp","height")
centered[1:3,]
##         radius      temp     height
## [1,]  2.712091  3.378224 -0.8810328
## [2,] -2.055472 -1.309584  0.9239578
## [3,]  2.912626 -2.355527  0.1917799
1/(nrow(centered) - 1) * (t(centered) %*% centered)
##            radius       temp     height
## radius 13.0755218 -1.8084569 -0.9389149
## temp   -1.8084569  4.1478005 -0.5245252
## height -0.9389149 -0.5245252  4.3746191
cov(cols.R.features)
##            [,1]       [,2]       [,3]
## [1,] 13.0755218 -1.8084569 -0.9389149
## [2,] -1.8084569  4.1478005 -0.5245252
## [3,] -0.9389149 -0.5245252  4.3746191
# COVARIANCE as X X transposed:

cols.R.examples = t(cols.R.features)
centered2 = sweep(cols.R.examples, 1, apply(cols.R.examples,1, function(x) mean(x)),"-")
rownames(centered2) = c("radius","temp","height")
centered2[,1:5]
##              [,1]       [,2]       [,3]      [,4]       [,5]
## radius  2.7120910 -2.0554720  2.9126259 2.7405161  0.1671524
## temp    3.3782243 -1.3095838 -2.3555271 0.2190592  0.1897986
## height -0.8810328  0.9239578  0.1917799 2.2043394 -0.3795495
1/(ncol(centered2) - 1) * (centered2 %*% t(centered2))
##            radius       temp     height
## radius 13.0755218 -1.8084569 -0.9389149
## temp   -1.8084569  4.1478005 -0.5245252
## height -0.9389149 -0.5245252  4.3746191
cov(t(cols.R.examples))
##            [,1]       [,2]       [,3]
## [1,] 13.0755218 -1.8084569 -0.9389149
## [2,] -1.8084569  4.1478005 -0.5245252
## [3,] -0.9389149 -0.5245252  4.3746191
# cov(cols.R.examples) is not what we want, giving a 10 x 10 matrix.

From this post and this other one:

When the data is centered (zero mean) the covariance matrix is \(\frac{1}{n-1}\mathbf X^\top\mathbf X.\)

set.seed(0)
mat = matrix(rnorm(25), nrow = 5)
centered = scale(mat,center = T, scale = F)
all.equal(cov(centered), 1/(nrow(mat)-1) * t(centered) %*% centered)
## [1] TRUE

Since the covariance matrix is symmetric, the matrix is diagonalizable, and the eigenvectors can be normalized such that they are orthonormal:

\(\mathbf X\mathbf X^\top=\mathbf W\mathbf D\mathbf W^\top\)

On the other hand, applying SVD to the data matrix \(\mathbf X\) as follows:

\(\mathbf X=\mathbf U\mathbf \Sigma\mathbf V^\top\)

and attempting to construct the covariance matrix from this decomposition gives

\(\begin{align*} \mathbf X\mathbf X^\top&=(\mathbf U\mathbf \Sigma\mathbf V^\top)(\mathbf U\mathbf \Sigma\mathbf V^\top)^\top\\ \mathbf X\mathbf X^\top&=(\mathbf U\mathbf \Sigma\mathbf V^\top)(\mathbf V\mathbf \Sigma\mathbf U^\top) \end{align*}\)

and since \(\mathbf V\) is an orthogonal matrix (\(\mathbf V^\top \mathbf V=\mathbf I\)),

\(\mathbf X\mathbf X^\top=\mathbf U\mathbf \Sigma^2 \mathbf U^\top\)

This can be seen continuing with the code example above:

SVD = svd(cols.R.examples)
SVD$u %*% (diag(SVD$d)^2 %*% t(SVD$u))
##          [,1]     [,2]     [,3]
## [1,] 486.9513 221.4810 308.0420
## [2,] 221.4810 190.4112 199.0542
## [3,] 308.0420 199.0542 310.6281
cols.R.examples %*% t(cols.R.examples)
##          [,1]     [,2]     [,3]
## [1,] 486.9513 221.4810 308.0420
## [2,] 221.4810 190.4112 199.0542
## [3,] 308.0420 199.0542 310.6281

and the correspondence is easily seen (the square roots of the eigenvalues of \(\mathbf X\mathbf X^\top\) are the singular values of \(\mathbf X\), etc.)


Just like the univariate variance is the average squared distance to the mean, \(trace(\hat{\bf{\Sigma}})\) is the average squared distance to the centroid: With \(\dot{\bf{X}}\) as the matrix of the centered variables, \(\hat{\bf{\Sigma}} = \frac{1}{n} \dot{\bf{X}}' \dot{\bf{X}}\) where \(\dot{\bf{X}}' \dot{\bf{X}}\) is the matrix of dot products of the columns of \(\dot{\bf{X}}\). Its diagonal elements are \(\dot{\bf{X}}_{\cdot i}' \dot{\bf{X}}_{\cdot i} = (\bf{X}_{\cdot i} - \overline{X}_{\cdot i})' (\bf{X}_{\cdot i} - \overline{X}_{\cdot i})\), i.e., the squared distance of variable \(i\) to its mean. As such, \(trace(\hat{\bf{\Sigma}})\) is a natural generalization of the univariate variance.

A second generalization is \(det(\hat{\bf{\Sigma}})\): This is a measure for the volume of the ellipsoid that characterizes the distribution. More precisely, \(|det(\hat{\bf{\Sigma}})|\) is the factor by which the volume of the unit cube changes after applying the linear transformation \(\hat{\bf{\Sigma}}\). (explanation). Here is an illustration for the matrix \(\left(\begin{smallmatrix}1 & -.5\\ .5 & .5\end{smallmatrix}\right)\) with determinant \(0.75\) (left: before, right: after transformation):

enter image description here


The sample covariance matrix is:

\[\mathbf{S}=\dfrac{1}{n-1}\sum_{j=1}^{n}(\mathbf{X}_{j}-\bar{\mathbf{X}})(\mathbf{X}_{j}-\bar{\mathbf{X}})^{T}\]


Correlation Matrix:


The covariance matrix of the standardized random variables \(\displaystyle X_{i}/\sigma (X_{i})\) for \(\displaystyle i=1,\dots ,n\).


Sample Covariance and Correlation:


In the \(G^TG\) operation, \(G^T\) is an \(6 \times 8\) matrix, and \(G\) an \(8 \times 6\). Hence, the matrix multiplication will yield a \(6 \times 6\) matrix.

Addressing the comments and the underlying issue, let’s pretend that we have a matrix corresponding to returns of different stocks (in the columns) versus 5 consecutive years (in the rows) - completely fictitious stocks and years. Let’s call the matrix, \(A\):

\[A = \begin{bmatrix} & \color{red}{\text{yah(y)}} & \color{blue}{\text{goog(g)}} & \color{green}{\text{ms(m)}} \\ \text{Yr.1} & 1 &8 & 1\\ \text{Yr.2} & -4 &9 & 3 \\ \text{Yr.3} & 5 & 10 & 4 \\ \text{Yr.4} & 7 & 3 & 5\\ \text{Yr.5} & 8 & 7& 6 \end{bmatrix}\]

We want to calculate the correlations between the different vectors of returns, one for each company, “packaged” in the matrix \(A\).

The variance-covariance matrix of the portfolio assuming equal holdings will be:

\(\Large \sigma(A) = \frac{G^TG}{n-1}\) with \(G\) being the mean-centered observations and \(n-1\) corresponding to the number of observations minus \(1\).

The mean-centered (or demeaned) matrix \(G\) is:

\[\begin{bmatrix} & \color{red}{\text{y}} & \color{blue}{\text{g}} & \color{green}{\text{m}} \\ \text{Yr.1} & -2.4 &0.6 & -2.8\\ \text{Yr.2} & -7.4 &1.6 & -0.8 \\ \text{Yr.3} & 1.6 & 2.6 & 0.2 \\ \text{Yr.4} & 3.6 & -4.4 & 1.2\\ \text{Yr.5} & 4.6 & -0.4& 2.2 \end{bmatrix} \]

And the variance-covariance matrix:

\[\begin{bmatrix} & \color{red}{y} & \color{blue}{g} & \color{green}{m} \\ \color{red}{y} & 24.30 &-6.70 & 6.85\\ \color{blue}{g} & -6.70 & 7.30 & -2.15 \\ \color{green}{m} & 6.85 & -2.15 & 3.70 \\ \end{bmatrix} \]

So it went from the \(5 \times 3\) \(A\) matrix to a \(3 \times 3\) matrix.

The operations involved in calculating the correlation matrix are similar, but the data points are standardized by dividing each one by the standard deviation of the returns of each company (column vectors), right after centering the data points by subtracting the column means as in the covariance matrix:

\[\small cor(A)=\tiny\frac{1}{n-1}\small\begin{bmatrix} \frac{\color{red}{y_1 - \bar{y}}}{\color{red}{sd(y)}} & \frac{\color{red}{y_2 - \bar{y}}}{\color{red}{sd(y)}} & \frac{\color{red}{y_3 - \bar{y}}}{\color{red}{sd(y)}} & \frac{\color{red}{y_4 - \bar{y}}}{\color{red}{sd(y)}} &\frac{\color{red}{y_5 - \bar{y}}}{\color{red}{sd(y)}} \\ \frac{\color{blue}{g_1 - \bar{g}}}{\color{blue}{sd(g)}} & \frac{\color{blue}{g_2 - \bar{g}}}{\color{blue}{sd(g)}} & \frac{\color{blue}{g_3 - \bar{g}}}{\color{blue}{sd(g)}} & \frac{\color{blue}{g_4 - \bar{g}}} {\color{blue}{sd(g)}}& \frac{\color{blue}{g_5 - \bar{g}}}{\color{blue}{sd(g)}}\\ \frac{\color{green}{m_1 - \bar{m}}}{\color{green}{sd(m)}}& \frac{\color{green}{m_2 - \bar{m}}}{\color{green}{sd(m)}} &\frac{\color{green}{m_3 - \bar{m}}}{\color{green}{sd(m)}} & \frac{\color{green}{m_4 - \bar{m}}}{\color{green}{sd(m)}} & \frac{\color{green}{m_5 - \bar{m}}}{\color{green}{sd(m)}}\\ &&\color{purple} {3\times 5 \,\text{matrix}} \end{bmatrix} \begin{bmatrix} \frac{\color{red}{y_1 - \bar{y}}}{\color{red}{sd(y)}} & \frac{\color{blue}{g_1 - \bar{g}}}{\color{blue}{sd(g)}} & \frac{\color{green}{m_1 - \bar{m}}}{\color{green}{sd(m)}} \\ \frac{\color{red}{y_2 - \bar{y}}}{\color{red}{sd(y)}} & \frac{\color{blue}{g_2 - \bar{g}}}{\color{blue}{sd(g)}} & \frac{\color{green}{m_2 - \bar{m}}}{\color{green}{sd(m)}} \\ \frac{\color{red}{y_3 - \bar{y}}}{\color{red}{sd(y)}} &\frac{\color{blue}{g_3 - \bar{g}}}{\color{blue}{sd(g)}} & \frac{\color{green}{m_3 - \bar{m}}}{\color{green}{sd(m)}} \\ \frac{\color{red}{y_4 - \bar{y}}}{\color{red}{sd(y)}} & \frac{\color{blue}{g_4 - \bar{go}}}{\color{blue}{sd(g)}} & \frac{\color{green}{m_4 - \bar{m}}}{\color{green}{sd(m)}} \\ \frac{\color{red}{y_5 - \bar{y}}}{\color{red}{sd(y)}} & \frac{\color{blue}{g_5 - \bar{g}}}{\color{blue}{sd(g)}} & \frac{\color{green}{m_5 - \bar{m}}}{\color{green}{sd(m)}} \\ &\color{purple} {5\times 3 \,\text{matrix}} \end{bmatrix}\]

One more quick thing for completeness sake: We have so far a clunky matrix as the result, but in general we want to estimate the portfolio variance: \(1\) portfolio; \(1\) variance. To do that we multiply the matrix of variance-covariance of \(A\) to the left and to the right by the vector containing the proportions or weightings in each stock - \(W\). Since we want to end up with a scalar single number, it is unsurprising that the algebra will be: \(W^T\,\sigma(A)\,W\), with the vector of weights (fractions) being in this case \(\color{blue}{3}\times \color{blue}{1}\) to match perfectly on the left as \(W^T\), and on the right as \(W\).

Code in R:

Fictitious data set of returns in billions, percentage (?) - the matrix A:

yah = c(1, - 4, 5, 7, 8)
goog = c(8, 9, 10, 3, 7)
ms = c(1, 3, 4, 5, 6)
returns <- cbind(yah, goog, ms)
row.names(returns) =c("Yr.1","Yr.2","Yr.3","Yr.4", "Yr.5")

Centered matrix (G) of demeaned returns:

demeaned_returns <- scale(returns, scale = F, center = T)

Manual and R function calculation of the variance-covariance matrix:

(var_cov_A = (t(demeaned_returns)%*%demeaned_returns)/(nrow(returns)-1))
cov(returns)   # the R in-built function cov() returns the same results.

Correlation matrix calculation:

We need to divide by the standard deviation column-wise:

demeaned_scaled_returns <- scale(returns, scale = T, center = T)

and then proceed as above:

(corr_A = (t(demeaned_scaled_returns) %*% demeaned_scaled_returns)/(nrow(returns)-1))
cor(returns) # Again, the R function returns the same matrix.

Home Page

NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.