NOTES ON STATISTICS, PROBABILITY and MATHEMATICS


Multivariate Gaussian:


The univariate Gaussian (\(X \sim N(\mu, \sigma^2\)) is:

\[f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{1}{2\sigma^2}(x-\mu)^2\right), \forall \in \mathbb R.\]

The degenerate Gaussian has variance equal to \(0\) and hence, \(X(\omega)=\mu, \forall \omega \in \Omega.\)

The multivariate Gaussian is defined for \(X \in \mathbb R^n\) as any linear combination of univariate Gaussian distributions \(X_i\):

\(a^T X = \sum_{i=1}^n a_i X_i\) for \(\forall a \in \mathbb R^n.\)

We will express it as \(X\sim N(\mu,\Sigma)\), where \(\mu\) is a vector in \(\mathbb R^n\). The \(\mathbb E(X_i)=\mu_i\); and the covariance matrix is an \(n \times n\) positive semidefinite matrix, such that \(\Sigma(X_i, X_j) = \text{Cov}_{ij}.\)

A multivariate Gaussian is degenerated if the \(\text{det}(\Sigma)=0.\)

If the Gaussian distributions (components) are independent, the covariance matrix is:

\[\Sigma = \begin{bmatrix}\sigma_1^2&0\\0&\sigma_2^2\end{bmatrix}\]

If the variance is \(1\) for both Gaussians:

If the variances are different, say \(1\) and \(4\):

This is the Matlab code:

pkg load 'statistics' % This only works like this in Octave
mu = [0,0]; %// data
sigma = [1 0; 0 4]; %// data
x = -5:.2:5; %// x axis
y = -5:.2:5; %// y axis

[X Y] = meshgrid(x,y); %// all combinations of x, y
Z = mvnpdf([X(:) Y(:)],mu,sigma); %// compute Gaussian pdf
Z = reshape(Z,size(X)); %// put into same size as X, Y
colormap(jet)
surf(X,Y,Z) %// ... or 3D plot

The density of an \(M\)-dimensional multivariate Gaussian is:

\[f({\bf X \vert \mu, \Sigma}) = \frac{1}{(2\pi)^{M/2}\,\text{det}(\Sigma)^{1/2} }\text{exp}\left[-\frac{1}{2}({\bf X-\mu})^T \,{\bf \Sigma^{-1}\,(X-\mu)}\right]\]


The important part of the formula is the exponent, \(({\bf X-\mu})^T \,{\bf \Sigma^{-1}\,(X-\mu)}\), which is a positive definite quadratic function. The part in front is just a normalizing factor.

A quadratic form in linear algebra is of the form \(x^TAx\), providing the formula for ellipsoids in higher dimensions. And \(\Sigma\) can be visualized as an “error elipsoid” around the mean.


Ellipsoids are of the form \(x^2/a^2+y^2/b^2+z^2/c^2 = 1\):


This is Dr. Strang’s example of a \(3 \times 3\) positive definite matrix. What he calls “the good matrix”:

\[A= \begin{bmatrix}\,\,\,2&-1&0\\-1&\,\,2&-1\\\,\,\,0&-1&2\end{bmatrix}\]

Proving that it is positive definite through the \(x^TAx\) rule…

\[x^TAx=2x_1^2 + 2x_2^2 + 2x_3^2 - 2x_1x_2 - 2x_2x_3 > 0\]

we could complete the square to proof that the inequality is true.

We are in four dimensions, being that we have a function. But if we cut through this thing at height one, we get an ellipsoid (a lopsided football) with its axes determined by the eigenvalues in the factorization \(Q\Lambda Q^T\), where \(Q\) is the matrix of eigenvectors, and \(\Lambda\) the diagonal of the squared eigenvalues, \(\lambda_i \geq 0\). Hence,

\[Q\Lambda Q^T= Q\Lambda^{1/2}\Lambda^{1/2} Q^T = (Q\Lambda^{1/2})(Q\Lambda^{1/2})^T=AA^T\tag 1\]


Affine Property of the Gaussian:

Any affine transformation \(f(x) = AX + b\) of a Gaussian is a Gaussian. If \(X \sim N(\mu, \Sigma)\), \(\color{red}{AX + b \sim N(A\mu + b, A\Sigma A^T)}\).

So if \(X_1, X_2, \dots, X_n \sim N(0,1)\) iid, placing them in a vector, we get \(X \sim (0, I)\), and \(\color{blue}{Ax + \mu \sim N(\mu ,\Sigma)}\), where \(\Sigma = AA^T\). This is a form of generating multivariate Gaussians.

Sphering:

“Sphering” turns a Gaussian into a “sphere” (multivariate standard) through an affine transformation. So it converts Gaussians back to standard multivariate normal.

\(Y \sim N(\mu, \Sigma)\implies A^{-1}(Y-\mu) \sim N(0,\bf I)\), where \(\Sigma = AA^T\).


Using equation (1) we can use express the \(A\) in \(\color{blue}{Y = AX + \mu \sim N(\mu,\Sigma)}\) (blue equation) as \(A = Q\Lambda^{1/2}\), giving \(Y = AX + \mu= Q\Lambda^{1/2}X + \mu\). Now we can apply the red equation to \(\Lambda^{1/2}X\), rendering \(\Lambda^{1/2}X + \mu \sim N(\Lambda^{1/2}\times 0 + 0, \Lambda^{1/2} I \Lambda^{1/2}) = N(0, \Lambda)\). \(\Lambda\) is geometrically the degree of stretching of the distribution (the variance). When we multiply by \(Q\) (an orthogonal matrix) \(Q\Lambda^{1/2}\) we end up with \(Q\Lambda^{1/2} \sim N(0, \Sigma)\). An orthogonal matrix give a reflection or rotation.

So by applying an affine transformation to \(X \sim N(0,I)\) we end up stretching and rotating. The \(\mu\) centers the multivariate (shif).


MARGINAL DISTRIBUTION:

If we have a multivariate Gaussian \(X = [X_1,X_2] \in \mathbb R^2\) with two dimensions, the coordinates are also Gaussian (\(X_1\) and \(X_2\) are Gaussian).

Proof:

In general, we can decompose an \(n\)-dimension multivariate Gaussian matrix \(X \sim N(\mu, \Sigma)\) by getting the first \(k\) components, indexed as \(a = 2, \dots, k\). We’ll try to show that these first \(k\) components are Gaussian. The rest of the components are indexed by \(b = k+1, \dots, n\). Now \(X\) can be expressed as a block vector:

\(X= \begin{bmatrix}X_a\\X_b\end{bmatrix}\) with \(X_a= \begin{bmatrix}X_1\\\vdots\\X_k\end{bmatrix}\) and \(X_b= \begin{bmatrix}X_{k+1}\\\vdots\\X_n\end{bmatrix}\).

We can decompose \(\mu = \begin{bmatrix}\mu_a\\\mu_b\end{bmatrix}\)

and \(\Sigma\) into the block matrix \(\Sigma = \begin{bmatrix}\Sigma_{aa}&\Sigma_{ab}\\\Sigma_{ba}&\Sigma_{bb}\end{bmatrix}\). This is a block matrix because \(\Sigma_{aa}=\small\begin{bmatrix}\sigma^2(X_1)&\dots&\text{cov}(X_1,X_k)\\\vdots&\ddots&\vdots\\\text{cov}(X_k,X_1)&\dots&\sigma^2(X_k)\end{bmatrix}\)

The marginalization property states that \(X_a \sim N({\bf \mu_a}, \Sigma_{aa})\) is multivariate normal. We can prove it using the affine property and using the projection matrix \(A\) (without blocks it is of the form \([1 0]\) or \([0 1]\)):


\[A = \begin{bmatrix} {\color{red}{1}}&0&0&0&0&\dots&0&&0&0&0&\dots&0\\ 0&{\color{red}{1}}&0&0&0&\dots&0&&0&0&0&\dots&0\\ 0&0&{\color{red}{1}}&0&0&\dots&0&&0&0&0&\dots&0\\ 0&0&0&{\color{red}{1}}&0&\dots&0&&0&0&0&\dots&0\\ 0&0&0&0&{\color{red}{1}}&\dots&0&&0&0&0&\dots&0\\\vdots&\vdots&\vdots&\vdots&\vdots&\ddots&\vdots&&\vdots&\vdots&\vdots&\ddots&\vdots\\0&0&0&0&0&\dots&{\color{red}{1}}&&0&0&0&\dots&0\\ \end{bmatrix}\]

which is \((k \times n)\).

And by construction,

\(AX=X_a\) which by the affine property (red equation):

Given that \(A\mu = \mu_a\) (the projection of the means); and \(A\Sigma A^T = \Sigma{aa}\),

\(AX=A_a \sim N(\mu_a, \Sigma_{aa})\).

The same can be done with \(X_b\).


####CONDITIONAL DISTRIBUTION:

If \(X= (X_1, X_2)^T \in \mathbb R^2 \implies (X_1 \vert X_2 = x_2)\) is Gaussian.

Using the same block matrices as above:

\((X_a \vert X_b =x_b) \sim N(m, D)\) where \(m= \mu_a + \Sigma_{ab}\Sigma_{bb}^{-1}(x_b - \mu_b)\) and \(D = \Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^{-1}\Sigma_{ba}\).

For a full derivation see here and here.


Home Page

NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.