NOTES ON STATISTICS, PROBABILITY and MATHEMATICS


PCA Derivation:


Extracted from this series by the Imperial College of London.

We want lower dimensional orthogonal projections of data that retain as much information as possible.

We have a dataset with \(\mathbf X=\{\vec x_1,\dots,\vec x_n\}\) such that \(\vec x_i\in\mathbb R^D.\)

Each \[\vec x_n =\sum_{i=1}^D \beta_{in} \vec e_i\]

with \(\vec e_i\) being orthonormal basis of \(\mathbb R^D.\) This implies that

\[\beta_{in}= \vec x_n'\vec e_i\] the projection of \(\vec x_n\) onto the \(\vec e_i\) vector.

If we have a subspace with an ONB \(\mathbf B=\begin{bmatrix}\vec e_1,\dots, \vec e_M\end{bmatrix}\) (principal subspace),

\[\tilde{\vec x}=BB^\top\vec x\]

is the projection of \(\vec x\) onto the subspace, since \(B'B=I\) in \((B'B)^{-1}BB'.\)

Every

\[\tilde{\vec x}_n=\color{red}{\sum_{j=1}^M \beta_{in} \vec e_j} + \sum_{i=M+1}^D \beta_{in} \vec e_i \in \mathbb R^D\]

the second sum lives in the orthogonal complement of the part we want:

\[\tilde{\vec x}_n=\color{red}{\sum_{j=1}^M \beta_{in} \vec e_j}\tag 1\]

The objective function to minimize is the average squared reconstruction error:

\[J=\frac 1 N\sum_{i=1}^N\Vert \vec x_n - \tilde{\vec x}_n\Vert^2\tag 2\]

Differentiating wrt to the parameters in the red expression above and equating to zero will minimize the loss. We will need to apply the chain rule, differentiating wrt \(\tilde{\vec x}_n:\)

\[\frac{\partial J}{\partial \tilde{\vec x}_n}=-\frac 2 N( \vec x_n - \tilde{\vec x}_n)^\top\tag 3\]

Now

\[\frac{\partial J}{\partial \beta_{in}}=\frac{\partial J}{\partial \tilde{\vec x}_n} \frac{\partial \tilde{\vec x}_n}{\partial \beta_{in}}\] where \[\frac{\partial \tilde{\vec x}_n}{\partial \beta_{in}}=\vec e_i\] because only a component of the sum in \((1)\) will play a role.

Therefore

\[\frac{\partial J}{\partial \beta_{in}}=-\frac 2 N( \vec x_n - \tilde{\vec x}_n)^\top \vec e_i\]

using \((1),\) and then noticing that we are using ONB to rearrange indexes based on cancellation through orthogonality, and dotting when identical to \(1\)

\[\begin{align} \frac{\partial J}{\partial \beta_{in}} &=-\frac 2 N \left( \vec x_n - \sum_{j=1}^M \beta_{jn} \vec e_i \right)^\top \vec e_i\\ &=\frac 2 N \left( x_n^\top \vec e_i -\beta_{in} \vec e_i^\top \vec e_i \right)\\ &= \frac 2 N \left( x_n^\top \vec e_i -\beta_{in} \right) \end{align}\]

leaves us with the equation to find the parameters \(\beta\) as follows

\[\beta _{in}= x_n^\top \vec e_i \tag 4\] Hence, the optimal coordinates of \(\tilde{\vec x}_n\) wrt to the basis are the orthogonal projections of the coordinates of the original vector \(\vec x_n\) (data point) on the \(i\)-th vector that spans the principal subspace.

Splicing \((1)\) and \((4)\) after matching indexes,

\[\begin{align} \tilde{\vec x}_n &=\color{red}{\sum_{j=1}^M \beta_{jn} \vec e_j}\\ &= \sum_{j=1}^M \left( x_n^\top \vec e_j \right) \vec e_j\\ &= \sum_{j=1}^M \vec e_j \vec e_j^\top x_n\\ &= \left( \sum_{j=1}^M \underset{\text{proj.mat.}}{\vec e_j \vec e_j^\top} \right) x_n \end{align}\]

so \(\tilde{\vec x}_n\) is the orthogonal projection of \(\vec x_n\) onto the subspace spanned by the \(M\) basis vectors \(\vec e_j\) with \(j=1,\dots, M\)

We can similarly write \(\vec x_n\) including the orthogonal complement:

\[\vec x_n= \left( \sum_{j=1}^M \vec e_j \vec e_j^\top \right) x_n + \left( \sum_{j=M+1}^D \vec e_j \vec e_j^\top \right) x_n\]

and compare to the prior equation to see that

\[\vec x_n - \tilde{\vec x}_n=\left( \sum_{j=M+1}^D \vec e_j \vec e_j^\top \right) x_n\]

so the difference is in the orthogonal subspace that was ignored (orthogonal complement to the principal subspace).

The last equation can be re-written as

\[\vec x_n - \tilde{\vec x}_n=\left( \sum_{j=M+1}^D \vec e_j^\top x_n \right)\vec e_j \tag 5 \]

Going back to \((2),\) and using the fact that \(\vec e_j\) is an ONB:

\[\begin{align} J &=\frac 1 N\sum_{i=1}^N \left \Vert \left( \sum_{j=M+1}^D \vec e_j^\top x_n \right)\vec e_j \right \Vert^2\\ &=\frac 1 N\sum_{i=1}^N \sum_{j=M+1}^D \left( \vec e_j^\top x_n \right)^2\\ &=\frac 1 N\sum_{i=1}^N \sum_{j=M+1}^D \vec e_j^\top x_n x_n^\top \vec e_j\\ &=\sum_{j=M+1}^D \vec e_j^\top \left( \frac 1 N\sum_{i=1}^N x_n x_n^\top \right) \vec e_j \\ \end{align}\]

And, critically, \(\frac 1 N\sum_{i=1}^N x_n x_n^\top\) in the above expression is the data covariance \((\Sigma)\) since we have centered data!

We can re-write the loss function as

\[\begin{align} J &= \sum_{j=M+1}^D \vec e_j^\top \Sigma \vec e_j \tag 6\\ &= \text{Trace} \left( \color{blue}{\left( \sum_{j=M+1}^D \vec e_j \vec e_j^\top \right)}\Sigma\right) \end{align}\]

with the blue expression projecting \(\Sigma\) onto the orthogonal component of the principal subspace: the loss function is the variance of the data projected onto the subspace that we ignore - minimizing the loss is minimizing the variance of the data orthogonal to the principal subspace.

We need to find the ONB for the subspace that we want to ignore.

In \(\mathbb R^2,\) we have \(\vec e_1, \vec e_2\) with \(\vec e_1\) spanning the principal subspace, and \(\vec e_2\) spanning the subspace to ignore. They are orthonormal, and hence, \(\vec e_i^\top \vec e_j=\delta_{ij}.\) The loss function is

\[J= \vec e_2^\top \Sigma \vec e_2, \; \vec e_2^\top e_2=1\]

The Lagrangian is

\[L =\vec e_2^\top \Sigma \vec e_2 + \lambda \left(1 - \vec e_2^\top \vec e_2 \right)\] calculating partial derivatives

\[\frac{\partial L}{\partial \lambda}=1 - \vec e_2^\top \vec e_2=0\implies \vec e_2^\top \vec e_2=1\]

that is the constraint Of ONB.

\[\frac{\partial L}{\partial \vec e_2} = 2 \vec e_2^\top \Sigma - 2 \lambda \vec e_2^\top =0 \implies \Sigma \vec e_2 = \lambda \vec e_2\]

this is an eigenvalue problem, and going back to the loss function:

\[J = \vec e_2^\top \Sigma \vec e_2 =\vec e_2^\top \vec e_2 \lambda =\lambda\]

this last result due to the ONB. So the error is minimized if \(\lambda\) is the smallest eigenvalue of the data covariance matrix, corresponding to the eigenvector \(\vec e_2,\) which spans the subspace to be ignored. On the other hand, \(\vec e_1\) will correspond to the largest eigenvalue to the data covariance matrix. The e-vectors of the covariance matrix are orthogonal because of the symmetry of the covariance matrix.

In the general \(\mathbb R^D,\) solving for \(\vec e_j, \; j=M+1, \dots, D\) will entail

\[\text{CovMat } \vec e_j =\lambda_j \vec e_j, \; j = M+1, \dots, D\]

and the loss funcion

\[J = \sum_{j=M+1}^D \lambda_j\]

yielding a basis for the subspace to be ignored corresponding to the smallest eigenvalues. The principal subspace will be spanned by the e-vectors corresponding to the largest eigenvalues.


Home Page

NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.