PRINCIPAL COMPONENT ANALYSIS (PCA):


[What follows was first posted in CrossValidated.]


The dataset and code can be directly copied and pasted into R form Github.

I used a data set that I found online on semiconductors here, and I trimmed it to just two dimensions - “atomic number” and “melting point” - to facilitate plotting.


As a caveat the idea is purely illustrative of the computational process: PCA is used to reduce more than two variables to a few derived principal components, or to identify collinearity also in the case of multiple features. So it wouldn’t find much application in the case of two variables, nor would there be a need to calculate eigenvectors of correlation matrices as pointed out by @amoeba.


Further, I truncated the observations from 44 to 15 to ease the task of tracking individual points. The ultimate result was a skeleton data frame (dat1):

compounds   atomic.no      melting.point
AIN         10             498.0
AIP         14             625.0
AIAs        23             1011.5
...         ...            ... 

The “compounds” column indicate the chemical constitution of the semiconductor, and plays the role of row name.

This can be reproduced as follows (ready to copy and paste on R console):

dat              <- read.csv(url("http://rinterested.github.io/datasets/semiconductors"))
colnames(dat)[2] <- "atomic.no"
dat1             <- subset(dat[1:15,1:3])
row.names(dat1)  <- dat1$compounds
dat1             <- dat1[,-1]

The data were then scaled:

X <- apply(dat1, 2, function(x) (x - mean(x)) / sd(x))
# This centers data points around the mean and standardizes by dividing by SD.
# It is the equivalent to `X <- scale(dat1, center = T, scale = T)`  

The linear algebra steps followed:

C <- cov(X)                                           # Covariance matrix (centered data)

\(\begin{bmatrix} &\text{at_no}&\text{melt_p}\\ \text{at_no}&1&0.296\\ \text{melt_p}&0.296&1 \end{bmatrix}\)

The correlation function cor(dat1) gives the same output on the non-scaled data as the function cov(X) on the scaled data.

lambda        <- eigen(C)$values                      # Eigenvalues

lambda_matrix <- diag(2)*eigen(C)$values # Eigenvalues matrix

\(\begin{bmatrix} &\color{purple}{\lambda_{\text{PC1}}}&\color{orange}{\lambda_{\text{PC2}}}\\ &1.296422& 0\\ &0&0.7035783 \end{bmatrix}\)

e_vectors     <- eigen(C)$vectors                     # Eigenvectors

\(\frac{1}{\sqrt{2}}\begin{bmatrix} &\color{purple}{\text{PC1}}&\color{orange}{\text{PC2}}\\ &1&\,\,\,\,\,1\\ &1&-1 \end{bmatrix}\)

Since the first eigenvector initially returns as \(\sim \small [-0.7,-0.7]\) we choose to change it to \(\small [0.7, 0.7]\) to make it consistent with built-in formulas through:

e_vectors[,1] = - e_vectors[,1]; colnames(e_vectors) <- c("PC1","PC2")

The resultant eigenvalues were \(\small 1.2964217\) and \(\small 0.7035783\). Under less minimalistic conditions, this result would have helped decide which eigenvectors to include (largest eigenvalues). For instance, the relative contribution of the first eigenvalue is \(\small 64.8\%\): eigen(C)$values[1]/sum(eigen(C)$values) * 100, meaning that it accounts for \(\sim\small 65\%\) of the variability in the data. The variability in the direction of the second eigenvector is \(35.2\%\). This is typically shown on a scree plot depicting the value of the eigenvalues:

enter image description here

We’ll include both eigenvectors given the small size of this toy data set example, understanding that excluding one of the eigenvectors would result in dimensionality reduction - the idea behind PCA.

The score matrix was determined as the matrix multiplication of the scaled data (X) by the matrix of eigenvectors (or “rotations”):

score_matrix <-  X %*% e_vectors    
# Identical to the often found operation: t(t(e_vectors) %*% t(X))

The concept entails a linear combination of each entry (row / subject / observation / superconductor in this case) of the centered (and in this case scaled) data weighted by the rows of each eigenvector, so that in each of the final columns of the score matrix, we’ll find a contribution from each variable (column) of the data (the entire X), BUT only the corresponding eigenvector will have taken part in the computation (i.e. the first eigenvector \([0.7, 0.7]^{T}\) will contribute to \(\text{PC}\,1\) (Principal Component 1) and \([0.7, -0.7]^{T}\) to \(\text{PC}\,2\), as in: