[What follows was first posted in
CrossValidated.]
The dataset and code can be directly copied and pasted into R form Github.
I used a data set that I found online on semiconductors here, and I trimmed it to just two dimensions - “atomic number” and “melting point” - to facilitate plotting.
As a caveat the idea is purely illustrative of the computational process: PCA is used to reduce more than two variables to a few derived principal components, or to identify collinearity also in the case of multiple features. So it wouldn’t find much application in the case of two variables, nor would there be a need to calculate eigenvectors of correlation matrices as pointed out by @amoeba.
Further, I truncated the observations from 44 to 15 to ease the task
of tracking individual points. The ultimate result was a skeleton data
frame (dat1
):
compounds atomic.no melting.point
AIN 10 498.0
AIP 14 625.0
AIAs 23 1011.5
... ... ...
The “compounds” column indicate the chemical constitution of the semiconductor, and plays the role of row name.
This can be reproduced as follows (ready to copy and paste on R console):
dat <- read.csv(url("http://rinterested.github.io/datasets/semiconductors"))
colnames(dat)[2] <- "atomic.no"
dat1 <- subset(dat[1:15,1:3])
row.names(dat1) <- dat1$compounds
dat1 <- dat1[,-1]
The data were then scaled:
X <- apply(dat1, 2, function(x) (x - mean(x)) / sd(x))
# This centers data points around the mean and standardizes by dividing by SD.
# It is the equivalent to `X <- scale(dat1, center = T, scale = T)`
The linear algebra steps followed:
C <- cov(X) # Covariance matrix (centered data)
\(\begin{bmatrix} &\text{at_no}&\text{melt_p}\\ \text{at_no}&1&0.296\\ \text{melt_p}&0.296&1 \end{bmatrix}\)
The correlation function cor(dat1)
gives the same output
on the non-scaled data as the function cov(X)
on the scaled
data.
lambda <- eigen(C)$values # Eigenvalues
lambda_matrix <- diag(2)*eigen(C)$values # Eigenvalues matrix
\(\begin{bmatrix} &\color{purple}{\lambda_{\text{PC1}}}&\color{orange}{\lambda_{\text{PC2}}}\\ &1.296422& 0\\ &0&0.7035783 \end{bmatrix}\)
e_vectors <- eigen(C)$vectors # Eigenvectors
\(\frac{1}{\sqrt{2}}\begin{bmatrix} &\color{purple}{\text{PC1}}&\color{orange}{\text{PC2}}\\ &1&\,\,\,\,\,1\\ &1&-1 \end{bmatrix}\)
Since the first eigenvector initially returns as \(\sim \small [-0.7,-0.7]\) we choose to change it to \(\small [0.7, 0.7]\) to make it consistent with built-in formulas through:
e_vectors[,1] = - e_vectors[,1]; colnames(e_vectors) <- c("PC1","PC2")
The resultant eigenvalues were \(\small
1.2964217\) and \(\small
0.7035783\). Under less minimalistic conditions, this result
would have helped decide which eigenvectors to include (largest
eigenvalues). For instance, the relative contribution of the first
eigenvalue is \(\small 64.8\%\):
eigen(C)$values[1]/sum(eigen(C)$values) * 100
, meaning that
it accounts for \(\sim\small 65\%\) of
the variability in the data. The variability in the direction of the
second eigenvector is \(35.2\%\). This
is typically shown on a scree plot depicting the value of the
eigenvalues:
We’ll include both eigenvectors given the small size of this toy data set example, understanding that excluding one of the eigenvectors would result in dimensionality reduction - the idea behind PCA.
The score matrix was determined as the matrix
multiplication of the scaled data (X
) by
the matrix of eigenvectors (or “rotations”):
score_matrix <- X %*% e_vectors
# Identical to the often found operation: t(t(e_vectors) %*% t(X))
The concept entails a linear combination of each entry (row
/ subject / observation / superconductor in this case) of the centered
(and in this case scaled) data weighted by the rows of each
eigenvector, so that in each of the final columns of the score
matrix, we’ll find a contribution from each variable (column) of the
data (the entire X
), BUT only the corresponding eigenvector
will have taken part in the computation (i.e. the first eigenvector
\([0.7, 0.7]^{T}\) will contribute to
\(\text{PC}\,1\) (Principal Component
1) and \([0.7, -0.7]^{T}\) to \(\text{PC}\,2\), as in:
Therefore each eigenvector will influence each variable differently, and this will be reflected in the “loadings” of the PCA. In our case, the negative sign in the second component of the second eigenvector \([0.7, - 0.7]\) will change the sign of the melting point values in the linear combinations that produce PC2, whereas the effect of the first eigenvector will be consistently positive:
The eigenvectors are scaled to \(1\):
> apply(e_vectors, 2, function(x) sum(x^2))
PC1 PC2
1 1
whereas the (loadings) are the eigenvectors scaled by the eigenvalues (despite the confusing terminology in the in-built R functions displayed below). Consequently, the loadings can be calculated as:
> e_vectors %*% lambda_matrix
[,1] [,2]
[1,] 0.9167086 0.497505
[2,] 0.9167086 -0.497505
> prcomp(X)$rotation %*% diag(princomp(covmat = C)$sd^2)
[,1] [,2]
atomic.no 0.9167086 0.497505
melting.point 0.9167086 -0.497505
It is interesting to note that the rotated data cloud (the score plot) will have variance along each component (PC) equal to the eigenvalues:
> apply(score_matrix, 2, function(x) var(x))
PC1 PC2
53829.7896 110.8414
> lambda
[1] 53829.7896 110.8414
Utilizing the built-in functions the results can be replicated:
# For the SCORE MATRIX:
prcomp(X)$x
# or...
princomp(X)$scores # The signs of the PC 1 column will be reversed.
# and for EIGENVECTOR MATRIX:
prcomp(X)$rotation
# or...
princomp(X)$loadings
# and for EIGENVALUES:
prcomp(X)$sdev^2
# or...
princomp(covmat = C)$sd^2
Alternatively, the singular
value decomposition (\(\text{U}\Sigma
\text{V}^\text{T}\)) method can be applied to manually calculate
PCA; in fact, this is the method used in prcomp()
. The
steps can be spelled out as:
svd_scaled_dat <-svd(scale(dat1))
eigen_vectors <- svd_scaled_dat$v
eigen_values <- (svd_scaled_dat$d/sqrt(nrow(dat1) - 1))^2
scores<-scale(dat1) %*% eigen_vectors
The result is shown below, with first, the distances from the individual points to the first eigenvector, and on a second plot, the orthogonal distances to the second eigenvector:
If instead we plotted the values of the score matrix (PC1 and PC2) - no longer “melting.point” and “atomic.no”, but really a change of basis of the point coordinates with the eigenvectors as basis, these distances would be preserved, but would naturally become perpendicular to the xy axis:
The trick was now to recover the original data. The points had been transformed through a simple matrix multiplication by the eigenvectors. Now the data was rotated back by multiplying by the inverse of the matrix of eigenvectors with a resultant marked change in the location of the data points. For instance, notice the change in pink dot “GaN” in the left upper quadrant (black circle in the left plot, below), returning to its initial position in the left lower quadrant (black circle in the right plot, below).
Now we finally had the original data restored in this “de-rotated” matrix:
Beyond the change of coordinates of rotation of the data in PCA, the
results must be interpreted, and this process tends to involve a
biplot
, on which the data points are plotted with respect
to the new eigenvector coordinates, and the original variables are now
superimposed as vectors. It is interesting to note the equivalence in
the position of the points between the plots in the second row of
rotation graphs above (“Scores with xy Axis = Eigenvectors”) (to the
left in the plots that follow), and the biplot
(to the
right):
The superimposition of the original variables as red arrows offers a
path to the interpretation of PC1
as a vector in the
direction (or with a positive correlation) with both
atomic no
and melting point
; and of
PC2
as a component along increasing values of
atomic no
but negatively correlated with
melting point
, consistent with the values of the
eigenvectors:
PCA$rotation
PC1 PC2
atomic.no 0.7071068 0.7071068
melting.point 0.7071068 -0.7071068