PCA

SVD, pseudoinverse, and PCA: SVD, pseudoinverse and PCA in R

PCA

Let us carry out in R a principal component analysis in a concrete example.
We start with the creation of a data set of two quantities and its visualisation.

> N<-25 
> X1 <- sort(rnorm(N))  # sort the first column
> X2 <- rep(2,N)+ 1.5*X1 + 0.25*rnorm(N) # data on a line + noise
> plot(X1, X2, type="p", col="blue", pch=1, asp=1)

If we want to do a proper analysis, we first need to edit the data in such a way that the mean values of the two quantities are equal to zero.

> X1 <- X1-mean(X1) 
> X2 <- X2-mean(X2) 
> X <- cbind(X1,X2)

From the data matrix $X$ with two quantities (two columns) and $N=25$ data (25 lines) we can construct the covariance matrix $C$ as

$C = \frac{1}{N-1}X^{\top}\!X$ R has for this purpose the function cov on board.

> C <- 1/(N-1) * t(X) %*% X; C 
          X1       X2 
X1 0.7483376 1.089108 
X2 1.0891078 1.661375

We now want to define new quantities whose values are uncorrelated. To do this, we diagonalise the covariance matrix by numerically solving the singular value decomposition of $C$

> svdC <- svd(C) # singular value decomposition of C
> V <- svdC$v; V
           [,1]       [,2]
[1,] -0.5538142 -0.8326403
[2,] -0.8326403  0.5538142
> S <- diag(svdC$d); S
         [,1]       [,2]
[1,] 2.385773 0.00000000
[2,] 0.000000 0.02393922

The eigenvectors define new quantities (as linear combinations of the old variables) that are uncorrelated since correlation is nothing else than dotproduct equal to zero for column vectors. In the figure below are shown the two vectors as part of a new coordinate system.

> plot(X1, X2, type="p", col="blue", pch=1, asp=1)
> arrows(0, 0, -V[1,1], -V[2,1], col="red")
> arrows(0, 0,  V[2,1],  V[2,2], col="red")

We can transform the data for the new coordinate system and visualise them. Because the eigenvalues are in increasing order, we have to switch columns during the computation.

> Y <- X %*% t(V); Y   # transformed data
             [,1]        [,2]
 [1,]  2.83855472 -0.02504232
 [2,]  1.72813421  0.27607070
 [3,]  1.77252406  0.18196905
 [4,]  1.89608066  0.07139954
 [5,]  2.04905404 -0.11153504
 [6,]  1.03245369  0.10813638
 [7,]  1.06534013 -0.01541172
 [8,]  0.97677332  0.04009078
 [9,]  0.78539110 -0.20678436
[10,]  0.89555155 -0.33257959
[11,] -0.20647606  0.14933024
[12,] -0.24741694  0.15946881
[13,]  0.03575947 -0.04127564
[14,]  0.09433655 -0.16633603
[15,] -0.59100438  0.17256342
[16,] -0.26799229 -0.05202771
[17,] -0.42391817 -0.05901045
[18,] -0.58422372 -0.07351742
[19,] -0.90964667  0.03826849
[20,] -0.66825077 -0.29052675
[21,] -1.15731713 -0.04643221
[22,] -2.04298603  0.24909232
[23,] -2.34162497  0.07674717
[24,] -2.16904238 -0.06608385
[25,] -3.56005399 -0.03657379
> Y1 <- Y[,1]
> Y2 <- Y[,2]
> plot(Y1, Y2, type="p", col="blue", pch=1, asp=1)
> arrows(0,0,1,0)
> arrows(0,0,0,1)

With the function var we can determine the variation in the quantities. One quantity then concerns $99\%$ of the variation of the data, and the other quantity only $1\%$ .

> var(Y1) 
[1] 2.385773 
> var(Y2) [1] 0.02393922 
varsY <- diag(var(Y)); varsY 
[1] 2.38577326 0.02393922 
> varsY/sum(varsY) 
[1] 0.990065527 0.009934473

$\phantom{x}$

We can of course also use the built-in function pca in R.

> pcaX <- prcomp(X, retx =TRUE, center=TRUE)
> summary(pcaX)
Importance of components:
                          PC1     PC2
Standard deviation     1.5446 0.15472
Proportion of Variance 0.9901 0.00993
Cumulative Proportion  0.9901 1.00000
> pcaX$rotation   # coefficients of the principal components (factor loadings)
         PC1        PC2
X1 0.5538142  0.8326403
X2 0.8326403 -0.5538142
> pcaX$x        # transformed data
              PC1         PC2
 [1,] -2.83855472  0.02504232
 [2,] -1.72813421 -0.27607070
 [3,] -1.77252406 -0.18196905
 [4,] -1.89608066 -0.07139954
 [5,] -2.04905404  0.11153504
 [6,] -1.03245369 -0.10813638
 [7,] -1.06534013  0.01541172
 [8,] -0.97677332 -0.04009078
 [9,] -0.78539110  0.20678436
[10,] -0.89555155  0.33257959
[11,]  0.20647606 -0.14933024
[12,]  0.24741694 -0.15946881
[13,] -0.03575947  0.04127564
[14,] -0.09433655  0.16633603
[15,]  0.59100438 -0.17256342
[16,]  0.26799229  0.05202771
[17,]  0.42391817  0.05901045
[18,]  0.58422372  0.07351742
[19,]  0.90964667 -0.03826849
[20,]  0.66825077  0.29052675
[21,]  1.15731713  0.04643221
[22,]  2.04298603 -0.24909232
[23,]  2.34162497 -0.07674717
[24,]  2.16904238  0.06608385
[25,]  3.56005399  0.03657379
> round(sum(abs(Y + pcaX$x)))  # in this example: Y = -pcaX$x 
[1] 0