SVD, pseudoinverse, and PCA: SVD, pseudoinverse and PCA in MATLAB
PCA
Let us carry out in MATLAB a principal component analysis in a concrete example.
We start with the creation of a data set of two quantities and its visualisation.
>> rng(0); % control random number generation
>> data(:,1) = randn(25,1);
>> data(:,2) = 2 + 1.5*data(:,1) + 0.25*randn(size(data(:,1))); % data on a line + noise
>> data = sortrows(data,1); % sort the first column
>> figure
axes('LineWidth',0.6,...
'FontName','Helvetica',...
'FontSize',8,...
'XAxisLocation','Origin',...
'YAxisLocation','Origin');
line(data(:,1),data(:,2),...
'LineStyle','None',...
'Marker','o');
axis equal
If we want to do a proper analysis, we first need to edit the data in such a way that the mean values of the two quantities are equal to zero.
>> data(:,1) = data(:,1)-mean(data(:,1));
>> data(:,2) = data(:,2)-mean(data(:,2));
From the data matrix \(X\) with two quantities (two columns) and \(N=25\) data (25 lines) we can construct the covariance matrix \(C\) as \[C = \frac{1}{N-1}X^{\top}\!X\] MATLAB has for this purpose the function cov
on board.
>> C = cov(data)
C =
1.9179 2.7857
2.7857 4.1102
We now want to define new variables whose values are uncorrelated. For this we diagonalise the covariance matrix by solving numerically the eigenvalue problem for the matrix \(C\).
>> [V,D] = eig(C)
V =
-0.8265 0.5630
0.5630 0.8265
D =
0.0205 0
0 6.0076
The eigenvectors define new quantities (as linear combinations of the old variables) that are uncorrelated since correlation is nothing else than dotproduct equal to zero for column vectors. In the figure below are shown the two vectors as part of a new coordinate system.
>> figure
axes('LineWidth',0.6,...
'FontName','Helvetica',...
'FontSize',8,...
'XAxisLocation','Origin',...
'YAxisLocation','Origin');
line(data(:,1),data(:,2),...
'LineStyle','None',...
'Marker','o');
line([0 V(1,1)],[0 V(2,1)],...
'Color',[0.8 0.5 0.3],...
'LineWidth',0.75);
line([0 V(1,2)],[0 V(2,2)],...
'Color',[0.8 0.5 0.3],...
'LineWidth',0.75);
axis equal
We can transform the data for the new coordinate system and visualise them. Because the eigenvalues are in increasing order, we have to switch columns during the computation.
>> newdata = data * V';
>> newdata = fliplr(newdata)
newdata =
-5.2475 -0.0866
-3.2490 0.1749
-3.2866 0.0982
-3.3236 -0.0482
-2.1315 -0.1726
-1.5034 -0.0214
-1.3865 -0.0396
-0.9434 0.1884
-0.7009 -0.1084
-0.7160 -0.1476
-0.0023 0.1615
0.0702 0.1519
0.0915 0.0045
-0.1780 -0.2314
0.3098 0.0978
0.0389 -0.0966
0.5021 0.0535
1.4919 0.0659
1.2632 -0.0997
1.6386 0.0683
2.0519 0.1797
2.3433 0.1318
3.2713 -0.3681
4.4254 0.0968
5.1706 -0.0531
>> figure
axes('LineWidth',0.6,...
'FontName','Helvetica',...
'FontSize',8,...
'XAxisLocation','Origin',...
'YAxisLocation','Origin')
line(newdata(:,1),newdata(:,2),...
'LineStyle','None',...
'Marker','o');
axis equal
With the function var
we can determine the variation in the quantities. One quantity explains \(99.96\%\) of the variation of the data, and the other quantity only for \(0.34\%\).
>> var(newdata)
ans =
6.0076 0.0205
>> var(newdata)/sum(var(newdata))
ans =
0.9966 0.0034
\(\phantom{x}\)
We can also the built-in function pca
in MATLAB.
>> [coeff,newdata,latent,tsd,variance] = pca(data);
>> coeff % coefficients of the principal components (factor loadings)
coeff =
0.5630 0.8265
0.8265 -0.5630
>> newdata % transformed data
newdata =
-5.2475 0.0866
-3.2490 -0.1749
-3.2866 -0.0982
-3.3236 0.0482
-2.1315 0.1726
-1.5034 0.0214
-1.3865 0.0396
-0.9434 -0.1884
-0.7009 0.1084
-0.7160 0.1476
-0.0023 -0.1615
0.0702 -0.1519
0.0915 -0.0045
-0.1780 0.2314
0.3098 -0.0978
0.0389 0.0966
0.5021 -0.0535
1.4919 -0.0659
1.2632 0.0997
1.6386 -0.0683
2.0519 -0.1797
2.3433 -0.1318
3.2713 0.3681
4.4254 -0.0968
5.1706 0.0531
>> latent % eigenvalues of the covariance matrix of the data set
latent =
6.0076
0.0205
>> variance % variance in data along the principal axes
variance =
99.6601
0.3399
>> corrcoef(newdata)
ans =
1.0000 0.0000
0.0000 1.0000