StatQuest (Josh Starmer) has published a great video explaining principal component analysis (PCA). In the video he mentions that he is using Singular Value Decomposition (SVD), but he doesn’t actually do/present that part of the maths – he does not explicitly calculate a singular value decomposition.
The reason he states that he’s using SVD is because the example works directly with the data matrix – a matrix/table with measurements where each column represents a variable, and the rows represent a sample. (In his table of mice and genes the data matrix is transposed.) The alternative would be to first form a covariance matrix and then use Eigenvalue Decomposition (EvD). The two methods are equivalent.
This post aims to fill in some of the gaps by presenting a program that does the maths. Further it will show the connection between EvD and SVD. In addition it demonstrates how to do PCA using ojAlgo.
Unfortunately the program doesn’t get exactly the same numbers as in the video. If you can spot what’s causing the differences then please let me and/or Josh Starmer know (depending on who made a mistake). In the video the loading scores for PC1 (2 mice version) are 0.97 and 0.242, but in our calculations they are 0.94 and 0.34. The corresponding eigenvalues are also different. This causes derived numbers like the relative importance (of the principal components) to be very different.
Remember!
PCA will not (directly) tell you which variables are the most important. It removes noise and reduces the dimensionality of the data. This makes it easier to work with. Most likely there needs to be further analysis. Possibly that further analysis is simply to plot a chart.
Example Code
Console Output
2 Variables
class StatQuestPCA
ojAlgo
2019-05-05
Data
10 6
11 4
8 5
3 3
2 2.8
1 1
There are 2 variables and 6 samples.
Covariances
18.966667 6.486667
6.486667 3.126667
EvD
Eigenvalues (on the diagonal)
21.284012 0
0 0.809321
Eigenvectors (in the columns)
0.941711 0.336424
0.336424 -0.941711
Relative (Variance) Importance: { 0.963368085822159, 0.03663191417784095 }
Data (centered)
4.166667 2.366667
5.166667 0.366667
2.166667 1.366667
-2.833333 -0.633333
-3.833333 -0.833333
-4.833333 -2.633333
SVD
Left-singular Vectors (in the columns)
0.457541 -0.411087
0.483604 0.692426
0.242357 -0.277432
-0.279299 -0.177362
-0.377107 -0.250975
-0.527095 0.424429
Singular values (on the diagonal)
10.31601 0
0 2.011618
Right-singular Vectors (in the columns) - compare these to the eigenvectors above
0.941711 0.336424
0.336424 -0.941711
Sum of eigenvalues/variance: 22.093333333333334 == 22.093333333333344
PC1: Variance=21.284012242764245 (96.34%%) Loadings={ 0.9417106889306233, 0.3364238076501293 }
PC2: Variance=0.8093210905690993 (3.66%%) Loadings={ 0.3364238076501293, -0.9417106889306233 }
Data (transformed)
4.719998 -0.826949
4.988861 1.392896
2.500152 -0.558086
-2.881249 -0.356784
-3.890244 -0.504866
-5.437518 0.85379
Transformed data (derived another way) - compare the 2 first columns with what we just calculated above
4.719998 -0.826949
4.988861 1.392896
2.500152 -0.558086
-2.881249 -0.356784
-3.890244 -0.504866
-5.437518 0.85379
Covariances (from SVD) – compare this what we originally calculated
18.966667 6.486667
6.486667 3.126667
Covariances (from SVD using only 2 components)
18.966667 6.486667
6.486667 3.126667
3 Variables
class StatQuestPCA
ojAlgo
2019-05-05
Data
10 6 12
11 4 9
8 5 10
3 3 2.5
2 2.8 1.3
1 1 2
There are 3 variables and 6 samples.
Covariances
18.966667 6.486667 19.286667
6.486667 3.126667 7.486667
19.286667 7.486667 22.246667
EvD
Eigenvalues (on the diagonal)
42.45597 0 0
0 1.357664 0
0 0 0.526366
Eigenvectors (in the columns)
0.654759 -0.747649 0.110959
0.244157 0.348147 0.905086
0.715316 0.565522 -0.410496
Relative (Variance) Importance: { 0.9575094809015773, 0.030619393636449194, 0.011871125461973569 }
Data (centered)
4.166667 2.366667 5.866667
5.166667 0.366667 2.866667
2.166667 1.366667 3.866667
-2.833333 -0.633333 -3.633333
-3.833333 -0.833333 -4.833333
-4.833333 -2.633333 -4.133333
SVD
Left-singular Vectors (in the columns)
0.514936 0.393973 -0.120891
0.379072 -0.811392 0.16742
0.310108 0.400155 0.067738
-0.316323 -0.060214 -0.37223
-0.423529 -0.060447 -0.495894
-0.464265 0.137926 0.753857
Singular values (on the diagonal)
14.569827 0 0
0 2.60544 0
0 0 1.622291
Right-singular Vectors (in the columns) - compare these to the eigenvectors above
0.654759 -0.747649 -0.110959
0.244157 0.348147 -0.905086
0.715316 0.565522 0.410496
Sum of eigenvalues/variance: 44.34 == 44.34000000000004
PC1: Variance=42.455970383175966 (95.75%%) Loadings={ 0.6547591745515717, 0.24415739058194894, 0.715316427858859 }
PC2: Variance=1.3576639138401558 (3.06%%) Loadings={ -0.7476486995434165, 0.34814683061538987, 0.5655220653550999 }
Data (transformed)
7.502525 1.026474
5.523021 -2.114035
4.518217 1.04258
-4.608767 -0.156885
-6.170737 -0.157492
-6.764258 0.359358
Transformed data (derived another way) - compare the 2 first columns with what we just calculated above
7.502525 1.026474 -0.196121
5.523021 -2.114035 0.271604
4.518217 1.04258 0.109891
-4.608767 -0.156885 -0.603865
-6.170737 -0.157492 -0.804485
-6.764258 0.359358 1.222976
Covariances (from SVD) – compare this what we originally calculated
18.966667 6.486667 19.286667
6.486667 3.126667 7.486667
19.286667 7.486667 22.246667
Covariances (from SVD using only 2 components)
18.960186 6.433805 19.310642
6.433805 2.695478 7.68223
19.310642 7.68223 22.15797
4 Variables
class StatQuestPCA
ojAlgo
2019-05-05
Data
10 6 12 5
11 4 9 7
8 5 10 6
3 3 2.5 2
2 2.8 1.3 4
1 1 2 7
There are 4 variables and 6 samples.
Covariances
18.966667 6.486667 19.286667 3.033333
6.486667 3.126667 7.486667 -0.086667
19.286667 7.486667 22.246667 3.413333
3.033333 -0.086667 3.413333 3.766667
EvD
Eigenvalues (on the diagonal)
42.951938 0 0 0
0 3.730063 0 0
0 0 1.320309 0
0 0 0 0.104357
Eigenvectors (in the columns)
0.651053 0.001663 0.759018 0.004493
0.239553 -0.375033 -0.20981 0.8706
0.711502 -0.020939 -0.60817 -0.351361
0.111845 0.926773 -0.100005 0.344355
Relative (Variance) Importance: { 0.8928479327252333, 0.07753734201523488, 0.027445446117285287, 0.0021692791422464803 }
Data (centered)
4.166667 2.366667 5.866667 -0.166667
5.166667 0.366667 2.866667 1.833333
2.166667 1.366667 3.866667 0.833333
-2.833333 -0.633333 -3.633333 -3.166667
-3.833333 -0.833333 -4.833333 -1.166667
-4.833333 -2.633333 -4.133333 1.833333
SVD
Left-singular Vectors (in the columns)
0.507358 0.268131 0.34454 0.05478
0.388702 -0.349683 -0.746453 0.046354
0.312688 -0.042237 0.419224 -0.177092
-0.336798 0.608043 -0.197987 0.523235
-0.427491 0.156041 -0.125105 -0.766633
-0.444459 -0.640295 0.305781 0.319356
Singular values (on the diagonal)
14.654681 0 0 0
0 4.318601 0 0
0 0 2.569347 0
0 0 0 0.722346
Right-singular Vectors (in the columns) - compare these to the eigenvectors above
0.651053 -0.001663 -0.759018 -0.004493
0.239553 0.375033 0.20981 -0.8706
0.711502 0.020939 0.60817 0.351361
0.111845 -0.926773 0.100005 -0.344355
Sum of eigenvalues/variance: 48.10666666666667 == 48.10666666666664
PC1: Variance=42.95193788363521 (89.28%%) Loadings={ 0.6510525699470171, 0.23955264672389256, 0.711502412186202, 0.11184542040771327 }
PC2: Variance=3.7300630665462307 (7.75%%) Loadings={ -0.001662974141421872, 0.3750331043127808, 0.020938613599998712, -0.9267734241156432 }
Data (transformed)
7.435167 1.157951
5.696298 -1.51014
4.58235 -0.182406
-4.935668 2.625896
-6.264743 0.67388
-6.513403 -2.76518
Transformed data (derived another way) - compare the 2 first columns with what we just calculated above
7.435167 1.157951 0.885242 0.03957
5.696298 -1.51014 -1.917896 0.033484
4.58235 -0.182406 1.077132 -0.127922
-4.935668 2.625896 -0.508698 0.377957
-6.264743 0.67388 -0.321437 -0.553774
-6.513403 -2.76518 0.785657 0.230685
Covariances (from SVD) – compare this what we originally calculated
18.966667 6.486667 19.286667 3.033333
6.486667 3.126667 7.486667 -0.086667
19.286667 7.486667 22.246667 3.413333
3.033333 -0.086667 3.413333 3.766667
Covariances (from SVD using only 2 components)
18.206025 6.696517 19.896302 3.133391
6.696517 2.98945 7.350117 -0.145655
19.896302 7.350117 21.745439 3.345658
3.133391 -0.145655 3.345658 3.741088