StatQuest PCA Example

StatQuest (Josh Starmer) has published a great video explaining principal component analysis (PCA). In the video he mentions that he is using Singular Value Decomposition (SVD), but he doesn’t actually do/present that part of the maths – he does not explicitly calculate a singular value decomposition.

The reason he states that he’s using SVD is because the example works directly with the data matrix – a matrix/table with measurements where each column represents a variable, and the rows represent a sample. (In his table of mice and genes the data matrix is transposed.) The alternative would be to first form a covariance matrix and then use Eigenvalue Decomposition (EvD). The two methods are equivalent.

This post aims to fill in some of the gaps by presenting a program that does the maths. Further it will show the connection between EvD and SVD. In addition it demonstrates how to do PCA using ojAlgo.

Unfortunately the program doesn’t get exactly the same numbers as in the video. If you can spot what’s causing the differences then please let me and/or Josh Starmer know (depending on who made a mistake). In the video the loading scores for PC1 (2 mice version) are 0.97 and 0.242, but in our calculations they are 0.94 and 0.34. The corresponding eigenvalues are also different. This causes derived numbers like the relative importance (of the principal components) to be very different.

Remember!

PCA will not (directly) tell you which variables are the most important. It removes noise and reduces the dimensionality of the data. This makes it easier to work with. Most likely there needs to be further analysis. Possibly that further analysis is simply to plot a chart.

Example Code

Console Output

2 Variables

class StatQuestPCA
ojAlgo
2019-04-12

Data
  10   6
  11   4
   8   5
   3   3
   2 2.8
   1   1
There are 2 variables and 6 samples.

Covariances
 18.966667  6.486667
  6.486667  3.126667

EvD
Eigenvalues (on the diagonal)
 21.284012         0
         0  0.809321
Eigenvectors (in the columns)
  0.941711  0.336424
  0.336424 -0.941711

Relative (Variance) Importance: { 0.963368085822159, 0.03663191417784095 }

Data (centered)
  4.166667  2.366667
  5.166667  0.366667
  2.166667  1.366667
 -2.833333 -0.633333
 -3.833333 -0.833333
 -4.833333 -2.633333

SVD
Left-singular Vectors (in the columns)
  0.457541 -0.411087
  0.483604  0.692426
  0.242357 -0.277432
 -0.279299 -0.177362
 -0.377107 -0.250975
 -0.527095  0.424429
Singular values (on the diagonal)
 10.31601        0
        0 2.011618
Right-singular Vectors (in the columns) - compare these to the eigenvectors above
  0.941711  0.336424
  0.336424 -0.941711

Sum of eigenvalues/variance: 22.093333333333334 == 22.093333333333344

PC1: Variance=21.284012242764245 (96.34%%) Loadings={ 0.9417106889306233, 0.3364238076501293 }
PC2: Variance=0.8093210905690993 (3.66%%) Loadings={ 0.3364238076501293, -0.9417106889306233 }

Data (transformed)
  4.719998 -0.826949
  4.988861  1.392896
  2.500152 -0.558086
 -2.881249 -0.356784
 -3.890244 -0.504866
 -5.437518   0.85379

Transformed data (derived another way) - compare the 2 first columns with what we just calculated above
  4.719998 -0.826949
  4.988861  1.392896
  2.500152 -0.558086
 -2.881249 -0.356784
 -3.890244 -0.504866
 -5.437518   0.85379

3 Variables

class StatQuestPCA
ojAlgo
2019-04-12

Data
  10   6  12
  11   4   9
   8   5  10
   3   3 2.5
   2 2.8 1.3
   1   1   2
There are 3 variables and 6 samples.

Covariances
 18.966667  6.486667 19.286667
  6.486667  3.126667  7.486667
 19.286667  7.486667 22.246667

EvD
Eigenvalues (on the diagonal)
 42.45597        0        0
        0 1.357664        0
        0        0 0.526366
Eigenvectors (in the columns)
  0.654759 -0.747649  0.110959
  0.244157  0.348147  0.905086
  0.715316  0.565522 -0.410496

Relative (Variance) Importance: { 0.9575094809015773, 0.030619393636449194, 0.011871125461973569 }

Data (centered)
  4.166667  2.366667  5.866667
  5.166667  0.366667  2.866667
  2.166667  1.366667  3.866667
 -2.833333 -0.633333 -3.633333
 -3.833333 -0.833333 -4.833333
 -4.833333 -2.633333 -4.133333

SVD
Left-singular Vectors (in the columns)
  0.514936  0.393973 -0.120891
  0.379072 -0.811392   0.16742
  0.310108  0.400155  0.067738
 -0.316323 -0.060214  -0.37223
 -0.423529 -0.060447 -0.495894
 -0.464265  0.137926  0.753857
Singular values (on the diagonal)
 14.569827         0         0
         0   2.60544         0
         0         0  1.622291
Right-singular Vectors (in the columns) - compare these to the eigenvectors above
  0.654759 -0.747649 -0.110959
  0.244157  0.348147 -0.905086
  0.715316  0.565522  0.410496

Sum of eigenvalues/variance: 44.34 == 44.34000000000004

PC1: Variance=42.455970383175966 (95.75%%) Loadings={ 0.6547591745515717, 0.24415739058194894, 0.715316427858859 }
PC2: Variance=1.3576639138401558 (3.06%%) Loadings={ -0.7476486995434165, 0.34814683061538987, 0.5655220653550999 }

Data (transformed)
  7.502525  1.026474
  5.523021 -2.114035
  4.518217   1.04258
 -4.608767 -0.156885
 -6.170737 -0.157492
 -6.764258  0.359358

Transformed data (derived another way) - compare the 2 first columns with what we just calculated above
  7.502525  1.026474 -0.196121
  5.523021 -2.114035  0.271604
  4.518217   1.04258  0.109891
 -4.608767 -0.156885 -0.603865
 -6.170737 -0.157492 -0.804485
 -6.764258  0.359358  1.222976

4 Variables

class StatQuestPCA
ojAlgo
2019-04-12

Data
  10   6  12   5
  11   4   9   7
   8   5  10   6
   3   3 2.5   2
   2 2.8 1.3   4
   1   1   2   7
There are 4 variables and 6 samples.

Covariances
 18.966667  6.486667 19.286667  3.033333
  6.486667  3.126667  7.486667 -0.086667
 19.286667  7.486667 22.246667  3.413333
  3.033333 -0.086667  3.413333  3.766667

EvD
Eigenvalues (on the diagonal)
 42.951938         0         0         0
         0  3.730063         0         0
         0         0  1.320309         0
         0         0         0  0.104357
Eigenvectors (in the columns)
  0.651053  0.001663  0.759018  0.004493
  0.239553 -0.375033  -0.20981    0.8706
  0.711502 -0.020939  -0.60817 -0.351361
  0.111845  0.926773 -0.100005  0.344355

Relative (Variance) Importance: { 0.8928479327252333, 0.07753734201523488, 0.027445446117285287, 0.0021692791422464803 }

Data (centered)
  4.166667  2.366667  5.866667 -0.166667
  5.166667  0.366667  2.866667  1.833333
  2.166667  1.366667  3.866667  0.833333
 -2.833333 -0.633333 -3.633333 -3.166667
 -3.833333 -0.833333 -4.833333 -1.166667
 -4.833333 -2.633333 -4.133333  1.833333

SVD
Left-singular Vectors (in the columns)
  0.507358  0.268131   0.34454   0.05478
  0.388702 -0.349683 -0.746453  0.046354
  0.312688 -0.042237  0.419224 -0.177092
 -0.336798  0.608043 -0.197987  0.523235
 -0.427491  0.156041 -0.125105 -0.766633
 -0.444459 -0.640295  0.305781  0.319356
Singular values (on the diagonal)
 14.654681         0         0         0
         0  4.318601         0         0
         0         0  2.569347         0
         0         0         0  0.722346
Right-singular Vectors (in the columns) - compare these to the eigenvectors above
  0.651053 -0.001663 -0.759018 -0.004493
  0.239553  0.375033   0.20981   -0.8706
  0.711502  0.020939   0.60817  0.351361
  0.111845 -0.926773  0.100005 -0.344355

Sum of eigenvalues/variance: 48.10666666666667 == 48.10666666666664

PC1: Variance=42.95193788363521 (89.28%%) Loadings={ 0.6510525699470171, 0.23955264672389256, 0.711502412186202, 0.11184542040771327 }
PC2: Variance=3.7300630665462307 (7.75%%) Loadings={ -0.001662974141421872, 0.3750331043127808, 0.020938613599998712, -0.9267734241156432 }

Data (transformed)
  7.435167  1.157951
  5.696298  -1.51014
   4.58235 -0.182406
 -4.935668  2.625896
 -6.264743   0.67388
 -6.513403  -2.76518

Transformed data (derived another way) - compare the 2 first columns with what we just calculated above
  7.435167  1.157951  0.885242   0.03957
  5.696298  -1.51014 -1.917896  0.033484
   4.58235 -0.182406  1.077132 -0.127922
 -4.935668  2.625896 -0.508698  0.377957
 -6.264743   0.67388 -0.321437 -0.553774
 -6.513403  -2.76518  0.785657  0.230685

Leave a Reply