Understanding PCA deeper

Deeper than most data analysts.

Why PCA?

Whenever I get a big data table, with features in rows and sample in columns, I would like to first see how the samples are clustered or distributed in the feature space in an unsupervised way. However, feature space can be huge, and not visualizable in this 3D world. PCA is a go-to choice to achieve a lower dimension (2 or 3) summerisation/representation of the original matrix. In a 2D space, the x axis could be the first principal component (PC) of the data and the y axis could be the second PC. I make a scatter plot, projecting the data points (samples) to these two axis.

And I would color the data points regarding to some sample meta information, for example, tissue type, experiment batch, gender, marker mutation status. This type of visualization is usually very informative in many ways. It tells me if the features capture the intrinsic differences/characteristics of the samples, or in QC, I can directly spot batch effect if there is any. Even better, I would investigate the loadings of the PCs to see which features contribute how much and to what direction of these PCs.

\[maybe a PCA example ## PCA as a way to explain variance in the data When a data analyst is asked to interpret a PCA plot, for example the above one. He or she would probably say, samples are clustered according to their tissue type along the first PC and according to xx along the second PC, the first PC explains xx% of the total variance and the second explains xx%. In total we are examining xx% of the variance. We know that PCs are linear combinations of the original features and the first PC would explain more variance than the second one and the second would explain more than the third one. But how are these linear combinations determined and why they "explain" variance? As you may know, PCs are orthogonal to each other, and why is that? And how many PCs would there be for each data matrix? ## PCA and the sample covariance matrix Let's say the data matarix is $D$, with $p$ rows (features) and $n$ samples after you center the data such that features for each sample has mean at zero, you can get a centered matrix $X$. To calculate sample covariance matrix you can just easily calculate the matrix product ${1 \over n-1}XX^T$. Let's ignore the scalar for now. $XX^T$ would be a $p$ by $p$ symmetric matrix. Why is the covariance matrix important? Because it basically telling you the relationships between each pair of samples using ALL the features. Let's make $A = XX^T$. In linear algebra, a symmetric matrix has some very good properties. For one, it can be factorized as $A = Q\Lambda Q^T$. Where $Q$ is an orthogonal matrix, of which the columns are the eigenvectors of $A$. $\Lambda$ is a diagonal matrix with the eigenvalues of $A$ on its diagonal. And there are some very interesting properties of eigenvalues and eigenvectors. Basically they can capture the most fundamental characteristics of a matrix. And why is that? @@@ I need to better explain this part Ok, now it seems we have explored the calculation behind PCA. But what how is it actually processed in a modern computation program? Singular value decomposition! ## PCA and singular value decompostion It is very computationally expensive to calculate the covariance matrix, especially with dimensions are large. Singular value decomposition (SVD) comes in handy because there are fast and efficient ways to do that decomposition. What's better, SVD can be performed on any matrix (no need to be symmetric, even no need to be a square matrix). So for our centered data matrix $X$, we can SVD it to this form: $X = U\Sigma V^T$ Here, $U$ and $V$ are two orthogonal matrices, and $\Sigma$ is a diagonal matrix. This is exactly why math is elegant. For any matrix, you can view its impact as a rotation ($U$), followed by a strech ($\Sigma$), followed by another rotation ($V^T$). Then what about the our covariance matrix? $XX^T = (U\Sigma V^T)(U\Sigma V^T)^T = U\Sigma V^TV\Sigma^TU^T = U\Sigma^2U^T$ Note that the last equation is because $U$ is orthagonal. And guess what, we already know : $A = XX^T = Q\Lambda Q^T$ So now we have: $U\Sigma^2U^T = Q\Lambda Q^T$ Naturally we have: $U = Q$ and $\Sigma^2 = \Lambda$ This derivation tells us, if we want to get the eigenvalues and vectors of the covariance matrix, we just need to do a SVD on the centered data matrix. Isn't that wonderful? ## Glue together with some experiment in R You are probably famillar with the `prcomp` function in R. `svd` is the base function in R for SVD calculation. And let'd do a simple experiment to see the relationship between SVD and PCA analysis. Supose we have such a simple matrix $X$ $\begin{bmatrix}3 & 4 & 2\\4 & 6 & 1\\ 2 & 5 & 7\\ 6 & 7& 2\\2 & 1&4 \end{bmatrix}$ 3 samples and 5 features.\]