Data obsessed

What’s Really Going On In Principal Component Analysis (PCA)?

PCA seems to have gained sufficient popularity that many “I know just enough statistics to be really dangerous” types are using it. I have no real problem with this: it’s a great technique that has a lot of applications, but on a fairly regular basis I stumble across questions about PCA on CrossValidated that would only be asked by someone who really doesn’t understand at a fundamental level what PCA is doing. Often the situation is appropriate for PCA to be applied (usually dimensionality reduction), but questions are asked about how to use or interpret the output the indicate the person really doesn’t know what’s going on.

This article provides the best explanation of PCA I’ve come across, illustrating the linear algebra in a very intuitive geometrical context. I’ll provide a condensed version of the geometric explanation below without getting into the linear algebra.

Consider a sample of 50 points generated from $y=x + noise$. The first principal component will lie along the line $y=x$ and the second component will lie along the line $y=-x$, as shown below.

The aspect ratio messes it up a little, but take my word for it that the components are orthogonal. Applying PCA will rotate our data so the components become the $x$ and $y$ axes:

The data before the transformation are circles, the data after are crosses. In this particular example, the data wasn’t rotated so much as it was flipped across the line $y=-2x$, but we could have just as easily inverted the y-axis to make this truly a rotation without loss of generality as described here.

The bulk of the variance, i.e. the information in the data, is spread along the first principal component (which is represented by the $x$-axis after we have transformed the data). There’s a little variance along the second component (now the $y$-axis), but we can drop this component entirely without significant loss of information. So to collapse this from two dimensions into one, we let the projection of the data onto the first principal component completely describe our data.

We can partially recover our original data by rotating (ok, projecting) it back onto the original axes.

The dark blue points are the “recovered” data, whereas the empty points are the original data. As you can see, we have lost some of the information from the original data, specifically the variance in the direction of the second principal component. But for many purposes, this compressed description (using the projection along the first principal component) may suit our needs.

Here’s the code I used to generate this example in case you want to replicate it yourself. If you reduce the variance of the noise component on the second line, the amount of data lost by the PCA transformation will decrease as well because the data will converge onto the first principal component:

The bulk of this post was scavenged from a response I provided on CrossValidated.