PCA / FA tricky preprocessing

Introduction

I have stumbled across a tricky point in the preprocessing of data. The most relevant post is probably

this of April 7. Rather than lecture, let me ask and answer some questions. The fundamental question is:
Can I inadvertently reduce the rank (the dimensionality) of the data matrix?
The answer is yes.

Suppose we have a data matrix of full rank. Suppose the matrix is a typical one, in having more rows than columns. That the matrix be of full rank is equivalent to: the columns are linearly independent.

There is a serious asymmetry here: not all of the rows are linearly independent.

We might as well imagine some specific numbers. Suppose we have 3 columns and 5 rows. Since we assume it is of full rank, the rank of the matrix is 3.

The term “centering” means to “make zero-mean.” A centered column is one with zero mean = zero sum; a centered row is a row whose mean (hence sum) is zero..

What happens to the rank of the data matrix if we only row-center the data?
The rank decreases by 1.

Visualize the matrix as 3 column-vectors c1, c2, c3 of length 5. That the row sums (= row means) are zero says that c1 + c2 + c3 = 0, so the three columns are linearly dependent and the matrix is of rank 2.

What happens to the rank of the data matrix if we only column-center the data?
Nothing.

Visualize the matrix as 5 row-vectors r1, r2, r3, r4, r5 of length 3. We have more than 3 vectors of length 3, so they span a 3D space, but no larger. Adding the constraint r1 + r2 + r3 + r4 + r5 = 0 imposes, in general, no additional restriction. (If they had somehow spanned a 5D space, it would, but they can’t.)

What happens to the rank of the data matrix if we only set the row sums to 1 (or 100), i.e. to a nonzero constant value?
Nothing.

All we’ve done is assert that r1 + r2 + r3 = 1, a constant vector. These 4 vectors are linearly dependent, but the original 3 are not.

But. It has another potential effect, probably undesirable.

If we were to use the resulting data matrix for a regression, we would find ourselves in trouble. We almost always want to adjoin a column of 1s to the data matrix, in order to include a constant term in the regression equation. We just got through saying that the combination r1, r2, r3, 1 is linearly dependent: the regression will explode with a singular matrix. Once we have set the row sums, we would have to drop a column of data in order to include a constant term. This is the same problem we face in regression when we define instrumental (dummy) variables.

What happens to the rank of the data matrix if we only set the column sums to 1 or some other nonzero constant value? (I’ve never seen it done, but why not, at least in principle?)
Nothing. (It even has no effect on regression.)

First Summary

So, there were 4 cases: whether we set row or column sum to a constant value, and whether that constant was zero or not. Of those 4 cases, only one (row sum = 0) reduces the rank of the data matrix.

What I was unclear about – or downright wrong about – on April 7 was in saying that we would lose rank by setting the row sums to 1 or 100 etc. Not so; none of the other 3 cases reduce the rank of the matrix.

Is that it?
No. All those questions were about one single operation. What if we combine operations?

Can’t we center both the rows and the columns?
Yes. We saw
here on April 13 that we could doubly-center the data.

Does that mean we lose rank?
Yes, because we are centering the rows. Centering the columns is irrelevant.

Can we set the row sums to 1 and then center the columns?
Yes.

But something else will happen. Centering the columns is no longer irrelevant.

Didn’t we see that if the row sums were constant, and then we centered the data, the row sums went to zero?
Yes. It may seem weird, but it’s true.

That’s an easy way to get doubly-centered data using Mathematica®. Scale the rows so they each sum to 1, then center the columns. Voila’, the row sums will change to zero.

Any time you start messing with row and column sums or means, remember the grand mean, too: the average of all the data. It is also both the average row mean, and the average column mean. If the columns all have zero mean, then the grand mean is zero. Then the mean of the rows (edit: i.e. of the row means) must be zero. The challenge is to show that the rows still have a common sum (edit: equivalently, a common mean); then that common sum mean must be the mean of the rows, i.e. zero.

If we set the row sums, say to 1, and then column-center the data, do we lose rank?
Yes.

Column-centering (or standardizing) the data after setting the row sums will reset the row sums to zero, and that loses rank. If it is appropriate, however, that the data be column-centered, so be it. I do not believe that reduction of rank is sufficient grounds for not centering the data.

Second summary

The apparently innocuous act of column-centering can lose rank, if we apply it to nonzero-constant-row-sum data. We can do it, and sometimes we should do it, but we should expect to lose rank when we do. But we’re losing rank because we first set the row sums, even though to a nonzero constant.

I had previously said I would favor doubly-centered data. Now I am inclined not to do so without cause, simply because it reduces the rank of the matrix.

Can I inadvertently reduce the rank (the dimensionality) of the data matrix?
Hell yes. In particular, we might find that we effectively centered the data without realizing it.

Suppose we set the row sums of our data to 1, say, and then we decide that rather than do an SVD of the new matrix, we will do an eigendecomposition of the covariance (or correlation) matrix.

Oops. Computing either the covariance or correlation matrix is tantamount to column-centering the data.

here on March 10.

Goodbye full rank.

Third summary

Suppose we have data for which a constant nonzero row sum is appropriate. Suppose we choose to do an eigendecomposition of the covariance (or correlation) matrix.

Then we lose rank. Constructing either of those smaller matrices implicitly centers the columns, and that in turn implicitly changes the row sums of the data to zero. Even if we never explicitly compute the transformed data.

The clearest path in such a case might well be as follows. First set the row sums to 1 or whatever nonzero. Then column-center or standardize the data. We will see the rank drop by 1 at that point. Then do an SVD. The only advantage to doing it this way, rather than via the covariance or correlation matrices, is that we see when we lose full rank.

If we live in a universe where data is supposed to be centered, or where the covariance or correlation matrix is appropriate, then we will lose rank if, in addition, the data had constant row sums.

Final summary

0. We could set row or column sums to zero or to a nonzero constant.
1. Setting row sums to zero is the only one of the four cases that will reduce rank.
2. Column-centering applied to data with constant row sums leaves us still with constant row sums but the sums have been changed to zero. This gives us (1) and loss of rank.
3. An accidental way to achieve (2) and thus (1) is to analyze the covariance (or correlation) matrix of constant-row-sum data.

It’s not a problem if you know what happened and why.

PCA / FA . “Preprocessing”

When I moved beyond the first couple of books, I was bewildered by the huge number of alternatives for PCA / FA. I think my final count was that there were 288 different ways of doing it. It was so bad that I put together a form so that whenever I read a new example I could check off boxes for the choices made. As I’ve gotten more experience, I no longer need that checklist.

A lot of those choices pertained to the starting point. If the analysis began with an eigendecomposition, well, it could have been applied to the correlation matrix, or to the covariance matrix – oh, and one text used N instead of N-1 to compute sample variances.

Or an eigendecomposition could have been applied to X^T\ X or to X\ X^T or to both… but X itself could have been raw data, centered data, doubly-centered data, standardized data, or small standard deviates. Oh, and X could have observations in rows (jolliffe, Davis, and I) or in columns (Harman). Oh boy.

Or we could have done an SVD of X, where X itself could have been raw data, centered data, doubly-centered data, standardized data, or small standard deviates, with observations in rows or columns. Yikes!

(not to mention that the data could very likely have been manipulated even before these possible transformations.)

Then I decided that all those choices were pre-processing, not really part of PCA / FA. Actually, I decided that I must have been careless, and missed the point when everyone made it, that one had to decide what pre-processing to do before starting PCA / FA. I was kicking myself a bit.

When I sat down to summarize the material, I discovered that most people seem to prefer a specifc form of pre-processing, although most do, in fact, describe alternatives. (It felt like I had to look hard to see their alternatives.) I think a description of PCA / FA should not restrict itself to a specific choice of pre-processing. I’m not ready to summarize PCA / FA for my own purposes – we have to look at two more examples, at the very least – but I am prepared to say that the construction of X, pre-processing the data, should be viewed as a separate activity, possibly requiring its own justification in one’s report. I’d certainly like to see more discussion of it in the texts.

The social sciences, geology, chemistry, and oceanography all specify different constructions for X. The social sciences (and jolliffe and Harman) seem to favor the correlation matrix (which is tantamount to using standardized data). Geology favors centered data (which is tantamount to using the covariance matrix). Chemistry, however, says we should use the raw data (which takes us into uncharted waters). And oceanography says we should remove a linear trend (or more) from the data!

These are serious decisions to make. Computing an SVD or an eigendecomposition is pretty easy nowadays, a lot easier than it was when PCA / FA were invented. Now that the computations are so easy, we can afford to pre-process the data in more than one way. But I’d still like to see more discussion of one choice over another.

And I can’t settle it. See what is customary in your own field. Talk to your colleagues. Try more than one form of preprocessing.

We’ve already seen that for Davis’ (admittedly artificial) example, an eigendecomposition of the correlation matrix did not lead to new data that was manifestly 2D. (here) In a sense the correlation matrix was not as effective at showing the 2D dimensionality of the data as the covariance matrix was (strictly speaking, as the SVD of the centered data was and as the covariance matrix would have been).

I see four major possibilities for preprocessing, but these are not the only possibilities:

  1. raw data,
  2. centered data,
  3. doubly-centered data,
  4. and standardized data.

Raw data and doubly-centered data are symmetric choices: we could view a Q-mode analysis of X as an R-mode analysis of X^T\ .

Maybe I should be more precise there. My working definition of an R-mode analysis of X is: an eigendecomposition of X^T\ X\ ; my working definition of a Q-mode analysis of X is: an eigendecomposition of X\ X^T\ . We accomplish them simultaneously by looking at the SVD of X.

But they would be two distinct analyses if we required, for example, standardized columns for X^T\ X\ and standardized rows for X\ X^T\ . The X’s themselves would be different. This troubles me at present. OTOH, I have not seen Q-mode analysis carried out by its proponents, so maybe i’m troubled by shadows rather than substance.

One of my criteria for categorizing choices as major is: when do we affect the eigenvectors? The eigenvalues, by contrast, are all too easy to change: X^T\ X\ of centered data has the same eigenvectors as the covariance matrix of X, but the eigenvalues differ by a common scale factor. I don’t see that as a major difference. More and more, I am inclined to convert the eigenvalues to percentages just so I don’t have to worry about scale factors between slightly different computational choices.

Then we have two major choices for executing the PCA / FA:

  1. an eigendecomposition of some square matrix or matrices,
  2. or an SVD of the data matrix X.

There may be some relationships between the SVD and the eigendecomposition, depending on what square matrix or matrices we choose. We have a lot of freedom, perhaps too much, in our choice of square matrices.

Whichever we choose, however, we will end up with some eigenvalues and eigenvectors, or principal values and (technically) left and right principal vectors, to give the columns of u and v of the SVD their proper names. For our purposes, where X^T\ X\ and X\ X^T\ are lurking about if not standing in full sunlight, it is convenient to call the u and v eigenvector matrices even when they come from the SVD.

Frankly, when I’m reading a new example, the first thing I want to find is the eigenvector matrix; I’ll go so far as to say, if there isn’t one somewhere, it isn’t PCA / FA. “There must be a pony.”

As a computational consideration, the eigendecomposition may or may not yield an orthogonal eigenvector matrix; and the SVD may or may not yield orthogonal u and v. the eigendecomposition should, however, yield orthogonal eigenvectors even if they need to be normalized to length 1; conversely, the SVD should yield sets of orthonormal vectors, even if there aren’t enough of them to make square (orthogonal) matrices.

To rephrase that, the eigendecomposition should always yield a square matrix, whose columns may need to be normalized; the SVD should always yield orthonormal vectors, in matrices that may not be square.

Nevertheless, in principle we could have orthogonal matrices of eigenvectors from either the eigendecomposition or the SVD, and in practice, I do. And I keep them close to hand, in addition to doing whatever people want me to do to them.

Oh, we do some post-processing, too.

  1. Leave the orthonormal eigenvectors alone?
  2. Weight them by the square roots of their eigenvalues?
  3. How about by the inverse square roots of their eigenvalues?
  4. I may even be able to find an example where they were weighted by the eigenvalues!

(And of course, if we use the SVD (me raises hand) the principal values correspond to the square roots of the eigenvalues.)

Or, if we did want unique eigenvectors – at least from real data – we could scale each eigenvector so its largest component was +1. For artificial examples, that may not be unique, but for real data, it seems unlikely that any eigenvector would have two identical largest components of opposite sign. (That’s when this prescription would fail to work.) In any case, it is a prescription I have encountered .

Now I think I’m ready for chemistry. I hope you are, too.