PCA / FA tricky preprocessing


I have stumbled across a tricky point in the preprocessing of data. The most relevant post is probably

this of April 7. Rather than lecture, let me ask and answer some questions. The fundamental question is:
Can I inadvertently reduce the rank (the dimensionality) of the data matrix?
The answer is yes.

Suppose we have a data matrix of full rank. Suppose the matrix is a typical one, in having more rows than columns. That the matrix be of full rank is equivalent to: the columns are linearly independent.

There is a serious asymmetry here: not all of the rows are linearly independent.

We might as well imagine some specific numbers. Suppose we have 3 columns and 5 rows. Since we assume it is of full rank, the rank of the matrix is 3.

The term “centering” means to “make zero-mean.” A centered column is one with zero mean = zero sum; a centered row is a row whose mean (hence sum) is zero..

What happens to the rank of the data matrix if we only row-center the data?
The rank decreases by 1.

Visualize the matrix as 3 column-vectors c1, c2, c3 of length 5. That the row sums (= row means) are zero says that c1 + c2 + c3 = 0, so the three columns are linearly dependent and the matrix is of rank 2.

What happens to the rank of the data matrix if we only column-center the data?

Visualize the matrix as 5 row-vectors r1, r2, r3, r4, r5 of length 3. We have more than 3 vectors of length 3, so they span a 3D space, but no larger. Adding the constraint r1 + r2 + r3 + r4 + r5 = 0 imposes, in general, no additional restriction. (If they had somehow spanned a 5D space, it would, but they can’t.)

What happens to the rank of the data matrix if we only set the row sums to 1 (or 100), i.e. to a nonzero constant value?

All we’ve done is assert that r1 + r2 + r3 = 1, a constant vector. These 4 vectors are linearly dependent, but the original 3 are not.

But. It has another potential effect, probably undesirable.

If we were to use the resulting data matrix for a regression, we would find ourselves in trouble. We almost always want to adjoin a column of 1s to the data matrix, in order to include a constant term in the regression equation. We just got through saying that the combination r1, r2, r3, 1 is linearly dependent: the regression will explode with a singular matrix. Once we have set the row sums, we would have to drop a column of data in order to include a constant term. This is the same problem we face in regression when we define instrumental (dummy) variables.

What happens to the rank of the data matrix if we only set the column sums to 1 or some other nonzero constant value? (I’ve never seen it done, but why not, at least in principle?)
Nothing. (It even has no effect on regression.)

First Summary

So, there were 4 cases: whether we set row or column sum to a constant value, and whether that constant was zero or not. Of those 4 cases, only one (row sum = 0) reduces the rank of the data matrix.

What I was unclear about – or downright wrong about – on April 7 was in saying that we would lose rank by setting the row sums to 1 or 100 etc. Not so; none of the other 3 cases reduce the rank of the matrix.

Is that it?
No. All those questions were about one single operation. What if we combine operations?

Can’t we center both the rows and the columns?
Yes. We saw
here on April 13 that we could doubly-center the data.

Does that mean we lose rank?
Yes, because we are centering the rows. Centering the columns is irrelevant.

Can we set the row sums to 1 and then center the columns?

But something else will happen. Centering the columns is no longer irrelevant.

Didn’t we see that if the row sums were constant, and then we centered the data, the row sums went to zero?
Yes. It may seem weird, but it’s true.

That’s an easy way to get doubly-centered data using Mathematica®. Scale the rows so they each sum to 1, then center the columns. Voila’, the row sums will change to zero.

Any time you start messing with row and column sums or means, remember the grand mean, too: the average of all the data. It is also both the average row mean, and the average column mean. If the columns all have zero mean, then the grand mean is zero. Then the mean of the rows (edit: i.e. of the row means) must be zero. The challenge is to show that the rows still have a common sum (edit: equivalently, a common mean); then that common sum mean must be the mean of the rows, i.e. zero.

If we set the row sums, say to 1, and then column-center the data, do we lose rank?

Column-centering (or standardizing) the data after setting the row sums will reset the row sums to zero, and that loses rank. If it is appropriate, however, that the data be column-centered, so be it. I do not believe that reduction of rank is sufficient grounds for not centering the data.

Second summary

The apparently innocuous act of column-centering can lose rank, if we apply it to nonzero-constant-row-sum data. We can do it, and sometimes we should do it, but we should expect to lose rank when we do. But we’re losing rank because we first set the row sums, even though to a nonzero constant.

I had previously said I would favor doubly-centered data. Now I am inclined not to do so without cause, simply because it reduces the rank of the matrix.

Can I inadvertently reduce the rank (the dimensionality) of the data matrix?
Hell yes. In particular, we might find that we effectively centered the data without realizing it.

Suppose we set the row sums of our data to 1, say, and then we decide that rather than do an SVD of the new matrix, we will do an eigendecomposition of the covariance (or correlation) matrix.

Oops. Computing either the covariance or correlation matrix is tantamount to column-centering the data.

here on March 10.

Goodbye full rank.

Third summary

Suppose we have data for which a constant nonzero row sum is appropriate. Suppose we choose to do an eigendecomposition of the covariance (or correlation) matrix.

Then we lose rank. Constructing either of those smaller matrices implicitly centers the columns, and that in turn implicitly changes the row sums of the data to zero. Even if we never explicitly compute the transformed data.

The clearest path in such a case might well be as follows. First set the row sums to 1 or whatever nonzero. Then column-center or standardize the data. We will see the rank drop by 1 at that point. Then do an SVD. The only advantage to doing it this way, rather than via the covariance or correlation matrices, is that we see when we lose full rank.

If we live in a universe where data is supposed to be centered, or where the covariance or correlation matrix is appropriate, then we will lose rank if, in addition, the data had constant row sums.

Final summary

0. We could set row or column sums to zero or to a nonzero constant.
1. Setting row sums to zero is the only one of the four cases that will reduce rank.
2. Column-centering applied to data with constant row sums leaves us still with constant row sums but the sums have been changed to zero. This gives us (1) and loss of rank.
3. An accidental way to achieve (2) and thus (1) is to analyze the covariance (or correlation) matrix of constant-row-sum data.

It’s not a problem if you know what happened and why.

2 Responses to “PCA / FA tricky preprocessing”

  1. Muhammad Alkarouri Says:

    Rephrasing that for understanding effect: Can’t I say that centering is generally introducing an additional constraint and losing, thus, one degree of freedom?
    For me, using the covariance / correlation means that I do not care about the origin and so I am explicitly losing that information. In other words, I don’t care if a constant (vector) was added to the whole data set. So I would say, if you care about that don’t use covariance / correlation (or may be store the value of the mean vector).

    In correlation you further give up information about the scales of individual variables, but that’s another matter.

  2. rip Says:

    I think that’s a fair summary of three kinds of preprocessing.

    The impetus for my post was that the combination of two steps which do not individually reduce the rank of the data matrix, nevertheless has the effect of reducing it.

    Although I said I might avoid centering the rows just because it reduces the rank, I’m not sure that’s a good reason.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: