PCA / FA Example 7: Bartholomew et al. Correlation matrix

edit 5 Oct 2008: I had omitted the word “constant”. see edit.

The following example comes from Bartholomew et al. “The Analysis and Interpretation of Multivariate Data for Social Scientists.”

It is an excellent example with which to wrap up PCA / FA. (There’s a lot we haven’t done, but it’s almost time for me to move on.)

The example is “employment in 26 European countries”, “eurojob” for short, from chapter 5 (either 1st or 2nd edition), and data for both editions is available at http://www.cmm.bris.ac.uk/team/amssd.shtml . Please note that I am using the 1st edition of the book, and the 1st edition data.

When I first worked this example, I knew that something interesting happened, but not why; and, there was one thing I didn’t understand at all, back then.

One reason why this example is so good is that they provide both the data and a correlation matrix. In fact, most of their analyses in chapter 5 seem to be based on the correlation matrix, except in those few cases when they compute scores (for which they need the data). We get to return to our starting point, using the correlation matrix.

Here’s the raw data.

data = \left(\begin{array}{lllllllll} 3.3 & 0.9 & 27.6 & 0.9 & 8.2 & 19.1 & 6.2 & 26.6 & 7.2 \\ 9.2 & 0.1 & 21.8 & 0.6 & 8.3 & 14.6 & 6.5 & 32.2 & 7.1 \\ 10.8 & 0.8 & 27.5 & 0.9 & 8.9 & 16.8 & 6. & 22.6 & 5.7 \\ 6.7 & 1.3 & 35.8 & 0.9 & 7.3 & 14.4 & 5. & 22.3 & 6.1 \\ 23.2 & 1. & 20.7 & 1.3 & 7.5 & 16.8 & 2.8 & 20.8 & 6.1 \\ 15.9 & 0.6 & 27.6 & 0.5 & 10. & 18.1 & 1.6 & 20.1 & 5.7 \\ 7.7 & 3.1 & 30.8 & 0.8 & 9.2 & 18.5 & 4.6 & 19.2 & 6.2 \\ 6.3 & 0.1 & 22.5 & 1. & 9.9 & 18. & 6.8 & 28.5 & 6.8 \\ 2.7 & 1.4 & 30.2 & 1.4 & 6.9 & 16.9 & 5.7 & 28.3 & 6.4 \\ 12.7 & 1.1 & 30.2 & 1.4 & 9. & 16.8 & 4.9 & 16.8 & 7. \\ 13. & 0.4 & 25.9 & 1.3 & 7.4 & 14.7 & 5.5 & 24.3 & 7.6 \\ 41.4 & 0.6 & 17.6 & 0.6 & 8.1 & 11.5 & 2.4 & 11. & 6.7 \\ 9. & 0.5 & 22.4 & 0.8 & 8.6 & 16.9 & 4.7 & 27.6 & 9.4 \\ 27.8 & 0.3 & 24.5 & 0.6 & 8.4 & 13.3 & 2.7 & 16.7 & 5.7 \\ 22.9 & 0.8 & 28.5 & 0.7 & 11.5 & 9.7 & 8.5 & 11.8 & 5.5 \\ 6.1 & 0.4 & 25.9 & 0.8 & 7.2 & 14.4 & 6. & 32.4 & 6.8 \\ 7.7 & 0.2 & 37.8 & 0.8 & 9.5 & 17.5 & 5.3 & 15.4 & 5.7 \\ 66.8 & 0.7 & 7.9 & 0.1 & 2.8 & 5.2 & 1.1 & 11.9 & 3.2 \\ 23.6 & 1.9 & 32.3 & 0.6 & 7.9 & 8. & 0.7 & 18.2 & 6.7 \\ 16.5 & 2.9 & 35.5 & 1.2 & 8.7 & 9.2 & 0.9 & 17.9 & 7. \\ 4.2 & 2.9 & 41.2 & 1.3 & 7.6 & 11.2 & 1.2 & 22.1 & 8.4 \\ 21.7 & 3.1 & 29.6 & 1.9 & 8.2 & 9.4 & 0.9 & 17.2 & 8. \\ 31.1 & 2.5 & 25.7 & 0.9 & 8.4 & 7.5 & 0.9 & 16.1 & 6.9 \\ 34.7 & 2.1 & 30.1 & 0.6 & 8.7 & 5.9 & 1.3 & 11.7 & 5. \\ 23.7 & 1.4 & 25.8 & 0.6 & 9.2 & 6.1 & 0.5 & 23.6 & 9.3 \\ 48.7 & 1.5 & 16.8 & 1.1 & 4.9 & 6.4 & 11.3 & 5.3 & 4.\end{array}\right)

I remark that the matrix is 26×9. What we have is: for 26 countries, the % of people employed in 9 categories (agriculture, mining, etc.).

This would be a good, rather than excellent, example to end with simply because it takes us back to our roots, using the correlation matrix for its PCA analysis. I compute the correlation matrix, and I save it; but I also round it to 2 places, and show it to you:

\left(\begin{array}{lllllllll} 1. & 0.04 & -0.67 & -0.4 & -0.54 & -0.74 & -0.22 & -0.75 & -0.56 \\ 0.04 & 1. & 0.45 & 0.41 & -0.03 & -0.4 & -0.44 & -0.28 & 0.16 \\ -0.67 & 0.45 & 1. & 0.39 & 0.49 & 0.2 & -0.16 & 0.15 & 0.35 \\ -0.4 & 0.41 & 0.39 & 1. & 0.06 & 0.2 & 0.11 & 0.13 & 0.38 \\ -0.54 & -0.03 & 0.49 & 0.06 & 1. & 0.36 & 0.02 & 0.16 & 0.39 \\ -0.74 & -0.4 & 0.2 & 0.2 & 0.36 & 1. & 0.37 & 0.57 & 0.19 \\ -0.22 & -0.44 & -0.16 & 0.11 & 0.02 & 0.37 & 1. & 0.11 & -0.25 \\ -0.75 & -0.28 & 0.15 & 0.13 & 0.16 & 0.57 & 0.11 & 1. & 0.57 \\ -0.56 & 0.16 & 0.35 & 0.38 & 0.39 & 0.19 & -0.25 & 0.57 & 1.\end{array}\right)

I do not quite match them. In fact, we differ in one number, at the end of the first row (and at the end of the first column, of course). To five places, I have the number as -0.56492. They show -.57 where I round to -.56.

That’s irrelevant. What is not irrelevant is the result of computing the eigenvalues of the rounded correlation matrix.

It could be a big deal, but it isn’t. Here be dragons, in either case. The point is that their apparent starting point is their printed (hence rounded) correlation matrix. (To be more precise: they clearly computed an eigendecomposition of the correlation matrix. I used their printed correlation matrix.) Whether I use theirs or mine, so long as it’s rounded to 2 places, life is interesting.

Let’s just do our thing: get the eigendecomposition of the rounded correlation matrix… and look at the eigenvalues.

\left(\begin{array}{l} 3.48828 \\ 2.14083 \\ 1.10125 \\ 0.992444 \\ 0.543472 \\ 0.379233 \\ 0.224842 \\ 0.133101 \\ -0.00344958\end{array}\right)

I hope you took a deep breath right there. The smallest eigenvalue is negative. But the correlation matrix is supposed to be positive semi-definite. (Because it’s of the form X^T\ X\ .)

Alarm bells should be ringing in your head: eigenvalues of a positive semi-definite matrix are non-negative. Where did that negative eigenvalue come from?

Presumably the smallest eigenvalue is very close to 0, and the negative number is numerical error.

Well, sort of, but we can say more. Here are the eigenvalues of the unrounded correlation matrix:

\left(\begin{array}{l} 3.48715 \\ 2.13017 \\ 1.09896 \\ 0.994483 \\ 0.543218 \\ 0.383428 \\ 0.225754 \\ 0.13679 \\ 0.0000456251\end{array}\right)

We see that the last one is very small but positive. The computed correlation matrix is positive definite. The problem comes from rounding off the correlation matrix.

But that must be positive definite, too, you cry?

Why?

Symmetry is not enough to guarantee positive definite.

This is just the tip of an iceberg. There was nothing special about rounding to 2 digits, nothing special about a correlation matrix. It is not trivial to manipulate a positive (semi-) definite matrix and guarantee that subsequent matrices are still positive (semi-) definite. But that’s about all I’ll say about the numerical hassles.

As I said, the computed correlation matrix is, in fact, positive definite: its determinant and all its eigenvalues are positive. Mathematica® does just fine with it.

But the rounded correlation matrix – which is what they presented – is not positive definite: its determinant and one eigenvalue are negative.

So we now know two good reasons to provide data instead of just providing a correlation matrix: one, we can do more with the data; two, a rounded-off correlation matrix may not be positive semi-definite. And, bless them, they did provide the data.

Just this insight alone upgrades this to a very good example. And this negative eigenvalue was the “something interesting”. We’ll see very shortly why it happened. (Perhaps you can guess.)

To summarize. The smallest eigenvalue of the correlation matrix is extremely close to zero, but non-negative, as it should be. The real problem is that the correlation matrix is almost singular (not of full rank); then the rounded correlation matrix is almost singular, too, but the smallest eigenvalue went to the other side of zero.

The lesson we have learned is: if the correlation matrix is barely of full rank, working from a rounded version of it could work out badly.

Now, why is the correlation matrix almost singular? I.e. why is the smallest eigenvalue so close to zero?

I swear I had no idea that this example would tie in so closely with recent posts. I knew that inadvertent reduction of rank was important, but I didn’t have a clear reason why.

What might cause the smallest eigenvalue to go to zero? All we did was compute a correlation matrix. That is, all we did was implicitly center the columns.

Ah ha! Just what are the row sums of the raw data?

Here they are:

row sums = \left(\begin{array}{l} 100. \\ 100.4 \\ 100. \\ 99.8 \\ 100.2 \\ 100.1 \\ 100.1 \\ 99.9 \\ 99.9 \\ 99.9 \\ 100.1 \\ 99.9 \\ 99.9 \\ 100. \\ 99.9 \\ 100. \\ 99.9 \\ 99.7 \\ 99.9 \\ 99.8 \\ 100.1 \\ 100. \\ 100. \\ 100.1 \\ 100.2 \\ 100.\end{array}\right)

The min, mean, and max are 99.7, 99.9923, and 100.4 respectively.

They’re not constant, but they’re awfully close to constant. Hence, the smallest eigenvalue of the correlation matrix isn’t zero, but it’s awfully close to zero. I explained here that making (edit: constant) non-zero row sums followed by centering the columns leads to constant zero row sums.

So: we just got a real-life example of the inadvertent loss of rank caused by using the correlation matrix on constant-row-sum data. The computed correlation matrix is positive definite but almost singular (determinant almost zero but positive, smallest eigenvalue almost zero but positive); and then because we were nearly singular, the rounded correlation matrix could fail to be positive definite, and in fact failed to be positive semi-definite.

Next, we will confirm that part of their analysis based on the correlation matrix.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: