PCA / FA. Example 4 ! Davis, and almost everyone else

I would like to revisit the work we did in Davis (example 4). For one thing, I did a lot of calculations with that example, and despite the compare-and-contrast posts towards the end, I fear it may be difficult to sort out what I finally came to.

In addition, my notation has settled down a bit since then, and I would like to recast the work using my current notation.

The original (“raw”) data for example 4 was (p. 502, and columns are variables):

X_r = \left(\begin{array}{lll} 4 & 27 & 18 \\ 12 & 25 & 12 \\ 10 & 23 & 16 \\ 14 & 21 & 14\end{array}\right)

It would be prudent to compute the means…

 \{10,\ 24,\ 15\}

and variances of the data…

 \{18.6667,\ 6.66667,\ 6.66667\}

It would also be prudent to calculate the row sums.

  \{49,\ 49,\ 49,\ 49\}

They are all the same (we have “constant row sums”).

Now, Davis chooses to analyze centered data. Let’s get it, by subtracting its mean from each column.

 X = \left(\begin{array}{lll} -6 & 3 & 3 \\ 2 & 1 & -3 \\ 0 & -1 & 1 \\ 4 & -3 & -1\end{array}\right)

That is what he has on page 503. I remind us that we expect (HERE) that the row sums of the centered data are now zero. That in turn means that we have lost rank: the centered data should have two nonzero eigenvalues or singular values, instead of 3.

Next, Davis chooses to analyze X^T X instead of the covariance matrix \frac{X^TX}{N-1}\ . We recall that these two have the same eigenvectors, but eigenvalues differ by a factor of N-1. In fact, Davis also uses the singular value decomposition (SVD); that is a good reason for doing an eigendecomposition of X^T X\ .

Here is the SVD of X, X = u\ w\ v^T\ . (Always remember that my u and v correspond to Davis’ V and U respectively, if you’re looking at both of us.)

u = \left(\begin{array}{llll} 0.801784 & 0. & 0.597614 & 0. \\ -0.267261 & -0.816497 & 0.358569 & 0.365148 \\ 0. & 0.408248 & 0. & 0.912871 \\ -0.534522 & 0.408248 & 0.717137 & -0.182574\end{array}\right)

v =\left(\begin{array}{lll} -0.816497 & 0. & 0.57735 \\ 0.408248 & -0.707107 & 0.57735 \\ 0.408248 & 0.707107 & 0.57735\end{array}\right)

w = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Sure enough, there are only two nonzero singular values in the w matrix.

Let us now compute Davis’ loadings and scores. Using my notation for the SVD, he has

A^R = v\ w^T

S^R = X A^R (= u\ w\ w^T\ , but use the definition for computation: one less matrix product.)

A^Q = u\ w ( = X v)

S^Q = X^T A^Q (= v\ w^T\ w)

(I should remark that we see that we could have done all of those calculations using the data X, the singular values w, and the matrix v: we do not need u.)

and we get

A^R = \left(\begin{array}{llll} -7.48331 & 0. & 0. & 0. \\ 3.74166 & -2.44949 & 0. & 0. \\ 3.74166 & 2.44949 & 0. & 0.\end{array}\right)

 A^Q = \left(\begin{array}{lll} 7.34847 & 0. & 0. \\ -2.44949 & -2.82843 & 0. \\ 0. & 1.41421 & 0. \\ -4.89898 & 1.41421 & 0.\end{array}\right)

S^R = \left(\begin{array}{llll} 67.3498 & 0. & 0. & 0. \\ -22.4499 & -9.79796 & 0. & 0. \\ 0. & 4.89898 & 0. & 0. \\ -44.8999 & 4.89898 & 0. & 0.\end{array}\right)

 S^Q = \left(\begin{array}{lll} -68.5857 & 0 & 0 \\ 34.2929 & -8.48528 & 0 \\ 34.2929 & 8.48528 & 0\end{array}\right)

Davis does not have any of these extra columns of zeros, but all of our nonzero numbers agree (to within a sign for each column). At this point, I think the most significant thing to say is that these scores and loadings do not reproduce the data; in the sense that, the matrix product of corresponding scores S and loadings A is not X.

Let us see what Jolliffe would have done.

He writes his model as

Z = X A,

with Z as the new data, and A as an orthogonal eigenvector matrix; but I would write that as

Y = X v,

since I reserve A for the weighted eigenvector matrix.

(I have to point out that, as we saw in previous posts, Y = A^Q\ .)

Now, Jolliffe would usually do an eigendecomposition of the covariance matrix; but we have enough experience to understand that we can do an eigendecomposition of c = X^T X\ . Here it is:

 c = \left(\begin{array}{lll} 56 & -28 & -28 \\ -28 & 20 & 8 \\ -28 & 8 & 20\end{array}\right)

Here are its eigenvalues:

 \lambda = \{84,\ 12,\ 0\}

Here is an orthogonal eigenvector matrix (now I use my standard notation, V for an orthogonal matrix computed from an eigendecomposition, versus v from the SVD):

V = \left(\begin{array}{lll} -0.816497 & 0. & 0.57735 \\ 0.408248 & -0.707107 & 0.57735 \\ 0.408248 & 0.707107 & 0.57735\end{array}\right)

We can check that V^T V = I\ . (It does.)

We know that V should differ from v (from the SVD) by at most the signs of columns. Recall v from the SVD:

 v = \left(\begin{array}{lll} -0.816497 & 0. & 0.57735 \\ 0.408248 & -0.707107 & 0.57735 \\ 0.408248 & 0.707107 & 0.57735\end{array}\right)

In fact, they agree completely.

Jolliffe would project the data X onto the eigenvector matrix V to get new data Y = XV:

Y = X V = \left(\begin{array}{lll} 7.34847 & 0 & 0 \\ -2.44949 & -2.82843 & 0 \\ 0 & 1.41421 & 0 \\ -4.89898 & 1.41421 & 0\end{array}\right)

Now we should recall A^R\ , to confirm the equivalence:

A^R = \left(\begin{array}{lll} 7.34847 & 0. & 0. \\ -2.44949 & -2.82843 & 0. \\ 0. & 1.41421 & 0. \\ -4.89898 & 1.41421 & 0.\end{array}\right)

We recall that Y = X v = X V is equivalent to the change of basis formula

x = v y,

where x is an old observation and y is the corresponding new one.

We learned – from Harman! – that the new data Y has maximally redistributed variance. It is crucial that we used an orthogonal eigenvector matrix v or V. Let’s confirm that by computing the variances of the columns of Y…

\{28.,\ 4.,\ 0.\}

and their sum is 32.

Let’s recall the variances of the columns of X…

\{18.6667,\ 6.66667,\ 6.66667\}

and their sum is also 32.

Now, what about Harman? He writes his model as

Z = A F,

where

Z = X^T\ ,

(i.e. his variables are in rows, not in columns) and A is a \sqrt{\text{eigenvalue}}-weighted eigenector matrix. And I generally favor his notation for Z, but I will follow Jolliffe for a few seconds later on.

Let us compute. From the eigenvalues of c,

\{84,\ 12,\ 0\}

we construct a diagonal matrix \Lambda of their square roots:

\Lambda = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 0.\end{array}\right)

We should expect that these are the nonzero nonsingular values w. Sure enough:

w = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Note, however, that w is not square, hence is not equal to the diagonal matrix \Lambda\ . This will lead to an irrelevant difference between A and A^R\ , etc. In fact, I often ignore the distinction between \Lambda and w conceptually; in practice, of course, whatever matrices we’re using have to be “the right size” for whatever matrix products we compute (conformable).

We should recall that the nonzero eigenvalues are the nonzero values of both w^T w and w\ w^T\ :

The nonzero eigenvalues were…

\{84,\ 12,\ 0\}

whereas

w^T w = \left(\begin{array}{lll} 84. & 0. & 0. \\ 0. & 12. & 0. \\ 0. & 0. & 0.\end{array}\right)

and

w\ w^T = \left(\begin{array}{llll} 84. & 0. & 0. & 0. \\ 0. & 12. & 0. & 0. \\ 0. & 0. & 0. & 0. \\ 0. & 0. & 0. & 0.\end{array}\right)

Resuming our thread, we compute the weighted eigenvector matrix A

A = v\ \Lambda = \left(\begin{array}{lll} -7.48331 & 0. & 0. \\ 3.74166 & -2.44949 & 0. \\ 3.74166 & 2.44949 & 0.\end{array}\right)

and that should be Davis’ A^R\ ; and it is, except for an extra – to my mind, irrelevant – column of zeros.

A^R = \left(\begin{array}{llll} -7.48331 & 0. & 0. & 0. \\ 3.74166 & -2.44949 & 0. & 0. \\ 3.74166 & 2.44949 & 0. & 0.\end{array}\right)

Viewing A as a transition matrix, we know that Harman’s scores are the new data F^T\ . We also know (HERE) that we may compute them as

F^T= u\ ,

that is, the “first few” columns of u, not all of u.

Ah, life gets interesting because A is of rank 2 instead of 3. The new data F^T should be the first two columns of u, but if we want to see (Harman’s) Z = A F, then we must either cut down A or keep the first 3 columns of u, to have conformable matrices. I will cut down A.

F^T = \left(\begin{array}{ll} 0.801784 & 0. \\ -0.267261 & -0.816497 \\ 0. & 0.408248 \\ -0.534522 & 0.408248\end{array}\right)

We check that by computing A F and comparing it to Z = X^T\ ; we expect Z = AF. And we have it; here’s Z:

Z = X^T = \left(\begin{array}{llll} -6 & 2 & 0 & 4 \\ 3 & 1 & -1 & -3 \\ 3 & -3 & 1 & -1\end{array}\right)

And here’s A F (using the two nonzero columns of A!):

A F = \left(\begin{array}{llll} -6. & 2. & 0. & 4. \\ 3. & 1. & -1. & -3. \\ 3. & -3. & 1. & -1.\end{array}\right)

I remind us that this new data F^T does not have redistributed variance; in fact, this new data has constant variance:

\{0.333333,\ 0.333333\}

There are a couple of (intertwined) ways to understand why the variances are not 1. Fundamentally, we did an eigendecomposition of X^T X rather than of the covariance matrix \frac {X^TX}{N-1}\ . Alternatively, our new data was given by columns of u rather than by columns of \sqrt{N-1}\ u\ . (And that’s a consequence of the “fundamental” reason, but it may make sense by itself.)

That constant variance is in marked contrast to the maximally redistributed variances of the new data Y constructed from the orthogonal matrix v instead of from A.

It is my strong preference to compute the new data F^T from the first few columns of u. To make sure I have the scaling correct, I then check the product A F.

Any number of people, including Davis himself in a subsequent example, show us how to compute the new data F^T using a reciprocal basis instead; that is, one which weights the v matrix by \frac{1}{\sqrt{\text{eigenvalue}}} instead of \sqrt{\text{eigenvalue}}. (Whether or not an author uses the phrase reciprocal basis or dual basis, if he divides each column of v by \sqrt{\text{eigenvalue}}, he is computing a reciprocal basis.)

Life is made interesting, but not difficult, by the presence of a zero eigenvalue. Frankly, the simplest way to deal with a zero eigenvalue – for finding a reciprocal basis – it is to set the zero eigenvalue to 1. Instead of the diagonal matrix

\Lambda = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 0.\end{array}\right)

I use

\Lambda 1 = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 1\end{array}\right)

and instead of A = v \Lambda\ , I compute the reciprocal basis B = v \Lambda1^{-1} (i.e. I divide each column by \sqrt{\text{eigenvalue}} instead of multiplying by \sqrt{\text{eigenvalue}}).

B = \left(\begin{array}{lll} -0.0890871 & 0. & 0.57735 \\ 0.0445435 & -0.204124 & 0.57735 \\ 0.0445435 & 0.204124 & 0.57735\end{array}\right)

Then the new data with respect to A can be found by projecting X onto B:

X B = \left(\begin{array}{lll} 0.801784 & 0 & 0 \\ -0.267261 & -0.816497 & 0 \\ 0 & 0.408248 & 0 \\ -0.534522 & 0.408248 & 0\end{array}\right)

That had better be F^T\ … and it is:

F^T = \left(\begin{array}{ll} 0.801784 & 0. \\ -0.267261 & -0.816497 \\ 0. & 0.408248 \\ -0.534522 & 0.408248\end{array}\right)

(Okay, that was the simplest way to get a reciprocal basis when we have a zero eigenvalue; but the simplest way to get the new data is to avoid the reciprocal basis entirely by taking the first few columns of u instead – assuming that you have the SVD available to you.)

For this example, Davis did not compute the new data with respect to A (i.e. wrt his A^R\ ). As I have said before, I have seen him do it in another example; but his scores S^R are not the new data F^T\ , which is, as far I have seen in textbooks, what other people use for the scores corresponding to loadings A.

Moving on, we learned from Basilevsky (HERE) that the weighted eigenvector matrix A = v \Lambda was also the dispersion matrix between the new data F^T and the old data X. I say “dispersion matrix” because A itself was not computed from a covariance matrix but from a more general “dispersion” matrix X^T X\ .

Let’s confirm that. For just a moment I want to denote the new data by (Jolliffe’s) Z instead of F^T\ , so as not to get confused by transposes. A quick computation (on the side) shows me that I want to compare A with X^T Z\ , i.e. that I want to compute

X^T Z = X^T F^T

(finally reverting to my original notation). Fine, I compute that product, and I get

X^T F^T = \left(\begin{array}{ll} -7.48331 & 0 \\ 3.74166 & -2.44949 \\ 3.74166 & 2.44949\end{array}\right)

and I compare that to A:

A = \left(\begin{array}{lll} -7.48331 & 0. & 0. \\ 3.74166 & -2.44949 & 0. \\ 3.74166 & 2.44949 & 0.\end{array}\right)

Good. Our matrix A is, as we expect, a cross-dispersion matrix between old and new data. (Perhaps I should be less meticulous, and refer to A as a cross-covariance matrix; that term may be more familiar and hence clearer in sense, even though X^T X it is not the covariance matrix.)

Now let me show you something new. We had also discovered from Basilevsky (HERE) that the orthogonal eigenvector matrix v was not a dispersion matrix D between its old data X and its new data Y. The v matrix does not have this property of the A matrix. We had seen that v and D differed by eigenvalue column-weights:

D = v \Lambda^2\ .

But… we’ve just seen that weight here. Okay, it wasn’t \Lambda^2\ , but it was either w\ w^T or w^T w\ . Well, it must have been S^Q\ , i.e. the one that multiplied v:

S^Q = X^T A^Q (= v w^T w)\ .

Of course, I should check that computationally. Here is v \Lambda^2\ :

v \Lambda^2 = \left(\begin{array}{lll} -68.5857 & 0. & 0. \\ 34.2929 & -8.48528 & 0. \\ 34.2929 & 8.48528 & 0.\end{array}\right)

and here is our S^Q\ :

S^Q = \left(\begin{array}{lll} -68.5857 & 0 & 0 \\ 34.2929 & -8.48528 & 0 \\ 34.2929 & 8.48528 & 0\end{array}\right)

Hang on. We need to explicitly see X^T Y\ , the cross-dispersion matrix analagous to X^T X\ : we compute it…

X^T Y = \left(\begin{array}{lll} -68.5857 & 0. & 0. \\ 34.2929 & -8.48528 & 0. \\ 34.2929 & 8.48528 & 0.\end{array}\right)

I don’t know if it’s useful, but there it is anyway. I can hear my favorite Vulcan whisper, “Fascinating.”

Don’t misunderstand me: I am sure that the Q-mode scores and loadings are interesting in their own right. (I don’t understand the goal of a Q-mode analysis, but that’s another question.)

But I do find it… fascinating… that Davis’ Q-mode scores S^Q are the cross-dispersion matrix between the old data X and the new data Y (in addition to our earlier finding that his Q-mode loadings A^Q are the new data Y with respect to the transition matrix v).

To put that another way…. The orthogonal eigenvector matrix v could have been computed from an eigendecomposition of the dispersion matrix X^T X\ , and defines new data Y; and S^Q = X^T Y is the cross-dispersion between old data X and new data Y.

I had asked a question: is the orthogonal matrix v the dispersion matrix D between the old data X and the new data Y? The answer was, “No” (HERE).

Now we know the rest of the story: not v, but the Q-mode scores S^Q\ are that dispersion matrix D.

PCA / FA Bartholomew et al.: discussion

This is the 4th post in the Bartholomew et al. sequence in PCA/FA, but it’s an overview of what I did last time. Before we plunge ahead with another set of computations, let me talk about things.

I want to elaborate on the previous post. We discussed

  • the choice of data corresponding to an eigendecomposition of the correlation matrix
  • the pesky \sqrt{N-1} that shows up when we relate the \sqrt{\text{eigenvalues}},\ \Lambda, of a correlation matrix to the principal values w of the standardized data
  • the computation of scores as \sqrt{N-1}\ u
  • the computation of scores as F^T\ where X^T = Z = A\ F
  • the computation of scores as projections of the data onto the reciprocal basis
  • different factorings of the data matrix as scores times loadings

The three computations of scores, of course, are all the same, only looking different. I will show you an example where A is not invertible (but not in this post). Although it looks harder at first, it simplifies so considerably as to be easier in practice.

The choice of data is actually dictated by the SVD. If we do an eigendecomposition of the correlation matrix c,

c = V\ \lambda\ V^T

then the data must be standardized or a constant multiple of the standardized data (e.g. small standard deviates), because that gives us that V from the eigendecomposition matches v from the SVD (to within signs of columns),

X = u\ w\ v^T\ .

Similarly, if we do eigendecomposition of the covariance matrix, then the data must be centered or a constant multiple of the centered data. (We could define “small centered deviates” by dividing the data by \sqrt{N-1}\ .) Any multiple of centered data give us that V from the eigendecomposition matches v from the SVD (again, to within signs of columns).

Generally, then, if we do an eigendecomposition of XTX or XXT, then the data must be X or a multiple of it. (In that form it looks pretty obvious, huh?)

We had gotten used to not seeing that N-1 or \sqrt{N-1}\ . Both Davis and Malinowski used eigendecompositions in preference to SVDs, but they computed eigendecompositions of X^T\ X not of \frac {X^T\ X}{N-1}. I was able to replicate their results by using an SVD of X, instead of their eigendecompositions, and we had that w = \Lambda \ : the principal values were equal to the square roots of the eigenvalues.

But Harman and jolliffe used eigendecompositions of the correlation matrix, and the square roots of the eigenvalues are proportional instead of equal to the principal values,

\sqrt{N-1}\ w = \Lambda = \sqrt{\lambda}

First remark: the same thing happens if we use the covariance matrix and centered data instead of the correlation matrix and standardized data. (It is not uncommon for people to refer to X^T\ X as a “covariance matrix”; I try to reserve that term for \frac {X^T\ X}{N-1} with X centered.)

Second remark: we saw that Harman’s model and Bartholomew’s is to factor the (preferably standardized, possibly centered) data matrix using a weighted \sqrt{\text{eigenvalue}} matrix A:

A = V\ \Lambda

X = F^T\ A^T

and that in fact we could write that same factoring of X using the SVD:

X = \left(\sqrt{N-1}\ u\right)\left( \Lambda\ v^T\right)

Third remark: Jolliffe wrote his model with symbols similar to Harman’s but with two completely different meanings: he writes

Z = X A

but this time Z is the principal components instead of  X^T\ , and his loadings A are the orthogonal eigenvector matrix (i.e. my V), and X is still the data matrix (with variables in columns). That is, in my usual notation,

Y = X V,

where I have written Y for Z because the product X V is precsiely the new components of the data wrt V. His “principal components” are the new components of X.

Now, we can write that as a factoring of the data matrix, and because V is orthogonal we get

X = Y\ V^T = \left(u\ w\right)\ v^T

Funny thing, that looks just like Malinowski’s

X = R\ C = \left(u\ w\right)\ v^T

It is.

Malinowski, of course, uses his decomposition for an arbitrary X: it may be standardized, or centered, or even raw data. Jolliffe uses his decomposition preferably for a correlation matrix, possibly for a covariance matrix. That difference aside, Jolliffe and Malinowski are using the same model.

Let me be explicit about something: we may start with v from the SVD and use that v in place of V in the eigendecomposition. Going the other way is harder: the SVD has consistent signs on u and v, but V may not be consistent with u. In the eigendecomposition, sign consistency is automatic between V and V^T, or between v and v^T. Ah, the analog of mixing u and V in the SVD would be to mix v and V in the eigendecomposition, but I can’t imagine that we would ever decompose the correlation matrix c as either V\ \lambda\ v^T or v\  \lambda\ V^T\ , using both v and V. That’s good, because it wouldn’t work in general.

Harman and Bartholomew are factoring the data as

X = \left(\sqrt{N-1}\ u\right)\left(\Lambda\ V^T\right)

which is similar to

X = \left(u\right)\left(w\ v^T\right)\ .

while Jolliffe and Malinowski are factoring the data as

\left(u\ w\right)\left(v^T\right)\ .

Note that I wrote singular values w not \Lambda\ , hence “similar to” for Harman & Bartholomew et al.

The first are very nearly associating the singular values w with v, the second are associating the singular values with u. That strikes me as a fundamental difference.

What is common is that they all are factoring the data matrix as scores times loadings.

Unfortunately, Davis is only sometimes in that camp, and sometimes not. From his definitions on p. 504 – the ones I showed – his scores times his loadings do not reproduce the data. He defines loadings

A^Q = u\ w

A^R = v\ w^T

and then defines scores as projections of the data X onto the loadings:

S^R = X\ A^R = u\ w\ w^T = A^Q\ w^T

S^Q = X^T\ A^Q = v\ w^T\ w = A^R\ w\ .

We know that by projecting data onto non-orthonormal bases A^R and A^Q\ , Davis is computing components of the data wrt the reciprocal bases instead of wrt the bases.

I don’t know why he uses scores and loadings that do not factor the data matrix. About the closest I can get to factoring the data is (making w square, and u, v conformable):

A^Q\ \left(A^R\right)^T = \left(u\ w\right)\left(w\ v^T\right) = u\ w^2\ v^T\ ,

which just isn’t the same as

u\ w\ v^T\ .

(So close and yet so far.)

Incidentally, if you are looking in Davis, remember that his uppercase U and V are my lowercase v and u respectively, not u and v; and his \Lambda is my w (made square).

Those definitions notwithstanding, Davis defines different scores on pp. 536-537, and this new definition is a factoring of the data matrix; more to the point, by this new definition of the scores, they can be computed as \sqrt{N-1}\ u\ .

That is, when he discusses “factor analysis” around p. 536, he agrees with Harman & Bartholomew et al.

Moving on.

Tell me again why we get that pesky \sqrt{N-1}\ ? Answer: because we want the eigenvalues of the correlation matrix, not the singular values of the data matrix.

And why do we want them? Because the eigenvalues are numerically equal to the variances of the new data Y wrt the basis V, i.e. the variances of the columns of Y,

Y = X V.

Why do we care that the eigenvalues are equal to the variances of the new data? Answer: I suspect that people wanted to make inferences about the new components of the data without computing it, i.e. without actually computing the scores. We can tell what the variances would be just from the eigenvalues.)

And yet, the new data does not depend on the eigenvalues or the singular values. Take the orthogonal eigenvector matrix V, premultiply it by X (“project the data onto the basis vectors”), and we’ve gotten data with the new variances, i.e. new data with the property that it has the same total variance as the original data X, but the variance has been redistributed maximally.

What’s important is that the new data does not depend on the \sqrt{\text{eigenvalue}} matrix A. so why did they create A instead of using the orthogonal V? Answer: the F matrix.

What does depend on the \sqrt{\text{eigenvalue}} matrix A is the F matrix. That pesky factor of \sqrt{N-1}\ gives us that

\frac{F^T F}{N-1} = I\ ,

i.e. F^T is not only standardized but also uncorrelated. That’s in complete contrast to Y = X V, which has the redistributed variances. Maybe that was the purpose, and the creation of A is the means to that end. (To look at it another way: u is orthogonal, the cut-down u0 is orthonormal, and F^T= \sqrt{N-1}\ u0 is standardized, instead of orthonormal.)

Nevertheless, we struggle to have it both ways: either the new data is XV, with redistributed variances, or the new data is F^T with unit variances. I would go so far as to say that Harman & Bartholomew et al. are assessing one thing (the variances of the new data Y), while computing another thing (standardized data).

People talk about both. We’ll apply all this to an example.

PCA / FA Example 7: Bartholomew et al.: the scores

edited 17 Oct 2008 to round-off the first two columns of 5 u.

This is the 3rd post about the Bartholomew et al. book.

Introduction

It would be convenient if Bartholomew’s model were one we had seen before.

It is.

We got their scores and loadings just by following their instructions. Although they didn’t use matrix notation, their equations amounted to

loadings = the \sqrt{\text{eigenvalue}}-weighted eigenvector matrix, A = V\ \Lambda

“component score coefficients” = reciprocal basis vectors, cs = V\ \Lambda^{-1}

scores = X cs.

where V is an orthogonal eigenvector matrix of the correlation matrix, \Lambda is the diagonal matrix of (nonzero) \sqrt{\text{eigenvalues}}\ , X is the standardized data.

Let’s recall Harman’s model. He had

Z = A\ F

where Z was the transpose of the standardized data X:

Z = X^T

and A is the \sqrt{\text{eigenvalue}}]-weighted eigenvector matrix, so we have

A = V\ \Lambda\ .

Transposing, we get

X = F^T\ A^T\ .

There are two crucial facts:

  • Harman has factored the data matrix X as scores times loadings.
  • The right factor is not orthogonal: A^T rather than V^T.

Terminology: Harman says the elements of F are the factor values, and the “coefficients of the factors” are loadings. I take it, therefore, that A is the loadings.

So can we compute F or F^T? In this case, yes: all the eigenvalues are nonzero, so A is invertible, and then

F^T = X\ A^{-T}\ .

So let’s compute F^T. (You may have recognized A^{-T}\ . It is in fact the reciprocal basis to A, and that equation says that F^T is found by projecting the data onto the reciprocal basis. That’s exactly what Bartholomew had us do, without telling us that we were computing A^{-T}\ along the way.)

Before we do that, I want to be very careful about the signs of columns. In the previous post, for example, the A matrix which I computed differed from the A matrix which Bartholomew et al. computed. We know that’s ok, because eigenvectors of length 1 can be multiplied by -1 and still be eigenvectors of length 1. But I was not consistent about how I changed signs to confirm that our graphs matched: in fact while I had a different A matrix from them, I defined my “component score coefficients” to match theirs. That was not consistent.)

But now, if I want to check my computations by multiplying the scores times the loadings, I need a consistent set of signs for the columns.

Rewriting Bartholomew’s scores

My personal preference for achieving that is to start with the orthogonal v matrix from the SVD (singular value decomposition) and construct A from it. You’ll see why, below.

What follows seems to be my adopted recipe for doing PCA / FA, although it’s cluttered here by comparisons to Bartholomew’s calculations. You’ll see it again.

First, decide what “the data” is to be: raw, centered, standardized, etc. We know that Bartholomew used standardized data, so let X be the standardized data, and let

X = u\ w\ v^T

be its SVD.

I remind us that the dimensions are

X: 26 x 9

u: 26 x 26

w: 26 x 9

v: 9 x 9

That is, u and v are square (as they must be: they are orthogonal); and w is the same shape as the data X.

Right now, I wish to use Harman’s & Bartholomew’s model, i.e. the correlation matrix. Rather than do algebra in my head, I’m going to compute the correlation matrix of X… get just its eigenvalues …

\begin{array}{l} 3.48715 \\ 2.13017 \\ 1.09896 \\ 0.994483 \\ 0.543218 \\ 0.383428 \\ 0.225754 \\ 0.13679 \\ 0.0000456251\end{array}

That’s all I want the correlation matrix for: its eigenvalues. Instead of its eigenvector matrix V, with whatever signs Mathematica® returns, I use the v matrix from the SVD.

Then create a diagonal matrix whose entries are the square roots of those eigenvalues; that is, the diagonal entries are

\sqrt{\lambda} = \begin{array}{l} 1.86739 \\ 1.45951 \\ 1.04831 \\ 0.997238 \\ 0.737033 \\ 0.619215 \\ 0.475136 \\ 0.369851 \\ 0.00675464\end{array}

Let me remind us that w0, the cut-down w, is the same shape as \Lambda\ . That is, let w0 be the square 9×9 submatrix of w:

w0 = \left(\begin{array}{lllllllll} 9.33696 & 0. & 0. & 0. & 0. & 0. & 0. & 0. & 0. \\ 0. & 7.29756 & 0. & 0. & 0. & 0. & 0. & 0. & 0. \\ 0. & 0. & 5.24156 & 0. & 0. & 0. & 0. & 0. & 0. \\ 0. & 0. & 0. & 4.98619 & 0. & 0. & 0. & 0. & 0. \\ 0. & 0. & 0. & 0. & 3.68517 & 0. & 0. & 0. & 0. \\ 0. & 0. & 0. & 0. & 0. & 3.09608 & 0. & 0. & 0. \\ 0. & 0. & 0. & 0. & 0. & 0. & 2.37568 & 0. & 0. \\ 0. & 0. & 0. & 0. & 0. & 0. & 0. & 1.84926 & 0. \\ 0. & 0. & 0. & 0. & 0. & 0. & 0. & 0. & 0.0337732\end{array}\right)

I promised in the previous post that there would be one caveat, and here it is: w0 is proportional to \Lambda\ , but not equal to it.

\Lambda = \left(\begin{array}{lllllllll} 1.86739 & 0. & 0. & 0. & 0. & 0. & 0. & 0. & 0. \\ 0. & 1.45951 & 0. & 0. & 0. & 0. & 0. & 0. & 0. \\ 0. & 0. & 1.04831 & 0. & 0. & 0. & 0. & 0. & 0. \\ 0. & 0. & 0. & 0.997238 & 0. & 0. & 0. & 0. & 0. \\ 0. & 0. & 0. & 0. & 0.737033 & 0. & 0. & 0. & 0. \\ 0. & 0. & 0. & 0. & 0. & 0.619215 & 0. & 0. & 0. \\ 0. & 0. & 0. & 0. & 0. & 0. & 0.475136 & 0. & 0. \\ 0. & 0. & 0. & 0. & 0. & 0. & 0. & 0.369851 & 0. \\ 0. & 0. & 0. & 0. & 0. & 0. & 0. & 0. & 0.00675464\end{array}\right)

The proportionality constant is \sqrt{N-1}\ ; in this case, 5 because we have 26 observations. We have

w0 = 5\ \Lambda

It’s that proportionality that I know I will mess up unless I explicitly start with the eigenvalues of the correlation matrix.

then the \sqrt{\text{eigenvalue}}-weighted matrix A is

A = v\ \Lambda

and its first 2 columns, rounded off, are…

\left(\begin{array}{ll} -0.98 & 0.08 \\ 0 & 0.9 \\ 0.65 & 0.52 \\ 0.48 & 0.38 \\ 0.61 & 0.07 \\ 0.71 & -0.51 \\ 0.14 & -0.66 \\ 0.72 & -0.32 \\ 0.69 & 0.3\end{array}\right)

Those are loadings. They happen to match Bartholomew’s choice of signs.

The “component score coefficients” are computed as \frac{1}{\sqrt{\text{eigenvalue}}}-weighted matrix v… that inverse square root tells me, and you, I hope, that the “component score coefficients” are the reciprocal basis to A.

The first two columns of the “component score coefficients” are…

cs = \left(\begin{array}{ll} -0.28 & 0.04 \\ 0 & 0.42 \\ 0.19 & 0.24 \\ 0.14 & 0.18 \\ 0.17 & 0.04 \\ 0.2 & -0.24 \\ 0.04 & -0.31 \\ 0.21 & -0.15 \\ 0.2 & 0.14\end{array}\right)

They, too, match Bartholomew et al., as we would expect, since the first two columns of A matched. (I have labeled those “cs” but they are only the first two columns of the cs matrix, which is 9×9. I will consistently abuse that notation, because I’m going to be displaying matrices which are equal but computed differently. I really am displaying the results of computations, not just copying the thing that I’m supposed to get.)

The scores are computed by projecting the data onto the “component score coefficients”. That told me that Bartholomew’s scores were the new components of the data wrt the A basis. Here they are, as Bartholomew would compute them:

scores = \left(\begin{array}{ll} 0.9 & -0.82 \\ 0.5 & -1.43 \\ 0.4 & -0.75 \\ 0.45 & -0.01 \\ -0.05 & -0.28 \\ 0.2 & -0.52 \\ 0.56 & 0.51 \\ 0.89 & -1.35 \\ 0.86 & -0.25 \\ 0.62 & 0.1 \\ 0.52 & -0.5 \\ -1.11 & -0.24 \\ 0.89 & -0.72 \\ -0.53 & -0.51 \\ -0.23 & -0.41 \\ 0.57 & -1.06 \\ 0.56 & -0.5 \\ -3.33 & -0.72 \\ -0.38 & 1.01 \\ 0.22 & 1.79 \\ 0.94 & 1.89 \\ 0.3 & 2.11 \\ -0.58 & 1.28 \\ -1.08 & 1.08 \\ 0.03 & 0.85 \\ -2.07 & -0.55\end{array}\right)

I need to plot these, to confirm that they are the same as Bartholomew’s.

Yes.

We have reproduced the calculations in the previous post, but whereas I was changing signs all over the place in the previous post – in order to match their pictures – this time I know that I have a consistent set of signs for whatever calculations I choose to do.

Computing F transpose instead

Now I’m ready to get F^T = X\ A^{-T}\ . Let’s compute A^{-T} and look at its first two columns: they should be the “component score coefficients”.

A^{-T} = \left(\begin{array}{ll} -0.28 & 0.04 \\ 0 & 0.42 \\ 0.19 & 0.24 \\ 0.14 & 0.18 \\ 0.17 & 0.04 \\ 0.2 & -0.24 \\ 0.04 & -0.31 \\ 0.21 & -0.15 \\ 0.2 & 0.14\end{array}\right)

They are. Recall the “component score coefficients”, as computed according to Bartholomew:

cs = \left(\begin{array}{ll} -0.28 & 0.04 \\ 0 & 0.42 \\ 0.19 & 0.24 \\ 0.14 & 0.18 \\ 0.17 & 0.04 \\ 0.2 & -0.24 \\ 0.04 & -0.31 \\ 0.21 & -0.15 \\ 0.2 & 0.14\end{array}\right)

They match.

Then F^T is the projection of X onto the reciprocal basis, and must agree with the scores as we already computed them:

F^T = \left(\begin{array}{ll} 0.9 & -0.82 \\ 0.5 & -1.43 \\ 0.4 & -0.75 \\ 0.45 & -0.01 \\ -0.05 & -0.28 \\ 0.2 & -0.52 \\ 0.56 & 0.51 \\ 0.89 & -1.35 \\ 0.86 & -0.25 \\ 0.62 & 0.1 \\ 0.52 & -0.5 \\ -1.11 & -0.24 \\ 0.89 & -0.72 \\ -0.53 & -0.51 \\ -0.23 & -0.41 \\ 0.57 & -1.06 \\ 0.56 & -0.5 \\ -3.33 & -0.72 \\ -0.38 & 1.01 \\ 0.22 & 1.79 \\ 0.94 & 1.89 \\ 0.3 & 2.11 \\ -0.58 & 1.28 \\ -1.08 & 1.08 \\ 0.03 & 0.85 \\ -2.07 & -0.55\end{array}\right)

Yes, the same.

Let’s check things. We can compute AF, compare it with Z = X^T. They should be equal…. I get differences of about 10^–15 or less.

Good.

Bartholomew’s “component score coefficients” are the columns of the reciprocal basis A^{-T}\ , and his scores are the F^T matrix. Bartholomew et al. have factored the data matrix the same way Harman did. (Good. it’s nice that they agree with a classic text.)

There is another way

But there is another approach to all this. After all, we know how to factor a data matrix. In fact, I started this process by factoring it!

There’s just that one little caveat: the singular values w are not quite the square roots of the eigenvalues – because there’s a pesky factor of \sqrt{N-1}.

To be quite explicit, suppose we do an SVD of the standardized data X:

X = u\ w\ v^T\ .

Now, compute the correlation matrix c..

c = \frac{1}{N-1} \left(v\ w^T\ u^T\right)\ \left(u\ w\ v^T\right) = \frac{1}{N-1} \left(v\ w^T\ \ w\ v^T\right)

but the eigendecomposition gave us

c = V\ \lambda\ V^T = V \Lambda^2\ V^T

and we may choose, as I did, to set V = v. (That’s pretty major: we are requiring that u and V have consistent signs.) Then we see that

\Lambda^2 = \frac{w^T\ w}{N-1}

Hence (if we cut down w to w0 so that it is square)

\Lambda = \frac{w0}{\sqrt{N-1}}

or

w0 = \sqrt{N-1}\ \Lambda\ .

(as we confirmed earlier.)

Now go back and rewrite the SVD (cutting down u to u0 that it is conformable, i.e. the right size) :

X = u0\ w0\ v^T = u0\ \sqrt{N-1}\ \Lambda\ v^T\ .

Collect terms:

X = \left(u0\ \sqrt{N-1}\right)\  \left(\Lambda\ v^T\right)

Now recognize A^T = \Lambda\ v^T (since A = v\ \Lambda) and we have

X = \left(u0\ \sqrt{N-1}\right)\ A^T\ .

For A invertible, we may then infer that

X\ A^{-T} = F^T = u0\ \sqrt{N-1}\ .

in this case, with N = 26, we have \sqrt{N-1} = 5\ .

In other words, the scores are 5 u0.

(I have to tell you that I found that his scores were 5 u0 before I understood why. I was a little annoyed that they were so much simpler than projections onto a reciprocal basis.)

So let’s compute 5 u, and take its first two columns. (In fact, we could compute the full matrix 5u rather than 5 u0; I would never take more than 9 columns of it, because that’s all the columns I have of data X, and the scores are the new components of the data.)

Here are the first two columns of 5 u:

5u  = \left(\begin{array}{ll} 0.898195 & -0.820867 \\ 0.500376 & -1.42956 \\ 0.396261 & -0.753289 \\ 0.447681 & -0.00764342 \\ -0.0543495 & -0.27814 \\ 0.197129 & -0.517025 \\ 0.556321 & 0.507807 \\ 0.886495 & -1.34697 \\ 0.85616 & -0.250689 \\ 0.617761 & 0.0961429 \\ 0.520875 & -0.502753 \\ -1.10976 & -0.237475 \\ 0.885591 & -0.722726 \\ -0.533138 & -0.508625 \\ -0.228874 & -0.41409 \\ 0.569114 & -1.06054 \\ 0.555631 & -0.49878 \\ -3.33314 & -0.716295 \\ -0.381018 & 1.00561 \\ 0.223669 & 1.7902 \\ 0.935125 & 1.89129 \\ 0.303693 & 2.11194 \\ -0.584818 & 1.28005 \\ -1.07726 & 1.07835 \\ 0.0264838 & 0.850927 \\ -2.0742 & -0.546846\end{array}\right)

edit 17 oct 2008: while it isn’t essential, it would help if i rounded those off, and showed F^T:

5 u = 5 u0 = \left(\begin{array}{ll} 0.9 & -0.82 \\ 0.5 & -1.43 \\ 0.4 & -0.75 \\ 0.45 & -0.01 \\ -0.05 & -0.28 \\ 0.2 & -0.52 \\ 0.56 & 0.51 \\ 0.89 & -1.35 \\ 0.86 & -0.25 \\ 0.62 & 0.1 \\ 0.52 & -0.5 \\ -1.11 & -0.24 \\ 0.89 & -0.72 \\ -0.53 & -0.51 \\ -0.23 & -0.41 \\ 0.57 & -1.06 \\ 0.56 & -0.5 \\ -3.33 & -0.72 \\ -0.38 & 1.01 \\ 0.22 & 1.79 \\ 0.94 & 1.89 \\ 0.3 & 2.11 \\ -0.58 & 1.28 \\ -1.08 & 1.08 \\ 0.03 & 0.85 \\ -2.07 & -0.55\end{array}\right) F^T = \left(\begin{array}{ll} 0.9 & -0.82 \\ 0.5 & -1.43 \\ 0.4 & -0.75 \\ 0.45 & -0.01 \\ -0.05 & -0.28 \\ 0.2 & -0.52 \\ 0.56 & 0.51 \\ 0.89 & -1.35 \\ 0.86 & -0.25 \\ 0.62 & 0.1 \\ 0.52 & -0.5 \\ -1.11 & -0.24 \\ 0.89 & -0.72 \\ -0.53 & -0.51 \\ -0.23 & -0.41 \\ 0.57 & -1.06 \\ 0.56 & -0.5 \\ -3.33 & -0.72 \\ -0.38 & 1.01 \\ 0.22 & 1.79 \\ 0.94 & 1.89 \\ 0.3 & 2.11 \\ -0.58 & 1.28 \\ -1.08 & 1.08 \\ 0.03 & 0.85 \\ -2.07 & -0.55\end{array}\right)

Those are, indeed, first two columns of the scores, and the first two columns of F^T.

It’s the computation of 5u that requires the signs of A to be consistent with the signs of u. That’s why I will use v from the SVD instead of V from the eigendecomposition of a correlation or covariance matrix, even though I will calculate their eigenvalues. (If, and it’s a big if, I choose to use the correlation or covariance matrices at all.)

I want to confirm the factoring of the data, that the product of the scores and the loadings is the standardized data. Let’s compute (5 u0) A^T. For this I need the cut-down u. Since A is 9×9, u0 must have 9 columns.

So I compute it. the difference between the product and the data, X - 5\ u0\ A^T\ , is the zero matrix (to within 10^-15 on each entry).

Let me emphasize that the \sqrt{N-1} showed up because we used \Lambda instead of w. If we’d never seen the eigenvalues \lambda of the correlation matrix, computing scores would have been simpler, because we’d have used w instead of the square roots \Lambda\ .

But, of course, the scores and loadings would have been different: we would have gotten a different matrix A. Instead of

X = \left(5\ u\right)\ \left(\Lambda\ v^T\right)\

we would have had

X = u\ \left(w\ v^T\right)\

i.e. A = w\ v^T\ .

Anyway, we know 3 equivalent ways to compute the scores for Harman’s or Bartholomew’s model:

  • project onto the reciprocal basis: either as X\ \left(V\  \Lambda^{-1}\right) or as F^T= X\ A^{-T}\ .
  • compute \sqrt{N-1}\ u0\ .

Ah, they’re not quite equivalent: what we just worked is a case where A was invertible.

What if A is not invertible? Equivalently, what if there’s a zero eigenvalue?

That is a very real possibility, even for real data: if the rows sum to exactly 1, then, as we saw, both the correlation matrix and the covariance matrix have a zero eigenvalue.

In such a case, \sqrt{N-1}\ u0\ is probably the way to go, although we can make the other computations work. Let’s look at such a case next time..

PCA / FA Example 7: Bartholomew et al. Calculations

The familiar part

This is the second post about Example 7. We confirm their analysis, but we work with the computed correlation matrix rather than their published rounded-off correlation matrix.

I am pretty well standardizing my notation. V is an orthogonal eigenvector matrix from an eigendecomposition; \lambda is the associated eigenvalues, possibly as a list, possibly as a square diagonal matrix. \Lambda is the square roots of \lambda\ , possibly as a list, possibly as a square matrix, and possibly as a matrix with rows of zeroes appended to make it the same shape as w (below).

Ah, X is a data matrix with observations in rows. Its transpose is Z = X^T\ .

The (full) singular value decomposition (SVD) is the product u\ w\ v^T\ , with u and v orthogonal, w is generally rectangular, as it must be the same shape as X.

The cut-down SVD makes a square matrix w0 out of w, and I write uo\ wo\ vo^T\ . I only do it in order to make an invertible form of w. For the product to work, we must drop some columns of u, and possibly columns of v.

By the same token, if any eigenvalues \lambda are zero, we would drop them, and make the diagonal matrices \lambda and \Lambda invertible. And, again, conformability requires that we drop columns of V. (There is no U, unless I also do an eigendecomposition of X\ X^T\ , which is not all that likely.)

A is a weighted matrix derived from V. It is usually defined as

A = V\ \Lambda\ ,

which means that each column of V has been weighted by a common \sqrt{\text{eigenvalue}}.

You will note that I am being very sloppy about the sizes of \lambda\ , \Lambda\ , and V. I do distinguish things derived from the eigendecomposition (\lambda\ , \Lambda\ , V, A) from things derived from the SVD (u, v, w); in addition, I usually distinguish the cut-down SVD (uo, vo, wo) from the full one.

Here are the eigenvalues of the computed correlation matrix:

\begin{array}{l} 3.48715 \\ 2.13017 \\ 1.09896 \\ 0.994483 \\ 0.543218 \\ 0.383428 \\ 0.225754 \\ 0.13679 \\ 0.0000456251\end{array}

I can get a scree plot and confirm their fig 5.6. That is, we simply plot the each eigenvalue against the order in which it occurred. Personally, i would have used percentages rather than the actual eigenvalues. Regardless, we would probably keep 3 or 4 of the eigenvalues. (We discussed both of these way back in the Jolliffe posts, which followed Harman, the beginning of the PCA / FA category.)

I start to say that their definition of loadings is, in fact, Davis’ A^R = v\ w^T

but it’s not so. w \ne \Lambda = \sqrt{\lambda} edited: = instead of a 2nd \ne

Harman and Jolliffe both computed eigendecompositions of a correlation matrix; Davis and Malinowski both computed eigendecompositions of X^T\ X\ . There’s quite a difference. Do you recall it? (I will, later and in some detail.)

Their definition of loadings is the weighted eigenvector matrix: best to write it as A = V\ \Lambda.

First we get \Lambda as a diagonal matrix with the following diagonal elements, which are \sqrt{\text{eigenvalues}}:

\begin{array}{l} 1.86739 \\ 1.45951 \\ 1.04831 \\ 0.997238 \\ 0.737033 \\ 0.619215 \\ 0.475136 \\ 0.369851 \\ 0.00675464\end{array}

For reference, here are the first three columns of the orthogonal eigenvector matrix V:

\left(\begin{array}{lll} -0.523791 & -0.0535939 & -0.0486744 \\ -0.00132346 & -0.617807 & 0.2011 \\ 0.347495 & -0.355054 & 0.150463 \\ 0.255716 & -0.261096 & 0.561083 \\ 0.325179 & -0.0512884 & -0.153321 \\ 0.37892 & 0.350172 & 0.115096 \\ 0.0743736 & 0.453698 & 0.587361 \\ 0.387409 & 0.221521 & -0.311904 \\ 0.366823 & -0.202592 & -0.375106\end{array}\right)

Get loadings as A = V\ \Lambda and display only the first 3 columns, rounded.

A = \left(\begin{array}{lll} -0.98 & -0.08 & -0.05 \\ 0 & -0.9 & 0.21 \\ 0.65 & -0.52 & 0.16 \\ 0.48 & -0.38 & 0.59 \\ 0.61 & -0.07 & -0.16 \\ 0.71 & 0.51 & 0.12 \\ 0.14 & 0.66 & 0.62 \\ 0.72 & 0.32 & -0.33 \\ 0.69 & -0.3 & -0.39\end{array}\right)

That’s their table 5.7. (Except for the sign of the 2nd column, but we know that’s ok: any eigenvector can be multiplied by any number, and any unit eigenvector can be multiplied by –1.)

Now plot the first two. But change the sign of the second (only to match them for the drawing).

That’s their fig. 5.7.

Next, we get “component score coefficients”. They define them as eigenvectors weighted by \frac{1}{\sqrt{\text{eigenvalues}}}\ . (Doesn’t that division remind you of a reciprocal basis, namely the reciprocal basis of A? That’s exactly what it is.) For now, anyway, I simply compute what they tell me, and get, for the first two columns:

\left(\begin{array}{ll} -0.28 & 0.04 \\ 0 & 0.42 \\ 0.19 & 0.24 \\ 0.14 & 0.18 \\ 0.17 & 0.04 \\ 0.2 & -0.24 \\ 0.04 & -0.31 \\ 0.21 & -0.15 \\ 0.2 & 0.14\end{array}\right)

That’s their table 5.9. and those two columns are two of the reciprocal basis vectors to A.

The unfamiliar part

To compute scores, we need to introduce the data.

Well, what is the data?

Please don’t curse me out. That wasn’t as stupid a question as it sounded.

First off, we started with “data”. We had to: we couldn’t use the published correlation matrix because it was not positive definite. But what we started with was the “raw data”, and what we did to it was compute its correlation matrix.

So what’s the appropriate data for the analysis? After all, over the past several months we’ve discussed four possible forms of the data

  1. raw
  2. centered
  3. standardized
  4. small standard deviates

and every one of them has exactly the same correlation matrix. How do we choose?

There are only two sensible possibilities, but the fact is there are two, not just one. We can use standardized data, or we can use small standard deviates.

(In fact, we could use any multiple of the standardized data but why would we? Small standard deviates are one particular multiple, all divided by \sqrt{\text{N-1}}, where N is the number of observations.)

But why must we use some multiple (possibly 1) of the standardized data? Because then the eigenvector matrix V from the eigendecomposition of the correlation matrix matches the v matrix from the singular value decomposition u\ w^T\ v of the standardized data (to within signs of columns).

Multiplying the standardized data matrix by any constant simply multiplies the w’s or the \lambda’s, it has no effect on the orthogonal eigenvector matrices u, v, V.

Oh, if we had done an eigendecomposition of the covariance matrix instead of the correlation matrix, we would have to now choose our “data” as the centered data, not the standardized. The reasoning is the same: the eigenvector matrix V from the eigendecomposition of the covariance matrix matches the v matrix from the singular value decomposition of the centered data (or any constant multiple of it).

So now we do what most people take completely for granted, and we standardize the data, because we worked from the correlation matrix. But we could choose small standard deviates. (More to follow.)

I would check the variances; they are each 1, as they should be.

Henceforth, X is the matrix of standardized data. Here are its first two columns:

\left(\begin{array}{ll} -1.01828 & -0.364773 \\ -0.638776 & -1.18948 \\ -0.535859 & -0.467862 \\ -0.799583 & 0.0475791 \\ 0.261745 & -0.261685 \\ -0.207812 & -0.674038 \\ -0.73526 & 1.90317 \\ -0.825312 & -1.18948 \\ -1.05687 & 0.150667 \\ -0.413646 & -0.158597 \\ -0.394349 & -0.880214 \\ 1.43242 & -0.674038 \\ -0.65164 & -0.777126 \\ 0.55763 & -0.983302 \\ 0.242448 & -0.467862 \\ -0.838177 & -0.880214 \\ -0.73526 & -1.08639 \\ 3.06622 & -0.57095 \\ 0.287474 & 0.666108 \\ -0.169219 & 1.69699 \\ -0.96039 & 1.69699 \\ 0.16526 & 1.90317 \\ 0.769895 & 1.28464 \\ 1.00146 & 0.872284 \\ 0.293906 & 0.150667 \\ 1.90198 & 0.253755\end{array}\right)

Finally, having decided upon the standardized data X, they compute the scores by projecting X onto the “component score coefficients”,

scores = X\ cs

Hey, that’s why they computed the reciprocal basis: so they could get the components wrt the A basis by doing a projection. Here are the first two columns of the scores.

\left(\begin{array}{ll} 0.898195 & -0.820867 \\ 0.500376 & -1.42956 \\ 0.396261 & -0.753289 \\ 0.447681 & -0.00764342 \\ -0.0543495 & -0.27814 \\ 0.197129 & -0.517025 \\ 0.556321 & 0.507807 \\ 0.886495 & -1.34697 \\ 0.85616 & -0.250689 \\ 0.617761 & 0.0961429 \\ 0.520875 & -0.502753 \\ -1.10976 & -0.237475 \\ 0.885591 & -0.722726 \\ -0.533138 & -0.508625 \\ -0.228874 & -0.41409 \\ 0.569114 & -1.06054 \\ 0.555631 & -0.49878 \\ -3.33314 & -0.716295 \\ -0.381018 & 1.00561 \\ 0.223669 & 1.7902 \\ 0.935125 & 1.89129 \\ 0.303693 & 2.11194 \\ -0.584818 & 1.28005 \\ -1.07726 & 1.07835 \\ 0.0264838 & 0.850927 \\ -2.0742 & -0.546846\end{array}\right)

Let’s plot those numbers, as xy-coordinates.

That matches their figure 5.8.

To summarize: the familiar part was the eigendecomposition of the correlation matrix, yielding eigenvalues \lambda and their square roots \Lambda\ , and the orthogonal eigenvector matrix V and the \sqrt{\text{eigenvalue}}-weighted eigenvector matrix A = V\ \Lambda\ . We got a scree plot of the eigenvalues, and plotted the first two columns of A.

For the unfamiliar part, we computed the scores by computing the reciprocal basis (“component score coefficients”, i.e. weighted by the inverse square roots, or V\ \Lambda^{-1}), and then projecting the standardized data onto the reciprocal basis.

There are a couple of ways to get the scores without the reciprocal basis, but there is one wrinkle. Let’s look at it next time.

Review: reciprocal basis

if the reciprocal basis confuses you, look for it in the linear algebra category, and in the PCA / FA category…. or you could pick a month, and search for “reciprocal”.

Or, for this special case, when we only change the sizes but not the directions of an orthonormal basis, think of it as follows.

If we have old components of data, as a column vector z, and an orthogonal transition matrix V, then the new components of the data, as a column vector y, are described by

z = V\ y\ .

Transpose:

z^T = y^T\ V^T

Let x = z^T\ , since Z = X^T\ :

x = y^T\ V^T

Post-multply by V (edit: not by V^T\ ):

x\ V = y^T\ V^T\ V = y^T

since V is orthogonal, i.e. V^T\ V = I\ .

What we’ve just seen is that the usual change-of-basis equation

z = V\ y

or

Z = V\ Y

which represents old and new data components as columns is consistent with the projection

Y^T = X\ V\ ,

which represents old and new data components as rows, if V is orthogonal.

Now suppose we were to make all the vectors in V larger, say by a factor of 2; then the new components y would have to be smaller by a factor of 2:

z = \left(2\ V\right)\  \left(y\ /\ 2\right)\ .

If we were to project X onto the new basis (2 V), however, we’d get numbers that were larger by a factor of 2 instead of smaller. That’s just wrong.

This what the reciprocal basis takes care of. If the new basis were 2 V, then the reciprocal basis would be V / 2. If the new basis has columns A = V\ \Lambda\ , then the reciprocal basis has columns V\ \Lambda^{-1}\ .

And then we can get new components of the data by projecting onto the reciprocal basis. If you insist on projecting, then you need the reciprocal basis. (If you want the scores to be the new components of the data wrt the A basis. Not everyone does, apparently. Harman’s or Jolliffe’s definitions of the scores are not universal: Davis is an exception. More to follow.)

As I said, however, there are a couple of ways to get the scores without the reciprocal basis, but there is one wrinkle.

Let’s look at it next time.

PCA / FA Example 7: Bartholomew et al. Correlation matrix

edit 5 Oct 2008: I had omitted the word “constant”. see edit.

The following example comes from Bartholomew et al. “The Analysis and Interpretation of Multivariate Data for Social Scientists.”

It is an excellent example with which to wrap up PCA / FA. (There’s a lot we haven’t done, but it’s almost time for me to move on.)

The example is “employment in 26 European countries”, “eurojob” for short, from chapter 5 (either 1st or 2nd edition), and data for both editions is available at http://www.cmm.bris.ac.uk/team/amssd.shtml . Please note that I am using the 1st edition of the book, and the 1st edition data.

When I first worked this example, I knew that something interesting happened, but not why; and, there was one thing I didn’t understand at all, back then.

One reason why this example is so good is that they provide both the data and a correlation matrix. In fact, most of their analyses in chapter 5 seem to be based on the correlation matrix, except in those few cases when they compute scores (for which they need the data). We get to return to our starting point, using the correlation matrix.

Here’s the raw data.

data = \left(\begin{array}{lllllllll} 3.3 & 0.9 & 27.6 & 0.9 & 8.2 & 19.1 & 6.2 & 26.6 & 7.2 \\ 9.2 & 0.1 & 21.8 & 0.6 & 8.3 & 14.6 & 6.5 & 32.2 & 7.1 \\ 10.8 & 0.8 & 27.5 & 0.9 & 8.9 & 16.8 & 6. & 22.6 & 5.7 \\ 6.7 & 1.3 & 35.8 & 0.9 & 7.3 & 14.4 & 5. & 22.3 & 6.1 \\ 23.2 & 1. & 20.7 & 1.3 & 7.5 & 16.8 & 2.8 & 20.8 & 6.1 \\ 15.9 & 0.6 & 27.6 & 0.5 & 10. & 18.1 & 1.6 & 20.1 & 5.7 \\ 7.7 & 3.1 & 30.8 & 0.8 & 9.2 & 18.5 & 4.6 & 19.2 & 6.2 \\ 6.3 & 0.1 & 22.5 & 1. & 9.9 & 18. & 6.8 & 28.5 & 6.8 \\ 2.7 & 1.4 & 30.2 & 1.4 & 6.9 & 16.9 & 5.7 & 28.3 & 6.4 \\ 12.7 & 1.1 & 30.2 & 1.4 & 9. & 16.8 & 4.9 & 16.8 & 7. \\ 13. & 0.4 & 25.9 & 1.3 & 7.4 & 14.7 & 5.5 & 24.3 & 7.6 \\ 41.4 & 0.6 & 17.6 & 0.6 & 8.1 & 11.5 & 2.4 & 11. & 6.7 \\ 9. & 0.5 & 22.4 & 0.8 & 8.6 & 16.9 & 4.7 & 27.6 & 9.4 \\ 27.8 & 0.3 & 24.5 & 0.6 & 8.4 & 13.3 & 2.7 & 16.7 & 5.7 \\ 22.9 & 0.8 & 28.5 & 0.7 & 11.5 & 9.7 & 8.5 & 11.8 & 5.5 \\ 6.1 & 0.4 & 25.9 & 0.8 & 7.2 & 14.4 & 6. & 32.4 & 6.8 \\ 7.7 & 0.2 & 37.8 & 0.8 & 9.5 & 17.5 & 5.3 & 15.4 & 5.7 \\ 66.8 & 0.7 & 7.9 & 0.1 & 2.8 & 5.2 & 1.1 & 11.9 & 3.2 \\ 23.6 & 1.9 & 32.3 & 0.6 & 7.9 & 8. & 0.7 & 18.2 & 6.7 \\ 16.5 & 2.9 & 35.5 & 1.2 & 8.7 & 9.2 & 0.9 & 17.9 & 7. \\ 4.2 & 2.9 & 41.2 & 1.3 & 7.6 & 11.2 & 1.2 & 22.1 & 8.4 \\ 21.7 & 3.1 & 29.6 & 1.9 & 8.2 & 9.4 & 0.9 & 17.2 & 8. \\ 31.1 & 2.5 & 25.7 & 0.9 & 8.4 & 7.5 & 0.9 & 16.1 & 6.9 \\ 34.7 & 2.1 & 30.1 & 0.6 & 8.7 & 5.9 & 1.3 & 11.7 & 5. \\ 23.7 & 1.4 & 25.8 & 0.6 & 9.2 & 6.1 & 0.5 & 23.6 & 9.3 \\ 48.7 & 1.5 & 16.8 & 1.1 & 4.9 & 6.4 & 11.3 & 5.3 & 4.\end{array}\right)

I remark that the matrix is 26×9. What we have is: for 26 countries, the % of people employed in 9 categories (agriculture, mining, etc.).

This would be a good, rather than excellent, example to end with simply because it takes us back to our roots, using the correlation matrix for its PCA analysis. I compute the correlation matrix, and I save it; but I also round it to 2 places, and show it to you:

\left(\begin{array}{lllllllll} 1. & 0.04 & -0.67 & -0.4 & -0.54 & -0.74 & -0.22 & -0.75 & -0.56 \\ 0.04 & 1. & 0.45 & 0.41 & -0.03 & -0.4 & -0.44 & -0.28 & 0.16 \\ -0.67 & 0.45 & 1. & 0.39 & 0.49 & 0.2 & -0.16 & 0.15 & 0.35 \\ -0.4 & 0.41 & 0.39 & 1. & 0.06 & 0.2 & 0.11 & 0.13 & 0.38 \\ -0.54 & -0.03 & 0.49 & 0.06 & 1. & 0.36 & 0.02 & 0.16 & 0.39 \\ -0.74 & -0.4 & 0.2 & 0.2 & 0.36 & 1. & 0.37 & 0.57 & 0.19 \\ -0.22 & -0.44 & -0.16 & 0.11 & 0.02 & 0.37 & 1. & 0.11 & -0.25 \\ -0.75 & -0.28 & 0.15 & 0.13 & 0.16 & 0.57 & 0.11 & 1. & 0.57 \\ -0.56 & 0.16 & 0.35 & 0.38 & 0.39 & 0.19 & -0.25 & 0.57 & 1.\end{array}\right)

I do not quite match them. In fact, we differ in one number, at the end of the first row (and at the end of the first column, of course). To five places, I have the number as -0.56492. They show -.57 where I round to -.56.

That’s irrelevant. What is not irrelevant is the result of computing the eigenvalues of the rounded correlation matrix.

It could be a big deal, but it isn’t. Here be dragons, in either case. The point is that their apparent starting point is their printed (hence rounded) correlation matrix. (To be more precise: they clearly computed an eigendecomposition of the correlation matrix. I used their printed correlation matrix.) Whether I use theirs or mine, so long as it’s rounded to 2 places, life is interesting.

Let’s just do our thing: get the eigendecomposition of the rounded correlation matrix… and look at the eigenvalues.

\left(\begin{array}{l} 3.48828 \\ 2.14083 \\ 1.10125 \\ 0.992444 \\ 0.543472 \\ 0.379233 \\ 0.224842 \\ 0.133101 \\ -0.00344958\end{array}\right)

I hope you took a deep breath right there. The smallest eigenvalue is negative. But the correlation matrix is supposed to be positive semi-definite. (Because it’s of the form X^T\ X\ .)

Alarm bells should be ringing in your head: eigenvalues of a positive semi-definite matrix are non-negative. Where did that negative eigenvalue come from?

Presumably the smallest eigenvalue is very close to 0, and the negative number is numerical error.

Well, sort of, but we can say more. Here are the eigenvalues of the unrounded correlation matrix:

\left(\begin{array}{l} 3.48715 \\ 2.13017 \\ 1.09896 \\ 0.994483 \\ 0.543218 \\ 0.383428 \\ 0.225754 \\ 0.13679 \\ 0.0000456251\end{array}\right)

We see that the last one is very small but positive. The computed correlation matrix is positive definite. The problem comes from rounding off the correlation matrix.

But that must be positive definite, too, you cry?

Why?

Symmetry is not enough to guarantee positive definite.

This is just the tip of an iceberg. There was nothing special about rounding to 2 digits, nothing special about a correlation matrix. It is not trivial to manipulate a positive (semi-) definite matrix and guarantee that subsequent matrices are still positive (semi-) definite. But that’s about all I’ll say about the numerical hassles.

As I said, the computed correlation matrix is, in fact, positive definite: its determinant and all its eigenvalues are positive. Mathematica® does just fine with it.

But the rounded correlation matrix – which is what they presented – is not positive definite: its determinant and one eigenvalue are negative.

So we now know two good reasons to provide data instead of just providing a correlation matrix: one, we can do more with the data; two, a rounded-off correlation matrix may not be positive semi-definite. And, bless them, they did provide the data.

Just this insight alone upgrades this to a very good example. And this negative eigenvalue was the “something interesting”. We’ll see very shortly why it happened. (Perhaps you can guess.)

To summarize. The smallest eigenvalue of the correlation matrix is extremely close to zero, but non-negative, as it should be. The real problem is that the correlation matrix is almost singular (not of full rank); then the rounded correlation matrix is almost singular, too, but the smallest eigenvalue went to the other side of zero.

The lesson we have learned is: if the correlation matrix is barely of full rank, working from a rounded version of it could work out badly.

Now, why is the correlation matrix almost singular? I.e. why is the smallest eigenvalue so close to zero?

I swear I had no idea that this example would tie in so closely with recent posts. I knew that inadvertent reduction of rank was important, but I didn’t have a clear reason why.

What might cause the smallest eigenvalue to go to zero? All we did was compute a correlation matrix. That is, all we did was implicitly center the columns.

Ah ha! Just what are the row sums of the raw data?

Here they are:

row sums = \left(\begin{array}{l} 100. \\ 100.4 \\ 100. \\ 99.8 \\ 100.2 \\ 100.1 \\ 100.1 \\ 99.9 \\ 99.9 \\ 99.9 \\ 100.1 \\ 99.9 \\ 99.9 \\ 100. \\ 99.9 \\ 100. \\ 99.9 \\ 99.7 \\ 99.9 \\ 99.8 \\ 100.1 \\ 100. \\ 100. \\ 100.1 \\ 100.2 \\ 100.\end{array}\right)

The min, mean, and max are 99.7, 99.9923, and 100.4 respectively.

They’re not constant, but they’re awfully close to constant. Hence, the smallest eigenvalue of the correlation matrix isn’t zero, but it’s awfully close to zero. I explained here that making (edit: constant) non-zero row sums followed by centering the columns leads to constant zero row sums.

So: we just got a real-life example of the inadvertent loss of rank caused by using the correlation matrix on constant-row-sum data. The computed correlation matrix is positive definite but almost singular (determinant almost zero but positive, smallest eigenvalue almost zero but positive); and then because we were nearly singular, the rounded correlation matrix could fail to be positive definite, and in fact failed to be positive semi-definite.

Next, we will confirm that part of their analysis based on the correlation matrix.