PCA / FA. Example 4 ! Davis, and almost everyone else

I would like to revisit the work we did in Davis (example 4). For one thing, I did a lot of calculations with that example, and despite the compare-and-contrast posts towards the end, I fear it may be difficult to sort out what I finally came to.

In addition, my notation has settled down a bit since then, and I would like to recast the work using my current notation.

The original (“raw”) data for example 4 was (p. 502, and columns are variables):

X_r = \left(\begin{array}{lll} 4 & 27 & 18 \\ 12 & 25 & 12 \\ 10 & 23 & 16 \\ 14 & 21 & 14\end{array}\right)

It would be prudent to compute the means…

 \{10,\ 24,\ 15\}

and variances of the data…

 \{18.6667,\ 6.66667,\ 6.66667\}

It would also be prudent to calculate the row sums.

  \{49,\ 49,\ 49,\ 49\}

They are all the same (we have “constant row sums”).

Now, Davis chooses to analyze centered data. Let’s get it, by subtracting its mean from each column.

 X = \left(\begin{array}{lll} -6 & 3 & 3 \\ 2 & 1 & -3 \\ 0 & -1 & 1 \\ 4 & -3 & -1\end{array}\right)

That is what he has on page 503. I remind us that we expect (HERE) that the row sums of the centered data are now zero. That in turn means that we have lost rank: the centered data should have two nonzero eigenvalues or singular values, instead of 3.

Next, Davis chooses to analyze X^T X instead of the covariance matrix \frac{X^TX}{N-1}\ . We recall that these two have the same eigenvectors, but eigenvalues differ by a factor of N-1. In fact, Davis also uses the singular value decomposition (SVD); that is a good reason for doing an eigendecomposition of X^T X\ .

Here is the SVD of X, X = u\ w\ v^T\ . (Always remember that my u and v correspond to Davis’ V and U respectively, if you’re looking at both of us.)

u = \left(\begin{array}{llll} 0.801784 & 0. & 0.597614 & 0. \\ -0.267261 & -0.816497 & 0.358569 & 0.365148 \\ 0. & 0.408248 & 0. & 0.912871 \\ -0.534522 & 0.408248 & 0.717137 & -0.182574\end{array}\right)

v =\left(\begin{array}{lll} -0.816497 & 0. & 0.57735 \\ 0.408248 & -0.707107 & 0.57735 \\ 0.408248 & 0.707107 & 0.57735\end{array}\right)

w = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Sure enough, there are only two nonzero singular values in the w matrix.

Let us now compute Davis’ loadings and scores. Using my notation for the SVD, he has

A^R = v\ w^T

S^R = X A^R (= u\ w\ w^T\ , but use the definition for computation: one less matrix product.)

A^Q = u\ w ( = X v)

S^Q = X^T A^Q (= v\ w^T\ w)

(I should remark that we see that we could have done all of those calculations using the data X, the singular values w, and the matrix v: we do not need u.)

and we get

A^R = \left(\begin{array}{llll} -7.48331 & 0. & 0. & 0. \\ 3.74166 & -2.44949 & 0. & 0. \\ 3.74166 & 2.44949 & 0. & 0.\end{array}\right)

 A^Q = \left(\begin{array}{lll} 7.34847 & 0. & 0. \\ -2.44949 & -2.82843 & 0. \\ 0. & 1.41421 & 0. \\ -4.89898 & 1.41421 & 0.\end{array}\right)

S^R = \left(\begin{array}{llll} 67.3498 & 0. & 0. & 0. \\ -22.4499 & -9.79796 & 0. & 0. \\ 0. & 4.89898 & 0. & 0. \\ -44.8999 & 4.89898 & 0. & 0.\end{array}\right)

 S^Q = \left(\begin{array}{lll} -68.5857 & 0 & 0 \\ 34.2929 & -8.48528 & 0 \\ 34.2929 & 8.48528 & 0\end{array}\right)

Davis does not have any of these extra columns of zeros, but all of our nonzero numbers agree (to within a sign for each column). At this point, I think the most significant thing to say is that these scores and loadings do not reproduce the data; in the sense that, the matrix product of corresponding scores S and loadings A is not X.

Let us see what Jolliffe would have done.

He writes his model as

Z = X A,

with Z as the new data, and A as an orthogonal eigenvector matrix; but I would write that as

Y = X v,

since I reserve A for the weighted eigenvector matrix.

(I have to point out that, as we saw in previous posts, Y = A^Q\ .)

Now, Jolliffe would usually do an eigendecomposition of the covariance matrix; but we have enough experience to understand that we can do an eigendecomposition of c = X^T X\ . Here it is:

 c = \left(\begin{array}{lll} 56 & -28 & -28 \\ -28 & 20 & 8 \\ -28 & 8 & 20\end{array}\right)

Here are its eigenvalues:

 \lambda = \{84,\ 12,\ 0\}

Here is an orthogonal eigenvector matrix (now I use my standard notation, V for an orthogonal matrix computed from an eigendecomposition, versus v from the SVD):

V = \left(\begin{array}{lll} -0.816497 & 0. & 0.57735 \\ 0.408248 & -0.707107 & 0.57735 \\ 0.408248 & 0.707107 & 0.57735\end{array}\right)

We can check that V^T V = I\ . (It does.)

We know that V should differ from v (from the SVD) by at most the signs of columns. Recall v from the SVD:

 v = \left(\begin{array}{lll} -0.816497 & 0. & 0.57735 \\ 0.408248 & -0.707107 & 0.57735 \\ 0.408248 & 0.707107 & 0.57735\end{array}\right)

In fact, they agree completely.

Jolliffe would project the data X onto the eigenvector matrix V to get new data Y = XV:

Y = X V = \left(\begin{array}{lll} 7.34847 & 0 & 0 \\ -2.44949 & -2.82843 & 0 \\ 0 & 1.41421 & 0 \\ -4.89898 & 1.41421 & 0\end{array}\right)

Now we should recall A^R\ , to confirm the equivalence:

A^R = \left(\begin{array}{lll} 7.34847 & 0. & 0. \\ -2.44949 & -2.82843 & 0. \\ 0. & 1.41421 & 0. \\ -4.89898 & 1.41421 & 0.\end{array}\right)

We recall that Y = X v = X V is equivalent to the change of basis formula

x = v y,

where x is an old observation and y is the corresponding new one.

We learned – from Harman! – that the new data Y has maximally redistributed variance. It is crucial that we used an orthogonal eigenvector matrix v or V. Let’s confirm that by computing the variances of the columns of Y…

\{28.,\ 4.,\ 0.\}

and their sum is 32.

Let’s recall the variances of the columns of X…

\{18.6667,\ 6.66667,\ 6.66667\}

and their sum is also 32.

Now, what about Harman? He writes his model as

Z = A F,

where

Z = X^T\ ,

(i.e. his variables are in rows, not in columns) and A is a \sqrt{\text{eigenvalue}}-weighted eigenector matrix. And I generally favor his notation for Z, but I will follow Jolliffe for a few seconds later on.

Let us compute. From the eigenvalues of c,

\{84,\ 12,\ 0\}

we construct a diagonal matrix \Lambda of their square roots:

\Lambda = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 0.\end{array}\right)

We should expect that these are the nonzero nonsingular values w. Sure enough:

w = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Note, however, that w is not square, hence is not equal to the diagonal matrix \Lambda\ . This will lead to an irrelevant difference between A and A^R\ , etc. In fact, I often ignore the distinction between \Lambda and w conceptually; in practice, of course, whatever matrices we’re using have to be “the right size” for whatever matrix products we compute (conformable).

We should recall that the nonzero eigenvalues are the nonzero values of both w^T w and w\ w^T\ :

The nonzero eigenvalues were…

\{84,\ 12,\ 0\}

whereas

w^T w = \left(\begin{array}{lll} 84. & 0. & 0. \\ 0. & 12. & 0. \\ 0. & 0. & 0.\end{array}\right)

and

w\ w^T = \left(\begin{array}{llll} 84. & 0. & 0. & 0. \\ 0. & 12. & 0. & 0. \\ 0. & 0. & 0. & 0. \\ 0. & 0. & 0. & 0.\end{array}\right)

Resuming our thread, we compute the weighted eigenvector matrix A

A = v\ \Lambda = \left(\begin{array}{lll} -7.48331 & 0. & 0. \\ 3.74166 & -2.44949 & 0. \\ 3.74166 & 2.44949 & 0.\end{array}\right)

and that should be Davis’ A^R\ ; and it is, except for an extra – to my mind, irrelevant – column of zeros.

A^R = \left(\begin{array}{llll} -7.48331 & 0. & 0. & 0. \\ 3.74166 & -2.44949 & 0. & 0. \\ 3.74166 & 2.44949 & 0. & 0.\end{array}\right)

Viewing A as a transition matrix, we know that Harman’s scores are the new data F^T\ . We also know (HERE) that we may compute them as

F^T= u\ ,

that is, the “first few” columns of u, not all of u.

Ah, life gets interesting because A is of rank 2 instead of 3. The new data F^T should be the first two columns of u, but if we want to see (Harman’s) Z = A F, then we must either cut down A or keep the first 3 columns of u, to have conformable matrices. I will cut down A.

F^T = \left(\begin{array}{ll} 0.801784 & 0. \\ -0.267261 & -0.816497 \\ 0. & 0.408248 \\ -0.534522 & 0.408248\end{array}\right)

We check that by computing A F and comparing it to Z = X^T\ ; we expect Z = AF. And we have it; here’s Z:

Z = X^T = \left(\begin{array}{llll} -6 & 2 & 0 & 4 \\ 3 & 1 & -1 & -3 \\ 3 & -3 & 1 & -1\end{array}\right)

And here’s A F (using the two nonzero columns of A!):

A F = \left(\begin{array}{llll} -6. & 2. & 0. & 4. \\ 3. & 1. & -1. & -3. \\ 3. & -3. & 1. & -1.\end{array}\right)

I remind us that this new data F^T does not have redistributed variance; in fact, this new data has constant variance:

\{0.333333,\ 0.333333\}

There are a couple of (intertwined) ways to understand why the variances are not 1. Fundamentally, we did an eigendecomposition of X^T X rather than of the covariance matrix \frac {X^TX}{N-1}\ . Alternatively, our new data was given by columns of u rather than by columns of \sqrt{N-1}\ u\ . (And that’s a consequence of the “fundamental” reason, but it may make sense by itself.)

That constant variance is in marked contrast to the maximally redistributed variances of the new data Y constructed from the orthogonal matrix v instead of from A.

It is my strong preference to compute the new data F^T from the first few columns of u. To make sure I have the scaling correct, I then check the product A F.

Any number of people, including Davis himself in a subsequent example, show us how to compute the new data F^T using a reciprocal basis instead; that is, one which weights the v matrix by \frac{1}{\sqrt{\text{eigenvalue}}} instead of \sqrt{\text{eigenvalue}}. (Whether or not an author uses the phrase reciprocal basis or dual basis, if he divides each column of v by \sqrt{\text{eigenvalue}}, he is computing a reciprocal basis.)

Life is made interesting, but not difficult, by the presence of a zero eigenvalue. Frankly, the simplest way to deal with a zero eigenvalue – for finding a reciprocal basis – it is to set the zero eigenvalue to 1. Instead of the diagonal matrix

\Lambda = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 0.\end{array}\right)

I use

\Lambda 1 = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 1\end{array}\right)

and instead of A = v \Lambda\ , I compute the reciprocal basis B = v \Lambda1^{-1} (i.e. I divide each column by \sqrt{\text{eigenvalue}} instead of multiplying by \sqrt{\text{eigenvalue}}).

B = \left(\begin{array}{lll} -0.0890871 & 0. & 0.57735 \\ 0.0445435 & -0.204124 & 0.57735 \\ 0.0445435 & 0.204124 & 0.57735\end{array}\right)

Then the new data with respect to A can be found by projecting X onto B:

X B = \left(\begin{array}{lll} 0.801784 & 0 & 0 \\ -0.267261 & -0.816497 & 0 \\ 0 & 0.408248 & 0 \\ -0.534522 & 0.408248 & 0\end{array}\right)

That had better be F^T\ … and it is:

F^T = \left(\begin{array}{ll} 0.801784 & 0. \\ -0.267261 & -0.816497 \\ 0. & 0.408248 \\ -0.534522 & 0.408248\end{array}\right)

(Okay, that was the simplest way to get a reciprocal basis when we have a zero eigenvalue; but the simplest way to get the new data is to avoid the reciprocal basis entirely by taking the first few columns of u instead – assuming that you have the SVD available to you.)

For this example, Davis did not compute the new data with respect to A (i.e. wrt his A^R\ ). As I have said before, I have seen him do it in another example; but his scores S^R are not the new data F^T\ , which is, as far I have seen in textbooks, what other people use for the scores corresponding to loadings A.

Moving on, we learned from Basilevsky (HERE) that the weighted eigenvector matrix A = v \Lambda was also the dispersion matrix between the new data F^T and the old data X. I say “dispersion matrix” because A itself was not computed from a covariance matrix but from a more general “dispersion” matrix X^T X\ .

Let’s confirm that. For just a moment I want to denote the new data by (Jolliffe’s) Z instead of F^T\ , so as not to get confused by transposes. A quick computation (on the side) shows me that I want to compare A with X^T Z\ , i.e. that I want to compute

X^T Z = X^T F^T

(finally reverting to my original notation). Fine, I compute that product, and I get

X^T F^T = \left(\begin{array}{ll} -7.48331 & 0 \\ 3.74166 & -2.44949 \\ 3.74166 & 2.44949\end{array}\right)

and I compare that to A:

A = \left(\begin{array}{lll} -7.48331 & 0. & 0. \\ 3.74166 & -2.44949 & 0. \\ 3.74166 & 2.44949 & 0.\end{array}\right)

Good. Our matrix A is, as we expect, a cross-dispersion matrix between old and new data. (Perhaps I should be less meticulous, and refer to A as a cross-covariance matrix; that term may be more familiar and hence clearer in sense, even though X^T X it is not the covariance matrix.)

Now let me show you something new. We had also discovered from Basilevsky (HERE) that the orthogonal eigenvector matrix v was not a dispersion matrix D between its old data X and its new data Y. The v matrix does not have this property of the A matrix. We had seen that v and D differed by eigenvalue column-weights:

D = v \Lambda^2\ .

But… we’ve just seen that weight here. Okay, it wasn’t \Lambda^2\ , but it was either w\ w^T or w^T w\ . Well, it must have been S^Q\ , i.e. the one that multiplied v:

S^Q = X^T A^Q (= v w^T w)\ .

Of course, I should check that computationally. Here is v \Lambda^2\ :

v \Lambda^2 = \left(\begin{array}{lll} -68.5857 & 0. & 0. \\ 34.2929 & -8.48528 & 0. \\ 34.2929 & 8.48528 & 0.\end{array}\right)

and here is our S^Q\ :

S^Q = \left(\begin{array}{lll} -68.5857 & 0 & 0 \\ 34.2929 & -8.48528 & 0 \\ 34.2929 & 8.48528 & 0\end{array}\right)

Hang on. We need to explicitly see X^T Y\ , the cross-dispersion matrix analagous to X^T X\ : we compute it…

X^T Y = \left(\begin{array}{lll} -68.5857 & 0. & 0. \\ 34.2929 & -8.48528 & 0. \\ 34.2929 & 8.48528 & 0.\end{array}\right)

I don’t know if it’s useful, but there it is anyway. I can hear my favorite Vulcan whisper, “Fascinating.”

Don’t misunderstand me: I am sure that the Q-mode scores and loadings are interesting in their own right. (I don’t understand the goal of a Q-mode analysis, but that’s another question.)

But I do find it… fascinating… that Davis’ Q-mode scores S^Q are the cross-dispersion matrix between the old data X and the new data Y (in addition to our earlier finding that his Q-mode loadings A^Q are the new data Y with respect to the transition matrix v).

To put that another way…. The orthogonal eigenvector matrix v could have been computed from an eigendecomposition of the dispersion matrix X^T X\ , and defines new data Y; and S^Q = X^T Y is the cross-dispersion between old data X and new data Y.

I had asked a question: is the orthogonal matrix v the dispersion matrix D between the old data X and the new data Y? The answer was, “No” (HERE).

Now we know the rest of the story: not v, but the Q-mode scores S^Q\ are that dispersion matrix D.

PCA / FA example 4: davis. Davis & harman 3.

Hey, since we have the reciprocal basis, we can project onto it to get the components wrt the A^R basis. After all that work to show that the S^R are the components wrt the reciprocal basis, we ought to find the components wrt the A^R basis. And now we know that that’s just a projection onto the reciprocal basis: we want to compute X B.

Recall X:

X = \left(\begin{array}{lll} -6 & 3 & 3 \\ 2 & 1 & -3 \\ 0 & -1 & 1 \\ 4 & -3 & -1\end{array}\right)

Recall B:

B = \left(\begin{array}{ll} 0.0890871 & 0. \\ -0.0445435 & -0.204124 \\ -0.0445435 & 0.204124\end{array}\right)

The product is:

X\ B = \left(\begin{array}{ll} -0.801784 & 0 \\ 0.267261 & -0.816497 \\ 0 & 0.408248 \\ 0.534522 & 0.408248\end{array}\right)

What are the column variances?

\{0.333333,0.333333\}

Does that surprise you? the 1/3 comes from our using eigenvalues of X^T\ X instead of eigenvalues of the covariance matrix. The new variables have a common variance, but it isn’t 1. this corresponds to our initial work in harman, when we discovered that using the weighted eigenvector matrix gave us new variables with variance 1.

at first glance, it seems not worthwhile to work that out using the covariance matrix, but i suppose it might be worthwhile to come at the covariance matrix from the SVD. (oh, yes! it is!)

Let’s do it.

The covariance matrix of X is \frac{X^T\ X}{N-1}

We should just replace X by X2 = \frac{X}{\sqrt{N-1}} and find the SVD of X2. but there’s a catch.

Do it anyway. here’s our X2:

X2 = \left(\begin{array}{lll} -3.4641 & 1.73205 & 1.73205 \\ 1.1547 & 0.57735 & -1.73205 \\ 0. & -0.57735 & 0.57735 \\ 2.3094 & -1.73205 & -0.57735\end{array}\right)

Compute the variances…

\{6.22222,2.22222,2.22222\}

and their sum … 10.6667 .

Hang on. The data X2 has 1/3 the variance of the original data. We’re trying to find new forms of X (not of X2) with variance 1. but let’s just keep going.

Get the SVD of X2:

here are new u…

u2 = \left(\begin{array}{llll} -0.801784 & 0 & -0.571429 & 0.174964 \\ 0.267261 & -0.816497 & -0.449762 & -0.24417   \\ 0. & 0.408248 & -0.267261 & -0.872872 \\ 0.534522 & 0.408248 & -0.632262 & 0.384531\end{array}\right)

new v…

v2 = \left(\begin{array}{lll} 0.816497 & 0 & 0.57735 \\ -0.408248 & -0.707107 & 0.57735 \\ -0.408248 & 0.707107 & 0.57735\end{array}\right)

and new w…

w2 = \left(\begin{array}{lll} 5.2915 & 0. & 0. \\ 0. & 2. & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Then i compute new A’s and S’s from

A^R = v\ w^T.

A^Q = u\ w.

S^R = X\ A^R.

S^Q = X^T\ A^Q.

(using X2, u2, v2, and w2, of course)

A^Q are the components of the data X2 wrt the orthogonal eigenvector matrix v2.

A^Q = \left(\begin{array}{lll} -4.24264 & 0 & 0 \\ 1.41421 & -1.63299 & 0 \\ 0. & 0.816497 & 0 \\ 2.82843 & 0.816497 & 0\end{array}\right)

Compute the variances…

\{9.33333,1.33333,0.\}

and their sum is 10.6667 . Yes, the new orthogonal eigenvector matrix v has redistributed the total variance of X2 among two new variables.

What about the R-mode scores?

S^R = \left(\begin{array}{llll} -22.4499 & 0 & 0 & 0 \\ 7.48331 & -3.26599 & 0 & 0 \\ 0. & 1.63299 & 0 & 0 \\ 14.9666 & 1.63299 & 0 & 0\end{array}\right)

Compute the variances:

\{261.333,5.33333,0.,0.\}

i say again, yikes!

For the record, here are the new

A^R = \left(\begin{array}{llll} 4.32049 & 0 & 0 & 0 \\ -2.16025 & -1.41421 & 0 & 0 \\ -2.16025 & 1.41421 & 0 & 0\end{array}\right)

and

S^Q = \left(\begin{array}{lll} 22.8619 & 0 & 0 \\ -11.431 & -2.82843 & 0 \\ -11.431 & 2.82843 & 0\end{array}\right)

From the orthonormal basis v2 , i construct a reciprocal basis for the \sqrt{\text{eigenvalue}}-weighted matrix A^R. i’ll just use the diagonal of w2 (instead of the \sqrt{\text{eigenvalue}} since i haven’t computed them) to scale two of the vectors. Recall the new w matrix:

w2 = \left(\begin{array}{lll} 5.2915 & 0. & 0. \\ 0. & 2. & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

i construct a diagonal matrix using 1/w when possible, 1 otherwise.

\left(\begin{array}{lll} 0.188982 & 0. & 0. \\ 0. & 0.5 & 0. \\ 0. & 0. & 1.\end{array}\right)

then i premultiply that by the new v, and keep only the first two columns; the new reciprocal basis is:

B2 = \left(\begin{array}{ll} 0.154303 & 0 \\ -0.0771517 & -0.353553 \\ -0.0771517 & 0.353553\end{array}\right)

Having the reciprocal basis, we project the data onto it get the new data wrt the A^R basis. But we project the original data. That’s the catch. As i said, it’s the variance of X we’re trying to scale to 1, not the variance of X2. we compute X B2:

X\ B2 = \left(\begin{array}{ll} -1.38873 & 0 \\ 0.46291 & -1.41421 \\ 0 & 0.707107 \\ 0.92582 & 0.707107\end{array}\right)

And the variances?

\{1.,1.\}.

Indeed. this corresponds even more closely to what we first saw in harman. (if instead we project X2 onto B2, we get a common variance of 1/3 again.)

PCA / FA example 4: davis. Davis & harman 2.

What about redistributing the variance?

We learned from harman that if we use the orthogonal eigenvector matrix v to compute new data, the result has redistributed the original variance according to the eigenvalues; and we also learned that if we use the weighted eigenvector matrix instead, we get new variables of unit variance.

(i think i thought this required us to start with the correlation matrix; it does not.)

let’s see this again for davis’ centered data X:

\left(\begin{array}{lll} -6&3&3\\ 2&1&-3\\ 0.&-1&1\\ 4&-3&-1\end{array}\right)

We have variances…

\{\frac{56}{3},\ \frac{20}{3},\ \frac{20}{3}\}

or…

{18.6667,\ 6.66667,\ 6.66667}

and the sum of the variances is…

32.

Now is a good time to recall w^T\ w:

\left(\begin{array}{lll} 84.&0.&0.\\ 0.&12.&0.\\ 0.&0.&0.\end{array}\right)

The sum of the diagonal elements is 96, and if we divide by 3 (= N-1, for the covariance matrix instead of for X^T\ X), gee, we get 32.

stop for a moment. We have two, admittedly proportional, sets of eigenvalues, one set from the covariance matrix, the other from X^T\ X. this gives us two different possible sets of weights to apply to the eigenvector matrix v. (recall that v is an eigenvector matrix both for X^T\ X and for the covariance matrix of X, as well as coming out of the SVD of X.)

Let’s stay with what i know: our new data, using the orthogonal eigenvector matrix v, is A^Q.

\left(\begin{array}{lll} -7.34847&0.&0.\\ 2.44949&-2.82843&0.\\ 0.&1.41421&0.\\ 4.89898&1.41421&0.\end{array}\right)

The variances are…

{28.,\ 4.,\ 0.}

and we don’t even need to consciously add those: the sum is 32. We have redistributed the original variance among our two new variables.

Now, everyone seems to say that the \sqrt{\text{eigenvalue}}-weighted eigenvectors have variances equal to the eigenvalues. What those eigenvectors really have is lengths equal to their \sqrt{\text{eigenvalue}}.

here indeed be dragons.

as linear algebra is my touchstone in general for all of this, that redistributed variance using the orthogonal eigenvector matrix is my vorpal blade (dragon, jabberwock, no big deal – snicker-snack either way). any different choice of basis must lead to different results.

we looked at that redistribution of variance in harman. We learned then that the correct way to get new data with redistributed variance is to use the orthogonal eigenvector matrix. We just reconfirmed it for davis’ example.

Instead of actually kicking one of the dragons awake, let’s try counting its scales while it sleeps. If the R-mode scores S^R are some form of data (and they are), what are their variances?

my S^R is

\left(\begin{array}{llll} -67.3498&0.&0.&0.\\ 22.4499&-9.79796&0.&0.\\ 0.&4.89898&0.&0.\\ 44.8999&4.89898&0.&0.\end{array}\right)

the variances of the first two columns are

{2352.,\ 48.}

yikes! if nothing else, that should convince you that the R-mode scores are weird.Life is so much simpler if we use the orthogonal eigenvector matrix. The new data is trivial to compute (X v), and we can see that it has redistributed the variance of the original data.Trying to use any weighted eigenvector matrix is dicey if there is a zero eigenvalue (because our transition matrix is not invertible); and confusing if we used X^T\ X instead of the covariance matrix (because they have different eigenvalues).i will eventually calculate the data wrt both of the weighted eigenvector matrices. The dragons will be awake and aloft and pretty to look at.

PCA / FA example 4: davis. Davis & harman 1.

Let us now recall what we saw harman do. Does he have anything to add to what we saw davis do? if we’re building a toolbox for PCA / FA, we need to see if harman gave us any additional tools.

here’s what harman did or got us to do:

  • a plot of old variables in terms of the first two new ones
  • redistribute the variance of the original variables
  • his model, written Z = A F, was (old data, variables in rows) = (some eigenvector matrix) x (new data)

We should remind ourselves, however, that harman used an eigendecomposition of the correlation matrix; and the eigenvectors of the correlation matrix are not linearly related to the eigenvectors of the covariance matrix (or, equivalently, of X^T\ X.) having said that, i’m going to go ahead and apply harman’s tools to the davis results. i will work from the correlation matrix when we get to jolliffe.

What exactly was that graph of harman’s? he took the first two columns of the weighted eigenvector matrix, then plotted each row as a point. He had 5 original variables. His plot

shows that the old 1st and 3rd variables, for example, are defined very similarly in terms of the first two new variables. let me emphasize that this is a description of defining variables, not a display of the resulting data. Simple counting can help with the distinction, e.g. 3 variables but 4 observations.

from davis we only have 3 variables to start, and our weighted eigenvector matrix has only two nonzero columns; and it’s one form of what we called A^R. we have to be careful, because there are sign differences among the various forms of A^R; not because the definitions are different, but because eigenvectors are not unique. In particular, my u and v from the SVD have some different signs from davis’ U and V from eigendecompositions.

most of the time, i will use my SVD results, but right now let’s compare davis with harman. Davis had A^R as…

\left(\begin{array}{ll} 7.48331&0.\\ -3.74166&2.44949\\ -3.74166&-2.44949\end{array}\right)

Davis showed us a plot of these numbers viewed as three points.

So the R-mode loadings A^R tell us how the 3 old variables are described by the new ones; we see that a plot of A^R corresponds to harman’s plot.

i want to emphasize that even the orthogonal eigenvector matrix shows us that the data is really only 2D. Don’t think that we had to scale the eigenvectors in order to get 2D.

my orthogonal eigenvector matrix v is

\left(\begin{array}{lll} 0.816497&0&0.57735\\ -0.408248&-0.707107&0.57735\\ -0.408248&0.707107&0.57735\end{array}\right)

but wrt that basis, the new data is A^Q:

\left(\begin{array}{lll} -7.34847&0.&0.\\ 2.44949&-2.82843&0.\\ 0.&1.41421&0.\\ 4.89898&1.41421&0.\end{array}\right)

i.e. the new data is described by the first two eigenvectors, hence it lies in a plane. but then so does the original data, because the transformation between them is linear (a plane cannot be mapped to 3D, although it could be mapped to a line). we could plot the data in 3D and rotate the point of view until we see the plane edge on.

Do davis and harman’s models resemble each other?

Let’s recall the SVD form of the A’s and S’s.

X = u\ w\ v^T

A^R = v\ w^T.

A^Q = u\ w.

S^R = X\ A^R = u\ w\ w^T= A^Q\ w^T.

S^Q = X^T\ A^Q = v\ w^T\ w = A^R\ w.

Then

X = u\ w\ v^T = A^Q\ v^T = u\ AR^T.

in particular,

X\ v = A^Q and then X^T = Z = v\ AQ^T.

Z = v\ AQ^T is harman’s model (Z = A F, where Z = X^T, A is a weighted eigenvector matrix, F is the new data), while A^Q = X\ v is jolliffe’s (Z = X A, where Z is the new data, X is the original data, and A is the orthogonal eigenvector matrix). i think it would be unfair to say that davis is using either of those models, especially since harman and jolliffe (usually) work from the correlation matrix.

i would summarize davis with all four A’s and S’s.

i can’t, however, say too many times that A^R and A^Q can be constructed using u or v:

A^R = X^T\ u = v\ w^T,

and

A^Q = X\ v = u\ w.

i see no grounds for associating A^R with just u or just v. and no grounds for associating either u or v primarily with R-mode or Q-mode; they’re equally important to the SVD.

i would probably focus on v and A^Q: one defines new variables, the other gives me the associated new data.

PCA / FA example 1: harman. discussion 2

recall that we had discovered that harman’s new F variables were uncorrelated and standardized: they had mean 0 and variance 1. he had implied, however, that the variances of the new variables would be the eigenvalues of the correlation matrix of Z.
 
he didn’t get that. can we?
 
instead of his model

Z = A F 

(where Z is the standardized data, A is a weighted eigenvector matrix, and F is the transformed, new, data), which implies 

F^T = X \ A^{-T}

we take

Z = P Y,

(where P is an orthogonal eigenvector matrix, and Y is the transformed, new, data), i.e.

Z^T = Y^T \ P^T

hence

Y^T = Z^T \ P^{-T} = X P.

(in words, we fix his messy A^{-T} by using an orthogonal eigenvector matrix P, for which P^{-T}= P.) here’s the transpose of the Y matrix:

Y^T = \left(\begin{array}{ccccc} -1.64367&0.955194&0.489728&0.0115288&-0.173057\\ 2.25697&1.0171&-0.14812&-0.592152&-0.0381633\\ 2.49521&-0.120329&0.518183&-0.146651&0.025367\\ -0.779124&1.73058&-0.377206&0.291712&-0.0255523\\ -0.564461&1.55725&-0.0522449&0.460059&-0.00751884\\ 1.17305&-1.54174&0.662974&0.40844&-0.000126565\\ 1.73452&1.57553&-0.172833&0.0361866&0.0856931\\ -0.0974502&-1.14587&-0.682363&0.0811312&0.073895\\ -1.32993&-0.75909&-0.320628&-0.110526&0.223974\\ -3.18609&0.0790644&0.54889&-0.437424&0.0779839\\ 0.384854&-1.7828&0.113252&0.160042&0.0140288\\ -0.443879&-1.56488&-0.579631&-0.162345&-0.256524\end{array}\right)

and the variances of the Y variables?

{2.87331,\ 1.79666,\ 0.214837,\ 0.0999341,\ 0.0152554}

ta da! those are so close to the eigenvalues…

{2.87331,\ 1.79666,\ 0.214832,\ 0.0999379,\ 0.0152551}

that we all are quite sure they are equal in principle. right?
 
for the record, here is the covariance matrix of Y:

\left(\begin{array}{ccccc} 2.87331&0&0&0&0\\ 0&1.79666&0&0&0\\ 0&0&0.214837&0&0\\ 0&0&0&0.0999341&0\\ 0&0&0&0&0.0152554\end{array}\right)

like the Fs, the Ys are uncorrelated; unlike the Fs, the Ys have different variances, and in fact, the first Y variable, Y_1, has the largest possible variance among linear combinations of the Xs, subject to the constraint that the total variance of the five variables is constant (= 5.) and so on: it’s no accident that Y_2 has the second largest variance.

we have somethig new here, but it does not come from the F variables in harman’s model. using an orthogonal eigenvector matrix as a change-of-basis, we have redistributed the variances of the Z data. harman did not accomplish this with the weighted eigenvector matrix.

it gets worse. if we were to recreate that wonderful picture using the first two coumns of the orthogonal eigenvector matrix -P… 

\left(\begin{array}{cc} 0.342731&0.601629\\ 0.452506&-0.406417\\ 0.396696&0.541664\\ 0.550056&-0.077817\\ 0.466738&-0.416428\end{array}\right)

(yes, it is convenient to change the sign of P; we did it to A, too.) we plot those pairs of numbers and get the left-hand image; the right-hand image is the previous plot…


we see exactly the same implications (2 & 5, 1 & 3 together, 4 alone), but the scales are different. maybe we should have stayed with P all along. 

next? let’s see how jolliffe does PCA in “principal component analysis”.

PCA / FA Example 1: Harman. discussion 1.



harman gave us two outputs as a result of his analysis: a table and a picture.

let’s consider the picture first. it clearly shows z2 and z5 similar, z1 and z3 similar, and z4 roughly in the middle between them.
the 5 z variables in terms of F1 and F2 

i don’t know about you, but this surprised me. should it have? well, let’s take a look at the correlation matrix, from which we got our results. i’m going to round it off just a little bit, so i can refer to 3-digit numbers instead of 5.

\left(\begin{array}{ccccc} 1.&0.01&0.972&0.439&0.022\\ 0.01&1.&0.154&0.691&0.863\\ 0.972&0.154&1.&0.515&0.122\\ 0.439&0.691&0.515&1.&0.778\\ 0.022&0.863&0.122&0.778&1.\end{array}\right)

we confirm that there is a high correlation (.972) between z1 and z3, and and one almost as large (.863) between z2 and z5. z4 is about equally related to z2 (.691) and to z5 (.778). the same information as the picture seems to be there, but i didn’t see it.

let’s learn from that blindness. how might we have expected it? this time, round off the correlation matrix severely (to the nearest .5).

\left(\begin{array}{ccccc} 1.&0&1.&0.5&0 \\ 0&1.&0&0.5&1. \\ 1.&0&1.&0.5&0 \\ 0.5&0.5&0.5&1.&1. \\ 0&1.&0&1.&1.\end{array}\right)

there we are: on the off-diagonal, we have 1s for z1 & z3, z2 & z5 (good), and also for z4 & z5 (not so good). at the other extreme, we have 0s for z1 & z2 and z1 & z5, so z2 & z5 are far from z1. finally, we have either .5 or 1 for z4 vs everything else. 

there was nothing sacred about .5. if we round to the nearest .25, instead, we get … 

\left(\begin{array}{ccccc} 1.&0&1.&0.5&0\\ 0&1.&0.25&0.75&0.75\\ 1.&0.25&1.&0.5&0\\ 0.5&0.75&0.5&1.&0.75\\ 0&0.75&0&0.75&1.\end{array}\right)

we see similar relationships, but now the only off-diagonal 1 is between z1 & z3; z2 seems to be equally related to z4 & z5. but z4 is still the only variable related to everything else. bear in mind that we – i at least – still think the third variable, P3 (cf. F3) might be important.

conclusion: the picture we got showed us some strong relationships that are not quite so clearly seen by playing with the correlation matrix. 

i find it reassuring that the relationships can be somewhat seen in the correlation matrix; i begin to trust that the relationships were not produced by our computations.

conclusion: severe rounding can be informative.

before we leave the picture for his table, let’s go to 3D. we take the first three columns of the weighted eigenvector matrix A…

\left(\begin{array}{ccc} 0.580958&0.806421&0.0275932\\ 0.767036&-0.544759&0.319267\\ 0.672433&0.726044&0.114925\\ 0.932392&-0.104306&-0.307804\\ 0.79116&-0.558179&-0.0647204\end{array}\right)

… and we plot them as points in 3D:
 
harman2.png 

that’s disappointing: z1 & z3 are still close, but z2 & z5 are not. ok, we have something to look at down the road. what we’re nibbling on the edges of is reduction of dimensionality: can we replace 5 variables by fewer? we are not done with this, not by a long shot.

open: investigate reduction of dimensionality.

now let’s go look at his table. recall it:

\left(\begin{array}{ccccccc} Variable&P1&P2&P3&P4&P5&Variance\\ 1&0.581&0.8064&0.0276&-0.0645&-0.0852&1.\\ 2&0.767&-0.5448&0.3193&0.1118&-0.0216&1.0002\\ 3&0.6724&0.726&0.1149&-0.0073&0.0862&0.9999 \\ 4&0.9324&-0.1043&-0.3078&0.1582&0&1.\\ 5&0.7912&-0.5582&-0.0647&-0.2413&0.0102&0.9999\\ Variance&2.8733&1.7967&0.2148&0.0999&0.0153&5.\\Percent&57.5&35.9&4.3&2.&0.3&100.\end{array}\right)

there are a few things which i find confusing at best, misleading at worst. (in fact, i think the table is inconsistent.) 

i do not count his use of P_i instead of F_i, although you might. maybe i should not have used his notation. he used P_i instead of F_i simply because his is a principal components example, but he will later do a factor analysis of the same data. eventually, he will obtain some other coefficients which he will multiply by F_i

this is just like my distinquishing the orthogonal eigenvector matrix P from the weighted eigenvector matrix A. you haven’t seen me use P since we got A, but i want both available; we won’t see him compute the other coefficients, but he wanted both labels available. i do plan to discuss his factor analysis; unfortunately, the example i need to use is not based on this data.

the first thing i object to is the column headings. whether P_i or F_i, they are misleading. the column headed P1 contains the first column of the weighted eigenvector matrix A. P1 (cf. F1) should denote the first column of the F matrix. instead, he is using the column heading as a reminder that in the equation…

z_2 = .767 P_1 - .545 P_2 + .319 P_3 + .112 P_4 - .022 P_5

we will multiply, for example, the 3rd column number .319 by P3. numerically, that just says that the second row of Z is the matrix product of the second row of A with (each column F1, F2, … of) the F matrix. his headings really stand for the columns of F. 

his fundamental model is

Z = A \ F

and for the second row of Z, that equation would be written with F_i instead of P_i:

z_2 = .767 F_1 - .545 F_2 + .319 F_3 + .112 F_4 - .022 F_5

as i said, maybe i made a mstake in switching to his notation; i’ll keep it in mind for the future. and no, he did not write that model as Z = A P; although he changed F_i to P_i, he did not change the matrix F to P. good thing, or i’d have had to use a symbol different from P for the orthogonal eigenvector matrix.

the second thing i object to is both of his “variance” labels. in fact, the numbers in the second-to-last row are the squared lengths of the columns of the weighted eigenvector matrix A. and the numbers in the last column are the squared lengths of the rows of the weighted eigenvector matrix A. 

i shouldn’t be hasty. maybe those squared lengths are the variances. 

ok, let’s compute F. hmm. we need Z. harman never computed Z. as i said, what he wanted was that drawing showing the relationship between the Z variables and the F varables.

ok, let’s figure out what Z might be. we can certainly construct a Z matrix with variances equal to 1, which is what he says they are. 

(the only reason i found this confusing is that for his theoretical work he has taken Z to be “small standard deviates” – where you divide each datum by\sqrt{N-1} = \sqrt{11} – instead of “standardized”.) 

in order for the variances of the Z variables to be 1, all we need to do is standardize the data. we recall our data matrix D…

D = \left(\begin{array}{ccccc} 5700&12.8&2500&270&25000\\ 1000&10.9&600&10&10000\\ 3400&8.8&1000&10&9000\\ 3800&13.6&1700&140&25000\\ 4000&12.8&1600&140&25000\\ 8200&8.3&2600&60&12000\\ 1200&11.4&400&10&16000\\ 9100&11.5&3300&60&14000\\ 9900&12.5&3400&180&18000\\ 9600&13.7&3600&390&25000\\ 9600&9.6&3300&80&12000\\ 9400&11.4&4000&100&13000\end{array}\right)

we standardize D and call it X:

X = \left(\begin{array}{ccccc} -0.157462&0.760313&0.134277&1.29792&1.25637\\ -1.52374&-0.303192&-1.39649&-0.964376&-1.09933\\ -0.826067&-1.47865&-1.07422&-0.964376&-1.25637\\ -0.709788&1.2081&-0.510254&0.166772&1.25637\\ -0.651648&0.760313&-0.590821&0.166772&1.25637\\ 0.569284&-1.75852&0.214844&-0.529319&-0.785234\\ -1.4656&-0.0233225&-1.55762&-0.964376&-0.157047\\ 0.830912&0.0326515&0.778809&-0.529319&-0.47114\\ 1.06347&0.592391&0.859375&0.514817&0.157047\\ 0.976261&1.26408&1.02051&2.34206&1.25637\\ 0.976261&-1.03085&0.778809&-0.355296&-0.785234\\ 0.918122&-0.0233225&1.34277&-0.181274&-0.628187\end{array}\right)

i have changed the name; this let’s me distinquish the data matrix D from a design matrix X. further, we will let 

Z = X^T

that may seem like a silly step to you, but i have enough trouble sorting out what’s going on without having to translate things in my head: i really want both X and Z. oh, why do i want both? because Z in harman’s model has variables in rows, not in columns. back to our model:

Z = A \ F

A^{-1}\ Z = F.

(BTW, A inverse exists because none of the eigenvalues were zero. A is an eigenvector matrix weighted by the (square roots of the) eigenvalues. it came from an orthogonal matrix P of unit eigenvectors. P is trivial to invert; just take the transpose: P^T = P^{-1}. A will be invertible so long as the weights are nonzero, because the corresponding column of A will be zero if and only if a weight is zero.)

next, bizarre as you may think me, i’m going to take the transpose of the last equation. and i’m going to abuse notation (so i’ve been told), writing A^{-T} for the inverse transpose of A…

Z^T \ A^{-T} = F^T

but that’s

X \ A^{-T} = F^T

why am computing F^T, the transpose of F? for one thing, both Z and F have 12 columns and don’t fit in the page! for another, Mathematica wants variables in columns: its Mean, Variance, Standardize, Covariance, and Correlation commands each want F^T, not F. (did you think i was computing the correlation matrix step-by-step?)

here is F^T= X \ A^{-T} =

\left(\begin{array}{ccccc} 0.96967&-0.712621&-1.05659&-0.0364685&1.40114\\ -1.33148&-0.758802&0.319568&1.87313&0.308985\\ -1.47203&0.0897711&-1.11798&0.463894&-0.205381\\ 0.459637&-1.29109&0.813821&-0.922761&0.206882\\ 0.332999&-1.16178&0.112718&-1.45529&0.0608755\\ -0.692033&1.15022&-1.43036&-1.292&0.00102472\\ -1.02327&-1.17542&0.372886&-0.114468&-0.693806\\ 0.0574899&0.854874&1.4722&-0.256639&-0.598284\\ 0.784582&0.566317&0.691755&0.349624&-1.81338\\ 1.8796&-0.0589858&-1.18423&1.38369&-0.631389\\ -0.227041&1.33005&-0.24434&-0.506254&-0.113582\\ 0.261862&1.16748&1.25055&0.513541&2.07692\end{array}\right)

ok, like the Z variables, the F variables have zero means:

{2.0 x 10^{-16},\ 2.0 x 10^{-16},\ 1.0 x 10^{-15},\ 9.0 x 10^{-16},\ 1.1 x 10^{-15}}

that is to say,

{0,\ 0,\ 0,\ 0,\ 0}

and, like the Z variables, the F variables have unit variances:

{1.,\ 0.999999,\ 1.00002,\ 0.999961,\ 1.00002}

oops. this is not ok. the table is wrong: harman said the variances were the eigenvalues. (and if you happen to know, vaguely or precisely, that PCA was supposed give us new variabes that redistribute the variance, you were really, really expecting to see the eigenvalues. welcome to the club.) 

what’s going on? it’s time to do a little algebra. 

i need one key relationship. if we have a matrix M with variables in columns, with N rows, and with each variable having zero mean (called centered data), then the covariance matrix of M is given by

c = \frac{1}{N-1}\ M^T M.

further, if the variance of each variable is 1, then the covariance matrix c is also the correlation matrix r.

let’s lay out all the equations we have. 

one, X contains columns of standardized variables, so the correlation matrix r (= the covariance matrix) of X (or of Z) is given by

r = \frac{1}{N-1} X^T X = \frac{1}{N-1} Z Z^T \ or\ Z Z^T = (N-1) r.

two, the covariance (not necessarily correlation) matrix of F is given by

c = \frac{1}{N-1} F F^T.

(i’m going to take it for granted that the means of the F values are zero; after all, the Fs a linear combination of Xs, which do have zero mean.)

three, the eigendecomposition was

\Lambda = P^{-1}\ r \ P = P^T\ r\ P, \ or \ r = P \Lambda P^{-1}

with P orthogonal and \Lambda diagonal.

four, the weighted eigenvector matrix A is

A = P \sqrt{\Lambda } = P \Lambda ^{1/2} \ or\ A^{-1} = \Lambda ^{-1/2} P^{-1}

where \sqrt{\Lambda } = \Lambda ^{1/2} denotes a diagonal matrix whose entries are square roots of the entries of \Lambda. (all of the eigenvalues are non-negative in general; and in particular, my changing all the signs of columns of A amounts to choosing the negative square roots! eigenvectors are not unique.)

perhaps i should remind you that multiplying diagonal matrices is almost trivial: we multiply corresponding diagonal elements. that means, for example, that \Lambda ^{-1/2} and \Lambda ^{1/2} have diagonal elements which are the inverses of each other. that’s also why it makes sense to take square roots of the elements of \Lambda and call the result \Lambda ^{1/2}.

finally, five, the fundamental model is

Z = A \ F, \ or \ F = A^{-1} Z

assuming that A is invertible (i.e. assuming all eigenvalues of r were positive, not just non-negative).

let’s compute the covariance (not correlation) matrix c of F. we have

c = \frac{1}{N-1} F F^T.

first we substitute F = A^{-1} Z and Z Z^T = (N-1) r :

c = \frac{1}{N-1} (A^{-1} Z) (A^{-1}Z)^T = \frac{1}{N-1}A^{-1}Z Z^T A^{-T} = \frac{N-1}{N-1}A^{-1}\ r \ A^{-T} = A^{-1}\ r \ A^{-T};

then we substitute A^{-1} = \Lambda ^{-1/2} P^{-1}:

c = (\Lambda ^{-1/2} P^{-1})\ r \ {(\Lambda ^{-1/2} P^{-1})}^T = \Lambda ^{-1/2} P^{-1} r \ P^{-T} \Lambda ^{-1/2}= \Lambda ^{-1/2} P^{-1} r \ P \Lambda ^{-1/2};
 
then we substitute r = P \Lambda P^{-1}:

c = \Lambda ^{-1/2} P^{-1} (P \Lambda P^{-1}) P \Lambda ^{-1/2} = \Lambda ^{-1/2} (P^{-1}P) \Lambda (P^{-1} P) \Lambda ^{-1/2} = \Lambda ^{-1/2} \Lambda \Lambda ^{-1/2} = I.

whoa! an identity matrix? boy, do we need to compute the covariance matrix of F for this data!

here it is, truncated so i don’t have to reformat a bunch of numbers of the form a x 10^-b:

c = \left(\begin{array}{ccccc} 1.&0&0&0&0\\ 0&0.999999&0&0&0\\ 0&0&1.00002&0&0\\ 0&0&0&0.999961&0\\ 0&0&0&0&1.00002\end{array}\right)

an identity matrix, conceptually if not numerically. yes, the algebra was right. not only are the F variables of unit variance, they are completely uncorrelated with each other. yessssss. now i remember seeing that statement somewhere.

conclusion: the F variables are uncorrelated variables of unit variance.

conclusion: the eigenvalues are not the variances of the F variables. not for Z = A F, with A a weighted eigenvector matrix. tsk, tsk.

conclusion: harman’s table can be fixed most simply by changing the first-column label “Variance” to “Eigenvalue”.

nevertheless, his “variance” label suggests that we should be able to construct combinations of the Zs which have the eigenvalues for variances.

open: how do we find new variables whose variance is the eigenvalues? 

but we’ve done enough for one post.

ok, ok, i won’t quite leave you hanging. what about using the orthogonal eigenvector matrix P instead of the weighted eigenvector matrix A?

PCA / FA Example 1: Harman. what he did.

Less is more. And more is huge. It is easy for me to end up with huge posts to put out here, but I’d rather go with smaller.

Let’s get started with PCA / FA, principal components analysis and factor analysis.

In case it matters, I am using Mathematica to do these computations.

Here is an example, the first of several. This comes from Harman’s “factor analysis”. In order to emphasize the distinction between PCA and FA, he has one example of principal component analysis, and this is it.

Let me tell you up front what he did:

  • Get some data;
  • Compute its correlation matrix;
  • Find the eigenstructure of the correlation matrix;
  • Weight each eigenvector by the square root of its eigenvalue;
  • Tabulate the results;
  • Plot the original variables in the space of the two largest principal components.

I also need to say that his conceptual model is written

Z = A F,

And from the dimensions of the matrices it is clear that

A is square, k by k

Z and F are the same shape, with k rows.

We infer from its size that A will be derived from the eigenvector matrix, and that Z is derived from the given data matrix. From the shapes, we conclude that Z has observations in columns, rather than in rows. (If you’re used to econometrics or regression, you expect the transpose, observations in rows.)

But this is a fine thing, because we recognize that Z = A F is a change-of-basis equation for corresponding columns of Z and F; A is a transition matrix mapping new components (any one column of F) to old components (a column of Z).

The following data comes from Harman, p. 14. We would customarily say it has 5 variables (k = 5) and 12 observations (n=12); to be precise, we mean it has 12 observations per variable, since the total number of data points is 60. I have chosen to use regression notation, and denote the number of variables by k.

Here’s the data.

D = \left(\begin{array}{ccccc} 5700&12.8&2500&270&25000\\ 1000&10.9&600&10&10000\\ 3400&8.8&1000&10&9000\\ 3800&13.6&1700&140&25000\\ 4000&12.8&1600&140&25000\\ 8200&8.3&2600&60&12000\\ 1200&11.4&400&10&16000\\ 9100&11.5&3300&60&14000\\ 9900&12.5&3400&180&18000\\ 9600&13.7&3600&390&25000\\ 9600&9.6&3300&80&12000\\ 9400&11.4&4000&100&13000\end{array}\right)

I have called it D, for data matrix; and I have displayed it with observations in rows. This is what I’m used to from regression analysis. This is also how Harman displayed it. The point of the discussion about Z is that Z is a transposed matrix, relative to D: D has variables in columns, Z has variables in rows. In addition, if Harman were to to compute Z – which he does not – it would almost certainly be standardized data. (That is, subtract the mean of each variable, and then divide each by its sample standard deviation.)

He printed means and standard deviations. I checked his printed means…

{6241.7,\ 11.4,\ 2333.3,\ 120.8,\ 17000.}

and, more importanty, his standard deviations…

{3440.,\ 1.8,\ 1241.2,\ 114.9,\ 6367.5}

Our numbers agree: despite using N in his discussion, he correctly used N-1 in the computation of the sample variance. I would have been stunned to disagree with his means, but he led me to expect different computed standard deviations. I am pleasantly surprised.

We compute the correlation matrix r:

r = \left(\begin{array}{ccccc} 1.&0.00975&0.97245&0.43887&0.02241 \\ 0.00975&1.&0.15428&0.69141&0.86307 \\ 0.97245&0.15428&1.&0.51472&0.12193 \\ 0.43887&0.69141&0.51472&1.&0.77765 \\ 0.02241&0.86307&0.12193&0.77765&1.\end{array}\right) 
 
This also agrees with his tabulated correlation matrix on p. 14. (I have rounded the display to match his.)

Now (jumping from his p. 14 to p. 135) we get the eigenstructure of the correlation matrix r. Here is an orthogonal eigenvector matrix:


P = \left(\begin{array}{ccccc} -0.342731&-0.601629&-0.0595321&0.204003&0.689505\\ -0.452506&0.406417&-0.688816&-0.353592&0.174837\\ -0.396696&-0.541664&-0.247949&0.0229598&-0.698016\\ -0.550056&0.077817&0.664087&-0.500371&-0.000134573\\ -0.466738&0.416428&0.139634&0.763189&-0.0823919\end{array}\right)

(I remind you that eigenvectors – even orthonormal ones – are not unique; as it happens, every one I got is the negative of his. that doesn’t matter, but for subsequent computations, I changed the sign of my matrix.)

(We have “diagonalized” the correlation matrix r. That is, we have computed a matrix P whose columns are unit eigenvectors, and  a diagonal matrix \Lambda whose diagonal elements \lambda are the eigenvalues, such that \Lambda = P^{-1}\ r\ P.)

We also agree on the eigenvalues (which are unique):

 \lambda = {2.8733,\ 1.7967,\ 0.2148,\ 0.0999,\ 0.0153}

Now he weights each eigenvector by the square root of its eigenvalue; e.g., the first eigenvector will have length \sqrt{2.8733} while the fifth will have length \sqrt{0.0153}.

We end up with the following (rounded) matrix of weighted eigenvectors (where I have also multiplied all of mine by -1):

A = \left(\begin{array}{ccccc} 0.581&0.8064&0.0276&-0.0645&-0.0852\\ 0.767&-0.5448&0.3193&0.1118&-0.0216\\ 0.6724&0.726&0.1149&-0.0073&0.0862\\ 0.9324&-0.1043&-0.3078&0.1582&0\\ 0.7912&-0.5582&-0.0647&-0.2413&0.0102\end{array}\right)

He presents the following summary table, whose center is precisely our weighted eigenvectors:

\left(\begin{array}{ccccccc} Variable&P1&P2&P3&P4&P5&Variance\\ 1&0.581&0.8064&0.0276&-0.0645&-0.0852&1.\\ 2&0.767&-0.5448&0.3193&0.1118&-0.0216&1.0002\\ 3&0.6724&0.726&0.1149&-0.0073&0.0862&0.9999 \\ 4&0.9324&-0.1043&-0.3078&0.1582&0&1.\\ 5&0.7912&-0.5582&-0.0647&-0.2413&0.0102&0.9999\\ Variance&2.8733&1.7967&0.2148&0.0999&0.0153&5.\\Percent&57.5&35.9&4.3&2.&0.3&100.\end{array}\right)

I’ll have a lot to say about that table, but not yet. Oh, I should point out that the row labeled “variance” consists of the five eigenvalues and their sum (yes, 5.).

There’s one more thing he did. Back on p. 16, as a preview of things to come, he wrote

z_2 = .767 P_1 - .545 P_2 + .319 P_3 + .112 P_4 - .022 P_5

(where he used P_i rather than F_i to emphasize that these were principal components from PCA rather than factors from FA; you should read them as F_i, but I don’t want to misrepresent exactly what he wrote).

But that is precisely the equation Z = A F written for the second row of the Z matrix: multiply the second row of A by each of the columns of F. I am happy to see this. It explains why people never display Z, and never compute F: they just want to describe the old variable names in terms of the new variable names.

Let us select all the coefficients of P1 and P2, not because only two new variables are important, but because it’s easy to plot things in 2D. By weighting the eigenvectors by \sqrt{\lambda}, he has emphasized the first eigenvector over the second, and the second over the third, etc. So let’s see what the first two tell us.

Here are the first two columns of the weighted eigenvector matrix A, i.e. the first two weighted eigenvectors. 

 \left(\begin{array}{cc} 0.580958&0.806421\\ 0.767036&-0.544759\\ 0.672433&0.726044\\ 0.932392&-0.104306\\ 0.79116&-0.558179\end{array}\right)

Let’s plot those pairs of points. Recall
 
z_2 = .767 P_1 - .545 P_2 + .319 P_3 + .112 P_4 - .022 P_5
 
but retain only the first two terms: 
 
z_2 = .767 P_1 - .545 P_2
 
Now plot the point (.767, -.545) and label it “2″. Do the same for the other four equations for z1 and z3 thru z5. (Please don’t be offended at my being explicit: it took me forever to get my mind off the data itself and to grok this relationship between variable names.)
the 5 z variables in terms of F1 and F2
 
Verily, points 1 and 3 (i.e. variables z1 and z3) are similar, and variables z2 and z5 are even more similar, and variable z4 is about equally different from the two clusters, but closer to 2-5.

Let me summarize Harman’s analysis to this point. For this example, he has:

  • given us data: 5 variables, 12 observations;
  • computed the eigenstructure of the correlation matrix;
  • Weighted each eigenvector by the square root of its eigenvalue;
  • Tabulated the results;
  • Plotted the original variables in the space of the first two principal components.

What has he not done?

  • He did not actually compute Z or F;
  • He did not explain why his table looks the way it does;
  • He did not explain the graph he drew;
  • I wonder if the fact that z4 is different from the two clusters suggests that there really ought to be three new variables P1, P2, P3 instead of just two;
  • Maybe we should do a 3D graph.

Enough for now. Next, I’ll talk about what he did.