## PCA / FA. Example 4 ! Davis, and almost everyone else

I would like to revisit the work we did in Davis (example 4). For one thing, I did a lot of calculations with that example, and despite the compare-and-contrast posts towards the end, I fear it may be difficult to sort out what I finally came to.

In addition, my notation has settled down a bit since then, and I would like to recast the work using my current notation.

The original (“raw”) data for example 4 was (p. 502, and columns are variables):

$X_r = \left(\begin{array}{lll} 4 & 27 & 18 \\ 12 & 25 & 12 \\ 10 & 23 & 16 \\ 14 & 21 & 14\end{array}\right)$

It would be prudent to compute the means…

$\{10,\ 24,\ 15\}$

and variances of the data…

$\{18.6667,\ 6.66667,\ 6.66667\}$

It would also be prudent to calculate the row sums.

$\{49,\ 49,\ 49,\ 49\}$

They are all the same (we have “constant row sums”).

Now, Davis chooses to analyze centered data. Let’s get it, by subtracting its mean from each column.

$X = \left(\begin{array}{lll} -6 & 3 & 3 \\ 2 & 1 & -3 \\ 0 & -1 & 1 \\ 4 & -3 & -1\end{array}\right)$

That is what he has on page 503. I remind us that we expect (HERE) that the row sums of the centered data are now zero. That in turn means that we have lost rank: the centered data should have two nonzero eigenvalues or singular values, instead of 3.

Next, Davis chooses to analyze $X^T X$ instead of the covariance matrix $\frac{X^TX}{N-1}\$. We recall that these two have the same eigenvectors, but eigenvalues differ by a factor of N-1. In fact, Davis also uses the singular value decomposition (SVD); that is a good reason for doing an eigendecomposition of $X^T X\$.

Here is the SVD of X, $X = u\ w\ v^T\$. (Always remember that my u and v correspond to Davis’ V and U respectively, if you’re looking at both of us.)

$u = \left(\begin{array}{llll} 0.801784 & 0. & 0.597614 & 0. \\ -0.267261 & -0.816497 & 0.358569 & 0.365148 \\ 0. & 0.408248 & 0. & 0.912871 \\ -0.534522 & 0.408248 & 0.717137 & -0.182574\end{array}\right)$

$v =\left(\begin{array}{lll} -0.816497 & 0. & 0.57735 \\ 0.408248 & -0.707107 & 0.57735 \\ 0.408248 & 0.707107 & 0.57735\end{array}\right)$

$w = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)$

Sure enough, there are only two nonzero singular values in the w matrix.

Let us now compute Davis’ loadings and scores. Using my notation for the SVD, he has

$A^R = v\ w^T$

$S^R = X A^R (= u\ w\ w^T\$, but use the definition for computation: one less matrix product.)

$A^Q = u\ w ( = X v)$

$S^Q = X^T A^Q (= v\ w^T\ w)$

(I should remark that we see that we could have done all of those calculations using the data X, the singular values w, and the matrix v: we do not need u.)

and we get

$A^R = \left(\begin{array}{llll} -7.48331 & 0. & 0. & 0. \\ 3.74166 & -2.44949 & 0. & 0. \\ 3.74166 & 2.44949 & 0. & 0.\end{array}\right)$

$A^Q = \left(\begin{array}{lll} 7.34847 & 0. & 0. \\ -2.44949 & -2.82843 & 0. \\ 0. & 1.41421 & 0. \\ -4.89898 & 1.41421 & 0.\end{array}\right)$

$S^R = \left(\begin{array}{llll} 67.3498 & 0. & 0. & 0. \\ -22.4499 & -9.79796 & 0. & 0. \\ 0. & 4.89898 & 0. & 0. \\ -44.8999 & 4.89898 & 0. & 0.\end{array}\right)$

$S^Q = \left(\begin{array}{lll} -68.5857 & 0 & 0 \\ 34.2929 & -8.48528 & 0 \\ 34.2929 & 8.48528 & 0\end{array}\right)$

Davis does not have any of these extra columns of zeros, but all of our nonzero numbers agree (to within a sign for each column). At this point, I think the most significant thing to say is that these scores and loadings do not reproduce the data; in the sense that, the matrix product of corresponding scores S and loadings A is not X.

Let us see what Jolliffe would have done.

He writes his model as

Z = X A,

with Z as the new data, and A as an orthogonal eigenvector matrix; but I would write that as

Y = X v,

since I reserve A for the weighted eigenvector matrix.

(I have to point out that, as we saw in previous posts, $Y = A^Q\$.)

Now, Jolliffe would usually do an eigendecomposition of the covariance matrix; but we have enough experience to understand that we can do an eigendecomposition of $c = X^T X\$. Here it is:

$c = \left(\begin{array}{lll} 56 & -28 & -28 \\ -28 & 20 & 8 \\ -28 & 8 & 20\end{array}\right)$

Here are its eigenvalues:

$\lambda = \{84,\ 12,\ 0\}$

Here is an orthogonal eigenvector matrix (now I use my standard notation, V for an orthogonal matrix computed from an eigendecomposition, versus v from the SVD):

$V = \left(\begin{array}{lll} -0.816497 & 0. & 0.57735 \\ 0.408248 & -0.707107 & 0.57735 \\ 0.408248 & 0.707107 & 0.57735\end{array}\right)$

We can check that $V^T V = I\$. (It does.)

We know that V should differ from v (from the SVD) by at most the signs of columns. Recall v from the SVD:

$v = \left(\begin{array}{lll} -0.816497 & 0. & 0.57735 \\ 0.408248 & -0.707107 & 0.57735 \\ 0.408248 & 0.707107 & 0.57735\end{array}\right)$

In fact, they agree completely.

Jolliffe would project the data X onto the eigenvector matrix V to get new data Y = XV:

$Y = X V = \left(\begin{array}{lll} 7.34847 & 0 & 0 \\ -2.44949 & -2.82843 & 0 \\ 0 & 1.41421 & 0 \\ -4.89898 & 1.41421 & 0\end{array}\right)$

Now we should recall $A^R\$, to confirm the equivalence:

$A^R = \left(\begin{array}{lll} 7.34847 & 0. & 0. \\ -2.44949 & -2.82843 & 0. \\ 0. & 1.41421 & 0. \\ -4.89898 & 1.41421 & 0.\end{array}\right)$

We recall that Y = X v = X V is equivalent to the change of basis formula

x = v y,

where x is an old observation and y is the corresponding new one.

We learned – from Harman! – that the new data Y has maximally redistributed variance. It is crucial that we used an orthogonal eigenvector matrix v or V. Let’s confirm that by computing the variances of the columns of Y…

$\{28.,\ 4.,\ 0.\}$

and their sum is 32.

Let’s recall the variances of the columns of X…

$\{18.6667,\ 6.66667,\ 6.66667\}$

and their sum is also 32.

Now, what about Harman? He writes his model as

Z = A F,

where

$Z = X^T\$,

(i.e. his variables are in rows, not in columns) and A is a $\sqrt{\text{eigenvalue}}$-weighted eigenector matrix. And I generally favor his notation for Z, but I will follow Jolliffe for a few seconds later on.

Let us compute. From the eigenvalues of c,

$\{84,\ 12,\ 0\}$

we construct a diagonal matrix $\Lambda$ of their square roots:

$\Lambda = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 0.\end{array}\right)$

We should expect that these are the nonzero nonsingular values w. Sure enough:

$w = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)$

Note, however, that w is not square, hence is not equal to the diagonal matrix $\Lambda\$. This will lead to an irrelevant difference between A and $A^R\$, etc. In fact, I often ignore the distinction between $\Lambda$ and w conceptually; in practice, of course, whatever matrices we’re using have to be “the right size” for whatever matrix products we compute (conformable).

We should recall that the nonzero eigenvalues are the nonzero values of both $w^T w$ and $w\ w^T\$:

The nonzero eigenvalues were…

$\{84,\ 12,\ 0\}$

whereas

$w^T w = \left(\begin{array}{lll} 84. & 0. & 0. \\ 0. & 12. & 0. \\ 0. & 0. & 0.\end{array}\right)$

and

$w\ w^T = \left(\begin{array}{llll} 84. & 0. & 0. & 0. \\ 0. & 12. & 0. & 0. \\ 0. & 0. & 0. & 0. \\ 0. & 0. & 0. & 0.\end{array}\right)$

Resuming our thread, we compute the weighted eigenvector matrix A

$A = v\ \Lambda = \left(\begin{array}{lll} -7.48331 & 0. & 0. \\ 3.74166 & -2.44949 & 0. \\ 3.74166 & 2.44949 & 0.\end{array}\right)$

and that should be Davis’ $A^R\$; and it is, except for an extra – to my mind, irrelevant – column of zeros.

$A^R = \left(\begin{array}{llll} -7.48331 & 0. & 0. & 0. \\ 3.74166 & -2.44949 & 0. & 0. \\ 3.74166 & 2.44949 & 0. & 0.\end{array}\right)$

Viewing A as a transition matrix, we know that Harman’s scores are the new data $F^T\$. We also know (HERE) that we may compute them as

$F^T= u\$,

that is, the “first few” columns of u, not all of u.

Ah, life gets interesting because A is of rank 2 instead of 3. The new data $F^T$ should be the first two columns of u, but if we want to see (Harman’s) Z = A F, then we must either cut down A or keep the first 3 columns of u, to have conformable matrices. I will cut down A.

$F^T = \left(\begin{array}{ll} 0.801784 & 0. \\ -0.267261 & -0.816497 \\ 0. & 0.408248 \\ -0.534522 & 0.408248\end{array}\right)$

We check that by computing A F and comparing it to $Z = X^T\$; we expect Z = AF. And we have it; here’s Z:

$Z = X^T = \left(\begin{array}{llll} -6 & 2 & 0 & 4 \\ 3 & 1 & -1 & -3 \\ 3 & -3 & 1 & -1\end{array}\right)$

And here’s A F (using the two nonzero columns of A!):

$A F = \left(\begin{array}{llll} -6. & 2. & 0. & 4. \\ 3. & 1. & -1. & -3. \\ 3. & -3. & 1. & -1.\end{array}\right)$

I remind us that this new data $F^T$ does not have redistributed variance; in fact, this new data has constant variance:

$\{0.333333,\ 0.333333\}$

There are a couple of (intertwined) ways to understand why the variances are not 1. Fundamentally, we did an eigendecomposition of $X^T X$ rather than of the covariance matrix $\frac {X^TX}{N-1}\$. Alternatively, our new data was given by columns of u rather than by columns of $\sqrt{N-1}\ u\$. (And that’s a consequence of the “fundamental” reason, but it may make sense by itself.)

That constant variance is in marked contrast to the maximally redistributed variances of the new data Y constructed from the orthogonal matrix v instead of from A.

It is my strong preference to compute the new data $F^T$ from the first few columns of u. To make sure I have the scaling correct, I then check the product A F.

Any number of people, including Davis himself in a subsequent example, show us how to compute the new data $F^T$ using a reciprocal basis instead; that is, one which weights the v matrix by $\frac{1}{\sqrt{\text{eigenvalue}}}$ instead of $\sqrt{\text{eigenvalue}}$. (Whether or not an author uses the phrase reciprocal basis or dual basis, if he divides each column of v by $\sqrt{\text{eigenvalue}}$, he is computing a reciprocal basis.)

Life is made interesting, but not difficult, by the presence of a zero eigenvalue. Frankly, the simplest way to deal with a zero eigenvalue – for finding a reciprocal basis – it is to set the zero eigenvalue to 1. Instead of the diagonal matrix

$\Lambda = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 0.\end{array}\right)$

I use

$\Lambda 1 = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 1\end{array}\right)$

and instead of $A = v \Lambda\$, I compute the reciprocal basis $B = v \Lambda1^{-1}$ (i.e. I divide each column by $\sqrt{\text{eigenvalue}}$ instead of multiplying by $\sqrt{\text{eigenvalue}}$).

$B = \left(\begin{array}{lll} -0.0890871 & 0. & 0.57735 \\ 0.0445435 & -0.204124 & 0.57735 \\ 0.0445435 & 0.204124 & 0.57735\end{array}\right)$

Then the new data with respect to A can be found by projecting X onto B:

$X B = \left(\begin{array}{lll} 0.801784 & 0 & 0 \\ -0.267261 & -0.816497 & 0 \\ 0 & 0.408248 & 0 \\ -0.534522 & 0.408248 & 0\end{array}\right)$

That had better be $F^T\$… and it is:

$F^T = \left(\begin{array}{ll} 0.801784 & 0. \\ -0.267261 & -0.816497 \\ 0. & 0.408248 \\ -0.534522 & 0.408248\end{array}\right)$

(Okay, that was the simplest way to get a reciprocal basis when we have a zero eigenvalue; but the simplest way to get the new data is to avoid the reciprocal basis entirely by taking the first few columns of u instead – assuming that you have the SVD available to you.)

For this example, Davis did not compute the new data with respect to A (i.e. wrt his $A^R\$). As I have said before, I have seen him do it in another example; but his scores $S^R$ are not the new data $F^T\$, which is, as far I have seen in textbooks, what other people use for the scores corresponding to loadings A.

Moving on, we learned from Basilevsky (HERE) that the weighted eigenvector matrix $A = v \Lambda$ was also the dispersion matrix between the new data $F^T$ and the old data X. I say “dispersion matrix” because A itself was not computed from a covariance matrix but from a more general “dispersion” matrix $X^T X\$.

Let’s confirm that. For just a moment I want to denote the new data by (Jolliffe’s) Z instead of $F^T\$, so as not to get confused by transposes. A quick computation (on the side) shows me that I want to compare A with $X^T Z\$, i.e. that I want to compute

$X^T Z = X^T F^T$

(finally reverting to my original notation). Fine, I compute that product, and I get

$X^T F^T = \left(\begin{array}{ll} -7.48331 & 0 \\ 3.74166 & -2.44949 \\ 3.74166 & 2.44949\end{array}\right)$

and I compare that to A:

$A = \left(\begin{array}{lll} -7.48331 & 0. & 0. \\ 3.74166 & -2.44949 & 0. \\ 3.74166 & 2.44949 & 0.\end{array}\right)$

Good. Our matrix A is, as we expect, a cross-dispersion matrix between old and new data. (Perhaps I should be less meticulous, and refer to A as a cross-covariance matrix; that term may be more familiar and hence clearer in sense, even though $X^T X$ it is not the covariance matrix.)

Now let me show you something new. We had also discovered from Basilevsky (HERE) that the orthogonal eigenvector matrix v was not a dispersion matrix D between its old data X and its new data Y. The v matrix does not have this property of the A matrix. We had seen that v and D differed by eigenvalue column-weights:

$D = v \Lambda^2\$.

But… we’ve just seen that weight here. Okay, it wasn’t $\Lambda^2\$, but it was either $w\ w^T$ or $w^T w\$. Well, it must have been $S^Q\$, i.e. the one that multiplied v:

$S^Q = X^T A^Q (= v w^T w)\$.

Of course, I should check that computationally. Here is $v \Lambda^2\$:

$v \Lambda^2 = \left(\begin{array}{lll} -68.5857 & 0. & 0. \\ 34.2929 & -8.48528 & 0. \\ 34.2929 & 8.48528 & 0.\end{array}\right)$

and here is our $S^Q\$:

$S^Q = \left(\begin{array}{lll} -68.5857 & 0 & 0 \\ 34.2929 & -8.48528 & 0 \\ 34.2929 & 8.48528 & 0\end{array}\right)$

Hang on. We need to explicitly see $X^T Y\$, the cross-dispersion matrix analagous to $X^T X\$: we compute it…

$X^T Y = \left(\begin{array}{lll} -68.5857 & 0. & 0. \\ 34.2929 & -8.48528 & 0. \\ 34.2929 & 8.48528 & 0.\end{array}\right)$

I don’t know if it’s useful, but there it is anyway. I can hear my favorite Vulcan whisper, “Fascinating.”

Don’t misunderstand me: I am sure that the Q-mode scores and loadings are interesting in their own right. (I don’t understand the goal of a Q-mode analysis, but that’s another question.)

But I do find it… fascinating… that Davis’ Q-mode scores $S^Q$ are the cross-dispersion matrix between the old data X and the new data Y (in addition to our earlier finding that his Q-mode loadings $A^Q$ are the new data Y with respect to the transition matrix v).

To put that another way…. The orthogonal eigenvector matrix v could have been computed from an eigendecomposition of the dispersion matrix $X^T X\$, and defines new data Y; and $S^Q = X^T Y$ is the cross-dispersion between old data X and new data Y.

I had asked a question: is the orthogonal matrix v the dispersion matrix D between the old data X and the new data Y? The answer was, “No” (HERE).

Now we know the rest of the story: not v, but the Q-mode scores $S^Q\$ are that dispersion matrix D.