PCA / FA. Example 4 ! Davis, and almost everyone else

I would like to revisit the work we did in Davis (example 4). For one thing, I did a lot of calculations with that example, and despite the compare-and-contrast posts towards the end, I fear it may be difficult to sort out what I finally came to.

In addition, my notation has settled down a bit since then, and I would like to recast the work using my current notation.

The original (“raw”) data for example 4 was (p. 502, and columns are variables):

X_r = \left(\begin{array}{lll} 4 & 27 & 18 \\ 12 & 25 & 12 \\ 10 & 23 & 16 \\ 14 & 21 & 14\end{array}\right)

It would be prudent to compute the means…

 \{10,\ 24,\ 15\}

and variances of the data…

 \{18.6667,\ 6.66667,\ 6.66667\}

It would also be prudent to calculate the row sums.

  \{49,\ 49,\ 49,\ 49\}

They are all the same (we have “constant row sums”).

Now, Davis chooses to analyze centered data. Let’s get it, by subtracting its mean from each column.

 X = \left(\begin{array}{lll} -6 & 3 & 3 \\ 2 & 1 & -3 \\ 0 & -1 & 1 \\ 4 & -3 & -1\end{array}\right)

That is what he has on page 503. I remind us that we expect (HERE) that the row sums of the centered data are now zero. That in turn means that we have lost rank: the centered data should have two nonzero eigenvalues or singular values, instead of 3.

Next, Davis chooses to analyze X^T X instead of the covariance matrix \frac{X^TX}{N-1}\ . We recall that these two have the same eigenvectors, but eigenvalues differ by a factor of N-1. In fact, Davis also uses the singular value decomposition (SVD); that is a good reason for doing an eigendecomposition of X^T X\ .

Here is the SVD of X, X = u\ w\ v^T\ . (Always remember that my u and v correspond to Davis’ V and U respectively, if you’re looking at both of us.)

u = \left(\begin{array}{llll} 0.801784 & 0. & 0.597614 & 0. \\ -0.267261 & -0.816497 & 0.358569 & 0.365148 \\ 0. & 0.408248 & 0. & 0.912871 \\ -0.534522 & 0.408248 & 0.717137 & -0.182574\end{array}\right)

v =\left(\begin{array}{lll} -0.816497 & 0. & 0.57735 \\ 0.408248 & -0.707107 & 0.57735 \\ 0.408248 & 0.707107 & 0.57735\end{array}\right)

w = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Sure enough, there are only two nonzero singular values in the w matrix.

Let us now compute Davis’ loadings and scores. Using my notation for the SVD, he has

A^R = v\ w^T

S^R = X A^R (= u\ w\ w^T\ , but use the definition for computation: one less matrix product.)

A^Q = u\ w ( = X v)

S^Q = X^T A^Q (= v\ w^T\ w)

(I should remark that we see that we could have done all of those calculations using the data X, the singular values w, and the matrix v: we do not need u.)

and we get

A^R = \left(\begin{array}{llll} -7.48331 & 0. & 0. & 0. \\ 3.74166 & -2.44949 & 0. & 0. \\ 3.74166 & 2.44949 & 0. & 0.\end{array}\right)

 A^Q = \left(\begin{array}{lll} 7.34847 & 0. & 0. \\ -2.44949 & -2.82843 & 0. \\ 0. & 1.41421 & 0. \\ -4.89898 & 1.41421 & 0.\end{array}\right)

S^R = \left(\begin{array}{llll} 67.3498 & 0. & 0. & 0. \\ -22.4499 & -9.79796 & 0. & 0. \\ 0. & 4.89898 & 0. & 0. \\ -44.8999 & 4.89898 & 0. & 0.\end{array}\right)

 S^Q = \left(\begin{array}{lll} -68.5857 & 0 & 0 \\ 34.2929 & -8.48528 & 0 \\ 34.2929 & 8.48528 & 0\end{array}\right)

Davis does not have any of these extra columns of zeros, but all of our nonzero numbers agree (to within a sign for each column). At this point, I think the most significant thing to say is that these scores and loadings do not reproduce the data; in the sense that, the matrix product of corresponding scores S and loadings A is not X.

Let us see what Jolliffe would have done.

He writes his model as

Z = X A,

with Z as the new data, and A as an orthogonal eigenvector matrix; but I would write that as

Y = X v,

since I reserve A for the weighted eigenvector matrix.

(I have to point out that, as we saw in previous posts, Y = A^Q\ .)

Now, Jolliffe would usually do an eigendecomposition of the covariance matrix; but we have enough experience to understand that we can do an eigendecomposition of c = X^T X\ . Here it is:

 c = \left(\begin{array}{lll} 56 & -28 & -28 \\ -28 & 20 & 8 \\ -28 & 8 & 20\end{array}\right)

Here are its eigenvalues:

 \lambda = \{84,\ 12,\ 0\}

Here is an orthogonal eigenvector matrix (now I use my standard notation, V for an orthogonal matrix computed from an eigendecomposition, versus v from the SVD):

V = \left(\begin{array}{lll} -0.816497 & 0. & 0.57735 \\ 0.408248 & -0.707107 & 0.57735 \\ 0.408248 & 0.707107 & 0.57735\end{array}\right)

We can check that V^T V = I\ . (It does.)

We know that V should differ from v (from the SVD) by at most the signs of columns. Recall v from the SVD:

 v = \left(\begin{array}{lll} -0.816497 & 0. & 0.57735 \\ 0.408248 & -0.707107 & 0.57735 \\ 0.408248 & 0.707107 & 0.57735\end{array}\right)

In fact, they agree completely.

Jolliffe would project the data X onto the eigenvector matrix V to get new data Y = XV:

Y = X V = \left(\begin{array}{lll} 7.34847 & 0 & 0 \\ -2.44949 & -2.82843 & 0 \\ 0 & 1.41421 & 0 \\ -4.89898 & 1.41421 & 0\end{array}\right)

Now we should recall A^R\ , to confirm the equivalence:

A^R = \left(\begin{array}{lll} 7.34847 & 0. & 0. \\ -2.44949 & -2.82843 & 0. \\ 0. & 1.41421 & 0. \\ -4.89898 & 1.41421 & 0.\end{array}\right)

We recall that Y = X v = X V is equivalent to the change of basis formula

x = v y,

where x is an old observation and y is the corresponding new one.

We learned – from Harman! – that the new data Y has maximally redistributed variance. It is crucial that we used an orthogonal eigenvector matrix v or V. Let’s confirm that by computing the variances of the columns of Y…

\{28.,\ 4.,\ 0.\}

and their sum is 32.

Let’s recall the variances of the columns of X…

\{18.6667,\ 6.66667,\ 6.66667\}

and their sum is also 32.

Now, what about Harman? He writes his model as

Z = A F,

where

Z = X^T\ ,

(i.e. his variables are in rows, not in columns) and A is a \sqrt{\text{eigenvalue}}-weighted eigenector matrix. And I generally favor his notation for Z, but I will follow Jolliffe for a few seconds later on.

Let us compute. From the eigenvalues of c,

\{84,\ 12,\ 0\}

we construct a diagonal matrix \Lambda of their square roots:

\Lambda = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 0.\end{array}\right)

We should expect that these are the nonzero nonsingular values w. Sure enough:

w = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Note, however, that w is not square, hence is not equal to the diagonal matrix \Lambda\ . This will lead to an irrelevant difference between A and A^R\ , etc. In fact, I often ignore the distinction between \Lambda and w conceptually; in practice, of course, whatever matrices we’re using have to be “the right size” for whatever matrix products we compute (conformable).

We should recall that the nonzero eigenvalues are the nonzero values of both w^T w and w\ w^T\ :

The nonzero eigenvalues were…

\{84,\ 12,\ 0\}

whereas

w^T w = \left(\begin{array}{lll} 84. & 0. & 0. \\ 0. & 12. & 0. \\ 0. & 0. & 0.\end{array}\right)

and

w\ w^T = \left(\begin{array}{llll} 84. & 0. & 0. & 0. \\ 0. & 12. & 0. & 0. \\ 0. & 0. & 0. & 0. \\ 0. & 0. & 0. & 0.\end{array}\right)

Resuming our thread, we compute the weighted eigenvector matrix A

A = v\ \Lambda = \left(\begin{array}{lll} -7.48331 & 0. & 0. \\ 3.74166 & -2.44949 & 0. \\ 3.74166 & 2.44949 & 0.\end{array}\right)

and that should be Davis’ A^R\ ; and it is, except for an extra – to my mind, irrelevant – column of zeros.

A^R = \left(\begin{array}{llll} -7.48331 & 0. & 0. & 0. \\ 3.74166 & -2.44949 & 0. & 0. \\ 3.74166 & 2.44949 & 0. & 0.\end{array}\right)

Viewing A as a transition matrix, we know that Harman’s scores are the new data F^T\ . We also know (HERE) that we may compute them as

F^T= u\ ,

that is, the “first few” columns of u, not all of u.

Ah, life gets interesting because A is of rank 2 instead of 3. The new data F^T should be the first two columns of u, but if we want to see (Harman’s) Z = A F, then we must either cut down A or keep the first 3 columns of u, to have conformable matrices. I will cut down A.

F^T = \left(\begin{array}{ll} 0.801784 & 0. \\ -0.267261 & -0.816497 \\ 0. & 0.408248 \\ -0.534522 & 0.408248\end{array}\right)

We check that by computing A F and comparing it to Z = X^T\ ; we expect Z = AF. And we have it; here’s Z:

Z = X^T = \left(\begin{array}{llll} -6 & 2 & 0 & 4 \\ 3 & 1 & -1 & -3 \\ 3 & -3 & 1 & -1\end{array}\right)

And here’s A F (using the two nonzero columns of A!):

A F = \left(\begin{array}{llll} -6. & 2. & 0. & 4. \\ 3. & 1. & -1. & -3. \\ 3. & -3. & 1. & -1.\end{array}\right)

I remind us that this new data F^T does not have redistributed variance; in fact, this new data has constant variance:

\{0.333333,\ 0.333333\}

There are a couple of (intertwined) ways to understand why the variances are not 1. Fundamentally, we did an eigendecomposition of X^T X rather than of the covariance matrix \frac {X^TX}{N-1}\ . Alternatively, our new data was given by columns of u rather than by columns of \sqrt{N-1}\ u\ . (And that’s a consequence of the “fundamental” reason, but it may make sense by itself.)

That constant variance is in marked contrast to the maximally redistributed variances of the new data Y constructed from the orthogonal matrix v instead of from A.

It is my strong preference to compute the new data F^T from the first few columns of u. To make sure I have the scaling correct, I then check the product A F.

Any number of people, including Davis himself in a subsequent example, show us how to compute the new data F^T using a reciprocal basis instead; that is, one which weights the v matrix by \frac{1}{\sqrt{\text{eigenvalue}}} instead of \sqrt{\text{eigenvalue}}. (Whether or not an author uses the phrase reciprocal basis or dual basis, if he divides each column of v by \sqrt{\text{eigenvalue}}, he is computing a reciprocal basis.)

Life is made interesting, but not difficult, by the presence of a zero eigenvalue. Frankly, the simplest way to deal with a zero eigenvalue – for finding a reciprocal basis – it is to set the zero eigenvalue to 1. Instead of the diagonal matrix

\Lambda = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 0.\end{array}\right)

I use

\Lambda 1 = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 1\end{array}\right)

and instead of A = v \Lambda\ , I compute the reciprocal basis B = v \Lambda1^{-1} (i.e. I divide each column by \sqrt{\text{eigenvalue}} instead of multiplying by \sqrt{\text{eigenvalue}}).

B = \left(\begin{array}{lll} -0.0890871 & 0. & 0.57735 \\ 0.0445435 & -0.204124 & 0.57735 \\ 0.0445435 & 0.204124 & 0.57735\end{array}\right)

Then the new data with respect to A can be found by projecting X onto B:

X B = \left(\begin{array}{lll} 0.801784 & 0 & 0 \\ -0.267261 & -0.816497 & 0 \\ 0 & 0.408248 & 0 \\ -0.534522 & 0.408248 & 0\end{array}\right)

That had better be F^T\ … and it is:

F^T = \left(\begin{array}{ll} 0.801784 & 0. \\ -0.267261 & -0.816497 \\ 0. & 0.408248 \\ -0.534522 & 0.408248\end{array}\right)

(Okay, that was the simplest way to get a reciprocal basis when we have a zero eigenvalue; but the simplest way to get the new data is to avoid the reciprocal basis entirely by taking the first few columns of u instead – assuming that you have the SVD available to you.)

For this example, Davis did not compute the new data with respect to A (i.e. wrt his A^R\ ). As I have said before, I have seen him do it in another example; but his scores S^R are not the new data F^T\ , which is, as far I have seen in textbooks, what other people use for the scores corresponding to loadings A.

Moving on, we learned from Basilevsky (HERE) that the weighted eigenvector matrix A = v \Lambda was also the dispersion matrix between the new data F^T and the old data X. I say “dispersion matrix” because A itself was not computed from a covariance matrix but from a more general “dispersion” matrix X^T X\ .

Let’s confirm that. For just a moment I want to denote the new data by (Jolliffe’s) Z instead of F^T\ , so as not to get confused by transposes. A quick computation (on the side) shows me that I want to compare A with X^T Z\ , i.e. that I want to compute

X^T Z = X^T F^T

(finally reverting to my original notation). Fine, I compute that product, and I get

X^T F^T = \left(\begin{array}{ll} -7.48331 & 0 \\ 3.74166 & -2.44949 \\ 3.74166 & 2.44949\end{array}\right)

and I compare that to A:

A = \left(\begin{array}{lll} -7.48331 & 0. & 0. \\ 3.74166 & -2.44949 & 0. \\ 3.74166 & 2.44949 & 0.\end{array}\right)

Good. Our matrix A is, as we expect, a cross-dispersion matrix between old and new data. (Perhaps I should be less meticulous, and refer to A as a cross-covariance matrix; that term may be more familiar and hence clearer in sense, even though X^T X it is not the covariance matrix.)

Now let me show you something new. We had also discovered from Basilevsky (HERE) that the orthogonal eigenvector matrix v was not a dispersion matrix D between its old data X and its new data Y. The v matrix does not have this property of the A matrix. We had seen that v and D differed by eigenvalue column-weights:

D = v \Lambda^2\ .

But… we’ve just seen that weight here. Okay, it wasn’t \Lambda^2\ , but it was either w\ w^T or w^T w\ . Well, it must have been S^Q\ , i.e. the one that multiplied v:

S^Q = X^T A^Q (= v w^T w)\ .

Of course, I should check that computationally. Here is v \Lambda^2\ :

v \Lambda^2 = \left(\begin{array}{lll} -68.5857 & 0. & 0. \\ 34.2929 & -8.48528 & 0. \\ 34.2929 & 8.48528 & 0.\end{array}\right)

and here is our S^Q\ :

S^Q = \left(\begin{array}{lll} -68.5857 & 0 & 0 \\ 34.2929 & -8.48528 & 0 \\ 34.2929 & 8.48528 & 0\end{array}\right)

Hang on. We need to explicitly see X^T Y\ , the cross-dispersion matrix analagous to X^T X\ : we compute it…

X^T Y = \left(\begin{array}{lll} -68.5857 & 0. & 0. \\ 34.2929 & -8.48528 & 0. \\ 34.2929 & 8.48528 & 0.\end{array}\right)

I don’t know if it’s useful, but there it is anyway. I can hear my favorite Vulcan whisper, “Fascinating.”

Don’t misunderstand me: I am sure that the Q-mode scores and loadings are interesting in their own right. (I don’t understand the goal of a Q-mode analysis, but that’s another question.)

But I do find it… fascinating… that Davis’ Q-mode scores S^Q are the cross-dispersion matrix between the old data X and the new data Y (in addition to our earlier finding that his Q-mode loadings A^Q are the new data Y with respect to the transition matrix v).

To put that another way…. The orthogonal eigenvector matrix v could have been computed from an eigendecomposition of the dispersion matrix X^T X\ , and defines new data Y; and S^Q = X^T Y is the cross-dispersion between old data X and new data Y.

I had asked a question: is the orthogonal matrix v the dispersion matrix D between the old data X and the new data Y? The answer was, “No” (HERE).

Now we know the rest of the story: not v, but the Q-mode scores S^Q\ are that dispersion matrix D.

PCA / FA example 4: davis. Summary.

Considering how much I learned from Davis’ little example, I almost hate to summarize it. The following just does not do justice to it. We’ll see if the passage of time – coming back to Davis with fresh eyes after doing the next example – improves the summary.

OTOH, an awful lot of what I got out of Davis came from bringing linear algebra to it, and just plain mucking about with the matrices. And I discovered, having picked up the two chemistry books since I drafted this post, that they made a lot more sense than when I first looked at them. I have learned a lot from working with Davis’ example.

Here’s what I have. My own strong preference, but I’d call it a personal choice, is to use the SVD of the data matrix X:

X = u\ w\ v^T

and compute the A’s and S’s (R-mode and Q-mode loadings and scores):

A^R = v\ w^T.

A^Q = u\ w.

S^R = X\ A^R = u\ w\ w^T = A^Q\ w^T.

S^Q = X^T\ A^Q = v\ w^T\ w = A^R\ w.

Those equations are worth having as a conceptual basis even if one chooses to do eigendecompositions of X^T\ X or X\ X^T instead of the SVD.

  • I would construct a reciprocal basis for A^R.
  • And, whenever I construct any basis, I would compute the data wrt that basis.
  • And I would probably compute the variances of any new forms of the data.

The Q-mode loadings A^Q are the new data wrt the orthogonal eigenvector matrix v (properly named, the columns of v are the right principal vectors of X, but i think of them as the eigenvectors of X^T\ X). The R-mode loadings A^R (cut down a little) can be thought of as a non-orthonormal basis, and the R-mode scores S^R are the new data wrt the reciprocal basis for A^R.

A lot of what I did with Davis’ example was look at alternative forms and the duality between R-mode and Q-mode. By using the SVD, I choose one particular form, and I have the duality.

Bear in mind that the SVD does not require that X be centered. But that’s a whole ‘nother story.

PCA / FA example 4: davis. Davis & Jolliffe.

What tools did jolliffe give us (that i could confirm)?

  1. Z = X A, A orthogonal, X old data variables in columns
  2. serious rounding
  3. plot data wrt first two PCs
  4. how many variables to keep?

but jolliffe would have used the correlation matrix. Before we do anything else, let’s get the correlation matrix for davis’ example. Recall the centered data X:

X = \left(\begin{array}{lll} -6 & 3 & 3 \\ 2 & 1 & -3 \\ 0 & -1 & 1 \\ 4 & -3 & -1\end{array}\right)

i compute its correlation matrix:

\left(\begin{array}{lll} 1. & -0.83666 & -0.83666 \\ -0.83666 & 1. & 0.4 \\ -0.83666 & 0.4 & 1.\end{array}\right)

Now get an eigendecomposition of the correlation matrix; i called the eigenvalues \lambda and the eigenvector matrix A. Here’s A:

A = \left(\begin{array}{lll} -0.645497 & 0 & -0.763763 \\ 0.540062 & -0.707107 & -0.456435 \\ 0.540062 & 0.707107 & -0.456435\end{array}\right)

If we compute A^T\ A and get an identity matrix, then A is orthogonal. (it is.)

the eigenvalues are:

\lambda = \{2.4,\ 0.6,\ 0\}

That the third eigenvalue is zero would tell us, if we didn’t already know it from davis, that the data, X, is 2D.

now we can get (1) the new data Z, wrt the orthogonal eigenvector matrix A; it is given by the projection of X onto A (because A is orthogonal, i.e. the basis vectors are orthonormal):

Z = \left(\begin{array}{lll} 7.11335 & 0 & 1.84396 \\ -2.37112 & -2.82843 & -0.614654 \\ 0 & 1.41421 & 0 \\ -4.74224 & 1.41421 & -1.22931\end{array}\right)

Fascinating.

An understatement. There are 3 nonzero columns.

in sharp contrast to A^Q from the centered data X, this new data does not look 2D wrt the orthogonal eigenvector basis. It is still 2D, nothing changed about that, but it isn’t at all clear from the new components of the data. our 3 columns are not linearly independent.

i just gave us all a beautiful lesson: it can be very worthwhile to look at both covariance (or X^T\ X of centered data) and correlation.

BTW, since this is new data, get the variances…

\{26.2369,\ 4.,\ 1.76307\}

and the total is… 32. we remember that number from davis.

Sure enough, the orthogonal eigenvector matrix A has redistributed the total variance, which is 32 – but it has spread it out over 3 variables instead of 2.

Now we are ready to follow jolliffe. BTW, i will refer to things like “jolliffe’s Z” although jolliffe, of course, never worked out davis’ example! we could try (2) some serious rounding of everything in sight. The correlation matrix: original…

\left(\begin{array}{lll} 1. & -0.83666 & -0.83666 \\ -0.83666 & 1. & 0.4 \\ -0.83666 & 0.4 & 1.\end{array}\right)

rounded to the nearest .5…

\left(\begin{array}{lll} 1. & -1. & -1. \\ -1. & 1. & 0.5 \\ -1. & 0.5 & 1.\end{array}\right)

The eigenvector matrix? original…

\left(\begin{array}{lll} -0.645497 & 0 & -0.763763 \\ 0.540062 & -0.707107 & -0.456435 \\ 0.540062 & 0.707107 & -0.456435\end{array}\right)

rounded to the nearest .5…

\left(\begin{array}{lll} -0.5 & 0 & -1. \\ 0.5 & -0.5 & -0.5 \\ 0.5 & 0.5 & -0.5\end{array}\right)

The new data? original…

\left(\begin{array}{lll} 7.11335 & 0 & 1.84396 \\ -2.37112 & -2.82843 & -0.614654 \\ 0 & 1.41421 & 0 \\ -4.74224 & 1.41421 & -1.22931\end{array}\right)

Rounded to the nearest .5 …

\left(\begin{array}{lll} 7. & 0 & 2. \\ -2.5 & -3. & -0.5 \\ 0 & 1.5 & 0 \\ -4.5 & 1.5 & -1.\end{array}\right)

Rounded to the nearest 1…

\left(\begin{array}{lll} 7. & 0 & 2. \\ -2. & -3. & -1. \\ 0 & 1. & 0 \\ -5. & 1. & -1.\end{array}\right)

i don’t see anything exciting in all that. (but if you’re thinking that i haven’t looked at everything in sight, you’re right. My bad. I’ll make it right. Soon. Very soon.)

As for (3) a plot of the data wrt the first two PCs, that should be: take the first two columns of the new data…

\left(\begin{array}{ll} 7.11335 & 0 \\ -2.37112 & -2.82843 \\ 0 & 1.41421 \\ -4.74224 & 1.41421\end{array}\right)

and plot those pairs of points. (this corresponds to davis’ plotting A^Q.)

That looks remarkably like a plot of my A^Q (mine or davis’ doesn’t matter much, but there are sign differences; nevertheless, the geometry was memorable.) here’s a plot of my A^Q:

Clearly we need to look at rounded A^Q and Z, and A and v. (i said i’d make it right.) here are jolliffe’s Z …

\left(\begin{array}{lll} 7.2 & 0 & 1.8 \\ -2.4 & -2.8 & -0.6 \\ 0 & 1.4 & 0 \\ -4.8 & 1.4 & -1.2\end{array}\right)

and my A^Q, rounded to the nearest .2 …

\left(\begin{array}{lll} -7.4 & 0 & 0 \\ 2.4 & -2.8 & 0 \\ 0 & 1.4 & 0 \\ 4.8 & 1.4 & 0\end{array}\right)

The first two columns are not identical, but awfully close, and everything else is identical (to within the usual sign ambiguity).

here are jolliffe’s A …

\left(\begin{array}{lll} -0.5 & 0 & -1. \\ 0.5 & -0.5 & -0.5 \\ 0.5 & 0.5 & -0.5\end{array}\right)

and my v…

\left(\begin{array}{lll} 1. & 0 & 0.5 \\ -0.5 & -0.5 & 0.5 \\ -0.5 & 0.5 & 0.5\end{array}\right)

i am surprised at how similar these are. This will not always happen. After all, we saw, in jolliffe when he worked both the correlation and the covariance matrices, that the eigenvectors need not be that similar.

Moving on to (4), how many new variables would jolliffe tell us to keep? recall the eigenvalues of the correlation matrix…

\lambda = \{2.4,\ 0.6,\ 0\}

If we keep only those greater than 1 or even those greater than .7, we would keep only one.

Write them as percentages of total…

\{80., 20., 0\}

If we keep those whose cumulative contribution is 70 – 90%, we would keep only one.

a scree graph won’t help at all: we only have two nonzero eigenvalues, so we get a single line segment, and no “elbow” is possible.

What about the broken stick model? recall that if a stick of length 1 is broken into p pieces, then the expected length of the kth largest piece is given by

L(p,k) =  \frac{1}{p}\sum _{ j = k}^{ p} \frac{1}{j}

The expected lengths for a unit stick broken into 3 pieces are…

\{\frac{11}{18},\frac{5}{18},\frac{1}{9}\}

i.e.

\{0.611111,\ 0.277778,\ 0.111111\}

and our fractional eigenvalues were

\{0.8,\ 0.2,\ 0\}

So, once more, jolliffe would keep only one eigenvector (.8 > .61 but .2 < .278).

Jolliffe notwithstanding, i would keep both nonzero eigenvalues. it just doesn’t seem excessive.

PCA / FA example 4: davis. Reciprocal basis 4.

(this has nothing to do with the covariance matrix of X; we’re back with A^R for the SVD of X.)

in the course of computing the reciprocal basis for my cut down  A^R

\left(\begin{array}{ll} 7.48331 & 0. \\ -3.74166 & -2.44949 \\ -3.74166 & 2.44949\end{array}\right)

i came up with the following matrix:

\beta = \left(\begin{array}{ll} 0.0445435 & 0 \\ -0.0890871 & -0.204124 \\ -0.0890871 & 0.204124\end{array}\right)

now, \beta is very well behaved. Its two columns are orthogonal to each other:

\left(0.0445435,-0.0890871,-0.0890871\right) \left(\begin{array}{l} 0 \\ -0.204124 \\ 0.204124\end{array}\right) = 0

And each column of \beta is orthogonal to one of the A^R vectors…

\left(0.0445435,-0.0890871,-0.0890871\right) \left(\begin{array}{l} 0. \\ -2.44949 \\ 2.44949\end{array}\right) = 0

\left(0,-0.204124,0.204124\right) \left(\begin{array}{l} 7.48331 \\ -3.74166 \\ -3.74166\end{array}\right) = 0

and each column of \beta has dot product 1 with the other of the A^R vectors:

\left(0.0445435,-0.0890871,-0.0890871\right) \left(\begin{array}{l} 7.48331 \\ -3.74166 \\ -3.74166\end{array}\right) = 1

\left(0,-0.204124,0.204124\right) \left(\begin{array}{l} 0. \\ -2.44949 \\ 2.44949\end{array}\right) = 1

i know perfectly well that i could have simply said that \beta is diagonal and that \beta^T\ A^R = I, for my cut-down A^R.

I wanted to write out the orthogonality relations for \beta explicitly. you see, there’s just one tiny problem: i know that \beta is wrong. It is not the reciprocal basis for A^R.

What could be wrong? how can it be so well-behaved but wrong? on a hunch, i computed the normals to the planes spanned by the columns of B and by the columns of \beta, and by the (first two) columns of A^R. in 3D we just use the cross product of two vectors to find their normal.

The cross product of the first two columns of A^R is:

\{-18.3303,-18.3303,-18.3303\}

That’s a normal to the plane spanned by the first two columns of A^R. it’s also (anti-)parallel to the third column of v:

\{0.57735,0.57735,0.57735\}.

Of course: the eigenvector basis described by v is orthogonal, so the third one is normal to the plane spanned by the first two.

Similarly, the cross product of the first two columns of B is:

\{-0.0181848,-0.0181848,-0.0181848\}.

(how convenient that the normal vectors have all components equal!) that, too, is (anti-)parallel to the third column of v, so it’s also normal to the plane spanned by the first two columns of A^R.

and the cross product of the columns of \beta?

\{-0.0363696,-0.00909241,-0.00909241\}.

well, shiver my timbers. that is not like the other two cross products, hence not normal to the plane spanned by the first two columns of A^R.

In contrast to the two vectors of B, the two vectors of \beta do not lie in the same plane as the A^R vectors. They do not span the same plane. They are not a reciprocal basis for the same plane.

\beta is very nice 2D basis. it just happens to span a plane other than the one spanned by A^R. it’s not normal to the third column of v.

It isn’t sufficient to require \beta^T\ A = I; we also need to require that the columns of \beta span the same space – in this case, the same plane – as the columns of A^R!

And how would we realize that our alleged reciprocal basis \beta wasn’t one? well, the projections of X onto the reciprocal basis should be the S^R. if they’re not, something’s wrong.

(guess what? now you know how i discovered that \beta was wrong.)

i would hate weighted eigenvector matrices when there are eigenvalues equal to zero, except that i think i can handle them. this is when i decided to replace zero eigenvalues by 1s, take the inverse transpose, and then take a sub-basis (i.e. the appropriate number of columns).

My own list of things to do now includes constructing the reciprocal basis in precisely that fashion. i choose to construct it because it corresponds to S^R, and let’s me compute the data values for the A^R basis.

PCA / FA example 4: davis. Davis & harman 3.

Hey, since we have the reciprocal basis, we can project onto it to get the components wrt the A^R basis. After all that work to show that the S^R are the components wrt the reciprocal basis, we ought to find the components wrt the A^R basis. And now we know that that’s just a projection onto the reciprocal basis: we want to compute X B.

Recall X:

X = \left(\begin{array}{lll} -6 & 3 & 3 \\ 2 & 1 & -3 \\ 0 & -1 & 1 \\ 4 & -3 & -1\end{array}\right)

Recall B:

B = \left(\begin{array}{ll} 0.0890871 & 0. \\ -0.0445435 & -0.204124 \\ -0.0445435 & 0.204124\end{array}\right)

The product is:

X\ B = \left(\begin{array}{ll} -0.801784 & 0 \\ 0.267261 & -0.816497 \\ 0 & 0.408248 \\ 0.534522 & 0.408248\end{array}\right)

What are the column variances?

\{0.333333,0.333333\}

Does that surprise you? the 1/3 comes from our using eigenvalues of X^T\ X instead of eigenvalues of the covariance matrix. The new variables have a common variance, but it isn’t 1. this corresponds to our initial work in harman, when we discovered that using the weighted eigenvector matrix gave us new variables with variance 1.

at first glance, it seems not worthwhile to work that out using the covariance matrix, but i suppose it might be worthwhile to come at the covariance matrix from the SVD. (oh, yes! it is!)

Let’s do it.

The covariance matrix of X is \frac{X^T\ X}{N-1}

We should just replace X by X2 = \frac{X}{\sqrt{N-1}} and find the SVD of X2. but there’s a catch.

Do it anyway. here’s our X2:

X2 = \left(\begin{array}{lll} -3.4641 & 1.73205 & 1.73205 \\ 1.1547 & 0.57735 & -1.73205 \\ 0. & -0.57735 & 0.57735 \\ 2.3094 & -1.73205 & -0.57735\end{array}\right)

Compute the variances…

\{6.22222,2.22222,2.22222\}

and their sum … 10.6667 .

Hang on. The data X2 has 1/3 the variance of the original data. We’re trying to find new forms of X (not of X2) with variance 1. but let’s just keep going.

Get the SVD of X2:

here are new u…

u2 = \left(\begin{array}{llll} -0.801784 & 0 & -0.571429 & 0.174964 \\ 0.267261 & -0.816497 & -0.449762 & -0.24417   \\ 0. & 0.408248 & -0.267261 & -0.872872 \\ 0.534522 & 0.408248 & -0.632262 & 0.384531\end{array}\right)

new v…

v2 = \left(\begin{array}{lll} 0.816497 & 0 & 0.57735 \\ -0.408248 & -0.707107 & 0.57735 \\ -0.408248 & 0.707107 & 0.57735\end{array}\right)

and new w…

w2 = \left(\begin{array}{lll} 5.2915 & 0. & 0. \\ 0. & 2. & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Then i compute new A’s and S’s from

A^R = v\ w^T.

A^Q = u\ w.

S^R = X\ A^R.

S^Q = X^T\ A^Q.

(using X2, u2, v2, and w2, of course)

A^Q are the components of the data X2 wrt the orthogonal eigenvector matrix v2.

A^Q = \left(\begin{array}{lll} -4.24264 & 0 & 0 \\ 1.41421 & -1.63299 & 0 \\ 0. & 0.816497 & 0 \\ 2.82843 & 0.816497 & 0\end{array}\right)

Compute the variances…

\{9.33333,1.33333,0.\}

and their sum is 10.6667 . Yes, the new orthogonal eigenvector matrix v has redistributed the total variance of X2 among two new variables.

What about the R-mode scores?

S^R = \left(\begin{array}{llll} -22.4499 & 0 & 0 & 0 \\ 7.48331 & -3.26599 & 0 & 0 \\ 0. & 1.63299 & 0 & 0 \\ 14.9666 & 1.63299 & 0 & 0\end{array}\right)

Compute the variances:

\{261.333,5.33333,0.,0.\}

i say again, yikes!

For the record, here are the new

A^R = \left(\begin{array}{llll} 4.32049 & 0 & 0 & 0 \\ -2.16025 & -1.41421 & 0 & 0 \\ -2.16025 & 1.41421 & 0 & 0\end{array}\right)

and

S^Q = \left(\begin{array}{lll} 22.8619 & 0 & 0 \\ -11.431 & -2.82843 & 0 \\ -11.431 & 2.82843 & 0\end{array}\right)

From the orthonormal basis v2 , i construct a reciprocal basis for the \sqrt{\text{eigenvalue}}-weighted matrix A^R. i’ll just use the diagonal of w2 (instead of the \sqrt{\text{eigenvalue}} since i haven’t computed them) to scale two of the vectors. Recall the new w matrix:

w2 = \left(\begin{array}{lll} 5.2915 & 0. & 0. \\ 0. & 2. & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

i construct a diagonal matrix using 1/w when possible, 1 otherwise.

\left(\begin{array}{lll} 0.188982 & 0. & 0. \\ 0. & 0.5 & 0. \\ 0. & 0. & 1.\end{array}\right)

then i premultiply that by the new v, and keep only the first two columns; the new reciprocal basis is:

B2 = \left(\begin{array}{ll} 0.154303 & 0 \\ -0.0771517 & -0.353553 \\ -0.0771517 & 0.353553\end{array}\right)

Having the reciprocal basis, we project the data onto it get the new data wrt the A^R basis. But we project the original data. That’s the catch. As i said, it’s the variance of X we’re trying to scale to 1, not the variance of X2. we compute X B2:

X\ B2 = \left(\begin{array}{ll} -1.38873 & 0 \\ 0.46291 & -1.41421 \\ 0 & 0.707107 \\ 0.92582 & 0.707107\end{array}\right)

And the variances?

\{1.,1.\}.

Indeed. this corresponds even more closely to what we first saw in harman. (if instead we project X2 onto B2, we get a common variance of 1/3 again.)

PCA / FA example 4: davis. Reciprocal basis 2 & 3.

The reciprocal basis for A^R using explicit bases.

Here I go again, realizing that I’m being sloppy. i call

\{2,1,-3\}

the second data vector, but of course, those are the components of the second data vector. a lot of us blur this distinction a lot of the time, between a vector and its components. So long as we work only with components, this isn’t an issue, but we’re about to write vectors. i wouldn’t go so far as to say that the whole point of matrix algebra is to let us blur that distinction, but it’s certainly a major reason why we do matrix algebra. but for now, let’s look at the linear algebra, distinquishing between vectors and their components.

i need a name for the second data vector; let’s call it s. we will write it two ways, with respect to two bases, and show that the two ways are equivalent.

We have the original basis. We never write it, because the basis vectors have components (1,0,0), (0,1,0), and (0,0,1). let me call this basis e_i. because it’s an orthonormal basis, it is its own reciprocal basis, and i will call the reciprocal basis e^i. yes, i use the same symbol e, but the original basis has subscripts, the reciprocal basis has superscripts.

We have the basis described by A^R. let me call those vectors f_i. we found the reciprocal basis B, and i will call it f^i. same convention: subscripts or superscripts on the common symbol f.

let us recall the (transition matrix B) for the reciprocal basis:

\left(\begin{array}{ll} 0.0890871 & 0. \\ -0.0445435 & -0.204124 \\ -0.0445435 & 0.204124\end{array}\right)

let us also recall the design matrix X:

\left(\begin{array}{lll} -6 & 3 & 3 \\ 2 & 1 & -3 \\ 0 & -1 & 1 \\ 4 & -3 & -1\end{array}\right)

When we say the components of the second data vector are

\{2,1,-3\}

we are asserting that the second data vector is

s = 2\ e_1 + e_2 - 3\ e_3.

When we say that the first reciprocal basis vector has components given by the first column of B…

\{0.0890871,-0.0445435,-0.0445435\}

we are asserting that the first reciprocal basis vector is

f^1= 0.0890871\ e^1\ -0.0445435\ e^2 - 0.0445435\ e^3

Similarly, the second reciprocal basis vector, whose components are the second column of B, is

f^2 = 0. e^1 - 0.204124\ e^2 + 0.204124\ e^3

Let us recall (the first two columns of) S^R:

\left(\begin{array}{ll} -67.3498 & 0. \\ 22.4499 & -9.79796 \\ 0. & 4.89898 \\ 44.8999 & 4.89898\end{array}\right)

When we say that the second row of S^R

\{22.4499,-9.79796\}

are the components of the second data vector wrt the reciprocal basis \left(f^i\right), we are saying that

s = 22.4499 f^1 - 9.79796 f^2.

but we defined the vector s as

s = 2\ e_1 + e_2 - 3\ e_3.

The two forms of s should be the same vector. We start with…

s = 22.4499 f^1 - 9.79796 f^2.

Plug in the expressions for f^1 and f^2

s = 22.4499 \left(0.0890871 e^1 - 0.0445435   e^2 - 0.0445435 e^3\right)
- 9.79796 \left(0.   e^1 - 0.204124 e^2 + 0.204124 e^3\right)

and Mathematica® simplifies that to

s = 2. e^1 + 1. e^2 - 3. e^3,

just what it should be.

The reciprocal basis for A^R using tensors.

Third, i can write this another way, and some of you will have seen it this way. If you have, i just want to refresh your memory. Using tensor notation (and the einstein summation convention), we could have written

s = s_i\ f^i and s = s^i\ f_i.

which is another way of saying that the components of s wrt the reciprocal basis f^i are s_i and the components of s wrt the basis f_i are s^i. It is absolutely crucial that corresponding components and vectors do not both have superscripts or subscripts, but are mixed.

what is significant are these equations

s\ \cdot\ f_i = s_i

and

s\ \cdot\ f^i = s^i,

which I offer as reminders; i’m not going to work them out for you at this time. they are another way of writing what we’ve been saying all along: dotting a vector into some basis vectors gives you the components wrt the reciprocal basis.

PCA / FA example 4: davis. Scores & reciprocal basis 1.

The reciprocal basis for A^R, using matrix multiplication.

i keep emphasizing that A^Q is the new data wrt the orthogonal eigenvector matrix; and i hope i’ve said that the R-mode scores S^R = X\ A^R are not the new data wrt the weighted eigenvector matrix A^R. In a very real sense, the R-mode scores S^R do not correspond to the R-mode loadings A^R. This is important enough that i want to work out the linear algebra. In fact, i’ll work it out thrice. (And i’m going to do it differently from the original demonstration that A^Q is the new data wrt v.)

Of course, people will expect us compute the scores S^R, and I’ll be perfectly happy to. I just won’t let anyone tell me they correspond to the A^R.

First off, we have been viewing the original data as vectors wrt an orthonormal basis. Recall the design matrix:

\left(\begin{array}{lll} -6 & 3 & 3 \\ 2 & 1 & -3 \\ 0 & -1 & 1 \\ 4 & -3 & -1\end{array}\right)

Let’s take just the second observation:

X_2 = (2,\ 1,\ -3)

Those are the components of a vector.

Second, we have an orthonormal basis defined by the orthogonal eigenvector matrix v. Recall A^Q:

\left(\begin{array}{lll} -7.34847 & 0. & 0. \\ 2.44949 & -2.82843 & 0. \\ 0. & 1.41421 & 0. \\ 4.89898 & 1.41421 & 0.\end{array}\right)

for the second row, in particular, we have new components….

A^{Q}_2 = (2.44949,\ -2.82843,\ 0.)

We’ve already seen that old and new components (x and y, resp) are related by

X_2 = v\ A^{Q}_2,

although we wrote the whole thing as A^Q = X\ v. yes? we confirm it simply by computing…

v\ A^{Q}_2 = \left(\begin{array}{lll} 0.816497&0&0.57735\\ -0.408248&-0.707107&0.57735\\ -0.408248&0.707107&0.57735\end{array}\right) x \left(\begin{array}{l} 2.44949\\ -2.82843\\ 0.\end{array}\right)

= \left(\begin{array}{l} 2.\\ 1.\\ -3.\end{array}\right)

which is X_2.

The second row of A^Q represents the same vector as the second row of X, but the components are different because the basis vectors are different.

All that was review, albeit from a slightly different point of view. let’s move on to the weighted eigenvector matrix A^R; it is not an orthonormal basis because the basis vectors are not of length 1, although they are orthogonal to each other. Here’s my A^R from the SVD:

\left(\begin{array}{llll} 7.48331&0.&0.&0.\\ -3.74166&-2.44949&0.&0.\\ -3.74166&2.44949&0.&0.\end{array}\right)

i just want the first two columns. Here’s my A^R cut down…

\left(\begin{array}{ll} 7.48331&0.\\ -3.74166&-2.44949\\ -3.74166&2.44949\end{array}\right)

and my S^R cut down:

\left(\begin{array}{ll} -67.3498&0.\\ 22.4499&-9.79796\\ 0.&4.89898\\ 44.8999&4.89898\end{array}\right)

I’ve been implying, if not saying, that the projections of X onto A^R, namely X\ A^R (which are the R-mode scores S^R), are the components of the data wrt a basis other than A^R. it’s what may best be called the reciprocal basis.

(if you haven’t looked at “the reciprocal basis” in http://rip94550.wordpress.com/2008/03/21/transpose-matrix-adjoint-operator-2/, you might want to.)

Briefly, a reciprocal basis is constructed to make the dot product of vectors work out right when we have a non-orthonormal basis. What we do is get the components of one vector wrt the basis, the other vector wrt the reciprocal basis, and then we can compute their dot product in the usual way, as the sum of products of corresponding entries; i.e. as a row vector times a column vector.

If we have a transition matrix A, then the transition matrix B for the reciprocal basis is B = A^{-T}. (because it gives us B^T\ A = A^{-1}\ A = I, which says that the dot product of a row of B^T with a column of A is either 0 or 1. it’s equivalent to saying that the dot product of a column of B with a column of A is either 0 or 1. either way, the column vectors of B and of A are pairwise orthogonal.)

we have a problem, however: we have a matrix A^R which is not invertible. Another damned dragon to cope with.

We could look for a matrix B such that B^T\ A = I. unfortunately, such matrices are not unique.

Let’s construct the reciprocal basis geometrically. As A^R was constructed from v by multiplying columns by \sqrt{\text{eigenvalue}} we will construct B by dividing v by \sqrt{\text{eigenvalue}}, but we’ll just do it for the first two columns. The quickest way to accomplish that is to post-multiply by a diagonal matrix whose entries are \frac{1}{\sqrt{\text{Eigenvalue}}}, when possible and 1 otherwise… the diagonal eigenvalue matrix is…

\left(\begin{array}{lll} 9.16515&0.&0.\\ 0.&3.4641&0.\\ 0.&0.&0.\end{array}\right)

so i form a new diagonal matrix

\left(\begin{array}{lll} 0.109109&0.&0.\\ 0.&0.288675&0.\\ 0.&0.&1.\end{array}\right)

Note the 1 instead of 0 in the (3, 3) slot.

recall v…

\left(\begin{array}{lll} 0.816497&0&0.57735\\ -0.408248&-0.707107&0.57735\\ -0.408248&0.707107&0.57735\end{array}\right)

and compute the product of v and the new diagonal matrix

\left(\begin{array}{lll} 0.816497&0&0.57735\\ -0.408248&-0.707107&0.57735\\ -0.408248&0.707107&0.57735\end{array}\right) x \left(\begin{array}{lll} 0.109109&0.&0.\\ 0.&0.288675&0.\\ 0.&0.&1.\end{array}\right)

= \left(\begin{array}{lll} 0.0890871&0.&0.57735\\ -0.0445435&-0.204124&0.57735\\ -0.0445435&0.204124&0.57735\end{array}\right)

and since i only want the first two columns, i cut that down to

B = \left(\begin{array}{ll} 0.0890871&0.\\ -0.0445435&-0.204124\\ -0.0445435&0.204124\end{array}\right)

and call it B.

It is true that B^T\ A = I. (so the columns of B are mutually orthogonal to the columns of A.)

Now, B is the transition matrix for the reciprocal basis of A^R. S^R was defined as X\ A^R, the projections of the data onto the A^R basis. i claim that S^R is the components of the data wrt B, the reciprocal basis for A^R.

i want to confirm this just for the second data vector, namely that

X_2 = B\ S^{R}_2,

i.e. the old components are found by applying the transition matrix B to the new components. the RHS is

B\ S^{R}_2 = \left(\begin{array}{ll} 0.0890871&0.\\ -0.0445435&-0.204124\\ -0.0445435&0.204124\end{array}\right) x \left(\begin{array}{l} 22.4499\\ -9.79796\end{array}\right)

= \left(\begin{array}{l} 2\\ 1\\ -3\end{array}\right)

and, once again, we have found the old components of the second observation. More importantly, we have exhibited a basis B \ne\ A^R wrt which the R-mode scores represent the new components of the data.

PCA / FA example 4: davis. Davis & harman 2.

What about redistributing the variance?

We learned from harman that if we use the orthogonal eigenvector matrix v to compute new data, the result has redistributed the original variance according to the eigenvalues; and we also learned that if we use the weighted eigenvector matrix instead, we get new variables of unit variance.

(i think i thought this required us to start with the correlation matrix; it does not.)

let’s see this again for davis’ centered data X:

\left(\begin{array}{lll} -6&3&3\\ 2&1&-3\\ 0.&-1&1\\ 4&-3&-1\end{array}\right)

We have variances…

\{\frac{56}{3},\ \frac{20}{3},\ \frac{20}{3}\}

or…

{18.6667,\ 6.66667,\ 6.66667}

and the sum of the variances is…

32.

Now is a good time to recall w^T\ w:

\left(\begin{array}{lll} 84.&0.&0.\\ 0.&12.&0.\\ 0.&0.&0.\end{array}\right)

The sum of the diagonal elements is 96, and if we divide by 3 (= N-1, for the covariance matrix instead of for X^T\ X), gee, we get 32.

stop for a moment. We have two, admittedly proportional, sets of eigenvalues, one set from the covariance matrix, the other from X^T\ X. this gives us two different possible sets of weights to apply to the eigenvector matrix v. (recall that v is an eigenvector matrix both for X^T\ X and for the covariance matrix of X, as well as coming out of the SVD of X.)

Let’s stay with what i know: our new data, using the orthogonal eigenvector matrix v, is A^Q.

\left(\begin{array}{lll} -7.34847&0.&0.\\ 2.44949&-2.82843&0.\\ 0.&1.41421&0.\\ 4.89898&1.41421&0.\end{array}\right)

The variances are…

{28.,\ 4.,\ 0.}

and we don’t even need to consciously add those: the sum is 32. We have redistributed the original variance among our two new variables.

Now, everyone seems to say that the \sqrt{\text{eigenvalue}}-weighted eigenvectors have variances equal to the eigenvalues. What those eigenvectors really have is lengths equal to their \sqrt{\text{eigenvalue}}.

here indeed be dragons.

as linear algebra is my touchstone in general for all of this, that redistributed variance using the orthogonal eigenvector matrix is my vorpal blade (dragon, jabberwock, no big deal – snicker-snack either way). any different choice of basis must lead to different results.

we looked at that redistribution of variance in harman. We learned then that the correct way to get new data with redistributed variance is to use the orthogonal eigenvector matrix. We just reconfirmed it for davis’ example.

Instead of actually kicking one of the dragons awake, let’s try counting its scales while it sleeps. If the R-mode scores S^R are some form of data (and they are), what are their variances?

my S^R is

\left(\begin{array}{llll} -67.3498&0.&0.&0.\\ 22.4499&-9.79796&0.&0.\\ 0.&4.89898&0.&0.\\ 44.8999&4.89898&0.&0.\end{array}\right)

the variances of the first two columns are

{2352.,\ 48.}

yikes! if nothing else, that should convince you that the R-mode scores are weird.Life is so much simpler if we use the orthogonal eigenvector matrix. The new data is trivial to compute (X v), and we can see that it has redistributed the variance of the original data.Trying to use any weighted eigenvector matrix is dicey if there is a zero eigenvalue (because our transition matrix is not invertible); and confusing if we used X^T\ X instead of the covariance matrix (because they have different eigenvalues).i will eventually calculate the data wrt both of the weighted eigenvector matrices. The dragons will be awake and aloft and pretty to look at.

PCA / FA example 4: davis. Davis & harman 1.

Let us now recall what we saw harman do. Does he have anything to add to what we saw davis do? if we’re building a toolbox for PCA / FA, we need to see if harman gave us any additional tools.

here’s what harman did or got us to do:

  • a plot of old variables in terms of the first two new ones
  • redistribute the variance of the original variables
  • his model, written Z = A F, was (old data, variables in rows) = (some eigenvector matrix) x (new data)

We should remind ourselves, however, that harman used an eigendecomposition of the correlation matrix; and the eigenvectors of the correlation matrix are not linearly related to the eigenvectors of the covariance matrix (or, equivalently, of X^T\ X.) having said that, i’m going to go ahead and apply harman’s tools to the davis results. i will work from the correlation matrix when we get to jolliffe.

What exactly was that graph of harman’s? he took the first two columns of the weighted eigenvector matrix, then plotted each row as a point. He had 5 original variables. His plot

shows that the old 1st and 3rd variables, for example, are defined very similarly in terms of the first two new variables. let me emphasize that this is a description of defining variables, not a display of the resulting data. Simple counting can help with the distinction, e.g. 3 variables but 4 observations.

from davis we only have 3 variables to start, and our weighted eigenvector matrix has only two nonzero columns; and it’s one form of what we called A^R. we have to be careful, because there are sign differences among the various forms of A^R; not because the definitions are different, but because eigenvectors are not unique. In particular, my u and v from the SVD have some different signs from davis’ U and V from eigendecompositions.

most of the time, i will use my SVD results, but right now let’s compare davis with harman. Davis had A^R as…

\left(\begin{array}{ll} 7.48331&0.\\ -3.74166&2.44949\\ -3.74166&-2.44949\end{array}\right)

Davis showed us a plot of these numbers viewed as three points.

So the R-mode loadings A^R tell us how the 3 old variables are described by the new ones; we see that a plot of A^R corresponds to harman’s plot.

i want to emphasize that even the orthogonal eigenvector matrix shows us that the data is really only 2D. Don’t think that we had to scale the eigenvectors in order to get 2D.

my orthogonal eigenvector matrix v is

\left(\begin{array}{lll} 0.816497&0&0.57735\\ -0.408248&-0.707107&0.57735\\ -0.408248&0.707107&0.57735\end{array}\right)

but wrt that basis, the new data is A^Q:

\left(\begin{array}{lll} -7.34847&0.&0.\\ 2.44949&-2.82843&0.\\ 0.&1.41421&0.\\ 4.89898&1.41421&0.\end{array}\right)

i.e. the new data is described by the first two eigenvectors, hence it lies in a plane. but then so does the original data, because the transformation between them is linear (a plane cannot be mapped to 3D, although it could be mapped to a line). we could plot the data in 3D and rotate the point of view until we see the plane edge on.

Do davis and harman’s models resemble each other?

Let’s recall the SVD form of the A’s and S’s.

X = u\ w\ v^T

A^R = v\ w^T.

A^Q = u\ w.

S^R = X\ A^R = u\ w\ w^T= A^Q\ w^T.

S^Q = X^T\ A^Q = v\ w^T\ w = A^R\ w.

Then

X = u\ w\ v^T = A^Q\ v^T = u\ AR^T.

in particular,

X\ v = A^Q and then X^T = Z = v\ AQ^T.

Z = v\ AQ^T is harman’s model (Z = A F, where Z = X^T, A is a weighted eigenvector matrix, F is the new data), while A^Q = X\ v is jolliffe’s (Z = X A, where Z is the new data, X is the original data, and A is the orthogonal eigenvector matrix). i think it would be unfair to say that davis is using either of those models, especially since harman and jolliffe (usually) work from the correlation matrix.

i would summarize davis with all four A’s and S’s.

i can’t, however, say too many times that A^R and A^Q can be constructed using u or v:

A^R = X^T\ u = v\ w^T,

and

A^Q = X\ v = u\ w.

i see no grounds for associating A^R with just u or just v. and no grounds for associating either u or v primarily with R-mode or Q-mode; they’re equally important to the SVD.

i would probably focus on v and A^Q: one defines new variables, the other gives me the associated new data.

PCA / FA example 4: Davis review (3)

Let’s talk about centering the data. Recall the questions:

  • is there any chance that we can always center both the columns and the rows?
  • is there any chance that we can always standardize both the columns and the rows?
  • if we can’t have our cake and eat it too, should we give up the duality between R-mode and Q-mode?
  • what if rows of the data matrix add up to 1 or 100 (i.e. the variables are percentages)?

(My third question is poorly phrased. we always have the duality between A’s and S’s; what i feared losing was the idea that Q-mode is just R-mode applied to the transpose X^T; but what if X doesn’t have row-centered data?)

The third question is also a red herring. We can always make both the columns and the rows add up to zero. Proving it isn’t too hard; some convenient notation might simplify things. If the data is

x_{ij}

then

x_{\cdot j} and x_{i\cdot}

are good symbols for the column means and row means, respectively. The grand mean (the mean of all the matrix values = mean of column means = mean of row means) would be denoted

x_{\cdot\cdot}

BTW, it’s very handy that the mean of column means = mean of row means; we’ll see how handy, soon.

The only subtlety is that what we have here are row and column means of the original data, instead of, for example, column means of the original data and row means of the column-centered data. We can get “doubly-centered data” by computing

x_{ij} - x_{\cdot j} - x_{i\cdot} + x_{\cdot\cdot}

Why are we adding the grand mean? Because we subtracted it twice. If we had taken row means of the column centered data, we wouldn’t add the grand mean back in; but i chose, and i’m sure most people choose, to take row means of the original data.

I leave the proof that we can always get doubly-centered data for the interested reader. Really, it shouldn’t be difficult.

It means that we can always do both the R-mode and Q-mode analyses on a doubly-centered design matrix X.

That had bothered while i was looking at Q-mode. I knew that davis’ data was row-centered as well as column-centered (had zero-mean rows as well as zero-mean columns, resp.), but i just assumed he had picked the numbers that way in order to avoid discussing the case when the rows were not zero-mean.

We can always have both row-centered and column centered simultaneously.
(We can have our cake and eat it too, so we don’t have to worry about Q-mode being done with non-centered data.)

NOTE that we cannot do that with standardized data (or, therefore, with the correlation matrices). That is, we cannot standardize both the columns and the rows. The unit column-variances get messed up when we standardize the rows after standardizing the columns.

As I draft this, it is an open question for me whether we can always arrange to have column-centered data with row sums equal to 1. Let me follow some great advice, scary as it is: let me guess. (from John A. Wheeler, the great physics teacher: if my guess turns out right, i have reinforced my intuition; if my guess turns out wrong, i correct my intuition.)

I think we can do it. Adding that grand mean back in looks promising for adding 1 back in.

(I won’t string you along any further: my intuition was wrong. We’ll see what happens instead. It’s almost as good.)

Let’s try an example. I start with a real simple set of numbers:

\left(\begin{array}{llll} 1&2&3&4\\ 5&6&7&8\\ 9&10&11&12\end{array}\right)

Square the first row, square-root the second row, leave the third alone:

\left(\begin{array}{llll} 1&4&9&16\\ \sqrt{5}&\sqrt{6}&\sqrt{7}&2 \sqrt{2}  \\ 9&10&11&12\end{array}\right)

While i’ve got it in this form, get what will be the row means.

{4.07869,\ 5.48316,\ 7.54858,\ 10.2761}

(Much of Mathematica® works on rows; I’ve just been working with X^T.) Transpose. This is my example data matrix.

\left(\begin{array}{lll} 1&\sqrt{5}&9\\ 4&\sqrt{6}&10\\ 9&\sqrt{7}&11\\ 16&2 \sqrt{2}&12\end{array}\right)

Get the column means.

{7.5,\ 2.53993,\ 10.5}

The grand mean is any of the following: the mean of row means… the mean of column means… or the mean of all 12 values. No matter how we compute it, we get

6.84664 .

We get doubly-centered data by

x_{ij} - x_{\cdot j} - x_{i\cdot} + x_{\cdot\cdot}

which gives us

\left(\begin{array}{lll} -3.73204&2.46409&1.26796\\ -2.13652&1.27304&0.863481\\ 0.798061&-0.596122&-0.201939\\ 5.0705&-3.141&-1.9295\end{array}\right)

(Yes, the column means are zero and the row means are zero.)

That was easy enough.

Let’s back up and look at just column-centered data from the original.

\left(\begin{array}{lll} -6.5&-0.303866&-1.5\\ -3.5&-0.0904443&-0.5\\ 1.5&0.105817&0.5\\ 8.5&0.288493&1.5\end{array}\right)

The column means, of course, are zero… the row means are not (these, of course, are the row means of the column centered data, not of the original data)…

{-2.76796,\ -1.36348,\ 0.701939,\ 3.4295}

but the grand mean is zero. (go ahead, compute it if you must.)

The grand mean is zero because it is the mean of the column means, each of which is zero. This implies that the mean of the row means is zero. In particular, since 1 ≠ 0, we cannot have column-centered data with constant row sums = 1. (I said it was handy that the grand mean could be computed more than one way.)

maybe I should emphasize what we just saw: the act of column-centering the data implies that the sum of the new row means is zero. If each row mean is the same, then that common value must be zero. That example did not have constant row sums, but we see what must happen if it had.

I probably do not need to work an example, but i’m going to. And i’m going to do it because I had actually worked this example first; then i realized what had happened, and i told you you up front. So suppose we had started with a matrix where the rows each add up to 1. Go back to the raw data again….

\left(\begin{array}{llll} 1&4&9&16\\ \sqrt{5}&\sqrt{6}&\sqrt{7}&2 \sqrt{2}  \\ 9&10&11&12\end{array}\right)

but change the rows. Divide each row by its mean, and by 3. (I don’t want row mean = 1, I want row sum = 1.)

\left(\begin{array}{lll} 0.0817256&0.182744&0.73553\\ 0.243169&0.14891&0.607922\\ 0.397426&0.116832&0.485742\\ 0.519002&0.0917474&0.389251\end{array}\right)

Get the column means…

{0.31033,\ 0.135058,\ 0.554611}

and having said that we want row sums = 1, we confirm that each row mean is 1/3, because it’s an easier command to Mathematica®:

{0.333333,\ 0.333333,\ 0.333333,\ 0.333333}

I then construct column-centered data:

\left(\begin{array}{lll} -0.228605&0.0476857&0.180919\\ -0.0671617&0.0138515&0.0533102  \\ 0.0870952&-0.0182262&-0.068869  \\ 0.208671&-0.0433109&-0.16536\end{array}\right)

We could confirm that the column means are zero; and we compute the row means…

{2.77556 x 10^{-17},\ 9.25186 x 10^{-18},\ -1.85037 x 10^{-17},\ -9.25186 x 10^{-18}}

We automatically got row-centered data, as expected.

We could look at standardized data, but i’ll leave this to you, too: a counterexample should be convincing enough (and I’ve given you an matrix to play with). We can’t preserve unit variance of the columns when we then standardize the rows.

Summary:

  1. we can always get doubly-centered data.
  2. my guess was wrong: we cannot have column-centered with constant row sums, unless the constant is 0.
  3. but if the rows do have a common row mean (equivalently, a common row sum), then just centering the columns automatically gives us doubly-centered data.
  4. we cannot always get doubly-standardized data.
  5. if i center the columns, then i would be inclined to center the rows: use doubly-centered data in preference to only column-centered (assuming, as i prefer, that the variables are columns.)

Let me elaborate on (5). For a matrix X with variables in columns, column-centered data means that X^T\ X is proportional to the covariance matrix of X. If we want  X\ X^T to be proportional to the covariance matrix of X^T, then we need row-centered data. If we want both X^T\ X and  X\ X^T to be proportional to covariance matrices, then we need doubly-centered data.