Example: Is it a transition matrix? Part 2

We had three matrices from Jolliffe, P, V, and Q. They were allegedly a set of principal components P, a varimax rotation V of P, and a quartimin “oblique rotation” Q.

I’ll remind you that when they say “oblique rotation” they mean a general change-of-basis. A rotation preserves an orthonormal basis; a rotation cannot transform an orthonormal basis to a non-orthonormal basis, and that’s what they mean — a transformation from an orthonormal basis to a non-orthonormal basis, or possibly a transformation from a merely orthogonal basis to a non-orthogonal one. In either case, the transformation cannot be a rotation.

(It isn’t that complicated! If you change the lengths of basis vectors, it isn’t a rotation; if you change the angles between the basis vectors, it isn’t a rotation.)

Anyway, we showed in Part 1 that V and Q spanned the same 4D subspace of R^{10}\ .

Now, what about V and P? Let me recall them:

P = \left(\begin{array}{cccc} 0.34 & -0.39 & 0.09 & -0.08 \\ 0.34 & -0.37 & -0.08 & -0.23 \\ 0.35 & -0.1 & 0.05 & 0.03 \\ 0.3 & -0.24 & -0.2 & 0.63 \\ 0.34 & -0.32 & 0.19 & -0.28 \\ 0.27 & 0.24 & -0.57 & 0.3 \\ 0.32 & 0.27 & -0.27 & -0.34 \\ 0.3 & 0.51 & 0.19 & 0.27 \\ 0.23 & 0.22 & 0.69 & 0.43 \\ 0.36 & 0.33 & -0.03 & 0.02\end{array}\right)

V = \left(\begin{array}{cccc} 0.48 & 0.09 & 0.17 & 0.14 \\ 0.49 & 0.15 & 0.18 & -0.03 \\ 0.35 & 0.22 & 0.24 & 0.22 \\ 0.26 & 0. & 0.64 & 0.2 \\ 0.49 & 0.16 & 0.02 & 0.15 \\ 0.05 & 0.34 & 0.6 & -0.09 \\ 0.2 & 0.51 & 0.18 & -0.07 \\ 0.1 & 0.54 & -0.02 & 0.32 \\ 0.1 & 0.13 & 0.07 & 0.83 \\ 0.17 & 0.46 & 0.28 & 0.26\end{array}\right)

As before, I suppose there is a transition matrix T relating them; each column of V^T = V' can be computed by applying a transition matrix T to the corresponding column of P^T\ :

V’ = T P’

i.e.

V = P T’.

Then I imagine briefly that I can “solve” for T’…

T' = P^{-1}\ V\ ,

but then quickly replace the non-existent inverse of Q by the appropriate pseudo-inverse

T' = (P'P)^{-1}\ P'\ V\ .

I compute T’, and then display its transpose T:

T = \left(\begin{array}{cccc} 0.905145 & -0.386677 & 0.0882059 & -0.141151 \\ 0.853806 & 0.568131 & -0.0875601 & -0.220572 \\ 0.64498 & -0.155196 & -0.506397 & 0.490735 \\ 0.507437 & 0.131036 & 0.67664 & 0.347268\end{array}\right)

and then I test whether

V = P T’ (?).

The largest difference (element by element) between the matrices V and P T’ is 0.170627, which is rather larger (by a factor of 23) than the differences we saw in Part 1. Here’s the full blown matrix — where I chose to round off first, then subtract (i.e. what we’d compute if we were subtracting published, already rounded-off, matrices):

V - P\ T' = \left(\begin{array}{cccc} 0. & 0.01 & -0.02 & -0.01 \\ 0.01 & 0.01 & -0.02 & -0.02 \\ -0.01 & -0.01 & 0.01 & 0.01 \\ 0. & 0 & 0. & 0. \\ 0. & 0.01 & -0.02 & -0.01 \\ -0.01 & -0.01 & 0.03 & 0.02 \\ -0.01 & -0.02 & 0.05 & 0.03 \\ 0.05 & 0.07 & -0.17 & -0.12 \\ -0.02 & -0.04 & 0.09 & 0.07 \\ -0.02 & -0.03 & 0.07 & 0.05\end{array}\right)

I am inclined to think there is a probem. But we don’t know much at all about it.

Let’s try one of our alternatives. We construct a projection operator onto the subspace spanned by P.

Get the SVD (Singular Value Decomposition) of P…

P = u\ w\ v^T\ .

Then, since u is 10×10 and V is of rank 4, the leftmost 4 columns (u1) of u….

u1 = \left(\begin{array}{cccc} 0.132041 & 0.280276 & -0.423196 & 0.0366383 \\ 0.24536 & 0.379423 & -0.288143 & -0.0619886 \\ -0.0781059 & 0.252968 & -0.245411 & -0.0642461 \\ -0.310642 & 0.402272 & -0.122076 & 0.641009 \\ 0.19934 & 0.202925 & -0.456543 & -0.212204 \\ -0.275595 & 0.504975 & 0.417433 & 0.130869 \\ 0.0279239 & 0.347505 & 0.199748 & -0.545498 \\ -0.547166 & 0.00466106 & -0.00216808 & -0.298924 \\ -0.571058 & -0.293548 & -0.488675 & -0.061545 \\ -0.278923 & 0.220314 & 0.0368869 & -0.356268\end{array}\right)

are an orthonormal basis for the range of P (i.e. the column space of P). We construct a projection operator, as we have before, by

R = u1\ u1^T\ .

We check it first by confirming that R R = R, i.e. it is idempotent, hence a projection operator. We check it further by applying it to the column vectors in P: we expect that R P = P; i.e. it actually projects onto the column space of P (the space spanned by the columns of P).

Now, having checked that R is a projection operator onto P, we apply it to V, and compute V – R V:

V - R\ V = \left(\begin{array}{cccc} 0.00221622 & 0.0115115 & -0.0249851 & -0.0145408 \\ 0.0137721 & 0.012178 & -0.0243583 & -0.0200425 \\ -0.00564412 & -0.011024 & 0.00933521 & 0.0112507 \\ 0.00222059 & 0.00165751 & -0.00118369 & -0.00423324 \\ 0.00223276 & 0.00638406 & -0.0153346 & -0.0119237 \\ -0.00896396 & -0.0106169 & 0.0272358 & 0.0230479 \\ -0.00941945 & -0.0152489 & 0.0456323 & 0.0330044 \\ 0.0470134 & 0.0703021 & -0.170627 & -0.121383 \\ -0.0232815 & -0.036102 & 0.0941953 & 0.068255 \\ -0.0227796 & -0.0330688 & 0.0740154 & 0.0474348\end{array}\right)

Let me round that to the nearest .01…

V - R V = \left(\begin{array}{cccc} 0 & 0.01 & -0.02 & -0.01 \\ 0.01 & 0.01 & -0.02 & -0.02 \\ -0.01 & -0.01 & 0.01 & 0.01 \\ 0 & 0 & 0 & 0 \\ 0 & 0.01 & -0.02 & -0.01 \\ -0.01 & -0.01 & 0.03 & 0.02 \\ -0.01 & -0.02 & 0.05 & 0.03 \\ 0.05 & 0.07 & -0.17 & -0.12 \\ -0.02 & -0.04 & 0.09 & 0.07 \\ -0.02 & -0.03 & 0.07 & 0.05\end{array}\right)

I should put that array of differences into perspective. Recall V:

V = \left(\begin{array}{cccc} 0.48 & 0.09 & 0.17 & 0.14 \\ 0.49 & 0.15 & 0.18 & -0.03 \\ 0.35 & 0.22 & 0.24 & 0.22 \\ 0.26 & 0. & 0.64 & 0.2 \\ 0.49 & 0.16 & 0.02 & 0.15 \\ 0.05 & 0.34 & 0.6 & -0.09 \\ 0.2 & 0.51 & 0.18 & -0.07 \\ 0.1 & 0.54 & -0.02 & 0.32 \\ 0.1 & 0.13 & 0.07 & 0.83 \\ 0.17 & 0.46 & 0.28 & 0.26\end{array}\right)

This reinforces my belief that we have a problem. Some of the differences we found are larger than some of the values of V itself. I can see that V is not in the column space of P, and while it’s not extremely far away, it’s not close either. (This seems more definitive than “the hypothetical transition matrix T doesn’t quite work”.)

As before, however, we are tip-toeing around the real question: Is every column vector in V a linear combination of the columns of P?

Easy enough. Do a regression. Actually 4 of them, one for each column of V.

Let me name the individual columns of the P matrix, and let me set the first column of V (v1) as “y”, and the columns of V as independent variables.

(Some smaller pictures follow, in which you can see that the final option is “IncludeConstantBasis -> False”.) The results are:

Not bad at all. In fact, rather close to perfect.

What was the first row of T?

{0.905145, -0.386677, 0.0882059, -0.141151}.

Yes, those are the \beta s (“Estimates”) for that regression; first row of T is first column of T’ is regression coefficients for the first column of V as the dependent variable.

Let me emphasize another relationship. For the first regression, yhat should be the projection of v1 (the 1st column of V) onto the column space of P. But we have that projection directly as R v1:

R v1 = {0.477784,0.476228,0.355644,0.257779,0.487767,
0.058964,0.209419,0.0529866,0.123281,0.19278}

… and we have the yhat from the regression:

yhat = {0.477784,0.476228,0.355644,0.257779,0.487767,
0.058964,0.209419,0.0529866,0.123281,0.19278}

Good, they are the same. So if regression works better for you than the projection operator, then use regression. Personally, I prefer the projection operator because I can apply it to all of V at once. (And because sometimes I have a use for the orthonormal basis u1.)

Largely to display the R^2, let me run the other three regressions. (Oh, I also reduced the magnification in my Mathematica notebook, so the pictures are smaller and the right side of the regression command isn’t cut off, at least on my screen.)

Here’s v2 as the dependent variable:

Here’s v3 as the dependent variable:

Here’s v4 as the dependent variable:

We see that the R^2 is over 99% for v2, falls to 95% for v3, and is 97.6% for v4. Those last two are just not large enough, not for this question.

I’m certain that V and P do not span the same subspace, but that’s about all I know.

We could, however, have made the dual choice, projecting P onto V instead of V onto P. Are they the same?

No.

We already know — from the T matrix, from the projection operator, or from the regressions — that the P and V matrices do not represent the same data in two different coordinate systems. But there might be — and in this case, there is! — more information available by making the other choice.

Let’s get the projection onto V instead of onto P. We need the SVD of V…

V = u\ w\ v^T\ .

We need the 4 leftmost columns (u1) of u…

u1 = \left(\begin{array}{cccc} -0.292636 & 0.0434068 & -0.33927 & 0.269896 \\ -0.274013 & 0.218205 & -0.238496 & 0.365998 \\ -0.337357 & -0.00723082 & -0.119568 & 0.0745101 \\ -0.355459 & 0.224109 & -0.415783 & -0.490416 \\ -0.2739 & -0.0500253 & -0.225463 & 0.452829 \\ -0.308498 & 0.463936 & 0.232078 & -0.416388 \\ -0.287334 & 0.227908 & 0.427185 & 0.212631 \\ -0.303344 & -0.281983 & 0.503807 & 0.152613 \\ -0.329068 & -0.741253 & -0.129874 & -0.304713 \\ -0.382183 & -0.0396735 & 0.288053 & -0.0857624\end{array}\right)

We construct the projection operator…

R = u1\ u1^T

Now we apply it to P, and compute the difference P – R P:

P - R P = \left(\begin{array}{cccc} 0.00186923 & -0.00646007 & -0.00165699 & 0.0293708 \\ -0.00396322 & 0.00598219 & 0.00143136 & 0.0162366 \\ 0.0035053 & 0.00255499 & -0.00525352 & -0.0232869 \\ 0.000382829 & 0.00307153 & 0.00267228 & 0.124774 \\ 0.000923133 & -0.00219591 & 0.00142733 & -0.0228012 \\ 0.0000472374 & -0.00300758 & -0.00291228 & -0.00287352 \\ -0.00462961 & -0.00220958 & -0.0000400809 & -0.149176 \\ 0.00305495 & 0.00376593 & -0.000439512 & 0.29618 \\ -0.00367312 & -0.00153518 & -0.000838857 & -0.106747 \\ 0.00147883 & -0.000459218 & 0.00482354 & -0.141981\end{array}\right)

So what?

Round it off:

P - R P = \left(\begin{array}{cccc} 0 & -0.01 & 0 & 0.03 \\ 0 & 0.01 & 0 & 0.02 \\ 0 & 0 & -0.01 & -0.02 \\ 0 & 0 & 0 & 0.12 \\ 0 & 0 & 0 & -0.02 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & -0.15 \\ 0 & 0 & 0 & 0.3 \\ 0 & 0 & 0 & -0.11 \\ 0 & 0 & 0 & -0.14\end{array}\right)

Ah ha! It’s time for Robin to squeak out, “Holy moly, Batman!”

The fourth column of P does not lie in the subspace spanned by the 4 columns of V, while the first three do. That is the problem. I don’t why or where it happened, but I know exactly what’s wrong: the 4th column of P is the only problem.

Somebody blew it somewhere. (OK, since I don’t know how a varimax rotation is computed, I have to entertain the possibility that V is not supposed to be P under a change of basis. But I can’t believe that, because V and Q have a common column space, and 3 of the 4 columns of P lie in that common column space. If that’s an accident, what does a deliberate act look like?)

Is the mistake in the orignal paper? Were P and V computed from the same correlation matrix but by independent algorithms? Or was V computed incorectly from P (and then Q computed from V)?

Or did someone copy P incorrectly? (P should be primary, but then how the heck does the 4th column of P fail to be in the column space of V? I would sooner have expected an error in V, given P.)

Or did they rotate the first 3 PCs, and then apply that transformation to the 4th column? (No. I’ve checked. You can too.)

Let me emphasize that we would get the same result from four regressions each with one column of P as the dependent variable and all the columns of V as the independent variables. The first three regressions would be extremely good fits, but the fourth would not be. We would, again, conclude that the fourth column of P was not exactly — or even to a fair approximation — a linear combination of the columns of V.

Summary: theory

Given two matrices, say P and V, which are alleged to be the same data in different coordinate systems, we may check that assertion very quickly (if both are of full rank):

Define

T' = (P'P)^{-1} P'\ V\ .

If we find it true that

V’ = T P’,

then T is a transition matrix between two bases, and P and V are indeed the same data in different coordinate systems. We’re done.

If, however, the equality fails, V' \ne T\ P'\ ,

then T is not a transition matrix, and P and V are not the same data in different coordinate systems. Unless we want more information, we’re done.

Finally, we have seen that, numerically, we may have T “close to” a transition matrix. (In Part 1, V and Q are very nearly the same data; in the color posts, the xyx bar and rgb bar tables are very nearly the same data.)

Summary: practice

Computing T tells us if something is wrong, but not much more. It’s quick, easy, and gives a simple answer: “good” or “no good”.

Computing both projection operators P onto V and V onto P (not just one of them) would likewise tell us whether or not something is wrong, and, in addition, the projection operators might isolate the problem.

We could use regression instead of computing explicit projection operators.

Summary: this example

We found that V and Q are the same data in two different coordinate systems, but that P is not. In addition, we actually know that the problem is limited to the 4th column of P: it does not lie in the space spanned by the columns of V (i.e. it does not lie in the column space of V, which is also the column space of Q).

Question:
Can someone tell me if these P, V and Q match the original paper by Yule et al.?

And there’s a question you might have.
How many examples did I have to search through to find such a good (because it was such a bad) example?

One. This was the very first case I checked, outside of the 1931 CIE tables.

That’s scary.

Example: Is it a transition matrix? Part 1

This example comes from PCA / FA (principal component analysis, factor analysis), namely from Jolliffe (see the bibliography). But it illustrates some very nice linear algebra.

More precisely, the source of this example is:
Yule, W., Berger, M., Butler, S., Newham, V. and Tizard, J. (1969). The WPPSL: An empirical evaluation with a British sample. Brit. J. Educ. Psychol., 39, 1-13.

I have not been able to find the original paper. There is a problem here, and I do not know whether the problem lies in the original paper or in Jolliffe’s version of it. If anyone out there can let me know, I’d be grateful. (I will present 3 matrices, taken from Jolliffe; my question is, does the original paper contain the same 3 matrices?)

Like the previous post on this topic, this one is self-contained. In fact, it has almost nothing to do with PCA, and everything to do with finding — or failing to find! — a transition matrix relating two matrices.

On p. 163, Jolliffe provides a matrix of “principal components”:

P = \left(\begin{array}{cccc} 0.34 & -0.39 & 0.09 & -0.08 \\ 0.34 & -0.37 & -0.08 & -0.23 \\ 0.35 & -0.1 & 0.05 & 0.03 \\ 0.3 & -0.24 & -0.2 & 0.63 \\ 0.34 & -0.32 & 0.19 & -0.28 \\ 0.27 & 0.24 & -0.57 & 0.3 \\ 0.32 & 0.27 & -0.27 & -0.34 \\ 0.3 & 0.51 & 0.19 & 0.27 \\ 0.23 & 0.22 & 0.69 & 0.43 \\ 0.36 & 0.33 & -0.03 & 0.02\end{array}\right)

On the same page, he provides two matrices of the “rotated factor loadings”. One is “Varimax”…

V = \left(\begin{array}{cccc} 0.48 & 0.09 & 0.17 & 0.14 \\ 0.49 & 0.15 & 0.18 & -0.03 \\ 0.35 & 0.22 & 0.24 & 0.22 \\ 0.26 & 0. & 0.64 & 0.2 \\ 0.49 & 0.16 & 0.02 & 0.15 \\ 0.05 & 0.34 & 0.6 & -0.09 \\ 0.2 & 0.51 & 0.18 & -0.07 \\ 0.1 & 0.54 & -0.02 & 0.32 \\ 0.1 & 0.13 & 0.07 & 0.83 \\ 0.17 & 0.46 & 0.28 & 0.26\end{array}\right)

… and the other is “direct quartimin”:

Q = \left(\begin{array}{cccc} 0.51 & -0.05 & 0.05 & 0.05 \\ 0.53 & 0.04 & 0.05 & -0.14 \\ 0.32 & 0.13 & 0.16 & 0.15 \\ 0.17 & -0.19 & 0.65 & 0.2 \\ 0.54 & 0.06 & -0.13 & 0.05 \\ -0.07 & 0.28 & 0.67 & -0.12 \\ 0.16 & 0.53 & 0.13 & -0.17 \\ 0.03 & 0.62 & -0.09 & 0.26 \\ 0. & 0.09 & 0.02 & 0.87 \\ 0.08 & 0.45 & 0.24 & 0.21\end{array}\right)

I presume — not knowing how to comute a varimax rotation! — that each of these matrices represents the same data, i.e. the same 4D subspace of R^{10} (since we have 4 column vectors of length 10 in each matrix). It would seem silly if a varimax rotation of PCs was not supposed to represent rotated PCs within the same subspace.

That presumption is wrong: they do not all represent the same subspaces.

How do I know this?

Because given two matrices of the same shape, I know how to find the transition matrix if it exists. We did this here.

Let me elaborate on what I’m doing. Imagine that our matrices had 2 columns and 3 rows. Those 3D column vectors lie in a 2D subspace — a plane — in R^3\ . By doing 2D transformations within that plane, I might find a particularly nice basis for representing that data.

If, instead, I apply a 3D transformation, I would be moving that data into a different plane. What I am doing here is looking at 4D transformations within the 4D subspace. Everything is predicated on my belief that the PCs in P, and the varimax rotation of them in V, and the direct quartimin data in Q are all supposed to be in the same subspace.

Not knowing how they were supposed to be computed, I could be seriously wrong here. OTOH, since we will find that V and Q are the same subspace, and the first 3 of the 4 columns of P also lie in that common subspace, I really believe that the 4th column of P should too. That it does not strikes me as an error.

Let me make one simplification up front. Are the matrices P, V, Q of full rank (4)?

Yes. The Mathematica command MatrixRank says that each of those matrices is of rank 4.

Let us consider V and Q. If the observations in each are related by a transition matrix T, and if we arbitrarily take V to be the “old” data, then each column of V’ (the transpose of V) is found by applying the transition matrix to the corresponding column of Q’.

V’ = T Q’

i.e.

V = Q T’.

Then, as I showed before, I imagine briefly that I can “solve” for T’…

T' = Q^{-1} V\ ,

but quickly replace the non-existent inverse of Q by the appropriate pseudo-inverse

T' = (Q'Q)^{-1} Q'\ V\ .

(It’s worth repeating that the inappropriate pseudo-inverse would use QQ’, but that can’t work because QQ’ is 10×10 but of rank 4 at most, hence not invertible.)

If Q’Q is invertible, that recipe always gives me a matrix T, but T need not be a transition matrix. The resulting T is a transition matrix if and only if we actually have

V = Q T’,

(given that V and Q are of the same full rank).

Do we have V = Q T’ ?

By direct computation I get

T = \left(\begin{array}{cccc} 0.9273 & 0.0921888 & 0.15161 & 0.101824 \\ 0.22944 & 0.864017 & 0.176744 & 0.0556721 \\ 0.251353 & 0.0599739 & 0.913081 & 0.0549228 \\ 0.185642 & 0.11154 & 0.00492819 & 0.941728\end{array}\right)

and then I look at V – Q T’…. You know, it’s easy enough to put them side by side.

V = \left(\begin{array}{cccc} 0.48 & 0.09 & 0.17 & 0.14 \\ 0.49 & 0.15 & 0.18 & -0.03 \\ 0.35 & 0.22 & 0.24 & 0.22 \\ 0.26 & 0. & 0.64 & 0.2 \\ 0.49 & 0.16 & 0.02 & 0.15 \\ 0.05 & 0.34 & 0.6 & -0.09 \\ 0.2 & 0.51 & 0.18 & -0.07 \\ 0.1 & 0.54 & -0.02 & 0.32 \\ 0.1 & 0.13 & 0.07 & 0.83 \\ 0.17 & 0.46 & 0.28 & 0.26\end{array}\right) Q\ T' = \left(\begin{array}{cccc} 0.48 & 0.09 & 0.17 & 0.14 \\ 0.49 & 0.16 & 0.17 & -0.03 \\ 0.35 & 0.22 & 0.24 & 0.22 \\ 0.26 & 0 & 0.64 & 0.2 \\ 0.49 & 0.16 & 0.02 & 0.15 \\ 0.05 & 0.34 & 0.6 & -0.09 \\ 0.2 & 0.51 & 0.18 & -0.07 \\ 0.1 & 0.54 & -0.02 & 0.32 \\ 0.1 & 0.13 & 0.07 & 0.83 \\ 0.17 & 0.46 & 0.28 & 0.26\end{array}\right)

Even the difference of the rounded-off matrices is satisfying small. That is,
Round[V,.01]-Round[Q.TT,.01]//Chop//MatrixForm is

\left(\begin{array}{cccc} 0 & 0 & 0 & 0 \\ 0 & -0.01 & 0.01 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0\end{array}\right)

Judging that the difference is negligible, I conclude that the matrices V and Q do span the same subspace of R^{10}\ ; they are the same 4D subspace with two different coordinate systems (two sets of basis vectors).

While we’re here, let’s do this another way. After all, the fundamental question is: can the columns of V be written as linear combinations of the columns of Q? Alternatively, can each column of Q be written as a linear combination of the columns of V?

Bear in mind that we have already established that Q and V are of the same full rank. In that case, the alternatives are equivalent.

We can answer the fundamental question by trying to regress each column of V on Q. Just remember to tell the regression to drop the constant!

To be specific, here’s the first column of V as a linear combination of the 4 columns of Q. Here’s the parameter table and R^2 — not the adjusted R^2\ : I really want to know how good this fit is, never mind how many variables it has.

Here’s a snapshot of the Mathematica commands (version 7). (Oh, darn. The last option in the LinearModelFit command, which replaced “Regress” is “IncludeConstant->False”. And I print the names just to confirm that none of those symbols has a left-over numerical value!)

(Let me just remind you that the R^2 is a measure of a how close the fit is. When we add another independent variable, however, the R^2 cannot decrease; it is not a good measure of the value — i.e. the worth — of that latest independent variable. The adjusted R^2 can decrease; in fact, in my experience, if the t-statistic of that latest variable is greater than 1 (in magnitude), then the adjusted R^2 goes up. As I said, in this case, I really want the R^2 rather than adjusted R^2 because I know exactly how many independent variables I am supposed to use. That said, there isn’t a whole lot of difference between the two in this case.)

v1 on Q

That’s a really good fit. And it should be. And that column of \beta\ , i.e. “Estimate”? It’s the first column of T’, i.e. the first row of T:

{0.9273, 0.0921888, 0.15161, 0.101824}

We could run 3 more regressions, but they’d only confirm what we believe: T’ could, in fact, be computed as regression coefficients. If the two subspaces coincide, then T is also a transition matrix between two bases in one and the same space.

There is yet another way to look at that. We know that in regression, yhat is the projection of y onto the subspace spanned by the columns of X. If y itself is in that subspace, then yhat = y, and the relatonship is given by a transition matrix.

So we could do this a third way. We know how to compute a projection operator onto the subspace spanned by the columns of X. In this case, X is Q, so we find its SVD (Singular Value Decomposition)…

Q = u\ w\ v^T

Then the 4 leftmost columns (u1) of u are an orthonormal basis for the range of Q….

u1 = \left(\begin{array}{cccc} -0.203473 & 0.27677 & 0.353491 & 0.188167 \\ -0.188235 & 0.423093 & 0.310857 & 0.0177813 \\ -0.30583 & 0.10025 & 0.140569 & 0.094808 \\ -0.313324 & 0.177504 & -0.285416 & 0.609666 \\ -0.194156 & 0.220796 & 0.497051 & 0.0101643 \\ -0.354903 & 0.228756 & -0.604256 & -0.00173818 \\ -0.320761 & 0.191153 & -0.0758258 & -0.463325 \\ -0.379019 & -0.300366 & 0.0912622 & -0.459411 \\ -0.365508 & -0.681435 & 0.196437 & 0.360653 \\ -0.43321 & -0.103153 & -0.111086 & -0.166404\end{array}\right)

That is, u1′ u1 = I, if the column space of X is of full rank. Check it:

u1^T u1 = \left(\begin{array}{cccc} 1. & 0 & 0 & 0 \\ 0 & 1. & 0 & 0 \\ 0 & 0 & 1. & 0 \\ 0 & 0 & 0 & 1.\end{array}\right)

Bur the reversed matrix product

R = u1 u1′

gives us a projection operator R onto the range of Q:

We can check it. (I won’t print it: it’s 10×10.) One, R is a projection operator if and only if it is idempotent, i.e. R R = R. And by computation, I see that is true.

Two, applied to each column of Q, it reproduces that column of Q; i.e. applied to Q as a whole, it reproduces Q, i.e. R Q = Q. Again, computation shows that is true, too.

Having checked that R is a projection onto the column space of Q, let’s use it: apply it to V. If V lies in the subspace spanned by the columns of Q, then R V will reproduce V….

I find that the largest difference between the elements of R V and V is 0.00720726 .

That’s as close as the other two calculations. Not exactly zero, but then, our regression didn’t fit exactly: like the RGB and XYZ tables in the earlier post, V and Q are almost but not exactly related by a transition matrix.

So we have seen three ways of looking at this. All three tell the same story: the column spaces of V and Q are the same.

Let me say a little more about two of them, the projection operator and the explicit regression.

The projection operator R came from the SVD. The regression can also be written using a projection operator, namely the hat matrix H. I keep getting mixed up about the relationship between T and R and H, so let me write it out.

The normal equations for

\hat{y} = X\ \beta

are

\beta = (X'X)^{-1}\ X'\ y\ .

(I showed you a quick and dirty way to recover that equation in the earlier post.)

Then

\hat{y} = X\ \beta = X\ (X'X)^{-1}\ X'\ y

and so we may write

yhat = H y

with the hat matrix H defined as

H = X\ (X'X)^{-1}\ X'\ .

For our regression, we took y to be the first column of V, and X to be the four colums of Q, so our hat matrix would be

H = Q\ (Q'Q)^{-1}\ Q'\ .

It is a projection operator onto the range of Q. But that’s exactly what R was. Go ahead and check: you will find that

H = R.

This means that we could compute R as the hat matrix, instead of using the SVD. Personally, I know how to use the SVD, whereas I would have to work out the definition of the hat matrix every time. But that’s a preference; you may choose to compute the hat matrix instead. (And that is how Malinowski was doing it!)

Going a little further, T’ was

T' = (Q'Q)^{-1}\ Q'\ V

so we have

Q T’ = H V.

We’ll look at the relationship between P and V in Part 2.

PCA / FA. Example 4 ! Davis, and almost everyone else

I would like to revisit the work we did in Davis (example 4). For one thing, I did a lot of calculations with that example, and despite the compare-and-contrast posts towards the end, I fear it may be difficult to sort out what I finally came to.

In addition, my notation has settled down a bit since then, and I would like to recast the work using my current notation.

The original (“raw”) data for example 4 was (p. 502, and columns are variables):

X_r = \left(\begin{array}{lll} 4 & 27 & 18 \\ 12 & 25 & 12 \\ 10 & 23 & 16 \\ 14 & 21 & 14\end{array}\right)

It would be prudent to compute the means…

 \{10,\ 24,\ 15\}

and variances of the data…

 \{18.6667,\ 6.66667,\ 6.66667\}

It would also be prudent to calculate the row sums.

  \{49,\ 49,\ 49,\ 49\}

They are all the same (we have “constant row sums”).

Now, Davis chooses to analyze centered data. Let’s get it, by subtracting its mean from each column.

 X = \left(\begin{array}{lll} -6 & 3 & 3 \\ 2 & 1 & -3 \\ 0 & -1 & 1 \\ 4 & -3 & -1\end{array}\right)

That is what he has on page 503. I remind us that we expect (HERE) that the row sums of the centered data are now zero. That in turn means that we have lost rank: the centered data should have two nonzero eigenvalues or singular values, instead of 3.

Next, Davis chooses to analyze X^T X instead of the covariance matrix \frac{X^TX}{N-1}\ . We recall that these two have the same eigenvectors, but eigenvalues differ by a factor of N-1. In fact, Davis also uses the singular value decomposition (SVD); that is a good reason for doing an eigendecomposition of X^T X\ .

Here is the SVD of X, X = u\ w\ v^T\ . (Always remember that my u and v correspond to Davis’ V and U respectively, if you’re looking at both of us.)

u = \left(\begin{array}{llll} 0.801784 & 0. & 0.597614 & 0. \\ -0.267261 & -0.816497 & 0.358569 & 0.365148 \\ 0. & 0.408248 & 0. & 0.912871 \\ -0.534522 & 0.408248 & 0.717137 & -0.182574\end{array}\right)

v =\left(\begin{array}{lll} -0.816497 & 0. & 0.57735 \\ 0.408248 & -0.707107 & 0.57735 \\ 0.408248 & 0.707107 & 0.57735\end{array}\right)

w = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Sure enough, there are only two nonzero singular values in the w matrix.

Let us now compute Davis’ loadings and scores. Using my notation for the SVD, he has

A^R = v\ w^T

S^R = X A^R (= u\ w\ w^T\ , but use the definition for computation: one less matrix product.)

A^Q = u\ w ( = X v)

S^Q = X^T A^Q (= v\ w^T\ w)

(I should remark that we see that we could have done all of those calculations using the data X, the singular values w, and the matrix v: we do not need u.)

and we get

A^R = \left(\begin{array}{llll} -7.48331 & 0. & 0. & 0. \\ 3.74166 & -2.44949 & 0. & 0. \\ 3.74166 & 2.44949 & 0. & 0.\end{array}\right)

 A^Q = \left(\begin{array}{lll} 7.34847 & 0. & 0. \\ -2.44949 & -2.82843 & 0. \\ 0. & 1.41421 & 0. \\ -4.89898 & 1.41421 & 0.\end{array}\right)

S^R = \left(\begin{array}{llll} 67.3498 & 0. & 0. & 0. \\ -22.4499 & -9.79796 & 0. & 0. \\ 0. & 4.89898 & 0. & 0. \\ -44.8999 & 4.89898 & 0. & 0.\end{array}\right)

 S^Q = \left(\begin{array}{lll} -68.5857 & 0 & 0 \\ 34.2929 & -8.48528 & 0 \\ 34.2929 & 8.48528 & 0\end{array}\right)

Davis does not have any of these extra columns of zeros, but all of our nonzero numbers agree (to within a sign for each column). At this point, I think the most significant thing to say is that these scores and loadings do not reproduce the data; in the sense that, the matrix product of corresponding scores S and loadings A is not X.

Let us see what Jolliffe would have done.

He writes his model as

Z = X A,

with Z as the new data, and A as an orthogonal eigenvector matrix; but I would write that as

Y = X v,

since I reserve A for the weighted eigenvector matrix.

(I have to point out that, as we saw in previous posts, Y = A^Q\ .)

Now, Jolliffe would usually do an eigendecomposition of the covariance matrix; but we have enough experience to understand that we can do an eigendecomposition of c = X^T X\ . Here it is:

 c = \left(\begin{array}{lll} 56 & -28 & -28 \\ -28 & 20 & 8 \\ -28 & 8 & 20\end{array}\right)

Here are its eigenvalues:

 \lambda = \{84,\ 12,\ 0\}

Here is an orthogonal eigenvector matrix (now I use my standard notation, V for an orthogonal matrix computed from an eigendecomposition, versus v from the SVD):

V = \left(\begin{array}{lll} -0.816497 & 0. & 0.57735 \\ 0.408248 & -0.707107 & 0.57735 \\ 0.408248 & 0.707107 & 0.57735\end{array}\right)

We can check that V^T V = I\ . (It does.)

We know that V should differ from v (from the SVD) by at most the signs of columns. Recall v from the SVD:

 v = \left(\begin{array}{lll} -0.816497 & 0. & 0.57735 \\ 0.408248 & -0.707107 & 0.57735 \\ 0.408248 & 0.707107 & 0.57735\end{array}\right)

In fact, they agree completely.

Jolliffe would project the data X onto the eigenvector matrix V to get new data Y = XV:

Y = X V = \left(\begin{array}{lll} 7.34847 & 0 & 0 \\ -2.44949 & -2.82843 & 0 \\ 0 & 1.41421 & 0 \\ -4.89898 & 1.41421 & 0\end{array}\right)

Now we should recall A^R\ , to confirm the equivalence:

A^R = \left(\begin{array}{lll} 7.34847 & 0. & 0. \\ -2.44949 & -2.82843 & 0. \\ 0. & 1.41421 & 0. \\ -4.89898 & 1.41421 & 0.\end{array}\right)

We recall that Y = X v = X V is equivalent to the change of basis formula

x = v y,

where x is an old observation and y is the corresponding new one.

We learned – from Harman! – that the new data Y has maximally redistributed variance. It is crucial that we used an orthogonal eigenvector matrix v or V. Let’s confirm that by computing the variances of the columns of Y…

\{28.,\ 4.,\ 0.\}

and their sum is 32.

Let’s recall the variances of the columns of X…

\{18.6667,\ 6.66667,\ 6.66667\}

and their sum is also 32.

Now, what about Harman? He writes his model as

Z = A F,

where

Z = X^T\ ,

(i.e. his variables are in rows, not in columns) and A is a \sqrt{\text{eigenvalue}}-weighted eigenector matrix. And I generally favor his notation for Z, but I will follow Jolliffe for a few seconds later on.

Let us compute. From the eigenvalues of c,

\{84,\ 12,\ 0\}

we construct a diagonal matrix \Lambda of their square roots:

\Lambda = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 0.\end{array}\right)

We should expect that these are the nonzero nonsingular values w. Sure enough:

w = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Note, however, that w is not square, hence is not equal to the diagonal matrix \Lambda\ . This will lead to an irrelevant difference between A and A^R\ , etc. In fact, I often ignore the distinction between \Lambda and w conceptually; in practice, of course, whatever matrices we’re using have to be “the right size” for whatever matrix products we compute (conformable).

We should recall that the nonzero eigenvalues are the nonzero values of both w^T w and w\ w^T\ :

The nonzero eigenvalues were…

\{84,\ 12,\ 0\}

whereas

w^T w = \left(\begin{array}{lll} 84. & 0. & 0. \\ 0. & 12. & 0. \\ 0. & 0. & 0.\end{array}\right)

and

w\ w^T = \left(\begin{array}{llll} 84. & 0. & 0. & 0. \\ 0. & 12. & 0. & 0. \\ 0. & 0. & 0. & 0. \\ 0. & 0. & 0. & 0.\end{array}\right)

Resuming our thread, we compute the weighted eigenvector matrix A

A = v\ \Lambda = \left(\begin{array}{lll} -7.48331 & 0. & 0. \\ 3.74166 & -2.44949 & 0. \\ 3.74166 & 2.44949 & 0.\end{array}\right)

and that should be Davis’ A^R\ ; and it is, except for an extra – to my mind, irrelevant – column of zeros.

A^R = \left(\begin{array}{llll} -7.48331 & 0. & 0. & 0. \\ 3.74166 & -2.44949 & 0. & 0. \\ 3.74166 & 2.44949 & 0. & 0.\end{array}\right)

Viewing A as a transition matrix, we know that Harman’s scores are the new data F^T\ . We also know (HERE) that we may compute them as

F^T= u\ ,

that is, the “first few” columns of u, not all of u.

Ah, life gets interesting because A is of rank 2 instead of 3. The new data F^T should be the first two columns of u, but if we want to see (Harman’s) Z = A F, then we must either cut down A or keep the first 3 columns of u, to have conformable matrices. I will cut down A.

F^T = \left(\begin{array}{ll} 0.801784 & 0. \\ -0.267261 & -0.816497 \\ 0. & 0.408248 \\ -0.534522 & 0.408248\end{array}\right)

We check that by computing A F and comparing it to Z = X^T\ ; we expect Z = AF. And we have it; here’s Z:

Z = X^T = \left(\begin{array}{llll} -6 & 2 & 0 & 4 \\ 3 & 1 & -1 & -3 \\ 3 & -3 & 1 & -1\end{array}\right)

And here’s A F (using the two nonzero columns of A!):

A F = \left(\begin{array}{llll} -6. & 2. & 0. & 4. \\ 3. & 1. & -1. & -3. \\ 3. & -3. & 1. & -1.\end{array}\right)

I remind us that this new data F^T does not have redistributed variance; in fact, this new data has constant variance:

\{0.333333,\ 0.333333\}

There are a couple of (intertwined) ways to understand why the variances are not 1. Fundamentally, we did an eigendecomposition of X^T X rather than of the covariance matrix \frac {X^TX}{N-1}\ . Alternatively, our new data was given by columns of u rather than by columns of \sqrt{N-1}\ u\ . (And that’s a consequence of the “fundamental” reason, but it may make sense by itself.)

That constant variance is in marked contrast to the maximally redistributed variances of the new data Y constructed from the orthogonal matrix v instead of from A.

It is my strong preference to compute the new data F^T from the first few columns of u. To make sure I have the scaling correct, I then check the product A F.

Any number of people, including Davis himself in a subsequent example, show us how to compute the new data F^T using a reciprocal basis instead; that is, one which weights the v matrix by \frac{1}{\sqrt{\text{eigenvalue}}} instead of \sqrt{\text{eigenvalue}}. (Whether or not an author uses the phrase reciprocal basis or dual basis, if he divides each column of v by \sqrt{\text{eigenvalue}}, he is computing a reciprocal basis.)

Life is made interesting, but not difficult, by the presence of a zero eigenvalue. Frankly, the simplest way to deal with a zero eigenvalue – for finding a reciprocal basis – it is to set the zero eigenvalue to 1. Instead of the diagonal matrix

\Lambda = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 0.\end{array}\right)

I use

\Lambda 1 = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 1\end{array}\right)

and instead of A = v \Lambda\ , I compute the reciprocal basis B = v \Lambda1^{-1} (i.e. I divide each column by \sqrt{\text{eigenvalue}} instead of multiplying by \sqrt{\text{eigenvalue}}).

B = \left(\begin{array}{lll} -0.0890871 & 0. & 0.57735 \\ 0.0445435 & -0.204124 & 0.57735 \\ 0.0445435 & 0.204124 & 0.57735\end{array}\right)

Then the new data with respect to A can be found by projecting X onto B:

X B = \left(\begin{array}{lll} 0.801784 & 0 & 0 \\ -0.267261 & -0.816497 & 0 \\ 0 & 0.408248 & 0 \\ -0.534522 & 0.408248 & 0\end{array}\right)

That had better be F^T\ … and it is:

F^T = \left(\begin{array}{ll} 0.801784 & 0. \\ -0.267261 & -0.816497 \\ 0. & 0.408248 \\ -0.534522 & 0.408248\end{array}\right)

(Okay, that was the simplest way to get a reciprocal basis when we have a zero eigenvalue; but the simplest way to get the new data is to avoid the reciprocal basis entirely by taking the first few columns of u instead – assuming that you have the SVD available to you.)

For this example, Davis did not compute the new data with respect to A (i.e. wrt his A^R\ ). As I have said before, I have seen him do it in another example; but his scores S^R are not the new data F^T\ , which is, as far I have seen in textbooks, what other people use for the scores corresponding to loadings A.

Moving on, we learned from Basilevsky (HERE) that the weighted eigenvector matrix A = v \Lambda was also the dispersion matrix between the new data F^T and the old data X. I say “dispersion matrix” because A itself was not computed from a covariance matrix but from a more general “dispersion” matrix X^T X\ .

Let’s confirm that. For just a moment I want to denote the new data by (Jolliffe’s) Z instead of F^T\ , so as not to get confused by transposes. A quick computation (on the side) shows me that I want to compare A with X^T Z\ , i.e. that I want to compute

X^T Z = X^T F^T

(finally reverting to my original notation). Fine, I compute that product, and I get

X^T F^T = \left(\begin{array}{ll} -7.48331 & 0 \\ 3.74166 & -2.44949 \\ 3.74166 & 2.44949\end{array}\right)

and I compare that to A:

A = \left(\begin{array}{lll} -7.48331 & 0. & 0. \\ 3.74166 & -2.44949 & 0. \\ 3.74166 & 2.44949 & 0.\end{array}\right)

Good. Our matrix A is, as we expect, a cross-dispersion matrix between old and new data. (Perhaps I should be less meticulous, and refer to A as a cross-covariance matrix; that term may be more familiar and hence clearer in sense, even though X^T X it is not the covariance matrix.)

Now let me show you something new. We had also discovered from Basilevsky (HERE) that the orthogonal eigenvector matrix v was not a dispersion matrix D between its old data X and its new data Y. The v matrix does not have this property of the A matrix. We had seen that v and D differed by eigenvalue column-weights:

D = v \Lambda^2\ .

But… we’ve just seen that weight here. Okay, it wasn’t \Lambda^2\ , but it was either w\ w^T or w^T w\ . Well, it must have been S^Q\ , i.e. the one that multiplied v:

S^Q = X^T A^Q (= v w^T w)\ .

Of course, I should check that computationally. Here is v \Lambda^2\ :

v \Lambda^2 = \left(\begin{array}{lll} -68.5857 & 0. & 0. \\ 34.2929 & -8.48528 & 0. \\ 34.2929 & 8.48528 & 0.\end{array}\right)

and here is our S^Q\ :

S^Q = \left(\begin{array}{lll} -68.5857 & 0 & 0 \\ 34.2929 & -8.48528 & 0 \\ 34.2929 & 8.48528 & 0\end{array}\right)

Hang on. We need to explicitly see X^T Y\ , the cross-dispersion matrix analagous to X^T X\ : we compute it…

X^T Y = \left(\begin{array}{lll} -68.5857 & 0. & 0. \\ 34.2929 & -8.48528 & 0. \\ 34.2929 & 8.48528 & 0.\end{array}\right)

I don’t know if it’s useful, but there it is anyway. I can hear my favorite Vulcan whisper, “Fascinating.”

Don’t misunderstand me: I am sure that the Q-mode scores and loadings are interesting in their own right. (I don’t understand the goal of a Q-mode analysis, but that’s another question.)

But I do find it… fascinating… that Davis’ Q-mode scores S^Q are the cross-dispersion matrix between the old data X and the new data Y (in addition to our earlier finding that his Q-mode loadings A^Q are the new data Y with respect to the transition matrix v).

To put that another way…. The orthogonal eigenvector matrix v could have been computed from an eigendecomposition of the dispersion matrix X^T X\ , and defines new data Y; and S^Q = X^T Y is the cross-dispersion between old data X and new data Y.

I had asked a question: is the orthogonal matrix v the dispersion matrix D between the old data X and the new data Y? The answer was, “No” (HERE).

Now we know the rest of the story: not v, but the Q-mode scores S^Q\ are that dispersion matrix D.

PCA / FA example 4: davis. Davis & Jolliffe.

What tools did jolliffe give us (that i could confirm)?

  1. Z = X A, A orthogonal, X old data variables in columns
  2. serious rounding
  3. plot data wrt first two PCs
  4. how many variables to keep?

but jolliffe would have used the correlation matrix. Before we do anything else, let’s get the correlation matrix for davis’ example. Recall the centered data X:

X = \left(\begin{array}{lll} -6 & 3 & 3 \\ 2 & 1 & -3 \\ 0 & -1 & 1 \\ 4 & -3 & -1\end{array}\right)

i compute its correlation matrix:

\left(\begin{array}{lll} 1. & -0.83666 & -0.83666 \\ -0.83666 & 1. & 0.4 \\ -0.83666 & 0.4 & 1.\end{array}\right)

Now get an eigendecomposition of the correlation matrix; i called the eigenvalues \lambda and the eigenvector matrix A. Here’s A:

A = \left(\begin{array}{lll} -0.645497 & 0 & -0.763763 \\ 0.540062 & -0.707107 & -0.456435 \\ 0.540062 & 0.707107 & -0.456435\end{array}\right)

If we compute A^T\ A and get an identity matrix, then A is orthogonal. (it is.)

the eigenvalues are:

\lambda = \{2.4,\ 0.6,\ 0\}

That the third eigenvalue is zero would tell us, if we didn’t already know it from davis, that the data, X, is 2D.

now we can get (1) the new data Z, wrt the orthogonal eigenvector matrix A; it is given by the projection of X onto A (because A is orthogonal, i.e. the basis vectors are orthonormal):

Z = \left(\begin{array}{lll} 7.11335 & 0 & 1.84396 \\ -2.37112 & -2.82843 & -0.614654 \\ 0 & 1.41421 & 0 \\ -4.74224 & 1.41421 & -1.22931\end{array}\right)

Fascinating.

An understatement. There are 3 nonzero columns.

in sharp contrast to A^Q from the centered data X, this new data does not look 2D wrt the orthogonal eigenvector basis. It is still 2D, nothing changed about that, but it isn’t at all clear from the new components of the data. our 3 columns are not linearly independent.

i just gave us all a beautiful lesson: it can be very worthwhile to look at both covariance (or X^T\ X of centered data) and correlation.

BTW, since this is new data, get the variances…

\{26.2369,\ 4.,\ 1.76307\}

and the total is… 32. we remember that number from davis.

Sure enough, the orthogonal eigenvector matrix A has redistributed the total variance, which is 32 – but it has spread it out over 3 variables instead of 2.

Now we are ready to follow jolliffe. BTW, i will refer to things like “jolliffe’s Z” although jolliffe, of course, never worked out davis’ example! we could try (2) some serious rounding of everything in sight. The correlation matrix: original…

\left(\begin{array}{lll} 1. & -0.83666 & -0.83666 \\ -0.83666 & 1. & 0.4 \\ -0.83666 & 0.4 & 1.\end{array}\right)

rounded to the nearest .5…

\left(\begin{array}{lll} 1. & -1. & -1. \\ -1. & 1. & 0.5 \\ -1. & 0.5 & 1.\end{array}\right)

The eigenvector matrix? original…

\left(\begin{array}{lll} -0.645497 & 0 & -0.763763 \\ 0.540062 & -0.707107 & -0.456435 \\ 0.540062 & 0.707107 & -0.456435\end{array}\right)

rounded to the nearest .5…

\left(\begin{array}{lll} -0.5 & 0 & -1. \\ 0.5 & -0.5 & -0.5 \\ 0.5 & 0.5 & -0.5\end{array}\right)

The new data? original…

\left(\begin{array}{lll} 7.11335 & 0 & 1.84396 \\ -2.37112 & -2.82843 & -0.614654 \\ 0 & 1.41421 & 0 \\ -4.74224 & 1.41421 & -1.22931\end{array}\right)

Rounded to the nearest .5 …

\left(\begin{array}{lll} 7. & 0 & 2. \\ -2.5 & -3. & -0.5 \\ 0 & 1.5 & 0 \\ -4.5 & 1.5 & -1.\end{array}\right)

Rounded to the nearest 1…

\left(\begin{array}{lll} 7. & 0 & 2. \\ -2. & -3. & -1. \\ 0 & 1. & 0 \\ -5. & 1. & -1.\end{array}\right)

i don’t see anything exciting in all that. (but if you’re thinking that i haven’t looked at everything in sight, you’re right. My bad. I’ll make it right. Soon. Very soon.)

As for (3) a plot of the data wrt the first two PCs, that should be: take the first two columns of the new data…

\left(\begin{array}{ll} 7.11335 & 0 \\ -2.37112 & -2.82843 \\ 0 & 1.41421 \\ -4.74224 & 1.41421\end{array}\right)

and plot those pairs of points. (this corresponds to davis’ plotting A^Q.)

That looks remarkably like a plot of my A^Q (mine or davis’ doesn’t matter much, but there are sign differences; nevertheless, the geometry was memorable.) here’s a plot of my A^Q:

Clearly we need to look at rounded A^Q and Z, and A and v. (i said i’d make it right.) here are jolliffe’s Z …

\left(\begin{array}{lll} 7.2 & 0 & 1.8 \\ -2.4 & -2.8 & -0.6 \\ 0 & 1.4 & 0 \\ -4.8 & 1.4 & -1.2\end{array}\right)

and my A^Q, rounded to the nearest .2 …

\left(\begin{array}{lll} -7.4 & 0 & 0 \\ 2.4 & -2.8 & 0 \\ 0 & 1.4 & 0 \\ 4.8 & 1.4 & 0\end{array}\right)

The first two columns are not identical, but awfully close, and everything else is identical (to within the usual sign ambiguity).

here are jolliffe’s A …

\left(\begin{array}{lll} -0.5 & 0 & -1. \\ 0.5 & -0.5 & -0.5 \\ 0.5 & 0.5 & -0.5\end{array}\right)

and my v…

\left(\begin{array}{lll} 1. & 0 & 0.5 \\ -0.5 & -0.5 & 0.5 \\ -0.5 & 0.5 & 0.5\end{array}\right)

i am surprised at how similar these are. This will not always happen. After all, we saw, in jolliffe when he worked both the correlation and the covariance matrices, that the eigenvectors need not be that similar.

Moving on to (4), how many new variables would jolliffe tell us to keep? recall the eigenvalues of the correlation matrix…

\lambda = \{2.4,\ 0.6,\ 0\}

If we keep only those greater than 1 or even those greater than .7, we would keep only one.

Write them as percentages of total…

\{80., 20., 0\}

If we keep those whose cumulative contribution is 70 – 90%, we would keep only one.

a scree graph won’t help at all: we only have two nonzero eigenvalues, so we get a single line segment, and no “elbow” is possible.

What about the broken stick model? recall that if a stick of length 1 is broken into p pieces, then the expected length of the kth largest piece is given by

L(p,k) =  \frac{1}{p}\sum _{ j = k}^{ p} \frac{1}{j}

The expected lengths for a unit stick broken into 3 pieces are…

\{\frac{11}{18},\frac{5}{18},\frac{1}{9}\}

i.e.

\{0.611111,\ 0.277778,\ 0.111111\}

and our fractional eigenvalues were

\{0.8,\ 0.2,\ 0\}

So, once more, jolliffe would keep only one eigenvector (.8 > .61 but .2 < .278).

Jolliffe notwithstanding, i would keep both nonzero eigenvalues. it just doesn’t seem excessive.

PCA / FA example 3: Jolliffe. analyzing the covariance matrix


we have seen what jolliffe did with a correlation matrix. now jolliffe presents the eigenstructure of the covariance matrix of his data, rather than of the correlation matrix. in order for us to confirm his work, he must give us some additional information: the standard deviations of each variable. (recall that he did not give us the data.)
we have to figure how to recover the covariance matrix c from the correlation matrix r, when for each and every ith variable we have its standard deviation s_i.
it’s easy: multiply the (i,j) entry in the correlation matrix r by both s_i and s_j

c_{i j} = r_{i j} \ s_i \ s_j

the diagonal entries r_{i i}, which are 1, become variances c_{i i} = s_i^2, and each off-diagonal correlation r_{i j} becomes a covariance. maybe it would have been more recognizable if i’d written

r_{i j} = \frac{c_{i j}}{\ s_i \ s_j}

which says that we get from covariances to correlations by dividing by two standard deviations.

here are the standard deviations he gives:

{.371,\ 41.253,\ 1.935,\ .077,\ .071,\ 4.037,\ 2.732,\ .297}

we should expect an interesting result because of the huge variation in values. the standard deviation of the 2nd variable is 10 times that of the 6th, 20 times that of the 4th, 15 times that of the 7th, and 100-1000 times that of the others. the 2nd variable should dominate the results. that’s an understatement: it will completely overwhelm them. 

back to business. my correlation matrix is r; let c be the covariance matrix. i have it to machine accuracy, but i am going to round it off in order to display it relatively compactly.

c = \left(\begin{array}{cccccccc} 0.138&4.438&0.145&-0.002&-0.003&-0.377&-0.232&0.006\\ 4.438&1701.81&33.127&0.905&-1.101&-58.122&-18.483&-1.581\\ 0.145&33.127&3.744&0.062&-0.072&-3.445&-0.767&-0.044\\ -0.002&0.905&0.062&0.006&-0.005&-0.024&0.005&-0.003\\ -0.003&-1.101&-0.072&-0.005&0.005&0.059&0.007&0.003\\ -0.377&-58.122&-3.445&-0.024&0.059&16.297&2.118&0.092\\ -0.232&-18.483&-0.767&0.005&0.007&2.118&7.464&0.343\\ 0.006&-1.581&-0.044&-0.003&0.003&0.092&0.343&0.088\end{array}\right)

now get the eigenstructure of the covariance matrix. as before, he rounded eigenvectors to the nearest .2, and ony showed the first four eigenvectors, for the four largest eigenvalues.

here’s what i get when i select the first 4 eigenvectors and round them off to the nearest .2 .

\left(\begin{array}{cccc} 0.&0.&0.&0.\\ -1.&0.&0.&0.\\ 0.&-0.2&0.&1.\\ 0.&0.&0.&0.\\ 0.&0.&0.&0.\\ 0.&1.&0.2&0.2\\ 0.&0.2&-1.&0.\\ 0.&0.&0.&0.\end{array}\right)

as before, to match his numbers i need to take the negatives of the 1st and 3rd columns. i multiply by a diagonal matrix with entries {-1,\ 1,\ -1,\ 1} and get:

\left(\begin{array}{cccc} 0.&0.&0.&0.\\ 1.&0.&0.&0.\\ 0.&-0.2&0.&1.\\ 0.&0.&0.&0.\\ 0.&0.&0.&0.\\ 0.&1.&-0.2&0.2\\ 0.&0.2&1.&0.\\ 0.&0.&0.&0.\end{array}\right)

this matches jolliffe’s table 3.3. it is very significant that there is a 1 in each column, and no other number larger than .2. and sure enough, the first column says use the original 2nd variable for our first new variable. you remember, that old variable with darn near all the variance.
and these four vectors bear no resemblence to the previous four eigenvectors, which we got from the correlation matrix. here’s what they were:

\left(\begin{array}{cccc} 0.2&-0.4&0.4&0.6\\ 0.4&-0.2&0.2&0.\\ 0.4&0.&0.2&-0.2\\ 0.4&0.4&-0.2&0.2\\ -0.4&-0.4&0.&-0.2\\ -0.4&0.4&-0.2&0.6\\ -0.2&0.6&0.4&-0.2\\ -0.2&0.2&0.8&0.\end{array}\right)


we get the % variation explained as before. here are the eigenvalues…

{1704.68,\ 15.0561,\ 6.98002,\ 2.63927,\ 0.125282,\ 0.0656242,\ 0.00724308,\ 0.000562978}

their sum is 1729.55 and their average is 216.194 .

now divide each eigenvalue by their sum, and round off, and write it as a percentage…

{98.6,\ 0.9,\ 0.4,\ 0.2,\ 0.,\ 0.,\ 0.,\ 0.}

one new variable accounts for just about all the variation, and it’s essentially the second old variable. any of the ad hoc rules would tell us to keep exactly one new variable – and in fact that one new variable almost the same as the second old variable. this is what happens when one variance dominates.
what do we actually have, if we don’t round off so harshly? let’s look at that first columnwith no rounding…

\left(\begin{array}{c} -0.00261244\\ -0.999151\\ -0.0195343\\ -0.00053178\\ 0.000647557\\ 0.0344497\\ 0.0109335\\ 0.000930991\end{array}\right)

well, with a little rounding…

\left(\begin{array}{c} 0.\\ -1.\\ -0.02\\ 0.\\ 0.\\ 0.03\\ 0.01\\ 0\end{array}\right)

that first eigenvector is still almost entirely the 2nd original data variable, with extremely small contributions from the 3rd, 6th, and 7th original variables, and even less from the others.
it’s important to realize that the transformation between correlation matrix and covariance matrix is not a nice one, and the eigenstructures can be significantly different depending on which we use. to put that more starkly, our eigenvalues and eigenvectors can be significantly different depending on which matrix we analyze.
to state it in a way that we will see again: our choice of the initial preprocessing of the data may be the most significant modeling we do.

jolliffe strongly recommends using the correlation matrix.

since we’ve seen the methods for deciding how many PCs to keep, let’s quickly run thru them as a refresher, and because jolliffe and i stated them for a correlation matrix rather than for a covariance matrix.
method 1 was to keep enough eigenvectors so that the corresponding eigenvalues accounted for 70-90% of the cumulative variation. well, the first eigenvector corresponds to 98%. keep just one, and it’s overkill.
method 2 was to keep an eigenvector if its eigenvalue was greater than .7, except that’s .7 compared to an average variance of 1. in this case the average variance is 216.194 and .7 times it is 151.336. since the 2nd eigenvalue is 15.0561, one tenth our cutoff, once more we are told to keep only the first eigenvector.
method 3 was either a scree graph or an LEV diagram, and the LEV is supposedly more appropriate for these eigenvalues. (a scree graph plots the eigenvalues, an LEV plots their logarithms, in order.)
 
lev5.png

that doesn’t do much for me. a scree graph, by contrast, would sure as heck show a “leveling off”. this doesn’t really tell us anything we didn’t already know, but, boy oh boy, it’s in our faces and impossible to miss.

scree5.png

finally, i can’t resist computing the broken sticks. we already have them for a stick of length 1 broken into 8 pieces; we want to change the total length from 1 to the sum of the eigenvalues, which was 1729.55.
here are the eigenvalues again:

{1704.68,\ 15.0561,\ 6.98002,\ 2.63927,\ 0.125282,\ 0.0656242,\ 0.00724308,\ 0.000562978}

here are the 8 expected broken lengths for total length 1… 

{0.3397,\ 0.2147,\ 0.1522,\ 0.1106,\ 0.07932,\ 0.05432,\ 0.03348,\ 0.01563}

and for total length 1729.55… 

{587.584,\ 371.39,\ 263.293,\ 191.229,\ 137.18,\ 93.9415,\ 57.9091,\ 27.0243}

and also the running cumulative lengths of the broken pieces….

{587.584,\ 958.975,\ 1222.27,\ 1413.5,\ 1550.68,\ 1644.62,\ 1702.53,\ 1729.55}

so the stick of length 1730 would have its two largest pieces with lengths 588 and 371; and the sum of the first 7 broken pieces (1702) is still less than the value of the first eigenvalue alone (1705) !
we would keep only the first eigenvector. not a surprise.

next, we will move on to a statistics in geology book (davis) and see what he has to say about PCA / FA. that will be quite a bit.

PCA / FA example 2: jolliffe. discussion 3: how many PCs to keep?


from jolliffe’s keeping only 4 eigenvectors, i understand that he’s interested in reducing the dimensionality of his data. in this case, he wants to replace the 8 original variables by some smaller number of new variables. that he has no data, only a correlation matrix, suggests that he’s interested in the definitions of the new variables, as opposed to the numerical values of them.
there are 4 ad hoc rules he will use on the example we’ve worked. he mentions a 5th which i want to try.
from the correlation matrix, we got the following eigenvalues.

{2.79227, \ 1.53162, \ 1.24928, \ 0.778408, \ 0.621567, \ 0.488844, \ 0.435632, \ 0.102376}

we can compute the cumulative % variation. recall the eigenvalues as percentages…

{34.9034, \ 19.1452, \ 15.6161, \ 9.7301, \ 7.76958, \ 6.11054, \ 5.4454, \ 1.2797}

now we want cumulative sums, rounded….

{34.9, \ 54., \ 69.7, \ 79.4, \ 87.2, \ 93.3, \ 98.7, \ 100.}

his 1st rule is to keep 70-90% of the total variation. read literally, we would keep 4 or 5 eigenvectors, getting ether 79.4% or 87.2% of the variation; we would not keep just 3, because 69.7% is less than 70%. but it seems silly to interpret an ad hoc rule literally: interpreted loosely we could keep just 3 eigenvectors, rounding the third cumulative to 70%. and i suppose we could keep the sixth one, even though it pushes our total to 93%. anyway, this is a pretty vague answer: keep 3 to 5 eigenvectors, maybe even 6 of them.

that 1st rule based the cutoff on the cumulative sum of the eigenvalues.
his 2nd rule is based on the individual eigenvalues: keep all eigenvalues above a lower bound. the classical recommendation is 1 – because the original variables each have variance 1, so anything less is not as significant as an original variable – but jolliffe himself recommends about .7: he argues that a variable which was completely independent of all the others must lead to an eigenvalue less than 1. (i don’t really care to see why: this is, after all, a rule of thumb.)
so let’s round off the eigenvalues… 

{2.8, \ 1.5, \ 1.2, \ 0.8, \ 0.6, \ 0.5, \ 0.4, \ 0.1}

to better see what we would keep.
his 2nd rule would have us keep up to the 4th eigenvalue (.8), but not the 5th (.6). this is why he displayed only 4 eigenvectors earlier. so this gives a single recommendaton: keep the first 4. (having seen this rule with the classical cutoff of 1, i was expecting him to keep only 3.)

he mentions a 3rd rule, but says that it amounts to a cutoff between .1 and .2 (instead of his .7), which he finds keeps too many eigenvectors. in this case it would keep at least 7 of them.
incidentaly, for a covariance matrix instead of correlation matrix, we would modify the cutoffs to use 1 or .7 or about .15 times the average variance (for the correlation matrix, the average variance is 1.)

his 4th method is called a \pmb{ scree} graph: it’s just a plot of the eigenvalues in decreasing order.
 
jolliffe-scree.png
 
we’re supposed to look for an “elbow”, a flattening out of the curve. well, maybe after the 4th point. he agrees with me that the (magnitude of the) slope has actualy increased between 3 and 4, and he says he views the graph with skepticism. in this case the scree graph gives us not much more than a hint.
alternatively, there is something called an \pmb{ LEV diagram}, which replaces the eigenvalue by its logarithm. this is used in meteorology and oceanography, he says. it also seems more useful applied to the covariance matrix.
in this case, applied to the correlation matrix…
 
jolliffe-lev.png
 
he agrees that the diagram shows maximum (magnitude) slope between 7 and 8. so much for a flattening out. the LEV diagram gives us nothing.

the 5th rule is to compare our eigenvalues with “broken sticks”. that is, if we have a stick of some length and we break it into random-sized pieces, what are the expected sizes? to be specific, if we have a stick of unit length and we break it at random into p segments, then, he says, the expected length of the kth longest segment is 

L(p,k) = l_{k}^* = \frac{1}{p}\sum _{ j = k}^{ p} \frac{1}{j}.

having converted our eigenvalues to % variance explained, we retain those for which the % variation explained exceeds l_{k}^*. (in fact, we will use fraction explained, and fractions of 1, rather than % explained and fractions of a stick of length 100.)
he does not provide a table for l_{k}^*, hence does not apply it in this case, but we can create one easily enough. i define that function… and i check it. for p = k = 1: L(1,1) = 1.
good: if we left the stick in one piece, that piece is of length 1.
if we break the stick into two pieces (p = 2), we get: L(2,1) = 3/4 and L(2,2) = 1/4, where L(2,1) is the expected length of the first longest piece (i.e. the longer piece of 2), and L[2,2] is the expected length of the second longest piece (i.e. the shorter piece).
let’s see. the larger piece cannot be less than 1/2, and it’s equally likely (random = uniform distribution) to be anywhere between 1/2 and 1, hence its expected value is 3/4. then the shorter piece has expected length 1/4.
i actually want to get a list of the values for a fixed p. then mathematica says the expected lengths of a unit stick broken into 2 pieces are:

{\frac{3}{4}, \ \frac{1}{4}}

and the expected lengths for a unit stick broken into 3 pieces are:

{\frac{11}{18}, \ \frac{5}{18}, \ \frac{1}{9}}

or

{0.611111, \ 0.277778, \ 0.111111}

in our example, we have 8 variables ~ 8 eigenvalues, so we want the expected lengths for a unit stick broken into 8 pieces:

{\frac{761}{2240}, \ \frac{481}{2240}, \ \frac{341}{2240}, \ \frac{743}{6720}, \ \frac{533}{6720}, \ \frac{73}{1344}, \ \frac{15}{448}, \ \frac{1}{64}}

or

{0.339732, \ 0.214732, \ 0.152232, \ 0.110565, \ 0.0793155, \ 0.0543155, \ 0.0334821, \ 0.015625}

we could check that they add up to 1 (they do).

now recall the eigenvalues as a fraction of variation:

{0.349034, \ 0.191452, \ 0.156161, \ 0.097301, \ 0.0776958, \ 0.0611054, \ 0.054454, \ 0.012797}

and we are to keep those whose variation exeeds the broken stick number. take each eigenvalue and subtract the corresponding broken stick value; we get:

{0.0093, \ -0.023, \ 0.003928, \ -0.01326, \ -0.001619, \ 0.00679, \ 0.02097, \ -0.002828}

well, i can’t say i like the alternating signs: keep the 1st, 3rd, 6th and 7th? he does say, “principal components for which the proportion exceeds [the expected length of the kth largest segment] are then retained, and all other PCs deleted.” hmm. did he really mean that, or would he tell us to keep all of them thru the 7th eigenvector (the last one greater than the broken stick number)?
i’m a little more impressed – and not favorably! – by how close to random, i.e. how close to the broken stick numbers, the eigenvalues seem to be. 
maybe we should use cumulatives? we have cumulative % for the eigenvalues, so recall them:

{34.9, \ 54., \ 69.7, \ 79.4, \ 87.2, \ 93.3, \ 98.7, \ 100.}

let’s get cumulative % for the stick broken into 8 pieces:

{34., \ 55.4, \ 70.7, \ 81.7, \ 89.7, \ 95.1, \ 98.4, \ 100.}

subtract the stick cumulatives from the eigenvalue cumulatives. does the sign change at some point? 

{0.9, \ -1.4, \ -1., \ -2.3, \ -2.5, \ -1.8, \ 0.3, \ 0.}

yes, on the second number! and the cumulative sum of % differences is nothing like zero. well, of course it’s zero for all 8 pieces, just not for any smaller number of pieces. i wonder if there’s an overall test we can do on this. in my notes, i write “open”. i am intrigued.
he then cites one study which shows the broken stick gives very good results; another showing that it’s good but no rule is consistently good; another that it’s bad. no help here, except perhaps the conclusion that none of these ad hoc rules is consistently good!
well, the broken stick was an interesting idea, but i can’t say it floats my boat, not for this problem anyway. but there’s something fascinating about it….

PCA / FA example 2: jolliffe. discussion 2: what might we have ended up with?


back to the table. here’s what jolliffe showed for the “principal components based on the correlation matrix…” with a subheading of “coefficients” over the columns of the eigenvectors.
\left(\begin{array}{cccc} 0.2&-0.4&0.4&0.6\\ 0.4&-0.2&0.2&0.\\ 0.4&0.&0.2&-0.2\\ 0.4&0.4&-0.2&0.2\\ -0.4&-0.4&0.&-0.2\\ -0.4&0.4&-0.2&0.6\\ -0.2&0.6&0.4&-0.2\\ -0.2&0.2&0.8&0.\end{array}\right)

under each column, he also showed “percentage of total variation explained”. those numbers were derived from the eigenvalues. we saw this with harman: 
  • we have standardized data;
  • we find an orthogonal eigenvector matrix of the correlation matrix;
  • which we use as a change-of-basis to get data wrt new variables;
  • the variances of the new data are given by the eigenvalues of the correlation matrix.
the most important detail is that the eigenvalues are the variances of the new data if and only if the change-of-basis matrix is an orthogonal eigenvector matrix.
and that is what jolliffe has: the full eigenvector matrix P is orthogonal. OTOH, we don’t actually know that the data was standardized, but the derivation made it clear that if we want the transformed data to have variances = eigenvalues, then the original data needs to be standardized.
again, since jolliffe never uses data, we can’t very well transform it. 
but first, instead of listing the eigenvalues, jolliffe displayed percentages. recall the eigenvalues…
{2.79227,\ 1.53162,\ 1.24928,\ 0.778408,\ 0.621567,\ 0.488844,\ 0.435632,\ 0.102376}

we compute their sum: it’s 8.
(of course. each original standardized variable has variance 1, so the eight of them have total variance 8. to put it another way, the correlation matrix implicitly standardizes the variables.)
now divide each eigenvalue by the total, 8, and round off, and make it a percentage… 
{34.9,\ 19.1,\ 15.6,\ 9.7,\ 7.8,\ 6.1,\ 5.4,\ 1.3}
the first 4 numbers match jolliffe. he only printed 4 numbers. for whatever reason, he only looked at the first 4 eigenvectors, hence the first 4 eigenvalues.
looking at all of them, i wonder why he chose 4. from the eigenvalues themselves, some would argue that all values less than 1 – i.e. with variance less than the variance of the original variables – should be discarded. but once he keeps the 4th one (.778), i see no cause to discard the 5th one (.622). 
take a deep breath. i’m about to short-change jolliffe a little. he has an entire chapter on graphical methods – but most of them cannot be reproduced without data, which he didn’t provide.
let me be more precise. he has several plots of data with respect to their first two PCs. but he doesn’t provide the data, so we’ll make a note to do this kind of plot when we have some data. he has biplots. these, again, appear to require data whch he didn’t provide. make a note to grab a biplot in the first book that has one. he shows a graph which is an example of correspondence analysis. data? no. 
finally, he has graphs of something called  andrews’ curves, which appear to be trigonometric polynomials whose coefficients are the first 7 PCs out of 20 (i.e. the 7 new data variables z1,… z7). these look intriguing, except they’ve been clustered somehow, and the text leading up to them talks about andrews’ curves of the residuals, and the trig polynomial comes out of nowhere.
like i said, intriguing. boy do i wish i had the data! well, we will have data from other sources. make a note: try an andrews’ curve, too, someday.
BTW, harman did not do any of these graphs, although we worked an example with data.
nevertheless, jolliffe’s discussions of all these graphs, and the variations on them, may be indispensible if you do this professionally. me? i’ll settle for finding examples down the road.
in addition, he also has an entire chapter on deciding how many variables to keep, either transformed variables or original variables. we’ll continue this example later, looking at how many transformed variables to keep.

PCA / FA example 2: jolliffe. discussion 1: notation & vocabulary


Let’s do a little housekeeping. First off, Jolliffe ’s notation. Recall Harman’s:
Z = A \ F
where Z was the original data, variables in rows; A was a transition matrix, a \sqrt{eigenvalues} weighted eigenvector matrix; and F was the new (transformed) variables wrt the basis specified by A.
In that case, each column of Z is an old vector equal to the application of A to the new vector in the corresponding column of F. This is fairly natural. If we let lower case z and f be corresponding columns of Z and F, then 
z = A \ f
We call A a transition matrix because it says that the old vectors z are the images under A of the new vectors f. Finally, the columns of A are the (old) components of the new basis vectors; the transpose of A is called the attitude matrix of the new basis vectors.
Let me emphasize that when I say “new data” I mean the data transformed using the transition matrix. Each observation is a vector, and that vector has components wrt the original basis and the new basis (the eigenvector basis). When I say “new data” I mean components wrt the new basis.
That was Harman.
We decided that we needed the orthogonal eigenvector matrix P, and I decided that I needed the design matrix X which has the old data with variables in columns. thus, 
Z = X^T ,
Then we replaced 
Z = A \ F
by 
Z = P \ F
and then 
Z^T = F^T \ P^T
X = F^T \ P^T
F^T = X \ P.

F^T , of course, is the same size as X; it has variables in columns.
That was rip. 
Jolliffe writes his model as
Z = X \ A
His X is the same as mine: a design matrix of old data with variables in columns. His A is an orthogonal eigenvector matrix, so I would denote it by P. his Z is not the old data, but the new data. It corresponds to F^T . in any case, I can’t use Z for it: I still need Z = X^T , so I will use Y for the new data. I would write jolliffe’s model as
Y = X \ P
We know that it is equivalent to Harman’s model, and it is a perfectly sensible vector-matrix equation. It’s a good thing.
It’s a little hard to get too excited by this, because Jolliffe has no data in his book. Oh, yes, he cites literature where the data can be found. In that sense, he has lots of examples. Unfortunately, only 3 are self-contained: there are only 3 correlation matrices in the book. At least we can start from them and confirm his analyses.
But without data, he can never compute his Z, my Y. and there are plots for which we need data. We’ll just have to see them elsewhere.
The second piece of housekeeping is vocabulary. 
On p. 4, Jolliffe defines principal components: “… the kth PC is given by z_k = \alpha_k \ x , where \alpha _k is an eigenvector … corresponding to its kth largest eigenvalue \lambda _k “. That is, the new data Z contains PCs and A contains eigenvectors; X is the old data.
On p. 6, he says, “sometimes the vectors \alpha _k are referred to as ‘PCs’…. it is preferable to… refer to \alpha _k as the vector of coefficients or loadings for the kth PC.”. He’s saying that sometimes the eigenvectors are called PCs, but they should be called the coefficients or loadings, while the transformed data Z are the principal components.
He got this table of 4 eigenvectors, and we confirmed it:

\left(\begin{array}{cccc} 0.2&-0.4&0.4&0.6\\ 0.4&-0.2&0.2&0.\\ 0.4&0.&0.2&-0.2\\ 0.4&0.4&-0.2&0.2\\ -0.4&-0.4&0.&-0.2\\ -0.4&0.4&-0.2&0.6\\ -0.2&0.6&0.4&-0.2\\ -0.2&0.2&0.8&0.\end{array}\right)

He says, “Each of the first four PCs for the correlation matrix has moderate-sized coefficients for several of the variables….”
My first reaction to that statement was: he’s talking about the eigenvectors. Why is he calling them PCs? He explicitly said the eigenvectors are not the PCs.
Did you catch that distinction? He didn’t say the PCs had moderate-sized components, but moderate-sized coefficients.
My second reaction was: damn, the components of the eigenvectors are the coefficients of the PCs. Rephrase that: the components of one vector are called the coefficients of another vector.
No wonder some people think the eigenvectors are the PCs! 
My third reaction is: keep your eyes on the linear algebra, while using whatever terminology is customary in your field. Don’t expect that a different field uses the same terminology. (If you must not call them eigenvectors, use the synonym loadings.)
Frankly, with one exception, I can’t see that even “factor analysis” or “principal component analysis” are uniquely defined; as far as I can tell, with one exception, they are synonyms. (There is one model, which we will see later, which is always called factor analysis and never called principal components.)
If I learn otherwise as I go through the various books, I’ll let you know.

PCA / FA example 2: jolliffe. correlation matrix


what we got from harman’s “factor analysis” was his only example of principal component analysis. now let’s see what jolliffe’s “principal component analysis” has for us.

i’m going to work one example, but he did it two ways, and i’ll confirm both sets of calculations.

first, he gives us a correlation matrix. i note that he says who gave him the data, but there is no other detail. don’t plan on ever finding the data behind this. here is the correlation matrix.

\left(\begin{array}{cccccccc} 1&0.29&0.202&-0.055&-0.105&-0.252&-0.229&0.058\\ 0.29&1&0.415&0.285&-0.376&-0.349&-0.164&-0.129\\ 0.202&0.415&1&0.419&-0.521&-0.441&-0.145&-0.076\\ -0.055&0.285&0.419&1&-0.877&-0.076&0.023&-0.131\\ -0.105&-0.376&-0.521&-0.877&1&0.206&0.034&0.151\\ -0.252&-0.349&-0.441&-0.076&0.206&1&0.192&0.077\\ -0.229&-0.164&-0.145&0.023&0.034&0.192&1&0.423\\ 0.058&-0.129&-0.076&-0.131&0.151&0.077&0.423&1\end{array}\right)

we know what to do: get the eigenstructure. 

here are the eigenvalues:

{2.79227,\ 1.53162,\ 1.24928,\ 0.778408,\ 0.621567,\ 0.488844,\ 0.435632,\ 0.102376}

here is an eigenvector matrix; it is orthogonal:

\left(\begin{array}{cccccccc} -0.19422&-0.417123&-0.399761&0.651593&0.175206&0.362816&0.176312&-0.102404\\ -0.400362&-0.153929&-0.167729&0.06372&-0.8476&-0.230413&-0.110465&-0.0101724\\ -0.458879&0.000298497&-0.167775&-0.2738&0.251231&-0.402953&0.676969&-0.0503862\\ -0.430336&0.472442&0.171282&0.169149&0.117723&0.0645932&-0.236745&-0.677924\\ 0.493775&-0.36045&-0.0871641&-0.180372&-0.138999&-0.135721&0.157246&-0.723646\\ 0.319455&0.320166&0.276619&0.633314&-0.161544&-0.383745&0.376513&0.0521431\\ 0.176886&0.535273&-0.410277&-0.163143&-0.298896&0.512799&0.367054&-0.014846\\ 0.170516&0.245283&-0.708611&0.0869052&0.197851&-0.469135&-0.375711&0.0262083\end{array}\right)

here are just the 4 largest eigenvectors – of course, i mean the eigenvectors having the 4 largest eigenvalues (all of these eigenvectors are the same size, length 1):

\left(\begin{array}{cccc} -0.19422&-0.417123&-0.399761&0.651593\\ -0.400362&-0.153929&-0.167729&0.06372\\ -0.458879&0.000298497&-0.167775&-0.2738\\ -0.430336&0.472442&0.171282&0.169149\\ 0.493775&-0.36045&-0.0871641&-0.180372\\ 0.319455&0.320166&0.276619&0.633314\\ 0.176886&0.535273&-0.410277&-0.163143\\ 0.170516&0.245283&-0.708611&0.0869052\end{array}\right)

he rounded things to the nearest .2 (this is where i saw this device, which i applied in harman). when i do it, i get:

\left(\begin{array}{cccc} -0.2&-0.4&-0.4&0.6\\ -0.4&-0.2&-0.2&0.\\ -0.4&0.&-0.2&-0.2\\ -0.4&0.4&0.2&0.2\\ 0.4&-0.4&0.&-0.2\\ 0.4&0.4&0.2&0.6\\ 0.2&0.6&-0.4&-0.2\\ 0.2&0.2&-0.8&0.\end{array}\right)

this is not exactly what he had. the signs of eigenvectors are not unique; to match him, i need to take the negatives of the 1st and 3rd. this is very interesting, that the absolute signs of the factors are unknown, because in the course of my reading i’ve seen many comments about “all or most of the coefficients of the first eigenvector are positive”. but that’s not true. what was true is that all or most of the coefficients had the same sign, but it’s an indeterminate sign.

we could instruct Mathematica to change signs directly, but we can also do this using matrices. we want to change the signs of the 1st and 3rd columns, but not of the 2nd and 4th. we create a diagonal matrix \Gamma whose diagonal entries are 
{-1,\ 1,\ -1,\ 1}
i.e. -1 in the 1st and 3rd columns, +1 in the 2nd and 4th – and we compute

P \ \Gamma  

right? right. we get

\left(\begin{array}{cccc} 0.2&-0.4&0.4&0.6\\ 0.4&-0.2&0.2&0.\\ 0.4&0.&0.2&-0.2\\ 0.4&0.4&-0.2&0.2\\ -0.4&-0.4&0.&-0.2\\ -0.4&0.4&-0.2&0.6\\ -0.2&0.6&0.4&-0.2\\ -0.2&0.2&0.8&0.\end{array}\right)

that’s what he got. it’s worth mentioning that each of these 4 eigenvectors is a mixture of most of the original data: there are very zeroes in that matrix.

in keeping with “less is more”, we’ll pick this up next time.