Example: Is it a transition matrix? Part 2

We had three matrices from Jolliffe, P, V, and Q. They were allegedly a set of principal components P, a varimax rotation V of P, and a quartimin “oblique rotation” Q.

I’ll remind you that when they say “oblique rotation” they mean a general change-of-basis. A rotation preserves an orthonormal basis; a rotation cannot transform an orthonormal basis to a non-orthonormal basis, and that’s what they mean — a transformation from an orthonormal basis to a non-orthonormal basis, or possibly a transformation from a merely orthogonal basis to a non-orthogonal one. In either case, the transformation cannot be a rotation.

(It isn’t that complicated! If you change the lengths of basis vectors, it isn’t a rotation; if you change the angles between the basis vectors, it isn’t a rotation.)

Anyway, we showed in Part 1 that V and Q spanned the same 4D subspace of R^{10}\ .

Now, what about V and P? Let me recall them:

P = \left(\begin{array}{cccc} 0.34 & -0.39 & 0.09 & -0.08 \\ 0.34 & -0.37 & -0.08 & -0.23 \\ 0.35 & -0.1 & 0.05 & 0.03 \\ 0.3 & -0.24 & -0.2 & 0.63 \\ 0.34 & -0.32 & 0.19 & -0.28 \\ 0.27 & 0.24 & -0.57 & 0.3 \\ 0.32 & 0.27 & -0.27 & -0.34 \\ 0.3 & 0.51 & 0.19 & 0.27 \\ 0.23 & 0.22 & 0.69 & 0.43 \\ 0.36 & 0.33 & -0.03 & 0.02\end{array}\right)

V = \left(\begin{array}{cccc} 0.48 & 0.09 & 0.17 & 0.14 \\ 0.49 & 0.15 & 0.18 & -0.03 \\ 0.35 & 0.22 & 0.24 & 0.22 \\ 0.26 & 0. & 0.64 & 0.2 \\ 0.49 & 0.16 & 0.02 & 0.15 \\ 0.05 & 0.34 & 0.6 & -0.09 \\ 0.2 & 0.51 & 0.18 & -0.07 \\ 0.1 & 0.54 & -0.02 & 0.32 \\ 0.1 & 0.13 & 0.07 & 0.83 \\ 0.17 & 0.46 & 0.28 & 0.26\end{array}\right)

As before, I suppose there is a transition matrix T relating them; each column of V^T = V' can be computed by applying a transition matrix T to the corresponding column of P^T\ :

V’ = T P’

i.e.

V = P T’.

Then I imagine briefly that I can “solve” for T’…

T' = P^{-1}\ V\ ,

but then quickly replace the non-existent inverse of Q by the appropriate pseudo-inverse

T' = (P'P)^{-1}\ P'\ V\ .

I compute T’, and then display its transpose T:

T = \left(\begin{array}{cccc} 0.905145 & -0.386677 & 0.0882059 & -0.141151 \\ 0.853806 & 0.568131 & -0.0875601 & -0.220572 \\ 0.64498 & -0.155196 & -0.506397 & 0.490735 \\ 0.507437 & 0.131036 & 0.67664 & 0.347268\end{array}\right)

and then I test whether

V = P T’ (?).

The largest difference (element by element) between the matrices V and P T’ is 0.170627, which is rather larger (by a factor of 23) than the differences we saw in Part 1. Here’s the full blown matrix — where I chose to round off first, then subtract (i.e. what we’d compute if we were subtracting published, already rounded-off, matrices):

V - P\ T' = \left(\begin{array}{cccc} 0. & 0.01 & -0.02 & -0.01 \\ 0.01 & 0.01 & -0.02 & -0.02 \\ -0.01 & -0.01 & 0.01 & 0.01 \\ 0. & 0 & 0. & 0. \\ 0. & 0.01 & -0.02 & -0.01 \\ -0.01 & -0.01 & 0.03 & 0.02 \\ -0.01 & -0.02 & 0.05 & 0.03 \\ 0.05 & 0.07 & -0.17 & -0.12 \\ -0.02 & -0.04 & 0.09 & 0.07 \\ -0.02 & -0.03 & 0.07 & 0.05\end{array}\right)

I am inclined to think there is a probem. But we don’t know much at all about it.

Let’s try one of our alternatives. We construct a projection operator onto the subspace spanned by P.

Get the SVD (Singular Value Decomposition) of P…

P = u\ w\ v^T\ .

Then, since u is 10×10 and V is of rank 4, the leftmost 4 columns (u1) of u….

u1 = \left(\begin{array}{cccc} 0.132041 & 0.280276 & -0.423196 & 0.0366383 \\ 0.24536 & 0.379423 & -0.288143 & -0.0619886 \\ -0.0781059 & 0.252968 & -0.245411 & -0.0642461 \\ -0.310642 & 0.402272 & -0.122076 & 0.641009 \\ 0.19934 & 0.202925 & -0.456543 & -0.212204 \\ -0.275595 & 0.504975 & 0.417433 & 0.130869 \\ 0.0279239 & 0.347505 & 0.199748 & -0.545498 \\ -0.547166 & 0.00466106 & -0.00216808 & -0.298924 \\ -0.571058 & -0.293548 & -0.488675 & -0.061545 \\ -0.278923 & 0.220314 & 0.0368869 & -0.356268\end{array}\right)

are an orthonormal basis for the range of P (i.e. the column space of P). We construct a projection operator, as we have before, by

R = u1\ u1^T\ .

We check it first by confirming that R R = R, i.e. it is idempotent, hence a projection operator. We check it further by applying it to the column vectors in P: we expect that R P = P; i.e. it actually projects onto the column space of P (the space spanned by the columns of P).

Now, having checked that R is a projection operator onto P, we apply it to V, and compute V – R V:

V - R\ V = \left(\begin{array}{cccc} 0.00221622 & 0.0115115 & -0.0249851 & -0.0145408 \\ 0.0137721 & 0.012178 & -0.0243583 & -0.0200425 \\ -0.00564412 & -0.011024 & 0.00933521 & 0.0112507 \\ 0.00222059 & 0.00165751 & -0.00118369 & -0.00423324 \\ 0.00223276 & 0.00638406 & -0.0153346 & -0.0119237 \\ -0.00896396 & -0.0106169 & 0.0272358 & 0.0230479 \\ -0.00941945 & -0.0152489 & 0.0456323 & 0.0330044 \\ 0.0470134 & 0.0703021 & -0.170627 & -0.121383 \\ -0.0232815 & -0.036102 & 0.0941953 & 0.068255 \\ -0.0227796 & -0.0330688 & 0.0740154 & 0.0474348\end{array}\right)

Let me round that to the nearest .01…

V - R V = \left(\begin{array}{cccc} 0 & 0.01 & -0.02 & -0.01 \\ 0.01 & 0.01 & -0.02 & -0.02 \\ -0.01 & -0.01 & 0.01 & 0.01 \\ 0 & 0 & 0 & 0 \\ 0 & 0.01 & -0.02 & -0.01 \\ -0.01 & -0.01 & 0.03 & 0.02 \\ -0.01 & -0.02 & 0.05 & 0.03 \\ 0.05 & 0.07 & -0.17 & -0.12 \\ -0.02 & -0.04 & 0.09 & 0.07 \\ -0.02 & -0.03 & 0.07 & 0.05\end{array}\right)

I should put that array of differences into perspective. Recall V:

V = \left(\begin{array}{cccc} 0.48 & 0.09 & 0.17 & 0.14 \\ 0.49 & 0.15 & 0.18 & -0.03 \\ 0.35 & 0.22 & 0.24 & 0.22 \\ 0.26 & 0. & 0.64 & 0.2 \\ 0.49 & 0.16 & 0.02 & 0.15 \\ 0.05 & 0.34 & 0.6 & -0.09 \\ 0.2 & 0.51 & 0.18 & -0.07 \\ 0.1 & 0.54 & -0.02 & 0.32 \\ 0.1 & 0.13 & 0.07 & 0.83 \\ 0.17 & 0.46 & 0.28 & 0.26\end{array}\right)

This reinforces my belief that we have a problem. Some of the differences we found are larger than some of the values of V itself. I can see that V is not in the column space of P, and while it’s not extremely far away, it’s not close either. (This seems more definitive than “the hypothetical transition matrix T doesn’t quite work”.)

As before, however, we are tip-toeing around the real question: Is every column vector in V a linear combination of the columns of P?

Easy enough. Do a regression. Actually 4 of them, one for each column of V.

Let me name the individual columns of the P matrix, and let me set the first column of V (v1) as “y”, and the columns of V as independent variables.

(Some smaller pictures follow, in which you can see that the final option is “IncludeConstantBasis -> False”.) The results are:

Not bad at all. In fact, rather close to perfect.

What was the first row of T?

{0.905145, -0.386677, 0.0882059, -0.141151}.

Yes, those are the \beta s (“Estimates”) for that regression; first row of T is first column of T’ is regression coefficients for the first column of V as the dependent variable.

Let me emphasize another relationship. For the first regression, yhat should be the projection of v1 (the 1st column of V) onto the column space of P. But we have that projection directly as R v1:

R v1 = {0.477784,0.476228,0.355644,0.257779,0.487767,
0.058964,0.209419,0.0529866,0.123281,0.19278}

… and we have the yhat from the regression:

yhat = {0.477784,0.476228,0.355644,0.257779,0.487767,
0.058964,0.209419,0.0529866,0.123281,0.19278}

Good, they are the same. So if regression works better for you than the projection operator, then use regression. Personally, I prefer the projection operator because I can apply it to all of V at once. (And because sometimes I have a use for the orthonormal basis u1.)

Largely to display the R^2, let me run the other three regressions. (Oh, I also reduced the magnification in my Mathematica notebook, so the pictures are smaller and the right side of the regression command isn’t cut off, at least on my screen.)

Here’s v2 as the dependent variable:

Here’s v3 as the dependent variable:

Here’s v4 as the dependent variable:

We see that the R^2 is over 99% for v2, falls to 95% for v3, and is 97.6% for v4. Those last two are just not large enough, not for this question.

I’m certain that V and P do not span the same subspace, but that’s about all I know.

We could, however, have made the dual choice, projecting P onto V instead of V onto P. Are they the same?

No.

We already know — from the T matrix, from the projection operator, or from the regressions — that the P and V matrices do not represent the same data in two different coordinate systems. But there might be — and in this case, there is! — more information available by making the other choice.

Let’s get the projection onto V instead of onto P. We need the SVD of V…

V = u\ w\ v^T\ .

We need the 4 leftmost columns (u1) of u…

u1 = \left(\begin{array}{cccc} -0.292636 & 0.0434068 & -0.33927 & 0.269896 \\ -0.274013 & 0.218205 & -0.238496 & 0.365998 \\ -0.337357 & -0.00723082 & -0.119568 & 0.0745101 \\ -0.355459 & 0.224109 & -0.415783 & -0.490416 \\ -0.2739 & -0.0500253 & -0.225463 & 0.452829 \\ -0.308498 & 0.463936 & 0.232078 & -0.416388 \\ -0.287334 & 0.227908 & 0.427185 & 0.212631 \\ -0.303344 & -0.281983 & 0.503807 & 0.152613 \\ -0.329068 & -0.741253 & -0.129874 & -0.304713 \\ -0.382183 & -0.0396735 & 0.288053 & -0.0857624\end{array}\right)

We construct the projection operator…

R = u1\ u1^T

Now we apply it to P, and compute the difference P – R P:

P - R P = \left(\begin{array}{cccc} 0.00186923 & -0.00646007 & -0.00165699 & 0.0293708 \\ -0.00396322 & 0.00598219 & 0.00143136 & 0.0162366 \\ 0.0035053 & 0.00255499 & -0.00525352 & -0.0232869 \\ 0.000382829 & 0.00307153 & 0.00267228 & 0.124774 \\ 0.000923133 & -0.00219591 & 0.00142733 & -0.0228012 \\ 0.0000472374 & -0.00300758 & -0.00291228 & -0.00287352 \\ -0.00462961 & -0.00220958 & -0.0000400809 & -0.149176 \\ 0.00305495 & 0.00376593 & -0.000439512 & 0.29618 \\ -0.00367312 & -0.00153518 & -0.000838857 & -0.106747 \\ 0.00147883 & -0.000459218 & 0.00482354 & -0.141981\end{array}\right)

So what?

Round it off:

P - R P = \left(\begin{array}{cccc} 0 & -0.01 & 0 & 0.03 \\ 0 & 0.01 & 0 & 0.02 \\ 0 & 0 & -0.01 & -0.02 \\ 0 & 0 & 0 & 0.12 \\ 0 & 0 & 0 & -0.02 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & -0.15 \\ 0 & 0 & 0 & 0.3 \\ 0 & 0 & 0 & -0.11 \\ 0 & 0 & 0 & -0.14\end{array}\right)

Ah ha! It’s time for Robin to squeak out, “Holy moly, Batman!”

The fourth column of P does not lie in the subspace spanned by the 4 columns of V, while the first three do. That is the problem. I don’t why or where it happened, but I know exactly what’s wrong: the 4th column of P is the only problem.

Somebody blew it somewhere. (OK, since I don’t know how a varimax rotation is computed, I have to entertain the possibility that V is not supposed to be P under a change of basis. But I can’t believe that, because V and Q have a common column space, and 3 of the 4 columns of P lie in that common column space. If that’s an accident, what does a deliberate act look like?)

Is the mistake in the orignal paper? Were P and V computed from the same correlation matrix but by independent algorithms? Or was V computed incorectly from P (and then Q computed from V)?

Or did someone copy P incorrectly? (P should be primary, but then how the heck does the 4th column of P fail to be in the column space of V? I would sooner have expected an error in V, given P.)

Or did they rotate the first 3 PCs, and then apply that transformation to the 4th column? (No. I’ve checked. You can too.)

Let me emphasize that we would get the same result from four regressions each with one column of P as the dependent variable and all the columns of V as the independent variables. The first three regressions would be extremely good fits, but the fourth would not be. We would, again, conclude that the fourth column of P was not exactly — or even to a fair approximation — a linear combination of the columns of V.

Summary: theory

Given two matrices, say P and V, which are alleged to be the same data in different coordinate systems, we may check that assertion very quickly (if both are of full rank):

Define

T' = (P'P)^{-1} P'\ V\ .

If we find it true that

V’ = T P’,

then T is a transition matrix between two bases, and P and V are indeed the same data in different coordinate systems. We’re done.

If, however, the equality fails, V' \ne T\ P'\ ,

then T is not a transition matrix, and P and V are not the same data in different coordinate systems. Unless we want more information, we’re done.

Finally, we have seen that, numerically, we may have T “close to” a transition matrix. (In Part 1, V and Q are very nearly the same data; in the color posts, the xyx bar and rgb bar tables are very nearly the same data.)

Summary: practice

Computing T tells us if something is wrong, but not much more. It’s quick, easy, and gives a simple answer: “good” or “no good”.

Computing both projection operators P onto V and V onto P (not just one of them) would likewise tell us whether or not something is wrong, and, in addition, the projection operators might isolate the problem.

We could use regression instead of computing explicit projection operators.

Summary: this example

We found that V and Q are the same data in two different coordinate systems, but that P is not. In addition, we actually know that the problem is limited to the 4th column of P: it does not lie in the space spanned by the columns of V (i.e. it does not lie in the column space of V, which is also the column space of Q).

Question:
Can someone tell me if these P, V and Q match the original paper by Yule et al.?

And there’s a question you might have.
How many examples did I have to search through to find such a good (because it was such a bad) example?

One. This was the very first case I checked, outside of the 1931 CIE tables.

That’s scary.

Example: Is it a transition matrix? Part 1

This example comes from PCA / FA (principal component analysis, factor analysis), namely from Jolliffe (see the bibliography). But it illustrates some very nice linear algebra.

More precisely, the source of this example is:
Yule, W., Berger, M., Butler, S., Newham, V. and Tizard, J. (1969). The WPPSL: An empirical evaluation with a British sample. Brit. J. Educ. Psychol., 39, 1-13.

I have not been able to find the original paper. There is a problem here, and I do not know whether the problem lies in the original paper or in Jolliffe’s version of it. If anyone out there can let me know, I’d be grateful. (I will present 3 matrices, taken from Jolliffe; my question is, does the original paper contain the same 3 matrices?)

Like the previous post on this topic, this one is self-contained. In fact, it has almost nothing to do with PCA, and everything to do with finding — or failing to find! — a transition matrix relating two matrices.

On p. 163, Jolliffe provides a matrix of “principal components”:

P = \left(\begin{array}{cccc} 0.34 & -0.39 & 0.09 & -0.08 \\ 0.34 & -0.37 & -0.08 & -0.23 \\ 0.35 & -0.1 & 0.05 & 0.03 \\ 0.3 & -0.24 & -0.2 & 0.63 \\ 0.34 & -0.32 & 0.19 & -0.28 \\ 0.27 & 0.24 & -0.57 & 0.3 \\ 0.32 & 0.27 & -0.27 & -0.34 \\ 0.3 & 0.51 & 0.19 & 0.27 \\ 0.23 & 0.22 & 0.69 & 0.43 \\ 0.36 & 0.33 & -0.03 & 0.02\end{array}\right)

On the same page, he provides two matrices of the “rotated factor loadings”. One is “Varimax”…

V = \left(\begin{array}{cccc} 0.48 & 0.09 & 0.17 & 0.14 \\ 0.49 & 0.15 & 0.18 & -0.03 \\ 0.35 & 0.22 & 0.24 & 0.22 \\ 0.26 & 0. & 0.64 & 0.2 \\ 0.49 & 0.16 & 0.02 & 0.15 \\ 0.05 & 0.34 & 0.6 & -0.09 \\ 0.2 & 0.51 & 0.18 & -0.07 \\ 0.1 & 0.54 & -0.02 & 0.32 \\ 0.1 & 0.13 & 0.07 & 0.83 \\ 0.17 & 0.46 & 0.28 & 0.26\end{array}\right)

… and the other is “direct quartimin”:

Q = \left(\begin{array}{cccc} 0.51 & -0.05 & 0.05 & 0.05 \\ 0.53 & 0.04 & 0.05 & -0.14 \\ 0.32 & 0.13 & 0.16 & 0.15 \\ 0.17 & -0.19 & 0.65 & 0.2 \\ 0.54 & 0.06 & -0.13 & 0.05 \\ -0.07 & 0.28 & 0.67 & -0.12 \\ 0.16 & 0.53 & 0.13 & -0.17 \\ 0.03 & 0.62 & -0.09 & 0.26 \\ 0. & 0.09 & 0.02 & 0.87 \\ 0.08 & 0.45 & 0.24 & 0.21\end{array}\right)

I presume — not knowing how to comute a varimax rotation! — that each of these matrices represents the same data, i.e. the same 4D subspace of R^{10} (since we have 4 column vectors of length 10 in each matrix). It would seem silly if a varimax rotation of PCs was not supposed to represent rotated PCs within the same subspace.

That presumption is wrong: they do not all represent the same subspaces.

How do I know this?

Because given two matrices of the same shape, I know how to find the transition matrix if it exists. We did this here.

Let me elaborate on what I’m doing. Imagine that our matrices had 2 columns and 3 rows. Those 3D column vectors lie in a 2D subspace — a plane — in R^3\ . By doing 2D transformations within that plane, I might find a particularly nice basis for representing that data.

If, instead, I apply a 3D transformation, I would be moving that data into a different plane. What I am doing here is looking at 4D transformations within the 4D subspace. Everything is predicated on my belief that the PCs in P, and the varimax rotation of them in V, and the direct quartimin data in Q are all supposed to be in the same subspace.

Not knowing how they were supposed to be computed, I could be seriously wrong here. OTOH, since we will find that V and Q are the same subspace, and the first 3 of the 4 columns of P also lie in that common subspace, I really believe that the 4th column of P should too. That it does not strikes me as an error.

Let me make one simplification up front. Are the matrices P, V, Q of full rank (4)?

Yes. The Mathematica command MatrixRank says that each of those matrices is of rank 4.

Let us consider V and Q. If the observations in each are related by a transition matrix T, and if we arbitrarily take V to be the “old” data, then each column of V’ (the transpose of V) is found by applying the transition matrix to the corresponding column of Q’.

V’ = T Q’

i.e.

V = Q T’.

Then, as I showed before, I imagine briefly that I can “solve” for T’…

T' = Q^{-1} V\ ,

but quickly replace the non-existent inverse of Q by the appropriate pseudo-inverse

T' = (Q'Q)^{-1} Q'\ V\ .

(It’s worth repeating that the inappropriate pseudo-inverse would use QQ’, but that can’t work because QQ’ is 10×10 but of rank 4 at most, hence not invertible.)

If Q’Q is invertible, that recipe always gives me a matrix T, but T need not be a transition matrix. The resulting T is a transition matrix if and only if we actually have

V = Q T’,

(given that V and Q are of the same full rank).

Do we have V = Q T’ ?

By direct computation I get

T = \left(\begin{array}{cccc} 0.9273 & 0.0921888 & 0.15161 & 0.101824 \\ 0.22944 & 0.864017 & 0.176744 & 0.0556721 \\ 0.251353 & 0.0599739 & 0.913081 & 0.0549228 \\ 0.185642 & 0.11154 & 0.00492819 & 0.941728\end{array}\right)

and then I look at V – Q T’…. You know, it’s easy enough to put them side by side.

V = \left(\begin{array}{cccc} 0.48 & 0.09 & 0.17 & 0.14 \\ 0.49 & 0.15 & 0.18 & -0.03 \\ 0.35 & 0.22 & 0.24 & 0.22 \\ 0.26 & 0. & 0.64 & 0.2 \\ 0.49 & 0.16 & 0.02 & 0.15 \\ 0.05 & 0.34 & 0.6 & -0.09 \\ 0.2 & 0.51 & 0.18 & -0.07 \\ 0.1 & 0.54 & -0.02 & 0.32 \\ 0.1 & 0.13 & 0.07 & 0.83 \\ 0.17 & 0.46 & 0.28 & 0.26\end{array}\right) Q\ T' = \left(\begin{array}{cccc} 0.48 & 0.09 & 0.17 & 0.14 \\ 0.49 & 0.16 & 0.17 & -0.03 \\ 0.35 & 0.22 & 0.24 & 0.22 \\ 0.26 & 0 & 0.64 & 0.2 \\ 0.49 & 0.16 & 0.02 & 0.15 \\ 0.05 & 0.34 & 0.6 & -0.09 \\ 0.2 & 0.51 & 0.18 & -0.07 \\ 0.1 & 0.54 & -0.02 & 0.32 \\ 0.1 & 0.13 & 0.07 & 0.83 \\ 0.17 & 0.46 & 0.28 & 0.26\end{array}\right)

Even the difference of the rounded-off matrices is satisfying small. That is,
Round[V,.01]-Round[Q.TT,.01]//Chop//MatrixForm is

\left(\begin{array}{cccc} 0 & 0 & 0 & 0 \\ 0 & -0.01 & 0.01 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0\end{array}\right)

Judging that the difference is negligible, I conclude that the matrices V and Q do span the same subspace of R^{10}\ ; they are the same 4D subspace with two different coordinate systems (two sets of basis vectors).

While we’re here, let’s do this another way. After all, the fundamental question is: can the columns of V be written as linear combinations of the columns of Q? Alternatively, can each column of Q be written as a linear combination of the columns of V?

Bear in mind that we have already established that Q and V are of the same full rank. In that case, the alternatives are equivalent.

We can answer the fundamental question by trying to regress each column of V on Q. Just remember to tell the regression to drop the constant!

To be specific, here’s the first column of V as a linear combination of the 4 columns of Q. Here’s the parameter table and R^2 — not the adjusted R^2\ : I really want to know how good this fit is, never mind how many variables it has.

Here’s a snapshot of the Mathematica commands (version 7). (Oh, darn. The last option in the LinearModelFit command, which replaced “Regress” is “IncludeConstant->False”. And I print the names just to confirm that none of those symbols has a left-over numerical value!)

(Let me just remind you that the R^2 is a measure of a how close the fit is. When we add another independent variable, however, the R^2 cannot decrease; it is not a good measure of the value — i.e. the worth — of that latest independent variable. The adjusted R^2 can decrease; in fact, in my experience, if the t-statistic of that latest variable is greater than 1 (in magnitude), then the adjusted R^2 goes up. As I said, in this case, I really want the R^2 rather than adjusted R^2 because I know exactly how many independent variables I am supposed to use. That said, there isn’t a whole lot of difference between the two in this case.)

v1 on Q

That’s a really good fit. And it should be. And that column of \beta\ , i.e. “Estimate”? It’s the first column of T’, i.e. the first row of T:

{0.9273, 0.0921888, 0.15161, 0.101824}

We could run 3 more regressions, but they’d only confirm what we believe: T’ could, in fact, be computed as regression coefficients. If the two subspaces coincide, then T is also a transition matrix between two bases in one and the same space.

There is yet another way to look at that. We know that in regression, yhat is the projection of y onto the subspace spanned by the columns of X. If y itself is in that subspace, then yhat = y, and the relatonship is given by a transition matrix.

So we could do this a third way. We know how to compute a projection operator onto the subspace spanned by the columns of X. In this case, X is Q, so we find its SVD (Singular Value Decomposition)…

Q = u\ w\ v^T

Then the 4 leftmost columns (u1) of u are an orthonormal basis for the range of Q….

u1 = \left(\begin{array}{cccc} -0.203473 & 0.27677 & 0.353491 & 0.188167 \\ -0.188235 & 0.423093 & 0.310857 & 0.0177813 \\ -0.30583 & 0.10025 & 0.140569 & 0.094808 \\ -0.313324 & 0.177504 & -0.285416 & 0.609666 \\ -0.194156 & 0.220796 & 0.497051 & 0.0101643 \\ -0.354903 & 0.228756 & -0.604256 & -0.00173818 \\ -0.320761 & 0.191153 & -0.0758258 & -0.463325 \\ -0.379019 & -0.300366 & 0.0912622 & -0.459411 \\ -0.365508 & -0.681435 & 0.196437 & 0.360653 \\ -0.43321 & -0.103153 & -0.111086 & -0.166404\end{array}\right)

That is, u1′ u1 = I, if the column space of X is of full rank. Check it:

u1^T u1 = \left(\begin{array}{cccc} 1. & 0 & 0 & 0 \\ 0 & 1. & 0 & 0 \\ 0 & 0 & 1. & 0 \\ 0 & 0 & 0 & 1.\end{array}\right)

Bur the reversed matrix product

R = u1 u1′

gives us a projection operator R onto the range of Q:

We can check it. (I won’t print it: it’s 10×10.) One, R is a projection operator if and only if it is idempotent, i.e. R R = R. And by computation, I see that is true.

Two, applied to each column of Q, it reproduces that column of Q; i.e. applied to Q as a whole, it reproduces Q, i.e. R Q = Q. Again, computation shows that is true, too.

Having checked that R is a projection onto the column space of Q, let’s use it: apply it to V. If V lies in the subspace spanned by the columns of Q, then R V will reproduce V….

I find that the largest difference between the elements of R V and V is 0.00720726 .

That’s as close as the other two calculations. Not exactly zero, but then, our regression didn’t fit exactly: like the RGB and XYZ tables in the earlier post, V and Q are almost but not exactly related by a transition matrix.

So we have seen three ways of looking at this. All three tell the same story: the column spaces of V and Q are the same.

Let me say a little more about two of them, the projection operator and the explicit regression.

The projection operator R came from the SVD. The regression can also be written using a projection operator, namely the hat matrix H. I keep getting mixed up about the relationship between T and R and H, so let me write it out.

The normal equations for

\hat{y} = X\ \beta

are

\beta = (X'X)^{-1}\ X'\ y\ .

(I showed you a quick and dirty way to recover that equation in the earlier post.)

Then

\hat{y} = X\ \beta = X\ (X'X)^{-1}\ X'\ y

and so we may write

yhat = H y

with the hat matrix H defined as

H = X\ (X'X)^{-1}\ X'\ .

For our regression, we took y to be the first column of V, and X to be the four colums of Q, so our hat matrix would be

H = Q\ (Q'Q)^{-1}\ Q'\ .

It is a projection operator onto the range of Q. But that’s exactly what R was. Go ahead and check: you will find that

H = R.

This means that we could compute R as the hat matrix, instead of using the SVD. Personally, I know how to use the SVD, whereas I would have to work out the definition of the hat matrix every time. But that’s a preference; you may choose to compute the hat matrix instead. (And that is how Malinowski was doing it!)

Going a little further, T’ was

T' = (Q'Q)^{-1}\ Q'\ V

so we have

Q T’ = H V.

We’ll look at the relationship between P and V in Part 2.

PCA / FA. A Map to my posts

The Posts

The purpose of this post is to provide guidance to a reader who has just discovered that I have a large pile of posts about principal components / factor analysis. This pile of posts might seem a very jungle, without any map.

Here, have a map.

As I finalize this post, it will be number 52 in PCA / FA. Here’s a list of the 52 posts, including the dates spanned by any group, and the number of posts in that group. (When the picture was taken, I didn’t know when this would be published. In fact, post 51 was scheduled but not yet published. Even more, post 51 did not even exist when the first picture was created.)

posts-table
transition/attitude matrices is a post that is sometimes relevant when we discuss “new data” in PCA, but it is not in the PCA / FA category.

“tricky prepro” is short for “tricky preprocessing”, and discusses the combination of constant row sums and covariance or correlation matrix.

Posts which have no named author, such as “example 8″, have the tag, “Rip”.

Right now the tag cloud includes entries such as Harman and Davis, i.e. the authors of textbooks I used. This means you can see all the posts based on Harman (including those which include Davis) by clicking on the Harman tag. The number of tags shown, however, is limited to 45. In time, some or all of these author tags may disappear. What to do then?

This post shows you that the original Harman posts all occurred in January 2008, and that there were some posts about Harman & Davis in April 2008, and the “et al” on post 51 means that it talks about Harman, Jolliffe, and Basilevsky.. The dates would let you pick whatever you wanted out of the monthly archives.

The Authors

Let me summarize the sources quickly. There are three texts which specialize in PCA / FA in general: Harman, Jolliffe, Basilevsky. There is one text which specializes in PCA / FA in chemistry: Malinowski. There are three texts which specialize in data analysis in a particular field: chemistry, geology, social science (resp. Brereton, Davis, and Bartholomew et al.)

Harman’s primary focus is factor analysis, but I started with him because he had a simple example with data. He also has a lot of material on what they call rotations, but I did not look at it. If you need to understand this, I think he’s a good place to start. (What people mean in this field when they say “rotation” is anything but; in fact, they want to turn an orthogonal basis into a non-orthogonal basis, and that is not done by a rotation!)

Jolliffe is all about principal component analysis. As I said before, if you do this professionally, you probably need this book: he talks a lot about what you’re doing and what it means. I didn’t start here because his examples are not self-contained: he references the literature which contains the data.

Davis is about data analysis primarily for geology. His material on PCA / FA is limited to one chapter. Nevertheless, as you might guess from the number of posts, Davis is where I began to put it all together. I would recommend his book for data analysis in general, not just for PCA / FA.

Malinowski and Brereton are special-purpose texts, even within the category “PCA in chemistry” (to judge from the comment here.) From Malinowski I learned quite a bit more about the geometry of the SVD, the plausibility of using data which was not even centered, and an interesting approach to missing data.

Brereton has some nice data, but you have to own the book to access the data. Also, he did not look like the kind of book I could have learned this from. (Practice with, yes; learn the concepts, no. But then, I’m not trying to call a PCA command in a statistics package – not that it’s a bad thing to do, but it’s not my cup of tea.)

Bartholomew et al. is about data analysis primarily for the social sciences. Like Davis, their material on PCA is limited to one chapter, but with another on classical FA (which I have not discussed). The book, however, has publicly available data, and this makes it an excellent source for practice. In fact, it was here that I finally saw how easy it was to compute the scores. They do not use matrix notation, so it might help if you had learned that elsewhere before picking up this book.

Basilevsky is devoted to factor analytic methods: factor analysis, principal component analysis, and regression. He tends to have data printed in tables, so I have scanned things into the computer for myself in order to play with his examples. I think I’m just as glad I found him after I understood how to do this; he might have made me dizzy in the beginning.

Introduction to The Calculations

A summary of just about all the calculations I could do for PCA is here but you might also look at the post immediately preceding this one.

Now that I understand the basics of PCA, I suspect that I personally would use it in two ways. One, I might use it on data before I ran a regression. Two, I might do just what I have been doing, namely checking someone else’s calculations.

Using PCA Myself

Let’s take care of the simple case first. If I have some data, and if I choose to run PCA, I would be interested – at this stage of my understanding – in two things. One, how would a PCA redistribute the variance? Two, how many linearly independent variables do I have?

I would answer both of those questions by getting the SVD (singular value decomposition, X = u\ w\ v^T\ ) of my data matrix X. The number of nonzero singular values w will tell me the rank of the matrix; the number of “small” singular values will let me estimate multicolinearity. Since I want to know about the redistributed variance, I need to use either centered or standardized data. (Otherwise, the eigenvalues are not equal to variances.) In either case, it is the orthogonal eigenvector matrix v (not the weighted matrix A) which will show me the redistributed variances.

When I do this, I’ll show it to you. (I have yet to do it, since understanding PCA.) Who knows? At that time, I may figure out that there are other things I would wish to get out of the PCA.

Anyway, it seems to me today that for my own purposes I would simply get the SVD of my data, and look at w and v. This is way, way simpler than what I’ve usually been doing while checking other peoples calculations.

(I do wonder if the matrix u can be used to isolate variables which are multi-collinear. We saw in Malinowski that we could construct a projection operator…. Let me think about that.)

The other case is somewhat more complicated. This takes us back to my beginnings.

If I am checking someone else’s analysis, the very first question is: did they give me the data?

Checking Someone Else, No Data

If they did not give me the data, they must have given me a “dispersion matrix”: some form of either X^T X or X X^T\ , in fact probably either a covariance matrix or a correlation matrix. Whatever it is, denote it by c.

I would then get an eigendecomposition of c,

c = v\ \Lambda^2\ v^T\ ,

where the eigenvalues of c are the nonzero elements of the diagonal matrix \Lambda^2\ . To put that another way, the diagonal matrix \Lambda has \sqrt{\text{eigenvalues}} for its diagonal elements.

You will recall that minor hell can break loose. Although any matrix of the form X^T X or X X^T must be positive definite – when computed exactly – any rounded off representation of it need not be positive definite. We saw an example here. The specific problem we might encounter is that a very small positive eigenvalue – really – might come out negative when we compute using a printed – rounded-off – dispersion matrix instead of the computed matrix. (In principle, of course, even the computed one could fail to be positive definite; there’s nothing sacred about rounding to 6 places or 15 places instead of 3 or so.)

If that happens, I would set the troublesome eigenvalue to zero, and continue. Incidentally, this can occur if we started with constant-row-sum data and then computed either the covariance or correlation matrix. I discussed that here.

Moving on, let us suppose we have the orthogonal eigenvector matrix v and the diagonal matrix \Lambda^2 of nonnegative eigenvalues (i.e. we took care of any negative ones, by setting them to 0).

If the work we are checking stays with v and \Lambda^2\ , then he is probably doing the kinds of things Jolliffe discussed.

Otherwise, I expect to see a weighted eigenvector matrix

A = v\ \Lambda\ .

If we were not given data, the only conclusions we can check are those drawn from v, A, and \Lambda\ . Computing the “scores” requires the data, and even if the analysis shows the scores, we cannot check them.

But what if we were given the data? Well, this could get interesting. (This, after all, is why I’ve been doing this stuff for a year and a half!)

Checking Someone Else, With Data

The first significant consequence of being given the data is this: we should be able to relate their eigenvalues and eigenvectors to the data. (We’ll get to the scores….)

That may not be straightforward.

The first challenge is: are the variables in columns or in rows? Harman, we will recall, puts the variables in rows, using the transpose of our customary X matrix: hence we had his

Z = X^T

and

Z = A\ F\ .

I would compute the means and variances of each variable (i.e. of each column or row). If I were at all uncertain whether the variables were rows or columns, I would compute the means and variances of each row and column. In fact, even if I know that the variables are in columns, say, I would compute the row sums.

This tells me whether I was given raw data, centered data, standardized data, or small standard deviates; or something else. It remains to be seen if they preprocessed the data before computing the dispersion matrix.

Okay, if the given data is standardized (or small standard deviates), I expect that they used the given data for the analysis. But I might be wrong.

Now, what did they do? They either computed an SVD of the data, or an eigendecomposition of a dispersion matrix. I would compute whatever I think they did with my best guess for the preprocessing.

If my eigenvectors are right, then I have the correct dispersion matrix to within a scale factor; if my eigenvalues are also right, then the scale factor is right.

Bear in mind that any multiple of an eigenvector is also an eigenvector. I would have computed an orthogonal eigenvector matrix v. If the given eigenvectors are of unit length, we have a match so long as they agree with mine to within a sign. If the given eigenvectors do not agree with mine to within a sign, I would compute their lengths. If their lengths turn out to be \sqrt{\text{eigenvalues}}, I would immediately compute

A = v\ \Lambda\ ,

expecting that my columns of A would agree with theirs to within a sign. About the only other scaling I have seen is to set the largest component of each eigenvector to 1. (Apart from Basilevsky normalizing the rows of the A matrix by the inverse standard deviations! And that, by the way, means the columns of A are no longer eigenvectors.)

Let me remind you that raw data, centered data, and standardized data (each denoted X) lead to different eigenvectors of X^T X\ ; so if I match the eigenvectors, I have correctly matched the author’s raw, centered, or standardized data. Until I match the eigenvectors, there’s not much point in worrying about the eigenvalues.

If the eigenvalues are off by a scale factor, once the eigenvectors are right, then I need to use a different multiple of X^T X\ . That is,

X^T X\ , \frac{X^T X}{N}\ , and \frac{X^T X}{N-1}

have the same eigenvectors, but the eigenvalues differ by factors of N or N-1 or (N-1)/N.

The same considerations apply to the computation of the singular value decomposition. If my matrices u and v agree with the author (to within signs, assuming orthonormal eigenvectors, as usual), I have correctly matched raw, centered, or standardized. If the singular values w are off by a scale factor, then they scaled the data somehow.

Why all the fuss?

  • I have seen people use 1/N in their theoretical discussion, but use 1/(N-1) in their computation.
  • I have seen someone use 1/N to get “the correlation matrix” or “the covariance matrix”, but I would have used 1/(N-1).
  • I have seen people say they were using “the covariance matrix”, but they actually used X^T X instead of \frac{X^T X}{N-1}\ .
  • I have seen people call X^T X the covariance matrix when X wasn’t even centered.
  • I have seen someone say the data was standardized, when in fact it was small standard deviates. (Same eigenvectors, different eigenvalues.)

Of course, either the author or I might have made a mistake getting the data into the computer; but I have learned to consider instead the possibility that an author and I are doing different computations on the same given data.

Finally, I have sometimes found that I agree with an author’s largest few eigenvectors (by which I mean, the eigenvectors associated with the largest eigenvalues), but not with the smaller ones. I suspect that this is a result of their using a sequential algorithm to find eigenvectors one after another.

I remind you that it is very easy to check an eigenvector-eigenvalue pair, my own or an author’s. Suppose we have done an eigendecomposition of a matrix c,

c = v\ \Lambda^2\ v^T\ .

Suppose, for example, that v_3 is the third eigenvector, with associated eigenvalue \lambda_3 = \Lambda_{33}^2\ ). Then the definitions of eigenvector and eigenvalue tell us that

c\ v_3 = \lambda_{3}\ v_3\ .

The LHS is a matrix-vector product and the RHS is a scalar times a vector. Compute both sides and compare them.

Whew! So far, I have matched the author’s eigendecomposition. Presumably, I know whether they used the orthogonal eigenvector matrix v or the \sqrt{\text{eigenvalue}} weighted eigenvector matrix A. The eigenvector matrix, whichever one they chose, is usually the “loadings”.

We’re actually home free, now.

If they present the “scores”, I expect to find that they are either X v or – if they come from A – the first few columns of \sqrt{\text{N-1}}\ u ( here), unless I have a reason to believe they are using Davis’ scheme:

S^R = X A^R\ .

(And for more about that, see the post immediately preceding this one.)

PCA / FA. Example 4 ! Davis, and almost everyone else

I would like to revisit the work we did in Davis (example 4). For one thing, I did a lot of calculations with that example, and despite the compare-and-contrast posts towards the end, I fear it may be difficult to sort out what I finally came to.

In addition, my notation has settled down a bit since then, and I would like to recast the work using my current notation.

The original (“raw”) data for example 4 was (p. 502, and columns are variables):

X_r = \left(\begin{array}{lll} 4 & 27 & 18 \\ 12 & 25 & 12 \\ 10 & 23 & 16 \\ 14 & 21 & 14\end{array}\right)

It would be prudent to compute the means…

 \{10,\ 24,\ 15\}

and variances of the data…

 \{18.6667,\ 6.66667,\ 6.66667\}

It would also be prudent to calculate the row sums.

  \{49,\ 49,\ 49,\ 49\}

They are all the same (we have “constant row sums”).

Now, Davis chooses to analyze centered data. Let’s get it, by subtracting its mean from each column.

 X = \left(\begin{array}{lll} -6 & 3 & 3 \\ 2 & 1 & -3 \\ 0 & -1 & 1 \\ 4 & -3 & -1\end{array}\right)

That is what he has on page 503. I remind us that we expect (HERE) that the row sums of the centered data are now zero. That in turn means that we have lost rank: the centered data should have two nonzero eigenvalues or singular values, instead of 3.

Next, Davis chooses to analyze X^T X instead of the covariance matrix \frac{X^TX}{N-1}\ . We recall that these two have the same eigenvectors, but eigenvalues differ by a factor of N-1. In fact, Davis also uses the singular value decomposition (SVD); that is a good reason for doing an eigendecomposition of X^T X\ .

Here is the SVD of X, X = u\ w\ v^T\ . (Always remember that my u and v correspond to Davis’ V and U respectively, if you’re looking at both of us.)

u = \left(\begin{array}{llll} 0.801784 & 0. & 0.597614 & 0. \\ -0.267261 & -0.816497 & 0.358569 & 0.365148 \\ 0. & 0.408248 & 0. & 0.912871 \\ -0.534522 & 0.408248 & 0.717137 & -0.182574\end{array}\right)

v =\left(\begin{array}{lll} -0.816497 & 0. & 0.57735 \\ 0.408248 & -0.707107 & 0.57735 \\ 0.408248 & 0.707107 & 0.57735\end{array}\right)

w = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Sure enough, there are only two nonzero singular values in the w matrix.

Let us now compute Davis’ loadings and scores. Using my notation for the SVD, he has

A^R = v\ w^T

S^R = X A^R (= u\ w\ w^T\ , but use the definition for computation: one less matrix product.)

A^Q = u\ w ( = X v)

S^Q = X^T A^Q (= v\ w^T\ w)

(I should remark that we see that we could have done all of those calculations using the data X, the singular values w, and the matrix v: we do not need u.)

and we get

A^R = \left(\begin{array}{llll} -7.48331 & 0. & 0. & 0. \\ 3.74166 & -2.44949 & 0. & 0. \\ 3.74166 & 2.44949 & 0. & 0.\end{array}\right)

 A^Q = \left(\begin{array}{lll} 7.34847 & 0. & 0. \\ -2.44949 & -2.82843 & 0. \\ 0. & 1.41421 & 0. \\ -4.89898 & 1.41421 & 0.\end{array}\right)

S^R = \left(\begin{array}{llll} 67.3498 & 0. & 0. & 0. \\ -22.4499 & -9.79796 & 0. & 0. \\ 0. & 4.89898 & 0. & 0. \\ -44.8999 & 4.89898 & 0. & 0.\end{array}\right)

 S^Q = \left(\begin{array}{lll} -68.5857 & 0 & 0 \\ 34.2929 & -8.48528 & 0 \\ 34.2929 & 8.48528 & 0\end{array}\right)

Davis does not have any of these extra columns of zeros, but all of our nonzero numbers agree (to within a sign for each column). At this point, I think the most significant thing to say is that these scores and loadings do not reproduce the data; in the sense that, the matrix product of corresponding scores S and loadings A is not X.

Let us see what Jolliffe would have done.

He writes his model as

Z = X A,

with Z as the new data, and A as an orthogonal eigenvector matrix; but I would write that as

Y = X v,

since I reserve A for the weighted eigenvector matrix.

(I have to point out that, as we saw in previous posts, Y = A^Q\ .)

Now, Jolliffe would usually do an eigendecomposition of the covariance matrix; but we have enough experience to understand that we can do an eigendecomposition of c = X^T X\ . Here it is:

 c = \left(\begin{array}{lll} 56 & -28 & -28 \\ -28 & 20 & 8 \\ -28 & 8 & 20\end{array}\right)

Here are its eigenvalues:

 \lambda = \{84,\ 12,\ 0\}

Here is an orthogonal eigenvector matrix (now I use my standard notation, V for an orthogonal matrix computed from an eigendecomposition, versus v from the SVD):

V = \left(\begin{array}{lll} -0.816497 & 0. & 0.57735 \\ 0.408248 & -0.707107 & 0.57735 \\ 0.408248 & 0.707107 & 0.57735\end{array}\right)

We can check that V^T V = I\ . (It does.)

We know that V should differ from v (from the SVD) by at most the signs of columns. Recall v from the SVD:

 v = \left(\begin{array}{lll} -0.816497 & 0. & 0.57735 \\ 0.408248 & -0.707107 & 0.57735 \\ 0.408248 & 0.707107 & 0.57735\end{array}\right)

In fact, they agree completely.

Jolliffe would project the data X onto the eigenvector matrix V to get new data Y = XV:

Y = X V = \left(\begin{array}{lll} 7.34847 & 0 & 0 \\ -2.44949 & -2.82843 & 0 \\ 0 & 1.41421 & 0 \\ -4.89898 & 1.41421 & 0\end{array}\right)

Now we should recall A^R\ , to confirm the equivalence:

A^R = \left(\begin{array}{lll} 7.34847 & 0. & 0. \\ -2.44949 & -2.82843 & 0. \\ 0. & 1.41421 & 0. \\ -4.89898 & 1.41421 & 0.\end{array}\right)

We recall that Y = X v = X V is equivalent to the change of basis formula

x = v y,

where x is an old observation and y is the corresponding new one.

We learned – from Harman! – that the new data Y has maximally redistributed variance. It is crucial that we used an orthogonal eigenvector matrix v or V. Let’s confirm that by computing the variances of the columns of Y…

\{28.,\ 4.,\ 0.\}

and their sum is 32.

Let’s recall the variances of the columns of X…

\{18.6667,\ 6.66667,\ 6.66667\}

and their sum is also 32.

Now, what about Harman? He writes his model as

Z = A F,

where

Z = X^T\ ,

(i.e. his variables are in rows, not in columns) and A is a \sqrt{\text{eigenvalue}}-weighted eigenector matrix. And I generally favor his notation for Z, but I will follow Jolliffe for a few seconds later on.

Let us compute. From the eigenvalues of c,

\{84,\ 12,\ 0\}

we construct a diagonal matrix \Lambda of their square roots:

\Lambda = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 0.\end{array}\right)

We should expect that these are the nonzero nonsingular values w. Sure enough:

w = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Note, however, that w is not square, hence is not equal to the diagonal matrix \Lambda\ . This will lead to an irrelevant difference between A and A^R\ , etc. In fact, I often ignore the distinction between \Lambda and w conceptually; in practice, of course, whatever matrices we’re using have to be “the right size” for whatever matrix products we compute (conformable).

We should recall that the nonzero eigenvalues are the nonzero values of both w^T w and w\ w^T\ :

The nonzero eigenvalues were…

\{84,\ 12,\ 0\}

whereas

w^T w = \left(\begin{array}{lll} 84. & 0. & 0. \\ 0. & 12. & 0. \\ 0. & 0. & 0.\end{array}\right)

and

w\ w^T = \left(\begin{array}{llll} 84. & 0. & 0. & 0. \\ 0. & 12. & 0. & 0. \\ 0. & 0. & 0. & 0. \\ 0. & 0. & 0. & 0.\end{array}\right)

Resuming our thread, we compute the weighted eigenvector matrix A

A = v\ \Lambda = \left(\begin{array}{lll} -7.48331 & 0. & 0. \\ 3.74166 & -2.44949 & 0. \\ 3.74166 & 2.44949 & 0.\end{array}\right)

and that should be Davis’ A^R\ ; and it is, except for an extra – to my mind, irrelevant – column of zeros.

A^R = \left(\begin{array}{llll} -7.48331 & 0. & 0. & 0. \\ 3.74166 & -2.44949 & 0. & 0. \\ 3.74166 & 2.44949 & 0. & 0.\end{array}\right)

Viewing A as a transition matrix, we know that Harman’s scores are the new data F^T\ . We also know (HERE) that we may compute them as

F^T= u\ ,

that is, the “first few” columns of u, not all of u.

Ah, life gets interesting because A is of rank 2 instead of 3. The new data F^T should be the first two columns of u, but if we want to see (Harman’s) Z = A F, then we must either cut down A or keep the first 3 columns of u, to have conformable matrices. I will cut down A.

F^T = \left(\begin{array}{ll} 0.801784 & 0. \\ -0.267261 & -0.816497 \\ 0. & 0.408248 \\ -0.534522 & 0.408248\end{array}\right)

We check that by computing A F and comparing it to Z = X^T\ ; we expect Z = AF. And we have it; here’s Z:

Z = X^T = \left(\begin{array}{llll} -6 & 2 & 0 & 4 \\ 3 & 1 & -1 & -3 \\ 3 & -3 & 1 & -1\end{array}\right)

And here’s A F (using the two nonzero columns of A!):

A F = \left(\begin{array}{llll} -6. & 2. & 0. & 4. \\ 3. & 1. & -1. & -3. \\ 3. & -3. & 1. & -1.\end{array}\right)

I remind us that this new data F^T does not have redistributed variance; in fact, this new data has constant variance:

\{0.333333,\ 0.333333\}

There are a couple of (intertwined) ways to understand why the variances are not 1. Fundamentally, we did an eigendecomposition of X^T X rather than of the covariance matrix \frac {X^TX}{N-1}\ . Alternatively, our new data was given by columns of u rather than by columns of \sqrt{N-1}\ u\ . (And that’s a consequence of the “fundamental” reason, but it may make sense by itself.)

That constant variance is in marked contrast to the maximally redistributed variances of the new data Y constructed from the orthogonal matrix v instead of from A.

It is my strong preference to compute the new data F^T from the first few columns of u. To make sure I have the scaling correct, I then check the product A F.

Any number of people, including Davis himself in a subsequent example, show us how to compute the new data F^T using a reciprocal basis instead; that is, one which weights the v matrix by \frac{1}{\sqrt{\text{eigenvalue}}} instead of \sqrt{\text{eigenvalue}}. (Whether or not an author uses the phrase reciprocal basis or dual basis, if he divides each column of v by \sqrt{\text{eigenvalue}}, he is computing a reciprocal basis.)

Life is made interesting, but not difficult, by the presence of a zero eigenvalue. Frankly, the simplest way to deal with a zero eigenvalue – for finding a reciprocal basis – it is to set the zero eigenvalue to 1. Instead of the diagonal matrix

\Lambda = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 0.\end{array}\right)

I use

\Lambda 1 = \left(\begin{array}{lll} 9.16515 & 0. & 0. \\ 0. & 3.4641 & 0. \\ 0. & 0. & 1\end{array}\right)

and instead of A = v \Lambda\ , I compute the reciprocal basis B = v \Lambda1^{-1} (i.e. I divide each column by \sqrt{\text{eigenvalue}} instead of multiplying by \sqrt{\text{eigenvalue}}).

B = \left(\begin{array}{lll} -0.0890871 & 0. & 0.57735 \\ 0.0445435 & -0.204124 & 0.57735 \\ 0.0445435 & 0.204124 & 0.57735\end{array}\right)

Then the new data with respect to A can be found by projecting X onto B:

X B = \left(\begin{array}{lll} 0.801784 & 0 & 0 \\ -0.267261 & -0.816497 & 0 \\ 0 & 0.408248 & 0 \\ -0.534522 & 0.408248 & 0\end{array}\right)

That had better be F^T\ … and it is:

F^T = \left(\begin{array}{ll} 0.801784 & 0. \\ -0.267261 & -0.816497 \\ 0. & 0.408248 \\ -0.534522 & 0.408248\end{array}\right)

(Okay, that was the simplest way to get a reciprocal basis when we have a zero eigenvalue; but the simplest way to get the new data is to avoid the reciprocal basis entirely by taking the first few columns of u instead – assuming that you have the SVD available to you.)

For this example, Davis did not compute the new data with respect to A (i.e. wrt his A^R\ ). As I have said before, I have seen him do it in another example; but his scores S^R are not the new data F^T\ , which is, as far I have seen in textbooks, what other people use for the scores corresponding to loadings A.

Moving on, we learned from Basilevsky (HERE) that the weighted eigenvector matrix A = v \Lambda was also the dispersion matrix between the new data F^T and the old data X. I say “dispersion matrix” because A itself was not computed from a covariance matrix but from a more general “dispersion” matrix X^T X\ .

Let’s confirm that. For just a moment I want to denote the new data by (Jolliffe’s) Z instead of F^T\ , so as not to get confused by transposes. A quick computation (on the side) shows me that I want to compare A with X^T Z\ , i.e. that I want to compute

X^T Z = X^T F^T

(finally reverting to my original notation). Fine, I compute that product, and I get

X^T F^T = \left(\begin{array}{ll} -7.48331 & 0 \\ 3.74166 & -2.44949 \\ 3.74166 & 2.44949\end{array}\right)

and I compare that to A:

A = \left(\begin{array}{lll} -7.48331 & 0. & 0. \\ 3.74166 & -2.44949 & 0. \\ 3.74166 & 2.44949 & 0.\end{array}\right)

Good. Our matrix A is, as we expect, a cross-dispersion matrix between old and new data. (Perhaps I should be less meticulous, and refer to A as a cross-covariance matrix; that term may be more familiar and hence clearer in sense, even though X^T X it is not the covariance matrix.)

Now let me show you something new. We had also discovered from Basilevsky (HERE) that the orthogonal eigenvector matrix v was not a dispersion matrix D between its old data X and its new data Y. The v matrix does not have this property of the A matrix. We had seen that v and D differed by eigenvalue column-weights:

D = v \Lambda^2\ .

But… we’ve just seen that weight here. Okay, it wasn’t \Lambda^2\ , but it was either w\ w^T or w^T w\ . Well, it must have been S^Q\ , i.e. the one that multiplied v:

S^Q = X^T A^Q (= v w^T w)\ .

Of course, I should check that computationally. Here is v \Lambda^2\ :

v \Lambda^2 = \left(\begin{array}{lll} -68.5857 & 0. & 0. \\ 34.2929 & -8.48528 & 0. \\ 34.2929 & 8.48528 & 0.\end{array}\right)

and here is our S^Q\ :

S^Q = \left(\begin{array}{lll} -68.5857 & 0 & 0 \\ 34.2929 & -8.48528 & 0 \\ 34.2929 & 8.48528 & 0\end{array}\right)

Hang on. We need to explicitly see X^T Y\ , the cross-dispersion matrix analagous to X^T X\ : we compute it…

X^T Y = \left(\begin{array}{lll} -68.5857 & 0. & 0. \\ 34.2929 & -8.48528 & 0. \\ 34.2929 & 8.48528 & 0.\end{array}\right)

I don’t know if it’s useful, but there it is anyway. I can hear my favorite Vulcan whisper, “Fascinating.”

Don’t misunderstand me: I am sure that the Q-mode scores and loadings are interesting in their own right. (I don’t understand the goal of a Q-mode analysis, but that’s another question.)

But I do find it… fascinating… that Davis’ Q-mode scores S^Q are the cross-dispersion matrix between the old data X and the new data Y (in addition to our earlier finding that his Q-mode loadings A^Q are the new data Y with respect to the transition matrix v).

To put that another way…. The orthogonal eigenvector matrix v could have been computed from an eigendecomposition of the dispersion matrix X^T X\ , and defines new data Y; and S^Q = X^T Y is the cross-dispersion between old data X and new data Y.

I had asked a question: is the orthogonal matrix v the dispersion matrix D between the old data X and the new data Y? The answer was, “No” (HERE).

Now we know the rest of the story: not v, but the Q-mode scores S^Q\ are that dispersion matrix D.

PCA/FA Answers to some Basilevsky questions

Let us look at three of the questions I asked early in February, and answer two of them.

First, what do we know? What have we done?

We assume that we have data X, with the variables in columns, as usual. In fact, we assume that the data is at least centered, and possibly standardized.

We compute the covariance matrix

c = \frac{X^T X}{N-1}\ ,

then its eigendecomposition

c = v\ \Lambda^2\ v^T\ ,

where \Lambda^2 is the diaginal matrix of eigenvalues. We define the \sqrt{\text{eigenvalue}}-weighted matrix

A = v\ \Lambda\ .

Finally, we use A as a transition matrix to define new data Z:

X^T = A\ Z^T\ .

We discovered two things. One, the matrix A is the cross covariance between Z and X:

A = \frac{X^T Z}{N-1}\ .

I find this interesting, and I suspect that it would jump off the page at me out of either Harman or Jolliffe; that is, I suspect it is written there but it didn’t register.

Two, we discovered that we could find a matrix Ar which is the cross covariance between Zc and Xs. That is, the general relation

A = \frac{X^T Z}{N-1}

specializes to both centered data Xc and standardized data Xs, and gives us

Ac = \frac{Xc^T Zc}{N-1}

As = \frac{Xs^T Zs}{N-1}\ .

In addition, however, we found a matrix Ar such that

Ar = \frac{Xs^T Zc}{N-1}\ ,

using the new centered data Zc and the old standardized data Xs. Oh, the definition of the matrix Ar was:

Ar = \sigma^{-1}Ac,

where \sigma^{-1} was a diagonal matrix whose entries were the inverse standard deviations of the centered data Xc.

That is, each row of Ac was weighted by the inverse standard deviation to define Ar. And it all worked out because the definition of Ar

Ar = \sigma^{-1}Ac

was dual to the definition of standardized data, which weighted the columns of Xc:

Xs = Xc\  \sigma^{-1}\ .

I will remind you that I find this interesting, but I’m not convinced it is important. I cannot forgive Ar for the fact that it is not a matrix of eigenvectors. I assure you, I could be missing something about Ar.

Anyway, that is what we did.

What were the questions?

  1. when is A a covariance matrix?
  2. is v a covariance matrix?
  3. can we do something similar for constant row sums?

For question 3, I have the feeling there’s something we can do, but I haven’t put the pieces together. It can’t be same as for question 1, because a matrix that weights the rows of the data matrix is not the same size as one that weights the columns (in general). Perhaps we need to look at XXT instead of XTX. But then things don’t drop out if A and Z are defined the same way, so…. This is still open.

For question 1, two partial answers are:
If A = v\ \Lambda\ , i.e. A is formed by weighting the columns of an orthonormal eigenvector matrix v by the \sqrt{\text{eigenvalues}}\ , then A is a covariance matrix of the same form as the one from which we got v.

And the construction that gave us Ar (weighting the rows of A) works for arbitrary weights.

I’ll explain.

Suppose we have a data matrix X. Suppose we compute a “dispersion matrix”, by which I mean a multiple of X^T X\ :

c = \alpha\ X^T X\ .

(Of course, \alpha is usually 1/(N-1).)

Suppose we do an eigendecomposition of c,

c = v\ \Lambda^2\ v^T\ ;

then define a matrix

A = v\ \Lambda\ ,

and finally define new data using A as the transition matrix:

X^T = A\ Z^T\ .

(I’ll remind you that our observations are columns of X^T and Z^T\ ; and a transition matrix applied to new components gives us old components. If you need to, look here.)

I claim that

A = \alpha\ X^T Z\ ,

i.e., that A is the “cross dispersion matrix” between X and Z.

Please note, in particular, that the data X need not be centered.

The proof goes just the way it did before.

We will need a couple of things. From

X^T = A\ Z^T\ ,

we get

Z = X A^{-T}\ .

And also from

A^T = \Lambda\ v^T

we get

A^{-T} = v^{-T}\Lambda^{-1} = v \Lambda^{-1} (because v is orthogonal and \Lambda is diagonal).

Then

\alpha\ X^T Z

= \alpha\ X^T (X\ A^{-T})

= \alpha X^T X\ A^{-T}

= c\ A^{-T}

= (v\ \Lambda^2v^T) (v \Lambda^{-1})

= v\ \Lambda^2 \Lambda^{-1}

= v\ \Lambda

= A\ .

QED.

(If we had started with \alpha\ Z^T X instead of \alpha\ X^T Z\ , we would have ended up with A^T\ , which is a good thing, since \alpha\ Z^T X = (\alpha\ X^T Z)^T\ . Things are just a little different because the cross-dispersion in general, and the cross-covariance in particular, are not symmetric.)

This result, that A is the corresponding dispersion matrix between X and Z means that we can say something about Z without ever computing it.

(I’m not particularly fond of that, but then I haven’t been trying to deal with thousands of observations and a hundred variables. I’m used to datasets small enough to hold in my hand, as it were.)

Example

We could verify this, in particular, for the raw data that we used for the previous Basilevsky computations. We knew it was true for centered data or for standardized data, i.e. that it was true for the covariance matrix or the correlation matrix; we have seen now that it is true for an arbitrary dispersion matrix

\alpha\ X^T X\ .

It didn’t depend on 1/(N-1) or on X being centered. If we do an eigendecomposition of \alpha\ X^T X\ , then A = \alpha\  X^T Z\ .

Well, let’s do it. Here’s the raw data that we used here:

X = \left(\begin{array}{lll} 2.09653 & -0.793484 & -7.33899 \\ -1.75252 & 13.0576 & 0.103549 \\ 3.63702 & 29.0064 & 8.52945 \\ 0.0338101 & 46.912 & 19.8517 \\ 5.91502 & 70.9696 & 36.0372\end{array}\right)

I will take \alpha = \pi\ , just so that it can’t be mistaken for anything else (I need some number; let’s make sure that if anything interesting happens, we see it. OK, nothing interesting will happen, not that kind of interesting, anyway.)

We compute

c = \pi X^T X = \left(\begin{array}{lll} 174.934 & 1578.09 & 720.323 \\ 1578.09 & 25917.9 & 11760.3 \\ 720.323 & 11760.3 & 5715.79\end{array}\right)

Now that I’ve written it once, let me emphasize: the weird expression \pi X^T X occurs only because I needed a number for my abstract scaling factor \alpha\ . In retrospect, perhaps I should not have used greek letters for both abstract and numerical.)

We get an eigendecomposition, but all I want right now is \Lambda\ , the diagonal matrix of \sqrt{\text{eigenvalues}}\ . OK, the eigenvalues are

\{31415.9,\ 314.159,\ 78.5398\}

and \Lambda is

\Lambda = \left(\begin{array}{lll} 177.245 & 0. & 0. \\ 0. & 17.7245 & 0. \\ 0. & 0. & 8.86227\end{array}\right)

On second thought, I really should give you the orthonormal eigenvector matrix for reference:

v = \left(\begin{array}{lll} 0.0554414 & -0.0173938 & 0.99831 \\ 0.907331 & -0.416446 & -0.0576447 \\ 0.416745 & 0.908994 & -0.00730637\end{array}\right)

The matrix A is

A = v \Lambda = \left(\begin{array}{lll} 9.82673 & -0.308298 & 8.8473 \\ 160.82 & -7.38131 & -0.510863 \\ 73.8661 & 16.1115 & -0.064751\end{array}\right)

The new data Z is given by

Z = X\ A^{-T} = \left(\begin{array}{lll} -0.0206618 & -0.359791 & 0.24738 \\ 0.0665383 & -0.299765 & -0.282435 \\ 0.169678 & -0.247659 & 0.213996 \\ 0.286832 & -0.0841641 & -0.317697 \\ 0.44988 & 0.17488 & 0.174979\end{array}\right)

Finally, the cross-dispersion between Z and X is given by

\alpha\ X^T Z = \pi\ X^T Z =\left(\begin{array}{lll} 9.82673 & -0.308298 & 8.8473 \\ 160.82 & -7.38131 & -0.510863 \\ 73.8661 & 16.1115 & -0.064751\end{array}\right)

And that is the same as A, as it should be.

We continue with the answer to question 1. We found that if we do an eigendecomposition of \alpha\ X^T X\ , then A = \alpha X^T Z\ .

What about the matrix Ar, found by weighting the rows of A by the inverse standard deviations?

There was nothing special about the inverse standard deviations. If we weight the columns of X by a diagonal matrix \Delta\ , we can find a corresponding matrix Ar.

Let’s work that out. I am going to take weights (1,2,4} in place of the inverse standard deviations. That is, I take the diagonal matrix

\Delta = \left(\begin{array}{lll} 1 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 4\end{array}\right)

and scale the columns of X and the rows of A:

X_d = X\ \Delta = \left(\begin{array}{lll} 2.09653 & -1.58697 & -29.356 \\ -1.75252 & 26.1153 & 0.414196 \\ 3.63702 & 58.0128 & 34.1178 \\ 0.0338101 & 93.8239 & 79.407 \\ 5.91502 & 141.939 & 144.149\end{array}\right)

A_d = \Delta\ A = \left(\begin{array}{lll} 9.82673 & -0.308298 & 8.8473 \\ 321.641 & -14.7626 & -1.02173 \\ 295.465 & 64.446 & -0.259004\end{array}\right)

(Perhaps now is a good time to remind you that we can scale the columns of a matrix by post-multiplying it by a diagonal matrix of the scale factors; and we can scale the rows of a matrix by pre-multiplying it by a diagonal matrix.)

i expect that

A_d = \alpha\ X_d^T Z = \pi\ X_d^T Z\ .

Compute the RHS:

 \pi\ X_d^T Z\ = \left(\begin{array}{lll} 9.82673 & -0.308298 & 8.8473 \\ 321.641 & -14.7626 & -1.02173 \\ 295.465 & 64.446 & -0.259004\end{array}\right)

and that is, indeed, A_d\ .

So, the construction of Ar using the inverse standard deviations could be done with any set of weights instead, and applied to any dispersion matrix, not just the A = Ac from centered data.

In other words, I know two kinds of transition matrices that are dispersion matrices: A between Z and X, Ar (Ad in this case) between a newer Z and the older X.

I do not have an “iff” result. I can’t guarantee that these are the only kinds of matrices that work.

Question 2. Is v a dispersion matrix?

No, at least not between its new data Z and the old X, i.e. not the way A is. (I wonder if there’s a way to make it happen. Would it ever be worthwhile if it were possible?)

This time, let’s first confirm the failure numerically.

We need new data from v instead of from A. Rather than destroy the Z’s, call the new data Y. We should have

Y = X\ v\ ,

i.e.

Y^T = v^T X^T\ ,

i.e.

v^{-T}Y^T = X^T = v\ Y^T\ .

That last equality corresponds to X^T= A\ Z^T\ , so we’re good.

Here’s the new data Y from old data X using the transition matrix v:

Y = X\ v =\left(\begin{array}{lll} -3.66221 & -6.37712 & 2.19235 \\ 11.7936 & -5.31319 & -2.50302 \\ 30.0746 & -4.38964 & 1.89649 \\ 50.8397 & -1.49177 & -2.81552 \\ 79.7392 & 3.09967 & 1.55071\end{array}\right)

now compute \alpha\ X^T Z\ , i.e. in this case,

\pi\ X^T Y\ = \left(\begin{array}{lll} 1741.74 & -5.46444 & 78.4071 \\ 28504.6 & -130.83 & -4.5274 \\ 13092.4 & 285.569 & -0.573841\end{array}\right)

and that is to be compared with v – but I’ve already done the algebra, and I know they should have proportional columns, so I took the ratio instead of the difference, i.e. here is (\pi\ X^T Y) / v\ , term by term:

\left(\begin{array}{lll} 31415.9 & 314.159 & 78.5398 \\ 31415.9 & 314.159 & 78.5398 \\ 31415.9 & 314.159 & 78.5398\end{array}\right)

Those should look familiar: they are, as they should be, the eigenvalues of \pi\ X^T X\ :

\{31415.9,\ 314.159,\ 78.5398\}

So the cross-dispersion matrix between X and Y is a column-weighted version of v, and the weights are the eigenvalues.

Time to do the algebra. This time, Y = Xv. We start with the dispersion matrix…

\alpha\ X^T Y

= \alpha\ X^T\ (X\ v)

= \alpha\ X^T X\ v

= c\ v

= ( v\ \Lambda^2\ v^T)\ v

= v\ \Lambda^2

and that’s what we got: the cross-dispersion \alpha\ X^T Y is the column-weighted v, and the weights are the eigenvalues \Lambda^2\ .

PCA / FA Basilevsky: with data

Introduction

I am looking into Basilevsky because he did something I didn’t understand: he normalized the rows of the Ac matrix (which I denoted Ar). We discussed that, and we illustrated the computations, in the previous two posts. But we did those computations without having any data. I want to take a closer look, with data.

In contrast to As and Ac, which are eigenvector matrices, his Ar matrix is not. Nevertheless, as I said, his Ar is not without some redeeming value. In fact, all three of the A matrices have the same redeeming value.

I will show, first by direct computation and then by proof, that each of these A matrices is the cross covariance between X data and Z data.

(That doesn’t mean I want to use his Ar matrix; all it means is that I learned something new about an A matrix.)

Recall the raw data of example 9. We’re about to go get the A matrices and the new data Z for both standardized and centered data.

X = \left(\begin{array}{lll} 2.09653 & -0.793484 & -7.33899 \\ -1.75252 & 13.0576 & 0.103549 \\ 3.63702 & 29.0064 & 8.52945 \\ 0.0338101 & 46.912 & 19.8517 \\ 5.91502 & 70.9696 & 36.0372\end{array}\right)

Yes, we’ve seen all these forms of the data before, but I want this post to be self-contained.

Now, I’m going to switch to Basilevsky’s notation, using Z for F^T. If you will, we’re mixing Harman’s model

Z = A\ F\ \text{or}\ X = F^T\ A^T

where Z = X^T, and F^T is the new data (“scores”) WRT the transition matrix (“loadings”) A; and Jolliffe’s model

Z = X\ V\ \text{or}\ X = Z\ V^{-1}

where Z is the new data (“scores”)….

We start with Harman’s model

Z = A\ F\ .

eliminate Z:

X^T = A\ F\ ,

and we know that F^T is the new data corresponding to X WRT the transition matrix A .

Now we reintroduce Z, redefining it as Jolliffe uses it, for the new data; that is, we let

Z = F^T

and get

X^T = A\ Z^T\ .

If I had to do it all over again, I would probably not have used Basilevsky’s notation. This is the notation I had for the previous two posts, but i’ll admit, it’s confusing.

But for this post, as for the previous two, that is our model.

First, let’s go get the standardized and the centered data (“X”), the corresponding A matrices, and the corresponding new data (“Z”).

Basilevsky Standardized

If you are comfortable with these calculations – which we’ve seen before for this very data – you should move right along. If, on the other hand, you need a refresher, here it is. And if you need to see the calculations explained, here

I standardize the data and call it Xs:

Xs = \left(\begin{array}{lll} 0.0368717 & -1.15632 & -1.09998 \\ -1.24681 & -0.665379 & -0.663951 \\ 0.550632 & -0.100095 & -0.170316 \\ -0.651057 & 0.534548 & 0.493006 \\ 1.31036 & 1.38724 & 1.44124\end{array}\right)

Get the eigenvalues of the correlation matrix:

\lambda s = \{2.43279,\ 0.565781,\ 0.00143288\}

Get the SVD (Singular Value Decomposition) of Xs:

Xs = us\ ws\ vs^T.

ws = \left(\begin{array}{lll} 3.11948 & 0. & 0. \\ 0. & 1.50437 & 0. \\ 0. & 0. & 0.0757068 \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Form the diagonal matrix of \sqrt{\text{eigenvalues}}\

\Lambda s = \left(\begin{array}{lll} 1.55974 & 0. & 0. \\ 0. & 0.752184 & 0. \\ 0. & 0. & 0.0378534\end{array}\right)

Form the weighted eigenvector matrix A = v\ \Lambda

As = vs\ \Lambda s = \left(\begin{array}{lll} -0.752315 & 0.658804 & -0.000578824 \\ -0.963827 & -0.265197 & -0.0266011 \\ -0.968424 & -0.247849 & 0.0269245\end{array}\right)

Get the scores FT as the first three columns of 2u (where 2 = Sqrt[N-1], N=5). This is as close to the u matrix itself as I need to get. But instead of calling these F^T, we are calling them Z:

Zs = \left(\begin{array}{lll} 0.88458 & 1.06679 & 0.782819 \\ 0.913474 & -0.849064 & 0.380336 \\ -0.0628237 & 0.762691 & -1.56451 \\ -0.206697 & -1.22463 & -0.396946 \\ -1.52853 & 0.244206 & 0.798301\end{array}\right)

Check it by confirming that we have

X^T = A\ Z^T \text{or}\ X = Z\ A^T\ .

(We do.)

Basilevsky centered

Recall the raw data:

X = \left(\begin{array}{lll} 2.09653 & -0.793484 & -7.33899 \\ -1.75252 & 13.0576 & 0.103549 \\ 3.63702 & 29.0064 & 8.52945 \\ 0.0338101 & 46.912 & 19.8517 \\ 5.91502 & 70.9696 & 36.0372\end{array}\right)

I will center the data and call it Xc. Get the column means…

\{1.98597,31.8304,11.4366\}

and subtract them from each column to get Xc:

Xc = \left(\begin{array}{lll} 0.110558 & -32.6239 & -18.7756 \\ -3.73849 & -18.7728 & -11.333 \\ 1.65105 & -2.82404 & -2.90714 \\ -1.95216 & 15.0815 & 8.41516 \\ 3.92905 & 39.1392 & 24.6006\end{array}\right)

Proceeding as before, we get the SVD (Singular Value Decomposition) and display w…

Xc = uc\ wc\ vc^T

wc = \left(\begin{array}{lll} 66.0141 & 0. & 0. \\ 0. & 5.01572 & 0. \\ 0. & 0. & 1.54942 \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Get the eigenvalues of the covariance matrix (and that’s the only change we make after switching from standardized to centered data):

\lambda c = \{1089.47,6.28937,0.600172\}

Get the diagonal matrix of \sqrt{\text{eigenvalues}} of the covariance matrix…

\Lambda c = \left(\begin{array}{lll} 33.0071 & 0. & 0. \\ 0. & 2.50786 & 0. \\ 0. & 0. & 0.774708\end{array}\right)

Get the weighted eigenvector matrix A = v\ \Lambda

Ac = vc\ \Lambda c = \left(\begin{array}{lll} -1.67235 & 2.48708 & -0.0914469 \\ -28.2097 & -0.261331 & -0.394039 \\ -17.0553 & 0.188375 & 0.660714\end{array}\right)

Get the scores F^T as the first three columns of 2u (where 2 = Sqrt[N-1], N=5). except that we are calling them Z, as Basilevsky does.

Zc = \left(\begin{array}{lll} 1.13849 & 0.83693 & 0.73261 \\ 0.669241 & -1.03777 & 0.41852 \\ 0.116099 & 0.683164 & -1.59786 \\ -0.519249 & -1.14658 & -0.340197 \\ -1.40458 & 0.664253 & 0.786923\end{array}\right)

Check it by confirming that we have

X^T = A\ Z^T \text{or}\ X = Z\ A^T\ .

(We do.)

Key: the A matrices are cross-covariances

We have As and Ac:

As = \left(\begin{array}{lll} -0.752315 & 0.658804 & -0.000578824 \\ -0.963827 & -0.265197 & -0.0266011 \\ -0.968424 & -0.247849 & 0.0269245\end{array}\right)

Ac = \left(\begin{array}{lll} -1.67235 & 2.48708 & -0.0914469 \\ -28.2097 & -0.261331 & -0.394039 \\ -17.0553 & 0.188375 & 0.660714\end{array}\right)

We have new data Zs and old Xs:

Zs = \left(\begin{array}{lll} 0.88458 & 1.06679 & 0.782819 \\ 0.913474 & -0.849064 & 0.380336 \\ -0.0628237 & 0.762691 & -1.56451 \\ -0.206697 & -1.22463 & -0.396946 \\ -1.52853 & 0.244206 & 0.798301\end{array}\right)

and old:

Xs = \left(\begin{array}{lll} 0.0368717 & -1.15632 & -1.09998 \\ -1.24681 & -0.665379 & -0.663951 \\ 0.550632 & -0.100095 & -0.170316 \\ -0.651057 & 0.534548 & 0.493006 \\ 1.31036 & 1.38724 & 1.44124\end{array}\right)

Now, what is the cross-covariance between Xs and Zs? I need either

\frac{X^T\ Z}{N-1}

or

\frac{Z^T\ X}{N-1}\ ,

and N-1 = 4. Note that they are the transposes of each other. We compute the former:

\frac{Xs^T\ Zs}{4} = \left(\begin{array}{lll} -0.752315 & 0.658804 & -0.000578824 \\ -0.963827 & -0.265197 & -0.0266011 \\ -0.968424 & -0.247849 & 0.0269245\end{array}\right)

Not symmetric. But it’s a cross-covariance, between two different variables, not the covariance of one variable.

Now recall As:

As = \left(\begin{array}{lll} -0.752315 & 0.658804 & -0.000578824 \\ -0.963827 & -0.265197 & -0.0266011 \\ -0.968424 & -0.247849 & 0.0269245\end{array}\right)

They are the same.

For centered data:

\frac{Xc^T\ Zc}{4} = \left(\begin{array}{lll} -1.67235 & 2.48708 & -0.0914469 \\ -28.2097 & -0.261331 & -0.394039 \\ -17.0553 & 0.188375 & 0.660714\end{array}\right)

and we recall

Ac = \left(\begin{array}{lll} -1.67235 & 2.48708 & -0.0914469 \\ -28.2097 & -0.261331 & -0.394039 \\ -17.0553 & 0.188375 & 0.660714\end{array}\right)

Again, they are the same.

Now, get Ar, by normalizing the rows of Ac:

Ar = \left(\begin{array}{lll} -0.557739 & 0.829456 & -0.030498 \\ -0.99986 & -0.00926256 & -0.0139662 \\ -0.99919 & 0.011036 & 0.0387082\end{array}\right)

and compute the cross-covariance between Xs and Zc:

\frac{Xs^T\ Zc}{4} = \left(\begin{array}{lll} -0.557739 & 0.829456 & -0.030498 \\ -0.99986 & -0.00926256 & -0.0139662 \\ -0.99919 & 0.011036 & 0.0387082\end{array}\right)

0nce again, they are the same. We have good reason, now, to believe that

\text{loadings}\ A = \frac{X^T\ Z}{N-1} =  covariance(X,Z)\ ,

where Z is the new data WRT A (i.e. the scores, as most people define them).

That’s what I set out to show.

We have seen it for As, relating standardized data Xs to the scores Zs; for Ac, relating centered data Xc to the scores Zc; and finally for Ar, relating standardized data Xs to the scores Zc.

Proving it

Let’s take a look at the proofs. The question in the back of my mind is: when is a transition matrix also a cross-covariance matrix? (I will confess that I do not have a universal answer; it appears that I can generalize from the Ar matrix somewhat.)

That is, given old data X and new data Z WRT transition matrix T,

X^T = T\ Z^T

X = Z\ T^T\ .

when can we conclude that T is the covariance between X and Z?

For either standardized data or centered data X, one argument works. We have

A = V \Lambda\ ,

where \Lambda is the diagonal matrix of \sqrt{\text{eigenvalues}} and V is an orthogonal eigenvector matrix of the covariance matrix c; that is, we have the eigendecomposition

\Lambda^2 = V^T\ c\ V = V^T \frac{X^T\ X}{N-1}\ V

\left(\text{or}\  \frac{1}{N-1}\ X^T\ X = V \Lambda^2\ V^T\right)\ .

We have the decomposition of X:

X = Z\ A^T\ \left(\text{or}\ X^T = A\ Z^T\right)\ .

For simplicity, i’m going to assume that A is invertible (i.e. that \Lambda is invertible; i.e. that all the eigenvalues are nonzero): then

Z = X\ A^{-T} \left(\text{or}\ Z^T = A^{-1}X^T\right)

and I note that

A^{-T} = V^{-T}\ \Lambda^{-T} = V\ \Lambda^{-T} = V\ \Lambda^{-1}\ .

Let me name those three equations: we have an eigendecomposition of the covariance matrix, the decomposition of X, and the definition of A.

Now we compute the cross-covariance. There may be simpler ways, and there are certainly other paths to the same answer, but this works for me. As I said, I’m going to assume that A is invertible, because for one thing we can cope if it is not; and for another, if it is not, we would have fewer Z variables than X variables.

\frac{1}{N-1}\ X^T\ Z

\frac{1}{N-1}\ X^T\ \left(X\ A^{-T}\right) (the decomposition of X)

= c\ A^{-T} (definition of the covariance matrix)

= \left(V\ \Lambda^2\ V^T\right)\ \left(V\ \Lambda^{-1}\right) (the eigendecomposition of c and definition of A)

= \left(V\ \Lambda^2\right)\ \Lambda^{-1} = V\  \Lambda (simplify)

= A (definition of A)

QED.

The eigendecomposition is crucial. Without it, we are left hanging at

c\ A^{-T}\ ,

Now let us try to prove it for Ar. It’s a surprisingly simple consequence of what we just did. We need only one new equation, the relationship between standardized and centered data:

Xs = Xc\ \sigma^{-1}\ ,

which says that we get standardized data from centered data by dividing each column by its standard deviation; and we recall that scaling columns can be accomplished by post-multiplication by a diagonal matrix. (After all, that’s how we get the A matrix from the V matrix.)

What we actually need is the transpose:

Xs^T = \sigma^{-1}\ Xc^T\ .

Let us compute the cross covariance of Xs and Zc.

\frac{ Xs^T\ Zc}{N-1}

= \frac{\sigma^{-1}\ Xc^TZc}{N-1} (Xs in terms of Xc)

= \sigma^{-1}\ Ac (by the previous result for Ac!)

= Ar (by definition of Ar)

QED.

The key, then, was that Ar was to Ac as Xs was to Xc (OK, not exactly: scale the rows of one and the columns of the other). I don’t know of any other suitable relationships, but if we found one to use, we should apply it to Xc and Ac.

Let me be more specific. We never really used the fact that \sigma was a diagonal matrix of standard deviations. If we had constructed completely different data Xd by scaling the columns of centered data Xc, using an arbitrary diagonal matrix \delta^{-1}\ , so that

Xd = Xc\ \delta^{-1}\ ,

and if we had defined a new A matrix by scaling the rows of Ac using the same diagonal matrix:

Ad = \delta^{-1}\ Ac\ ,

then we would find that Ad was the cross covariance of Xd and Zc.

I need to look at this some more.

PCA / FA Basilevsky: what he did.

Introduction

The previous post discussed something interesting that Basilevsky did. (Search the bibliography, he’s there.) I can’t say I like it — because it leads to a basis which is non-orthogonal, and whose vectors are not eigenvectors of either the correlation or covariance matrices.

But I had to understand it.

I don’t know how widespread it is nowadays, but even if Basilevsky is the only author who does it, it’s another example of the lack of standardization (no pun intend, I swear) in PCA / FA. This branch of applied statistics is like the mythical Wild West: everybody’s got a gun and there are bullets flying all over the place. Law and order have not arrived yet.

OTOH, it’s nice to find something different in just about every book I open.

Let me set the stage again. What we have is the following 3 models.

Xc^T = Ac\ Zc^T

Xs^T = Ar\ Zc^T

Xs^T = As\ Zs^T\ .

where X is data (variables in columns, observations in rows, and there are N rows), Z is principal components (PCs), and A is constructed from an eigenvector matrix V using the diagonal matrix \Lambda of \sqrt{\text{eigenvalues}}.

Now is a good time, before we need to be reminded of it, to emphasize that all three models are of the same form:

X^T = A\ Z^T\ .

In the first model,

Xc^T = Ac\ Zc^T\ ,

Xc is centered data, and

Ac = Vc\ \Lambda c\ ,

where Vc is an orthogonal eigenvector matrix of the covariance matrix c = \frac{Xc^T\ Xc}{N-1}\ :

\Lambda c^2 = Vc{^{-1}}\ c\ Vc = Vc^T\ c\ Vc\ .

In the third model,

Xs^T = As\ Zs^T\ ,

Xs is standardized data, and

As = Vs\ \Lambda s\ ,

where Vs is an orthogonal eigenvector matrix of the correlation matrix r = \frac{Xs^T Xs}{N-1}\ :

\Lambda s^2 = Vs{^{-1}}\ r\ Vs = Vs^T\ r Vs\ .

In the second model,

Xs^T = Ar\ Zc^T

we have the standardized data Xs as in the third model, but the PCs Zc from the first model. The transition matrix Ar may be found by normalizing the rows of Ac:

Ar = \sigma{^{-1}}\ Ac\ ,

where \sigma is a diagonal matrix of the standard deviations (\sqrt{\text{variances}}) of Xc.

The example

Basilevsky gives us a covariance matrix c. (So we have no data, for this example.)

c = \left(\begin{array}{llll} 471.51 & 324.71 & 73.24 & 4.35 \\ 324.71 & 224.84 & 50.72 & 2.81 \\ 73.24 & 50.72 & 11.99 & 1.23 \\ 4.35 & 2.81 & 1.23 & 0.98\end{array}\right)

We will want both the correlation matrix r and the diagonal matrix \sigma\ , so let’s get them. \sigma\ is to be a diagonal matrix of the square roots of the variances, hence of the square roots of the diagonal of c. (i should call it s, because it’s made up of sample standard deviations, but I don’t want “s” as well as “As”, “Vs”, etc.) Here it is:

\sigma = \left(\begin{array}{llll} 21.7143 & 0. & 0. & 0. \\ 0. & 14.9947 & 0. & 0. \\ 0. & 0. & 3.46266 & 0. \\ 0. & 0. & 0. & 0.989949\end{array}\right)

Once we have \sigma\ , the fastest way for me to get the correlation matrix is

r = \sigma^{-1}\ c\ \sigma^{-1}\ :

r = \left(\begin{array}{llll} 1. & 0.9973 & 0.9741 & 0.2024 \\ 0.9973 & 1. & 0.9769 & 0.1893 \\ 0.9741 & 0.9769 & 1. & 0.3588 \\ 0.2024 & 0.1893 & 0.3588 & 1.\end{array}\right)

From the covariance matrix:

So let’s do the eigendecompositions. First from the covariance matrix, which will give us the first model. We get eigenvalues…

\lambda c = \{706.979,\ 1.34915,\ 0.894303,\ 0.0971549\}

an eigenvector matrix…

V = \left(\begin{array}{llll} -0.8164 & -0.0236 & -0.5705 & 0.0862 \\ -0.5633 & -0.1048 & 0.7665 & -0.2903 \\ -0.1272 & 0.5679 & 0.2742 & 0.7656 \\ -0.0075 & 0.816 & -0.1089 & -0.5676\end{array}\right)

These are not quite Basilevsky’s numbers. The 4th eigenvector is off, by as much as .0014 . ah, the 2nd is off too, in two terms. Not great, but not a big deal. (Actually, I want to check his numbers, but not today. I will be using my numbers for eigenvectors and eigenvalues.)

(I suspect that most packages which do PCA / FA use iterative schemes to find one eigenvector after another; I’m getting used to seeing more and more errors as we move to the right in the eigenvector matrices. I think they’re suffering from accumulated round-off errors. Of course, I generally take it for granted that my answers are the right ones, but see below.)

Now I’m going change the signs to agree with Basilevsky:

Vc = \left(\begin{array}{llll} 0.8164 & -0.0236 & 0.5705 & 0.0862 \\ 0.5633 & -0.1048 & -0.7665 & -0.2903 \\ 0.1272 & 0.5679 & -0.2742 & 0.7656 \\ 0.0075 & 0.816 & 0.1089 & -0.5676\end{array}\right)

We will discover that Basilevsky should have changed the sign of the second column, too, but he didn’t so I won’t either.

Let’s keep going. We get the diagonal matrix \Lambda c of \sqrt{\text{eigenvalues}}\

\Lambda c = \left(\begin{array}{llll} 26.5891 & 0. & 0. & 0. \\ 0. & 1.16153 & 0. & 0. \\ 0. & 0. & 0.945676 & 0. \\ 0. & 0. & 0. & 0.311697\end{array}\right)

and Ac = Vc\ \Lambda c

Ac = \left(\begin{array}{llll} 21.7075 & -0.0274391 & 0.539517 & 0.0268828 \\ 14.9764 & -0.121762 & -0.724841 & -0.0904817 \\ 3.38092 & 0.659679 & -0.259284 & 0.238627 \\ 0.199249 & 0.947838 & 0.102961 & -0.176925\end{array}\right)

Basilevsky then displays equations for the variables, using Vc not Ac:

\begin{array}{c} \ \text{Xc1}=0.8164 \ \text{Zu1}-0.0236 \ \text{Zu2}+0.5705 \ \text{Zu3}+0.0862 \ \text{Zu4} \\ \ \text{Xc2}=0.5633 \ \text{Zu1}-0.1048 \ \text{Zu2}-0.7665 \ \text{Zu3}-0.2903 \ \text{Zu4} \\ \ \text{Xc3}=0.1272 \ \text{Zu1}+0.5679 \ \text{Zu2}-0.2742 \ \text{Zu3}+0.7656 \ \text{Zu4} \\ \ \text{Xc4}=0.0075 \ \text{Zu1}+0.816 \ \text{Zu2}+0.1089 \ \text{Zu3}-0.5676 \ \text{Zu4}\end{array}

What he has written is

basil

i.e.

Xc^T = Vc\ Zu^T\ ,

where the Zu are unstandardized, in contrast to Zc and Zs which are standardized. (The Zu were computed from Vc, not from Ac.) I remark that Basilevsky did not compute Ac.

From the correlation matrix:

Now let’s do the correlation matrix, which will give us the third model. Here are the eigenvalues…

\lambda s = \{3.0569,\ 0.9283,\ 0.013,\ 0.0018\}

and eigenvectors…

Vs = \left(\begin{array}{llll} 0.5627 & 0.1712 & 0.5688 & 0.5749 \\ 0.5624 & 0.1848 & 0.1817 & -0.7852 \\ 0.5696 & -0.0008 & -0.7906 & 0.2248 \\ 0.2065 & -0.9677 & 0.136 & -0.0485\end{array}\right)

Let’s stop there for a moment.

I want to compare the two V’s: recall Vc…

Vc = \left(\begin{array}{llll} 0.8164 & -0.0236 & 0.5705 & 0.0862 \\ 0.5633 & -0.1048 & -0.7665 & -0.2903 \\ 0.1272 & 0.5679 & -0.2742 & 0.7656 \\ 0.0075 & 0.816 & 0.1089 & -0.5676\end{array}\right)

He says, “… even though \alpha_{42} is positive as a covariance loading, it becomes highly negative as a correlations loadings [sic].”

He correctly points out that the 2nd column of Vc and the second column of Vs would have the same signs if one were multiplied by negative 1, and correctly concludes that we can take care of the sign of \alpha_{42} by changing the sign of the second column. (At least, I hope that what he said. And \alpha is his notation for an element of what i’m calling V.)

Unfortunately, he missed the fact that the 3rd columns cannot be made to have the same signs. The 3rd column of Vc has 2 negative signs; the 3rd column of Vs has only 1.

That is, even though Vs_{23} is positive, Vc_{23} is negative, when all other signs are the same.

This serves as a reminder that Vc and Vs, eigenvectors of the covariance matrix and of the correlation matrix, do not have a simple relationship to each other.

i will also point out that both Ac and Ar will be derived from Vc by scaling by positive numbers, and As will be derived from Vs by scaling by positive numbers, so this sign discrepancy between the (2,3) elements of Vs and Vc will propagate to the (2,3) elements of Ac and Ar versus As.

Now, lets get back to doing our thing. Get \Lambda s\

\Lambda s = \left(\begin{array}{llll} 1.74839 & 0. & 0. & 0. \\ 0. & 0.963503 & 0. & 0. \\ 0. & 0. & 0.114029 & 0. \\ 0. & 0. & 0. & 0.0424615\end{array}\right)

and As…

As = \left(\begin{array}{llll} 0.9839 & 0.1649 & 0.0649 & 0.0244 \\ 0.9832 & 0.1781 & 0.0207 & -0.0333 \\ 0.9959 & -0.0008 & -0.0901 & 0.0095 \\ 0.361 & -0.9324 & 0.0155 & -0.0021\end{array}\right)

Yes and yes, Basilevsky and I agree. Even to all his signs.

We use As and get the equations

\begin{array}{c} \ \text{Xs1}=0.9839 \ \text{Zs1}+0.1649 \ \text{Zs2}+0.0649 \ \text{Zs3}+0.0244 \ \text{Zs4} \\ \ \text{Xs2}=0.9832 \ \text{Zs1}+0.1781 \ \text{Zs2}+0.0207 \ \text{Zs3}-0.0333 \ \text{Zs4} \\ \ \text{Xs3}=0.9959 \ \text{Zs1}-0.0008 \ \text{Zs2}-0.0901 \ \text{Zs3}+0.0095 \ \text{Zs4} \\ \ \text{Xs4}=0.361 \ \text{Zs1}-0.9324 \ \text{Zs2}+0.0155 \ \text{Zs3}-0.0021 \ \text{Zs4}\end{array}

These Xs are standardized, and so are the Zs, in contrast to the Xc and Zu of the first set of equations:

\begin{array}{c} \ \text{Xc1}=0.8164 \ \text{Zu1}-0.0236 \ \text{Zu2}+0.5705 \ \text{Zu3}+0.0862 \ \text{Zu4} \\ \ \text{Xc2}=0.5633 \ \text{Zu1}-0.1048 \ \text{Zu2}-0.7665 \ \text{Zu3}-0.2903 \ \text{Zu4} \\ \ \text{Xc3}=0.1272 \ \text{Zu1}+0.5679 \ \text{Zu2}-0.2742 \ \text{Zu3}+0.7656 \ \text{Zu4} \\ \ \text{Xc4}=0.0075 \ \text{Zu1}+0.816 \ \text{Zu2}+0.1089 \ \text{Zu3}-0.5676 \ \text{Zu4}\end{array}

Basilevsky’s model:

Basilevsky has shown us that if we define

Ar = \sigma^{-1}Ac\ ,

then we would have the second model,

Xs^T = Ar\ Zc^T\ .

here’s Ar…

Ar = \left(\begin{array}{llll} 0.9997 & -0.0013 & 0.0248 & 0.0012 \\ 0.9988 & -0.0081 & -0.0483 & -0.006 \\ 0.9764 & 0.1905 & -0.0749 & 0.0689 \\ 0.2013 & 0.9575 & 0.104 & -0.1787\end{array}\right)

And here’s the corresponding equations…

\begin{array}{c} \ \text{Xs1}=0.9997 \ \text{Zc1}-0.0013 \ \text{Zc2}+0.0248 \ \text{Zc3}+0.0012 \ \text{Zc4} \\ \ \text{Xs2}=0.9988 \ \text{Zc1}-0.0081 \ \text{Zc2}-0.0483 \ \text{Zc3}-0.006 \ \text{Zc4} \\ \ \text{Xs3}=0.9764 \ \text{Zc1}+0.1905 \ \text{Zc2}-0.0749 \ \text{Zc3}+0.0689 \ \text{Zc4} \\ \ \text{Xs4}=0.2013 \ \text{Zc1}+0.9575 \ \text{Zc2}+0.104 \ \text{Zc3}-0.1787 \ \text{Zc4}\end{array}

followed immediately by the second set computed earlier…

\begin{array}{c} \ \text{Xs1}=0.9839 \ \text{Zs1}+0.1649 \ \text{Zs2}+0.0649 \ \text{Zs3}+0.0244 \ \text{Zs4} \\ \ \text{Xs2}=0.9832 \ \text{Zs1}+0.1781 \ \text{Zs2}+0.0207 \ \text{Zs3}-0.0333 \ \text{Zs4} \\ \ \text{Xs3}=0.9959 \ \text{Zs1}-0.0008 \ \text{Zs2}-0.0901 \ \text{Zs3}+0.0095 \ \text{Zs4} \\ \ \text{Xs4}=0.361 \ \text{Zs1}-0.9324 \ \text{Zs2}+0.0155 \ \text{Zs3}-0.0021 \ \text{Zs4}\end{array}

There are some satisfying similarities. The first columns differ significantly only in the last entry, and even that is .2 vs .36 . For the second column, as we said, if we had changed its sign, all the signs would match up, and in particular the single large terms would then be +0.9324 and +0.9575 .

I should emphasize that he calls both Ar and As, “correlation loadings”, at one point or another.

We could also severely round both As and Ar to confirm their similarity:

As = \left(\begin{array}{llll} 1. & 0 & 0 & 0 \\ 1. & 0 & 0 & 0 \\ 1. & 0 & 0 & 0 \\ 0.5 & -1. & 0 & 0\end{array}\right)

Ar = \left(\begin{array}{llll} 1. & 0 & 0 & 0 \\ 1. & 0 & 0 & 0 \\ 1. & 0 & 0 & 0 \\ 0 & 1. & 0 & 0\end{array}\right)

As he said, we shoulda changed the sign of the second column of Vc.

Without data, there isn’t a whole lot more we can do here. But I do want to show you one last thing: let’s take a closer look at Ar. We computed it from

Ar = \sigma^{-1}Ac

and the rows of Ar are supposed to be of unit length. As usual, the fastest way to compute the lengths (more precisely, the squared lengths) of vectors in an array A, is to compute either A\ A^T or A^T\ A\ . The former gives us dot products of all the rows with each other, including with themselves, while the latter gives us all the dot products among the columns.

So if we compute Ar Ar^T\ , the diagonal elements should be 1. We see that they are:

Ar Ar^T = \left(\begin{array}{llll} 1. & 0.997272 & 0.974077 & 0.202363 \\ 0.997272 & 1. & 0.976861 & 0.189303 \\ 0.974077 & 0.976861 & 1. & 0.358825 \\ 0.202363 & 0.189303 & 0.358825 & 1.\end{array}\right)

In fact, we see that Ar Ar^T\ is equal to our old friend the correlation matrix r:

r = \left(\begin{array}{llll} 1. & 0.997272 & 0.974077 & 0.202363 \\ 0.997272 & 1. & 0.976861 & 0.189303 \\ 0.974077 & 0.976861 & 1. & 0.358825 \\ 0.202363 & 0.189303 & 0.358825 & 1.\end{array}\right)

Don’t be distracted, however, by his scaling the rows. The new basis – which gives us the PCs Zc from Xs – is specified by the columns of Ar, just as the columns of Ac and As were the bases that gave us Zc from Xc, and Zs from Xs. (As I said, all three models are of the same form: X^T = A\ Z^T\ . And as I’ve said before, we should keep our eyes on the linear algebra.)

So what does the basis Ar look like? (I’ve said some terrible things about it; let me justify them.) For this, we compute Ar^T\ Ar\ :

Ar^T\ Ar = \left(\begin{array}{llll} 2.9908 & 0.369352 & -0.0756215 & 0.0265267 \\ 0.369352 & 0.953093 & 0.0856776 & -0.157942 \\ -0.0756215 & 0.0856776 & 0.0193785 & -0.023426 \\ 0.0265267 & -0.157942 & -0.023426 & 0.0367285\end{array}\right)

That the diagonals are not 1 says the columns are not of unit length; that the off-diagonal elements are not zero says the columns are not mutually orthogonal.

Ar is not a very nice basis at all.

I think it has some redeeming social value, but we’ll look at that when we have data.

Looking at eigenvectors

Let’s check one more thing. The columns of Vc are eigenvectors of the covariance matrix: we should have

c\ Vc = Vc\ \Lambda c^2\ .

Do we? We do. (When I compute the difference, I get zero, to within 10^{-13}\ .)

And columns of Ac are eigenvectors of the covariance matrix, too:

c\ Ac = Ac\ \Lambda c^2\ .

Similarly, columns of Vs and As are eigenvectors of r, and we should have both

r\ Vs = Vr\ \Lambda s^2

and

r\ As = As\ \Lambda s^2\ .

We do.

(This is why I’m so sure my answers are right: I’ve checked the eigendecompositions.)

What about Ar? Even supposing that its columns are eigenvectors of c, we don’t know what the eigenvalues are. No problem, just see if

r Ar

is proportional to Ar. (That is, see if columns are proportional; see below.)

We have

r\ Ar = \left(\begin{array}{llll} 2.98756 & 0.369966 & -0.0752539 & 0.0261815 \\ 2.98765 & 0.357973 & -0.0770204 & 0.0286877 \\ 2.99806 & 0.524909 & -0.0605794 & 0.0000958891 \\ 0.942999 & 1.02403 & 0.0730146 & -0.154885\end{array}\right)

and when we divide, element by element, by Ar, we get

\left(\begin{array}{llll} 2.98849 & -292.778 & -3.02879 & 21.1478 \\ 2.9913 & -44.0836 & 1.59331 & -4.75413 \\ 3.07054 & 2.75525 & 0.809018 & 0.00139143 \\ 4.68519 & 1.06953 & 0.70202 & 0.866628\end{array}\right)

so we conclude that the columns of Ar are not eigenvectors of r. (Put that the other way: if columns of Ar are eigenvectors, then applying r to them should yield columns which are proportional to the eigenvectors, i.e. proportional to Ar.)

Are the columns of Ar eigenvectors of c? here’s c Ar…

c\ Ar = \left(\begin{array}{llll} 868.064 & 14.8855 & -9.01303 & 2.8942 \\ 599.263 & 10.1172 & -6.30661 & 2.03838 \\ 135.83 & 2.95751 & -1.40195 & 0.391071 \\ 8.55343 & 1.14433 & -0.0179306 & -0.101953\end{array}\right)

divided by Ar…

\left(\begin{array}{llll} 868.334 & -11779.8 & -362.753 & 2337.76 \\ 599.995 & -1245.91 & 130.464 & -337.802 \\ 139.114 & 15.524 & 18.7226 & 5.67474 \\ 42.4968 & 1.19517 & -0.172399 & 0.570459\end{array}\right)

Again, no; the columns of Ar are not eigenvectors of c.

Oh, do you need to see how that calculation works when it comes out right? here’s c Ac…

c\ Ac = \left(\begin{array}{llll} 15346.8 & -0.0370194 & 0.482492 & 0.0026118 \\ 10588. & -0.164275 & -0.648228 & -0.00879074 \\ 2390.24 & 0.890006 & -0.231879 & 0.0231838 \\ 140.865 & 1.27878 & 0.0920784 & -0.0171892\end{array}\right)

and dividing by Ac gives us…

\left(\begin{array}{llll} 706.979 & 1.34915 & 0.894303 & 0.0971549 \\ 706.979 & 1.34915 & 0.894303 & 0.0971549 \\ 706.979 & 1.34915 & 0.894303 & 0.0971549 \\ 706.979 & 1.34915 & 0.894303 & 0.0971549\end{array}\right)

which says the 4 eigenvalues for the 4 columns are

\{706.979,\ 1.34915,\ 0.894303,\ 0.0971549\}

and that’s exactly what we got a long time ago:

\lambda c = \{706.979,\ 1.34915,\ 0.894303,\ 0.0971549\}

The purpose of this post was to demonstrate Basilevsky’s computations. Next time, I’ll do these kinds of calculations with data.

PCA / FA Basilevsky on standardizing. Discussion.

Introduction and review

Basilevsky presents an extremely interesting idea. For all I know, it’s become common in the last 10-20 years, but I hadn’t seen it in any of the other books we’ve looked at.

I’ll tell you up front that he’s going to normalize the rows of an A matrix, specifically the A matrix computed from the eigendecomposition of the covariance matrix.

I’ll also tell you up front that I don’t see any good reason for doing it, but I’m not averse to finding such a reason someday.

Suppose we have centered data X with variables in columns, and N rows (observations). The covariance matrix is c = \frac{X^T\ X}{N-1}\ ), and we find its eigendecomposition. We get a matrix V having the eigenvectors as columns, and we construct a diagonal matrix  \Lambda whose entries are the square roots of the eigenvalues.

That is, we have the decomposition

\Lambda^2 = V^{-1}\ c\ V\ .

Then we define the matrix A = V\ \Lambda\ : each column is an eigenvector, of length \sqrt{\text{eigenvalue}}\ .

For the next few posts, I will use a notation of Basilevsky (and Jolliffe): the principal components will be denoted by Z. (Yes, Harman and I almost invariably use Z = X^T\ , but I need to abandon that notation for a while. Sorry.)

This coincides with Jolliffe’s notation, which was to write

Z = X\ V\ ,

with Z as the principal components, X the data, and V the orthogonal eigenvector matrix.

As we have observed many times, the starting point for making sense of that is the change-of-basis equation for a vector:

x^T = V\ z^T\ ,

where x^T and z^T are column vectors, x is the old components, z is the new components, and V is the transition matrix. (Its columns are the old components of the new basis vectors. Using the symbol V is not an accident: the matrix V from the eigendecomposition is a transition matrix.)

Applied to the entire matrix X, the change-of-basis equation is

X^T = V\ Z^T

Transpose:

X = Z\ V^T

Solve for Z:

X\ V^{-T} = Z.\

But V is orthogonal, V^T = V^{-1}\ , so V^{-T} = V\ :

X\ V = Z\ .

Among the very first things we learned in PCA (in Harman) were two: first, that the Z’s we just defined have maximally redistributed the variance of the X’s: the 1st Z variable has the largest variance of any linear combination of the X’s; the 2nd Z variable has the 2nd largest variance of any other linear combination of the X’s, etc. (All subject to the constraint that the total variance of the Z’s is equal to the total variance of the X’s.)

Second, we learned that if we used the transition matrix A = V\ \Lambda\ , then the Z’s defined by it would be both standardized and uncorrelated.

In our present notation (our X^T is Harman’s Z, and our Z^T is Harman’s F), we would write Harman’s model as

X^T = A\ Z^T\

which, column by column, is precisely the change-of-basis equation

x^T = A\ z^T\ ,

with transition matrix A instead of V.

Or we could write

X = Z\ A^T.

(I will try not to mix these up.)

Let me emphasize that the two Z’s in

Z = X\ V

and

X^T = A\ Z^T

are different. In what follows, I will distinguish them by subscripts (Zu for the first, Zc for the second.)

(It is true that the second set of Z’s could have been found from the first set: just standardize the first set! I had never thought to ask that or check it before, but it would be rather annoying if it were not true. See below.)

We will, in fact, be dealing with both centered data Xc and the corresponding standardized data Xs; we will have corresponding orthogonal eigenvector matrices Vc and Vs; corresponding A matrices Ac and As, and a third one Ar; we will have 3 Z matrices, although a fourth exists in principle.

Zu will be the unstandardized Z from Vc:  Zu = Xc\ Vc

Zc will be the standardized Z from Ac: Xc^T = Ac\ Zc^T

Zs will be the standardized Z from As: Xs^T = As\ Zs^T\ .

(I have tried to minimize the occurence of Z’s as opposed to Zs, and X’s versus Xs, but I can’t reduce the number to zero. Zs and Xs are specific matrices, as opposed to collections of Z’s and X’s, which have apostrophes)

A new basis

We will show that

Xs^T = Ar\ Zc^T\ ,

where Ar can be computed from Ac.

That last is a mix of apples and oranges: standardized Xs, but standardized Zc from the covariance matrix, instead of either standardized Xs and standardized Zs from the correlation matrix, or centered Xc and standardized Zc from the covariance matrix.

As I intimated at the beginning, the Ar is Ac with its rows normalized to length 1.

Let’s work that out. The key is that each row of Ac has a sample std deviation \sigma as its length.

Let’s see that. To compute the mutual dot products of rows of Ac, we could compute

Ac\ Ac^T\ .

The diagonals of the result would be the squared lengths of rows of Ac (the dot products of rows with themselves).

OK. Let me drop the subscript c for the moment:

AA^T = \left(V\ \Lambda\right)\ \left(V\ \Lambda \right)^T = V\  \Lambda\ \Lambda^T\ V^T = V\ \Lambda^2\ V^{-1} = c\ ,

where c is the covariance matrix.

(the eigendecomposition was \Lambda^2 = V^{-1}\ c\ V\ .)

So the diagonal elements of Ac\ Ac^T are not only the squared lengths of the rows, but also the sample variances of Xc.

We want to scale each row of Ac by its length. To scale columns of V, we post-multiplied by the diagonal matrix \Lambda\ . To scale rows of Ac, we will premultiply by a diagonal matrix of inverse standard deviations.

So let \sigma = diagonal matrix of square roots of variances, and let \sigma^{-1} be its inverse. (Ignore the possibility of zero variances; after all, we do that every time we talk about the correlation matrix.)

We want to scale the rows of Ac to unit length to get Ar, so we define

Ar = \sigma^{-1}Ac\ .

We have the relationship between Xc and Zc:

Xc = Zc\ Ac^T\ .

We standardize Xc by scaling columns by standard deviations, so

Xs = Xc\ \sigma^{-1} = Zc\ Ac^T\ \sigma^{-1} = Zc\ Ar^T\ .

FInally, transposing

Xs^T = Ar\ Zc^T\ ,

QED.

Let me lay out all three equations again (I really like to see them all together):

Xc^T = Ac\ Zc^T

Xs^T = Ar\ Zc^T

Xs^T = As\ Zs^T\ .

That’s how I would present Basilevsky’s work. He never introduced Ac, and he computed Zc from the unstandardized Zu:

Zu = Xc\ Vc

Zc = Zu\ \Lambda^{-1}

Standardizing Zu

Yes, as I said, the standardized Zc can be computed by standardizing the Zu; and the variances of the Zu are given by the eigenvalues of the covariance matrix \left(\lambda = \text{diagonal of }\Lambda^2\right)\ . It’s easy enough to confirm.

From our usual definition of Zc,

Xc^T = Ac\ Zc^T\ , so Xc = Zc\ Ac^T\ ,

and the old definition of Zu

Zu = Xc\ Vc\ ,

we could write

Zu =  \left(Zc\ Ac^T\right)\ Vc = Zc\ \left(Vc\ \Lambda\right)^T\ Vc = Zc\  \Lambda\ Vc^T\ Vc = Zc\ \Lambda

i.e.

Zc = Zu\ \Lambda^{-1}\ ,

QED.

Summary and query

Overall, then, I learned two things. One, that standardized Zc and unstandardized Zu, defined by

Xc^T = Ac\ Zc^T

and

Zu = Xc\ Vc

were related in the sensible way; we could get Zc by standardizing Zu.

Two, and more importantly, I learned that we could write

Xs^T = Ar\ Zc^T\ ,

where

Ar = \sigma^{-1}\ Ac\ .

Fine.

Now, why would we want to?

I don’t know.

Basilevsky got there by looking at

Zu = Xc\ Vc

and saying he wanted to standardize both Z and X.

But we already know how to do that:

Xs^T = As\ Zs^T\ ,

with standardized Xs and standardized Zs.

Why doesn’t he do that?

He does.

And he knows and shows that As \ne Ar\ , and Zs \ne Zc\ , so by using Ar he has gotten standardized Zc which are different from standardized Zs.

I am perfectly happy with Zs derived from Xs. I am perfectly happy with Zc derived from Xc. I have no idea why we would want to use Ar to relate Zc to Xs. It’s true, but why do we care?

OK, Ac is not an orthonormal basis, but neither is Ar, so that’s not an improvement. OK, the columns of Ac do not have unit length, but they are mutually orthogonal and they are eigenvectors of the covariance matrix. In contrast, the columns of Ar are neither mutually orthogonal , nor are they eigenvectors (of Xc or of Xs).

I find myself wondering if there is some reason for liking this non-orthogonal basis Ar. Does its failure to be orthogonal tell us something? But it didn’t give us anything new. It says (once again) that

Xs^T = Ar\ Zc^T\ ,

but Zc is found from Xc and the eigenvectors of the covariance matrix:

Xc^T = Ac\ Zc^T\ ,

and Xs is found by standardizing Xc. We don’t need Ar to get the two things it relates.

I think Ar is cute, but I need some reason for going to a basis which isn’t at least orthogonal. Heck, I need a reason for going from the orthogonal Vc to a basis (Ac) whose vectors are not of unit length, but the reason is that it provides standardized Zc.

Finally, I need some reason for abandoning the eigenvectors. I’m in no hurry to substitute Ar for either Ac or As. The eigenvectors have significant properties; what does Ar have to offer that makes up for losing those properties?

If anyone knows….

PCA / FA Example 9: scores & loadings

I want to look at reconstituting the data. Equivalently, I want to look at setting successive singular values to zero.

This example was actually built on the previous one. Before I set the row sums to 1, I had started with

t1 = \left(\begin{array}{lll} 1 & 1 & -3 \\ -1 & 2 & -2 \\ 1 & 3 & -1 \\ -1 & 4 & 1 \\ 1 & 5 & 4\end{array}\right)

I’m going to continue with Harmon’s & Bartholomew’s model: Z = A F, Z = X^T, X is standardized, A is an eigenvector matrix weighted by the square roots of the eigenvalues of the correlation matrix of X.

I want data with one eigenvalue so large that we could sensibly retain only that one. Let me show you how I got that.

Get the SVD (Singular Value Decomposition) of t1… and look at w:

w = \left(\begin{array}{lll} 7.84944 & 0. & 0. \\ 0. & 4.9565 & 0. \\ 0. & 0. & 2.19533 \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

OK, keep u and v – just because they’re handy – but redefine w, and compute a new data matrix using this w:

w = \left(\begin{array}{lll} 100 & 0. & 0. \\ 0. & 10 & 0. \\ 0. & 0. & 5 \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Let t2 = u\ w\ v^T\ :

t2 = \left(\begin{array}{lll} 2.09653 & -0.793484 & -7.33899 \\ -1.75252 & 13.0576 & 0.103549 \\ 3.63702 & 29.0064 & 8.52945 \\ 0.0338101 & 46.912 & 19.8517 \\ 5.91502 & 70.9696 & 36.0372\end{array}\right)

Standardize t2 to get data X (“the data”):

X = \left(\begin{array}{lll} 0.0368717 & -1.15632 & -1.09998 \\ -1.24681 & -0.665379 & -0.663951 \\ 0.550632 & -0.100095 & -0.170316 \\ -0.651057 & 0.534548 & 0.493006 \\ 1.31036 & 1.38724 & 1.44124\end{array}\right)

Get the SVD of X, X = u\ w\ v^T\ , but look at w… (we’ll see u and v later)

w = \left(\begin{array}{lll} 3.11948 & 0. & 0. \\ 0. & 1.50437 & 0. \\ 0. & 0. & 0.0757068 \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Get the eigenvalues of the correlation matrix and look at the percentages:

\lambda = \{2.43279, 0.565781, 0.00143288\}

\text{percent} = \{81.0929, 18.8594, 0.0477627\}

\text{cumulative percent} = \{81.0929, 99.9522, 100.\}

so our first eigenvalue is 81% of the sum, the second is nearly all the rest, 19%.

Get the diagonal matrix of square roots, \Lambda\ :

\Lambda = \left(\begin{array}{lll} 1.55974 & 0. & 0. \\ 0. & 0.752184 & 0. \\ 0. & 0. & 0.0378534\end{array}\right)

Get A and the scores 2\ u = F^T\ , except that I never want more than 3 columns. (The first 3 columns of 2 u are the components of the new data wrt the A basis.)

A = \left(\begin{array}{lll} -0.752315 & 0.658804 & -0.000578824 \\ -0.963827 & -0.265197 & -0.0266011 \\ -0.968424 & -0.247849 & 0.0269245\end{array}\right)

F^T = 2 u = \left(\begin{array}{lll} 0.88458 & 1.06679 & 0.782819 \\ 0.913474 & -0.849064 & 0.380336 \\ -0.0628237 & 0.762691 & -1.56451 \\ -0.206697 & -1.22463 & -0.396946 \\ -1.52853 & 0.244206 & 0.798301\end{array}\right)

Check by confirming that  F^T\ A^T = X\ , i.e. that we have factored the data matrix.

Let’s quickly compute the reconstituted data from 1 and 2 singular values. I could use square forms of w (w1 and w2) with u and v cut down to u1 and v1 (or u2, v2) to be conformable. To be specific,

w1 = \left(\begin{array}{l} 3.11948\end{array}\right)

and compute X1 = u1\ w1\ v1^T\ :

X1 = \left(\begin{array}{lll} -0.665482 & -0.852582 & -0.856648 \\ -0.68722 & -0.880431 & -0.884631 \\ 0.0472632 & 0.0605512 & 0.06084 \\ 0.155502 & 0.199221 & 0.200171 \\ 1.14994 & 1.47324 & 1.48027\end{array}\right)

or take

w2 = \left(\begin{array}{ll} 3.11948 & 0. \\ 0. & 1.50437\end{array}\right)

and then X2 = u2\ w2\ v2^T\ :

X2 = \left(\begin{array}{lll} 0.0373248 & -1.13549 & -1.12105 \\ -1.24659 & -0.655262 & -0.674191 \\ 0.549727 & -0.141712 & -0.128192 \\ -0.651287 & 0.523988 & 0.503694 \\ 1.31082 & 1.40848 & 1.41974\end{array}\right)

You might note that my “1″ and “2″ indicate how many singular values I retained.

But what I was saying about leaving u and v untouched is that I can keep u and v full-size

u = \left(\begin{array}{lllll} 0.44229 & 0.533396 & 0.391409 & 0.605413 &   -0.0119286 \\ 0.456737 & -0.424532 & 0.190168 & -0.0677073 &   0.755259 \\ -0.0314119 & 0.381346 & -0.782255 & 0.201541 &   0.448384 \\ -0.103349 & -0.612313 & -0.198473 & 0.740037 &   -0.165366 \\ -0.764266 & 0.122103 & 0.39915 & 0.201541 &   0.448384\end{array}\right)

v = \left(\begin{array}{lll} -0.482334 & 0.875854 & -0.0152912 \\ -0.617941 & -0.35257 & -0.70274 \\ -0.620889 & -0.329506 & 0.711283\end{array}\right)

and use the following matrices for w, all the same size as X. The full-size w is:

w = \left(\begin{array}{lll} 3.11948 & 0. & 0. \\ 0. & 1.50437 & 0. \\ 0. & 0. & 0.0757068 \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

and the two others are:

w2 = \left(\begin{array}{lll} 3.11948 & 0 & 0 \\ 0 & 1.50437 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0\end{array}\right)

w1 = \left(\begin{array}{lll} 3.11948 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0\end{array}\right)

It’s easy enough to reconstitute the data that way,

X1 = u\ w1\ v^T

X2 = u\ w2\ v^T

and we do get the same answers for X1 and X2.

Now, what if take just the first two columns of F^T, and the first two columns of A (i.e. the first two rows of A^T)? (We saw this last time, too.) That is, I cut F^T down to:

F^T = \left(\begin{array}{ll} 0.88458 & 1.06679 \\ 0.913474 & -0.849064 \\ -0.0628237 & 0.762691 \\ -0.206697 & -1.22463 \\ -1.52853 & 0.244206\end{array}\right)

and I cut A^T down to

A^T = \left(\begin{array}{lll} -0.752315 & -0.963827 & -0.968424 \\ 0.658804 & -0.265197 & -0.247849\end{array}\right)

Do we get X2? Yes.

What if take just the first column of FT, and the first row of A^T? That is, I take FT to be:

F^T = \left(\begin{array}{l} 0.88458 \\ 0.913474 \\ -0.0628237 \\ -0.206697 \\ -1.52853\end{array}\right)

and AT to be

A^T = \left(\begin{array}{lll} -0.752315 & -0.963827 & -0.968424\end{array}\right)

Is their product equal to X1? Yes.

What we just saw is obvious in retrospect, but worth stating explicitly. When we throw away small singular values or eigenvalues, we do not change the scores and loadings. What we change, instead, is the number of scores and loadings used.

When we throw away a nonzero singular value or eigenvalue, we are throwing away a scores-and-loadings pair. We don’t change any of the other scores and loadings.

We have 3 pairs of “scores & loadings”, and then for each pair we compute a product. The individual pairs, and the 3 products, are not affected by our decision to “drop a factor”. Instead of adding all three products, we may choose to add only the first two products, or to keep only the first product.

It is probably customary to say that we are retaining 1 or 2 factors when we retain 1 or 2 scores-loadings pairs.

That “the scores” are selected columns of \sqrt{N-1}\ u tells us that the individual scores & loadings cannot change no matter how many factors we keep.

Throwing away – choosing not to use – a nonzero scores-loadings pair does affect the reconstituted data. X2 is close to X, but not the same, and X1 is quite different.

X2 - X = \left(\begin{array}{lll} 0.000453114 & 0.0208238 & -0.021077 \\ 0.000220147 & 0.0101173 & -0.0102403 \\ -0.000905575 & -0.0416176 & 0.0421236 \\ -0.000229762 & -0.0105592 & 0.0106876 \\ 0.000462075 & 0.0212357 & -0.0214938\end{array}\right)

X1 - X = \left(\begin{array}{lll} -0.702354 & 0.303734 & 0.243327 \\ 0.559586 & -0.215052 & -0.22068 \\ -0.503369 & 0.160646 & 0.231156 \\ 0.806559 & -0.335327 & -0.292835 \\ -0.160422 & 0.0859985 & 0.0390325\end{array}\right)

To look at that another way, the means of X2 are zero, and the variances of X2 (2 factors retained) are:

\{1., 0.999292, 0.999275\}

“Not far from 1″ is an understatement.

The means of X1 are also zero, but the variances of X1 (1 factor retained) are:

\{0.565977, 0.928963, 0.937846\}

You might note that the X1 data is still centered, but it is no longer standardized. That was lost when we threw away a fairly large singular value.

Next, I think I will show you what I reckon I would do, today, for PCA / FA.

PCA / FA Example 8: the pseudo-inverse

introduction

Recall Harman’s or Bartholomew’s model

Z = A F

with Z = X^T\ , X standardized, and A a \sqrt{\text{eigenvalue}}\ -weighted eigenvector matrix, with eigenvalues from the correlation matrix.

We saw how to compute the scores F^T in the case that A was invertible (here). If, however, any eigenvalues are zero then A will have that many columns of zeroes and will not be invertible.

What to do?

One possibility – shown in at least one of the references, and, quite honestly, one of the first things I considered – is to use a particular example of a pseudo-inverse. I must tell you up front that this is not what I would recommend, but since you will see it out there, you should see why I don’t recommend it.

(Answer: it works, it gets the same answer, but computing the pseudo-inverse explicitly is unnecessary. In fact, it’s unnecessary even if we don’t have the Singular Value Decomposition (SVD) available to us.)

Suppose we are given raw data which has, in fact, constant row sums (in this case, 1):

raw = \left(\begin{array}{lll} -1 & -1 & 3 \\ 1 & -2 & 2 \\ \frac{1}{3} & 1 & -\frac{1}{3} \\ -\frac{1}{4} & 1 & \frac{1}{4} \\ \frac{1}{10} & \frac{1}{2} & \frac{2}{5}\end{array}\right)

A couple of those fractions do not display well on my screen, so let me convert to decimals:

\left(\begin{array}{lll} -1. & -1. & 3. \\ 1. & -2. & 2. \\ 0.333333 & 1. & -0.333333 \\ -0.25 & 1. & 0.25 \\ 0.1 & 0.5 & 0.4\end{array}\right)

PCA a la Harman or Bartholomew et al.

Now, we do a PCA using Harman’s or Bartholomew’s model – but my way, not theirs. Standardize the input:

X = \left(\begin{array}{lll} -1.40524 & -0.67082 & 1.39765 \\ 1.30584 & -1.41618 & 0.675971 \\ 0.402143 & 0.819892 & -1.00794 \\ -0.388588 & 0.819892 & -0.586964 \\ 0.0858508 & 0.447214 & -0.478713\end{array}\right)

The X matrix is “the data”.

Get the eigenvalues of the correlation matrix:

\lambda 1 = \{1.86202,1.13798,0.\}

Yes, one of them is identically zero. We learned here that this happens when we start with constant nonzero row sums and compute the correlation matrix.

Get the diagonal matrix of square roots:

\Lambda 1 = \left(\begin{array}{lll} 1.36456 & 0. & 0. \\ 0. & 1.06676 & 0. \\ 0. & 0. & 0.\end{array}\right)

Get the SVD of X… and look at w:

w1 = \left(\begin{array}{lll} 2.72912 & 0. & 0. \\ 0. & 2.13352 & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Get the \sqrt{\text{eigenvalue}}\ -weighted matrix A, as A1 = v1\  \Lambda 1\ . (The only reason I’m using “1″ is so that in subsequent algebra and computations I can use A, v, \Lambda\ etc. consistently for cut-down matrices.)

A1 = \left(\begin{array}{lll} -0.136612 & 0.990625 & 0. \\ -0.938341 & -0.34571 & 0. \\ 0.981263 & -0.192673 & 0.\end{array}\right)

Compute F^T\ as I would usually, as columns of \sqrt{N-1}\ u\ , in this case 2 u1, and only the first 3 columns:

F1^T = \left(\begin{array}{lll} 1.17769 & -1.25613 & 0.660501 \\ 0.974085 & 1.45252 & 0.269327 \\ -0.97385 & 0.271651 & 1.59142 \\ -0.693986 & -0.487969 & -0.0525462 \\ -0.483941 & 0.0199256 & -0.977656\end{array}\right)

Let’s check it by computing F^T\ A^T\ :

 F1^T\ A1^T\  = \left(\begin{array}{lll} -1.40524 & -0.67082 & 1.39765 \\ 1.30584 & -1.41618 & 0.675971 \\ 0.402143 & 0.819892 & -1.00794 \\ -0.388588 & 0.819892 & -0.586964 \\ 0.0858508 & 0.447214 & -0.478713\end{array}\right)

and that is, indeed, X:

\left(\begin{array}{lll} -1.40524 & -0.67082 & 1.39765 \\ 1.30584 & -1.41618 & 0.675971 \\ 0.402143 & 0.819892 & -1.00794 \\ -0.388588 & 0.819892 & -0.586964 \\ 0.0858508 & 0.447214 & -0.478713\end{array}\right)

We’re done.

But what if the SVD is not available to us, and we don’t have u?

Let us recall the derivation in the case that A was invertible (here it was called A1; this is old stuff, so try not to worry about all the missing “1″s), we could also have computed F^T\ by projecting X onto the reciprocal basis A^{-T}\ :

F^T = X\ A^{-T}\ ,

but we can also derive that formula without ever knowing about the reciprocal basis:

Z = A\ F

A^{-1}\ Z = F

F^T = Z^T\ A^{-T}

F^T = X\ A^{-T}\ .

In that case, we could also have used A = v\ \Lambda\ to simplify things further, and get

A^T = \Lambda\ v^T

A^{-T} = v^{-T}\ \Lambda^{-1}

A^{-T} = v\ \Lambda^{-1}\ (because v is orthogonal)

and finally

F^T = X\ v\ \Lambda^{-1}\ .

That was fine (and it’s what Bartholomew had us do), but our A is not invertible.

the pseudo-inverse

So let’s work with the cut-down matrices. To be specific, we start with the two nonzero columns of A1:

A = \left(\begin{array}{ll} -0.136612 & 0.990625 \\ -0.938341 & -0.34571 \\ 0.981263 & -0.192673\end{array}\right)

This cut-down A has, in a sense, eliminated the problem: although the new A can’t be inverted because it’s not square, it is of full rank. Even better, it’s A1 with one column removed: in a very real sense, it’s the same linear operator.

Any time I have a formula which involves a matrix inverse, and I want to generalize it to a case where I have a rectangular matrix – hence, no inverse – I investigate whether a pseudo-inverse will work.

In this case, A is of rank 2, so the two square matrices A^T\ A and A\ A^T are of rank 2. But A is 5×2, so A^T\ A is 2×2 and A\ A^T is 5×5, so A^T\ A is invertible, even though A1 wasn’t. There’s no point to considering A\ A^T\ , because it’s not invertible.

So, for the rectangular A – because what I dropped was a column of zeroes – we still have

Z = A F

(That is crucial. We didn’t change the data Z = X^T\ , because we didn’t lose any singular values.)

Premultiply by A^T\ :

A^T Z = A^T\ A\ F\ .

now premultiply by \left(A^T\ A\right)^{-1}\ , which we know exists:

\left(A^T\ A\right)^{-1}\ A^T\ Z = F\ .

It may not be pretty, but it will work.

I should remind us that this is exactly what’s goes on in regression (OLS) here: the “normal equations” for the regression model

\hat{Y} = X\ \beta

are

\beta = \left(X^T\ X\right)^{-1}\ X^T\ Y \ .

Let me also point out that if A is invertible, then the inverse expands as

\left(A^T\ A\right)^{-1} = A^{-1}\ A^{-T}\ ,

so from

F = \left(A^T\ A\right)^{-1}\ A^T\ Z

we get

F = A^{-1}\ A^{-T}\ A^T\ Z

i.e.

F = A^{-1}\ Z

and finally

F^T = X A^{-T}\ .

That is, the equation written with the pseudo-inverse

F = \left(A^T\ A\right)^{-1}\ A^T\ Z

contains our earlier special case

F^T = X\ A^{-T}

when A is invertible. That derivation, incidentally, reminds me that although I think of \left(A^T\ A\right)^{-1} as the pseudo-inverse, it’s really the product \left(A^T\ A\right)^{-1}\ A^T which is the pseudo-inverse, since it’s the product that collapsed to A^{-1}\ .

It gets better, in this case. The pseudo-inverse is far prettier than it looks, but let’s see it before we show it.

Here is A^T\ A\ :

A^T\ A = \left(\begin{array}{ll} 1.86202 & 0 \\ 0 & 1.13798\end{array}\right)

Diagonal?

Diagonal!

Even better, it’s \Lambda^2\ , where \Lambda is the cut-down \Lambda 1\ :

\Lambda = \left(\begin{array}{ll} 1.36456 & 0. \\ 0. & 1.06676\end{array}\right)

\Lambda^2 = \left(\begin{array}{ll} 1.86202 & 0. \\ 0. & 1.13798\end{array}\right)

That is, it’s the diagonal matrix of \lambda instead of \sqrt{\lambda}\ . Not only is it diagonal, but its elements are almost already computed. And in fact the inverse is almost immediate:

\left(A^T\ A\right)^{-1} = \Lambda^{-2}\ .

Then a quick check shows that just as

A1 = v1\ \Lambda 1\ ,

we have

A = v\ \Lambda\ ,

hence

A^T = \Lambda\ v^T

(We also need a cut-down version v of v1 (its first two columns) as well as the cut down \Lambda of \Lambda 1 (2×2) which we got earlier.)

v = \left(\begin{array}{ll} -0.100114 & 0.92863 \\ -0.687651 & -0.324075 \\ 0.719106 & -0.180615\end{array}\right)

Then A = v\ \Lambda is

A = \left(\begin{array}{ll} -0.136612 & 0.990625 \\ -0.938341 & -0.34571 \\ 0.981263 & -0.192673\end{array}\right)

(That is, whether we start with A1 and cut it down to A, or compute the cut-down A from v and \Lambda\ , we get the same thing.)

Putting it all together, from

F = \left(A^T\ A\right)^{-1}\ A^T\ Z

we get, as before,

F = \Lambda^{-2} \Lambda\  v^T\ Z

F^T = X\ v\ \Lambda^{-1}\ .

That is, if A is invertible, we can compute F^T as any one of

  • F^T = X\ v\ \Lambda^{-1}
  • F^T = X\ A^{-T}
  • FT = \sqrt{N-1}\ u (the first “few” columns).

If A is not invertible but the cut-down A is of full rank (so v and \Lambda are also cut down as above), we can compute F^T as any of

  • FT = X\ v\ \Lambda^{-1}
  • F = \left(A^T\ A\right)^{-1}\ A^T\ X^T and take the transpose.
  • FT = \sqrt{N-1}\ u (the first “few” columns).

The only difference between A not invertible and A invertible is F = \left(A^T\ A\right)^{-1}\ A^T\ X^T in place of F^T = X\ A^{-T}\ .

(Yes, one of those uses F, the other F^T\ .)

But as I said at the beginning, I didn’t go thru all that in order to encourage you to compute the pseudo-inverse \left(A^T\ A\right)^{-1}A^T\ , but unfortunately you may very well see it in texts.

When you do, realize that it’s true but utterly unnecessary for computing F.

(There’s no reason to tell you which of the references show it.)

I would always compute F^T as F^T = \sqrt{N-1}\ u\ . If the SVD makes you uncomfortable, or isn’t available (I weep for you), then compute it as F^T = X\ V \Lambda^{-1}\ , using V from the eigendecomposition.

Even though the pseudo-inverse isn’t all that useful in this case, it’s a good thing to know about in general.

Oh, and don’t even dream that it’s unnecessary for regression. For PCA / FA, it’s

A^T\ A = \Lambda^2

which makes it unnecessary to actually compute the pseudo-inverse . For regression, X^T\ X isn’t at all likely to be diagonal.