PCA / FA. A Map to my posts

The Posts

The purpose of this post is to provide guidance to a reader who has just discovered that I have a large pile of posts about principal components / factor analysis. This pile of posts might seem a very jungle, without any map.

Here, have a map.

As I finalize this post, it will be number 52 in PCA / FA. Here’s a list of the 52 posts, including the dates spanned by any group, and the number of posts in that group. (When the picture was taken, I didn’t know when this would be published. In fact, post 51 was scheduled but not yet published. Even more, post 51 did not even exist when the first picture was created.)

posts-table
transition/attitude matrices is a post that is sometimes relevant when we discuss “new data” in PCA, but it is not in the PCA / FA category.

“tricky prepro” is short for “tricky preprocessing”, and discusses the combination of constant row sums and covariance or correlation matrix.

Posts which have no named author, such as “example 8″, have the tag, “Rip”.

Right now the tag cloud includes entries such as Harman and Davis, i.e. the authors of textbooks I used. This means you can see all the posts based on Harman (including those which include Davis) by clicking on the Harman tag. The number of tags shown, however, is limited to 45. In time, some or all of these author tags may disappear. What to do then?

This post shows you that the original Harman posts all occurred in January 2008, and that there were some posts about Harman & Davis in April 2008, and the “et al” on post 51 means that it talks about Harman, Jolliffe, and Basilevsky.. The dates would let you pick whatever you wanted out of the monthly archives.

The Authors

Let me summarize the sources quickly. There are three texts which specialize in PCA / FA in general: Harman, Jolliffe, Basilevsky. There is one text which specializes in PCA / FA in chemistry: Malinowski. There are three texts which specialize in data analysis in a particular field: chemistry, geology, social science (resp. Brereton, Davis, and Bartholomew et al.)

Harman’s primary focus is factor analysis, but I started with him because he had a simple example with data. He also has a lot of material on what they call rotations, but I did not look at it. If you need to understand this, I think he’s a good place to start. (What people mean in this field when they say “rotation” is anything but; in fact, they want to turn an orthogonal basis into a non-orthogonal basis, and that is not done by a rotation!)

Jolliffe is all about principal component analysis. As I said before, if you do this professionally, you probably need this book: he talks a lot about what you’re doing and what it means. I didn’t start here because his examples are not self-contained: he references the literature which contains the data.

Davis is about data analysis primarily for geology. His material on PCA / FA is limited to one chapter. Nevertheless, as you might guess from the number of posts, Davis is where I began to put it all together. I would recommend his book for data analysis in general, not just for PCA / FA.

Malinowski and Brereton are special-purpose texts, even within the category “PCA in chemistry” (to judge from the comment here.) From Malinowski I learned quite a bit more about the geometry of the SVD, the plausibility of using data which was not even centered, and an interesting approach to missing data.

Brereton has some nice data, but you have to own the book to access the data. Also, he did not look like the kind of book I could have learned this from. (Practice with, yes; learn the concepts, no. But then, I’m not trying to call a PCA command in a statistics package – not that it’s a bad thing to do, but it’s not my cup of tea.)

Bartholomew et al. is about data analysis primarily for the social sciences. Like Davis, their material on PCA is limited to one chapter, but with another on classical FA (which I have not discussed). The book, however, has publicly available data, and this makes it an excellent source for practice. In fact, it was here that I finally saw how easy it was to compute the scores. They do not use matrix notation, so it might help if you had learned that elsewhere before picking up this book.

Basilevsky is devoted to factor analytic methods: factor analysis, principal component analysis, and regression. He tends to have data printed in tables, so I have scanned things into the computer for myself in order to play with his examples. I think I’m just as glad I found him after I understood how to do this; he might have made me dizzy in the beginning.

Introduction to The Calculations

A summary of just about all the calculations I could do for PCA is here but you might also look at the post immediately preceding this one.

Now that I understand the basics of PCA, I suspect that I personally would use it in two ways. One, I might use it on data before I ran a regression. Two, I might do just what I have been doing, namely checking someone else’s calculations.

Using PCA Myself

Let’s take care of the simple case first. If I have some data, and if I choose to run PCA, I would be interested – at this stage of my understanding – in two things. One, how would a PCA redistribute the variance? Two, how many linearly independent variables do I have?

I would answer both of those questions by getting the SVD (singular value decomposition, X = u\ w\ v^T\ ) of my data matrix X. The number of nonzero singular values w will tell me the rank of the matrix; the number of “small” singular values will let me estimate multicolinearity. Since I want to know about the redistributed variance, I need to use either centered or standardized data. (Otherwise, the eigenvalues are not equal to variances.) In either case, it is the orthogonal eigenvector matrix v (not the weighted matrix A) which will show me the redistributed variances.

When I do this, I’ll show it to you. (I have yet to do it, since understanding PCA.) Who knows? At that time, I may figure out that there are other things I would wish to get out of the PCA.

Anyway, it seems to me today that for my own purposes I would simply get the SVD of my data, and look at w and v. This is way, way simpler than what I’ve usually been doing while checking other peoples calculations.

(I do wonder if the matrix u can be used to isolate variables which are multi-collinear. We saw in Malinowski that we could construct a projection operator…. Let me think about that.)

The other case is somewhat more complicated. This takes us back to my beginnings.

If I am checking someone else’s analysis, the very first question is: did they give me the data?

Checking Someone Else, No Data

If they did not give me the data, they must have given me a “dispersion matrix”: some form of either X^T X or X X^T\ , in fact probably either a covariance matrix or a correlation matrix. Whatever it is, denote it by c.

I would then get an eigendecomposition of c,

c = v\ \Lambda^2\ v^T\ ,

where the eigenvalues of c are the nonzero elements of the diagonal matrix \Lambda^2\ . To put that another way, the diagonal matrix \Lambda has \sqrt{\text{eigenvalues}} for its diagonal elements.

You will recall that minor hell can break loose. Although any matrix of the form X^T X or X X^T must be positive definite – when computed exactly – any rounded off representation of it need not be positive definite. We saw an example here. The specific problem we might encounter is that a very small positive eigenvalue – really – might come out negative when we compute using a printed – rounded-off – dispersion matrix instead of the computed matrix. (In principle, of course, even the computed one could fail to be positive definite; there’s nothing sacred about rounding to 6 places or 15 places instead of 3 or so.)

If that happens, I would set the troublesome eigenvalue to zero, and continue. Incidentally, this can occur if we started with constant-row-sum data and then computed either the covariance or correlation matrix. I discussed that here.

Moving on, let us suppose we have the orthogonal eigenvector matrix v and the diagonal matrix \Lambda^2 of nonnegative eigenvalues (i.e. we took care of any negative ones, by setting them to 0).

If the work we are checking stays with v and \Lambda^2\ , then he is probably doing the kinds of things Jolliffe discussed.

Otherwise, I expect to see a weighted eigenvector matrix

A = v\ \Lambda\ .

If we were not given data, the only conclusions we can check are those drawn from v, A, and \Lambda\ . Computing the “scores” requires the data, and even if the analysis shows the scores, we cannot check them.

But what if we were given the data? Well, this could get interesting. (This, after all, is why I’ve been doing this stuff for a year and a half!)

Checking Someone Else, With Data

The first significant consequence of being given the data is this: we should be able to relate their eigenvalues and eigenvectors to the data. (We’ll get to the scores….)

That may not be straightforward.

The first challenge is: are the variables in columns or in rows? Harman, we will recall, puts the variables in rows, using the transpose of our customary X matrix: hence we had his

Z = X^T

and

Z = A\ F\ .

I would compute the means and variances of each variable (i.e. of each column or row). If I were at all uncertain whether the variables were rows or columns, I would compute the means and variances of each row and column. In fact, even if I know that the variables are in columns, say, I would compute the row sums.

This tells me whether I was given raw data, centered data, standardized data, or small standard deviates; or something else. It remains to be seen if they preprocessed the data before computing the dispersion matrix.

Okay, if the given data is standardized (or small standard deviates), I expect that they used the given data for the analysis. But I might be wrong.

Now, what did they do? They either computed an SVD of the data, or an eigendecomposition of a dispersion matrix. I would compute whatever I think they did with my best guess for the preprocessing.

If my eigenvectors are right, then I have the correct dispersion matrix to within a scale factor; if my eigenvalues are also right, then the scale factor is right.

Bear in mind that any multiple of an eigenvector is also an eigenvector. I would have computed an orthogonal eigenvector matrix v. If the given eigenvectors are of unit length, we have a match so long as they agree with mine to within a sign. If the given eigenvectors do not agree with mine to within a sign, I would compute their lengths. If their lengths turn out to be \sqrt{\text{eigenvalues}}, I would immediately compute

A = v\ \Lambda\ ,

expecting that my columns of A would agree with theirs to within a sign. About the only other scaling I have seen is to set the largest component of each eigenvector to 1. (Apart from Basilevsky normalizing the rows of the A matrix by the inverse standard deviations! And that, by the way, means the columns of A are no longer eigenvectors.)

Let me remind you that raw data, centered data, and standardized data (each denoted X) lead to different eigenvectors of X^T X\ ; so if I match the eigenvectors, I have correctly matched the author’s raw, centered, or standardized data. Until I match the eigenvectors, there’s not much point in worrying about the eigenvalues.

If the eigenvalues are off by a scale factor, once the eigenvectors are right, then I need to use a different multiple of X^T X\ . That is,

X^T X\ , \frac{X^T X}{N}\ , and \frac{X^T X}{N-1}

have the same eigenvectors, but the eigenvalues differ by factors of N or N-1 or (N-1)/N.

The same considerations apply to the computation of the singular value decomposition. If my matrices u and v agree with the author (to within signs, assuming orthonormal eigenvectors, as usual), I have correctly matched raw, centered, or standardized. If the singular values w are off by a scale factor, then they scaled the data somehow.

Why all the fuss?

  • I have seen people use 1/N in their theoretical discussion, but use 1/(N-1) in their computation.
  • I have seen someone use 1/N to get “the correlation matrix” or “the covariance matrix”, but I would have used 1/(N-1).
  • I have seen people say they were using “the covariance matrix”, but they actually used X^T X instead of \frac{X^T X}{N-1}\ .
  • I have seen people call X^T X the covariance matrix when X wasn’t even centered.
  • I have seen someone say the data was standardized, when in fact it was small standard deviates. (Same eigenvectors, different eigenvalues.)

Of course, either the author or I might have made a mistake getting the data into the computer; but I have learned to consider instead the possibility that an author and I are doing different computations on the same given data.

Finally, I have sometimes found that I agree with an author’s largest few eigenvectors (by which I mean, the eigenvectors associated with the largest eigenvalues), but not with the smaller ones. I suspect that this is a result of their using a sequential algorithm to find eigenvectors one after another.

I remind you that it is very easy to check an eigenvector-eigenvalue pair, my own or an author’s. Suppose we have done an eigendecomposition of a matrix c,

c = v\ \Lambda^2\ v^T\ .

Suppose, for example, that v_3 is the third eigenvector, with associated eigenvalue \lambda_3 = \Lambda_{33}^2\ ). Then the definitions of eigenvector and eigenvalue tell us that

c\ v_3 = \lambda_{3}\ v_3\ .

The LHS is a matrix-vector product and the RHS is a scalar times a vector. Compute both sides and compare them.

Whew! So far, I have matched the author’s eigendecomposition. Presumably, I know whether they used the orthogonal eigenvector matrix v or the \sqrt{\text{eigenvalue}} weighted eigenvector matrix A. The eigenvector matrix, whichever one they chose, is usually the “loadings”.

We’re actually home free, now.

If they present the “scores”, I expect to find that they are either X v or – if they come from A – the first few columns of \sqrt{\text{N-1}}\ u ( here), unless I have a reason to believe they are using Davis’ scheme:

S^R = X A^R\ .

(And for more about that, see the post immediately preceding this one.)

PCA/FA Answers to some Basilevsky questions

Let us look at three of the questions I asked early in February, and answer two of them.

First, what do we know? What have we done?

We assume that we have data X, with the variables in columns, as usual. In fact, we assume that the data is at least centered, and possibly standardized.

We compute the covariance matrix

c = \frac{X^T X}{N-1}\ ,

then its eigendecomposition

c = v\ \Lambda^2\ v^T\ ,

where \Lambda^2 is the diaginal matrix of eigenvalues. We define the \sqrt{\text{eigenvalue}}-weighted matrix

A = v\ \Lambda\ .

Finally, we use A as a transition matrix to define new data Z:

X^T = A\ Z^T\ .

We discovered two things. One, the matrix A is the cross covariance between Z and X:

A = \frac{X^T Z}{N-1}\ .

I find this interesting, and I suspect that it would jump off the page at me out of either Harman or Jolliffe; that is, I suspect it is written there but it didn’t register.

Two, we discovered that we could find a matrix Ar which is the cross covariance between Zc and Xs. That is, the general relation

A = \frac{X^T Z}{N-1}

specializes to both centered data Xc and standardized data Xs, and gives us

Ac = \frac{Xc^T Zc}{N-1}

As = \frac{Xs^T Zs}{N-1}\ .

In addition, however, we found a matrix Ar such that

Ar = \frac{Xs^T Zc}{N-1}\ ,

using the new centered data Zc and the old standardized data Xs. Oh, the definition of the matrix Ar was:

Ar = \sigma^{-1}Ac,

where \sigma^{-1} was a diagonal matrix whose entries were the inverse standard deviations of the centered data Xc.

That is, each row of Ac was weighted by the inverse standard deviation to define Ar. And it all worked out because the definition of Ar

Ar = \sigma^{-1}Ac

was dual to the definition of standardized data, which weighted the columns of Xc:

Xs = Xc\  \sigma^{-1}\ .

I will remind you that I find this interesting, but I’m not convinced it is important. I cannot forgive Ar for the fact that it is not a matrix of eigenvectors. I assure you, I could be missing something about Ar.

Anyway, that is what we did.

What were the questions?

  1. when is A a covariance matrix?
  2. is v a covariance matrix?
  3. can we do something similar for constant row sums?

For question 3, I have the feeling there’s something we can do, but I haven’t put the pieces together. It can’t be same as for question 1, because a matrix that weights the rows of the data matrix is not the same size as one that weights the columns (in general). Perhaps we need to look at XXT instead of XTX. But then things don’t drop out if A and Z are defined the same way, so…. This is still open.

For question 1, two partial answers are:
If A = v\ \Lambda\ , i.e. A is formed by weighting the columns of an orthonormal eigenvector matrix v by the \sqrt{\text{eigenvalues}}\ , then A is a covariance matrix of the same form as the one from which we got v.

And the construction that gave us Ar (weighting the rows of A) works for arbitrary weights.

I’ll explain.

Suppose we have a data matrix X. Suppose we compute a “dispersion matrix”, by which I mean a multiple of X^T X\ :

c = \alpha\ X^T X\ .

(Of course, \alpha is usually 1/(N-1).)

Suppose we do an eigendecomposition of c,

c = v\ \Lambda^2\ v^T\ ;

then define a matrix

A = v\ \Lambda\ ,

and finally define new data using A as the transition matrix:

X^T = A\ Z^T\ .

(I’ll remind you that our observations are columns of X^T and Z^T\ ; and a transition matrix applied to new components gives us old components. If you need to, look here.)

I claim that

A = \alpha\ X^T Z\ ,

i.e., that A is the “cross dispersion matrix” between X and Z.

Please note, in particular, that the data X need not be centered.

The proof goes just the way it did before.

We will need a couple of things. From

X^T = A\ Z^T\ ,

we get

Z = X A^{-T}\ .

And also from

A^T = \Lambda\ v^T

we get

A^{-T} = v^{-T}\Lambda^{-1} = v \Lambda^{-1} (because v is orthogonal and \Lambda is diagonal).

Then

\alpha\ X^T Z

= \alpha\ X^T (X\ A^{-T})

= \alpha X^T X\ A^{-T}

= c\ A^{-T}

= (v\ \Lambda^2v^T) (v \Lambda^{-1})

= v\ \Lambda^2 \Lambda^{-1}

= v\ \Lambda

= A\ .

QED.

(If we had started with \alpha\ Z^T X instead of \alpha\ X^T Z\ , we would have ended up with A^T\ , which is a good thing, since \alpha\ Z^T X = (\alpha\ X^T Z)^T\ . Things are just a little different because the cross-dispersion in general, and the cross-covariance in particular, are not symmetric.)

This result, that A is the corresponding dispersion matrix between X and Z means that we can say something about Z without ever computing it.

(I’m not particularly fond of that, but then I haven’t been trying to deal with thousands of observations and a hundred variables. I’m used to datasets small enough to hold in my hand, as it were.)

Example

We could verify this, in particular, for the raw data that we used for the previous Basilevsky computations. We knew it was true for centered data or for standardized data, i.e. that it was true for the covariance matrix or the correlation matrix; we have seen now that it is true for an arbitrary dispersion matrix

\alpha\ X^T X\ .

It didn’t depend on 1/(N-1) or on X being centered. If we do an eigendecomposition of \alpha\ X^T X\ , then A = \alpha\  X^T Z\ .

Well, let’s do it. Here’s the raw data that we used here:

X = \left(\begin{array}{lll} 2.09653 & -0.793484 & -7.33899 \\ -1.75252 & 13.0576 & 0.103549 \\ 3.63702 & 29.0064 & 8.52945 \\ 0.0338101 & 46.912 & 19.8517 \\ 5.91502 & 70.9696 & 36.0372\end{array}\right)

I will take \alpha = \pi\ , just so that it can’t be mistaken for anything else (I need some number; let’s make sure that if anything interesting happens, we see it. OK, nothing interesting will happen, not that kind of interesting, anyway.)

We compute

c = \pi X^T X = \left(\begin{array}{lll} 174.934 & 1578.09 & 720.323 \\ 1578.09 & 25917.9 & 11760.3 \\ 720.323 & 11760.3 & 5715.79\end{array}\right)

Now that I’ve written it once, let me emphasize: the weird expression \pi X^T X occurs only because I needed a number for my abstract scaling factor \alpha\ . In retrospect, perhaps I should not have used greek letters for both abstract and numerical.)

We get an eigendecomposition, but all I want right now is \Lambda\ , the diagonal matrix of \sqrt{\text{eigenvalues}}\ . OK, the eigenvalues are

\{31415.9,\ 314.159,\ 78.5398\}

and \Lambda is

\Lambda = \left(\begin{array}{lll} 177.245 & 0. & 0. \\ 0. & 17.7245 & 0. \\ 0. & 0. & 8.86227\end{array}\right)

On second thought, I really should give you the orthonormal eigenvector matrix for reference:

v = \left(\begin{array}{lll} 0.0554414 & -0.0173938 & 0.99831 \\ 0.907331 & -0.416446 & -0.0576447 \\ 0.416745 & 0.908994 & -0.00730637\end{array}\right)

The matrix A is

A = v \Lambda = \left(\begin{array}{lll} 9.82673 & -0.308298 & 8.8473 \\ 160.82 & -7.38131 & -0.510863 \\ 73.8661 & 16.1115 & -0.064751\end{array}\right)

The new data Z is given by

Z = X\ A^{-T} = \left(\begin{array}{lll} -0.0206618 & -0.359791 & 0.24738 \\ 0.0665383 & -0.299765 & -0.282435 \\ 0.169678 & -0.247659 & 0.213996 \\ 0.286832 & -0.0841641 & -0.317697 \\ 0.44988 & 0.17488 & 0.174979\end{array}\right)

Finally, the cross-dispersion between Z and X is given by

\alpha\ X^T Z = \pi\ X^T Z =\left(\begin{array}{lll} 9.82673 & -0.308298 & 8.8473 \\ 160.82 & -7.38131 & -0.510863 \\ 73.8661 & 16.1115 & -0.064751\end{array}\right)

And that is the same as A, as it should be.

We continue with the answer to question 1. We found that if we do an eigendecomposition of \alpha\ X^T X\ , then A = \alpha X^T Z\ .

What about the matrix Ar, found by weighting the rows of A by the inverse standard deviations?

There was nothing special about the inverse standard deviations. If we weight the columns of X by a diagonal matrix \Delta\ , we can find a corresponding matrix Ar.

Let’s work that out. I am going to take weights (1,2,4} in place of the inverse standard deviations. That is, I take the diagonal matrix

\Delta = \left(\begin{array}{lll} 1 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 4\end{array}\right)

and scale the columns of X and the rows of A:

X_d = X\ \Delta = \left(\begin{array}{lll} 2.09653 & -1.58697 & -29.356 \\ -1.75252 & 26.1153 & 0.414196 \\ 3.63702 & 58.0128 & 34.1178 \\ 0.0338101 & 93.8239 & 79.407 \\ 5.91502 & 141.939 & 144.149\end{array}\right)

A_d = \Delta\ A = \left(\begin{array}{lll} 9.82673 & -0.308298 & 8.8473 \\ 321.641 & -14.7626 & -1.02173 \\ 295.465 & 64.446 & -0.259004\end{array}\right)

(Perhaps now is a good time to remind you that we can scale the columns of a matrix by post-multiplying it by a diagonal matrix of the scale factors; and we can scale the rows of a matrix by pre-multiplying it by a diagonal matrix.)

i expect that

A_d = \alpha\ X_d^T Z = \pi\ X_d^T Z\ .

Compute the RHS:

 \pi\ X_d^T Z\ = \left(\begin{array}{lll} 9.82673 & -0.308298 & 8.8473 \\ 321.641 & -14.7626 & -1.02173 \\ 295.465 & 64.446 & -0.259004\end{array}\right)

and that is, indeed, A_d\ .

So, the construction of Ar using the inverse standard deviations could be done with any set of weights instead, and applied to any dispersion matrix, not just the A = Ac from centered data.

In other words, I know two kinds of transition matrices that are dispersion matrices: A between Z and X, Ar (Ad in this case) between a newer Z and the older X.

I do not have an “iff” result. I can’t guarantee that these are the only kinds of matrices that work.

Question 2. Is v a dispersion matrix?

No, at least not between its new data Z and the old X, i.e. not the way A is. (I wonder if there’s a way to make it happen. Would it ever be worthwhile if it were possible?)

This time, let’s first confirm the failure numerically.

We need new data from v instead of from A. Rather than destroy the Z’s, call the new data Y. We should have

Y = X\ v\ ,

i.e.

Y^T = v^T X^T\ ,

i.e.

v^{-T}Y^T = X^T = v\ Y^T\ .

That last equality corresponds to X^T= A\ Z^T\ , so we’re good.

Here’s the new data Y from old data X using the transition matrix v:

Y = X\ v =\left(\begin{array}{lll} -3.66221 & -6.37712 & 2.19235 \\ 11.7936 & -5.31319 & -2.50302 \\ 30.0746 & -4.38964 & 1.89649 \\ 50.8397 & -1.49177 & -2.81552 \\ 79.7392 & 3.09967 & 1.55071\end{array}\right)

now compute \alpha\ X^T Z\ , i.e. in this case,

\pi\ X^T Y\ = \left(\begin{array}{lll} 1741.74 & -5.46444 & 78.4071 \\ 28504.6 & -130.83 & -4.5274 \\ 13092.4 & 285.569 & -0.573841\end{array}\right)

and that is to be compared with v – but I’ve already done the algebra, and I know they should have proportional columns, so I took the ratio instead of the difference, i.e. here is (\pi\ X^T Y) / v\ , term by term:

\left(\begin{array}{lll} 31415.9 & 314.159 & 78.5398 \\ 31415.9 & 314.159 & 78.5398 \\ 31415.9 & 314.159 & 78.5398\end{array}\right)

Those should look familiar: they are, as they should be, the eigenvalues of \pi\ X^T X\ :

\{31415.9,\ 314.159,\ 78.5398\}

So the cross-dispersion matrix between X and Y is a column-weighted version of v, and the weights are the eigenvalues.

Time to do the algebra. This time, Y = Xv. We start with the dispersion matrix…

\alpha\ X^T Y

= \alpha\ X^T\ (X\ v)

= \alpha\ X^T X\ v

= c\ v

= ( v\ \Lambda^2\ v^T)\ v

= v\ \Lambda^2

and that’s what we got: the cross-dispersion \alpha\ X^T Y is the column-weighted v, and the weights are the eigenvalues \Lambda^2\ .

PCA / FA Basilevsky: with data

Introduction

I am looking into Basilevsky because he did something I didn’t understand: he normalized the rows of the Ac matrix (which I denoted Ar). We discussed that, and we illustrated the computations, in the previous two posts. But we did those computations without having any data. I want to take a closer look, with data.

In contrast to As and Ac, which are eigenvector matrices, his Ar matrix is not. Nevertheless, as I said, his Ar is not without some redeeming value. In fact, all three of the A matrices have the same redeeming value.

I will show, first by direct computation and then by proof, that each of these A matrices is the cross covariance between X data and Z data.

(That doesn’t mean I want to use his Ar matrix; all it means is that I learned something new about an A matrix.)

Recall the raw data of example 9. We’re about to go get the A matrices and the new data Z for both standardized and centered data.

X = \left(\begin{array}{lll} 2.09653 & -0.793484 & -7.33899 \\ -1.75252 & 13.0576 & 0.103549 \\ 3.63702 & 29.0064 & 8.52945 \\ 0.0338101 & 46.912 & 19.8517 \\ 5.91502 & 70.9696 & 36.0372\end{array}\right)

Yes, we’ve seen all these forms of the data before, but I want this post to be self-contained.

Now, I’m going to switch to Basilevsky’s notation, using Z for F^T. If you will, we’re mixing Harman’s model

Z = A\ F\ \text{or}\ X = F^T\ A^T

where Z = X^T, and F^T is the new data (“scores”) WRT the transition matrix (“loadings”) A; and Jolliffe’s model

Z = X\ V\ \text{or}\ X = Z\ V^{-1}

where Z is the new data (“scores”)….

We start with Harman’s model

Z = A\ F\ .

eliminate Z:

X^T = A\ F\ ,

and we know that F^T is the new data corresponding to X WRT the transition matrix A .

Now we reintroduce Z, redefining it as Jolliffe uses it, for the new data; that is, we let

Z = F^T

and get

X^T = A\ Z^T\ .

If I had to do it all over again, I would probably not have used Basilevsky’s notation. This is the notation I had for the previous two posts, but i’ll admit, it’s confusing.

But for this post, as for the previous two, that is our model.

First, let’s go get the standardized and the centered data (“X”), the corresponding A matrices, and the corresponding new data (“Z”).

Basilevsky Standardized

If you are comfortable with these calculations – which we’ve seen before for this very data – you should move right along. If, on the other hand, you need a refresher, here it is. And if you need to see the calculations explained, here

I standardize the data and call it Xs:

Xs = \left(\begin{array}{lll} 0.0368717 & -1.15632 & -1.09998 \\ -1.24681 & -0.665379 & -0.663951 \\ 0.550632 & -0.100095 & -0.170316 \\ -0.651057 & 0.534548 & 0.493006 \\ 1.31036 & 1.38724 & 1.44124\end{array}\right)

Get the eigenvalues of the correlation matrix:

\lambda s = \{2.43279,\ 0.565781,\ 0.00143288\}

Get the SVD (Singular Value Decomposition) of Xs:

Xs = us\ ws\ vs^T.

ws = \left(\begin{array}{lll} 3.11948 & 0. & 0. \\ 0. & 1.50437 & 0. \\ 0. & 0. & 0.0757068 \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Form the diagonal matrix of \sqrt{\text{eigenvalues}}\

\Lambda s = \left(\begin{array}{lll} 1.55974 & 0. & 0. \\ 0. & 0.752184 & 0. \\ 0. & 0. & 0.0378534\end{array}\right)

Form the weighted eigenvector matrix A = v\ \Lambda

As = vs\ \Lambda s = \left(\begin{array}{lll} -0.752315 & 0.658804 & -0.000578824 \\ -0.963827 & -0.265197 & -0.0266011 \\ -0.968424 & -0.247849 & 0.0269245\end{array}\right)

Get the scores FT as the first three columns of 2u (where 2 = Sqrt[N-1], N=5). This is as close to the u matrix itself as I need to get. But instead of calling these F^T, we are calling them Z:

Zs = \left(\begin{array}{lll} 0.88458 & 1.06679 & 0.782819 \\ 0.913474 & -0.849064 & 0.380336 \\ -0.0628237 & 0.762691 & -1.56451 \\ -0.206697 & -1.22463 & -0.396946 \\ -1.52853 & 0.244206 & 0.798301\end{array}\right)

Check it by confirming that we have

X^T = A\ Z^T \text{or}\ X = Z\ A^T\ .

(We do.)

Basilevsky centered

Recall the raw data:

X = \left(\begin{array}{lll} 2.09653 & -0.793484 & -7.33899 \\ -1.75252 & 13.0576 & 0.103549 \\ 3.63702 & 29.0064 & 8.52945 \\ 0.0338101 & 46.912 & 19.8517 \\ 5.91502 & 70.9696 & 36.0372\end{array}\right)

I will center the data and call it Xc. Get the column means…

\{1.98597,31.8304,11.4366\}

and subtract them from each column to get Xc:

Xc = \left(\begin{array}{lll} 0.110558 & -32.6239 & -18.7756 \\ -3.73849 & -18.7728 & -11.333 \\ 1.65105 & -2.82404 & -2.90714 \\ -1.95216 & 15.0815 & 8.41516 \\ 3.92905 & 39.1392 & 24.6006\end{array}\right)

Proceeding as before, we get the SVD (Singular Value Decomposition) and display w…

Xc = uc\ wc\ vc^T

wc = \left(\begin{array}{lll} 66.0141 & 0. & 0. \\ 0. & 5.01572 & 0. \\ 0. & 0. & 1.54942 \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Get the eigenvalues of the covariance matrix (and that’s the only change we make after switching from standardized to centered data):

\lambda c = \{1089.47,6.28937,0.600172\}

Get the diagonal matrix of \sqrt{\text{eigenvalues}} of the covariance matrix…

\Lambda c = \left(\begin{array}{lll} 33.0071 & 0. & 0. \\ 0. & 2.50786 & 0. \\ 0. & 0. & 0.774708\end{array}\right)

Get the weighted eigenvector matrix A = v\ \Lambda

Ac = vc\ \Lambda c = \left(\begin{array}{lll} -1.67235 & 2.48708 & -0.0914469 \\ -28.2097 & -0.261331 & -0.394039 \\ -17.0553 & 0.188375 & 0.660714\end{array}\right)

Get the scores F^T as the first three columns of 2u (where 2 = Sqrt[N-1], N=5). except that we are calling them Z, as Basilevsky does.

Zc = \left(\begin{array}{lll} 1.13849 & 0.83693 & 0.73261 \\ 0.669241 & -1.03777 & 0.41852 \\ 0.116099 & 0.683164 & -1.59786 \\ -0.519249 & -1.14658 & -0.340197 \\ -1.40458 & 0.664253 & 0.786923\end{array}\right)

Check it by confirming that we have

X^T = A\ Z^T \text{or}\ X = Z\ A^T\ .

(We do.)

Key: the A matrices are cross-covariances

We have As and Ac:

As = \left(\begin{array}{lll} -0.752315 & 0.658804 & -0.000578824 \\ -0.963827 & -0.265197 & -0.0266011 \\ -0.968424 & -0.247849 & 0.0269245\end{array}\right)

Ac = \left(\begin{array}{lll} -1.67235 & 2.48708 & -0.0914469 \\ -28.2097 & -0.261331 & -0.394039 \\ -17.0553 & 0.188375 & 0.660714\end{array}\right)

We have new data Zs and old Xs:

Zs = \left(\begin{array}{lll} 0.88458 & 1.06679 & 0.782819 \\ 0.913474 & -0.849064 & 0.380336 \\ -0.0628237 & 0.762691 & -1.56451 \\ -0.206697 & -1.22463 & -0.396946 \\ -1.52853 & 0.244206 & 0.798301\end{array}\right)

and old:

Xs = \left(\begin{array}{lll} 0.0368717 & -1.15632 & -1.09998 \\ -1.24681 & -0.665379 & -0.663951 \\ 0.550632 & -0.100095 & -0.170316 \\ -0.651057 & 0.534548 & 0.493006 \\ 1.31036 & 1.38724 & 1.44124\end{array}\right)

Now, what is the cross-covariance between Xs and Zs? I need either

\frac{X^T\ Z}{N-1}

or

\frac{Z^T\ X}{N-1}\ ,

and N-1 = 4. Note that they are the transposes of each other. We compute the former:

\frac{Xs^T\ Zs}{4} = \left(\begin{array}{lll} -0.752315 & 0.658804 & -0.000578824 \\ -0.963827 & -0.265197 & -0.0266011 \\ -0.968424 & -0.247849 & 0.0269245\end{array}\right)

Not symmetric. But it’s a cross-covariance, between two different variables, not the covariance of one variable.

Now recall As:

As = \left(\begin{array}{lll} -0.752315 & 0.658804 & -0.000578824 \\ -0.963827 & -0.265197 & -0.0266011 \\ -0.968424 & -0.247849 & 0.0269245\end{array}\right)

They are the same.

For centered data:

\frac{Xc^T\ Zc}{4} = \left(\begin{array}{lll} -1.67235 & 2.48708 & -0.0914469 \\ -28.2097 & -0.261331 & -0.394039 \\ -17.0553 & 0.188375 & 0.660714\end{array}\right)

and we recall

Ac = \left(\begin{array}{lll} -1.67235 & 2.48708 & -0.0914469 \\ -28.2097 & -0.261331 & -0.394039 \\ -17.0553 & 0.188375 & 0.660714\end{array}\right)

Again, they are the same.

Now, get Ar, by normalizing the rows of Ac:

Ar = \left(\begin{array}{lll} -0.557739 & 0.829456 & -0.030498 \\ -0.99986 & -0.00926256 & -0.0139662 \\ -0.99919 & 0.011036 & 0.0387082\end{array}\right)

and compute the cross-covariance between Xs and Zc:

\frac{Xs^T\ Zc}{4} = \left(\begin{array}{lll} -0.557739 & 0.829456 & -0.030498 \\ -0.99986 & -0.00926256 & -0.0139662 \\ -0.99919 & 0.011036 & 0.0387082\end{array}\right)

0nce again, they are the same. We have good reason, now, to believe that

\text{loadings}\ A = \frac{X^T\ Z}{N-1} =  covariance(X,Z)\ ,

where Z is the new data WRT A (i.e. the scores, as most people define them).

That’s what I set out to show.

We have seen it for As, relating standardized data Xs to the scores Zs; for Ac, relating centered data Xc to the scores Zc; and finally for Ar, relating standardized data Xs to the scores Zc.

Proving it

Let’s take a look at the proofs. The question in the back of my mind is: when is a transition matrix also a cross-covariance matrix? (I will confess that I do not have a universal answer; it appears that I can generalize from the Ar matrix somewhat.)

That is, given old data X and new data Z WRT transition matrix T,

X^T = T\ Z^T

X = Z\ T^T\ .

when can we conclude that T is the covariance between X and Z?

For either standardized data or centered data X, one argument works. We have

A = V \Lambda\ ,

where \Lambda is the diagonal matrix of \sqrt{\text{eigenvalues}} and V is an orthogonal eigenvector matrix of the covariance matrix c; that is, we have the eigendecomposition

\Lambda^2 = V^T\ c\ V = V^T \frac{X^T\ X}{N-1}\ V

\left(\text{or}\  \frac{1}{N-1}\ X^T\ X = V \Lambda^2\ V^T\right)\ .

We have the decomposition of X:

X = Z\ A^T\ \left(\text{or}\ X^T = A\ Z^T\right)\ .

For simplicity, i’m going to assume that A is invertible (i.e. that \Lambda is invertible; i.e. that all the eigenvalues are nonzero): then

Z = X\ A^{-T} \left(\text{or}\ Z^T = A^{-1}X^T\right)

and I note that

A^{-T} = V^{-T}\ \Lambda^{-T} = V\ \Lambda^{-T} = V\ \Lambda^{-1}\ .

Let me name those three equations: we have an eigendecomposition of the covariance matrix, the decomposition of X, and the definition of A.

Now we compute the cross-covariance. There may be simpler ways, and there are certainly other paths to the same answer, but this works for me. As I said, I’m going to assume that A is invertible, because for one thing we can cope if it is not; and for another, if it is not, we would have fewer Z variables than X variables.

\frac{1}{N-1}\ X^T\ Z

\frac{1}{N-1}\ X^T\ \left(X\ A^{-T}\right) (the decomposition of X)

= c\ A^{-T} (definition of the covariance matrix)

= \left(V\ \Lambda^2\ V^T\right)\ \left(V\ \Lambda^{-1}\right) (the eigendecomposition of c and definition of A)

= \left(V\ \Lambda^2\right)\ \Lambda^{-1} = V\  \Lambda (simplify)

= A (definition of A)

QED.

The eigendecomposition is crucial. Without it, we are left hanging at

c\ A^{-T}\ ,

Now let us try to prove it for Ar. It’s a surprisingly simple consequence of what we just did. We need only one new equation, the relationship between standardized and centered data:

Xs = Xc\ \sigma^{-1}\ ,

which says that we get standardized data from centered data by dividing each column by its standard deviation; and we recall that scaling columns can be accomplished by post-multiplication by a diagonal matrix. (After all, that’s how we get the A matrix from the V matrix.)

What we actually need is the transpose:

Xs^T = \sigma^{-1}\ Xc^T\ .

Let us compute the cross covariance of Xs and Zc.

\frac{ Xs^T\ Zc}{N-1}

= \frac{\sigma^{-1}\ Xc^TZc}{N-1} (Xs in terms of Xc)

= \sigma^{-1}\ Ac (by the previous result for Ac!)

= Ar (by definition of Ar)

QED.

The key, then, was that Ar was to Ac as Xs was to Xc (OK, not exactly: scale the rows of one and the columns of the other). I don’t know of any other suitable relationships, but if we found one to use, we should apply it to Xc and Ac.

Let me be more specific. We never really used the fact that \sigma was a diagonal matrix of standard deviations. If we had constructed completely different data Xd by scaling the columns of centered data Xc, using an arbitrary diagonal matrix \delta^{-1}\ , so that

Xd = Xc\ \delta^{-1}\ ,

and if we had defined a new A matrix by scaling the rows of Ac using the same diagonal matrix:

Ad = \delta^{-1}\ Ac\ ,

then we would find that Ad was the cross covariance of Xd and Zc.

I need to look at this some more.

PCA / FA example 9: centered and raw, 3 models

What follows is simple computation, solely to show us exactly what happens. It continues the work of the previous post, which did my default calculations for standardized data. Here I do the same calculations for centered and raw data.

Centered

The raw data is still

\text{raw} = \left(\begin{array}{lll} 2.09653 & -0.793484 & -7.33899 \\ -1.75252 & 13.0576 & 0.103549 \\ 3.63702 & 29.0064 & 8.52945 \\ 0.0338101 & 46.912 & 19.8517 \\ 5.91502 & 70.9696 & 36.0372\end{array}\right)

I will center the data and call it Xc. Get the column means…

\{1.98597,\ 31.8304,\ 11.4366\}

and subtract them from each column to get Xc:

Xc = \left(\begin{array}{lll} 0.110558 & -32.6239 & -18.7756 \\ -3.73849 & -18.7728 & -11.333 \\ 1.65105 & -2.82404 & -2.90714 \\ -1.95216 & 15.0815 & 8.41516 \\ 3.92905 & 39.1392 & 24.6006\end{array}\right)

Proceeding as before, we get the SVD and display w…

wc = \left(\begin{array}{lll} 66.0141 & 0. & 0. \\ 0. & 5.01572 & 0. \\ 0. & 0. & 1.54942 \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Get the eigenvalues \lambda of the covariance matrix (and that’s the only change we make after switching to centered data):

\lambda c = \{1089.47,\ 6.28937,\ 0.600172\}

I’m getting fond of just doing pie charts:

Get the diagonal matrix \Lambda of square roots of those eigenvalues of the covariance matrix…

\Lambda c = \left(\begin{array}{lll} 33.0071 & 0. & 0. \\ 0. & 2.50786 & 0. \\ 0. & 0. & 0.774708\end{array}\right)

Get the weighted eigenvector matrix A

Ac = \left(\begin{array}{lll} -1.67235 & 2.48708 & -0.0914469 \\ -28.2097 & -0.261331 & -0.394039 \\ -17.0553 & 0.188375 & 0.660714\end{array}\right)

Get the scores F^T as the first three columns of 2u (where 2 = \sqrt{N-1}\ , N=5).

Fc^T = \left(\begin{array}{lll} 1.13849 & 0.83693 & 0.73261 \\ 0.669241 & -1.03777 & 0.41852 \\ 0.116099 & 0.683164 & -1.59786 \\ -0.519249 & -1.14658 & -0.340197 \\ -1.40458 & 0.664253 & 0.786923\end{array}\right)

Confirm that we have X = F^T\ A^T\ . (Yes.)

We compute the new data (Y = Xv = uw)

Yc = \left(\begin{array}{lll} 37.5783 & 2.0989 & 0.567558 \\ 22.0897 & -2.60257 & 0.324231 \\ 3.8321 & 1.71328 & -1.23787 \\ -17.1389 & -2.87546 & -0.263554 \\ -46.3612 & 1.66585 & 0.609635\end{array}\right)

And we confirm the product Xc = Yc\ vc^T\ – which amounts to confirming the SVD. Not a bad thing to do considering how many u’s and w’s I have floating around in my Mathematica® notebook.

Then, on second thought, I’m going to forget about computing Davis’ loadings and scores.

So we should display the two forms of loadings A, v together:

Ac = \left(\begin{array}{lll} -1.67235 & 2.48708 & -0.0914469 \\ -28.2097 & -0.261331 & -0.394039 \\ -17.0553 & 0.188375 & 0.660714\end{array}\right)

vc = \left(\begin{array}{lll} -0.0506666 & 0.991715 & -0.118041 \\ -0.854657 & -0.104205 & -0.508629 \\ -0.516715 & 0.0751137 & 0.852856\end{array}\right)

We should display the corresponding scores F^T, Y:

{Fc}^T = \left(\begin{array}{lll} 1.13849 & 0.83693 & 0.73261 \\ 0.669241 & -1.03777 & 0.41852 \\ 0.116099 & 0.683164 & -1.59786 \\ -0.519249 & -1.14658 & -0.340197 \\ -1.40458 & 0.664253 & 0.786923\end{array}\right)

Yc = \left(\begin{array}{lll} 37.5783 & 2.0989 & 0.567558 \\ 22.0897 & -2.60257 & 0.324231 \\ 3.8321 & 1.71328 & -1.23787 \\ -17.1389 & -2.87546 & -0.263554 \\ -46.3612 & 1.66585 & 0.609635\end{array}\right)

And we could confirm that the new data Y have mean 0 (yes) and redistributed variances equal to the eigenvalues:

\text{variances} = \{1089.47,\ 6.28937,\ 0.600172\}

(yes, the same as the eigenvalues \lambda r\ .)

And we could confirm that F^T is standardized: the means and variances of the columns are 0 and 1 respectively. in fact F^T is uncorrelated: its covariance matrix is the identity.

What about reconstituted data? I would look at w in preference to \lambda r since it’s the w’s I will be setting to zero.

wc = \left(\begin{array}{lll} 66.0141 & 0. & 0. \\ 0. & 5.01572 & 0. \\ 0. & 0. & 1.54942 \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

As before, setting w_{33} to zero will have a very small effect on the data: the reconstituted matrix would differ from the original by a square root sum of squares of 1.55 . So I’m going to set all but the first w equal to zero, just to get reconstituted data that differs nontrivially from the original.

wc0 = \left(\begin{array}{lll} 66.0141 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0\end{array}\right)

The reconstituted data is X0 = u\ w0\ v^T\ , i.e. Xc0 = uc\ wc0\ {vc}^T\

Xc0 = \left(\begin{array}{lll} -1.90396 & -32.1165 & -19.4173 \\ -1.11921 & -18.8791 & -11.4141 \\ -0.194159 & -3.27513 & -1.9801 \\ 0.868368 & 14.6479 & 8.85592 \\ 2.34896 & 39.6229 & 23.9555\end{array}\right)

and the leftover is X1 = X – X0 is

Xc1 = Xc - Xc0 = \left(\begin{array}{lll} 2.01452 & -0.507392 & 0.641702 \\ -2.61928 & 0.106287 & 0.0810332 \\ 1.8452 & 0.451085 & -0.927034 \\ -2.82053 & 0.433688 & -0.44076 \\ 1.58009 & -0.483668 & 0.645059\end{array}\right)

I’ll show you again that the difference between X and X0 is the sum of squares of the w’s we omitted. the squares of the entries in Xc1 are

\left(\begin{array}{lll} 4.05829 & 0.257447 & 0.411781 \\ 6.86065 & 0.0112969 & 0.00656638 \\ 3.40478 & 0.203478 & 0.859393 \\ 7.9554 & 0.188085 & 0.194269 \\ 2.49669 & 0.233934 & 0.416101\end{array}\right)

Their sum is 27.5582 and the square root is 5.24959 . And the squares of w_{22} and w_{33} are 25.1575 and 2.40069, with matching sum 27.5582 .

We could compute the reconstituted new data Y, but as before, it would just be the first column of Yc.

Raw

I use the raw data and call it Xr.

Xr = \left(\begin{array}{lll} 2.09653 & -0.793484 & -7.33899 \\ -1.75252 & 13.0576 & 0.103549 \\ 3.63702 & 29.0064 & 8.52945 \\ 0.0338101 & 46.912 & 19.8517 \\ 5.91502 & 70.9696 & 36.0372\end{array}\right)

Get the SVD and display w…

wr = \left(\begin{array}{lll} 100. & 0. & 0. \\ 0. & 10. & 0. \\ 0. & 0. & 5. \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Neither the correlation matrix nor covariance matrix is appropriate, so we look at the singular values. But we want to honor

w = \sqrt{N-1}\ \Lambda\ ,

I could compute the \Lambda matrix from the w matrix. Given the sizes of the matrices, I might actually compute

\Lambda^2 = \frac{w^T w}{N-1}

and take the square root. But conceptually it is cleaner to compute the eigenvalues of \frac{X^T X}{N-1}\ , which is exactly the correlation and covariance matrices for standardized and centered data. (Let me put that another way: if I did not have built-in Mathematica® commands Covariance and Correlation, I would be computing \frac{X^T X}{N-1}\ in every case.) Note, however, that for the raw data it is not remotely a covariance matrix. I get

\lambda r = \{2500.,\ 25.,\ 6.25\}

and here’s the pie chart of them:

and then I form the diagonal matrix \Lambda of square roots:

\Lambda r = \left(\begin{array}{lll} 50. & 0. & 0. \\ 0. & 5. & 0. \\ 0. & 0. & 2.5\end{array}\right)

Get the weighted eigenvector matrix A = v\ \Lambda

Ar = \left(\begin{array}{lll} -2.77207 & 0.0869692 & 2.49578 \\ -45.3666 & 2.08223 & -0.144112 \\ -20.8372 & -4.54497 & -0.0182659\end{array}\right)

Get the scores FT as the first three columns of 2u (where 2 = \sqrt{N-1}\ , N=5).

{Fr}^T = \left(\begin{array}{lll} 0.0732441 & 1.27542 & 0.87694 \\ -0.235872 & 1.06264 & -1.00121 \\ -0.601493 & 0.877928 & 0.758596 \\ -1.01679 & 0.298354 & -1.12621 \\ -1.59478 & -0.619934 & 0.620283\end{array}\right)

Check by confirming that X = F^T\ A^T\ for Xr, Fr, Ar. (Yes.)

we compute the new data (Y = u w):

Yr = \left(\begin{array}{lll} 3.66221 & 6.37712 & 2.19235 \\ -11.7936 & 5.31319 & -2.50302 \\ -30.0746 & 4.38964 & 1.89649 \\ -50.8397 & 1.49177 & -2.81552 \\ -79.7392 & -3.09967 & 1.55071\end{array}\right)

And we confirm the product X = Y\ v^T\ – which amounts to confirming the SVD. Not a bad thing to do considering how many u’s and w’s I have floating around.

Now let’s look at the new data Y. Here are its means…

\{-33.757,\ 2.89441,\ 0.064203\}

and its variances…

\{1075.58,\ 14.528,\ 6.24485\}

The sum of the variances is 1096.36, and that is, in fact, the sum of the variances of the raw data. we have redistributed the variance.

But.

Recall the eigenvalues:

\lambda r = \{2500.,\ 25.,\ 6.25\}

The first “but” is that the redistributed variances have nothing to do with the eigenvalues. Hey, Xc is not centered, so \frac{X^T X}{N-1}\ has nothing to do with the variance of the raw data.

The second “but” is that the redistributed variances of the centered data were

\{1089.47,\ 6.28937,\ 0.600172\}.

and the first one is larger than what I got for Yr. That’s actually good, since the first variance of Yc was supposed to be the largest I could get. The redistribution of variance of the raw data is not maximal. It’s not that far off, but it’s not maximal.

This has implications for regression analysis. That is, if I want to redistribute the variance of a bunch of variables for regression, to do it properly I have to center the data, which is not something I like doing under the circumstances. The uncentered data seems more appropriate.

The change to Y is far more significant than the change to F^T. In this case, it is still true that

\frac{X^T X}{N-1} = I\ ,

but it is no longer appropriate to say that the F^T are uncorrelated. In fact, the means are nonzero

\{-0.67514,\ 0.578882,\ 0.0256812\}

and the variances are not 1:

\{0.430233,\ 0.581119,\ 0.999176\}

The matrix F^T is almost orthonormal, except that instead of

F^T\ F = I

we have

F^T\ F = \left(N-1\right)\ I

What about reconstituted data? I would look at w in preference to \lambda r\ :

wr = \left(\begin{array}{lll} 100. & 0. & 0. \\ 0. & 10. & 0. \\ 0. & 0. & 5. \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

As before, setting w_{33} to zero will have a very small effect on the data: the reconstituted matrix would differ from the original by a total sum of squares of 25 . so I’m going to set all but the first w equal to zero, just to get reconstituted data that differs nontrivially from the original.

Here’s w0:

wr0 = \left(\begin{array}{lll} 100. & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{array}\right)

Here’s the reconstituted data:

Xr0 = \left(\begin{array}{lll} -0.203038 & -3.32283 & -1.52621 \\ 0.653854 & 10.7007 & 4.91492 \\ 1.66738 & 27.2877 & 12.5335 \\ 2.81862 & 46.1284 & 21.1872 \\ 4.42085 & 72.3499 & 33.2309\end{array}\right)

It is still true that the raw data X and the reconstituted data X0 differ by the square root of the sum of squares of the two omitted w’s.

Finally, if we were to compute the new components of the reconstituted data we would get the first column of the Yr matrix.

PCA / FA example 9: standardized data, 3 models

Introduction

edited 16 Jan 2009: I found a place where I called F^T the loadings instead of the scores. That’s all.

I want to run thru what is admittedly a toy case, but this seems to be where I stand on the computation of PCA / FA.

Recall the raw data of example 9:

\text{raw} = \left(\begin{array}{lll} 2.09653 & -0.793484 & -7.33899 \\ -1.75252 & 13.0576 & 0.103549 \\ 3.63702 & 29.0064 & 8.52945 \\ 0.0338101 & 46.912 & 19.8517 \\ 5.91502 & 70.9696 & 36.0372\end{array}\right)

Get the mean and variance of each column. The means are

\{1.98597,\ 31.8304,\ 11.4366\}

and the variances are

\{8.99072,\ 796.011,\ 291.354\}

We see that the raw data will differ from the centered data, and that will differ from the standardized data. Let’s do the standardized data first, because that’s what we’ve been doing most recently.

Here’s what I’m going to do. For a data matrix X

  1. get the SVD, X = u\ w\ v^T
  2. get the eigenvalues \lambda\ \text{of } X^T\ X/\left(N-1\right) (in 2 cases, that’s the correlation matrix or the covariance matrix)
  3. form the diagonal matrix \Lambda\ \text{of } \sqrt{\lambda}
  4. form the weighted eigenvector matrix A = v\ \Lambda
  5. form the loadings scores F^T= \sqrt{N-1}\ u
  6. form the new data Y wrt v, Y = u w
  7. form Davis’ loadings A^R = v\ w^T
  8. form Davis’ scores S^R = X\ A^R\ .

I will also keep the largest “few” singular values w
and form the reconstituted data X_0\ and the residual data X_1\ , where:

X = X_0 + X_1\ .

From given raw data, I’m going to do all that for the raw data, the centered data, and the standardized data. Chances are only one of those is really appropriate for any given problem, but I don’t know how to decide, so I’m going to use my really cheap computing power to see it all.

That lays out everyone’s models. For Harman and Bartholomew, we’re getting

X = F^T\ A^T (i.e. Z = A F)

for Jolliffe and Malinowski we’re getting

X = Y\ v^T (i.e. X = R C and Y = X v)

and I’ve included Davis’s R-mode scores and loadings.

This is complicated by my doing it three times, but that encourages me to check things along the way. Two of the obvious checks are

  • X = F^T\ A^T
  • X = Y\ v^T\

but we can also check that

  • Y has redistributed the variance
  • F^T is uncorrelated: covariance matrix = identity

but we have to be very careful about the last two when we work with the raw data.

Finally, I will look across loadings v, A, AR and across corresponding scores Y, F^T, SR

Gentemen, start your engines.

Standardized

i standardize the data and call it Xs:

Xs = \left(\begin{array}{lll} 0.0368717 & -1.15632 & -1.09998 \\ -1.24681 & -0.665379 & -0.663951 \\ 0.550632 & -0.100095 & -0.170316 \\ -0.651057 & 0.534548 & 0.493006 \\ 1.31036 & 1.38724 & 1.44124\end{array}\right)

Although we saw most of these calculations in the previous post, my focus is a little different this time. Among other things, we will see just how few things need to be computed.

Get the SVD X = u\ w\ v^T\ , labeled by “s”. For now, look only at w…

ws = \left(\begin{array}{lll} 3.11948 & 0. & 0. \\ 0. & 1.50437 & 0. \\ 0. & 0. & 0.0757068 \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Get the eigenvalues \lambda of the correlation matrix:

\lambda s = \{2.43279,\ 0.565781,\ 0.00143288\}

If I simply do a pie chart, I don’t have to explicitly compute percentages:

Form the diagonal matrix of square roots \Lambda\

\Lambda s = \left(\begin{array}{lll} 1.55974 & 0. & 0. \\ 0. & 0.752184 & 0. \\ 0. & 0. & 0.0378534\end{array}\right)

Form the weighted eigenvector matrix A

As = \left(\begin{array}{lll} -0.752315 & 0.658804 & -0.000578824 \\ -0.963827 & -0.265197 & -0.0266011 \\ -0.968424 & -0.247849 & 0.0269245\end{array}\right)

Get F^T as the first three columns of 2u (where 2 = \sqrt{N-1}\ , N=5). This is as close to the u matrix itself as I need to get.

Fs^T = \left(\begin{array}{lll} 0.88458 & 1.06679 & 0.782819 \\ 0.913474 & -0.849064 & 0.380336 \\ -0.0628237 & 0.762691 & -1.56451 \\ -0.206697 & -1.22463 & -0.396946 \\ -1.52853 & 0.244206 & 0.798301\end{array}\right)

Check it by confirming that we have X = F^T\ A^T\ . We do.

We compute the new data Y wrt the v basis (Y = u w = Xv):

Ys = \left(\begin{array}{lll} 1.37971 & 0.802425 & 0.0296324 \\ 1.42478 & -0.638652 & 0.014397 \\ -0.0979886 & 0.573684 & -0.059222 \\ -0.322394 & -0.921145 & -0.0150258 \\ -2.38411 & 0.183688 & 0.0302184\end{array}\right)

And we confirm the product Xs = Ys\ vs^T\ – which amounts to confirming the SVD. (That is, from Y = u w, the SVD becomes X = Y v^T\ .) Not a bad thing to do considering how many u’s and w’s I have floating around.

Finally, Davis defines R-mode loadings AR and scores SR:

A^R = v\ w^T

S^R = X\ A^R\ ,

so we can compute them:

As^R = \left(\begin{array}{lllll} -1.50463 & 1.31761 & -0.00115765 & 0. & 0. \\ -1.92765 & -0.530395 & -0.0532022 & 0. & 0. \\ -1.93685 & -0.495698 & 0.0538489 & 0. & 0.\end{array}\right)

Ss^R = \left(\begin{array}{lllll} 4.30399 & 1.20714 & 0.00224337 & 0. & 0. \\ 4.44458 & -0.960768 & 0.00108995 & 0. & 0. \\ -0.305673 & 0.863033 & -0.00448351 & 0. & 0. \\ -1.0057 & -1.38574 & -0.00113755 & 0. & 0. \\ -7.43719 & 0.276335 & 0.00228774 & 0. & 0.\end{array}\right)

OK, he would of course drop the zero columns:

As^R = \left(\begin{array}{lll} -1.50463 & 1.31761 & -0.00115765 \\ -1.92765 & -0.530395 & -0.0532022 \\ -1.93685 & -0.495698 & 0.0538489\end{array}\right)

Ss^R = \left(\begin{array}{lll} 4.30399 & 1.20714 & 0.00224337 \\ 4.44458 & -0.960768 & 0.00108995 \\ -0.305673 & 0.863033 & -0.00448351 \\ -1.0057 & -1.38574 & -0.00113755 \\ -7.43719 & 0.276335 & 0.00228774\end{array}\right)

In case I’ve never pointed it out before, although Davis’ scores S^R are strange, his loadings A^R = v\ w^T are exactly analagous to A = v\ \Lambda\ . The difference is in using w for weights instead of \Lambda\ .
We should display the three definitions of loadings A, v, A^R (and this is when I choose to see v from the SVD computed way back at the beginning):

As = \left(\begin{array}{lll} -0.752315 & 0.658804 & -0.000578824 \\ -0.963827 & -0.265197 & -0.0266011 \\ -0.968424 & -0.247849 & 0.0269245\end{array}\right)

vs = \left(\begin{array}{lll} -0.482334 & 0.875854 & -0.0152912 \\ -0.617941 & -0.35257 & -0.70274 \\ -0.620889 & -0.329506 & 0.711283\end{array}\right)

As^R\left(\begin{array}{lll} -1.50463 & 1.31761 & -0.00115765 \\ -1.92765 & -0.530395 & -0.0532022 \\ -1.93685 & -0.495698 & 0.0538489\end{array}\right)

Since we have no zeroes, we can display the ratios (no, I wouldn’t usually do this):

As / vs = \left(\begin{array}{lll} 1.55974 & 0.752184 & 0.0378534 \\ 1.55974 & 0.752184 & 0.0378534 \\ 1.55974 & 0.752184 & 0.0378534\end{array}\right)

As^R / vs = \left(\begin{array}{lll} 3.11948 & 1.50437 & 0.0757068 \\ 3.11948 & 1.50437 & 0.0757068 \\ 3.11948 & 1.50437 & 0.0757068\end{array}\right)

As^R / As = \left(\begin{array}{lll} 2. & 2. & 2. \\ 2. & 2. & 2. \\ 2. & 2. & 2.\end{array}\right)

i hope there were no surprises there. Each column of A or A^R is a multiple of the corresponding column of v.

We should display the corresponding scores (“new data”) F^T, Y, S^R (where F^T and Y are new components of X wrt A and v resp., but S^R is new components wrt the reciprocal basis of A.)

Fs^T = \left(\begin{array}{lll} 0.88458 & 1.06679 & 0.782819 \\ 0.913474 & -0.849064 & 0.380336 \\ -0.0628237 & 0.762691 & -1.56451 \\ -0.206697 & -1.22463 & -0.396946 \\ -1.52853 & 0.244206 & 0.798301\end{array}\right)

Ys = \left(\begin{array}{lll} 1.37971 & 0.802425 & 0.0296324 \\ 1.42478 & -0.638652 & 0.014397 \\ -0.0979886 & 0.573684 & -0.059222 \\ -0.322394 & -0.921145 & -0.0150258 \\ -2.38411 & 0.183688 & 0.0302184\end{array}\right)

Ss^R = \left(\begin{array}{lll} 4.30399 & 1.20714 & 0.00224337 \\ 4.44458 & -0.960768 & 0.00108995 \\ -0.305673 & 0.863033 & -0.00448351 \\ -1.0057 & -1.38574 & -0.00113755 \\ -7.43719 & 0.276335 & 0.00228774\end{array}\right)

Since we have no zeroes, we can display the ratios:

Fs^T / Ys = \left(\begin{array}{lll} 0.641133 & 1.32946 & 26.4177 \\ 0.641133 & 1.32946 & 26.4177 \\ 0.641133 & 1.32946 & 26.4177 \\ 0.641133 & 1.32946 & 26.4177 \\ 0.641133 & 1.32946 & 26.4177\end{array}\right)

Ss^R / Ys = \left(\begin{array}{lll} 3.11948 & 1.50437 & 0.0757068 \\ 3.11948 & 1.50437 & 0.0757068 \\ 3.11948 & 1.50437 & 0.0757068 \\ 3.11948 & 1.50437 & 0.0757068 \\ 3.11948 & 1.50437 & 0.0757068\end{array}\right)

Again, one set of scores is a column weighted multiple of any other.

We could confirm that the new data Y are centered and have redistributed variances equal to the eigenvalues: each column of Y does have mean 0, and the variances are

\{2.43279,\ 0.565781,\ 0.00143288\}

and those are indeed the eigenvalues

\lambda = \{2.43279,\ 0.565781,\ 0.00143288\}

And we could confirm that F^T is standardized: each column does in fact have mean 0 and variance 1. In fact, far more is true: F^T is uncorrelated, it’s covariance matrix is the identity.

What about reconstituted data? I would look at w in preference to \lambda\ :

ws = \left(\begin{array}{lll} 3.11948 & 0. & 0. \\ 0. & 1.50437 & 0. \\ 0. & 0. & 0.0757068 \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

We know by now that setting w_{33} to zero will have a very small effect on the data: the reconstituted matrix would differ from the original by a root sum of squares = \sqrt{w_{33}^2}  = .076\ . So I’m going to set all but the first w equal to zero, just to get reconstituted data that differs nontrivially from the original.

There are many equivalent ways to do this, as we saw last time. What come to mind are

  1. take 1 column of F and one column of A.
  2. set two columns of F and two columns of A to zero.
  3. reduce w to a 1×1 matrix and take skinny forms of u and v.
  4. set two columns of Y to zero.
  5. leave sizes alone, but reset two w’s to zero.

i’ll do the last one, (5). And we didn’t actually see the second-to-last one (4), but I’ll show it to you after we get the reconstituted data. I zero-out two entries in w:

ws0 = \left(\begin{array}{lll} 3.11948 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0\end{array}\right)

and I compute Xs0 = u\ ws0\ v^T\ :

Xs0 = \left(\begin{array}{lll} -0.665482 & -0.852582 & -0.856648 \\ -0.68722 & -0.880431 & -0.884631 \\ 0.0472632 & 0.0605512 & 0.06084 \\ 0.155502 & 0.199221 & 0.200171 \\ 1.14994 & 1.47324 & 1.48027\end{array}\right)

Now recall our initial Xs:

Xs = \left(\begin{array}{lll} 0.0368717 & -1.15632 & -1.09998 \\ -1.24681 & -0.665379 & -0.663951 \\ 0.550632 & -0.100095 & -0.170316 \\ -0.651057 & 0.534548 & 0.493006 \\ 1.31036 & 1.38724 & 1.44124\end{array}\right)

and look at their difference:

Xs1=Xs - Xs0 = \left(\begin{array}{lll} 0.702354 & -0.303734 & -0.243327 \\ -0.559586 & 0.215052 & 0.22068 \\ 0.503369 & -0.160646 & -0.231156 \\ -0.806559 & 0.335327 & 0.292835 \\ 0.160422 & -0.0859985 & -0.0390325\end{array}\right)

Incidentally, we could square each element in Xs1

\left(\begin{array}{lll} 0.493301 & 0.0922546 & 0.0592079 \\ 0.313137 & 0.0462474 & 0.0486997 \\ 0.253381 & 0.0258071 & 0.0534331 \\ 0.650537 & 0.112444 & 0.0857524 \\ 0.0257352 & 0.00739575 & 0.00152354\end{array}\right)

and add up all those squares: the sum is 2.26886 . That, as we’ve seen before, is also the sum of squares of the omitted w’s; that is,

w_{22}^2 + w_{33}^2 = 2.26886\ .

Is it plausible to ask about reconstituted Y? yeah, that’s just part of the previous calculation. We had computed Xs0 = u\ ws0\ v^T\ ; the reconstituted Y is Y0 = u\ ws0\

Ys0 = \left(\begin{array}{lll} 1.37971 & 0. & 0. \\ 1.42478 & 0. & 0. \\ -0.0979886 & 0. & 0. \\ -0.322394 & 0. & 0. \\ -2.38411 & 0. & 0.\end{array}\right)

Ah ha! And sure enough, the difference between Ys and Ys0 is extremely simple:

Ys - Ys0 = \left(\begin{array}{lll} 0. & 0.802425 & 0.0296324 \\ 0. & -0.638652 & 0.014397 \\ 0. & 0.573684 & -0.059222 \\ 0. & -0.921145 & -0.0150258 \\ 0. & 0.183688 & 0.0302184\end{array}\right)

So the reconstituted Y was the first column of Ys and the other two columns set to zero. We could have just taken it and post-multiplied by vs^T to get the reconstituted X: this was variant (4) of ways to do the calculation.

OK, we’ve seen how to run thru the calculations, for the standardized data. There’s a lot redundancy in all that I’ve computed, more than I need for any analysis of my own. Still, if I’m checking someone’s else’s work, I should have routinely gotten anything they came up with.

That’s enough for now; next time, centered data and raw data.

PCA / FA Example 9: scores & loadings

I want to look at reconstituting the data. Equivalently, I want to look at setting successive singular values to zero.

This example was actually built on the previous one. Before I set the row sums to 1, I had started with

t1 = \left(\begin{array}{lll} 1 & 1 & -3 \\ -1 & 2 & -2 \\ 1 & 3 & -1 \\ -1 & 4 & 1 \\ 1 & 5 & 4\end{array}\right)

I’m going to continue with Harmon’s & Bartholomew’s model: Z = A F, Z = X^T, X is standardized, A is an eigenvector matrix weighted by the square roots of the eigenvalues of the correlation matrix of X.

I want data with one eigenvalue so large that we could sensibly retain only that one. Let me show you how I got that.

Get the SVD (Singular Value Decomposition) of t1… and look at w:

w = \left(\begin{array}{lll} 7.84944 & 0. & 0. \\ 0. & 4.9565 & 0. \\ 0. & 0. & 2.19533 \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

OK, keep u and v – just because they’re handy – but redefine w, and compute a new data matrix using this w:

w = \left(\begin{array}{lll} 100 & 0. & 0. \\ 0. & 10 & 0. \\ 0. & 0. & 5 \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Let t2 = u\ w\ v^T\ :

t2 = \left(\begin{array}{lll} 2.09653 & -0.793484 & -7.33899 \\ -1.75252 & 13.0576 & 0.103549 \\ 3.63702 & 29.0064 & 8.52945 \\ 0.0338101 & 46.912 & 19.8517 \\ 5.91502 & 70.9696 & 36.0372\end{array}\right)

Standardize t2 to get data X (“the data”):

X = \left(\begin{array}{lll} 0.0368717 & -1.15632 & -1.09998 \\ -1.24681 & -0.665379 & -0.663951 \\ 0.550632 & -0.100095 & -0.170316 \\ -0.651057 & 0.534548 & 0.493006 \\ 1.31036 & 1.38724 & 1.44124\end{array}\right)

Get the SVD of X, X = u\ w\ v^T\ , but look at w… (we’ll see u and v later)

w = \left(\begin{array}{lll} 3.11948 & 0. & 0. \\ 0. & 1.50437 & 0. \\ 0. & 0. & 0.0757068 \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Get the eigenvalues of the correlation matrix and look at the percentages:

\lambda = \{2.43279, 0.565781, 0.00143288\}

\text{percent} = \{81.0929, 18.8594, 0.0477627\}

\text{cumulative percent} = \{81.0929, 99.9522, 100.\}

so our first eigenvalue is 81% of the sum, the second is nearly all the rest, 19%.

Get the diagonal matrix of square roots, \Lambda\ :

\Lambda = \left(\begin{array}{lll} 1.55974 & 0. & 0. \\ 0. & 0.752184 & 0. \\ 0. & 0. & 0.0378534\end{array}\right)

Get A and the scores 2\ u = F^T\ , except that I never want more than 3 columns. (The first 3 columns of 2 u are the components of the new data wrt the A basis.)

A = \left(\begin{array}{lll} -0.752315 & 0.658804 & -0.000578824 \\ -0.963827 & -0.265197 & -0.0266011 \\ -0.968424 & -0.247849 & 0.0269245\end{array}\right)

F^T = 2 u = \left(\begin{array}{lll} 0.88458 & 1.06679 & 0.782819 \\ 0.913474 & -0.849064 & 0.380336 \\ -0.0628237 & 0.762691 & -1.56451 \\ -0.206697 & -1.22463 & -0.396946 \\ -1.52853 & 0.244206 & 0.798301\end{array}\right)

Check by confirming that  F^T\ A^T = X\ , i.e. that we have factored the data matrix.

Let’s quickly compute the reconstituted data from 1 and 2 singular values. I could use square forms of w (w1 and w2) with u and v cut down to u1 and v1 (or u2, v2) to be conformable. To be specific,

w1 = \left(\begin{array}{l} 3.11948\end{array}\right)

and compute X1 = u1\ w1\ v1^T\ :

X1 = \left(\begin{array}{lll} -0.665482 & -0.852582 & -0.856648 \\ -0.68722 & -0.880431 & -0.884631 \\ 0.0472632 & 0.0605512 & 0.06084 \\ 0.155502 & 0.199221 & 0.200171 \\ 1.14994 & 1.47324 & 1.48027\end{array}\right)

or take

w2 = \left(\begin{array}{ll} 3.11948 & 0. \\ 0. & 1.50437\end{array}\right)

and then X2 = u2\ w2\ v2^T\ :

X2 = \left(\begin{array}{lll} 0.0373248 & -1.13549 & -1.12105 \\ -1.24659 & -0.655262 & -0.674191 \\ 0.549727 & -0.141712 & -0.128192 \\ -0.651287 & 0.523988 & 0.503694 \\ 1.31082 & 1.40848 & 1.41974\end{array}\right)

You might note that my “1″ and “2″ indicate how many singular values I retained.

But what I was saying about leaving u and v untouched is that I can keep u and v full-size

u = \left(\begin{array}{lllll} 0.44229 & 0.533396 & 0.391409 & 0.605413 &   -0.0119286 \\ 0.456737 & -0.424532 & 0.190168 & -0.0677073 &   0.755259 \\ -0.0314119 & 0.381346 & -0.782255 & 0.201541 &   0.448384 \\ -0.103349 & -0.612313 & -0.198473 & 0.740037 &   -0.165366 \\ -0.764266 & 0.122103 & 0.39915 & 0.201541 &   0.448384\end{array}\right)

v = \left(\begin{array}{lll} -0.482334 & 0.875854 & -0.0152912 \\ -0.617941 & -0.35257 & -0.70274 \\ -0.620889 & -0.329506 & 0.711283\end{array}\right)

and use the following matrices for w, all the same size as X. The full-size w is:

w = \left(\begin{array}{lll} 3.11948 & 0. & 0. \\ 0. & 1.50437 & 0. \\ 0. & 0. & 0.0757068 \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

and the two others are:

w2 = \left(\begin{array}{lll} 3.11948 & 0 & 0 \\ 0 & 1.50437 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0\end{array}\right)

w1 = \left(\begin{array}{lll} 3.11948 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0\end{array}\right)

It’s easy enough to reconstitute the data that way,

X1 = u\ w1\ v^T

X2 = u\ w2\ v^T

and we do get the same answers for X1 and X2.

Now, what if take just the first two columns of F^T, and the first two columns of A (i.e. the first two rows of A^T)? (We saw this last time, too.) That is, I cut F^T down to:

F^T = \left(\begin{array}{ll} 0.88458 & 1.06679 \\ 0.913474 & -0.849064 \\ -0.0628237 & 0.762691 \\ -0.206697 & -1.22463 \\ -1.52853 & 0.244206\end{array}\right)

and I cut A^T down to

A^T = \left(\begin{array}{lll} -0.752315 & -0.963827 & -0.968424 \\ 0.658804 & -0.265197 & -0.247849\end{array}\right)

Do we get X2? Yes.

What if take just the first column of FT, and the first row of A^T? That is, I take FT to be:

F^T = \left(\begin{array}{l} 0.88458 \\ 0.913474 \\ -0.0628237 \\ -0.206697 \\ -1.52853\end{array}\right)

and AT to be

A^T = \left(\begin{array}{lll} -0.752315 & -0.963827 & -0.968424\end{array}\right)

Is their product equal to X1? Yes.

What we just saw is obvious in retrospect, but worth stating explicitly. When we throw away small singular values or eigenvalues, we do not change the scores and loadings. What we change, instead, is the number of scores and loadings used.

When we throw away a nonzero singular value or eigenvalue, we are throwing away a scores-and-loadings pair. We don’t change any of the other scores and loadings.

We have 3 pairs of “scores & loadings”, and then for each pair we compute a product. The individual pairs, and the 3 products, are not affected by our decision to “drop a factor”. Instead of adding all three products, we may choose to add only the first two products, or to keep only the first product.

It is probably customary to say that we are retaining 1 or 2 factors when we retain 1 or 2 scores-loadings pairs.

That “the scores” are selected columns of \sqrt{N-1}\ u tells us that the individual scores & loadings cannot change no matter how many factors we keep.

Throwing away – choosing not to use – a nonzero scores-loadings pair does affect the reconstituted data. X2 is close to X, but not the same, and X1 is quite different.

X2 - X = \left(\begin{array}{lll} 0.000453114 & 0.0208238 & -0.021077 \\ 0.000220147 & 0.0101173 & -0.0102403 \\ -0.000905575 & -0.0416176 & 0.0421236 \\ -0.000229762 & -0.0105592 & 0.0106876 \\ 0.000462075 & 0.0212357 & -0.0214938\end{array}\right)

X1 - X = \left(\begin{array}{lll} -0.702354 & 0.303734 & 0.243327 \\ 0.559586 & -0.215052 & -0.22068 \\ -0.503369 & 0.160646 & 0.231156 \\ 0.806559 & -0.335327 & -0.292835 \\ -0.160422 & 0.0859985 & 0.0390325\end{array}\right)

To look at that another way, the means of X2 are zero, and the variances of X2 (2 factors retained) are:

\{1., 0.999292, 0.999275\}

“Not far from 1″ is an understatement.

The means of X1 are also zero, but the variances of X1 (1 factor retained) are:

\{0.565977, 0.928963, 0.937846\}

You might note that the X1 data is still centered, but it is no longer standardized. That was lost when we threw away a fairly large singular value.

Next, I think I will show you what I reckon I would do, today, for PCA / FA.

PCA / FA Example 8: the pseudo-inverse

introduction

Recall Harman’s or Bartholomew’s model

Z = A F

with Z = X^T\ , X standardized, and A a \sqrt{\text{eigenvalue}}\ -weighted eigenvector matrix, with eigenvalues from the correlation matrix.

We saw how to compute the scores F^T in the case that A was invertible (here). If, however, any eigenvalues are zero then A will have that many columns of zeroes and will not be invertible.

What to do?

One possibility – shown in at least one of the references, and, quite honestly, one of the first things I considered – is to use a particular example of a pseudo-inverse. I must tell you up front that this is not what I would recommend, but since you will see it out there, you should see why I don’t recommend it.

(Answer: it works, it gets the same answer, but computing the pseudo-inverse explicitly is unnecessary. In fact, it’s unnecessary even if we don’t have the Singular Value Decomposition (SVD) available to us.)

Suppose we are given raw data which has, in fact, constant row sums (in this case, 1):

raw = \left(\begin{array}{lll} -1 & -1 & 3 \\ 1 & -2 & 2 \\ \frac{1}{3} & 1 & -\frac{1}{3} \\ -\frac{1}{4} & 1 & \frac{1}{4} \\ \frac{1}{10} & \frac{1}{2} & \frac{2}{5}\end{array}\right)

A couple of those fractions do not display well on my screen, so let me convert to decimals:

\left(\begin{array}{lll} -1. & -1. & 3. \\ 1. & -2. & 2. \\ 0.333333 & 1. & -0.333333 \\ -0.25 & 1. & 0.25 \\ 0.1 & 0.5 & 0.4\end{array}\right)

PCA a la Harman or Bartholomew et al.

Now, we do a PCA using Harman’s or Bartholomew’s model – but my way, not theirs. Standardize the input:

X = \left(\begin{array}{lll} -1.40524 & -0.67082 & 1.39765 \\ 1.30584 & -1.41618 & 0.675971 \\ 0.402143 & 0.819892 & -1.00794 \\ -0.388588 & 0.819892 & -0.586964 \\ 0.0858508 & 0.447214 & -0.478713\end{array}\right)

The X matrix is “the data”.

Get the eigenvalues of the correlation matrix:

\lambda 1 = \{1.86202,1.13798,0.\}

Yes, one of them is identically zero. We learned here that this happens when we start with constant nonzero row sums and compute the correlation matrix.

Get the diagonal matrix of square roots:

\Lambda 1 = \left(\begin{array}{lll} 1.36456 & 0. & 0. \\ 0. & 1.06676 & 0. \\ 0. & 0. & 0.\end{array}\right)

Get the SVD of X… and look at w:

w1 = \left(\begin{array}{lll} 2.72912 & 0. & 0. \\ 0. & 2.13352 & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Get the \sqrt{\text{eigenvalue}}\ -weighted matrix A, as A1 = v1\  \Lambda 1\ . (The only reason I’m using “1″ is so that in subsequent algebra and computations I can use A, v, \Lambda\ etc. consistently for cut-down matrices.)

A1 = \left(\begin{array}{lll} -0.136612 & 0.990625 & 0. \\ -0.938341 & -0.34571 & 0. \\ 0.981263 & -0.192673 & 0.\end{array}\right)

Compute F^T\ as I would usually, as columns of \sqrt{N-1}\ u\ , in this case 2 u1, and only the first 3 columns:

F1^T = \left(\begin{array}{lll} 1.17769 & -1.25613 & 0.660501 \\ 0.974085 & 1.45252 & 0.269327 \\ -0.97385 & 0.271651 & 1.59142 \\ -0.693986 & -0.487969 & -0.0525462 \\ -0.483941 & 0.0199256 & -0.977656\end{array}\right)

Let’s check it by computing F^T\ A^T\ :

 F1^T\ A1^T\  = \left(\begin{array}{lll} -1.40524 & -0.67082 & 1.39765 \\ 1.30584 & -1.41618 & 0.675971 \\ 0.402143 & 0.819892 & -1.00794 \\ -0.388588 & 0.819892 & -0.586964 \\ 0.0858508 & 0.447214 & -0.478713\end{array}\right)

and that is, indeed, X:

\left(\begin{array}{lll} -1.40524 & -0.67082 & 1.39765 \\ 1.30584 & -1.41618 & 0.675971 \\ 0.402143 & 0.819892 & -1.00794 \\ -0.388588 & 0.819892 & -0.586964 \\ 0.0858508 & 0.447214 & -0.478713\end{array}\right)

We’re done.

But what if the SVD is not available to us, and we don’t have u?

Let us recall the derivation in the case that A was invertible (here it was called A1; this is old stuff, so try not to worry about all the missing “1″s), we could also have computed F^T\ by projecting X onto the reciprocal basis A^{-T}\ :

F^T = X\ A^{-T}\ ,

but we can also derive that formula without ever knowing about the reciprocal basis:

Z = A\ F

A^{-1}\ Z = F

F^T = Z^T\ A^{-T}

F^T = X\ A^{-T}\ .

In that case, we could also have used A = v\ \Lambda\ to simplify things further, and get

A^T = \Lambda\ v^T

A^{-T} = v^{-T}\ \Lambda^{-1}

A^{-T} = v\ \Lambda^{-1}\ (because v is orthogonal)

and finally

F^T = X\ v\ \Lambda^{-1}\ .

That was fine (and it’s what Bartholomew had us do), but our A is not invertible.

the pseudo-inverse

So let’s work with the cut-down matrices. To be specific, we start with the two nonzero columns of A1:

A = \left(\begin{array}{ll} -0.136612 & 0.990625 \\ -0.938341 & -0.34571 \\ 0.981263 & -0.192673\end{array}\right)

This cut-down A has, in a sense, eliminated the problem: although the new A can’t be inverted because it’s not square, it is of full rank. Even better, it’s A1 with one column removed: in a very real sense, it’s the same linear operator.

Any time I have a formula which involves a matrix inverse, and I want to generalize it to a case where I have a rectangular matrix – hence, no inverse – I investigate whether a pseudo-inverse will work.

In this case, A is of rank 2, so the two square matrices A^T\ A and A\ A^T are of rank 2. But A is 5×2, so A^T\ A is 2×2 and A\ A^T is 5×5, so A^T\ A is invertible, even though A1 wasn’t. There’s no point to considering A\ A^T\ , because it’s not invertible.

So, for the rectangular A – because what I dropped was a column of zeroes – we still have

Z = A F

(That is crucial. We didn’t change the data Z = X^T\ , because we didn’t lose any singular values.)

Premultiply by A^T\ :

A^T Z = A^T\ A\ F\ .

now premultiply by \left(A^T\ A\right)^{-1}\ , which we know exists:

\left(A^T\ A\right)^{-1}\ A^T\ Z = F\ .

It may not be pretty, but it will work.

I should remind us that this is exactly what’s goes on in regression (OLS) here: the “normal equations” for the regression model

\hat{Y} = X\ \beta

are

\beta = \left(X^T\ X\right)^{-1}\ X^T\ Y \ .

Let me also point out that if A is invertible, then the inverse expands as

\left(A^T\ A\right)^{-1} = A^{-1}\ A^{-T}\ ,

so from

F = \left(A^T\ A\right)^{-1}\ A^T\ Z

we get

F = A^{-1}\ A^{-T}\ A^T\ Z

i.e.

F = A^{-1}\ Z

and finally

F^T = X A^{-T}\ .

That is, the equation written with the pseudo-inverse

F = \left(A^T\ A\right)^{-1}\ A^T\ Z

contains our earlier special case

F^T = X\ A^{-T}

when A is invertible. That derivation, incidentally, reminds me that although I think of \left(A^T\ A\right)^{-1} as the pseudo-inverse, it’s really the product \left(A^T\ A\right)^{-1}\ A^T which is the pseudo-inverse, since it’s the product that collapsed to A^{-1}\ .

It gets better, in this case. The pseudo-inverse is far prettier than it looks, but let’s see it before we show it.

Here is A^T\ A\ :

A^T\ A = \left(\begin{array}{ll} 1.86202 & 0 \\ 0 & 1.13798\end{array}\right)

Diagonal?

Diagonal!

Even better, it’s \Lambda^2\ , where \Lambda is the cut-down \Lambda 1\ :

\Lambda = \left(\begin{array}{ll} 1.36456 & 0. \\ 0. & 1.06676\end{array}\right)

\Lambda^2 = \left(\begin{array}{ll} 1.86202 & 0. \\ 0. & 1.13798\end{array}\right)

That is, it’s the diagonal matrix of \lambda instead of \sqrt{\lambda}\ . Not only is it diagonal, but its elements are almost already computed. And in fact the inverse is almost immediate:

\left(A^T\ A\right)^{-1} = \Lambda^{-2}\ .

Then a quick check shows that just as

A1 = v1\ \Lambda 1\ ,

we have

A = v\ \Lambda\ ,

hence

A^T = \Lambda\ v^T

(We also need a cut-down version v of v1 (its first two columns) as well as the cut down \Lambda of \Lambda 1 (2×2) which we got earlier.)

v = \left(\begin{array}{ll} -0.100114 & 0.92863 \\ -0.687651 & -0.324075 \\ 0.719106 & -0.180615\end{array}\right)

Then A = v\ \Lambda is

A = \left(\begin{array}{ll} -0.136612 & 0.990625 \\ -0.938341 & -0.34571 \\ 0.981263 & -0.192673\end{array}\right)

(That is, whether we start with A1 and cut it down to A, or compute the cut-down A from v and \Lambda\ , we get the same thing.)

Putting it all together, from

F = \left(A^T\ A\right)^{-1}\ A^T\ Z

we get, as before,

F = \Lambda^{-2} \Lambda\  v^T\ Z

F^T = X\ v\ \Lambda^{-1}\ .

That is, if A is invertible, we can compute F^T as any one of

  • F^T = X\ v\ \Lambda^{-1}
  • F^T = X\ A^{-T}
  • FT = \sqrt{N-1}\ u (the first “few” columns).

If A is not invertible but the cut-down A is of full rank (so v and \Lambda are also cut down as above), we can compute F^T as any of

  • FT = X\ v\ \Lambda^{-1}
  • F = \left(A^T\ A\right)^{-1}\ A^T\ X^T and take the transpose.
  • FT = \sqrt{N-1}\ u (the first “few” columns).

The only difference between A not invertible and A invertible is F = \left(A^T\ A\right)^{-1}\ A^T\ X^T in place of F^T = X\ A^{-T}\ .

(Yes, one of those uses F, the other F^T\ .)

But as I said at the beginning, I didn’t go thru all that in order to encourage you to compute the pseudo-inverse \left(A^T\ A\right)^{-1}A^T\ , but unfortunately you may very well see it in texts.

When you do, realize that it’s true but utterly unnecessary for computing F.

(There’s no reason to tell you which of the references show it.)

I would always compute F^T as F^T = \sqrt{N-1}\ u\ . If the SVD makes you uncomfortable, or isn’t available (I weep for you), then compute it as F^T = X\ V \Lambda^{-1}\ , using V from the eigendecomposition.

Even though the pseudo-inverse isn’t all that useful in this case, it’s a good thing to know about in general.

Oh, and don’t even dream that it’s unnecessary for regression. For PCA / FA, it’s

A^T\ A = \Lambda^2

which makes it unnecessary to actually compute the pseudo-inverse . For regression, X^T\ X isn’t at all likely to be diagonal.

PCA / FA tricky preprocessing

Introduction

I have stumbled across a tricky point in the preprocessing of data. The most relevant post is probably

this of April 7. Rather than lecture, let me ask and answer some questions. The fundamental question is:
Can I inadvertently reduce the rank (the dimensionality) of the data matrix?
The answer is yes.

Suppose we have a data matrix of full rank. Suppose the matrix is a typical one, in having more rows than columns. That the matrix be of full rank is equivalent to: the columns are linearly independent.

There is a serious asymmetry here: not all of the rows are linearly independent.

We might as well imagine some specific numbers. Suppose we have 3 columns and 5 rows. Since we assume it is of full rank, the rank of the matrix is 3.

The term “centering” means to “make zero-mean.” A centered column is one with zero mean = zero sum; a centered row is a row whose mean (hence sum) is zero..

What happens to the rank of the data matrix if we only row-center the data?
The rank decreases by 1.

Visualize the matrix as 3 column-vectors c1, c2, c3 of length 5. That the row sums (= row means) are zero says that c1 + c2 + c3 = 0, so the three columns are linearly dependent and the matrix is of rank 2.

What happens to the rank of the data matrix if we only column-center the data?
Nothing.

Visualize the matrix as 5 row-vectors r1, r2, r3, r4, r5 of length 3. We have more than 3 vectors of length 3, so they span a 3D space, but no larger. Adding the constraint r1 + r2 + r3 + r4 + r5 = 0 imposes, in general, no additional restriction. (If they had somehow spanned a 5D space, it would, but they can’t.)

What happens to the rank of the data matrix if we only set the row sums to 1 (or 100), i.e. to a nonzero constant value?
Nothing.

All we’ve done is assert that r1 + r2 + r3 = 1, a constant vector. These 4 vectors are linearly dependent, but the original 3 are not.

But. It has another potential effect, probably undesirable.

If we were to use the resulting data matrix for a regression, we would find ourselves in trouble. We almost always want to adjoin a column of 1s to the data matrix, in order to include a constant term in the regression equation. We just got through saying that the combination r1, r2, r3, 1 is linearly dependent: the regression will explode with a singular matrix. Once we have set the row sums, we would have to drop a column of data in order to include a constant term. This is the same problem we face in regression when we define instrumental (dummy) variables.

What happens to the rank of the data matrix if we only set the column sums to 1 or some other nonzero constant value? (I’ve never seen it done, but why not, at least in principle?)
Nothing. (It even has no effect on regression.)

First Summary

So, there were 4 cases: whether we set row or column sum to a constant value, and whether that constant was zero or not. Of those 4 cases, only one (row sum = 0) reduces the rank of the data matrix.

What I was unclear about – or downright wrong about – on April 7 was in saying that we would lose rank by setting the row sums to 1 or 100 etc. Not so; none of the other 3 cases reduce the rank of the matrix.

Is that it?
No. All those questions were about one single operation. What if we combine operations?

Can’t we center both the rows and the columns?
Yes. We saw
here on April 13 that we could doubly-center the data.

Does that mean we lose rank?
Yes, because we are centering the rows. Centering the columns is irrelevant.

Can we set the row sums to 1 and then center the columns?
Yes.

But something else will happen. Centering the columns is no longer irrelevant.

Didn’t we see that if the row sums were constant, and then we centered the data, the row sums went to zero?
Yes. It may seem weird, but it’s true.

That’s an easy way to get doubly-centered data using Mathematica®. Scale the rows so they each sum to 1, then center the columns. Voila’, the row sums will change to zero.

Any time you start messing with row and column sums or means, remember the grand mean, too: the average of all the data. It is also both the average row mean, and the average column mean. If the columns all have zero mean, then the grand mean is zero. Then the mean of the rows (edit: i.e. of the row means) must be zero. The challenge is to show that the rows still have a common sum (edit: equivalently, a common mean); then that common sum mean must be the mean of the rows, i.e. zero.

If we set the row sums, say to 1, and then column-center the data, do we lose rank?
Yes.

Column-centering (or standardizing) the data after setting the row sums will reset the row sums to zero, and that loses rank. If it is appropriate, however, that the data be column-centered, so be it. I do not believe that reduction of rank is sufficient grounds for not centering the data.

Second summary

The apparently innocuous act of column-centering can lose rank, if we apply it to nonzero-constant-row-sum data. We can do it, and sometimes we should do it, but we should expect to lose rank when we do. But we’re losing rank because we first set the row sums, even though to a nonzero constant.

I had previously said I would favor doubly-centered data. Now I am inclined not to do so without cause, simply because it reduces the rank of the matrix.

Can I inadvertently reduce the rank (the dimensionality) of the data matrix?
Hell yes. In particular, we might find that we effectively centered the data without realizing it.

Suppose we set the row sums of our data to 1, say, and then we decide that rather than do an SVD of the new matrix, we will do an eigendecomposition of the covariance (or correlation) matrix.

Oops. Computing either the covariance or correlation matrix is tantamount to column-centering the data.

here on March 10.

Goodbye full rank.

Third summary

Suppose we have data for which a constant nonzero row sum is appropriate. Suppose we choose to do an eigendecomposition of the covariance (or correlation) matrix.

Then we lose rank. Constructing either of those smaller matrices implicitly centers the columns, and that in turn implicitly changes the row sums of the data to zero. Even if we never explicitly compute the transformed data.

The clearest path in such a case might well be as follows. First set the row sums to 1 or whatever nonzero. Then column-center or standardize the data. We will see the rank drop by 1 at that point. Then do an SVD. The only advantage to doing it this way, rather than via the covariance or correlation matrices, is that we see when we lose full rank.

If we live in a universe where data is supposed to be centered, or where the covariance or correlation matrix is appropriate, then we will lose rank if, in addition, the data had constant row sums.

Final summary

0. We could set row or column sums to zero or to a nonzero constant.
1. Setting row sums to zero is the only one of the four cases that will reduce rank.
2. Column-centering applied to data with constant row sums leaves us still with constant row sums but the sums have been changed to zero. This gives us (1) and loss of rank.
3. An accidental way to achieve (2) and thus (1) is to analyze the covariance (or correlation) matrix of constant-row-sum data.

It’s not a problem if you know what happened and why.