The purpose of this post is to provide guidance to a reader who has just discovered that I have a large pile of posts about principal components / factor analysis. This pile of posts might seem a very jungle, without any map.
Here, have a map.
As I finalize this post, it will be number 52 in PCA / FA. Here’s a list of the 52 posts, including the dates spanned by any group, and the number of posts in that group. (When the picture was taken, I didn’t know when this would be published. In fact, post 51 was scheduled but not yet published. Even more, post 51 did not even exist when the first picture was created.)
transition/attitude matrices is a post that is sometimes relevant when we discuss “new data” in PCA, but it is not in the PCA / FA category.
“tricky prepro” is short for “tricky preprocessing”, and discusses the combination of constant row sums and covariance or correlation matrix.
Posts which have no named author, such as “example 8”, have the tag, “Rip”.
Right now the tag cloud includes entries such as Harman and Davis, i.e. the authors of textbooks I used. This means you can see all the posts based on Harman (including those which include Davis) by clicking on the Harman tag. The number of tags shown, however, is limited to 45. In time, some or all of these author tags may disappear. What to do then?
This post shows you that the original Harman posts all occurred in January 2008, and that there were some posts about Harman & Davis in April 2008, and the “et al” on post 51 means that it talks about Harman, Jolliffe, and Basilevsky.. The dates would let you pick whatever you wanted out of the monthly archives.
Let me summarize the sources quickly. There are three texts which specialize in PCA / FA in general: Harman, Jolliffe, Basilevsky. There is one text which specializes in PCA / FA in chemistry: Malinowski. There are three texts which specialize in data analysis in a particular field: chemistry, geology, social science (resp. Brereton, Davis, and Bartholomew et al.)
Harman’s primary focus is factor analysis, but I started with him because he had a simple example with data. He also has a lot of material on what they call rotations, but I did not look at it. If you need to understand this, I think he’s a good place to start. (What people mean in this field when they say “rotation” is anything but; in fact, they want to turn an orthogonal basis into a non-orthogonal basis, and that is not done by a rotation!)
Jolliffe is all about principal component analysis. As I said before, if you do this professionally, you probably need this book: he talks a lot about what you’re doing and what it means. I didn’t start here because his examples are not self-contained: he references the literature which contains the data.
Davis is about data analysis primarily for geology. His material on PCA / FA is limited to one chapter. Nevertheless, as you might guess from the number of posts, Davis is where I began to put it all together. I would recommend his book for data analysis in general, not just for PCA / FA.
Malinowski and Brereton are special-purpose texts, even within the category “PCA in chemistry” (to judge from the comment here.) From Malinowski I learned quite a bit more about the geometry of the SVD, the plausibility of using data which was not even centered, and an interesting approach to missing data.
Brereton has some nice data, but you have to own the book to access the data. Also, he did not look like the kind of book I could have learned this from. (Practice with, yes; learn the concepts, no. But then, I’m not trying to call a PCA command in a statistics package – not that it’s a bad thing to do, but it’s not my cup of tea.)
Bartholomew et al. is about data analysis primarily for the social sciences. Like Davis, their material on PCA is limited to one chapter, but with another on classical FA (which I have not discussed). The book, however, has publicly available data, and this makes it an excellent source for practice. In fact, it was here that I finally saw how easy it was to compute the scores. They do not use matrix notation, so it might help if you had learned that elsewhere before picking up this book.
Basilevsky is devoted to factor analytic methods: factor analysis, principal component analysis, and regression. He tends to have data printed in tables, so I have scanned things into the computer for myself in order to play with his examples. I think I’m just as glad I found him after I understood how to do this; he might have made me dizzy in the beginning.
Introduction to The Calculations
A summary of just about all the calculations I could do for PCA is here but you might also look at the post immediately preceding this one.
Now that I understand the basics of PCA, I suspect that I personally would use it in two ways. One, I might use it on data before I ran a regression. Two, I might do just what I have been doing, namely checking someone else’s calculations.
Using PCA Myself
Let’s take care of the simple case first. If I have some data, and if I choose to run PCA, I would be interested – at this stage of my understanding – in two things. One, how would a PCA redistribute the variance? Two, how many linearly independent variables do I have?
I would answer both of those questions by getting the SVD (singular value decomposition, ) of my data matrix X. The number of nonzero singular values w will tell me the rank of the matrix; the number of “small” singular values will let me estimate multicolinearity. Since I want to know about the redistributed variance, I need to use either centered or standardized data. (Otherwise, the eigenvalues are not equal to variances.) In either case, it is the orthogonal eigenvector matrix v (not the weighted matrix A) which will show me the redistributed variances.
When I do this, I’ll show it to you. (I have yet to do it, since understanding PCA.) Who knows? At that time, I may figure out that there are other things I would wish to get out of the PCA.
Anyway, it seems to me today that for my own purposes I would simply get the SVD of my data, and look at w and v. This is way, way simpler than what I’ve usually been doing while checking other peoples calculations.
(I do wonder if the matrix u can be used to isolate variables which are multi-collinear. We saw in Malinowski that we could construct a projection operator…. Let me think about that.)
The other case is somewhat more complicated. This takes us back to my beginnings.
If I am checking someone else’s analysis, the very first question is: did they give me the data?
Checking Someone Else, No Data
If they did not give me the data, they must have given me a “dispersion matrix”: some form of either or , in fact probably either a covariance matrix or a correlation matrix. Whatever it is, denote it by c.
I would then get an eigendecomposition of c,
where the eigenvalues of c are the nonzero elements of the diagonal matrix . To put that another way, the diagonal matrix has for its diagonal elements.
You will recall that minor hell can break loose. Although any matrix of the form or must be positive definite – when computed exactly – any rounded off representation of it need not be positive definite. We saw an example here. The specific problem we might encounter is that a very small positive eigenvalue – really – might come out negative when we compute using a printed – rounded-off – dispersion matrix instead of the computed matrix. (In principle, of course, even the computed one could fail to be positive definite; there’s nothing sacred about rounding to 6 places or 15 places instead of 3 or so.)
If that happens, I would set the troublesome eigenvalue to zero, and continue. Incidentally, this can occur if we started with constant-row-sum data and then computed either the covariance or correlation matrix. I discussed that here.
Moving on, let us suppose we have the orthogonal eigenvector matrix v and the diagonal matrix of nonnegative eigenvalues (i.e. we took care of any negative ones, by setting them to 0).
If the work we are checking stays with v and , then he is probably doing the kinds of things Jolliffe discussed.
Otherwise, I expect to see a weighted eigenvector matrix
If we were not given data, the only conclusions we can check are those drawn from v, A, and . Computing the “scores” requires the data, and even if the analysis shows the scores, we cannot check them.
But what if we were given the data? Well, this could get interesting. (This, after all, is why I’ve been doing this stuff for a year and a half!)
Checking Someone Else, With Data
The first significant consequence of being given the data is this: we should be able to relate their eigenvalues and eigenvectors to the data. (We’ll get to the scores….)
That may not be straightforward.
The first challenge is: are the variables in columns or in rows? Harman, we will recall, puts the variables in rows, using the transpose of our customary X matrix: hence we had his
I would compute the means and variances of each variable (i.e. of each column or row). If I were at all uncertain whether the variables were rows or columns, I would compute the means and variances of each row and column. In fact, even if I know that the variables are in columns, say, I would compute the row sums.
This tells me whether I was given raw data, centered data, standardized data, or small standard deviates; or something else. It remains to be seen if they preprocessed the data before computing the dispersion matrix.
Okay, if the given data is standardized (or small standard deviates), I expect that they used the given data for the analysis. But I might be wrong.
Now, what did they do? They either computed an SVD of the data, or an eigendecomposition of a dispersion matrix. I would compute whatever I think they did with my best guess for the preprocessing.
If my eigenvectors are right, then I have the correct dispersion matrix to within a scale factor; if my eigenvalues are also right, then the scale factor is right.
Bear in mind that any multiple of an eigenvector is also an eigenvector. I would have computed an orthogonal eigenvector matrix v. If the given eigenvectors are of unit length, we have a match so long as they agree with mine to within a sign. If the given eigenvectors do not agree with mine to within a sign, I would compute their lengths. If their lengths turn out to be , I would immediately compute
expecting that my columns of A would agree with theirs to within a sign. About the only other scaling I have seen is to set the largest component of each eigenvector to 1. (Apart from Basilevsky normalizing the rows of the A matrix by the inverse standard deviations! And that, by the way, means the columns of A are no longer eigenvectors.)
Let me remind you that raw data, centered data, and standardized data (each denoted X) lead to different eigenvectors of ; so if I match the eigenvectors, I have correctly matched the author’s raw, centered, or standardized data. Until I match the eigenvectors, there’s not much point in worrying about the eigenvalues.
If the eigenvalues are off by a scale factor, once the eigenvectors are right, then I need to use a different multiple of . That is,
, , and
have the same eigenvectors, but the eigenvalues differ by factors of N or N-1 or (N-1)/N.
The same considerations apply to the computation of the singular value decomposition. If my matrices u and v agree with the author (to within signs, assuming orthonormal eigenvectors, as usual), I have correctly matched raw, centered, or standardized. If the singular values w are off by a scale factor, then they scaled the data somehow.
Why all the fuss?
- I have seen people use 1/N in their theoretical discussion, but use 1/(N-1) in their computation.
- I have seen someone use 1/N to get “the correlation matrix” or “the covariance matrix”, but I would have used 1/(N-1).
- I have seen people say they were using “the covariance matrix”, but they actually used instead of .
- I have seen people call the covariance matrix when X wasn’t even centered.
- I have seen someone say the data was standardized, when in fact it was small standard deviates. (Same eigenvectors, different eigenvalues.)
Of course, either the author or I might have made a mistake getting the data into the computer; but I have learned to consider instead the possibility that an author and I are doing different computations on the same given data.
Finally, I have sometimes found that I agree with an author’s largest few eigenvectors (by which I mean, the eigenvectors associated with the largest eigenvalues), but not with the smaller ones. I suspect that this is a result of their using a sequential algorithm to find eigenvectors one after another.
I remind you that it is very easy to check an eigenvector-eigenvalue pair, my own or an author’s. Suppose we have done an eigendecomposition of a matrix c,
Suppose, for example, that is the third eigenvector, with associated eigenvalue ). Then the definitions of eigenvector and eigenvalue tell us that
The LHS is a matrix-vector product and the RHS is a scalar times a vector. Compute both sides and compare them.
Whew! So far, I have matched the author’s eigendecomposition. Presumably, I know whether they used the orthogonal eigenvector matrix v or the weighted eigenvector matrix A. The eigenvector matrix, whichever one they chose, is usually the “loadings”.
We’re actually home free, now.
If they present the “scores”, I expect to find that they are either X v or – if they come from A – the first few columns of ( here), unless I have a reason to believe they are using Davis’ scheme:
(And for more about that, see the post immediately preceding this one.)