## PCA / FA Example 1: Harman. what he did.

Less is more. And more is huge. It is easy for me to end up with huge posts to put out here, but I’d rather go with smaller.

Let’s get started with PCA / FA, principal components analysis and factor analysis.

In case it matters, I am using Mathematica to do these computations.

Here is an example, the first of several. This comes from Harman’s “factor analysis”. In order to emphasize the distinction between PCA and FA, he has one example of principal component analysis, and this is it.

Let me tell you up front what he did:

• Get some data;
• Compute its correlation matrix;
• Find the eigenstructure of the correlation matrix;
• Weight each eigenvector by the square root of its eigenvalue;
• Tabulate the results;
• Plot the original variables in the space of the two largest principal components.

I also need to say that his conceptual model is written

Z = A F,

And from the dimensions of the matrices it is clear that

A is square, k by k

Z and F are the same shape, with k rows.

We infer from its size that A will be derived from the eigenvector matrix, and that Z is derived from the given data matrix. From the shapes, we conclude that Z has observations in columns, rather than in rows. (If you’re used to econometrics or regression, you expect the transpose, observations in rows.)

But this is a fine thing, because we recognize that Z = A F is a change-of-basis equation for corresponding columns of Z and F; A is a transition matrix mapping new components (any one column of F) to old components (a column of Z).

The following data comes from Harman, p. 14. We would customarily say it has 5 variables (k = 5) and 12 observations (n=12); to be precise, we mean it has 12 observations per variable, since the total number of data points is 60. I have chosen to use regression notation, and denote the number of variables by k.

Here’s the data.

$D = \left(\begin{array}{ccccc} 5700&12.8&2500&270&25000\\ 1000&10.9&600&10&10000\\ 3400&8.8&1000&10&9000\\ 3800&13.6&1700&140&25000\\ 4000&12.8&1600&140&25000\\ 8200&8.3&2600&60&12000\\ 1200&11.4&400&10&16000\\ 9100&11.5&3300&60&14000\\ 9900&12.5&3400&180&18000\\ 9600&13.7&3600&390&25000\\ 9600&9.6&3300&80&12000\\ 9400&11.4&4000&100&13000\end{array}\right)$

I have called it D, for data matrix; and I have displayed it with observations in rows. This is what I’m used to from regression analysis. This is also how Harman displayed it. The point of the discussion about Z is that Z is a transposed matrix, relative to D: D has variables in columns, Z has variables in rows. In addition, if Harman were to to compute Z – which he does not – it would almost certainly be standardized data. (That is, subtract the mean of each variable, and then divide each by its sample standard deviation.)

He printed means and standard deviations. I checked his printed means…

${6241.7,\ 11.4,\ 2333.3,\ 120.8,\ 17000.}$

and, more importanty, his standard deviations…

${3440.,\ 1.8,\ 1241.2,\ 114.9,\ 6367.5}$

Our numbers agree: despite using N in his discussion, he correctly used N-1 in the computation of the sample variance. I would have been stunned to disagree with his means, but he led me to expect different computed standard deviations. I am pleasantly surprised.

We compute the correlation matrix r:

$r = \left(\begin{array}{ccccc} 1.&0.00975&0.97245&0.43887&0.02241 \\ 0.00975&1.&0.15428&0.69141&0.86307 \\ 0.97245&0.15428&1.&0.51472&0.12193 \\ 0.43887&0.69141&0.51472&1.&0.77765 \\ 0.02241&0.86307&0.12193&0.77765&1.\end{array}\right)$

This also agrees with his tabulated correlation matrix on p. 14. (I have rounded the display to match his.)

Now (jumping from his p. 14 to p. 135) we get the eigenstructure of the correlation matrix r. Here is an orthogonal eigenvector matrix:

$P = \left(\begin{array}{ccccc} -0.342731&-0.601629&-0.0595321&0.204003&0.689505\\ -0.452506&0.406417&-0.688816&-0.353592&0.174837\\ -0.396696&-0.541664&-0.247949&0.0229598&-0.698016\\ -0.550056&0.077817&0.664087&-0.500371&-0.000134573\\ -0.466738&0.416428&0.139634&0.763189&-0.0823919\end{array}\right)$

(I remind you that eigenvectors – even orthonormal ones – are not unique; as it happens, every one I got is the negative of his. that doesn’t matter, but for subsequent computations, I changed the sign of my matrix.)

(We have “diagonalized” the correlation matrix r. That is, we have computed a matrix P whose columns are unit eigenvectors, and  a diagonal matrix $\Lambda$ whose diagonal elements $\lambda$ are the eigenvalues, such that $\Lambda = P^{-1}\ r\ P$.)

We also agree on the eigenvalues (which are unique):

$\lambda = {2.8733,\ 1.7967,\ 0.2148,\ 0.0999,\ 0.0153}$

Now he weights each eigenvector by the square root of its eigenvalue; e.g., the first eigenvector will have length $\sqrt{2.8733}$ while the fifth will have length $\sqrt{0.0153}$.

We end up with the following (rounded) matrix of weighted eigenvectors (where I have also multiplied all of mine by -1):

$A = \left(\begin{array}{ccccc} 0.581&0.8064&0.0276&-0.0645&-0.0852\\ 0.767&-0.5448&0.3193&0.1118&-0.0216\\ 0.6724&0.726&0.1149&-0.0073&0.0862\\ 0.9324&-0.1043&-0.3078&0.1582&0\\ 0.7912&-0.5582&-0.0647&-0.2413&0.0102\end{array}\right)$

He presents the following summary table, whose center is precisely our weighted eigenvectors:

$\left(\begin{array}{ccccccc} Variable&P1&P2&P3&P4&P5&Variance\\ 1&0.581&0.8064&0.0276&-0.0645&-0.0852&1.\\ 2&0.767&-0.5448&0.3193&0.1118&-0.0216&1.0002\\ 3&0.6724&0.726&0.1149&-0.0073&0.0862&0.9999 \\ 4&0.9324&-0.1043&-0.3078&0.1582&0&1.\\ 5&0.7912&-0.5582&-0.0647&-0.2413&0.0102&0.9999\\ Variance&2.8733&1.7967&0.2148&0.0999&0.0153&5.\\Percent&57.5&35.9&4.3&2.&0.3&100.\end{array}\right)$

I’ll have a lot to say about that table, but not yet. Oh, I should point out that the row labeled “variance” consists of the five eigenvalues and their sum (yes, 5.).

There’s one more thing he did. Back on p. 16, as a preview of things to come, he wrote

$z_2 = .767 P_1 - .545 P_2 + .319 P_3 + .112 P_4 - .022 P_5$

(where he used $P_i$ rather than $F_i$ to emphasize that these were principal components from PCA rather than factors from FA; you should read them as $F_i$, but I don’t want to misrepresent exactly what he wrote).

But that is precisely the equation Z = A F written for the second row of the Z matrix: multiply the second row of A by each of the columns of F. I am happy to see this. It explains why people never display Z, and never compute F: they just want to describe the old variable names in terms of the new variable names.

Let us select all the coefficients of P1 and P2, not because only two new variables are important, but because it’s easy to plot things in 2D. By weighting the eigenvectors by $\sqrt{\lambda}$, he has emphasized the first eigenvector over the second, and the second over the third, etc. So let’s see what the first two tell us.

Here are the first two columns of the weighted eigenvector matrix A, i.e. the first two weighted eigenvectors.

$\left(\begin{array}{cc} 0.580958&0.806421\\ 0.767036&-0.544759\\ 0.672433&0.726044\\ 0.932392&-0.104306\\ 0.79116&-0.558179\end{array}\right)$

Let’s plot those pairs of points. Recall

$z_2 = .767 P_1 - .545 P_2 + .319 P_3 + .112 P_4 - .022 P_5$

but retain only the first two terms:

$z_2 = .767 P_1 - .545 P_2$

Now plot the point (.767, -.545) and label it “2”. Do the same for the other four equations for z1 and z3 thru z5. (Please don’t be offended at my being explicit: it took me forever to get my mind off the data itself and to grok this relationship between variable names.)

Verily, points 1 and 3 (i.e. variables z1 and z3) are similar, and variables z2 and z5 are even more similar, and variable z4 is about equally different from the two clusters, but closer to 2-5.

Let me summarize Harman’s analysis to this point. For this example, he has:

• given us data: 5 variables, 12 observations;
• computed the eigenstructure of the correlation matrix;
• Weighted each eigenvector by the square root of its eigenvalue;
• Tabulated the results;
• Plotted the original variables in the space of the first two principal components.

What has he not done?

• He did not actually compute Z or F;
• He did not explain why his table looks the way it does;
• He did not explain the graph he drew;
• I wonder if the fact that z4 is different from the two clusters suggests that there really ought to be three new variables P1, P2, P3 instead of just two;
• Maybe we should do a 3D graph.

Enough for now. Next, I’ll talk about what he did.