PCA / FA example 4: davis. R mode FA

since it’s been more than a week, we should all probably review davis’ R-mode FA. (i needed to!) the challenge may be that we have kept track of not that much information but in multiple ways. bottom line is that we have
  • an orthogonal 3×3 eigenvector matrix P,
  • a 3×3 diagonal matrix \Sigma of eigenvalues,
  • and a 3×3 diagonal matrix \Lambda = \sqrt{\Sigma }, i.e. of square roots of eigenvalues.
in order to be very sure that my calculations match davis’, i also have cut-down versions of those three matrices:
  • an orthonormal 3×2 matrix U containing the first two eigenvectors,
  • a 2×2 diagonal matrix \Sigma 2 of the non-zero eigenvalues,
  • and a 2×2 diagonal matrix \Lambda 2 = \sqrt{\Sigma 2}, of the square roots of the nonzero eigenvalues.
moving on…. he computes what’s called the matrix of R-mode loadings in his notation A^R = U \ \Lambda , for which my notation is A^R = U \ \Lambda 2,
but which i would write and compute as
A^R = P \ \Lambda .
here it is, my way:
A^R = \left(\begin{array}{ccc} 0.816497&0.&0.57735\\ -0.408248&0.707107&0.57735\\ -0.408248&-0.707107&0.57735\end{array}\right) x \left(\begin{array}{ccc} 9.16515&0.&0.\\ 0.&3.4641&0.\\ 0.&0.&0.\end{array}\right)
= \left(\begin{array}{ccc} 7.48331&0.&0.\\ -3.74166&2.44949&0.\\ -3.74166&-2.44949&0.\end{array}\right)
i’m carrying around an extra column of zeroes. i then compute it his way:
\left(\begin{array}{cc} 0.816497&0.\\ -0.408248&0.707107\\ -0.408248&-0.707107\end{array}\right) x \left(\begin{array}{cc} 9.16515&0.\\ 0.&3.4641\end{array}\right)
= \left(\begin{array}{cc} 7.48331&0.\\ -3.74166&2.44949\\ -3.74166&-2.44949\end{array}\right)
the nonzero numbers are the same; using the smaller matrices shows no numerical effect. what we’ve seen is that i get the same answer either way. but what about his answer? if i round my answer just a little bit i get…
\left(\begin{array}{cc} 7.4833&0\\ -3.7417&2.4495\\ -3.7417&-2.4495\end{array}\right)
and then i copy what he shows (on p. 505):
\left(\begin{array}{cc} 7.4832&0\\ -3.7412&2.4494\\ -3.7412&-2.4494\end{array}\right)
we see that davis and i differ by .0001 or .0005; not at all significant. he hasn’t lost any numbers by throwing away a zero eigenvalue and the corresponding
eigenvector. my guess is that mathematica carried more precision in the computations.
now he computes what are called the R-mode scores as
S^R = X \ A^R.
using my matrices, i get…
S^R = \left(\begin{array}{ccc} -6&3&3\\ 2&1&-3\\ 0.&-1&1\\ 4&-3&-1\end{array}\right) x \left(\begin{array}{ccc} 7.48331&0.&0.\\ -3.74166&2.44949&0.\\ -3.74166&-2.44949&0.\end{array}\right)
= \left(\begin{array}{ccc} -67.3498&0.&0.\\ 22.4499&9.79796&0.\\ 0.&-4.89898&0.\\ 44.8999&-4.89898&0.\end{array}\right)
and doing it his way i get
\left(\begin{array}{ccc} -6&3&3\\ 2&1&-3\\ 0.&-1&1\\ 4&-3&-1\end{array}\right) x \left(\begin{array}{cc} 7.48331&0.\\ -3.74166&2.44949\\ -3.74166&-2.44949\end{array}\right)
= \left(\begin{array}{cc} -67.3498&0.\\ 22.4499&9.79796\\ 0.&-4.89898\\ 44.8999&-4.89898\end{array}\right)
(again, that verifed only that i get the same answer two ways.) i round it for comparison with him, getting
\left(\begin{array}{cc} -67.3&0.\\ 22.4&9.8\\ 0.&-4.9\\ 44.9&-4.9\end{array}\right)
and to this accuracy, davis and i agree exactly (again p. 505, but i didn’t bother to copy and display his numbers per se).
now let’s consider what we did: an eigendecomposition of X^T X, where X was “centered” data, i.e. with mean zero. as we discussed, the covariance matrix of zero-mean data X is just
cov(X) = \frac{1}{N-1} X^T X,
X^T X = (N-1) cov(X)
where N is the number of observations. right? the covariance matrix of our data is…
cov(X) = \left(\begin{array}{ccc} \frac{56}{3}&-\frac{28}{3}&-\frac{28}{3}\\ -\frac{28}{3}&\frac{20}{3}&\frac{8}{3}\\ -\frac{28}{3}&\frac{8}{3}&\frac{20}{3}\end{array}\right)
then multiply by 3 (=N-1)…
3 cov(X) = \left(\begin{array}{ccc} 56&-28&-28\\ -28&20&8\\ -28&8&20\end{array}\right)
and recall
X^T X = \left(\begin{array}{ccc} 56&-28&-28\\ -28&20&8\\ -28&8&20\end{array}\right)
the same.
i know, i used c = X^T X when i started this example, and maybe i shouldn’t have, but i wanted a one-letter symbol for the equations involving this matrix and i chose “c” because i knew this was effectively if not exactly the covariance matrix.
there is a significant similarity: X^T X and the covariance matrix have the same eigenvectors (as close to “the same” as it gets, for things that really only specify directions in space). if i ask mathematica for an eigenvector matrix of the covariance matrix, i get:
\left(\begin{array}{ccc} -2&0&1\\ 1&-1&1\\ 1&1&1\end{array}\right)
OTOH, the eigenvector matrix we got for X^T X was:
\left(\begin{array}{ccc} -2&0&1\\ 1&-1&1\\ 1&1&1\end{array}\right)
(ok, i’m dancing on the high wire. unit eigenvectors may differ by a sign; not being normalized, these eigenvectors could have differed by arbitrary scale factors! by finding the smallest possible integer components, mathematica gave me the same answers for the two computations. i wasn’t really lucky, per se: i was sort of expecting that it would work out that way. otherwise, i could have converted both sets to orthonormal vectors in order to compare them.)
there is also a significant dissimiliarity, the eigenvalues of cov(X) and of  X^T X differ by a factor of N-1, where N is the number of observations. (this is the N-1 in the computation of the sample variance.) just as the matrices differ by a factor of N-1,
X^T X = (N-1) cov(X)
so do the eigenvalues:
eigenvalues of X^T X = (N-1) * eigenvalues of cov(X).
for our example, here they are:
the eigenvalues of X^T X are,…  {84,\ 12,\ 0}
the eigenvalues of cov are… {28,\ 4,\ 0}
and 3 times the eigenvalues of cov gives… {84,\ 12,\ 0}
so, whether we use X^T X of zero-mean data X, or whether we use the covariance matrix of the raw data, has no effect on the eigenvectors, and only a common scaling effect on the eigenvalues. recall, by contrast, that whether we use the covariance matrix or the correlation matrix has unpredictable effects on the eigenvectors and eigenvalues.
but what are we doing using the covariance matrix or something like it? jolliffe and harman both think we should be using the correlation matrix. and jolliffe has shown us how significant the difference can be between using the correlation matrix or the covariance matrix. and using X^T X exacerbates the problem, because it has even larger eigenvalues than the covariance matrix.
this is why i will say, over and over, your choice of preprocessing is more important than the subsequent eigendecomposition or singular value decomposition, or your scaling of the eigenvectors, or whether you throw away eigenvectors associated with zero eigenvalues. preprocessing is far more important than whether we write the eigenvalue matrix as a 2×2 or as a 3×3. (all that is, however, my opinion, and i am an outsider to this field.)
we definitely need to talk about this. for the record, i am certain that davis knows exactly what he’s doing, and that – jolliffe and harman notwithstanding – it may be correct to use centered data (or the covariance matrix) for some analyses.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: