PCA / FA Bartholomew et al.: discussion

This is the 4th post in the Bartholomew et al. sequence in PCA/FA, but it’s an overview of what I did last time. Before we plunge ahead with another set of computations, let me talk about things.

I want to elaborate on the previous post. We discussed

  • the choice of data corresponding to an eigendecomposition of the correlation matrix
  • the pesky \sqrt{N-1} that shows up when we relate the \sqrt{\text{eigenvalues}},\ \Lambda, of a correlation matrix to the principal values w of the standardized data
  • the computation of scores as \sqrt{N-1}\ u
  • the computation of scores as F^T\ where X^T = Z = A\ F
  • the computation of scores as projections of the data onto the reciprocal basis
  • different factorings of the data matrix as scores times loadings

The three computations of scores, of course, are all the same, only looking different. I will show you an example where A is not invertible (but not in this post). Although it looks harder at first, it simplifies so considerably as to be easier in practice.

The choice of data is actually dictated by the SVD. If we do an eigendecomposition of the correlation matrix c,

c = V\ \lambda\ V^T

then the data must be standardized or a constant multiple of the standardized data (e.g. small standard deviates), because that gives us that V from the eigendecomposition matches v from the SVD (to within signs of columns),

X = u\ w\ v^T\ .

Similarly, if we do eigendecomposition of the covariance matrix, then the data must be centered or a constant multiple of the centered data. (We could define “small centered deviates” by dividing the data by \sqrt{N-1}\ .) Any multiple of centered data give us that V from the eigendecomposition matches v from the SVD (again, to within signs of columns).

Generally, then, if we do an eigendecomposition of XTX or XXT, then the data must be X or a multiple of it. (In that form it looks pretty obvious, huh?)

We had gotten used to not seeing that N-1 or \sqrt{N-1}\ . Both Davis and Malinowski used eigendecompositions in preference to SVDs, but they computed eigendecompositions of X^T\ X not of \frac {X^T\ X}{N-1}. I was able to replicate their results by using an SVD of X, instead of their eigendecompositions, and we had that w = \Lambda \ : the principal values were equal to the square roots of the eigenvalues.

But Harman and jolliffe used eigendecompositions of the correlation matrix, and the square roots of the eigenvalues are proportional instead of equal to the principal values,

\sqrt{N-1}\ w = \Lambda = \sqrt{\lambda}

First remark: the same thing happens if we use the covariance matrix and centered data instead of the correlation matrix and standardized data. (It is not uncommon for people to refer to X^T\ X as a “covariance matrix”; I try to reserve that term for \frac {X^T\ X}{N-1} with X centered.)

Second remark: we saw that Harman’s model and Bartholomew’s is to factor the (preferably standardized, possibly centered) data matrix using a weighted \sqrt{\text{eigenvalue}} matrix A:

A = V\ \Lambda

X = F^T\ A^T

and that in fact we could write that same factoring of X using the SVD:

X = \left(\sqrt{N-1}\ u\right)\left( \Lambda\ v^T\right)

Third remark: Jolliffe wrote his model with symbols similar to Harman’s but with two completely different meanings: he writes

Z = X A

but this time Z is the principal components instead of X^T\ , and his loadings A are the orthogonal eigenvector matrix (i.e. my V), and X is still the data matrix (with variables in columns). That is, in my usual notation,

Y = X V,

where I have written Y for Z because the product X V is precsiely the new components of the data wrt V. His “principal components” are the new components of X.

Now, we can write that as a factoring of the data matrix, and because V is orthogonal we get

X = Y\ V^T = \left(u\ w\right)\ v^T

Funny thing, that looks just like Malinowski’s

X = R\ C = \left(u\ w\right)\ v^T

It is.

Malinowski, of course, uses his decomposition for an arbitrary X: it may be standardized, or centered, or even raw data. Jolliffe uses his decomposition preferably for a correlation matrix, possibly for a covariance matrix. That difference aside, Jolliffe and Malinowski are using the same model.

Let me be explicit about something: we may start with v from the SVD and use that v in place of V in the eigendecomposition. Going the other way is harder: the SVD has consistent signs on u and v, but V may not be consistent with u. In the eigendecomposition, sign consistency is automatic between V and V^T, or between v and v^T. Ah, the analog of mixing u and V in the SVD would be to mix v and V in the eigendecomposition, but I can’t imagine that we would ever decompose the correlation matrix c as either V\ \lambda\ v^T or v\  \lambda\ V^T\ , using both v and V. That’s good, because it wouldn’t work in general.

Harman and Bartholomew are factoring the data as

X = \left(\sqrt{N-1}\ u\right)\left(\Lambda\ V^T\right)

which is similar to

X = \left(u\right)\left(w\ v^T\right)\ .

while Jolliffe and Malinowski are factoring the data as

\left(u\ w\right)\left(v^T\right)\ .

Note that I wrote singular values w not \Lambda\ , hence “similar to” for Harman & Bartholomew et al.

The first are very nearly associating the singular values w with v, the second are associating the singular values with u. That strikes me as a fundamental difference.

What is common is that they all are factoring the data matrix as scores times loadings.

Unfortunately, Davis is only sometimes in that camp, and sometimes not. From his definitions on p. 504 – the ones I showed – his scores times his loadings do not reproduce the data. He defines loadings

A^Q = u\ w

A^R = v\ w^T

and then defines scores as projections of the data X onto the loadings:

S^R = X\ A^R = u\ w\ w^T = A^Q\ w^T

S^Q = X^T\ A^Q = v\ w^T\ w = A^R\ w\ .

We know that by projecting data onto non-orthonormal bases A^R and A^Q\ , Davis is computing components of the data wrt the reciprocal bases instead of wrt the bases.

I don’t know why he uses scores and loadings that do not factor the data matrix. About the closest I can get to factoring the data is (making w square, and u, v conformable):

A^Q\ \left(A^R\right)^T = \left(u\ w\right)\left(w\ v^T\right) = u\ w^2\ v^T\ ,

which just isn’t the same as

u\ w\ v^T\ .

(So close and yet so far.)

Incidentally, if you are looking in Davis, remember that his uppercase U and V are my lowercase v and u respectively, not u and v; and his \Lambda is my w (made square).

Those definitions notwithstanding, Davis defines different scores on pp. 536-537, and this new definition is a factoring of the data matrix; more to the point, by this new definition of the scores, they can be computed as \sqrt{N-1}\ u\ .

That is, when he discusses “factor analysis” around p. 536, he agrees with Harman & Bartholomew et al.

Moving on.

Tell me again why we get that pesky \sqrt{N-1}\ ? Answer: because we want the eigenvalues of the correlation matrix, not the singular values of the data matrix.

And why do we want them? Because the eigenvalues are numerically equal to the variances of the new data Y wrt the basis V, i.e. the variances of the columns of Y,

Y = X V.

Why do we care that the eigenvalues are equal to the variances of the new data? Answer: I suspect that people wanted to make inferences about the new components of the data without computing it, i.e. without actually computing the scores. We can tell what the variances would be just from the eigenvalues.)

And yet, the new data does not depend on the eigenvalues or the singular values. Take the orthogonal eigenvector matrix V, premultiply it by X (“project the data onto the basis vectors”), and we’ve gotten data with the new variances, i.e. new data with the property that it has the same total variance as the original data X, but the variance has been redistributed maximally.

What’s important is that the new data does not depend on the \sqrt{\text{eigenvalue}} matrix A. so why did they create A instead of using the orthogonal V? Answer: the F matrix.

What does depend on the \sqrt{\text{eigenvalue}} matrix A is the F matrix. That pesky factor of \sqrt{N-1}\ gives us that

\frac{F^T F}{N-1} = I\ ,

i.e. F^T is not only standardized but also uncorrelated. That’s in complete contrast to Y = X V, which has the redistributed variances. Maybe that was the purpose, and the creation of A is the means to that end. (To look at it another way: u is orthogonal, the cut-down u0 is orthonormal, and F^T= \sqrt{N-1}\ u0 is standardized, instead of orthonormal.)

Nevertheless, we struggle to have it both ways: either the new data is XV, with redistributed variances, or the new data is F^T with unit variances. I would go so far as to say that Harman & Bartholomew et al. are assessing one thing (the variances of the new data Y), while computing another thing (standardized data).

People talk about both. We’ll apply all this to an example.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: