PCA / FA Basilevsky on standardizing. Discussion.

Introduction and review

Basilevsky presents an extremely interesting idea. For all I know, it’s become common in the last 10-20 years, but I hadn’t seen it in any of the other books we’ve looked at.

I’ll tell you up front that he’s going to normalize the rows of an A matrix, specifically the A matrix computed from the eigendecomposition of the covariance matrix.

I’ll also tell you up front that I don’t see any good reason for doing it, but I’m not averse to finding such a reason someday.

Suppose we have centered data X with variables in columns, and N rows (observations). The covariance matrix is c = \frac{X^T\ X}{N-1}\ ), and we find its eigendecomposition. We get a matrix V having the eigenvectors as columns, and we construct a diagonal matrix \Lambda whose entries are the square roots of the eigenvalues.

That is, we have the decomposition

\Lambda^2 = V^{-1}\ c\ V\ .

Then we define the matrix A = V\ \Lambda\ : each column is an eigenvector, of length \sqrt{\text{eigenvalue}}\ .

For the next few posts, I will use a notation of Basilevsky (and Jolliffe): the principal components will be denoted by Z. (Yes, Harman and I almost invariably use Z = X^T\ , but I need to abandon that notation for a while. Sorry.)

This coincides with Jolliffe’s notation, which was to write

Z = X\ V\ ,

with Z as the principal components, X the data, and V the orthogonal eigenvector matrix.

As we have observed many times, the starting point for making sense of that is the change-of-basis equation for a vector:

x^T = V\ z^T\ ,

where x^T and z^T are column vectors, x is the old components, z is the new components, and V is the transition matrix. (Its columns are the old components of the new basis vectors. Using the symbol V is not an accident: the matrix V from the eigendecomposition is a transition matrix.)

Applied to the entire matrix X, the change-of-basis equation is

X^T = V\ Z^T

Transpose:

X = Z\ V^T

Solve for Z:

X\ V^{-T} = Z.\

But V is orthogonal, V^T = V^{-1}\ , so V^{-T} = V\ :

X\ V = Z\ .

Among the very first things we learned in PCA (in Harman) were two: first, that the Z’s we just defined have maximally redistributed the variance of the X’s: the 1st Z variable has the largest variance of any linear combination of the X’s; the 2nd Z variable has the 2nd largest variance of any other linear combination of the X’s, etc. (All subject to the constraint that the total variance of the Z’s is equal to the total variance of the X’s.)

Second, we learned that if we used the transition matrix A = V\ \Lambda\ , then the Z’s defined by it would be both standardized and uncorrelated.

In our present notation (our X^T is Harman’s Z, and our Z^T is Harman’s F), we would write Harman’s model as

X^T = A\ Z^T\

which, column by column, is precisely the change-of-basis equation

x^T = A\ z^T\ ,

with transition matrix A instead of V.

Or we could write

X = Z\ A^T.

(I will try not to mix these up.)

Let me emphasize that the two Z’s in

Z = X\ V

and

X^T = A\ Z^T

are different. In what follows, I will distinguish them by subscripts (Zu for the first, Zc for the second.)

(It is true that the second set of Z’s could have been found from the first set: just standardize the first set! I had never thought to ask that or check it before, but it would be rather annoying if it were not true. See below.)

We will, in fact, be dealing with both centered data Xc and the corresponding standardized data Xs; we will have corresponding orthogonal eigenvector matrices Vc and Vs; corresponding A matrices Ac and As, and a third one Ar; we will have 3 Z matrices, although a fourth exists in principle.

Zu will be the unstandardized Z from Vc: Zu = Xc\ Vc

Zc will be the standardized Z from Ac: Xc^T = Ac\ Zc^T

Zs will be the standardized Z from As: Xs^T = As\ Zs^T\ .

(I have tried to minimize the occurence of Z’s as opposed to Zs, and X’s versus Xs, but I can’t reduce the number to zero. Zs and Xs are specific matrices, as opposed to collections of Z’s and X’s, which have apostrophes)

A new basis

We will show that

Xs^T = Ar\ Zc^T\ ,

where Ar can be computed from Ac.

That last is a mix of apples and oranges: standardized Xs, but standardized Zc from the covariance matrix, instead of either standardized Xs and standardized Zs from the correlation matrix, or centered Xc and standardized Zc from the covariance matrix.

As I intimated at the beginning, the Ar is Ac with its rows normalized to length 1.

Let’s work that out. The key is that each row of Ac has a sample std deviation \sigma as its length.

Let’s see that. To compute the mutual dot products of rows of Ac, we could compute

Ac\ Ac^T\ .

The diagonals of the result would be the squared lengths of rows of Ac (the dot products of rows with themselves).

OK. Let me drop the subscript c for the moment:

AA^T = \left(V\ \Lambda\right)\ \left(V\ \Lambda \right)^T = V\  \Lambda\ \Lambda^T\ V^T = V\ \Lambda^2\ V^{-1} = c\ ,

where c is the covariance matrix.

(the eigendecomposition was \Lambda^2 = V^{-1}\ c\ V\ .)

So the diagonal elements of Ac\ Ac^T are not only the squared lengths of the rows, but also the sample variances of Xc.

We want to scale each row of Ac by its length. To scale columns of V, we post-multiplied by the diagonal matrix \Lambda\ . To scale rows of Ac, we will premultiply by a diagonal matrix of inverse standard deviations.

So let \sigma = diagonal matrix of square roots of variances, and let \sigma^{-1} be its inverse. (Ignore the possibility of zero variances; after all, we do that every time we talk about the correlation matrix.)

We want to scale the rows of Ac to unit length to get Ar, so we define

Ar = \sigma^{-1}Ac\ .

We have the relationship between Xc and Zc:

Xc = Zc\ Ac^T\ .

We standardize Xc by scaling columns by standard deviations, so

Xs = Xc\ \sigma^{-1} = Zc\ Ac^T\ \sigma^{-1} = Zc\ Ar^T\ .

FInally, transposing

Xs^T = Ar\ Zc^T\ ,

QED.

Let me lay out all three equations again (I really like to see them all together):

Xc^T = Ac\ Zc^T

Xs^T = Ar\ Zc^T

Xs^T = As\ Zs^T\ .

That’s how I would present Basilevsky’s work. He never introduced Ac, and he computed Zc from the unstandardized Zu:

Zu = Xc\ Vc

Zc = Zu\ \Lambda^{-1}

Standardizing Zu

Yes, as I said, the standardized Zc can be computed by standardizing the Zu; and the variances of the Zu are given by the eigenvalues of the covariance matrix \left(\lambda = \text{diagonal of }\Lambda^2\right)\ . It’s easy enough to confirm.

From our usual definition of Zc,

Xc^T = Ac\ Zc^T\ , so Xc = Zc\ Ac^T\ ,

and the old definition of Zu

Zu = Xc\ Vc\ ,

we could write

Zu =  \left(Zc\ Ac^T\right)\ Vc = Zc\ \left(Vc\ \Lambda\right)^T\ Vc = Zc\  \Lambda\ Vc^T\ Vc = Zc\ \Lambda

i.e.

Zc = Zu\ \Lambda^{-1}\ ,

QED.

Summary and query

Overall, then, I learned two things. One, that standardized Zc and unstandardized Zu, defined by

Xc^T = Ac\ Zc^T

and

Zu = Xc\ Vc

were related in the sensible way; we could get Zc by standardizing Zu.

Two, and more importantly, I learned that we could write

Xs^T = Ar\ Zc^T\ ,

where

Ar = \sigma^{-1}\ Ac\ .

Fine.

Now, why would we want to?

I don’t know.

Basilevsky got there by looking at

Zu = Xc\ Vc

and saying he wanted to standardize both Z and X.

But we already know how to do that:

Xs^T = As\ Zs^T\ ,

with standardized Xs and standardized Zs.

Why doesn’t he do that?

He does.

And he knows and shows that As \ne Ar\ , and Zs \ne Zc\ , so by using Ar he has gotten standardized Zc which are different from standardized Zs.

I am perfectly happy with Zs derived from Xs. I am perfectly happy with Zc derived from Xc. I have no idea why we would want to use Ar to relate Zc to Xs. It’s true, but why do we care?

OK, Ac is not an orthonormal basis, but neither is Ar, so that’s not an improvement. OK, the columns of Ac do not have unit length, but they are mutually orthogonal and they are eigenvectors of the covariance matrix. In contrast, the columns of Ar are neither mutually orthogonal , nor are they eigenvectors (of Xc or of Xs).

I find myself wondering if there is some reason for liking this non-orthogonal basis Ar. Does its failure to be orthogonal tell us something? But it didn’t give us anything new. It says (once again) that

Xs^T = Ar\ Zc^T\ ,

but Zc is found from Xc and the eigenvectors of the covariance matrix:

Xc^T = Ac\ Zc^T\ ,

and Xs is found by standardizing Xc. We don’t need Ar to get the two things it relates.

I think Ar is cute, but I need some reason for going to a basis which isn’t at least orthogonal. Heck, I need a reason for going from the orthogonal Vc to a basis (Ac) whose vectors are not of unit length, but the reason is that it provides standardized Zc.

Finally, I need some reason for abandoning the eigenvectors. I’m in no hurry to substitute Ar for either Ac or As. The eigenvectors have significant properties; what does Ar have to offer that makes up for losing those properties?

If anyone knows….

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: