PCA/FA Answers to some Basilevsky questions

Let us look at three of the questions I asked early in February, and answer two of them.

First, what do we know? What have we done?

We assume that we have data X, with the variables in columns, as usual. In fact, we assume that the data is at least centered, and possibly standardized.

We compute the covariance matrix

c = \frac{X^T X}{N-1}\ ,

then its eigendecomposition

c = v\ \Lambda^2\ v^T\ ,

where \Lambda^2 is the diaginal matrix of eigenvalues. We define the \sqrt{\text{eigenvalue}}-weighted matrix

A = v\ \Lambda\ .

Finally, we use A as a transition matrix to define new data Z:

X^T = A\ Z^T\ .

We discovered two things. One, the matrix A is the cross covariance between Z and X:

A = \frac{X^T Z}{N-1}\ .

I find this interesting, and I suspect that it would jump off the page at me out of either Harman or Jolliffe; that is, I suspect it is written there but it didn’t register.

Two, we discovered that we could find a matrix Ar which is the cross covariance between Zc and Xs. That is, the general relation

A = \frac{X^T Z}{N-1}

specializes to both centered data Xc and standardized data Xs, and gives us

Ac = \frac{Xc^T Zc}{N-1}

As = \frac{Xs^T Zs}{N-1}\ .

In addition, however, we found a matrix Ar such that

Ar = \frac{Xs^T Zc}{N-1}\ ,

using the new centered data Zc and the old standardized data Xs. Oh, the definition of the matrix Ar was:

Ar = \sigma^{-1}Ac,

where \sigma^{-1} was a diagonal matrix whose entries were the inverse standard deviations of the centered data Xc.

That is, each row of Ac was weighted by the inverse standard deviation to define Ar. And it all worked out because the definition of Ar

Ar = \sigma^{-1}Ac

was dual to the definition of standardized data, which weighted the columns of Xc:

Xs = Xc\  \sigma^{-1}\ .

I will remind you that I find this interesting, but I’m not convinced it is important. I cannot forgive Ar for the fact that it is not a matrix of eigenvectors. I assure you, I could be missing something about Ar.

Anyway, that is what we did.

What were the questions?

  1. when is A a covariance matrix?
  2. is v a covariance matrix?
  3. can we do something similar for constant row sums?

For question 3, I have the feeling there’s something we can do, but I haven’t put the pieces together. It can’t be same as for question 1, because a matrix that weights the rows of the data matrix is not the same size as one that weights the columns (in general). Perhaps we need to look at XXT instead of XTX. But then things don’t drop out if A and Z are defined the same way, so…. This is still open.

For question 1, two partial answers are:
If A = v\ \Lambda\ , i.e. A is formed by weighting the columns of an orthonormal eigenvector matrix v by the \sqrt{\text{eigenvalues}}\ , then A is a covariance matrix of the same form as the one from which we got v.

And the construction that gave us Ar (weighting the rows of A) works for arbitrary weights.

I’ll explain.

Suppose we have a data matrix X. Suppose we compute a “dispersion matrix”, by which I mean a multiple of X^T X\ :

c = \alpha\ X^T X\ .

(Of course, \alpha is usually 1/(N-1).)

Suppose we do an eigendecomposition of c,

c = v\ \Lambda^2\ v^T\ ;

then define a matrix

A = v\ \Lambda\ ,

and finally define new data using A as the transition matrix:

X^T = A\ Z^T\ .

(I’ll remind you that our observations are columns of X^T and Z^T\ ; and a transition matrix applied to new components gives us old components. If you need to, look here.)

I claim that

A = \alpha\ X^T Z\ ,

i.e., that A is the “cross dispersion matrix” between X and Z.

Please note, in particular, that the data X need not be centered.

The proof goes just the way it did before.

We will need a couple of things. From

X^T = A\ Z^T\ ,

we get

Z = X A^{-T}\ .

And also from

A^T = \Lambda\ v^T

we get

A^{-T} = v^{-T}\Lambda^{-1} = v \Lambda^{-1} (because v is orthogonal and \Lambda is diagonal).

Then

\alpha\ X^T Z

= \alpha\ X^T (X\ A^{-T})

= \alpha X^T X\ A^{-T}

= c\ A^{-T}

= (v\ \Lambda^2v^T) (v \Lambda^{-1})

= v\ \Lambda^2 \Lambda^{-1}

= v\ \Lambda

= A\ .

QED.

(If we had started with \alpha\ Z^T X instead of \alpha\ X^T Z\ , we would have ended up with A^T\ , which is a good thing, since \alpha\ Z^T X = (\alpha\ X^T Z)^T\ . Things are just a little different because the cross-dispersion in general, and the cross-covariance in particular, are not symmetric.)

This result, that A is the corresponding dispersion matrix between X and Z means that we can say something about Z without ever computing it.

(I’m not particularly fond of that, but then I haven’t been trying to deal with thousands of observations and a hundred variables. I’m used to datasets small enough to hold in my hand, as it were.)

Example

We could verify this, in particular, for the raw data that we used for the previous Basilevsky computations. We knew it was true for centered data or for standardized data, i.e. that it was true for the covariance matrix or the correlation matrix; we have seen now that it is true for an arbitrary dispersion matrix

\alpha\ X^T X\ .

It didn’t depend on 1/(N-1) or on X being centered. If we do an eigendecomposition of \alpha\ X^T X\ , then A = \alpha\  X^T Z\ .

Well, let’s do it. Here’s the raw data that we used here:

X = \left(\begin{array}{lll} 2.09653 & -0.793484 & -7.33899 \\ -1.75252 & 13.0576 & 0.103549 \\ 3.63702 & 29.0064 & 8.52945 \\ 0.0338101 & 46.912 & 19.8517 \\ 5.91502 & 70.9696 & 36.0372\end{array}\right)

I will take \alpha = \pi\ , just so that it can’t be mistaken for anything else (I need some number; let’s make sure that if anything interesting happens, we see it. OK, nothing interesting will happen, not that kind of interesting, anyway.)

We compute

c = \pi X^T X = \left(\begin{array}{lll} 174.934 & 1578.09 & 720.323 \\ 1578.09 & 25917.9 & 11760.3 \\ 720.323 & 11760.3 & 5715.79\end{array}\right)

Now that I’ve written it once, let me emphasize: the weird expression \pi X^T X occurs only because I needed a number for my abstract scaling factor \alpha\ . In retrospect, perhaps I should not have used greek letters for both abstract and numerical.)

We get an eigendecomposition, but all I want right now is \Lambda\ , the diagonal matrix of \sqrt{\text{eigenvalues}}\ . OK, the eigenvalues are

\{31415.9,\ 314.159,\ 78.5398\}

and \Lambda is

\Lambda = \left(\begin{array}{lll} 177.245 & 0. & 0. \\ 0. & 17.7245 & 0. \\ 0. & 0. & 8.86227\end{array}\right)

On second thought, I really should give you the orthonormal eigenvector matrix for reference:

v = \left(\begin{array}{lll} 0.0554414 & -0.0173938 & 0.99831 \\ 0.907331 & -0.416446 & -0.0576447 \\ 0.416745 & 0.908994 & -0.00730637\end{array}\right)

The matrix A is

A = v \Lambda = \left(\begin{array}{lll} 9.82673 & -0.308298 & 8.8473 \\ 160.82 & -7.38131 & -0.510863 \\ 73.8661 & 16.1115 & -0.064751\end{array}\right)

The new data Z is given by

Z = X\ A^{-T} = \left(\begin{array}{lll} -0.0206618 & -0.359791 & 0.24738 \\ 0.0665383 & -0.299765 & -0.282435 \\ 0.169678 & -0.247659 & 0.213996 \\ 0.286832 & -0.0841641 & -0.317697 \\ 0.44988 & 0.17488 & 0.174979\end{array}\right)

Finally, the cross-dispersion between Z and X is given by

\alpha\ X^T Z = \pi\ X^T Z =\left(\begin{array}{lll} 9.82673 & -0.308298 & 8.8473 \\ 160.82 & -7.38131 & -0.510863 \\ 73.8661 & 16.1115 & -0.064751\end{array}\right)

And that is the same as A, as it should be.

We continue with the answer to question 1. We found that if we do an eigendecomposition of \alpha\ X^T X\ , then A = \alpha X^T Z\ .

What about the matrix Ar, found by weighting the rows of A by the inverse standard deviations?

There was nothing special about the inverse standard deviations. If we weight the columns of X by a diagonal matrix \Delta\ , we can find a corresponding matrix Ar.

Let’s work that out. I am going to take weights (1,2,4} in place of the inverse standard deviations. That is, I take the diagonal matrix

\Delta = \left(\begin{array}{lll} 1 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 4\end{array}\right)

and scale the columns of X and the rows of A:

X_d = X\ \Delta = \left(\begin{array}{lll} 2.09653 & -1.58697 & -29.356 \\ -1.75252 & 26.1153 & 0.414196 \\ 3.63702 & 58.0128 & 34.1178 \\ 0.0338101 & 93.8239 & 79.407 \\ 5.91502 & 141.939 & 144.149\end{array}\right)

A_d = \Delta\ A = \left(\begin{array}{lll} 9.82673 & -0.308298 & 8.8473 \\ 321.641 & -14.7626 & -1.02173 \\ 295.465 & 64.446 & -0.259004\end{array}\right)

(Perhaps now is a good time to remind you that we can scale the columns of a matrix by post-multiplying it by a diagonal matrix of the scale factors; and we can scale the rows of a matrix by pre-multiplying it by a diagonal matrix.)

i expect that

A_d = \alpha\ X_d^T Z = \pi\ X_d^T Z\ .

Compute the RHS:

\pi\ X_d^T Z\ = \left(\begin{array}{lll} 9.82673 & -0.308298 & 8.8473 \\ 321.641 & -14.7626 & -1.02173 \\ 295.465 & 64.446 & -0.259004\end{array}\right)

and that is, indeed, A_d\ .

So, the construction of Ar using the inverse standard deviations could be done with any set of weights instead, and applied to any dispersion matrix, not just the A = Ac from centered data.

In other words, I know two kinds of transition matrices that are dispersion matrices: A between Z and X, Ar (Ad in this case) between a newer Z and the older X.

I do not have an “iff” result. I can’t guarantee that these are the only kinds of matrices that work.

Question 2. Is v a dispersion matrix?

No, at least not between its new data Z and the old X, i.e. not the way A is. (I wonder if there’s a way to make it happen. Would it ever be worthwhile if it were possible?)

This time, let’s first confirm the failure numerically.

We need new data from v instead of from A. Rather than destroy the Z’s, call the new data Y. We should have

Y = X\ v\ ,

i.e.

Y^T = v^T X^T\ ,

i.e.

v^{-T}Y^T = X^T = v\ Y^T\ .

That last equality corresponds to X^T= A\ Z^T\ , so we’re good.

Here’s the new data Y from old data X using the transition matrix v:

Y = X\ v =\left(\begin{array}{lll} -3.66221 & -6.37712 & 2.19235 \\ 11.7936 & -5.31319 & -2.50302 \\ 30.0746 & -4.38964 & 1.89649 \\ 50.8397 & -1.49177 & -2.81552 \\ 79.7392 & 3.09967 & 1.55071\end{array}\right)

now compute \alpha\ X^T Z\ , i.e. in this case,

\pi\ X^T Y\ = \left(\begin{array}{lll} 1741.74 & -5.46444 & 78.4071 \\ 28504.6 & -130.83 & -4.5274 \\ 13092.4 & 285.569 & -0.573841\end{array}\right)

and that is to be compared with v – but I’ve already done the algebra, and I know they should have proportional columns, so I took the ratio instead of the difference, i.e. here is (\pi\ X^T Y) / v\ , term by term:

\left(\begin{array}{lll} 31415.9 & 314.159 & 78.5398 \\ 31415.9 & 314.159 & 78.5398 \\ 31415.9 & 314.159 & 78.5398\end{array}\right)

Those should look familiar: they are, as they should be, the eigenvalues of \pi\ X^T X\ :

\{31415.9,\ 314.159,\ 78.5398\}

So the cross-dispersion matrix between X and Y is a column-weighted version of v, and the weights are the eigenvalues.

Time to do the algebra. This time, Y = Xv. We start with the dispersion matrix…

\alpha\ X^T Y

= \alpha\ X^T\ (X\ v)

= \alpha\ X^T X\ v

= c\ v

= ( v\ \Lambda^2\ v^T)\ v

= v\ \Lambda^2

and that’s what we got: the cross-dispersion \alpha\ X^T Y is the column-weighted v, and the weights are the eigenvalues \Lambda^2\ .

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: