PCA/FA Answers to some Basilevsky questions

Let us look at three of the questions I asked early in February, and answer two of them.

First, what do we know? What have we done?

We assume that we have data X, with the variables in columns, as usual. In fact, we assume that the data is at least centered, and possibly standardized.

We compute the covariance matrix

c = \frac{X^T X}{N-1}\ ,

then its eigendecomposition

c = v\ \Lambda^2\ v^T\ ,

where \Lambda^2 is the diaginal matrix of eigenvalues. We define the \sqrt{\text{eigenvalue}}-weighted matrix

A = v\ \Lambda\ .

Finally, we use A as a transition matrix to define new data Z:

X^T = A\ Z^T\ .

We discovered two things. One, the matrix A is the cross covariance between Z and X:

A = \frac{X^T Z}{N-1}\ .

I find this interesting, and I suspect that it would jump off the page at me out of either Harman or Jolliffe; that is, I suspect it is written there but it didn’t register.

Two, we discovered that we could find a matrix Ar which is the cross covariance between Zc and Xs. That is, the general relation

A = \frac{X^T Z}{N-1}

specializes to both centered data Xc and standardized data Xs, and gives us

Ac = \frac{Xc^T Zc}{N-1}

As = \frac{Xs^T Zs}{N-1}\ .

In addition, however, we found a matrix Ar such that

Ar = \frac{Xs^T Zc}{N-1}\ ,

using the new centered data Zc and the old standardized data Xs. Oh, the definition of the matrix Ar was:

Ar = \sigma^{-1}Ac,

where \sigma^{-1} was a diagonal matrix whose entries were the inverse standard deviations of the centered data Xc.

That is, each row of Ac was weighted by the inverse standard deviation to define Ar. And it all worked out because the definition of Ar

Ar = \sigma^{-1}Ac

was dual to the definition of standardized data, which weighted the columns of Xc:

Xs = Xc\  \sigma^{-1}\ .

I will remind you that I find this interesting, but I’m not convinced it is important. I cannot forgive Ar for the fact that it is not a matrix of eigenvectors. I assure you, I could be missing something about Ar.

Anyway, that is what we did.

What were the questions?

  1. when is A a covariance matrix?
  2. is v a covariance matrix?
  3. can we do something similar for constant row sums?

For question 3, I have the feeling there’s something we can do, but I haven’t put the pieces together. It can’t be same as for question 1, because a matrix that weights the rows of the data matrix is not the same size as one that weights the columns (in general). Perhaps we need to look at XXT instead of XTX. But then things don’t drop out if A and Z are defined the same way, so…. This is still open.

For question 1, two partial answers are:
If A = v\ \Lambda\ , i.e. A is formed by weighting the columns of an orthonormal eigenvector matrix v by the \sqrt{\text{eigenvalues}}\ , then A is a covariance matrix of the same form as the one from which we got v.

And the construction that gave us Ar (weighting the rows of A) works for arbitrary weights.

I’ll explain.

Suppose we have a data matrix X. Suppose we compute a “dispersion matrix”, by which I mean a multiple of X^T X\ :

c = \alpha\ X^T X\ .

(Of course, \alpha is usually 1/(N-1).)

Suppose we do an eigendecomposition of c,

c = v\ \Lambda^2\ v^T\ ;

then define a matrix

A = v\ \Lambda\ ,

and finally define new data using A as the transition matrix:

X^T = A\ Z^T\ .

(I’ll remind you that our observations are columns of X^T and Z^T\ ; and a transition matrix applied to new components gives us old components. If you need to, look here.)

I claim that

A = \alpha\ X^T Z\ ,

i.e., that A is the “cross dispersion matrix” between X and Z.

Please note, in particular, that the data X need not be centered.

The proof goes just the way it did before.

We will need a couple of things. From

X^T = A\ Z^T\ ,

we get

Z = X A^{-T}\ .

And also from

A^T = \Lambda\ v^T

we get

A^{-T} = v^{-T}\Lambda^{-1} = v \Lambda^{-1} (because v is orthogonal and \Lambda is diagonal).

Then

\alpha\ X^T Z

= \alpha\ X^T (X\ A^{-T})

= \alpha X^T X\ A^{-T}

= c\ A^{-T}

= (v\ \Lambda^2v^T) (v \Lambda^{-1})

= v\ \Lambda^2 \Lambda^{-1}

= v\ \Lambda

= A\ .

QED.

(If we had started with \alpha\ Z^T X instead of \alpha\ X^T Z\ , we would have ended up with A^T\ , which is a good thing, since \alpha\ Z^T X = (\alpha\ X^T Z)^T\ . Things are just a little different because the cross-dispersion in general, and the cross-covariance in particular, are not symmetric.)

This result, that A is the corresponding dispersion matrix between X and Z means that we can say something about Z without ever computing it.

(I’m not particularly fond of that, but then I haven’t been trying to deal with thousands of observations and a hundred variables. I’m used to datasets small enough to hold in my hand, as it were.)

Example

We could verify this, in particular, for the raw data that we used for the previous Basilevsky computations. We knew it was true for centered data or for standardized data, i.e. that it was true for the covariance matrix or the correlation matrix; we have seen now that it is true for an arbitrary dispersion matrix

\alpha\ X^T X\ .

It didn’t depend on 1/(N-1) or on X being centered. If we do an eigendecomposition of \alpha\ X^T X\ , then A = \alpha\  X^T Z\ .

Well, let’s do it. Here’s the raw data that we used here:

X = \left(\begin{array}{lll} 2.09653 & -0.793484 & -7.33899 \\ -1.75252 & 13.0576 & 0.103549 \\ 3.63702 & 29.0064 & 8.52945 \\ 0.0338101 & 46.912 & 19.8517 \\ 5.91502 & 70.9696 & 36.0372\end{array}\right)

I will take \alpha = \pi\ , just so that it can’t be mistaken for anything else (I need some number; let’s make sure that if anything interesting happens, we see it. OK, nothing interesting will happen, not that kind of interesting, anyway.)

We compute

c = \pi X^T X = \left(\begin{array}{lll} 174.934 & 1578.09 & 720.323 \\ 1578.09 & 25917.9 & 11760.3 \\ 720.323 & 11760.3 & 5715.79\end{array}\right)

Now that I’ve written it once, let me emphasize: the weird expression \pi X^T X occurs only because I needed a number for my abstract scaling factor \alpha\ . In retrospect, perhaps I should not have used greek letters for both abstract and numerical.)

We get an eigendecomposition, but all I want right now is \Lambda\ , the diagonal matrix of \sqrt{\text{eigenvalues}}\ . OK, the eigenvalues are

\{31415.9,\ 314.159,\ 78.5398\}

and \Lambda is

\Lambda = \left(\begin{array}{lll} 177.245 & 0. & 0. \\ 0. & 17.7245 & 0. \\ 0. & 0. & 8.86227\end{array}\right)

On second thought, I really should give you the orthonormal eigenvector matrix for reference:

v = \left(\begin{array}{lll} 0.0554414 & -0.0173938 & 0.99831 \\ 0.907331 & -0.416446 & -0.0576447 \\ 0.416745 & 0.908994 & -0.00730637\end{array}\right)

The matrix A is

A = v \Lambda = \left(\begin{array}{lll} 9.82673 & -0.308298 & 8.8473 \\ 160.82 & -7.38131 & -0.510863 \\ 73.8661 & 16.1115 & -0.064751\end{array}\right)

The new data Z is given by

Z = X\ A^{-T} = \left(\begin{array}{lll} -0.0206618 & -0.359791 & 0.24738 \\ 0.0665383 & -0.299765 & -0.282435 \\ 0.169678 & -0.247659 & 0.213996 \\ 0.286832 & -0.0841641 & -0.317697 \\ 0.44988 & 0.17488 & 0.174979\end{array}\right)

Finally, the cross-dispersion between Z and X is given by

\alpha\ X^T Z = \pi\ X^T Z =\left(\begin{array}{lll} 9.82673 & -0.308298 & 8.8473 \\ 160.82 & -7.38131 & -0.510863 \\ 73.8661 & 16.1115 & -0.064751\end{array}\right)

And that is the same as A, as it should be.

We continue with the answer to question 1. We found that if we do an eigendecomposition of \alpha\ X^T X\ , then A = \alpha X^T Z\ .

What about the matrix Ar, found by weighting the rows of A by the inverse standard deviations?

There was nothing special about the inverse standard deviations. If we weight the columns of X by a diagonal matrix \Delta\ , we can find a corresponding matrix Ar.

Let’s work that out. I am going to take weights (1,2,4} in place of the inverse standard deviations. That is, I take the diagonal matrix

\Delta = \left(\begin{array}{lll} 1 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 4\end{array}\right)

and scale the columns of X and the rows of A:

X_d = X\ \Delta = \left(\begin{array}{lll} 2.09653 & -1.58697 & -29.356 \\ -1.75252 & 26.1153 & 0.414196 \\ 3.63702 & 58.0128 & 34.1178 \\ 0.0338101 & 93.8239 & 79.407 \\ 5.91502 & 141.939 & 144.149\end{array}\right)

A_d = \Delta\ A = \left(\begin{array}{lll} 9.82673 & -0.308298 & 8.8473 \\ 321.641 & -14.7626 & -1.02173 \\ 295.465 & 64.446 & -0.259004\end{array}\right)

(Perhaps now is a good time to remind you that we can scale the columns of a matrix by post-multiplying it by a diagonal matrix of the scale factors; and we can scale the rows of a matrix by pre-multiplying it by a diagonal matrix.)

i expect that

A_d = \alpha\ X_d^T Z = \pi\ X_d^T Z\ .

Compute the RHS:

 \pi\ X_d^T Z\ = \left(\begin{array}{lll} 9.82673 & -0.308298 & 8.8473 \\ 321.641 & -14.7626 & -1.02173 \\ 295.465 & 64.446 & -0.259004\end{array}\right)

and that is, indeed, A_d\ .

So, the construction of Ar using the inverse standard deviations could be done with any set of weights instead, and applied to any dispersion matrix, not just the A = Ac from centered data.

In other words, I know two kinds of transition matrices that are dispersion matrices: A between Z and X, Ar (Ad in this case) between a newer Z and the older X.

I do not have an “iff” result. I can’t guarantee that these are the only kinds of matrices that work.

Question 2. Is v a dispersion matrix?

No, at least not between its new data Z and the old X, i.e. not the way A is. (I wonder if there’s a way to make it happen. Would it ever be worthwhile if it were possible?)

This time, let’s first confirm the failure numerically.

We need new data from v instead of from A. Rather than destroy the Z’s, call the new data Y. We should have

Y = X\ v\ ,

i.e.

Y^T = v^T X^T\ ,

i.e.

v^{-T}Y^T = X^T = v\ Y^T\ .

That last equality corresponds to X^T= A\ Z^T\ , so we’re good.

Here’s the new data Y from old data X using the transition matrix v:

Y = X\ v =\left(\begin{array}{lll} -3.66221 & -6.37712 & 2.19235 \\ 11.7936 & -5.31319 & -2.50302 \\ 30.0746 & -4.38964 & 1.89649 \\ 50.8397 & -1.49177 & -2.81552 \\ 79.7392 & 3.09967 & 1.55071\end{array}\right)

now compute \alpha\ X^T Z\ , i.e. in this case,

\pi\ X^T Y\ = \left(\begin{array}{lll} 1741.74 & -5.46444 & 78.4071 \\ 28504.6 & -130.83 & -4.5274 \\ 13092.4 & 285.569 & -0.573841\end{array}\right)

and that is to be compared with v – but I’ve already done the algebra, and I know they should have proportional columns, so I took the ratio instead of the difference, i.e. here is (\pi\ X^T Y) / v\ , term by term:

\left(\begin{array}{lll} 31415.9 & 314.159 & 78.5398 \\ 31415.9 & 314.159 & 78.5398 \\ 31415.9 & 314.159 & 78.5398\end{array}\right)

Those should look familiar: they are, as they should be, the eigenvalues of \pi\ X^T X\ :

\{31415.9,\ 314.159,\ 78.5398\}

So the cross-dispersion matrix between X and Y is a column-weighted version of v, and the weights are the eigenvalues.

Time to do the algebra. This time, Y = Xv. We start with the dispersion matrix…

\alpha\ X^T Y

= \alpha\ X^T\ (X\ v)

= \alpha\ X^T X\ v

= c\ v

= ( v\ \Lambda^2\ v^T)\ v

= v\ \Lambda^2

and that’s what we got: the cross-dispersion \alpha\ X^T Y is the column-weighted v, and the weights are the eigenvalues \Lambda^2\ .

PCA / FA Basilevsky: with data

Introduction

I am looking into Basilevsky because he did something I didn’t understand: he normalized the rows of the Ac matrix (which I denoted Ar). We discussed that, and we illustrated the computations, in the previous two posts. But we did those computations without having any data. I want to take a closer look, with data.

In contrast to As and Ac, which are eigenvector matrices, his Ar matrix is not. Nevertheless, as I said, his Ar is not without some redeeming value. In fact, all three of the A matrices have the same redeeming value.

I will show, first by direct computation and then by proof, that each of these A matrices is the cross covariance between X data and Z data.

(That doesn’t mean I want to use his Ar matrix; all it means is that I learned something new about an A matrix.)

Recall the raw data of example 9. We’re about to go get the A matrices and the new data Z for both standardized and centered data.

X = \left(\begin{array}{lll} 2.09653 & -0.793484 & -7.33899 \\ -1.75252 & 13.0576 & 0.103549 \\ 3.63702 & 29.0064 & 8.52945 \\ 0.0338101 & 46.912 & 19.8517 \\ 5.91502 & 70.9696 & 36.0372\end{array}\right)

Yes, we’ve seen all these forms of the data before, but I want this post to be self-contained.

Now, I’m going to switch to Basilevsky’s notation, using Z for F^T. If you will, we’re mixing Harman’s model

Z = A\ F\ \text{or}\ X = F^T\ A^T

where Z = X^T, and F^T is the new data (“scores”) WRT the transition matrix (“loadings”) A; and Jolliffe’s model

Z = X\ V\ \text{or}\ X = Z\ V^{-1}

where Z is the new data (“scores”)….

We start with Harman’s model

Z = A\ F\ .

eliminate Z:

X^T = A\ F\ ,

and we know that F^T is the new data corresponding to X WRT the transition matrix A .

Now we reintroduce Z, redefining it as Jolliffe uses it, for the new data; that is, we let

Z = F^T

and get

X^T = A\ Z^T\ .

If I had to do it all over again, I would probably not have used Basilevsky’s notation. This is the notation I had for the previous two posts, but i’ll admit, it’s confusing.

But for this post, as for the previous two, that is our model.

First, let’s go get the standardized and the centered data (“X”), the corresponding A matrices, and the corresponding new data (“Z”).

Basilevsky Standardized

If you are comfortable with these calculations – which we’ve seen before for this very data – you should move right along. If, on the other hand, you need a refresher, here it is. And if you need to see the calculations explained, here

I standardize the data and call it Xs:

Xs = \left(\begin{array}{lll} 0.0368717 & -1.15632 & -1.09998 \\ -1.24681 & -0.665379 & -0.663951 \\ 0.550632 & -0.100095 & -0.170316 \\ -0.651057 & 0.534548 & 0.493006 \\ 1.31036 & 1.38724 & 1.44124\end{array}\right)

Get the eigenvalues of the correlation matrix:

\lambda s = \{2.43279,\ 0.565781,\ 0.00143288\}

Get the SVD (Singular Value Decomposition) of Xs:

Xs = us\ ws\ vs^T.

ws = \left(\begin{array}{lll} 3.11948 & 0. & 0. \\ 0. & 1.50437 & 0. \\ 0. & 0. & 0.0757068 \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Form the diagonal matrix of \sqrt{\text{eigenvalues}}\

\Lambda s = \left(\begin{array}{lll} 1.55974 & 0. & 0. \\ 0. & 0.752184 & 0. \\ 0. & 0. & 0.0378534\end{array}\right)

Form the weighted eigenvector matrix A = v\ \Lambda

As = vs\ \Lambda s = \left(\begin{array}{lll} -0.752315 & 0.658804 & -0.000578824 \\ -0.963827 & -0.265197 & -0.0266011 \\ -0.968424 & -0.247849 & 0.0269245\end{array}\right)

Get the scores FT as the first three columns of 2u (where 2 = Sqrt[N-1], N=5). This is as close to the u matrix itself as I need to get. But instead of calling these F^T, we are calling them Z:

Zs = \left(\begin{array}{lll} 0.88458 & 1.06679 & 0.782819 \\ 0.913474 & -0.849064 & 0.380336 \\ -0.0628237 & 0.762691 & -1.56451 \\ -0.206697 & -1.22463 & -0.396946 \\ -1.52853 & 0.244206 & 0.798301\end{array}\right)

Check it by confirming that we have

X^T = A\ Z^T \text{or}\ X = Z\ A^T\ .

(We do.)

Basilevsky centered

Recall the raw data:

X = \left(\begin{array}{lll} 2.09653 & -0.793484 & -7.33899 \\ -1.75252 & 13.0576 & 0.103549 \\ 3.63702 & 29.0064 & 8.52945 \\ 0.0338101 & 46.912 & 19.8517 \\ 5.91502 & 70.9696 & 36.0372\end{array}\right)

I will center the data and call it Xc. Get the column means…

\{1.98597,31.8304,11.4366\}

and subtract them from each column to get Xc:

Xc = \left(\begin{array}{lll} 0.110558 & -32.6239 & -18.7756 \\ -3.73849 & -18.7728 & -11.333 \\ 1.65105 & -2.82404 & -2.90714 \\ -1.95216 & 15.0815 & 8.41516 \\ 3.92905 & 39.1392 & 24.6006\end{array}\right)

Proceeding as before, we get the SVD (Singular Value Decomposition) and display w…

Xc = uc\ wc\ vc^T

wc = \left(\begin{array}{lll} 66.0141 & 0. & 0. \\ 0. & 5.01572 & 0. \\ 0. & 0. & 1.54942 \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Get the eigenvalues of the covariance matrix (and that’s the only change we make after switching from standardized to centered data):

\lambda c = \{1089.47,6.28937,0.600172\}

Get the diagonal matrix of \sqrt{\text{eigenvalues}} of the covariance matrix…

\Lambda c = \left(\begin{array}{lll} 33.0071 & 0. & 0. \\ 0. & 2.50786 & 0. \\ 0. & 0. & 0.774708\end{array}\right)

Get the weighted eigenvector matrix A = v\ \Lambda

Ac = vc\ \Lambda c = \left(\begin{array}{lll} -1.67235 & 2.48708 & -0.0914469 \\ -28.2097 & -0.261331 & -0.394039 \\ -17.0553 & 0.188375 & 0.660714\end{array}\right)

Get the scores F^T as the first three columns of 2u (where 2 = Sqrt[N-1], N=5). except that we are calling them Z, as Basilevsky does.

Zc = \left(\begin{array}{lll} 1.13849 & 0.83693 & 0.73261 \\ 0.669241 & -1.03777 & 0.41852 \\ 0.116099 & 0.683164 & -1.59786 \\ -0.519249 & -1.14658 & -0.340197 \\ -1.40458 & 0.664253 & 0.786923\end{array}\right)

Check it by confirming that we have

X^T = A\ Z^T \text{or}\ X = Z\ A^T\ .

(We do.)

Key: the A matrices are cross-covariances

We have As and Ac:

As = \left(\begin{array}{lll} -0.752315 & 0.658804 & -0.000578824 \\ -0.963827 & -0.265197 & -0.0266011 \\ -0.968424 & -0.247849 & 0.0269245\end{array}\right)

Ac = \left(\begin{array}{lll} -1.67235 & 2.48708 & -0.0914469 \\ -28.2097 & -0.261331 & -0.394039 \\ -17.0553 & 0.188375 & 0.660714\end{array}\right)

We have new data Zs and old Xs:

Zs = \left(\begin{array}{lll} 0.88458 & 1.06679 & 0.782819 \\ 0.913474 & -0.849064 & 0.380336 \\ -0.0628237 & 0.762691 & -1.56451 \\ -0.206697 & -1.22463 & -0.396946 \\ -1.52853 & 0.244206 & 0.798301\end{array}\right)

and old:

Xs = \left(\begin{array}{lll} 0.0368717 & -1.15632 & -1.09998 \\ -1.24681 & -0.665379 & -0.663951 \\ 0.550632 & -0.100095 & -0.170316 \\ -0.651057 & 0.534548 & 0.493006 \\ 1.31036 & 1.38724 & 1.44124\end{array}\right)

Now, what is the cross-covariance between Xs and Zs? I need either

\frac{X^T\ Z}{N-1}

or

\frac{Z^T\ X}{N-1}\ ,

and N-1 = 4. Note that they are the transposes of each other. We compute the former:

\frac{Xs^T\ Zs}{4} = \left(\begin{array}{lll} -0.752315 & 0.658804 & -0.000578824 \\ -0.963827 & -0.265197 & -0.0266011 \\ -0.968424 & -0.247849 & 0.0269245\end{array}\right)

Not symmetric. But it’s a cross-covariance, between two different variables, not the covariance of one variable.

Now recall As:

As = \left(\begin{array}{lll} -0.752315 & 0.658804 & -0.000578824 \\ -0.963827 & -0.265197 & -0.0266011 \\ -0.968424 & -0.247849 & 0.0269245\end{array}\right)

They are the same.

For centered data:

\frac{Xc^T\ Zc}{4} = \left(\begin{array}{lll} -1.67235 & 2.48708 & -0.0914469 \\ -28.2097 & -0.261331 & -0.394039 \\ -17.0553 & 0.188375 & 0.660714\end{array}\right)

and we recall

Ac = \left(\begin{array}{lll} -1.67235 & 2.48708 & -0.0914469 \\ -28.2097 & -0.261331 & -0.394039 \\ -17.0553 & 0.188375 & 0.660714\end{array}\right)

Again, they are the same.

Now, get Ar, by normalizing the rows of Ac:

Ar = \left(\begin{array}{lll} -0.557739 & 0.829456 & -0.030498 \\ -0.99986 & -0.00926256 & -0.0139662 \\ -0.99919 & 0.011036 & 0.0387082\end{array}\right)

and compute the cross-covariance between Xs and Zc:

\frac{Xs^T\ Zc}{4} = \left(\begin{array}{lll} -0.557739 & 0.829456 & -0.030498 \\ -0.99986 & -0.00926256 & -0.0139662 \\ -0.99919 & 0.011036 & 0.0387082\end{array}\right)

0nce again, they are the same. We have good reason, now, to believe that

\text{loadings}\ A = \frac{X^T\ Z}{N-1} =  covariance(X,Z)\ ,

where Z is the new data WRT A (i.e. the scores, as most people define them).

That’s what I set out to show.

We have seen it for As, relating standardized data Xs to the scores Zs; for Ac, relating centered data Xc to the scores Zc; and finally for Ar, relating standardized data Xs to the scores Zc.

Proving it

Let’s take a look at the proofs. The question in the back of my mind is: when is a transition matrix also a cross-covariance matrix? (I will confess that I do not have a universal answer; it appears that I can generalize from the Ar matrix somewhat.)

That is, given old data X and new data Z WRT transition matrix T,

X^T = T\ Z^T

X = Z\ T^T\ .

when can we conclude that T is the covariance between X and Z?

For either standardized data or centered data X, one argument works. We have

A = V \Lambda\ ,

where \Lambda is the diagonal matrix of \sqrt{\text{eigenvalues}} and V is an orthogonal eigenvector matrix of the covariance matrix c; that is, we have the eigendecomposition

\Lambda^2 = V^T\ c\ V = V^T \frac{X^T\ X}{N-1}\ V

\left(\text{or}\  \frac{1}{N-1}\ X^T\ X = V \Lambda^2\ V^T\right)\ .

We have the decomposition of X:

X = Z\ A^T\ \left(\text{or}\ X^T = A\ Z^T\right)\ .

For simplicity, i’m going to assume that A is invertible (i.e. that \Lambda is invertible; i.e. that all the eigenvalues are nonzero): then

Z = X\ A^{-T} \left(\text{or}\ Z^T = A^{-1}X^T\right)

and I note that

A^{-T} = V^{-T}\ \Lambda^{-T} = V\ \Lambda^{-T} = V\ \Lambda^{-1}\ .

Let me name those three equations: we have an eigendecomposition of the covariance matrix, the decomposition of X, and the definition of A.

Now we compute the cross-covariance. There may be simpler ways, and there are certainly other paths to the same answer, but this works for me. As I said, I’m going to assume that A is invertible, because for one thing we can cope if it is not; and for another, if it is not, we would have fewer Z variables than X variables.

\frac{1}{N-1}\ X^T\ Z

\frac{1}{N-1}\ X^T\ \left(X\ A^{-T}\right) (the decomposition of X)

= c\ A^{-T} (definition of the covariance matrix)

= \left(V\ \Lambda^2\ V^T\right)\ \left(V\ \Lambda^{-1}\right) (the eigendecomposition of c and definition of A)

= \left(V\ \Lambda^2\right)\ \Lambda^{-1} = V\  \Lambda (simplify)

= A (definition of A)

QED.

The eigendecomposition is crucial. Without it, we are left hanging at

c\ A^{-T}\ ,

Now let us try to prove it for Ar. It’s a surprisingly simple consequence of what we just did. We need only one new equation, the relationship between standardized and centered data:

Xs = Xc\ \sigma^{-1}\ ,

which says that we get standardized data from centered data by dividing each column by its standard deviation; and we recall that scaling columns can be accomplished by post-multiplication by a diagonal matrix. (After all, that’s how we get the A matrix from the V matrix.)

What we actually need is the transpose:

Xs^T = \sigma^{-1}\ Xc^T\ .

Let us compute the cross covariance of Xs and Zc.

\frac{ Xs^T\ Zc}{N-1}

= \frac{\sigma^{-1}\ Xc^TZc}{N-1} (Xs in terms of Xc)

= \sigma^{-1}\ Ac (by the previous result for Ac!)

= Ar (by definition of Ar)

QED.

The key, then, was that Ar was to Ac as Xs was to Xc (OK, not exactly: scale the rows of one and the columns of the other). I don’t know of any other suitable relationships, but if we found one to use, we should apply it to Xc and Ac.

Let me be more specific. We never really used the fact that \sigma was a diagonal matrix of standard deviations. If we had constructed completely different data Xd by scaling the columns of centered data Xc, using an arbitrary diagonal matrix \delta^{-1}\ , so that

Xd = Xc\ \delta^{-1}\ ,

and if we had defined a new A matrix by scaling the rows of Ac using the same diagonal matrix:

Ad = \delta^{-1}\ Ac\ ,

then we would find that Ad was the cross covariance of Xd and Zc.

I need to look at this some more.

PCA / FA Basilevsky: what he did.

Introduction

The previous post discussed something interesting that Basilevsky did. (Search the bibliography, he’s there.) I can’t say I like it — because it leads to a basis which is non-orthogonal, and whose vectors are not eigenvectors of either the correlation or covariance matrices.

But I had to understand it.

I don’t know how widespread it is nowadays, but even if Basilevsky is the only author who does it, it’s another example of the lack of standardization (no pun intend, I swear) in PCA / FA. This branch of applied statistics is like the mythical Wild West: everybody’s got a gun and there are bullets flying all over the place. Law and order have not arrived yet.

OTOH, it’s nice to find something different in just about every book I open.

Let me set the stage again. What we have is the following 3 models.

Xc^T = Ac\ Zc^T

Xs^T = Ar\ Zc^T

Xs^T = As\ Zs^T\ .

where X is data (variables in columns, observations in rows, and there are N rows), Z is principal components (PCs), and A is constructed from an eigenvector matrix V using the diagonal matrix \Lambda of \sqrt{\text{eigenvalues}}.

Now is a good time, before we need to be reminded of it, to emphasize that all three models are of the same form:

X^T = A\ Z^T\ .

In the first model,

Xc^T = Ac\ Zc^T\ ,

Xc is centered data, and

Ac = Vc\ \Lambda c\ ,

where Vc is an orthogonal eigenvector matrix of the covariance matrix c = \frac{Xc^T\ Xc}{N-1}\ :

\Lambda c^2 = Vc{^{-1}}\ c\ Vc = Vc^T\ c\ Vc\ .

In the third model,

Xs^T = As\ Zs^T\ ,

Xs is standardized data, and

As = Vs\ \Lambda s\ ,

where Vs is an orthogonal eigenvector matrix of the correlation matrix r = \frac{Xs^T Xs}{N-1}\ :

\Lambda s^2 = Vs{^{-1}}\ r\ Vs = Vs^T\ r Vs\ .

In the second model,

Xs^T = Ar\ Zc^T

we have the standardized data Xs as in the third model, but the PCs Zc from the first model. The transition matrix Ar may be found by normalizing the rows of Ac:

Ar = \sigma{^{-1}}\ Ac\ ,

where \sigma is a diagonal matrix of the standard deviations (\sqrt{\text{variances}}) of Xc.

The example

Basilevsky gives us a covariance matrix c. (So we have no data, for this example.)

c = \left(\begin{array}{llll} 471.51 & 324.71 & 73.24 & 4.35 \\ 324.71 & 224.84 & 50.72 & 2.81 \\ 73.24 & 50.72 & 11.99 & 1.23 \\ 4.35 & 2.81 & 1.23 & 0.98\end{array}\right)

We will want both the correlation matrix r and the diagonal matrix \sigma\ , so let’s get them. \sigma\ is to be a diagonal matrix of the square roots of the variances, hence of the square roots of the diagonal of c. (i should call it s, because it’s made up of sample standard deviations, but I don’t want “s” as well as “As”, “Vs”, etc.) Here it is:

\sigma = \left(\begin{array}{llll} 21.7143 & 0. & 0. & 0. \\ 0. & 14.9947 & 0. & 0. \\ 0. & 0. & 3.46266 & 0. \\ 0. & 0. & 0. & 0.989949\end{array}\right)

Once we have \sigma\ , the fastest way for me to get the correlation matrix is

r = \sigma^{-1}\ c\ \sigma^{-1}\ :

r = \left(\begin{array}{llll} 1. & 0.9973 & 0.9741 & 0.2024 \\ 0.9973 & 1. & 0.9769 & 0.1893 \\ 0.9741 & 0.9769 & 1. & 0.3588 \\ 0.2024 & 0.1893 & 0.3588 & 1.\end{array}\right)

From the covariance matrix:

So let’s do the eigendecompositions. First from the covariance matrix, which will give us the first model. We get eigenvalues…

\lambda c = \{706.979,\ 1.34915,\ 0.894303,\ 0.0971549\}

an eigenvector matrix…

V = \left(\begin{array}{llll} -0.8164 & -0.0236 & -0.5705 & 0.0862 \\ -0.5633 & -0.1048 & 0.7665 & -0.2903 \\ -0.1272 & 0.5679 & 0.2742 & 0.7656 \\ -0.0075 & 0.816 & -0.1089 & -0.5676\end{array}\right)

These are not quite Basilevsky’s numbers. The 4th eigenvector is off, by as much as .0014 . ah, the 2nd is off too, in two terms. Not great, but not a big deal. (Actually, I want to check his numbers, but not today. I will be using my numbers for eigenvectors and eigenvalues.)

(I suspect that most packages which do PCA / FA use iterative schemes to find one eigenvector after another; I’m getting used to seeing more and more errors as we move to the right in the eigenvector matrices. I think they’re suffering from accumulated round-off errors. Of course, I generally take it for granted that my answers are the right ones, but see below.)

Now I’m going change the signs to agree with Basilevsky:

Vc = \left(\begin{array}{llll} 0.8164 & -0.0236 & 0.5705 & 0.0862 \\ 0.5633 & -0.1048 & -0.7665 & -0.2903 \\ 0.1272 & 0.5679 & -0.2742 & 0.7656 \\ 0.0075 & 0.816 & 0.1089 & -0.5676\end{array}\right)

We will discover that Basilevsky should have changed the sign of the second column, too, but he didn’t so I won’t either.

Let’s keep going. We get the diagonal matrix \Lambda c of \sqrt{\text{eigenvalues}}\

\Lambda c = \left(\begin{array}{llll} 26.5891 & 0. & 0. & 0. \\ 0. & 1.16153 & 0. & 0. \\ 0. & 0. & 0.945676 & 0. \\ 0. & 0. & 0. & 0.311697\end{array}\right)

and Ac = Vc\ \Lambda c

Ac = \left(\begin{array}{llll} 21.7075 & -0.0274391 & 0.539517 & 0.0268828 \\ 14.9764 & -0.121762 & -0.724841 & -0.0904817 \\ 3.38092 & 0.659679 & -0.259284 & 0.238627 \\ 0.199249 & 0.947838 & 0.102961 & -0.176925\end{array}\right)

Basilevsky then displays equations for the variables, using Vc not Ac:

\begin{array}{c} \ \text{Xc1}=0.8164 \ \text{Zu1}-0.0236 \ \text{Zu2}+0.5705 \ \text{Zu3}+0.0862 \ \text{Zu4} \\ \ \text{Xc2}=0.5633 \ \text{Zu1}-0.1048 \ \text{Zu2}-0.7665 \ \text{Zu3}-0.2903 \ \text{Zu4} \\ \ \text{Xc3}=0.1272 \ \text{Zu1}+0.5679 \ \text{Zu2}-0.2742 \ \text{Zu3}+0.7656 \ \text{Zu4} \\ \ \text{Xc4}=0.0075 \ \text{Zu1}+0.816 \ \text{Zu2}+0.1089 \ \text{Zu3}-0.5676 \ \text{Zu4}\end{array}

What he has written is

basil

i.e.

Xc^T = Vc\ Zu^T\ ,

where the Zu are unstandardized, in contrast to Zc and Zs which are standardized. (The Zu were computed from Vc, not from Ac.) I remark that Basilevsky did not compute Ac.

From the correlation matrix:

Now let’s do the correlation matrix, which will give us the third model. Here are the eigenvalues…

\lambda s = \{3.0569,\ 0.9283,\ 0.013,\ 0.0018\}

and eigenvectors…

Vs = \left(\begin{array}{llll} 0.5627 & 0.1712 & 0.5688 & 0.5749 \\ 0.5624 & 0.1848 & 0.1817 & -0.7852 \\ 0.5696 & -0.0008 & -0.7906 & 0.2248 \\ 0.2065 & -0.9677 & 0.136 & -0.0485\end{array}\right)

Let’s stop there for a moment.

I want to compare the two V’s: recall Vc…

Vc = \left(\begin{array}{llll} 0.8164 & -0.0236 & 0.5705 & 0.0862 \\ 0.5633 & -0.1048 & -0.7665 & -0.2903 \\ 0.1272 & 0.5679 & -0.2742 & 0.7656 \\ 0.0075 & 0.816 & 0.1089 & -0.5676\end{array}\right)

He says, “… even though \alpha_{42} is positive as a covariance loading, it becomes highly negative as a correlations loadings [sic].”

He correctly points out that the 2nd column of Vc and the second column of Vs would have the same signs if one were multiplied by negative 1, and correctly concludes that we can take care of the sign of \alpha_{42} by changing the sign of the second column. (At least, I hope that what he said. And \alpha is his notation for an element of what i’m calling V.)

Unfortunately, he missed the fact that the 3rd columns cannot be made to have the same signs. The 3rd column of Vc has 2 negative signs; the 3rd column of Vs has only 1.

That is, even though Vs_{23} is positive, Vc_{23} is negative, when all other signs are the same.

This serves as a reminder that Vc and Vs, eigenvectors of the covariance matrix and of the correlation matrix, do not have a simple relationship to each other.

i will also point out that both Ac and Ar will be derived from Vc by scaling by positive numbers, and As will be derived from Vs by scaling by positive numbers, so this sign discrepancy between the (2,3) elements of Vs and Vc will propagate to the (2,3) elements of Ac and Ar versus As.

Now, lets get back to doing our thing. Get \Lambda s\

\Lambda s = \left(\begin{array}{llll} 1.74839 & 0. & 0. & 0. \\ 0. & 0.963503 & 0. & 0. \\ 0. & 0. & 0.114029 & 0. \\ 0. & 0. & 0. & 0.0424615\end{array}\right)

and As…

As = \left(\begin{array}{llll} 0.9839 & 0.1649 & 0.0649 & 0.0244 \\ 0.9832 & 0.1781 & 0.0207 & -0.0333 \\ 0.9959 & -0.0008 & -0.0901 & 0.0095 \\ 0.361 & -0.9324 & 0.0155 & -0.0021\end{array}\right)

Yes and yes, Basilevsky and I agree. Even to all his signs.

We use As and get the equations

\begin{array}{c} \ \text{Xs1}=0.9839 \ \text{Zs1}+0.1649 \ \text{Zs2}+0.0649 \ \text{Zs3}+0.0244 \ \text{Zs4} \\ \ \text{Xs2}=0.9832 \ \text{Zs1}+0.1781 \ \text{Zs2}+0.0207 \ \text{Zs3}-0.0333 \ \text{Zs4} \\ \ \text{Xs3}=0.9959 \ \text{Zs1}-0.0008 \ \text{Zs2}-0.0901 \ \text{Zs3}+0.0095 \ \text{Zs4} \\ \ \text{Xs4}=0.361 \ \text{Zs1}-0.9324 \ \text{Zs2}+0.0155 \ \text{Zs3}-0.0021 \ \text{Zs4}\end{array}

These Xs are standardized, and so are the Zs, in contrast to the Xc and Zu of the first set of equations:

\begin{array}{c} \ \text{Xc1}=0.8164 \ \text{Zu1}-0.0236 \ \text{Zu2}+0.5705 \ \text{Zu3}+0.0862 \ \text{Zu4} \\ \ \text{Xc2}=0.5633 \ \text{Zu1}-0.1048 \ \text{Zu2}-0.7665 \ \text{Zu3}-0.2903 \ \text{Zu4} \\ \ \text{Xc3}=0.1272 \ \text{Zu1}+0.5679 \ \text{Zu2}-0.2742 \ \text{Zu3}+0.7656 \ \text{Zu4} \\ \ \text{Xc4}=0.0075 \ \text{Zu1}+0.816 \ \text{Zu2}+0.1089 \ \text{Zu3}-0.5676 \ \text{Zu4}\end{array}

Basilevsky’s model:

Basilevsky has shown us that if we define

Ar = \sigma^{-1}Ac\ ,

then we would have the second model,

Xs^T = Ar\ Zc^T\ .

here’s Ar…

Ar = \left(\begin{array}{llll} 0.9997 & -0.0013 & 0.0248 & 0.0012 \\ 0.9988 & -0.0081 & -0.0483 & -0.006 \\ 0.9764 & 0.1905 & -0.0749 & 0.0689 \\ 0.2013 & 0.9575 & 0.104 & -0.1787\end{array}\right)

And here’s the corresponding equations…

\begin{array}{c} \ \text{Xs1}=0.9997 \ \text{Zc1}-0.0013 \ \text{Zc2}+0.0248 \ \text{Zc3}+0.0012 \ \text{Zc4} \\ \ \text{Xs2}=0.9988 \ \text{Zc1}-0.0081 \ \text{Zc2}-0.0483 \ \text{Zc3}-0.006 \ \text{Zc4} \\ \ \text{Xs3}=0.9764 \ \text{Zc1}+0.1905 \ \text{Zc2}-0.0749 \ \text{Zc3}+0.0689 \ \text{Zc4} \\ \ \text{Xs4}=0.2013 \ \text{Zc1}+0.9575 \ \text{Zc2}+0.104 \ \text{Zc3}-0.1787 \ \text{Zc4}\end{array}

followed immediately by the second set computed earlier…

\begin{array}{c} \ \text{Xs1}=0.9839 \ \text{Zs1}+0.1649 \ \text{Zs2}+0.0649 \ \text{Zs3}+0.0244 \ \text{Zs4} \\ \ \text{Xs2}=0.9832 \ \text{Zs1}+0.1781 \ \text{Zs2}+0.0207 \ \text{Zs3}-0.0333 \ \text{Zs4} \\ \ \text{Xs3}=0.9959 \ \text{Zs1}-0.0008 \ \text{Zs2}-0.0901 \ \text{Zs3}+0.0095 \ \text{Zs4} \\ \ \text{Xs4}=0.361 \ \text{Zs1}-0.9324 \ \text{Zs2}+0.0155 \ \text{Zs3}-0.0021 \ \text{Zs4}\end{array}

There are some satisfying similarities. The first columns differ significantly only in the last entry, and even that is .2 vs .36 . For the second column, as we said, if we had changed its sign, all the signs would match up, and in particular the single large terms would then be +0.9324 and +0.9575 .

I should emphasize that he calls both Ar and As, “correlation loadings”, at one point or another.

We could also severely round both As and Ar to confirm their similarity:

As = \left(\begin{array}{llll} 1. & 0 & 0 & 0 \\ 1. & 0 & 0 & 0 \\ 1. & 0 & 0 & 0 \\ 0.5 & -1. & 0 & 0\end{array}\right)

Ar = \left(\begin{array}{llll} 1. & 0 & 0 & 0 \\ 1. & 0 & 0 & 0 \\ 1. & 0 & 0 & 0 \\ 0 & 1. & 0 & 0\end{array}\right)

As he said, we shoulda changed the sign of the second column of Vc.

Without data, there isn’t a whole lot more we can do here. But I do want to show you one last thing: let’s take a closer look at Ar. We computed it from

Ar = \sigma^{-1}Ac

and the rows of Ar are supposed to be of unit length. As usual, the fastest way to compute the lengths (more precisely, the squared lengths) of vectors in an array A, is to compute either A\ A^T or A^T\ A\ . The former gives us dot products of all the rows with each other, including with themselves, while the latter gives us all the dot products among the columns.

So if we compute Ar Ar^T\ , the diagonal elements should be 1. We see that they are:

Ar Ar^T = \left(\begin{array}{llll} 1. & 0.997272 & 0.974077 & 0.202363 \\ 0.997272 & 1. & 0.976861 & 0.189303 \\ 0.974077 & 0.976861 & 1. & 0.358825 \\ 0.202363 & 0.189303 & 0.358825 & 1.\end{array}\right)

In fact, we see that Ar Ar^T\ is equal to our old friend the correlation matrix r:

r = \left(\begin{array}{llll} 1. & 0.997272 & 0.974077 & 0.202363 \\ 0.997272 & 1. & 0.976861 & 0.189303 \\ 0.974077 & 0.976861 & 1. & 0.358825 \\ 0.202363 & 0.189303 & 0.358825 & 1.\end{array}\right)

Don’t be distracted, however, by his scaling the rows. The new basis – which gives us the PCs Zc from Xs – is specified by the columns of Ar, just as the columns of Ac and As were the bases that gave us Zc from Xc, and Zs from Xs. (As I said, all three models are of the same form: X^T = A\ Z^T\ . And as I’ve said before, we should keep our eyes on the linear algebra.)

So what does the basis Ar look like? (I’ve said some terrible things about it; let me justify them.) For this, we compute Ar^T\ Ar\ :

Ar^T\ Ar = \left(\begin{array}{llll} 2.9908 & 0.369352 & -0.0756215 & 0.0265267 \\ 0.369352 & 0.953093 & 0.0856776 & -0.157942 \\ -0.0756215 & 0.0856776 & 0.0193785 & -0.023426 \\ 0.0265267 & -0.157942 & -0.023426 & 0.0367285\end{array}\right)

That the diagonals are not 1 says the columns are not of unit length; that the off-diagonal elements are not zero says the columns are not mutually orthogonal.

Ar is not a very nice basis at all.

I think it has some redeeming social value, but we’ll look at that when we have data.

Looking at eigenvectors

Let’s check one more thing. The columns of Vc are eigenvectors of the covariance matrix: we should have

c\ Vc = Vc\ \Lambda c^2\ .

Do we? We do. (When I compute the difference, I get zero, to within 10^{-13}\ .)

And columns of Ac are eigenvectors of the covariance matrix, too:

c\ Ac = Ac\ \Lambda c^2\ .

Similarly, columns of Vs and As are eigenvectors of r, and we should have both

r\ Vs = Vr\ \Lambda s^2

and

r\ As = As\ \Lambda s^2\ .

We do.

(This is why I’m so sure my answers are right: I’ve checked the eigendecompositions.)

What about Ar? Even supposing that its columns are eigenvectors of c, we don’t know what the eigenvalues are. No problem, just see if

r Ar

is proportional to Ar. (That is, see if columns are proportional; see below.)

We have

r\ Ar = \left(\begin{array}{llll} 2.98756 & 0.369966 & -0.0752539 & 0.0261815 \\ 2.98765 & 0.357973 & -0.0770204 & 0.0286877 \\ 2.99806 & 0.524909 & -0.0605794 & 0.0000958891 \\ 0.942999 & 1.02403 & 0.0730146 & -0.154885\end{array}\right)

and when we divide, element by element, by Ar, we get

\left(\begin{array}{llll} 2.98849 & -292.778 & -3.02879 & 21.1478 \\ 2.9913 & -44.0836 & 1.59331 & -4.75413 \\ 3.07054 & 2.75525 & 0.809018 & 0.00139143 \\ 4.68519 & 1.06953 & 0.70202 & 0.866628\end{array}\right)

so we conclude that the columns of Ar are not eigenvectors of r. (Put that the other way: if columns of Ar are eigenvectors, then applying r to them should yield columns which are proportional to the eigenvectors, i.e. proportional to Ar.)

Are the columns of Ar eigenvectors of c? here’s c Ar…

c\ Ar = \left(\begin{array}{llll} 868.064 & 14.8855 & -9.01303 & 2.8942 \\ 599.263 & 10.1172 & -6.30661 & 2.03838 \\ 135.83 & 2.95751 & -1.40195 & 0.391071 \\ 8.55343 & 1.14433 & -0.0179306 & -0.101953\end{array}\right)

divided by Ar…

\left(\begin{array}{llll} 868.334 & -11779.8 & -362.753 & 2337.76 \\ 599.995 & -1245.91 & 130.464 & -337.802 \\ 139.114 & 15.524 & 18.7226 & 5.67474 \\ 42.4968 & 1.19517 & -0.172399 & 0.570459\end{array}\right)

Again, no; the columns of Ar are not eigenvectors of c.

Oh, do you need to see how that calculation works when it comes out right? here’s c Ac…

c\ Ac = \left(\begin{array}{llll} 15346.8 & -0.0370194 & 0.482492 & 0.0026118 \\ 10588. & -0.164275 & -0.648228 & -0.00879074 \\ 2390.24 & 0.890006 & -0.231879 & 0.0231838 \\ 140.865 & 1.27878 & 0.0920784 & -0.0171892\end{array}\right)

and dividing by Ac gives us…

\left(\begin{array}{llll} 706.979 & 1.34915 & 0.894303 & 0.0971549 \\ 706.979 & 1.34915 & 0.894303 & 0.0971549 \\ 706.979 & 1.34915 & 0.894303 & 0.0971549 \\ 706.979 & 1.34915 & 0.894303 & 0.0971549\end{array}\right)

which says the 4 eigenvalues for the 4 columns are

\{706.979,\ 1.34915,\ 0.894303,\ 0.0971549\}

and that’s exactly what we got a long time ago:

\lambda c = \{706.979,\ 1.34915,\ 0.894303,\ 0.0971549\}

The purpose of this post was to demonstrate Basilevsky’s computations. Next time, I’ll do these kinds of calculations with data.

PCA / FA Basilevsky on standardizing. Discussion.

Introduction and review

Basilevsky presents an extremely interesting idea. For all I know, it’s become common in the last 10-20 years, but I hadn’t seen it in any of the other books we’ve looked at.

I’ll tell you up front that he’s going to normalize the rows of an A matrix, specifically the A matrix computed from the eigendecomposition of the covariance matrix.

I’ll also tell you up front that I don’t see any good reason for doing it, but I’m not averse to finding such a reason someday.

Suppose we have centered data X with variables in columns, and N rows (observations). The covariance matrix is c = \frac{X^T\ X}{N-1}\ ), and we find its eigendecomposition. We get a matrix V having the eigenvectors as columns, and we construct a diagonal matrix  \Lambda whose entries are the square roots of the eigenvalues.

That is, we have the decomposition

\Lambda^2 = V^{-1}\ c\ V\ .

Then we define the matrix A = V\ \Lambda\ : each column is an eigenvector, of length \sqrt{\text{eigenvalue}}\ .

For the next few posts, I will use a notation of Basilevsky (and Jolliffe): the principal components will be denoted by Z. (Yes, Harman and I almost invariably use Z = X^T\ , but I need to abandon that notation for a while. Sorry.)

This coincides with Jolliffe’s notation, which was to write

Z = X\ V\ ,

with Z as the principal components, X the data, and V the orthogonal eigenvector matrix.

As we have observed many times, the starting point for making sense of that is the change-of-basis equation for a vector:

x^T = V\ z^T\ ,

where x^T and z^T are column vectors, x is the old components, z is the new components, and V is the transition matrix. (Its columns are the old components of the new basis vectors. Using the symbol V is not an accident: the matrix V from the eigendecomposition is a transition matrix.)

Applied to the entire matrix X, the change-of-basis equation is

X^T = V\ Z^T

Transpose:

X = Z\ V^T

Solve for Z:

X\ V^{-T} = Z.\

But V is orthogonal, V^T = V^{-1}\ , so V^{-T} = V\ :

X\ V = Z\ .

Among the very first things we learned in PCA (in Harman) were two: first, that the Z’s we just defined have maximally redistributed the variance of the X’s: the 1st Z variable has the largest variance of any linear combination of the X’s; the 2nd Z variable has the 2nd largest variance of any other linear combination of the X’s, etc. (All subject to the constraint that the total variance of the Z’s is equal to the total variance of the X’s.)

Second, we learned that if we used the transition matrix A = V\ \Lambda\ , then the Z’s defined by it would be both standardized and uncorrelated.

In our present notation (our X^T is Harman’s Z, and our Z^T is Harman’s F), we would write Harman’s model as

X^T = A\ Z^T\

which, column by column, is precisely the change-of-basis equation

x^T = A\ z^T\ ,

with transition matrix A instead of V.

Or we could write

X = Z\ A^T.

(I will try not to mix these up.)

Let me emphasize that the two Z’s in

Z = X\ V

and

X^T = A\ Z^T

are different. In what follows, I will distinguish them by subscripts (Zu for the first, Zc for the second.)

(It is true that the second set of Z’s could have been found from the first set: just standardize the first set! I had never thought to ask that or check it before, but it would be rather annoying if it were not true. See below.)

We will, in fact, be dealing with both centered data Xc and the corresponding standardized data Xs; we will have corresponding orthogonal eigenvector matrices Vc and Vs; corresponding A matrices Ac and As, and a third one Ar; we will have 3 Z matrices, although a fourth exists in principle.

Zu will be the unstandardized Z from Vc:  Zu = Xc\ Vc

Zc will be the standardized Z from Ac: Xc^T = Ac\ Zc^T

Zs will be the standardized Z from As: Xs^T = As\ Zs^T\ .

(I have tried to minimize the occurence of Z’s as opposed to Zs, and X’s versus Xs, but I can’t reduce the number to zero. Zs and Xs are specific matrices, as opposed to collections of Z’s and X’s, which have apostrophes)

A new basis

We will show that

Xs^T = Ar\ Zc^T\ ,

where Ar can be computed from Ac.

That last is a mix of apples and oranges: standardized Xs, but standardized Zc from the covariance matrix, instead of either standardized Xs and standardized Zs from the correlation matrix, or centered Xc and standardized Zc from the covariance matrix.

As I intimated at the beginning, the Ar is Ac with its rows normalized to length 1.

Let’s work that out. The key is that each row of Ac has a sample std deviation \sigma as its length.

Let’s see that. To compute the mutual dot products of rows of Ac, we could compute

Ac\ Ac^T\ .

The diagonals of the result would be the squared lengths of rows of Ac (the dot products of rows with themselves).

OK. Let me drop the subscript c for the moment:

AA^T = \left(V\ \Lambda\right)\ \left(V\ \Lambda \right)^T = V\  \Lambda\ \Lambda^T\ V^T = V\ \Lambda^2\ V^{-1} = c\ ,

where c is the covariance matrix.

(the eigendecomposition was \Lambda^2 = V^{-1}\ c\ V\ .)

So the diagonal elements of Ac\ Ac^T are not only the squared lengths of the rows, but also the sample variances of Xc.

We want to scale each row of Ac by its length. To scale columns of V, we post-multiplied by the diagonal matrix \Lambda\ . To scale rows of Ac, we will premultiply by a diagonal matrix of inverse standard deviations.

So let \sigma = diagonal matrix of square roots of variances, and let \sigma^{-1} be its inverse. (Ignore the possibility of zero variances; after all, we do that every time we talk about the correlation matrix.)

We want to scale the rows of Ac to unit length to get Ar, so we define

Ar = \sigma^{-1}Ac\ .

We have the relationship between Xc and Zc:

Xc = Zc\ Ac^T\ .

We standardize Xc by scaling columns by standard deviations, so

Xs = Xc\ \sigma^{-1} = Zc\ Ac^T\ \sigma^{-1} = Zc\ Ar^T\ .

FInally, transposing

Xs^T = Ar\ Zc^T\ ,

QED.

Let me lay out all three equations again (I really like to see them all together):

Xc^T = Ac\ Zc^T

Xs^T = Ar\ Zc^T

Xs^T = As\ Zs^T\ .

That’s how I would present Basilevsky’s work. He never introduced Ac, and he computed Zc from the unstandardized Zu:

Zu = Xc\ Vc

Zc = Zu\ \Lambda^{-1}

Standardizing Zu

Yes, as I said, the standardized Zc can be computed by standardizing the Zu; and the variances of the Zu are given by the eigenvalues of the covariance matrix \left(\lambda = \text{diagonal of }\Lambda^2\right)\ . It’s easy enough to confirm.

From our usual definition of Zc,

Xc^T = Ac\ Zc^T\ , so Xc = Zc\ Ac^T\ ,

and the old definition of Zu

Zu = Xc\ Vc\ ,

we could write

Zu =  \left(Zc\ Ac^T\right)\ Vc = Zc\ \left(Vc\ \Lambda\right)^T\ Vc = Zc\  \Lambda\ Vc^T\ Vc = Zc\ \Lambda

i.e.

Zc = Zu\ \Lambda^{-1}\ ,

QED.

Summary and query

Overall, then, I learned two things. One, that standardized Zc and unstandardized Zu, defined by

Xc^T = Ac\ Zc^T

and

Zu = Xc\ Vc

were related in the sensible way; we could get Zc by standardizing Zu.

Two, and more importantly, I learned that we could write

Xs^T = Ar\ Zc^T\ ,

where

Ar = \sigma^{-1}\ Ac\ .

Fine.

Now, why would we want to?

I don’t know.

Basilevsky got there by looking at

Zu = Xc\ Vc

and saying he wanted to standardize both Z and X.

But we already know how to do that:

Xs^T = As\ Zs^T\ ,

with standardized Xs and standardized Zs.

Why doesn’t he do that?

He does.

And he knows and shows that As \ne Ar\ , and Zs \ne Zc\ , so by using Ar he has gotten standardized Zc which are different from standardized Zs.

I am perfectly happy with Zs derived from Xs. I am perfectly happy with Zc derived from Xc. I have no idea why we would want to use Ar to relate Zc to Xs. It’s true, but why do we care?

OK, Ac is not an orthonormal basis, but neither is Ar, so that’s not an improvement. OK, the columns of Ac do not have unit length, but they are mutually orthogonal and they are eigenvectors of the covariance matrix. In contrast, the columns of Ar are neither mutually orthogonal , nor are they eigenvectors (of Xc or of Xs).

I find myself wondering if there is some reason for liking this non-orthogonal basis Ar. Does its failure to be orthogonal tell us something? But it didn’t give us anything new. It says (once again) that

Xs^T = Ar\ Zc^T\ ,

but Zc is found from Xc and the eigenvectors of the covariance matrix:

Xc^T = Ac\ Zc^T\ ,

and Xs is found by standardizing Xc. We don’t need Ar to get the two things it relates.

I think Ar is cute, but I need some reason for going to a basis which isn’t at least orthogonal. Heck, I need a reason for going from the orthogonal Vc to a basis (Ac) whose vectors are not of unit length, but the reason is that it provides standardized Zc.

Finally, I need some reason for abandoning the eigenvectors. I’m in no hurry to substitute Ar for either Ac or As. The eigenvectors have significant properties; what does Ar have to offer that makes up for losing those properties?

If anyone knows….