PCA / FA Example 8: the pseudo-inverse

introduction

Recall Harman’s or Bartholomew’s model

Z = A F

with Z = X^T\ , X standardized, and A a \sqrt{\text{eigenvalue}}\ -weighted eigenvector matrix, with eigenvalues from the correlation matrix.

We saw how to compute the scores F^T in the case that A was invertible (here). If, however, any eigenvalues are zero then A will have that many columns of zeroes and will not be invertible.

What to do?

One possibility – shown in at least one of the references, and, quite honestly, one of the first things I considered – is to use a particular example of a pseudo-inverse. I must tell you up front that this is not what I would recommend, but since you will see it out there, you should see why I don’t recommend it.

(Answer: it works, it gets the same answer, but computing the pseudo-inverse explicitly is unnecessary. In fact, it’s unnecessary even if we don’t have the Singular Value Decomposition (SVD) available to us.)

Suppose we are given raw data which has, in fact, constant row sums (in this case, 1):

raw = \left(\begin{array}{lll} -1 & -1 & 3 \\ 1 & -2 & 2 \\ \frac{1}{3} & 1 & -\frac{1}{3} \\ -\frac{1}{4} & 1 & \frac{1}{4} \\ \frac{1}{10} & \frac{1}{2} & \frac{2}{5}\end{array}\right)

A couple of those fractions do not display well on my screen, so let me convert to decimals:

\left(\begin{array}{lll} -1. & -1. & 3. \\ 1. & -2. & 2. \\ 0.333333 & 1. & -0.333333 \\ -0.25 & 1. & 0.25 \\ 0.1 & 0.5 & 0.4\end{array}\right)

PCA a la Harman or Bartholomew et al.

Now, we do a PCA using Harman’s or Bartholomew’s model – but my way, not theirs. Standardize the input:

X = \left(\begin{array}{lll} -1.40524 & -0.67082 & 1.39765 \\ 1.30584 & -1.41618 & 0.675971 \\ 0.402143 & 0.819892 & -1.00794 \\ -0.388588 & 0.819892 & -0.586964 \\ 0.0858508 & 0.447214 & -0.478713\end{array}\right)

The X matrix is “the data”.

Get the eigenvalues of the correlation matrix:

\lambda 1 = \{1.86202,1.13798,0.\}

Yes, one of them is identically zero. We learned here that this happens when we start with constant nonzero row sums and compute the correlation matrix.

Get the diagonal matrix of square roots:

\Lambda 1 = \left(\begin{array}{lll} 1.36456 & 0. & 0. \\ 0. & 1.06676 & 0. \\ 0. & 0. & 0.\end{array}\right)

Get the SVD of X… and look at w:

w1 = \left(\begin{array}{lll} 2.72912 & 0. & 0. \\ 0. & 2.13352 & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Get the \sqrt{\text{eigenvalue}}\ -weighted matrix A, as A1 = v1\  \Lambda 1\ . (The only reason I’m using “1” is so that in subsequent algebra and computations I can use A, v, \Lambda\ etc. consistently for cut-down matrices.)

A1 = \left(\begin{array}{lll} -0.136612 & 0.990625 & 0. \\ -0.938341 & -0.34571 & 0. \\ 0.981263 & -0.192673 & 0.\end{array}\right)

Compute F^T\ as I would usually, as columns of \sqrt{N-1}\ u\ , in this case 2 u1, and only the first 3 columns:

F1^T = \left(\begin{array}{lll} 1.17769 & -1.25613 & 0.660501 \\ 0.974085 & 1.45252 & 0.269327 \\ -0.97385 & 0.271651 & 1.59142 \\ -0.693986 & -0.487969 & -0.0525462 \\ -0.483941 & 0.0199256 & -0.977656\end{array}\right)

Let’s check it by computing F^T\ A^T\ :

F1^T\ A1^T\  = \left(\begin{array}{lll} -1.40524 & -0.67082 & 1.39765 \\ 1.30584 & -1.41618 & 0.675971 \\ 0.402143 & 0.819892 & -1.00794 \\ -0.388588 & 0.819892 & -0.586964 \\ 0.0858508 & 0.447214 & -0.478713\end{array}\right)

and that is, indeed, X:

\left(\begin{array}{lll} -1.40524 & -0.67082 & 1.39765 \\ 1.30584 & -1.41618 & 0.675971 \\ 0.402143 & 0.819892 & -1.00794 \\ -0.388588 & 0.819892 & -0.586964 \\ 0.0858508 & 0.447214 & -0.478713\end{array}\right)

We’re done.

But what if the SVD is not available to us, and we don’t have u?

Let us recall the derivation in the case that A was invertible (here it was called A1; this is old stuff, so try not to worry about all the missing “1”s), we could also have computed F^T\ by projecting X onto the reciprocal basis A^{-T}\ :

F^T = X\ A^{-T}\ ,

but we can also derive that formula without ever knowing about the reciprocal basis:

Z = A\ F

A^{-1}\ Z = F

F^T = Z^T\ A^{-T}

F^T = X\ A^{-T}\ .

In that case, we could also have used A = v\ \Lambda\ to simplify things further, and get

A^T = \Lambda\ v^T

A^{-T} = v^{-T}\ \Lambda^{-1}

A^{-T} = v\ \Lambda^{-1}\ (because v is orthogonal)

and finally

F^T = X\ v\ \Lambda^{-1}\ .

That was fine (and it’s what Bartholomew had us do), but our A is not invertible.

the pseudo-inverse

So let’s work with the cut-down matrices. To be specific, we start with the two nonzero columns of A1:

A = \left(\begin{array}{ll} -0.136612 & 0.990625 \\ -0.938341 & -0.34571 \\ 0.981263 & -0.192673\end{array}\right)

This cut-down A has, in a sense, eliminated the problem: although the new A can’t be inverted because it’s not square, it is of full rank. Even better, it’s A1 with one column removed: in a very real sense, it’s the same linear operator.

Any time I have a formula which involves a matrix inverse, and I want to generalize it to a case where I have a rectangular matrix – hence, no inverse – I investigate whether a pseudo-inverse will work.

In this case, A is of rank 2, so the two square matrices A^T\ A and A\ A^T are of rank 2. But A is 5×2, so A^T\ A is 2×2 and A\ A^T is 5×5, so A^T\ A is invertible, even though A1 wasn’t. There’s no point to considering A\ A^T\ , because it’s not invertible.

So, for the rectangular A – because what I dropped was a column of zeroes – we still have

Z = A F

(That is crucial. We didn’t change the data Z = X^T\ , because we didn’t lose any singular values.)

Premultiply by A^T\ :

A^T Z = A^T\ A\ F\ .

now premultiply by \left(A^T\ A\right)^{-1}\ , which we know exists:

\left(A^T\ A\right)^{-1}\ A^T\ Z = F\ .

It may not be pretty, but it will work.

I should remind us that this is exactly what’s goes on in regression (OLS) here: the “normal equations” for the regression model

\hat{Y} = X\ \beta

are

\beta = \left(X^T\ X\right)^{-1}\ X^T\ Y \ .

Let me also point out that if A is invertible, then the inverse expands as

\left(A^T\ A\right)^{-1} = A^{-1}\ A^{-T}\ ,

so from

F = \left(A^T\ A\right)^{-1}\ A^T\ Z

we get

F = A^{-1}\ A^{-T}\ A^T\ Z

i.e.

F = A^{-1}\ Z

and finally

F^T = X A^{-T}\ .

That is, the equation written with the pseudo-inverse

F = \left(A^T\ A\right)^{-1}\ A^T\ Z

contains our earlier special case

F^T = X\ A^{-T}

when A is invertible. That derivation, incidentally, reminds me that although I think of \left(A^T\ A\right)^{-1} as the pseudo-inverse, it’s really the product \left(A^T\ A\right)^{-1}\ A^T which is the pseudo-inverse, since it’s the product that collapsed to A^{-1}\ .

It gets better, in this case. The pseudo-inverse is far prettier than it looks, but let’s see it before we show it.

Here is A^T\ A\ :

A^T\ A = \left(\begin{array}{ll} 1.86202 & 0 \\ 0 & 1.13798\end{array}\right)

Diagonal?

Diagonal!

Even better, it’s \Lambda^2\ , where \Lambda is the cut-down \Lambda 1\ :

\Lambda = \left(\begin{array}{ll} 1.36456 & 0. \\ 0. & 1.06676\end{array}\right)

\Lambda^2 = \left(\begin{array}{ll} 1.86202 & 0. \\ 0. & 1.13798\end{array}\right)

That is, it’s the diagonal matrix of \lambda instead of \sqrt{\lambda}\ . Not only is it diagonal, but its elements are almost already computed. And in fact the inverse is almost immediate:

\left(A^T\ A\right)^{-1} = \Lambda^{-2}\ .

Then a quick check shows that just as

A1 = v1\ \Lambda 1\ ,

we have

A = v\ \Lambda\ ,

hence

A^T = \Lambda\ v^T

(We also need a cut-down version v of v1 (its first two columns) as well as the cut down \Lambda of \Lambda 1 (2×2) which we got earlier.)

v = \left(\begin{array}{ll} -0.100114 & 0.92863 \\ -0.687651 & -0.324075 \\ 0.719106 & -0.180615\end{array}\right)

Then A = v\ \Lambda is

A = \left(\begin{array}{ll} -0.136612 & 0.990625 \\ -0.938341 & -0.34571 \\ 0.981263 & -0.192673\end{array}\right)

(That is, whether we start with A1 and cut it down to A, or compute the cut-down A from v and \Lambda\ , we get the same thing.)

Putting it all together, from

F = \left(A^T\ A\right)^{-1}\ A^T\ Z

we get, as before,

F = \Lambda^{-2} \Lambda\  v^T\ Z

F^T = X\ v\ \Lambda^{-1}\ .

That is, if A is invertible, we can compute F^T as any one of

  • F^T = X\ v\ \Lambda^{-1}
  • F^T = X\ A^{-T}
  • FT = \sqrt{N-1}\ u (the first “few” columns).

If A is not invertible but the cut-down A is of full rank (so v and \Lambda are also cut down as above), we can compute F^T as any of

  • FT = X\ v\ \Lambda^{-1}
  • F = \left(A^T\ A\right)^{-1}\ A^T\ X^T and take the transpose.
  • FT = \sqrt{N-1}\ u (the first “few” columns).

The only difference between A not invertible and A invertible is F = \left(A^T\ A\right)^{-1}\ A^T\ X^T in place of F^T = X\ A^{-T}\ .

(Yes, one of those uses F, the other F^T\ .)

But as I said at the beginning, I didn’t go thru all that in order to encourage you to compute the pseudo-inverse \left(A^T\ A\right)^{-1}A^T\ , but unfortunately you may very well see it in texts.

When you do, realize that it’s true but utterly unnecessary for computing F.

(There’s no reason to tell you which of the references show it.)

I would always compute F^T as F^T = \sqrt{N-1}\ u\ . If the SVD makes you uncomfortable, or isn’t available (I weep for you), then compute it as F^T = X\ V \Lambda^{-1}\ , using V from the eigendecomposition.

Even though the pseudo-inverse isn’t all that useful in this case, it’s a good thing to know about in general.

Oh, and don’t even dream that it’s unnecessary for regression. For PCA / FA, it’s

A^T\ A = \Lambda^2

which makes it unnecessary to actually compute the pseudo-inverse . For regression, X^T\ X isn’t at all likely to be diagonal.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: