## PCA / FA Malinowski Example 5: noise and both X and Y data

I think I understand what Malinowski is talking about on p. 63. He has no example to illustrate it, and I have to assume that he was a little sloppy: I must assume that the matrix T in eq. 1.136 3.136 is different from the matrix T in eq. 1.134 3.134. I don’t like having to assume that, but the math makes sense this way. This is the best way I have found to reconcile his equations.

I had thought that eq. 1.136 3.136 was a typo, but I no longer think so.

If what I’m about to do is what Malinowski was suggesting, then the credit is his; in any case, the mathematics looks good to me, and if it’s not what he suggested, it is nevertheless my reaction to his work.

## the data and its SVD

The cleanest way to make my case is to do a simulation. The following is the data from my example 5. It is of rank 2. In keeping with my latest notation, this is denoted by a “D”, specifically D0 (rather than “X”).

$D0 = \left(\begin{array}{lll} 2 & 3 & 4 \\ 1 & 0 & -1 \\ 4 & 5 & 6 \\ 3 & 2 & 1 \\ 6 & 7 & 8\end{array}\right)$

Let me add some noise to it.

Here’s randomly generated and rounded noise…

$\left(\begin{array}{lll} -0.1 & 0.2 & -0.1 \\ 0 & -0.2 & 0 \\ 0 & -0.1 & 0.1 \\ 0 & -0.1 & 0.1 \\ 0 & -0.1 & -0.1\end{array}\right)$

which i add to D0 to get the final data matrix for this example. In keeping with my latest notation, that data matrix will be called D.

$D = \left(\begin{array}{lll} 1.9 & 3.2 & 3.9 \\ 1 & -0.2 & -1 \\ 4 & 4.9 & 6.1 \\ 3 & 1.9 & 1.1 \\ 6 & 6.9 & 7.9\end{array}\right)$

That is what we know, not D0. We can talk about D0 because this is a simulation, but in practice D0 is unknowable: we are trying to estimate it.

Get the Singular Value Decomposition of D, $D = u\ w\ v^T\$. Here’s w:

$w = \left(\begin{array}{lll} 16.194 & 0. & 0. \\ 0. & 2.41991 & 0. \\ 0. & 0. & 0.238533 \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)$

Set the 3rd singular value to zero. We are assuming that it is a measurement of the noise – which it is, by design. (There will still be some noise left over.)

We could write either

$w0 = \left(\begin{array}{lll} 16.194 & 0 & 0 \\ 0 & 2.41991 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0\end{array}\right)$

or

$w1 = \left(\begin{array}{ll} 16.194 & 0. \\ 0. & 2.41991\end{array}\right)$

For w1, we need the cut-down u and v. Because w1 is 2×2, u1 is the first 2 columns of u, v1 is the first 2 columns of v. Then compute the “reconstituted data” $D1 = u1\ w1\ v1^T\$, using the cut-down SVD. As usual, it is also true that $D1 = u\ w0\ v^T\$, using the full SVD with only two singular values. Here is the reconstituted data D1:

$D1 = \left(\begin{array}{lll} 1.95859 & 3.05258 & 3.98415 \\ 0.992857 & -0.182028 & -1.01026 \\ 3.95401 & 5.0157 & 6.03396 \\ 3.02136 & 1.84625 & 1.13068 \\ 6.0016 & 6.89597 & 7.9023\end{array}\right)$

The difference between the data D and the reconstituted matrix is…

$D - D1 = \left(\begin{array}{lll} -0.0585929 & 0.147418 & -0.0841454 \\ 0.00714334 & -0.0179725 & 0.0102586 \\ 0.0459858 & -0.115699 & 0.0660403 \\ -0.0213625 & 0.0537476 & -0.0306787 \\ -0.00160247 & 0.00403179 & -0.00230131\end{array}\right)$

There is no way in the real world we would know the original matrix D0, but this is a simulation and here’s the difference between D0 and D1:

$D0 - D1 = \left(\begin{array}{lll} 0.0414071 & -0.0525815 & 0.0158546 \\ 0.00714334 & 0.182028 & 0.0102586 \\ 0.0459858 & -0.0156993 & -0.0339597 \\ -0.0213625 & 0.153748 & -0.130679 \\ -0.00160247 & 0.104032 & 0.0976987\end{array}\right)$

The u and v computed from D (which has noise) are different from the u and v computed from D0 (which does not have noise); the two largest singular values (w1) are also different: I infer that we cannot eradicate the noise (D vs D0) just by eliminating the smallest singular value. If we could have, D1 would match D0.

In other words, the noise has caused D1 to differ from D0.

For reference, in case you are working along with me, here are the u and v matrices.

$u = \left(\begin{array}{lllll} -0.32924 & -0.320222 & -0.752815 & -0.205034 & -0.424605 \\ 0.0179826 & 0.577752 & 0.0917793 & -0.775085 & -0.238114 \\ -0.542068 & -0.155692 & 0.590837 & 0.11501 & -0.565338 \\ -0.201861 & 0.731055 & -0.27447 & 0.56307 & -0.180079 \\ -0.746118 & 0.0705562 & -0.0205889 & -0.1641 & 0.641075\end{array}\right)$

$v = \left(\begin{array}{lll} -0.485249 & 0.811213 & 0.326294 \\ -0.570892 & -0.0112849 & -0.820948 \\ -0.662281 & -0.584642 & 0.468591\end{array}\right)$

## the RC decomposition, and Xhat from X

Moving on, do what we’ve done before: here’s $R = u1\ w1$ and $C = v1^T\$, so D1 = R C:

$R = \left(\begin{array}{ll} -5.33172 & -0.774911 \\ 0.291211 & 1.39811 \\ -8.77828 & -0.37676 \\ -3.26895 & 1.76909 \\ -12.0827 & 0.17074\end{array}\right)$

$C = \left(\begin{array}{lll} -0.485249 & -0.570892 & -0.662281 \\ 0.811213 & -0.0112849 & -0.584642\end{array}\right)$

Now, we test the following two X vectors (i.e. two columns of this X matrix):

$X = \left(\begin{array}{ll} 1 & 1 \\ 2 & -1 \\ 3 & 1 \\ 4 & -1 \\ 5 & 1\end{array}\right)$

You should assume that I tested thousands of other data records looking for likely candidates, and selected these two based on the following calculations.

To get $\hat{X}$ I want the hat matrix, $H = u1\ u1^T\$:

$H = \left(\begin{array}{lllll} 0.210941 & -0.19093 & 0.228326 & -0.167639 & 0.223058 \\ -0.19093 & 0.334121 & -0.099699 & 0.418739 & 0.0273469 \\ 0.228326 & -0.099699 & 0.318078 & -0.0043966 & 0.393462 \\ -0.167639 & 0.418739 & -0.0043966 & 0.57519 & 0.202193 \\ 0.223058 & 0.0273469 & 0.393462 & 0.202193 & 0.56167\end{array}\right)$

I compute the $\hat{X} = H X\$ (As I said, in the real world this computation is what selected the columns of X; the computation preceded the construction of X.)

$\hat{X} = \left(\begin{array}{ll} 0.958792 & 1.02089 \\ 1.98991 & -1.01614 \\ 2.93289 & 1.04396 \\ 3.96837 & -0.963772 \\ 5.07526 & 0.94865\end{array}\right)$

I compute the difference $\hat{X} - X\$:

$\hat{X} - X = \left(\begin{array}{ll} -0.0412078 & 0.0208948 \\ -0.0100941 & -0.0161423 \\ -0.067115 & 0.0439618 \\ -0.0316295 & 0.0362281 \\ 0.0752581 & -0.0513498\end{array}\right)$

OK, they will do: the differences are small. We would have done that calculation for thousands of candidates. Now get $T = w1^{-1}\ u1^T\ X$ for the two candidates we chose.

$T = \left(\begin{array}{ll} -0.398759 & -0.0885231 \\ 1.50634 & -0.708358\end{array}\right)$

As before, we can confirm that the product R T

$R\ T = \left(\begin{array}{ll} 0.958792 & 1.02089 \\ 1.98991 & -1.01614 \\ 2.93289 & 1.04396 \\ 3.96837 & -0.963772 \\ 5.07526 & 0.94865\end{array}\right)$

is equal to $\hat{X}\$.

## Yhat from Y

Now, i’m going to change notation. Somebody has to do something! We write, as before

$D1 = R\ C = R\ T\ T^{-1}\ C = \hat{X}\ T^{-1}\ C\$.

all i want to do is change is the name of $T^{-1}\ C\$: call it $\hat{C}$ (instead of $\hat{Y}\$). then we have

$D1 = \hat{X}\ \hat{C}\$,

and $\hat{C}$ is

$\hat{C} = \left(\begin{array}{lll} 0.999354 & 0.970148 & 1.00377 \\ 0.979945 & 2.07897 & 2.95989\end{array}\right)$

Why change notation? Because i’m about to test some vectors y and construct an independent $\hat{Y}\$. To put it another way, I desperately need something which is to Y as $\hat{X}$ is to X.

Duh, use $\hat{Y}\$. (My first wish is that Malinowski had done this.)

Let $J = v1\ v1^T$ (This is the hat matrix for Y; it projects a vector onto the 2D space spanned by the columns of v1. It corresponds neatly to $H = u1\ u1^T\$.)

$J = \left(\begin{array}{lll} 0.893533 & 0.26787 & -0.152898 \\ 0.26787 & 0.326045 & 0.384689 \\ -0.152898 & 0.384689 & 0.780423\end{array}\right)$

Let two test vectors y be :

$Y = \left(\begin{array}{lll} 2 & 2 & 2 \\ 0.5 & 1. & 1.5\end{array}\right)$

NOTE that my two vectors are the rows of Y, whereas my x vectors were the columns of X.

Instead of $\hat{X} = H\ X\$, we have $\hat{Y} = Y\ J\$ (because I put the vectors into the rows, and J is symmetric, J^T= J.) here is $\hat{Y}\$:

$\hat{Y} = \left(\begin{array}{lll} 2.01701 & 1.95721 & 2.02443 \\ 0.485289 & 1.03701 & 1.47887\end{array}\right)$

and here is the difference between Y and $\hat{Y}\$:

$Y - \hat{Y} = \left(\begin{array}{lll} -0.0170087 & 0.0427934 & -0.0244262 \\ 0.014711 & -0.0370126 & 0.0211265\end{array}\right)$

They are very close to the Y, of course. Again, you should assume that i searched high and low and tested hundreds or thousands of potential y vectors before selecting these two; and the computations dictated the selection, in contrast to my selecting first and then computing.)

Let’s look at it the other way: what are the new components of Y? Of Yhat? To get new components, we multiply each vector $y_i$ by the inverse (i.e. transpose) transition matrix, i.e. by $v^T\$. But matrix Y has vectors $y_i$ in rows, so we combine the individual

$v_i^T\$,

which makes sense for column vectors $y_i$ into

Y v

which makes sense for row vectors making up Y, and this gives us the new components in rows. And just as we used u instead of u1 for this, we use v instead of v1.

$Y\ v = \left(\begin{array}{lll} -3.43684 & 0.430572 & -0.0521269 \\ -1.80694 & -0.482641 & 0.0450852\end{array}\right)$

The 3rd components are nearly zero, as they should be: the vectors in Y are almost in the 2D subspace spanned by v1.

As further reassurance, we may compute the new components of the $\hat{Y}\$:

$\hat{Y}\ v = \left(\begin{array}{lll} -3.43684 & 0.430572 & 0 \\ -1.80694 & -0.482641 & 0\end{array}\right)$

Yes: their 3rd components are 0.

We need the “T” matrix for these Yhat. FIrst, give it a new name, S. (That’s the second thing I wish Malinowski had done, because I take it to be a different matrix.)

We have the hat matrix J

$J = v1\ v1^T$

and the projections $\hat{Y} = Y\ J$

and instead of $\hat{X} = R\ T\$, we will write

$\hat{Y} = S\ C = S\ v1^T$

so

$S = \hat{Y}\ v1\$.

i.e. we compute

$S = \hat{Y}\ v1\ = \left(\begin{array}{ll} -3.43684 & 0.430572 \\ -1.80694 & -0.482641\end{array}\right)$

I have checked that $\hat{Y} = S\ C\$.

This time we write

$D1 = R\ C = R\ S^{-1}\ S\ C = R\ S^{-1}\ \hat{Y}$

(The second reason for using S instead of T is that reverse, inserting $S^{-1}\ S$ instead of $T\ T^{-1}\$.)

And now we use $\hat{R}$ to denote $R\ S^{-1}\$, getting

$D1 = \hat{R}\ \hat{Y}\$,

with

$\hat{R} = \left(\begin{array}{ll} 0.481412 & 2.03504 \\ 0.979059 & -2.02336 \\ 1.45929 & 2.08248 \\ 1.95929 & -1.91752 \\ 2.51976 & 1.89416\end{array}\right)$

Maybe i should point out that we inserted $T\ T^{-1}$ inside D1 = RC because $\hat{X} = R\ T\$; we insert the reverse $S^{-1}\ S$ inside D1 = RC because $\hat{Y} = S\ C\$.

## approximate products

There is no doubt in my mind that Malinowski suggested replacing $\hat{X}$ by X in the equation

$D1 = \hat{X}\ \hat{C}\$,

to get a different product which is approximately D or D1:

$D_X = X\ \hat{C}\$.

I showed you that replacement in the previous post.

There is no doubt in my mind that it is equally valid to replace $\hat{Y}$ by Y in the equation

$D1 = \hat{R}\ \hat{Y}\$,

to get a different product which is also approximately D or D1:

$D_Y = \hat{R}\ Y\$.

Then, we could compare the original data matrix D with the two partially reconstructed data matrices $D_X$ and $D_Y\$. Each of $D_X$ and $D_Y\$ combines data (X and Y, respectively) with model ($\hat{C}$ and $\hat{R}\$, respectively).

I have to say that malinowski does not describe the details of such a comparison between D and, say, $\hat{D}_X\$.

The only doubt in my mind has nothing to do with my mathematics. What i did is valid, and plausible; but is it what he was recommending? I have to assume that his T in equation 3.133 in not the same T as in equation 3.136. To be specific, where he writes (3.132 and 3.133)

$X_{basic}\ \hat{Y}_{load} = \hat{D}\ \text{with } \hat{Y}_{load} = T^{-1}\ C$

$X\ \hat{C} = \hat{D}_X \text{with } \hat{C} = T^{-1}\ C\$.

All I did was change notation.

But where he writes (3.135 and 3.136)

$\hat{X}_{load}\ Y_{basic} = \hat{D}\ \text{with } \hat{X}_{load} = R\ T^{-1}\ \text{and with the same symbol } \hat{D}$

$\hat{R}\ Y = \hat{D}_Y\ \text{with } \hat{R} = R\ S^{-1}\$.

Not only is S not equal to T, but $\hat{D}_X \ne \hat{D}_Y\$. They’re close – that’s the point of studying their deviations from D: they’re both close to both D1 and D.

We could certainly have defined $\hat{X}_{load}= R\ T^{-1}\$, but I can’t see how we get anything useful from it. I have to believe that his T in the second pair of equations is a generic symbol, and that we really do have the two distinct computed matrices T and S.

Here are the two $\hat{D}$ matrices:

$\hat{D}_X = \left(\begin{array}{lll} 1.9793 & 3.04912 & 3.96366 \\ 1.01876 & -0.138675 & -0.952347 \\ 3.97801 & 4.98942 & 5.97121 \\ 3.01747 & 1.80162 & 1.0552 \\ 5.97671 & 6.92971 & 7.97875\end{array}\right)$

$\hat{D}_Y = \left(\begin{array}{lll} 1.98034 & 2.99786 & 4.01538 \\ 0.946439 & -0.0652406 & -1.07692 \\ 3.95983 & 5.00107 & 6.04231 \\ 2.95983 & 2.00107 & 1.04231 \\ 5.98661 & 6.93369 & 7.88077\end{array}\right)$

These are perfectly reasonable things to look at. X is close to $\hat{X}$ and Y is close to $\hat{Y}\$; each of X, Y was chosen because it was close. Replacing $\hat{X}$ by X and $\hat{Y}\$ by Y is a reasonable way to study the goodness of the model.

We would compute and investigate the differences $D - \hat{D}_X$ and $D - \hat{D}_Y\$. Just to be clear, let me compute, instead, the difference between the two $\hat{D}$ matrices; they are not the same:

$\hat{D}_X - \hat{D}_Y = \left(\begin{array}{lll} -0.00104385 & 0.0512587 & -0.0517172 \\ 0.0723242 & -0.0734349 & 0.124573 \\ 0.0181767 & -0.0116541 & -0.0711045 \\ 0.0576412 & -0.199449 & 0.0128861 \\ -0.00989663 & -0.00397833 & 0.0979791\end{array}\right)$

## sometimes approximate products

Finally, and this is what i think he meant by “the crown jewels of factor analysis” on p. 24, we could compute the product XY. We could also compute the product $\hat{X}\ \hat{Y}$ and the product $\hat{R}\ \hat{C}\$. I remind you that when he writes $\hat{X}\ \hat{Y}$ on p. 64, he means what I’ve called $\hat{X}\ \hat{C}\$. He has no symbol for the projection $\hat{Y}$ of his data $Y_{basic}\$.

What do we know about those products? Well, XY should be close to $\hat{X}\ \hat{Y}\$. But are they related to D or D1?

Well,

$\hat{X}\ \hat{Y} = R\ T\ S\ C$

so $\hat{X}\ \hat{Y}$ (and therefore XY) should be close to RC (hence close to D1 and D) if TS is the identity. By the same token,

$\hat{R}\ \hat{C} = R\ S^{-1}\ T^{-1}\ C = R\ {\left(TS\right)}^{-1}\ C$

so $\hat{R}\ \hat{C}\$ should be close to RC if ${\left(TS\right)}^{-1}$ is close to the identity; that’s equivalent to TS close to the identity.

It’s time to write down all those products and relationships. We have the equalities

$D1 = R\ C = \hat{X}\ \hat{C} = \hat{R}\ \hat{Y}\$;

and we have approximately

$\hat{X}\ \hat{C} \approx X\ \hat{C}$

$\hat{R}\ \hat{Y} \approx \hat{R}\ Y$

$D \approx D1$

so we could write

$D1 = R\ C = \hat{X}\ \hat{C} = \hat{R}\ \hat{Y}\ \approx X\ \hat{C} \approx \hat{R}\ Y\approx D\$.

The other 3 products are

$X\ Y \approx \hat{X}\ \hat{Y}$

and

$\hat{R}\ \hat{C}\$.

We have seen that all three of them will be close to D1 if TS is close to the identity. It’s all or nothing for the three musketeers.

Well, let’s just look at them. Here’s XY…

$X\ Y = \left(\begin{array}{lll} 2.5 & 3. & 3.5 \\ 3.5 & 3. & 2.5 \\ 6.5 & 7. & 7.5 \\ 7.5 & 7. & 6.5 \\ 10.5 & 11. & 11.5\end{array}\right)$

and for comparison, we recall D…

$D = \left(\begin{array}{lll} 1.9 & 3.2 & 3.9 \\ 1 & -0.2 & -1 \\ 4 & 4.9 & 6.1 \\ 3 & 1.9 & 1.1 \\ 6 & 6.9 & 7.9\end{array}\right)$

They are not close. Just what is TS?

$T\ S = \left(\begin{array}{ll} 1.53043 & -0.128969 \\ -3.89709 & 0.99047\end{array}\right)$

Not close to the identity. $\hat{R}\ \hat{C}$ is no closer to D:

$\hat{R}\ \hat{C} = \left(\begin{array}{lll} 2.47532 & 4.69783 & 6.50672 \\ -1.00435 & -3.25667 & -5.00617 \\ 3.49907 & 5.74515 & 7.62871 \\ 0.0789657 & -2.08566 & -3.70896 \\ 4.37431 & 6.38245 & 8.13578\end{array}\right)$

Nor is XY:

$X\ Y= \left(\begin{array}{lll} 2.5 & 3. & 3.5 \\ 3.5 & 3. & 2.5 \\ 6.5 & 7. & 7.5 \\ 7.5 & 7. & 6.5 \\ 10.5 & 11. & 11.5\end{array}\right)$

Note that all of XY, $\hat{X}\ \hat{Y}\$,$\hat{R}\ \hat{C}\$ are quite different from each other, and quite different from D.

## winding down

Please do not misunderstand me. $D_X = X\ \hat{C}\$ and $D_Y = \hat{R}\ Y\$ are always good things to look at. But XY, $\hat{X}\ \hat{Y}\$, and $\hat{R}\ \hat{C}$ need not be.

Please do not misunderstand me: they could be. If we had just used a different Y matrix, namely

$Y0 = \left(\begin{array}{lll} 1 & 1 & 1 \\ 1 & 2 & 3\end{array}\right)$

$Y = \left(\begin{array}{lll} 2 & 2 & 2 \\ 0.5 & 1. & 1.5\end{array}\right)$

then everything would have worked out fine. We would have found that ST was very nearly the identity, and that $D \approx X\ Y$ (hence also $D \approx \hat{X}\ \hat{Y}\$, and $D \approx \hat{R}\ \hat{C}\$).

But the proof of the pudding is the product TS, where T comes from the X data and S comes from the Y data, and if they mesh, then they mesh. That is, if TS = I, then they mesh.

I have deliberately computed a case where they did not mesh. I deliberately messed up the Y matrix. I hope to make this failure more memorable than the possibility that they could have meshed.

The key is that any change of scale at all would have left me with a matrix Y which is still close to its projection $\hat{Y}\$. If any vector y is close to the 2D subspace, so is any multiple of it. Such closeness by itself only guarantees that XY is close to $\hat{X}\ \hat{Y}\$, but it says nothing about ST close to the identity, hence it says nothing about XY being close to D.

Now, if there were an underlying chemical or physical model such that D ~ XY, and if we chose among such X and Y data, then we might more than hope – we might expect – that things would mesh. The challenge is to get X and Y that go together. In fact, one would hope that simply choosing consistent units would make them go together.

Nevertheless, I’d rather you took a deep sigh of relief if they meshed rather than expect it.

Finally, I really did have X and Y that would have made everything work out peachy. Instead, I scaled the rows of Y, and although I could still get perfectly reasonable independent projections $\hat{X}$ and $\hat{Y}\$, they were just that: independent.

As we wind down, I think you should asking: how did I know X and Y? Obviously they weren’t real data.

For that matter, how did I construct D0, so precisely of rank 2? That’s the question. And the answer.

I started with a 5×2 (“column”) matrix

$\left(\begin{array}{ll} 1 & 1 \\ 2 & -1 \\ 3 & 1 \\ 4 & -1 \\ 5 & 1\end{array}\right)$

and a 2×3 (“row”) matrix

$\left(\begin{array}{lll} 1 & 1 & 1 \\ 1 & 2 & 3\end{array}\right)$

and i multiplied them together to get D0.

$D0 = \left(\begin{array}{lll} 2 & 3 & 4 \\ 1 & 0 & -1 \\ 4 & 5 & 6 \\ 3 & 2 & 1 \\ 6 & 7 & 8\end{array}\right)$ = $\left(\begin{array}{ll} 1 & 1 \\ 2 & -1 \\ 3 & 1 \\ 4 & -1 \\ 5 & 1\end{array}\right)$ $\left(\begin{array}{lll} 1 & 1 & 1 \\ 1 & 2 & 3\end{array}\right)$

Then later on I took that column matrix for X and a scaling of that row matrix for Y.

Please do not think that we could have reconstructed X and Y. We invented them! In the real world, they are data to be searched out. We can talk about them here because this is a simulation. Remember example 6, and the fact that our X matrix was made up of independent data for o-xylene and p-xylene. In that case, we would need independent data for Y.

If they do not mesh, and if you cannot rescale things or find other data that does mesh, maybe you settle for studying

$D \approx \hat{D}_X = X\ \hat{C}$

and

$D \approx \hat{D}_Y = \hat{R}\ Y\$

Of course, if you have only X data, then you study only $D \approx \hat{D}_X = X\ \hat{C}\$, and if you have only Y data, then you study only $D \approx \hat{D}_Y = \hat{R}\ Y\$.

Next will be a “summary” of Malinowski, but in fact I expect that it will be more in the nature of “compare and contrast”. Although this example has introduced Y data and its projection $\hat{Y}\$, I hope it was also a good review and summary of what we’ve done with Malinowski.