PCA / FA Malinowski Summary

Malinowski’s work is considerably different from everything else we’ve seen before.

First of all, he expects that in most cases one will neither standardize nor even center the data X. We can do his computations as an SVD of X, or an eigendecomposition of X^{T}X or of XX^T – but because the data isn’t even centered, X^{T}X and XX^T are not remotely covariance matrices. For this reason, I assert that preprocessing is a separate issue.

Nevertheless, the underlying mathematics is the same: get either an SVD of X or an eigendecomposition of X^{T}X and/or of XX^T. But what do we do to X first? That’s a separate question.

Second, he is not primarily interested in eliminating “small” eigenvalues or singular values: he is interested in eliminating “experimental error”, i.e. “noise”. Although I worked thru his chapter 4 on estimating noise, I have not discussed it: his main interest is in deciding when x and \hat{x} are “close enough”, and without a real-world application, and without a more rigorous treatment, I’d rather pass for now. I’ll come back to his error stuff, however, if I ever find myself anywhere else looking at error estimation in PCA / FA.

In addition to omitting chapter 4, I stopped after chapter 5. What he has in common with Harman and Jolliffe is a lot of references to the literature. I didn’t see anything after chapter 5 that I could confirm the computations of.

Third, he doesn’t much use the usual vocabulary of scores and loadings, although he does use the subscript “load” for his \hat{X} in contrast to the subscript “basic” for his matrix X of successful test vectors x. (I sometimes think that he uses subscripts for emphasis rather than to distinguish entities, but I’m probably exaggerating.) In any case, I decided that the customary vocabulary was secondary to the mathematics: find the eigenvector matrix or matrices.

Fourth, he has no graphical techniques; he provides none of the graphs we came to expect in Harman, Jolliffe, and Davis. Such graphs do, in fact, have a place in chemistry; the Brereton “Chemometrics” – which I have not yet discussed – has a few. We will not do much with it: their internet data is available only to owners of the book, so I can compute to my heart’s content, but I can’t very well publish the data and you can’t very well follow along without it. But I will see if there’s anything I need to say about it.

Fifth, his notion of target testing (including using it to fill in missing values) is a whole new world, a brave new world, and I like it. I think I did it more simply than he did, but I was just cleaning up the math.

I do wonder. Can target testing be used in PCA “rotations”, i.e. change of basis? For handling multicollinearity in OLS? To tell us we should have centered the data? To tell us to subtract a constant? I don’t know yet. I’ll keep my eyes open.

I learned a lot from Malinowski about using the available tools. Davis taught me to use the SVD, Malinowski got me comfortable with using both the full and the cut-down SVDs. Not to mention using the u and v bases, and constructing the hat matrix.

OTOH, there is at least one thing we did along with Malinowski that the other authors did not do. They might have, but they did not: reconstitute the data matrix using the reduced set of eigenvalues or singular values.

The classical techniques generally just list the eigenvectors v1, possible weighted by the w0 (equivalently, by the square roots of the eigenvalues). But in principle, they could have computed D1. In practice, they usually got no closer than describing the new correlation matrix or variances.

When we start with the SVD

D = u\ w\ v^T

and replace the smallest singular value in w by 0 (and call the new matrix w0), we can reconstitute the data D as

D1 = u\ w_0\ v^T

(“Reconstitute” is intended to convey that impression that D1 is not the real data, not fresh orange juice. )

I want to show you more detail of the difference between D and D1. I’ve been a little vague about it.

Recall example 5 with noise. Here’s the data matrix:

D = \left(\begin{array}{lll} 1.9 & 3.2 & 3.9 \\ 1 & -0.2 & -1 \\ 4 & 4.9 & 6.1 \\ 3 & 1.9 & 1.1 \\ 6 & 6.9 & 7.9\end{array}\right)

The singular values were \{16.194,\ 2.41991,\ 0.238533\}

We interpret 3 nonzero singular values to mean that the matrix D is technically of rank 3; we know we can go further, however, and say that D differs from a matrix of rank 2 by its smallest singular value, namely .238533.

(I could go so far as to say the matrix D is of rank 2.238533, but perhaps it’s better to leave it at “differs from a matrix of rank 2 by .238533.” After all, if the smallest singular value were 10, the matrix would differ from a matrix of rank 2 by 10, and I don’t want to say that it’s of rank 12 = 2+10.)

We replaced the smallest singular value (0.238533) by 0, and reconstituted the data, calling it D1.

D1 = \left(\begin{array}{lll} 1.95859 & 3.05258 & 3.98415 \\ 0.992857 & -0.182028 & -1.01026 \\ 3.95401 & 5.0157 & 6.03396 \\ 3.02136 & 1.84625 & 1.13068 \\ 6.0016 & 6.89597 & 7.9023\end{array}\right)

What is the difference between D and D1? D1 is of rank 2, and the difference is supposed to be 0.238533.

Just how do we compute that difference? That’s the question.

The appropriate “norm” – appropriate because it gives this answer – is called the Frobenius norm, and it’s pretty simple: pretend the matrix is a vector, and compute its Euclidean norm (2-norm), namely the square root of the sum of squares.

Here we are. The element-by-element differences between D and D1 are:

e1 = \left(\begin{array}{lll} -0.0585929 & 0.147418 & -0.0841454 \\ 0.00714334 & -0.0179725 & 0.0102586 \\ 0.0459858 & -0.115699 & 0.0660403 \\ -0.0213625 & 0.0537476 & -0.0306787 \\ -0.00160247 & 0.00403179 & -0.00230131\end{array}\right)

Now take the square root of the sum of the squares of the “components”.

In case you’re working thru this with me, the squares of those numbers are

\left(\begin{array}{lll} 0.00343313 & 0.0217322 & 0.00708044 \\ 0.0000510273 & 0.000323011 & 0.000105238 \\ 0.0021147 & 0.0133863 & 0.00436132 \\ 0.000456356 & 0.00288881 & 0.000941184 \\ 2.56792E^{-6} & 0.0000162553 & 5.29604E^{-6} \end{array}\right)

and the sum of them is 0.0568979, and the square root of that is 0.238533.

That number should seem familiar: it’s exactly the singular value that we set to zero; it’s exactly what it ought to be.

By computing the difference e1, and then the square root of the sum of squares, I explicitly computed the difference D – D1 and confirmed that it was equal to the smallest singular value of D. In practice, of course, there is no need to compute the sum-of-squares, because we already know it.

I’ll close that by saying that if we set two singular values to zero, the difference between the original (D) and the reconstituted (D1) is the square root of the sum of squares of the two singular values. What’s going on is that the norm of D is the same as the norm of w, and the norm of D1 is the same as the norm of w0; so the difference between D and D1 is the difference between w and w0.

I think we’re done here.

Oh, let me be clear. I’m glad I own Malinowski (ah, the book). If you’re going to be doing PCA / FA in the physical sciences, you probably want it, too. If you’re going to be doing PCA / FA in chemistry, don’t even think about not buying it. Just my opinion.

PCA / FA Malinowski: Example 5. Simplified Target Testing

introduction

For a good reason which I have not yet discussed, Malinowski wants to find x vectors which are close to \hat{x} = H\ x vectors. (His x and \hat{x} are usually written y and \hat{y} for a least-squares fit). He finds a possible x and tests it to see if xhat is close. He recommended computing an intermediate t vector which was the \beta for his least-squares fit to x.

Since he seems to care about t only when \hat{x} is close to x, and since \hat{x} is incredibly easy to compute directly, I prefer to delay the computation of t. Find t after we’ve found a good x.

It will also turn out that he wants a collection of t vectors in order to pick a nicer basis than u or u1. And I’m not going to follow him there, because all of that is what practitioners call “non-orthogonal rotations”. (That strikes me as an oxymoron.) It’s what Harman spends most of his book doing, and that’s where I’ll look if I ever I want to. It’s important, but I’m not going to look at it this time around.

Anyway, we factored the data matrix

X = \left(\begin{array}{lll} 2 & 3 & 4 \\ 1 & 0 & -1 \\ 4 & 5 & 6 \\ 3 & 2 & 1 \\ 6 & 7 & 8\end{array}\right)

as X = R C, with R = u1 w1 and C = v1^T\ .

In addition, I’m keeping the full SVD handy: X = u\ w\ v^T\ .

For partial reference, we had

R = \left(\begin{array}{ll} 5.33135 & 0.759381 \\ -0.175256 & -1.40331 \\ 8.76875 & 0.330094 \\ 3.26214 & -1.8326 \\ 12.2062 & -0.0991938\end{array}\right)

u = \left(\begin{array}{lllll} 0.327517 & 0.309419 & -0.813733 & 0.257097 & -0.262167   \\ -0.0107664 & -0.571797 & -0.464991 & -0.668451 &   0.0994427 \\ 0.538684 & 0.134501 & 0. & 0. & 0.831703 \\ 0.200401 & -0.746715 & 0. & 0.634172 & -0.00904025 \\ 0.749851 & -0.0404178 & 0.348743 & -0.291376 & -0.479133\end{array}\right)

u_1 = \left(\begin{array}{ll} 0.327517 & 0.309419 \\ -0.0107664 & -0.571797 \\ 0.538684 & 0.134501 \\ 0.200401 & -0.746715 \\ 0.749851 & -0.0404178\end{array}\right)

(u1 is the first two columns of u.)

We computed the hat matrix H directly as

H = R\ \left(R^T\ R \right)^{-1}\ R^T\ ,

and we saw a few examples of computing \hat{x} = H\ x\ . In particular, we applied H to a vector of all 1s and got

\hat{x} = \left(\begin{array}{l} 0.308271 \\ 0.503759 \\ 0.849624 \\ 1.04511 \\ 1.39098\end{array}\right)

=

\left(\begin{array}{lllll} 0.203008 & -0.180451 & 0.218045 & -0.165414 & 0.233083   \\ -0.180451 & 0.327068 & -0.0827068 & 0.424812 & 0.0150376   \\ 0.218045 & -0.0827068 & 0.308271 & 0.0075188 & 0.398496   \\ -0.165414 & 0.424812 & 0.0075188 & 0.597744 & 0.180451   \\ 0.233083 & 0.0150376 & 0.398496 & 0.180451 & 0.56391\end{array}\right) \left(\begin{array}{l} 1 \\ 1 \\ 1 \\ 1 \\ 1\end{array}\right)

I closed by saying that I could simplify the computation, clarify the concept, and even dispense with H if I chose.

Let’s see all that.

computational simplification

We saw that R^T\ R was quite simple (diagonal, just the square of w1) when we computed it. First, let’s confirm the simplicity of R^T\ R. Since

R = u_1\ w_1\ ,

R^T\ R = \left(w_1\ {u_1}^T\right)\ \left(u_1 w_1\right) = {w_1}^2\ ,

because {u_1}^T\ u_1 = I (u1 is orthonormal although not orthogonal).

Let’s take a look at the H matrix. We have

{\left(R^T\ R\ \right)}^{-1} = w_1^{-2} because w1 is diagonal.

Now we can expand the definition of H:

H = R\ \left(R^T\ R\right)^{-1}\ R^T\

= \left(u_1\ w_1 \right) w_1^{-2}\ \left(w_1\ {u_1}^T \right) = u_1\ {u_1}^T\ .

In other words, if we want the matrix H, we just compute u_1 {u_1}^T\ . That’s a lot easier than using the normal equations. (I should probably say that for regression, the hat matrix is defined as H = X\ \left(X^T\ X \right)^{-1}\ X^T\ ; it simplified to u_1 {u_1}^T\ here because R was special.)

conceptual simplification

Fine, we have simplified the computation of H. This is about the time that I decided to find the matrix of H wrt the new basis u (not u1). I will confess that the hat matrix is a new friend of mine, not an old one; it took me a while to recognize him. The following computation helped a lot.

To transform H to the u basis, we would compute

P = u^{-1}\ H\ u

i.e.

P = u^T\ H\ u

because u, unlike u1, is orthogonal. We get

P = \left(\begin{array}{lllll} 1. & 0 & 0 & 0 & 0 \\ 0 & 1. & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0\end{array}\right)

OMG. Now that is a projection operator, in all its glory. It’s the identity on a 2D subspace, and the zero operator elsewhere.

It projects any vector onto the 2D subspace spanned by the data. All components other than the first two are mapped to zero. (If w1 were 3×3, we would have three 1’s, and be projecting onto a 3D subspace, and so on.)

It turns out that the hat matrix H is always a projection operator; it doesn’t require the special form H = u_1\ {u_1}^T\ .

Incidentally, there is a very simple test for whether a matrix A represents a projection operator: a matrix is idempotent, A^2 = A\ , if and only if the matrix represents a projection operator. And if A is idempotent then so is any similar matrix; i.e. if the linear operator is a projection, then all of its matrix representations are idempotent. It’s easy to show that: if B is similar to an idempotent matrix A,

B = Q^{-1}\ A\ Q

then B is idempotent too:

B^2 = \left(Q^{-1}\ A\ Q\ \right) \left(Q^{-1}\ A\ Q \right) = Q^{-1}\ A^2\ Q = Q^{-1}\ A\ Q = B\ .

We can show that the hat matrix is always a projection operator.

I personally find it far more revealing, for this application, that \hat{x} is the projection of x into the 2D subspace than that it is the result of a least-squares fit. It is both, of course; I just prefer seeing H as a projection operator. That speaks to me.

So, although I computed H = R\ \left(R^T\ R\right)^{-1}\ R^T\ , I could (and should!) have computed

H = u_1\ {u_1}^T\ .

or, we could have computed H from that nice simple projection operator P:

H = u\ P\ u^T\ .

Yes, I found P from H, but now that we know that H has that form wrt the u basis, we could define P and compute H from it. Nevertheless, although the projection P is visually stunning, it’s not an easier way to compute H. I would still compute H from u1.

But only if I had some use for H.

H is convenient, not essential.

Tell me again: what is xhat? It’s the projection of x into the 2D subspace spanned by u1.

So what would happen if we took x, found its components wrt the u basis, and zeroed out the components outside the subspace?

We would have the new components of xhat.

We would also have seen explicitly how far from the subspace x was. Then we could get the old components of xhat. (This works because u1 is a basis for the 2D subspace, and u1 is a subset of u.)

For an example, let’s take his all-1s vector. Get its new components by applying the inverse transition matrix u^{-1} = u^T\ , and we get

\left(\begin{array}{lllll} 0.327517 & -0.0107664 & 0.538684 & 0.200401 & 0.749851   \\ 0.309419 & -0.571797 & 0.134501 & -0.746715 & -0.0404178   \\ -0.813733 & -0.464991 & 0. & 0. & 0.348743 \\ 0.257097 & -0.668451 & 0. & 0.634172 & -0.291376 \\ -0.262167 & 0.0994427 & 0.831703 & -0.00904025 &   -0.479133\end{array}\right) \left(\begin{array}{l} 1 \\ 1 \\ 1 \\ 1 \\ 1\end{array}\right)

=

\left(\begin{array}{l} 1.80569 \\ -0.91501 \\ -0.929981 \\ -0.0685591 \\ 0.180805\end{array}\right)

These are the new components of x; we see that the 4th component is very small, the 5th is small, but the 3rd component is as large as the second. This vector isn’t close to being in the 2D subspace. Now zero out all but the first two components and we have the new components of xhat

s2 = \left(\begin{array}{l} 1.80569 \\ -0.91501 \\ 0 \\ 0 \\ 0\end{array}\right)

– but we want the old components of xhat. Fine, the old components of xhat are found by applying the transition matrix u …

\left(\begin{array}{lllll} 0.327517 & 0.309419 & -0.813733 & 0.257097 & -0.262167   \\ -0.0107664 & -0.571797 & -0.464991 & -0.668451 &   0.0994427 \\ 0.538684 & 0.134501 & 0. & 0. & 0.831703 \\ 0.200401 & -0.746715 & 0. & 0.634172 & -0.00904025 \\ 0.749851 & -0.0404178 & 0.348743 & -0.291376 & -0.479133\end{array}\right)

to s2, and we get exactly what we computed in the previous post by applying H to x:

\hat{x} = \left(\begin{array}{l} 0.308271 \\ 0.503759 \\ 0.849624 \\ 1.04511 \\ 1.39098\end{array}\right)

Because we understand that xhat is the projection of x onto the 2D subspace, and because u was designed to include a subbasis for that 2D subspace, we can compute xhat without using H.

As it happens, I find it easier to compute and use H for finding xhat, but I also really like knowing that I can get additional insight by looking at the new components of x, especially the ones outside the 2D subspace.

In summary then, if I must compare x with \hat{x}\ , I would compute H = u_1\ {u_1}^T\ \text{and } \hat{x} = H\ x\ . In the case that x and xhat are not close, that’s when I would use u^T to look at the new components of x, to see the components outside the subspace.

PCA / FA malinowski: example 5. target testing

Recall that we computed the SVD X = u\ w\ v^T\ of this matrix:

X = \left(\begin{array}{lll} 2 & 3 & 4 \\ 1 & 0 & -1 \\ 4 & 5 & 6 \\ 3 & 2 & 1 \\ 6 & 7 & 8\end{array}\right)

and we found that the w matrix was

w = \left(\begin{array}{lll} 16.2781 & 0. & 0. \\ 0. & 2.45421 & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

Because w has only two nonzero entries, we know that X is of rank 2. Its three columns only span a 2D space.

Given a column of data x (a variable, in this example, of length 5), Malinowski wants to know if it is in that 2D space. As he puts it, “if the suspected test vector [x] is a real factor, then the regeneration \hat{x} = R\ t will be successful.” He gives us a formula for computing t; by a successful regeneration, he means that \hat{x}\ is close to x.

He says that this is how it is done in practice: given a “suspected test vector” x, compute t …, compute R t, and compare R t to x.

We will see that he has some use for t apart from this testing of x. Otherwise, if he doesn’t need t, what he’s doing is unnecessarily complicated. I want to focus on his “regeneration” computation. He says that

t = {\left(R^T\ R\right)}^{-1}\ R^T\ x\ .

You might recognize that as the normal equations

\beta = {\left(X^T\ X\right)}^{-1}\ X^T\ y\

of ordinary least squares regression. (You might look at my “math OLS” category.) He counts on this, and says (p. 57) that “A least-squares procedure is employed to generate a transformation vector [t] that will yield a predicted vector… most closely matching [x]….”

(I would argue that’s a little misleading. The predicted vector is restricted to lie in the 2D subspace.)

Instead of

y = X\ \beta + e = \hat{y} + e

his implicit regression model is

x = R\ t + e = \hat{x} + e\ .

The analog of \hat{y} = X\ \beta is \hat{x} = R\ t\ : his t’s are the \beta\ ’s, his xhat is the yhat. In the customary notation, he’s computing yhat and comparing it to a “suspected test vector” y.

In his notation, he wants to compute R t and compare it to x.

Fine, let’s do it. We know enough to label his R t as
\hat{x}\ . We have

\hat{x} = R\ t = R\ {\left(R^T\ R\right)}^{-1}\ R^T\ x =: H x\ ,

so let’s find the matrix

H = R\ {\left(R^T\ R\right)}^{-1}\ R^T\ .

I have called it H because it has a name, the hat matrix. It is ubiquitous in regression diagnostics. Note that we are working with the R matrix, not with the full X matrix. Boy, does that matter!

we have R ( = u w)…

R = \left(\begin{array}{ll} 5.33135 & 0.759381 \\ -0.175256 & -1.40331 \\ 8.76875 & 0.330094 \\ 3.26214 & -1.8326 \\ 12.2062 & -0.0991938\end{array}\right)

and

R^T\ R = \left(\begin{array}{ll} 264.977 & 0 \\ 0 & 6.02317\end{array}\right)

and its inverse…

{\left(R^T\ R\right)}^{-1} = \left(\begin{array}{ll} 0.00377391 & 0. \\ 0. & 0.166026\end{array}\right)

We really want to observe that R^T\ R\ is the squared nonzero singular values, w1^2:

w1 = \left(\begin{array}{ll} 16.2781 & 0. \\ 0. & 2.45421\end{array}\right)

{w_1}^2 = \left(\begin{array}{ll} 264.977 & 0. \\ 0. & 6.02317\end{array}\right)

(Believe me, we will pursue that later.)

We premultiply by R and post-multiply by R^T\ , getting

H = \left(\begin{array}{lllll} 0.203008 & -0.180451 & 0.218045 & -0.165414 & 0.233083   \\ -0.180451 & 0.327068 & -0.0827068 & 0.424812 & 0.0150376   \\ 0.218045 & -0.0827068 & 0.308271 & 0.0075188 & 0.398496   \\ -0.165414 & 0.424812 & 0.0075188 & 0.597744 & 0.180451   \\ 0.233083 & 0.0150376 & 0.398496 & 0.180451 & 0.56391\end{array}\right)

Suppose we now try this on one of the columns of the X matrix. Here’s the first column:

x_1 = \left(\begin{array}{l} 2 \\ 1 \\ 4 \\ 3 \\ 6\end{array}\right)

And here’s H applied to it:

H\ x_1 = \left(\begin{array}{l} 2. \\ 1. \\ 4. \\ 3. \\ 6.\end{array}\right)

He would say, we regenerated the input. I would say, our regression has been perfect. (My suspect was known to lie in the 2D space. because it was a column of X.) Let’s try his favorite suspect in general: all 1’s. Here’s our test vector…

x_2 = \left(\begin{array}{l} 1 \\ 1 \\ 1 \\ 1 \\ 1\end{array}\right)

and here’s H applied to it:

H\ x_2 = \left(\begin{array}{l} 0.308271 \\ 0.503759 \\ 0.849624 \\ 1.04511 \\ 1.39098\end{array}\right)

Now that’s quite a different outcome. This \hat{x}\ is not close to x = x2. We did not regenerate the input. This is not a good regression. That vector x2 is not in the 2D subspace, not by a long shot.

His next favorite test is the basis vectors. We could apply H to each of the basis vectors and see what we get. But we know that the images under H of the basis vectors are the columns of H, so what we would get is the columns of H. Let’s do some serious rounding and just see if any of the columns of H look like the appropriate basis vectors. (That is, for example, does the second column look like the second basis vector, i.e. 1 on the diagonal and zero elsewhere?)

H \approx\ \left(\begin{array}{lllll} 0 & 0 & 0 & 0 & 0 \\ 0 & 0.5 & 0 & 0.5 & 0 \\ 0 & 0 & 0.5 & 0 & 0.5 \\ 0 & 0.5 & 0 & 0.5 & 0 \\ 0 & 0 & 0.5 & 0 & 0.5\end{array}\right)

I would say not: never mind the diagonals, the columns have either zero or two large entries, instead of one large entry.

In summary, we can compute \hat{x}\ . But what’s going on? As we see this more clearly, we will discover three things.

  1. we can make a significant computational simplification.
  2. we can make a significant conceptual breakthrough.
  3. we can dispense with the hat matrix H completely.

But all of that depends on having u and u_1\ , that is, all of it depends on using the SVD in both forms:

X = u\ w\ v^T = u_1\ w_1\ {v_1}^T

We must know that R = u_1\ w_1\ , not just that R = X\ v_1\ .

PCA / FA Malinowski: example 5.

(June 10: i have made 4 edits, all cosmetic. you may search on “edit:”)

Malinowski (edit: “Factor Analysis in Chemistry”, 3rd ed.) does a lot of things differently from what we’ve seen. Fortunately, his model is simple enough, although his notation is… different. His model is

X = R C,

and he calls R and C the row and column matrices respectively. He wants X to have more rows than columns, so he transposes if necessary; then he chooses C to have more columns than rows, and R will have more rows than columns. For starters, then, his X matrix looks like the usual design matrix for regression. (Incidentally, he didn’t call it X.)

He chooses C = {v_1}^T, from the cut-down SVD. That is, I write the SVD of X as

X = u\ w\ v^T\ ,

where u and v are orthogonal and w is the same shape as X. But we know from the derivation and our experience with Davis that we may also write

X = u_1\ w_1\ {v_1}^T\ ,

where w_1 is square, diagonal, and invertible (it is a cut-down w), and u_1 and v_1 are the submatrices of u and v which are conformable with w_1. (We’ll see all this shortly.) We have dropped the parts of u, w, and v which are not required for reproducing X. (I remind you that what we’ve lost is the orthogonality of the matrices u and v.)

Malinowski’s model is

X = \left(u_1\ w_1\right)\ {v_1}^T\ ,

i.e. he has chosen

C = {v_1}^T

R = u_1\ w_1\ .

In particular, C is (part of) an eigenvector matrix. I would argue that whether we use those choices or

C = {v}^T

R = u\ w

is purely a matter of convenience, of no real significance.

Have we seen this before? Yes. This is effectively Harman’s model. That was

Z = A F

with

eigenvector matrix A and Z = X^T\ , i.e.

X = F^T\ A^T\ .

In addition, when we looked at Harman, we found F by computing

F^T = X\ A^{-T}\ .

Malinowski does the same thing. Having chosen C = v^T\ , he computes

R = X\ v_1\ .

That’s ok. We know that u_1 and v_1 are not orthogonal, but they are orthonormal matrices:

{u_1}^T\ u_1 = I = {v_1}^T\ v_1\ .

Then from

X = R\ {v_1}^T

we post-multiply by v1 and get

 X\  v_1 = R\ {v_1}^T\ v_1 = R\ .

I should probably recall that v could be found as an eigenvector matrix of X^T\ X\ , and v_1 is the subset of eigenvectors with nonzero eigenvalues. What’s important about this is that we could have gotten R and C without ever knowing u or u_1\ .

He knows about the SVD, but he seems not to really use it. I say that because i’m going to get a lot of mileage from u and u_1\ . I should probably also recall that (edit: add R to the following)

R = X\ v = u\ w\ = A^Q

(see Davis) is the new components of the data wrt the v basis. So we understand R.

He prefers the term “factor analysis” to “principal component analysis” because “component” has special meaning in chemistry; otherwise, he considers the terms to be synonyms, along with “singular value decomposition”. He also prefers “row matrix” and “column matrix” to “scores” and “factors”. He says, in fact (p.29), “… if we focus attention on R, we would call R the score matrix and C the loading matrix. If we focus attention on C, we would call C the score matrix and R the loading matrix.” Maybe that’s when I stopped caring about the usual terminology in PCA / FA.

Far more important than his breaking free of the conventional names is his determination that in chemistry, the data should not be centered. Do take logarithms, for example, if it seems you should, but do not subtract out the means of the variables. “By using covariance or correlation about the mean, we lose information concerning the zero point of the experimental scale.” This is what broke me free of covariance and correlation matrices.

His entire chapter 5 is devoted to one manufactured example. I do not wish to reproduce an entire chapter, so I am going to build a similar but different example.

Here is my data: 3 variables, 5 observations each.

X = \left(\begin{array}{lll} 2 & 3 & 4 \\ 1 & 0 & -1 \\ 4 & 5 & 6 \\ 3 & 2 & 1 \\ 6 & 7 & 8\end{array}\right)

We get the SVD of X, i.e. X = u\ w\ v^T\ without any preprocessing at all. We look at w to see the rank of X. Both w and X are of rank 2 because there are 2 nonzero principal values.

w = \left(\begin{array}{lll} 16.2781 & 0. & 0. \\ 0. & 2.45421 & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0. \\ 0. & 0. & 0.\end{array}\right)

In this case, for example, with only two nonzero singular values in w, if drop three rows and one column of w, we get w1 an invertible diagonal 2×2:

w_1 = \left(\begin{array}{ll} 16.2781 & 0. \\ 0. & 2.45421\end{array}\right)

In order to have X = u_1\ w_1\ {v_1}^T\ , we drop three columns of u and one column of v. (We are partially reversing the derivation of the SVD. In that, we had v, split it into v1 and v2, then we cut down w to w1 and computed u1; then we extended u1 to an orthonormal basis u.)

Let me be explicit. Here are u and u1:

u = \left(\begin{array}{lllll} 0.327517 & 0.309419 & -0.813733 & 0.257097 &   -0.262167 \\ -0.0107664 & -0.571797 & -0.464991 & -0.668451 &   0.0994427 \\ 0.538684 & 0.134501 & 0. & 0. & 0.831703 \\ 0.200401 & -0.746715 & 0. & 0.634172 & -0.00904025   \\ 0.749851 & -0.0404178 & 0.348743 & -0.291376 &   -0.479133\end{array}\right)

u_1 = \left(\begin{array}{ll} 0.327517 & 0.309419 \\ -0.0107664 & -0.571797 \\ 0.538684 & 0.134501 \\ 0.200401 & -0.746715 \\ 0.749851 & -0.0404178\end{array}\right)

u1 is the first two columns of u. Here are v and v1:

v = \left(\begin{array}{lll} 0.485272 & -0.773204 & 0.408248 \\ 0.5729 & -0.0715479 & -0.816497 \\ 0.660528 & 0.630108 & 0.408248\end{array}\right)

v_1 = \left(\begin{array}{ll} 0.485272 & -0.773204 \\ 0.5729 & -0.0715479 \\ 0.660528 & 0.630108\end{array}\right)

v1 is the first two columns of v. You can confirm that we really do have

X = u_1\ w_1\ {v_1}^T\ .

For Malinowski’s model, we choose C = {v_1}^T\ and R = u_1\ w_1\ . We get

R = \left(\begin{array}{ll} 5.33135 & 0.759381 \\ -0.175256 & -1.40331 \\ 8.76875 & 0.330094 \\ 3.26214 & -1.8326 \\ 12.2062 & -0.0991938\end{array}\right)

C = \left(\begin{array}{lll} 0.485272 & 0.5729 & 0.660528 \\ -0.773204 & -0.0715479 & 0.630108\end{array}\right)

If we had used the full SVD, we would have had

R = \left(\begin{array}{lll} 5.33135 & 0.759381 & 0. \\ -0.175256 & -1.40331 & 0. \\ 8.76875 & 0.330094 & 0. \\ 3.26214 & -1.8326 & 0. \\ 12.2062 & -0.0991938 & 0.\end{array}\right)

which simply adds a column of zeroes to Malinowski’s R. I no longer care about that as much as I did when I started Davis. Either form of R is explicitly of rank 2.

I should point out that Malinowski does not describe his decomposition as an SVD. But this is his model. He also says that from

X = PQ

and

X = RC

where P and Q formed a purely hypothetical decomposition of X, we may conclude R = P and C = Q.

No, not quite. From X = RC we may conclude that we have a decomposition, but the decomposition of X into a product of matrices is not unique, and he knows it. (Which is important. Read him carefully, just as you must read me carefully. He knows what he’s doing, and is well worth reading, though his language is sometimes imprecise.) After asserting R = P and C = Q on p. 31, he shows us on p. 33 that the decomposition is not unique. We can even write a recipe for creating an infinite number of decompositions, from a given one. Take any change-of-basis transition matrix (edit: replace P, which was already in use, by T) T, insert I = T\ T^{-1} into X = R C, and get

X = \left(R\ T\right)\ \left(T^{-1}\ C\right)\ .

That is, we have a new R and a new C. edit: Even if the original R and C had been equal to P and Q, the new ones are not.

Next, he moves on to something rather fascinating, which he calls target testing.