PCA / FA Malinowski Summary

Malinowski’s work is considerably different from everything else we’ve seen before.

First of all, he expects that in most cases one will neither standardize nor even center the data X. We can do his computations as an SVD of X, or an eigendecomposition of X^{T}X or of XX^T – but because the data isn’t even centered, X^{T}X and XX^T are not remotely covariance matrices. For this reason, I assert that preprocessing is a separate issue.

Nevertheless, the underlying mathematics is the same: get either an SVD of X or an eigendecomposition of X^{T}X and/or of XX^T. But what do we do to X first? That’s a separate question.

Second, he is not primarily interested in eliminating “small” eigenvalues or singular values: he is interested in eliminating “experimental error”, i.e. “noise”. Although I worked thru his chapter 4 on estimating noise, I have not discussed it: his main interest is in deciding when x and \hat{x} are “close enough”, and without a real-world application, and without a more rigorous treatment, I’d rather pass for now. I’ll come back to his error stuff, however, if I ever find myself anywhere else looking at error estimation in PCA / FA.

In addition to omitting chapter 4, I stopped after chapter 5. What he has in common with Harman and Jolliffe is a lot of references to the literature. I didn’t see anything after chapter 5 that I could confirm the computations of.

Third, he doesn’t much use the usual vocabulary of scores and loadings, although he does use the subscript “load” for his \hat{X} in contrast to the subscript “basic” for his matrix X of successful test vectors x. (I sometimes think that he uses subscripts for emphasis rather than to distinguish entities, but I’m probably exaggerating.) In any case, I decided that the customary vocabulary was secondary to the mathematics: find the eigenvector matrix or matrices.

Fourth, he has no graphical techniques; he provides none of the graphs we came to expect in Harman, Jolliffe, and Davis. Such graphs do, in fact, have a place in chemistry; the Brereton “Chemometrics” – which I have not yet discussed – has a few. We will not do much with it: their internet data is available only to owners of the book, so I can compute to my heart’s content, but I can’t very well publish the data and you can’t very well follow along without it. But I will see if there’s anything I need to say about it.

Fifth, his notion of target testing (including using it to fill in missing values) is a whole new world, a brave new world, and I like it. I think I did it more simply than he did, but I was just cleaning up the math.

I do wonder. Can target testing be used in PCA “rotations”, i.e. change of basis? For handling multicollinearity in OLS? To tell us we should have centered the data? To tell us to subtract a constant? I don’t know yet. I’ll keep my eyes open.

I learned a lot from Malinowski about using the available tools. Davis taught me to use the SVD, Malinowski got me comfortable with using both the full and the cut-down SVDs. Not to mention using the u and v bases, and constructing the hat matrix.

OTOH, there is at least one thing we did along with Malinowski that the other authors did not do. They might have, but they did not: reconstitute the data matrix using the reduced set of eigenvalues or singular values.

The classical techniques generally just list the eigenvectors v1, possible weighted by the w0 (equivalently, by the square roots of the eigenvalues). But in principle, they could have computed D1. In practice, they usually got no closer than describing the new correlation matrix or variances.

When we start with the SVD

D = u\ w\ v^T

and replace the smallest singular value in w by 0 (and call the new matrix w0), we can reconstitute the data D as

D1 = u\ w_0\ v^T

(“Reconstitute” is intended to convey that impression that D1 is not the real data, not fresh orange juice. )

I want to show you more detail of the difference between D and D1. I’ve been a little vague about it.

Recall example 5 with noise. Here’s the data matrix:

D = \left(\begin{array}{lll} 1.9 & 3.2 & 3.9 \\ 1 & -0.2 & -1 \\ 4 & 4.9 & 6.1 \\ 3 & 1.9 & 1.1 \\ 6 & 6.9 & 7.9\end{array}\right)

The singular values were \{16.194,\ 2.41991,\ 0.238533\}

We interpret 3 nonzero singular values to mean that the matrix D is technically of rank 3; we know we can go further, however, and say that D differs from a matrix of rank 2 by its smallest singular value, namely .238533.

(I could go so far as to say the matrix D is of rank 2.238533, but perhaps it’s better to leave it at “differs from a matrix of rank 2 by .238533.” After all, if the smallest singular value were 10, the matrix would differ from a matrix of rank 2 by 10, and I don’t want to say that it’s of rank 12 = 2+10.)

We replaced the smallest singular value (0.238533) by 0, and reconstituted the data, calling it D1.

D1 = \left(\begin{array}{lll} 1.95859 & 3.05258 & 3.98415 \\ 0.992857 & -0.182028 & -1.01026 \\ 3.95401 & 5.0157 & 6.03396 \\ 3.02136 & 1.84625 & 1.13068 \\ 6.0016 & 6.89597 & 7.9023\end{array}\right)

What is the difference between D and D1? D1 is of rank 2, and the difference is supposed to be 0.238533.

Just how do we compute that difference? That’s the question.

The appropriate “norm” – appropriate because it gives this answer – is called the Frobenius norm, and it’s pretty simple: pretend the matrix is a vector, and compute its Euclidean norm (2-norm), namely the square root of the sum of squares.

Here we are. The element-by-element differences between D and D1 are:

e1 = \left(\begin{array}{lll} -0.0585929 & 0.147418 & -0.0841454 \\ 0.00714334 & -0.0179725 & 0.0102586 \\ 0.0459858 & -0.115699 & 0.0660403 \\ -0.0213625 & 0.0537476 & -0.0306787 \\ -0.00160247 & 0.00403179 & -0.00230131\end{array}\right)

Now take the square root of the sum of the squares of the “components”.

In case you’re working thru this with me, the squares of those numbers are

\left(\begin{array}{lll} 0.00343313 & 0.0217322 & 0.00708044 \\ 0.0000510273 & 0.000323011 & 0.000105238 \\ 0.0021147 & 0.0133863 & 0.00436132 \\ 0.000456356 & 0.00288881 & 0.000941184 \\ 2.56792E^{-6} & 0.0000162553 & 5.29604E^{-6} \end{array}\right)

and the sum of them is 0.0568979, and the square root of that is 0.238533.

That number should seem familiar: it’s exactly the singular value that we set to zero; it’s exactly what it ought to be.

By computing the difference e1, and then the square root of the sum of squares, I explicitly computed the difference D – D1 and confirmed that it was equal to the smallest singular value of D. In practice, of course, there is no need to compute the sum-of-squares, because we already know it.

I’ll close that by saying that if we set two singular values to zero, the difference between the original (D) and the reconstituted (D1) is the square root of the sum of squares of the two singular values. What’s going on is that the norm of D is the same as the norm of w, and the norm of D1 is the same as the norm of w0; so the difference between D and D1 is the difference between w and w0.

I think we’re done here.

Oh, let me be clear. I’m glad I own Malinowski (ah, the book). If you’re going to be doing PCA / FA in the physical sciences, you probably want it, too. If you’re going to be doing PCA / FA in chemistry, don’t even think about not buying it. Just my opinion.

6 Responses to “PCA / FA Malinowski Summary”

  1. Dana H. Says:

    I’m a chemist who does statistical data modeling, and I must say that (from your description), I find Malinowski’s approach just plain weird. If I’m doing PCA to try to understand where the variance in a set of molecular descriptors is coming from, I’d be crazy not to mean-center-and-scale the data first. Otherwise, a property such as molecular weight (with typical values in the 100s) will completely dominate a property such as AlogP (with typical values in the 1s). Perhaps in spectroscopic applications, using raw data in PCAs makes sense, but not in QSAR/QSPR-type applications.

    And, yes, it’s true that “component” has a special meaning in chemistry, but that’s no reason for Malinowski to use “factor analysis” in preference to “principal component analysis”. Many words have more than one meaning; the context should make the meaning clear in a given case. To my knowledge, *no one* in the QSAR/molecular modeling world uses the term “FA” in preference to “PCA”.

    OK, having gotten that out of my system, I have a question for you. I have a piece of code that does PCA, and for some reason it incorporates a constant term in the analysis — i.e., the first column of the X matrix is all 1’s (with all other columns centered and scaled). This seems pointless to me, as you obviously get no additional variance from the column. However, when you uncenter and unscale to express the PCs in terms of the original variables, you do get a constant term left over. Could this be the reason for including the extra column — essentially as a placeholder? Just wondering if you ever ran across this.

  2. Dana H. Says:

    In the PCA code I have, the constant term only gets added at the end, when the loadings get expressed in terms of the uncentered, unscaled variables. So there’s no wasted column being put into the X matrix, and the whole thing makes a lot more sense.

  3. rip Says:

    Hi Dana,

    Thank you very much for taking the time to talk about your experience with PCA. Maybe you’ll find things to say about my earlier posts on PCA / FA, too. They may seem terribly elementary to you.

    I have been trying to emphasize that the preprocessing of data – just whether to use raw or centered or standardized data – is a very important issue, and people do it differently, with good reason, I hope. Your comment reinforces my assessment: you do what Harman and Jolliffe recommend; but not what my geology text (Davis) recommends, at least for some data; and not what Malinowski recommends, at least for some data.

    The only example I have taken from Malinowski is my “example 6”. The data was hypothetical, he said, “involving the ultraviolet absorbances of five different mixtures of the same absorbing components measured at six wavelengths.” The subsequent SVD was computed for that clearly uncentered data. You did say, “Perhaps in spectroscopic applications, using raw data in PCAs makes sense, but not in QSAR/QSPR-type applications.” That sounds like Malinowski’s application to me, but I’m no chemist.

    Finally, you did make me check if I exaggerated Malinowski’s explanation for calling it FA instead of PCA. You be the judge; he said on p. 17, “_Principal component analysis_ (PCA) is another popular name for _eigenanalysis_. To chemists the word ‘component’ conjures up a different meaning and therefore the terminology _principal factor analysis_ (PFA) is offered as an alternative.” Thanks for letting me know that he’s apparently not mainstream on this.

    Again, thanks for throwing in your two cents. Please feel free to throw more coins or even rocks – small ones, I hope – for it’s nice to have an actual practitioner weighing in.

    (Separate comment about the column of 1s.)

  4. rip Says:

    Hello again Dana,

    A column of 1s in the “X” matrix for regression is how we incorporate a constant term. As you say, it makes far more sense to put it in at the end, after the PCA is done, and when we move on to other parts of the analysis.

    (Is your code performing “principal component regression?)

    I have a strong suspicion, but only a suspicion, that “target testing” can say to us something like: “subtract a constant term from your data”.

    Ah, I did close my post with the bold, “If you’re going to be doing PCA / FA in chemistry, don’t even think about not buying it [Malinowski].” From what you say, Malinowski’s focus is not so wide as all of chemistry. (OTOH, I think “target testing” looks promising.)

    Is there a book (or books or even just chapters) on PCA / FA that you would recommend for chemists, or for scientists in general?

  5. Dana H. Says:

    Yes, I know that the column of 1’s is used in regression calculations for the constant term. I just couldn’t understand why it was there for the PCA. It turns out that I had just misread the code.

    I can’t say that I have a good book on PCA to recommend to chemists specifically. The discussion in “Modern Applied Statistics with S” by Venables and Ripley is pretty good, but very brief. Hastie et al. in “The Elements of Statistical Learning” have a few interesting things to say about PCA and its applications, though none of their examples are chemistry-focused.

  6. rip Says:

    OK, thanks. I’ll keep them in mind.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: