## PCA / FA Brereton Summary

(No, you didn’t miss any previous posts about Brereton’s “Chemometrics”. This is it.)

Having made it through chapter 4 of Brereton, I will revise the bibliographic entry to echo the following sentiment: it might be fair to say that I view this as a workbook; figure the stuff out elsewhere, but play with it here. He has several examples in chapter 4: two case studies, a small worked-out example of cross-validation, and a small example illustrating various forms of preprocessing. He also provides data to illustrate “procrustes analysis”, but he does not show how to do it.

On my first pass thru here, all I tried was the two big case studies. I could match his first, but it was ugly getting there, and I could not match his second. On my second pass thru here, a week ago, having learned the SVD from Davis and the X = RC factoring from Malinowski, it was clear what Brereton was doing. I tore thru the chapter, and matched all of his results; more importantly, I got them easily and cleanly. (Mostly. His cross-validation example was more like tiptoeing than tearing thru, and there is one figure in the chapter that must come from some data not supplied.)

As I said in the bibliography, the data is available in electronic form once you purchase the book. Under the circumstances, I won’t publish any of his examples.

But let’s talk about them anyway.

First of all, what he’s doing is factoring the data matrix as Malinowski did, and I think this is best expressed using the SVD. Write

$X = u\ w\ v^T$

and then set

$R = u\ w$

$C = v^T$

so

$X = R\ C\$.

He always calls the columns of R the scores, and the rows of C (the columns of v) the loadings. (I have used Malinowski’s notation rather than Brereton’s.)

I have the impression that this book was aimed at people who can ask a computer to compute a PCA, and need to know what to do with the output. I don’t have a problem with that, but for myself I require a wider understanding. For example, he says “… a simple one [definition] defines the eigenvalue of a PC as the sum of squares of the scores….”

Let’s see. The scores are

$R = u\ w\$,

the sum of squares of the scores is

$w^T\ u^T\ u\ w = w^T\ w\$.

Indeed, we recognize those as the eigenvalues of both $X^T\ X$ and $XX^T\$. That is, the sums of squares of the scores are equal to the eigenvalues, but I wouldn’t define them that way.

In the same vein, he says, “The first scores vector and the first loadings vector are often called the eigenvectors of the first principal component.” That’s as close as he gets to a definition of an eigenvector. I actually figured out what he was doing when he said that the loadings vectors were orthonormal: then they pretty much had to be the columns of the v matrix.

His case study 1 is a 30×28 matrix; it resembles Malinowski’s introductory example. That is, we have the responses of a mixture, we analyze the raw data, and we hope to identify the individual compounds in the mixture. He provides a plot of the “pure spectra” of the compounds which de decided were in the mixture, but he never named them or supplied that data. (That, of course, is that figure of his which I can’t reproduce.)

His case study 2 is an 8×32 matrix, i.e. with 8 rows and 32 columns: he carefully provides a figure showing that it is short and wide instead of tall and thin. He has taken 32 measurements using 8 similar but different devices, and he wants to assess the similarities and differences. He wants to know which devices are effectively duplicates, and which are not. (What combination of instruments will give him the most information?) For this, he standardizes the data.

Be very careful. One, he uses the number of observations N (in this case, N = 8 ) rather than N-1. That is, he does not compute the sample variance. He is consistent and adamant about using N instead of N-1. Two, he presents the data matrix as a 32×8, both in the book and electronically: it is imperative that we transpose the given matrix from 32×8 to 8×32. Once we do both of these, everything works out just fine.

(Oh, because his matrix is short and wide, its rank is equal to the number of rows, instead of the number of columns. This means that his standardizing the columns will reduce the rank of the matrix.)

His third example is an 8×10 cross-validation. As with case study 2, having returned to it with Davis and Malinowski under my belt, it was easy to figure out what he had to be doing. The basic idea is to do 11 PCA’s: one with all 10 rows, and then the ten possible PCA’s on the ten 8×9 matrices formed by omitting one row at a time. The objective is to figure out how many components (singular values) to retain, based on how well we predict the omitted rows.

Once I realized that his criterion for choosing the number of components was different from Malinowski’s, I moved cross-validation, as such, to the back burner. Hell, it’s probably going into the pantry. All I wanted to do was understand PCA / FA, with no idea of the can of worms I was opening. Cross-validation looks useful, but I’m not really interested in it.

Do note, however, that Brereton provides an example of cross-validation; Malinowski does not. It was satisfying to match Brereton’s output.

His fourth example is a different 8×10 illustration of preprocessing. For one set of data, he illustrates PCA applied to the raw data, column-centered data, standardized data, and row-sum =1 data. For the constant row sum, we get a beautiful and informative plot of the scores. Here is my plot of the first two scores of his row-sum =1 data.

(Please note that the origin is a long way from the vertical axis. the y-intercept of that line is about +5, and any plot that shows it will compress those points to a very small blob.) I tried to construct an example, but I didn’t get such a magnificent picture from my data.

That picture is what hammered home the difference between a row-sum of zero and a row-sum of 1. That line says that the y components of our scores are affinely – not linearly – related to the x components. That is, the two columns are not linearly dependent, but they do (very nearly) satisfy an equation of the form

a x + b y = nonzero constant

Now you know why I started to write a post about the difference between row sums equal to zero and row sums equal to a nonzero constant. While I was looking at it, I was reminded of the tricky interaction between constant row sums and centered columns, and the post grew.

His final example or two revisits case study 2, but takes the same measurements with “acetonitrile as mobile phase” and “THF as mobile phase” instead of methanol (Nope, I have no idea what that means!). The point is to overlay the scores plots for methanol and acetonitrile, and then to overlay the scores plots for methanol and THF.

For methanol and acetonitrile, this appears to work just fine: pairs of points fall close to each other. For methanol and THF, we find that the scales are sufficiently different that the points do not fall close to each other.

That is, for methanol and acetonitrile, I get the following picture when I simply overlay the two sets of scores which I computed:

It appears to match his figure 4.21. We didn’t have to scale one set to get it to lie near the other set. (At least, I didn’t, and if he did, then he didn’t do much.)

For methanol and THF, on the other hand, it is crystal clear that one set of points needs to be transformed. Such a transformation (translation, scaling, and rotation) is called procrustes analysis. He doesn’t show what he did, and the wiki article

here