## PCA / FA example 3: Jolliffe. analyzing the covariance matrix

we have seen what jolliffe did with a correlation matrix. now jolliffe presents the eigenstructure of the covariance matrix of his data, rather than of the correlation matrix. in order for us to confirm his work, he must give us some additional information: the standard deviations of each variable. (recall that he did not give us the data.)
we have to figure how to recover the covariance matrix c from the correlation matrix r, when for each and every ith variable we have its standard deviation $s_i$.
it’s easy: multiply the (i,j) entry in the correlation matrix r by both $s_i$ and $s_j$
$c_{i j} = r_{i j} \ s_i \ s_j$
the diagonal entries $r_{i i}$, which are 1, become variances $c_{i i} = s_i^2$, and each off-diagonal correlation $r_{i j}$ becomes a covariance. maybe it would have been more recognizable if i’d written
$r_{i j} = \frac{c_{i j}}{\ s_i \ s_j}$
which says that we get from covariances to correlations by dividing by two standard deviations.
here are the standard deviations he gives:
${.371,\ 41.253,\ 1.935,\ .077,\ .071,\ 4.037,\ 2.732,\ .297}$
we should expect an interesting result because of the huge variation in values. the standard deviation of the 2nd variable is 10 times that of the 6th, 20 times that of the 4th, 15 times that of the 7th, and 100-1000 times that of the others. the 2nd variable should dominate the results. that’s an understatement: it will completely overwhelm them.
back to business. my correlation matrix is r; let c be the covariance matrix. i have it to machine accuracy, but i am going to round it off in order to display it relatively compactly.
$c = \left(\begin{array}{cccccccc} 0.138&4.438&0.145&-0.002&-0.003&-0.377&-0.232&0.006\\ 4.438&1701.81&33.127&0.905&-1.101&-58.122&-18.483&-1.581\\ 0.145&33.127&3.744&0.062&-0.072&-3.445&-0.767&-0.044\\ -0.002&0.905&0.062&0.006&-0.005&-0.024&0.005&-0.003\\ -0.003&-1.101&-0.072&-0.005&0.005&0.059&0.007&0.003\\ -0.377&-58.122&-3.445&-0.024&0.059&16.297&2.118&0.092\\ -0.232&-18.483&-0.767&0.005&0.007&2.118&7.464&0.343\\ 0.006&-1.581&-0.044&-0.003&0.003&0.092&0.343&0.088\end{array}\right)$
now get the eigenstructure of the covariance matrix. as before, he rounded eigenvectors to the nearest .2, and ony showed the first four eigenvectors, for the four largest eigenvalues.
here’s what i get when i select the first 4 eigenvectors and round them off to the nearest .2 .
$\left(\begin{array}{cccc} 0.&0.&0.&0.\\ -1.&0.&0.&0.\\ 0.&-0.2&0.&1.\\ 0.&0.&0.&0.\\ 0.&0.&0.&0.\\ 0.&1.&0.2&0.2\\ 0.&0.2&-1.&0.\\ 0.&0.&0.&0.\end{array}\right)$
as before, to match his numbers i need to take the negatives of the 1st and 3rd columns. i multiply by a diagonal matrix with entries ${-1,\ 1,\ -1,\ 1}$ and get:
$\left(\begin{array}{cccc} 0.&0.&0.&0.\\ 1.&0.&0.&0.\\ 0.&-0.2&0.&1.\\ 0.&0.&0.&0.\\ 0.&0.&0.&0.\\ 0.&1.&-0.2&0.2\\ 0.&0.2&1.&0.\\ 0.&0.&0.&0.\end{array}\right)$
this matches jolliffe’s table 3.3. it is very significant that there is a 1 in each column, and no other number larger than .2. and sure enough, the first column says use the original 2nd variable for our first new variable. you remember, that old variable with darn near all the variance.
and these four vectors bear no resemblence to the previous four eigenvectors, which we got from the correlation matrix. here’s what they were:
$\left(\begin{array}{cccc} 0.2&-0.4&0.4&0.6\\ 0.4&-0.2&0.2&0.\\ 0.4&0.&0.2&-0.2\\ 0.4&0.4&-0.2&0.2\\ -0.4&-0.4&0.&-0.2\\ -0.4&0.4&-0.2&0.6\\ -0.2&0.6&0.4&-0.2\\ -0.2&0.2&0.8&0.\end{array}\right)$
we get the % variation explained as before. here are the eigenvalues…
${1704.68,\ 15.0561,\ 6.98002,\ 2.63927,\ 0.125282,\ 0.0656242,\ 0.00724308,\ 0.000562978}$
their sum is 1729.55 and their average is 216.194 .
now divide each eigenvalue by their sum, and round off, and write it as a percentage…
${98.6,\ 0.9,\ 0.4,\ 0.2,\ 0.,\ 0.,\ 0.,\ 0.}$
one new variable accounts for just about all the variation, and it’s essentially the second old variable. any of the ad hoc rules would tell us to keep exactly one new variable – and in fact that one new variable almost the same as the second old variable. this is what happens when one variance dominates.
what do we actually have, if we don’t round off so harshly? let’s look at that first columnwith no rounding…
$\left(\begin{array}{c} -0.00261244\\ -0.999151\\ -0.0195343\\ -0.00053178\\ 0.000647557\\ 0.0344497\\ 0.0109335\\ 0.000930991\end{array}\right)$
well, with a little rounding…
$\left(\begin{array}{c} 0.\\ -1.\\ -0.02\\ 0.\\ 0.\\ 0.03\\ 0.01\\ 0\end{array}\right)$
that first eigenvector is still almost entirely the 2nd original data variable, with extremely small contributions from the 3rd, 6th, and 7th original variables, and even less from the others.
it’s important to realize that the transformation between correlation matrix and covariance matrix is not a nice one, and the eigenstructures can be significantly different depending on which we use. to put that more starkly, our eigenvalues and eigenvectors can be significantly different depending on which matrix we analyze.
to state it in a way that we will see again: our choice of the initial preprocessing of the data may be the most significant modeling we do.
jolliffe strongly recommends using the correlation matrix.
since we’ve seen the methods for deciding how many PCs to keep, let’s quickly run thru them as a refresher, and because jolliffe and i stated them for a correlation matrix rather than for a covariance matrix.
method 1 was to keep enough eigenvectors so that the corresponding eigenvalues accounted for 70-90% of the cumulative variation. well, the first eigenvector corresponds to 98%. keep just one, and it’s overkill.
method 2 was to keep an eigenvector if its eigenvalue was greater than .7, except that’s .7 compared to an average variance of 1. in this case the average variance is 216.194 and .7 times it is 151.336. since the 2nd eigenvalue is 15.0561, one tenth our cutoff, once more we are told to keep only the first eigenvector.
method 3 was either a scree graph or an LEV diagram, and the LEV is supposedly more appropriate for these eigenvalues. (a scree graph plots the eigenvalues, an LEV plots their logarithms, in order.)
that doesn’t do much for me. a scree graph, by contrast, would sure as heck show a “leveling off”. this doesn’t really tell us anything we didn’t already know, but, boy oh boy, it’s in our faces and impossible to miss.
finally, i can’t resist computing the broken sticks. we already have them for a stick of length 1 broken into 8 pieces; we want to change the total length from 1 to the sum of the eigenvalues, which was 1729.55.
here are the eigenvalues again:
${1704.68,\ 15.0561,\ 6.98002,\ 2.63927,\ 0.125282,\ 0.0656242,\ 0.00724308,\ 0.000562978}$
here are the 8 expected broken lengths for total length 1…
${0.3397,\ 0.2147,\ 0.1522,\ 0.1106,\ 0.07932,\ 0.05432,\ 0.03348,\ 0.01563}$
and for total length 1729.55…
${587.584,\ 371.39,\ 263.293,\ 191.229,\ 137.18,\ 93.9415,\ 57.9091,\ 27.0243}$
and also the running cumulative lengths of the broken pieces….
${587.584,\ 958.975,\ 1222.27,\ 1413.5,\ 1550.68,\ 1644.62,\ 1702.53,\ 1729.55}$
so the stick of length 1730 would have its two largest pieces with lengths 588 and 371; and the sum of the first 7 broken pieces (1702) is still less than the value of the first eigenvalue alone (1705) !
we would keep only the first eigenvector. not a surprise.
next, we will move on to a statistics in geology book (davis) and see what he has to say about PCA / FA. that will be quite a bit.