PCA / FA example 4: Davis. covariance, correlation, and all that

Every working mathematician knows that if one does not control oneself (best of all by examples), then after some ten pages half of all the signs in formulae will be wrong and twos will find their way from denominators into numerators. V.I. Arnold, “On Teaching Mathematics”.
i keep losing track of factors of N-1 in my head, so maybe you do too. to be specific, my recollection is that if the design matrix X contains standardized data, then the correlation matrix is
c = \frac{X^T \ X}{N-1}.
but another voice asks, didn’t an N-1 in the numerator cancel an N-1 in the denominator? i’m dividing covariances by two standard deviations.
let’s take a look. we are going to use something closer to standard notation for a little while, rather than the notations i’ve been using. so, for the definitions that follow, forget about my usual conventions for X and Z.
we start with a data matrix X = X_{ij}, with variables in columns. we compute the mean of each column,
\bar{X}_j = \frac{1}{N-1} \sum_{ i = 1}^{ N}X_{ i j}.
we subtract each column mean from the entries in that column, and denote the resulting deviate by (LC) x:
x_{ij} = X_{ij} - \bar{X}_j.
that is called centered data or zero-mean data.
the sample variance of raw data or of centered data is
s_j^2 = \frac{1}{N-1} \sum_{ i = 1}^{ N}{x_{ij}}^2
where we recall that N-1 rather than N appears because it gives us an unbiased estimate of the population variance. the square root, s_j, of the sample variance is the sample standard deviation. to get the sample variance of raw data, just center it first.
we can create standardized variables, also called standard deviates, from centered data by dividing each column entry by the standard deviation of that column; denote these by (LC) z, and we have
z_{ij}= \frac{x_{ij}}{s_j}.
we can also create what are called small standard deviates by dividing standard deviates by \sqrt{N-1}, getting:
z_{ ij}^{ *} = \frac{z_{ ij}}{\sqrt{N-1}}.
the sample covariance between the jth and kth zero-mean variables is
s_{ ij} = \frac{1}{N-1} \sum _{ i = 1}^{ N} x_{ ij}\ x_{ k i}.
(for j = k, this is the sample variance.)
i remark that i have used the usual notation s instead of s^2, which would be consistent with the notation for the sample variance.  the sample correlation coefficient between the ith and kth variables is their sample covariance divided by their standard deviations:
r_{ ij}= \frac{s_{ ij}}{s_{ i}s_{ j}}.
(there, didn’t that just eliminate an N-1? well, no.)
now let’s start relating all that to matrices. the first one we care about is the matrix x of centered data. the sample variance and covariance can be combined in matrix form as
\frac{1}{N-1} \sum _{ i = 1}^{ N} x_{ ij} x_{ ki} = cov(x) = \frac{1}{N-1} x^T \ x.
that is, for any centered data x, the product x^T \ x is (N-1) times the covariance matrix of the raw data. the key is that the summation \sum _{ i = 1}^{ N} x_{ij} x_{ ki}
is exactly x^T x.
now, let’s return to my usual notation.
if you need to, i encourage you to do the following calculations in fact, but perhaps it suffices to do them in principle. for the davis example, we already have the raw data D and the zero-mean data X. go ahead and compute standardized data Z and small standard deviates
Y = Z / \sqrt{N-1}
for each of those four matrices of data M = {D, X, Z, Y}, compute three things:
\frac{M^T M}{N-1},
the covariance matrix cov(M),
and the correlation matrix corr(M).
what will we find?
first, two of the 4 covariance matrices are the same, for the raw data D and the zero-mean data X:
cov(D) = cov(X).
second, all four correlation matrices are the same; the correlation matrix is unique:
corr(D) = corr(X) = corr(Z) = corr(Y).
third, the covariance matrix of the standardized data is also its correlation matrix, hence is the common correlation matrix:
cov(Z) = corr(Z).
now what about \frac{M^T M}{N-1} ?
for three of the four data matrices, all except D which is not zero-mean, we must have
\frac{M^T M}{N-1} = cov(M).
in particular, for standardized data Z, the correlation matrix is also proportional to Z^T Z, because it is also the covariance matrix:
\frac{Z^T Z}{N-1} = cov(Z) = corr(Z).
finally, the definition of small standard deviates says
Y^T Y = \frac{Z^T Z}{N-1}
so
Y^T Y = cov(Z) = corr(Z) = corr(Y).
in conclusion, introducing small standard deviates Y lets us use the unique correlation matrix indirectly, by computing either an eigendecomposition of Y^T Y or an SVD of Y. the eigenvalues, as well as the eigenvectors, of Y^T Y are precisely the eigenvalues and the eigenvectors of the correlation matrix.
perhaps i should remind you that
corr(Y) \ne cov(Y).
small standard deviates would let us use Y instead of the correlation matrix; they do nothing for any of the other covariance matrices.
for any other data matrix, cov(M) and \frac{M^T M}{N-1} have the same eigenvectors but their eigenvalues differ by a factor N-1.
i hate to say the following, but i want to move on. when i think of the correlation matrix as the covariance matrix of standardized data – which
has standard deviations equal to 1 – i know there’s a factor of N-1. my brain takes two premises and reaches a conclusion:
corr(Z) = cov(Z),
and
cov(Z) = \frac{1}{N-1} Z^T Z,
hence
corr(Z)= \frac{1}{N-1} Z^TZ.
but when i think of the correlation matrix as
r_{ ij}= \frac{s_{ ij}}{s_{ i}s_{ j}},
i still can’t see why there’s an N-1. someday i’ll track it down and beat it to death, but not today. i suspect i’ll be embarrased, but that’s not all that important.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: