## PCA / FA example 4: Davis review (3)

Let’s talk about centering the data. Recall the questions:

• is there any chance that we can always center both the columns and the rows?
• is there any chance that we can always standardize both the columns and the rows?
• if we can’t have our cake and eat it too, should we give up the duality between R-mode and Q-mode?
• what if rows of the data matrix add up to 1 or 100 (i.e. the variables are percentages)?

(My third question is poorly phrased. we always have the duality between A’s and S’s; what i feared losing was the idea that Q-mode is just R-mode applied to the transpose $X^T$; but what if X doesn’t have row-centered data?)

The third question is also a red herring. We can always make both the columns and the rows add up to zero. Proving it isn’t too hard; some convenient notation might simplify things. If the data is

$x_{ij}$

then

$x_{\cdot j}$ and $x_{i\cdot}$

are good symbols for the column means and row means, respectively. The grand mean (the mean of all the matrix values = mean of column means = mean of row means) would be denoted

$x_{\cdot\cdot}$

BTW, it’s very handy that the mean of column means = mean of row means; we’ll see how handy, soon.

The only subtlety is that what we have here are row and column means of the original data, instead of, for example, column means of the original data and row means of the column-centered data. We can get “doubly-centered data” by computing

$x_{ij} - x_{\cdot j} - x_{i\cdot} + x_{\cdot\cdot}$

Why are we adding the grand mean? Because we subtracted it twice. If we had taken row means of the column centered data, we wouldn’t add the grand mean back in; but i chose, and i’m sure most people choose, to take row means of the original data.

I leave the proof that we can always get doubly-centered data for the interested reader. Really, it shouldn’t be difficult.

It means that we can always do both the R-mode and Q-mode analyses on a doubly-centered design matrix X.

That had bothered while i was looking at Q-mode. I knew that davis’ data was row-centered as well as column-centered (had zero-mean rows as well as zero-mean columns, resp.), but i just assumed he had picked the numbers that way in order to avoid discussing the case when the rows were not zero-mean.

We can always have both row-centered and column centered simultaneously.
(We can have our cake and eat it too, so we don’t have to worry about Q-mode being done with non-centered data.)

NOTE that we cannot do that with standardized data (or, therefore, with the correlation matrices). That is, we cannot standardize both the columns and the rows. The unit column-variances get messed up when we standardize the rows after standardizing the columns.

As I draft this, it is an open question for me whether we can always arrange to have column-centered data with row sums equal to 1. Let me follow some great advice, scary as it is: let me guess. (from John A. Wheeler, the great physics teacher: if my guess turns out right, i have reinforced my intuition; if my guess turns out wrong, i correct my intuition.)

I think we can do it. Adding that grand mean back in looks promising for adding 1 back in.

(I won’t string you along any further: my intuition was wrong. We’ll see what happens instead. It’s almost as good.)

Let’s try an example. I start with a real simple set of numbers:

$\left(\begin{array}{llll} 1&2&3&4\\ 5&6&7&8\\ 9&10&11&12\end{array}\right)$

Square the first row, square-root the second row, leave the third alone:

$\left(\begin{array}{llll} 1&4&9&16\\ \sqrt{5}&\sqrt{6}&\sqrt{7}&2 \sqrt{2} \\ 9&10&11&12\end{array}\right)$

While i’ve got it in this form, get what will be the row means.

${4.07869,\ 5.48316,\ 7.54858,\ 10.2761}$

(Much of Mathematica® works on rows; I’ve just been working with $X^T$.) Transpose. This is my example data matrix.

$\left(\begin{array}{lll} 1&\sqrt{5}&9\\ 4&\sqrt{6}&10\\ 9&\sqrt{7}&11\\ 16&2 \sqrt{2}&12\end{array}\right)$

Get the column means.

${7.5,\ 2.53993,\ 10.5}$

The grand mean is any of the following: the mean of row means… the mean of column means… or the mean of all 12 values. No matter how we compute it, we get

6.84664 .

We get doubly-centered data by

$x_{ij} - x_{\cdot j} - x_{i\cdot} + x_{\cdot\cdot}$

which gives us

$\left(\begin{array}{lll} -3.73204&2.46409&1.26796\\ -2.13652&1.27304&0.863481\\ 0.798061&-0.596122&-0.201939\\ 5.0705&-3.141&-1.9295\end{array}\right)$

(Yes, the column means are zero and the row means are zero.)

That was easy enough.

Let’s back up and look at just column-centered data from the original.

$\left(\begin{array}{lll} -6.5&-0.303866&-1.5\\ -3.5&-0.0904443&-0.5\\ 1.5&0.105817&0.5\\ 8.5&0.288493&1.5\end{array}\right)$

The column means, of course, are zero… the row means are not (these, of course, are the row means of the column centered data, not of the original data)…

${-2.76796,\ -1.36348,\ 0.701939,\ 3.4295}$

but the grand mean is zero. (go ahead, compute it if you must.)

The grand mean is zero because it is the mean of the column means, each of which is zero. This implies that the mean of the row means is zero. In particular, since 1 ≠ 0, we cannot have column-centered data with constant row sums = 1. (I said it was handy that the grand mean could be computed more than one way.)

maybe I should emphasize what we just saw: the act of column-centering the data implies that the sum of the new row means is zero. If each row mean is the same, then that common value must be zero. That example did not have constant row sums, but we see what must happen if it had.

I probably do not need to work an example, but i’m going to. And i’m going to do it because I had actually worked this example first; then i realized what had happened, and i told you you up front. So suppose we had started with a matrix where the rows each add up to 1. Go back to the raw data again….

$\left(\begin{array}{llll} 1&4&9&16\\ \sqrt{5}&\sqrt{6}&\sqrt{7}&2 \sqrt{2} \\ 9&10&11&12\end{array}\right)$

but change the rows. Divide each row by its mean, and by 3. (I don’t want row mean = 1, I want row sum = 1.)

$\left(\begin{array}{lll} 0.0817256&0.182744&0.73553\\ 0.243169&0.14891&0.607922\\ 0.397426&0.116832&0.485742\\ 0.519002&0.0917474&0.389251\end{array}\right)$

Get the column means…

${0.31033,\ 0.135058,\ 0.554611}$

and having said that we want row sums = 1, we confirm that each row mean is 1/3, because it’s an easier command to Mathematica®:

${0.333333,\ 0.333333,\ 0.333333,\ 0.333333}$

I then construct column-centered data:

$\left(\begin{array}{lll} -0.228605&0.0476857&0.180919\\ -0.0671617&0.0138515&0.0533102 \\ 0.0870952&-0.0182262&-0.068869 \\ 0.208671&-0.0433109&-0.16536\end{array}\right)$

We could confirm that the column means are zero; and we compute the row means…

${2.77556 x 10^{-17},\ 9.25186 x 10^{-18},\ -1.85037 x 10^{-17},\ -9.25186 x 10^{-18}}$

We automatically got row-centered data, as expected.

We could look at standardized data, but i’ll leave this to you, too: a counterexample should be convincing enough (and I’ve given you an matrix to play with). We can’t preserve unit variance of the columns when we then standardize the rows.

Summary:

1. we can always get doubly-centered data.
2. my guess was wrong: we cannot have column-centered with constant row sums, unless the constant is 0.
3. but if the rows do have a common row mean (equivalently, a common row sum), then just centering the columns automatically gives us doubly-centered data.
4. we cannot always get doubly-standardized data.
5. if i center the columns, then i would be inclined to center the rows: use doubly-centered data in preference to only column-centered (assuming, as i prefer, that the variables are columns.)

Let me elaborate on (5). For a matrix X with variables in columns, column-centered data means that $X^T\ X$ is proportional to the covariance matrix of X. If we want $X\ X^T$ to be proportional to the covariance matrix of $X^T$, then we need row-centered data. If we want both $X^T\ X$ and $X\ X^T$ to be proportional to covariance matrices, then we need doubly-centered data.