PCA / FA Malinowski: Example 5. missing data.

Malinowski does use H for something else, namely missing data points. The X matrix must be complete, but a test vector x need not be.

For quick reference, X, H and u are

$X = \left(\begin{array}{lll} 2 & 3 & 4 \\ 1 & 0 & -1 \\ 4 & 5 & 6 \\ 3 & 2 & 1 \\ 6 & 7 & 8\end{array}\right)$

$H =\left(\begin{array}{lllll} 0.203008 & -0.180451 & 0.218045 & -0.165414 & 0.233083 \\ -0.180451 & 0.327068 & -0.0827068 & 0.424812 & 0.0150376 \\ 0.218045 & -0.0827068 & 0.308271 & 0.0075188 & 0.398496 \\ -0.165414 & 0.424812 & 0.0075188 & 0.597744 & 0.180451 \\ 0.233083 & 0.0150376 & 0.398496 & 0.180451 & 0.56391\end{array}\right)$

$u = \left(\begin{array}{lllll} 0.327517 & 0.309419 & -0.813733 & 0.257097 & -0.262167 \\ -0.0107664 & -0.571797 & -0.464991 & -0.668451 & 0.0994427 \\ 0.538684 & 0.134501 & 0. & 0. & 0.831703 \\ 0.200401 & -0.746715 & 0. & 0.634172 & -0.00904025 \\ 0.749851 & -0.0404178 & 0.348743 & -0.291376 & -0.479133\end{array}\right)$

Let’s try a magic vector, with one missing value, marked NA. This vector came to me in a dream. (Not! But it might as well have. There is no way in the real world I would know this vector.)

$x = \{1,\ 2,\ 3,\ \text{NA},\ 5\}$

Is there a value of NA which would put this vector in the 2D subspace? (Yes, but I know this because I used this vector to construct the data matrix X!)

Let’s just try an iteration, essentially looking for a fixed-point solution. We will start by replacing NA with 0, but I suspect anything would work.

$m0 = \{1,\ 2,\ 3,\ 0,\ 5\}$

Apply the hat matrix H to that vector; we get

$m1 = \{1.66165,\ 0.300752,\ 2.96992,\ 1.60902,\ 4.2782\}$

That’s not very close at all; even the other components are way off. But take that 4th component, and use it in place of NA. That is, we return to the original vector m0, and replace only the missing entry. (H is a projection operator, hence idempotent: $H^2 = H\$; applying H to m1 would deliver m1 again. Really.)

So we redefine m1 as…

$m1 = \{1,\ 2,\ 3,\ 1.60902,\ 5\}$

and apply the hat matrix to it. We get

$\{1.3955,\ 0.984284,\ 2.98202,\ 2.57081,\ 4.56855\}$

Better. Again, replace the missing value NA in m0 by the latest 4th component, getting

$\{1,\ 2,\ 3,\ 2.57081,\ 5\}$

Apply the hat matrix…

$\{1.23641,\ 1.39286,\ 2.98925,\ 3.14571,\ 4.7421\}$

It’s going to take a while (27 iterations total), but that process will converge to

$\{1,\ 2,\ 3,\ 4,\ 5\}$

i.e. we decide that the missing value could be replaced by 4. In this case, the vector does lie in the 2D subspace. To see that, just apply H to it; since the components don’t change, the vector must lie in the 2D subspace.

Or maybe you would find it more convincing to compute the new components of that vector; we apply the inverse transition matrix $u^{-1} = u^T\$ and get:

$\{6.47289,\ -3.61962,\ 0,\ 0,\ 0\}$

Only the first two components are nonzero: therefore it lies in the 2D subspace.

So, we have created a plausible value for the missing entry.

I find this convergence to be plausible, but the whole idea isn’t yet important enough for me to prove that it must always converge. So i’m not going to justify that iteration.

Alternatively, suppose we use a symbol and try to find a value which zeroes out a component. Start with

$\{1,\ 2,\ 3,\ \text{NA},\ 5\}$

Get new components by applying $u^T\$:

$\left(\begin{array}{l} 0.200401\ \text{NA}+5.67129 \\ -0.746715\ \text{NA}-0.632761 \\ 0 \\ 0.634172\ \text{NA}-2.53669 \\ 0.036161-0.00904025\ \text{NA}\end{array}\right)$

(interesting: the 3rd component is always 0.) Set the last component (it’s either that or the 4th one) to zero…

$0.036161-0.00904025\ \text{NA}=0$

and solve for NA: we get NA = 4.

This is a rather novel way of estimating missing values. It depends on our assumption that the data vector lies in, or close to, the 2D subspace.

The mean of the 4 known values was 2.75, and the most common prescription, I think, would have been to use that 2.75 for the missing value. It’s not clear to me that choosing NA to put the data vector into the subspace is any more right than using the mean value, but it is an alternative.

Incidentally, we could have solved the equation for the 4th component being zero. We get the same answer, NA = 4.

It’s crucial that, in this example, there was a value of a such that the vector was in the 2D subspace. But if there is not such an exact solution, I think that the equations for different components would have given different answers. And i’m not sure what the iteration converges to in such a case.

Well, let’s try this on a vector that does not lie in the 2D subspace, our old friend with all 1s. Suppose the 3rd value is unknown:

$\{1,\ 1,\ \text{NA},\ 1,\ 1\}$

I’ll start with 0 as a guess and see what happens. (Remember that the second line, for example, is not the result of applying H to the first line; instead, it replaced the 3rd component.)

{0, {1, 1, 0, 1, 1}}

{1, {1, 1, 0.541353, 1, 1}}

{2, {1, 1, 0.708237, 1, 1}}

{3, {1, 1, 0.759682, 1, 1}}

{4, {1, 1, 0.775541, 1, 1}}

{5, {1, 1, 0.78043, 1, 1}}

{6, {1, 1, 0.781937, 1, 1}}

{7, {1, 1, 0.782402, 1, 1}}

{8, {1, 1, 0.782545, 1, 1}}

{9, {1, 1, 0.782589, 1, 1}}

{10, {1, 1, 0.782603, 1, 1}}

{11, {1, 1, 0.782607, 1, 1}}

{12, {1, 1, 0.782608, 1, 1}}

{13, {1, 1, 0.782609, 1, 1}}

{14, {1, 1, 0.782609, 1, 1}}

We converged to 0.782609 on the 13th iteration.

Just what are the new components of that final vector? Apply u^T to it, and we get

$\{1.68858,\ -0.944249,\ -0.929981,\ -0.0685591,\ 0\}$

About all I’m really sure of is that this is an interesting alternative to replacing NA by the mean value of the other components.