Example: Is it a transition matrix? Part 2

We had three matrices from Jolliffe, P, V, and Q. They were allegedly a set of principal components P, a varimax rotation V of P, and a quartimin “oblique rotation” Q.

I’ll remind you that when they say “oblique rotation” they mean a general change-of-basis. A rotation preserves an orthonormal basis; a rotation cannot transform an orthonormal basis to a non-orthonormal basis, and that’s what they mean — a transformation from an orthonormal basis to a non-orthonormal basis, or possibly a transformation from a merely orthogonal basis to a non-orthogonal one. In either case, the transformation cannot be a rotation.

(It isn’t that complicated! If you change the lengths of basis vectors, it isn’t a rotation; if you change the angles between the basis vectors, it isn’t a rotation.)

Anyway, we showed in Part 1 that V and Q spanned the same 4D subspace of R^{10}\ .

Now, what about V and P? Let me recall them:

P = \left(\begin{array}{cccc} 0.34 & -0.39 & 0.09 & -0.08 \\ 0.34 & -0.37 & -0.08 & -0.23 \\ 0.35 & -0.1 & 0.05 & 0.03 \\ 0.3 & -0.24 & -0.2 & 0.63 \\ 0.34 & -0.32 & 0.19 & -0.28 \\ 0.27 & 0.24 & -0.57 & 0.3 \\ 0.32 & 0.27 & -0.27 & -0.34 \\ 0.3 & 0.51 & 0.19 & 0.27 \\ 0.23 & 0.22 & 0.69 & 0.43 \\ 0.36 & 0.33 & -0.03 & 0.02\end{array}\right)

V = \left(\begin{array}{cccc} 0.48 & 0.09 & 0.17 & 0.14 \\ 0.49 & 0.15 & 0.18 & -0.03 \\ 0.35 & 0.22 & 0.24 & 0.22 \\ 0.26 & 0. & 0.64 & 0.2 \\ 0.49 & 0.16 & 0.02 & 0.15 \\ 0.05 & 0.34 & 0.6 & -0.09 \\ 0.2 & 0.51 & 0.18 & -0.07 \\ 0.1 & 0.54 & -0.02 & 0.32 \\ 0.1 & 0.13 & 0.07 & 0.83 \\ 0.17 & 0.46 & 0.28 & 0.26\end{array}\right)

As before, I suppose there is a transition matrix T relating them; each column of V^T = V' can be computed by applying a transition matrix T to the corresponding column of P^T\ :

V’ = T P’

i.e.

V = P T’.

Then I imagine briefly that I can “solve” for T’…

T' = P^{-1}\ V\ ,

but then quickly replace the non-existent inverse of Q by the appropriate pseudo-inverse

T' = (P'P)^{-1}\ P'\ V\ .

I compute T’, and then display its transpose T:

T = \left(\begin{array}{cccc} 0.905145 & -0.386677 & 0.0882059 & -0.141151 \\ 0.853806 & 0.568131 & -0.0875601 & -0.220572 \\ 0.64498 & -0.155196 & -0.506397 & 0.490735 \\ 0.507437 & 0.131036 & 0.67664 & 0.347268\end{array}\right)

and then I test whether

V = P T’ (?).

The largest difference (element by element) between the matrices V and P T’ is 0.170627, which is rather larger (by a factor of 23) than the differences we saw in Part 1. Here’s the full blown matrix — where I chose to round off first, then subtract (i.e. what we’d compute if we were subtracting published, already rounded-off, matrices):

V - P\ T' = \left(\begin{array}{cccc} 0. & 0.01 & -0.02 & -0.01 \\ 0.01 & 0.01 & -0.02 & -0.02 \\ -0.01 & -0.01 & 0.01 & 0.01 \\ 0. & 0 & 0. & 0. \\ 0. & 0.01 & -0.02 & -0.01 \\ -0.01 & -0.01 & 0.03 & 0.02 \\ -0.01 & -0.02 & 0.05 & 0.03 \\ 0.05 & 0.07 & -0.17 & -0.12 \\ -0.02 & -0.04 & 0.09 & 0.07 \\ -0.02 & -0.03 & 0.07 & 0.05\end{array}\right)

I am inclined to think there is a probem. But we don’t know much at all about it.

Let’s try one of our alternatives. We construct a projection operator onto the subspace spanned by P.

Get the SVD (Singular Value Decomposition) of P…

P = u\ w\ v^T\ .

Then, since u is 10×10 and V is of rank 4, the leftmost 4 columns (u1) of u….

u1 = \left(\begin{array}{cccc} 0.132041 & 0.280276 & -0.423196 & 0.0366383 \\ 0.24536 & 0.379423 & -0.288143 & -0.0619886 \\ -0.0781059 & 0.252968 & -0.245411 & -0.0642461 \\ -0.310642 & 0.402272 & -0.122076 & 0.641009 \\ 0.19934 & 0.202925 & -0.456543 & -0.212204 \\ -0.275595 & 0.504975 & 0.417433 & 0.130869 \\ 0.0279239 & 0.347505 & 0.199748 & -0.545498 \\ -0.547166 & 0.00466106 & -0.00216808 & -0.298924 \\ -0.571058 & -0.293548 & -0.488675 & -0.061545 \\ -0.278923 & 0.220314 & 0.0368869 & -0.356268\end{array}\right)

are an orthonormal basis for the range of P (i.e. the column space of P). We construct a projection operator, as we have before, by

R = u1\ u1^T\ .

We check it first by confirming that R R = R, i.e. it is idempotent, hence a projection operator. We check it further by applying it to the column vectors in P: we expect that R P = P; i.e. it actually projects onto the column space of P (the space spanned by the columns of P).

Now, having checked that R is a projection operator onto P, we apply it to V, and compute V – R V:

V - R\ V = \left(\begin{array}{cccc} 0.00221622 & 0.0115115 & -0.0249851 & -0.0145408 \\ 0.0137721 & 0.012178 & -0.0243583 & -0.0200425 \\ -0.00564412 & -0.011024 & 0.00933521 & 0.0112507 \\ 0.00222059 & 0.00165751 & -0.00118369 & -0.00423324 \\ 0.00223276 & 0.00638406 & -0.0153346 & -0.0119237 \\ -0.00896396 & -0.0106169 & 0.0272358 & 0.0230479 \\ -0.00941945 & -0.0152489 & 0.0456323 & 0.0330044 \\ 0.0470134 & 0.0703021 & -0.170627 & -0.121383 \\ -0.0232815 & -0.036102 & 0.0941953 & 0.068255 \\ -0.0227796 & -0.0330688 & 0.0740154 & 0.0474348\end{array}\right)

Let me round that to the nearest .01…

V - R V = \left(\begin{array}{cccc} 0 & 0.01 & -0.02 & -0.01 \\ 0.01 & 0.01 & -0.02 & -0.02 \\ -0.01 & -0.01 & 0.01 & 0.01 \\ 0 & 0 & 0 & 0 \\ 0 & 0.01 & -0.02 & -0.01 \\ -0.01 & -0.01 & 0.03 & 0.02 \\ -0.01 & -0.02 & 0.05 & 0.03 \\ 0.05 & 0.07 & -0.17 & -0.12 \\ -0.02 & -0.04 & 0.09 & 0.07 \\ -0.02 & -0.03 & 0.07 & 0.05\end{array}\right)

I should put that array of differences into perspective. Recall V:

V = \left(\begin{array}{cccc} 0.48 & 0.09 & 0.17 & 0.14 \\ 0.49 & 0.15 & 0.18 & -0.03 \\ 0.35 & 0.22 & 0.24 & 0.22 \\ 0.26 & 0. & 0.64 & 0.2 \\ 0.49 & 0.16 & 0.02 & 0.15 \\ 0.05 & 0.34 & 0.6 & -0.09 \\ 0.2 & 0.51 & 0.18 & -0.07 \\ 0.1 & 0.54 & -0.02 & 0.32 \\ 0.1 & 0.13 & 0.07 & 0.83 \\ 0.17 & 0.46 & 0.28 & 0.26\end{array}\right)

This reinforces my belief that we have a problem. Some of the differences we found are larger than some of the values of V itself. I can see that V is not in the column space of P, and while it’s not extremely far away, it’s not close either. (This seems more definitive than “the hypothetical transition matrix T doesn’t quite work”.)

As before, however, we are tip-toeing around the real question: Is every column vector in V a linear combination of the columns of P?

Easy enough. Do a regression. Actually 4 of them, one for each column of V.

Let me name the individual columns of the P matrix, and let me set the first column of V (v1) as “y”, and the columns of V as independent variables.

(Some smaller pictures follow, in which you can see that the final option is “IncludeConstantBasis -> False”.) The results are:

Not bad at all. In fact, rather close to perfect.

What was the first row of T?

{0.905145, -0.386677, 0.0882059, -0.141151}.

Yes, those are the \beta s (“Estimates”) for that regression; first row of T is first column of T’ is regression coefficients for the first column of V as the dependent variable.

Let me emphasize another relationship. For the first regression, yhat should be the projection of v1 (the 1st column of V) onto the column space of P. But we have that projection directly as R v1:

R v1 = {0.477784,0.476228,0.355644,0.257779,0.487767,
0.058964,0.209419,0.0529866,0.123281,0.19278}

… and we have the yhat from the regression:

yhat = {0.477784,0.476228,0.355644,0.257779,0.487767,
0.058964,0.209419,0.0529866,0.123281,0.19278}

Good, they are the same. So if regression works better for you than the projection operator, then use regression. Personally, I prefer the projection operator because I can apply it to all of V at once. (And because sometimes I have a use for the orthonormal basis u1.)

Largely to display the R^2, let me run the other three regressions. (Oh, I also reduced the magnification in my Mathematica notebook, so the pictures are smaller and the right side of the regression command isn’t cut off, at least on my screen.)

Here’s v2 as the dependent variable:

Here’s v3 as the dependent variable:

Here’s v4 as the dependent variable:

We see that the R^2 is over 99% for v2, falls to 95% for v3, and is 97.6% for v4. Those last two are just not large enough, not for this question.

I’m certain that V and P do not span the same subspace, but that’s about all I know.

We could, however, have made the dual choice, projecting P onto V instead of V onto P. Are they the same?

No.

We already know — from the T matrix, from the projection operator, or from the regressions — that the P and V matrices do not represent the same data in two different coordinate systems. But there might be — and in this case, there is! — more information available by making the other choice.

Let’s get the projection onto V instead of onto P. We need the SVD of V…

V = u\ w\ v^T\ .

We need the 4 leftmost columns (u1) of u…

u1 = \left(\begin{array}{cccc} -0.292636 & 0.0434068 & -0.33927 & 0.269896 \\ -0.274013 & 0.218205 & -0.238496 & 0.365998 \\ -0.337357 & -0.00723082 & -0.119568 & 0.0745101 \\ -0.355459 & 0.224109 & -0.415783 & -0.490416 \\ -0.2739 & -0.0500253 & -0.225463 & 0.452829 \\ -0.308498 & 0.463936 & 0.232078 & -0.416388 \\ -0.287334 & 0.227908 & 0.427185 & 0.212631 \\ -0.303344 & -0.281983 & 0.503807 & 0.152613 \\ -0.329068 & -0.741253 & -0.129874 & -0.304713 \\ -0.382183 & -0.0396735 & 0.288053 & -0.0857624\end{array}\right)

We construct the projection operator…

R = u1\ u1^T

Now we apply it to P, and compute the difference P – R P:

P - R P = \left(\begin{array}{cccc} 0.00186923 & -0.00646007 & -0.00165699 & 0.0293708 \\ -0.00396322 & 0.00598219 & 0.00143136 & 0.0162366 \\ 0.0035053 & 0.00255499 & -0.00525352 & -0.0232869 \\ 0.000382829 & 0.00307153 & 0.00267228 & 0.124774 \\ 0.000923133 & -0.00219591 & 0.00142733 & -0.0228012 \\ 0.0000472374 & -0.00300758 & -0.00291228 & -0.00287352 \\ -0.00462961 & -0.00220958 & -0.0000400809 & -0.149176 \\ 0.00305495 & 0.00376593 & -0.000439512 & 0.29618 \\ -0.00367312 & -0.00153518 & -0.000838857 & -0.106747 \\ 0.00147883 & -0.000459218 & 0.00482354 & -0.141981\end{array}\right)

So what?

Round it off:

P - R P = \left(\begin{array}{cccc} 0 & -0.01 & 0 & 0.03 \\ 0 & 0.01 & 0 & 0.02 \\ 0 & 0 & -0.01 & -0.02 \\ 0 & 0 & 0 & 0.12 \\ 0 & 0 & 0 & -0.02 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & -0.15 \\ 0 & 0 & 0 & 0.3 \\ 0 & 0 & 0 & -0.11 \\ 0 & 0 & 0 & -0.14\end{array}\right)

Ah ha! It’s time for Robin to squeak out, “Holy moly, Batman!”

The fourth column of P does not lie in the subspace spanned by the 4 columns of V, while the first three do. That is the problem. I don’t why or where it happened, but I know exactly what’s wrong: the 4th column of P is the only problem.

Somebody blew it somewhere. (OK, since I don’t know how a varimax rotation is computed, I have to entertain the possibility that V is not supposed to be P under a change of basis. But I can’t believe that, because V and Q have a common column space, and 3 of the 4 columns of P lie in that common column space. If that’s an accident, what does a deliberate act look like?)

Is the mistake in the orignal paper? Were P and V computed from the same correlation matrix but by independent algorithms? Or was V computed incorectly from P (and then Q computed from V)?

Or did someone copy P incorrectly? (P should be primary, but then how the heck does the 4th column of P fail to be in the column space of V? I would sooner have expected an error in V, given P.)

Or did they rotate the first 3 PCs, and then apply that transformation to the 4th column? (No. I’ve checked. You can too.)

Let me emphasize that we would get the same result from four regressions each with one column of P as the dependent variable and all the columns of V as the independent variables. The first three regressions would be extremely good fits, but the fourth would not be. We would, again, conclude that the fourth column of P was not exactly — or even to a fair approximation — a linear combination of the columns of V.

Summary: theory

Given two matrices, say P and V, which are alleged to be the same data in different coordinate systems, we may check that assertion very quickly (if both are of full rank):

Define

T' = (P'P)^{-1} P'\ V\ .

If we find it true that

V’ = T P’,

then T is a transition matrix between two bases, and P and V are indeed the same data in different coordinate systems. We’re done.

If, however, the equality fails, V' \ne T\ P'\ ,

then T is not a transition matrix, and P and V are not the same data in different coordinate systems. Unless we want more information, we’re done.

Finally, we have seen that, numerically, we may have T “close to” a transition matrix. (In Part 1, V and Q are very nearly the same data; in the color posts, the xyx bar and rgb bar tables are very nearly the same data.)

Summary: practice

Computing T tells us if something is wrong, but not much more. It’s quick, easy, and gives a simple answer: “good” or “no good”.

Computing both projection operators P onto V and V onto P (not just one of them) would likewise tell us whether or not something is wrong, and, in addition, the projection operators might isolate the problem.

We could use regression instead of computing explicit projection operators.

Summary: this example

We found that V and Q are the same data in two different coordinate systems, but that P is not. In addition, we actually know that the problem is limited to the 4th column of P: it does not lie in the space spanned by the columns of V (i.e. it does not lie in the column space of V, which is also the column space of Q).

Question:
Can someone tell me if these P, V and Q match the original paper by Yule et al.?

And there’s a question you might have.
How many examples did I have to search through to find such a good (because it was such a bad) example?

One. This was the very first case I checked, outside of the 1931 CIE tables.

That’s scary.

Example: Is it a transition matrix? Part 1

This example comes from PCA / FA (principal component analysis, factor analysis), namely from Jolliffe (see the bibliography). But it illustrates some very nice linear algebra.

More precisely, the source of this example is:
Yule, W., Berger, M., Butler, S., Newham, V. and Tizard, J. (1969). The WPPSL: An empirical evaluation with a British sample. Brit. J. Educ. Psychol., 39, 1-13.

I have not been able to find the original paper. There is a problem here, and I do not know whether the problem lies in the original paper or in Jolliffe’s version of it. If anyone out there can let me know, I’d be grateful. (I will present 3 matrices, taken from Jolliffe; my question is, does the original paper contain the same 3 matrices?)

Like the previous post on this topic, this one is self-contained. In fact, it has almost nothing to do with PCA, and everything to do with finding — or failing to find! — a transition matrix relating two matrices.

On p. 163, Jolliffe provides a matrix of “principal components”:

P = \left(\begin{array}{cccc} 0.34 & -0.39 & 0.09 & -0.08 \\ 0.34 & -0.37 & -0.08 & -0.23 \\ 0.35 & -0.1 & 0.05 & 0.03 \\ 0.3 & -0.24 & -0.2 & 0.63 \\ 0.34 & -0.32 & 0.19 & -0.28 \\ 0.27 & 0.24 & -0.57 & 0.3 \\ 0.32 & 0.27 & -0.27 & -0.34 \\ 0.3 & 0.51 & 0.19 & 0.27 \\ 0.23 & 0.22 & 0.69 & 0.43 \\ 0.36 & 0.33 & -0.03 & 0.02\end{array}\right)

On the same page, he provides two matrices of the “rotated factor loadings”. One is “Varimax”…

V = \left(\begin{array}{cccc} 0.48 & 0.09 & 0.17 & 0.14 \\ 0.49 & 0.15 & 0.18 & -0.03 \\ 0.35 & 0.22 & 0.24 & 0.22 \\ 0.26 & 0. & 0.64 & 0.2 \\ 0.49 & 0.16 & 0.02 & 0.15 \\ 0.05 & 0.34 & 0.6 & -0.09 \\ 0.2 & 0.51 & 0.18 & -0.07 \\ 0.1 & 0.54 & -0.02 & 0.32 \\ 0.1 & 0.13 & 0.07 & 0.83 \\ 0.17 & 0.46 & 0.28 & 0.26\end{array}\right)

… and the other is “direct quartimin”:

Q = \left(\begin{array}{cccc} 0.51 & -0.05 & 0.05 & 0.05 \\ 0.53 & 0.04 & 0.05 & -0.14 \\ 0.32 & 0.13 & 0.16 & 0.15 \\ 0.17 & -0.19 & 0.65 & 0.2 \\ 0.54 & 0.06 & -0.13 & 0.05 \\ -0.07 & 0.28 & 0.67 & -0.12 \\ 0.16 & 0.53 & 0.13 & -0.17 \\ 0.03 & 0.62 & -0.09 & 0.26 \\ 0. & 0.09 & 0.02 & 0.87 \\ 0.08 & 0.45 & 0.24 & 0.21\end{array}\right)

I presume — not knowing how to comute a varimax rotation! — that each of these matrices represents the same data, i.e. the same 4D subspace of R^{10} (since we have 4 column vectors of length 10 in each matrix). It would seem silly if a varimax rotation of PCs was not supposed to represent rotated PCs within the same subspace.

That presumption is wrong: they do not all represent the same subspaces.

How do I know this?

Because given two matrices of the same shape, I know how to find the transition matrix if it exists. We did this here.

Let me elaborate on what I’m doing. Imagine that our matrices had 2 columns and 3 rows. Those 3D column vectors lie in a 2D subspace — a plane — in R^3\ . By doing 2D transformations within that plane, I might find a particularly nice basis for representing that data.

If, instead, I apply a 3D transformation, I would be moving that data into a different plane. What I am doing here is looking at 4D transformations within the 4D subspace. Everything is predicated on my belief that the PCs in P, and the varimax rotation of them in V, and the direct quartimin data in Q are all supposed to be in the same subspace.

Not knowing how they were supposed to be computed, I could be seriously wrong here. OTOH, since we will find that V and Q are the same subspace, and the first 3 of the 4 columns of P also lie in that common subspace, I really believe that the 4th column of P should too. That it does not strikes me as an error.

Let me make one simplification up front. Are the matrices P, V, Q of full rank (4)?

Yes. The Mathematica command MatrixRank says that each of those matrices is of rank 4.

Let us consider V and Q. If the observations in each are related by a transition matrix T, and if we arbitrarily take V to be the “old” data, then each column of V’ (the transpose of V) is found by applying the transition matrix to the corresponding column of Q’.

V’ = T Q’

i.e.

V = Q T’.

Then, as I showed before, I imagine briefly that I can “solve” for T’…

T' = Q^{-1} V\ ,

but quickly replace the non-existent inverse of Q by the appropriate pseudo-inverse

T' = (Q'Q)^{-1} Q'\ V\ .

(It’s worth repeating that the inappropriate pseudo-inverse would use QQ’, but that can’t work because QQ’ is 10×10 but of rank 4 at most, hence not invertible.)

If Q’Q is invertible, that recipe always gives me a matrix T, but T need not be a transition matrix. The resulting T is a transition matrix if and only if we actually have

V = Q T’,

(given that V and Q are of the same full rank).

Do we have V = Q T’ ?

By direct computation I get

T = \left(\begin{array}{cccc} 0.9273 & 0.0921888 & 0.15161 & 0.101824 \\ 0.22944 & 0.864017 & 0.176744 & 0.0556721 \\ 0.251353 & 0.0599739 & 0.913081 & 0.0549228 \\ 0.185642 & 0.11154 & 0.00492819 & 0.941728\end{array}\right)

and then I look at V – Q T’…. You know, it’s easy enough to put them side by side.

V = \left(\begin{array}{cccc} 0.48 & 0.09 & 0.17 & 0.14 \\ 0.49 & 0.15 & 0.18 & -0.03 \\ 0.35 & 0.22 & 0.24 & 0.22 \\ 0.26 & 0. & 0.64 & 0.2 \\ 0.49 & 0.16 & 0.02 & 0.15 \\ 0.05 & 0.34 & 0.6 & -0.09 \\ 0.2 & 0.51 & 0.18 & -0.07 \\ 0.1 & 0.54 & -0.02 & 0.32 \\ 0.1 & 0.13 & 0.07 & 0.83 \\ 0.17 & 0.46 & 0.28 & 0.26\end{array}\right) Q\ T' = \left(\begin{array}{cccc} 0.48 & 0.09 & 0.17 & 0.14 \\ 0.49 & 0.16 & 0.17 & -0.03 \\ 0.35 & 0.22 & 0.24 & 0.22 \\ 0.26 & 0 & 0.64 & 0.2 \\ 0.49 & 0.16 & 0.02 & 0.15 \\ 0.05 & 0.34 & 0.6 & -0.09 \\ 0.2 & 0.51 & 0.18 & -0.07 \\ 0.1 & 0.54 & -0.02 & 0.32 \\ 0.1 & 0.13 & 0.07 & 0.83 \\ 0.17 & 0.46 & 0.28 & 0.26\end{array}\right)

Even the difference of the rounded-off matrices is satisfying small. That is,
Round[V,.01]-Round[Q.TT,.01]//Chop//MatrixForm is

\left(\begin{array}{cccc} 0 & 0 & 0 & 0 \\ 0 & -0.01 & 0.01 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0\end{array}\right)

Judging that the difference is negligible, I conclude that the matrices V and Q do span the same subspace of R^{10}\ ; they are the same 4D subspace with two different coordinate systems (two sets of basis vectors).

While we’re here, let’s do this another way. After all, the fundamental question is: can the columns of V be written as linear combinations of the columns of Q? Alternatively, can each column of Q be written as a linear combination of the columns of V?

Bear in mind that we have already established that Q and V are of the same full rank. In that case, the alternatives are equivalent.

We can answer the fundamental question by trying to regress each column of V on Q. Just remember to tell the regression to drop the constant!

To be specific, here’s the first column of V as a linear combination of the 4 columns of Q. Here’s the parameter table and R^2 — not the adjusted R^2\ : I really want to know how good this fit is, never mind how many variables it has.

Here’s a snapshot of the Mathematica commands (version 7). (Oh, darn. The last option in the LinearModelFit command, which replaced “Regress” is “IncludeConstant->False”. And I print the names just to confirm that none of those symbols has a left-over numerical value!)

(Let me just remind you that the R^2 is a measure of a how close the fit is. When we add another independent variable, however, the R^2 cannot decrease; it is not a good measure of the value — i.e. the worth — of that latest independent variable. The adjusted R^2 can decrease; in fact, in my experience, if the t-statistic of that latest variable is greater than 1 (in magnitude), then the adjusted R^2 goes up. As I said, in this case, I really want the R^2 rather than adjusted R^2 because I know exactly how many independent variables I am supposed to use. That said, there isn’t a whole lot of difference between the two in this case.)

v1 on Q

That’s a really good fit. And it should be. And that column of \beta\ , i.e. “Estimate”? It’s the first column of T’, i.e. the first row of T:

{0.9273, 0.0921888, 0.15161, 0.101824}

We could run 3 more regressions, but they’d only confirm what we believe: T’ could, in fact, be computed as regression coefficients. If the two subspaces coincide, then T is also a transition matrix between two bases in one and the same space.

There is yet another way to look at that. We know that in regression, yhat is the projection of y onto the subspace spanned by the columns of X. If y itself is in that subspace, then yhat = y, and the relatonship is given by a transition matrix.

So we could do this a third way. We know how to compute a projection operator onto the subspace spanned by the columns of X. In this case, X is Q, so we find its SVD (Singular Value Decomposition)…

Q = u\ w\ v^T

Then the 4 leftmost columns (u1) of u are an orthonormal basis for the range of Q….

u1 = \left(\begin{array}{cccc} -0.203473 & 0.27677 & 0.353491 & 0.188167 \\ -0.188235 & 0.423093 & 0.310857 & 0.0177813 \\ -0.30583 & 0.10025 & 0.140569 & 0.094808 \\ -0.313324 & 0.177504 & -0.285416 & 0.609666 \\ -0.194156 & 0.220796 & 0.497051 & 0.0101643 \\ -0.354903 & 0.228756 & -0.604256 & -0.00173818 \\ -0.320761 & 0.191153 & -0.0758258 & -0.463325 \\ -0.379019 & -0.300366 & 0.0912622 & -0.459411 \\ -0.365508 & -0.681435 & 0.196437 & 0.360653 \\ -0.43321 & -0.103153 & -0.111086 & -0.166404\end{array}\right)

That is, u1′ u1 = I, if the column space of X is of full rank. Check it:

u1^T u1 = \left(\begin{array}{cccc} 1. & 0 & 0 & 0 \\ 0 & 1. & 0 & 0 \\ 0 & 0 & 1. & 0 \\ 0 & 0 & 0 & 1.\end{array}\right)

Bur the reversed matrix product

R = u1 u1′

gives us a projection operator R onto the range of Q:

We can check it. (I won’t print it: it’s 10×10.) One, R is a projection operator if and only if it is idempotent, i.e. R R = R. And by computation, I see that is true.

Two, applied to each column of Q, it reproduces that column of Q; i.e. applied to Q as a whole, it reproduces Q, i.e. R Q = Q. Again, computation shows that is true, too.

Having checked that R is a projection onto the column space of Q, let’s use it: apply it to V. If V lies in the subspace spanned by the columns of Q, then R V will reproduce V….

I find that the largest difference between the elements of R V and V is 0.00720726 .

That’s as close as the other two calculations. Not exactly zero, but then, our regression didn’t fit exactly: like the RGB and XYZ tables in the earlier post, V and Q are almost but not exactly related by a transition matrix.

So we have seen three ways of looking at this. All three tell the same story: the column spaces of V and Q are the same.

Let me say a little more about two of them, the projection operator and the explicit regression.

The projection operator R came from the SVD. The regression can also be written using a projection operator, namely the hat matrix H. I keep getting mixed up about the relationship between T and R and H, so let me write it out.

The normal equations for

\hat{y} = X\ \beta

are

\beta = (X'X)^{-1}\ X'\ y\ .

(I showed you a quick and dirty way to recover that equation in the earlier post.)

Then

\hat{y} = X\ \beta = X\ (X'X)^{-1}\ X'\ y

and so we may write

yhat = H y

with the hat matrix H defined as

H = X\ (X'X)^{-1}\ X'\ .

For our regression, we took y to be the first column of V, and X to be the four colums of Q, so our hat matrix would be

H = Q\ (Q'Q)^{-1}\ Q'\ .

It is a projection operator onto the range of Q. But that’s exactly what R was. Go ahead and check: you will find that

H = R.

This means that we could compute R as the hat matrix, instead of using the SVD. Personally, I know how to use the SVD, whereas I would have to work out the definition of the hat matrix every time. But that’s a preference; you may choose to compute the hat matrix instead. (And that is how Malinowski was doing it!)

Going a little further, T’ was

T' = (Q'Q)^{-1}\ Q'\ V

so we have

Q T’ = H V.

We’ll look at the relationship between P and V in Part 2.

regression example 2

a smaller range

I want to show you something interesting. What if our data had spanned a much smaller range? Let the x’s range from 1 to 2 instead of 1 to 4:

{1.,\ 1.1,\ 1.5,\ 1.7,\ 2.}

I use the very same u values. And the same model: my data is generated as y = 2 + x^2 + u\ . My u values are still

u = {-0.0494377,\ 0.0900309,\ -0.0316807,\ 0.0148669,\ -0.0842221}

and my (x, y) points become:

\left(\begin{array}{ll} 1. & 2.95056 \\ 1.1 & 3.30003 \\ 1.5 & 4.21832 \\ 1.7 & 4.90487 \\ 2. & 5.91578\end{array}\right)

Here is a drawing of the true function without noise (y = x^2 + 2) and the (x,y) points with noise.

Over that span, a straight line would not be a bad fit. In fact, it would be very good. (There’s just not all that much curvature over this small a range. Ah, FWIW, the R^2 for the straight line would be 0.992583, not what you’d think of as far from 1. I cringe when I see R^2 less than 0.9, because I know how far off I can be in simulations and get 0.99.)

But for now, i’m going to plunge ahead. I know the data came from a parabola, so i’m going to fit one; but i’m going to forget that there’s no “x” term. That is, as before y = a + b\ x + c\ x^2\ , i.e. y = \beta_1 + \beta_2\ x + \beta_3\ x^2\ .

Here’s the equation Mathematica gives me. Instead of 2 + x^2 we have

y = 1.63209 + 0.597983\ x + 0.772142\ x^2

That’s pretty far off. 1.6 instead of 2, .6 instead of 0, and .77 instead of 1. As I said, over this small a range of x, we don’t have a lot of evidence for the curvature of the data. But, as before, the t-statistic on the linear term x is less than 1…

…so we drop the linear term and get a new regression. Our new equation is

y = 2.05047 + 0.972441\ x^2

so dropping the x term has gotten us much, much closer to the true model: 2.05 instead of 2 and .97 instead of 1.

What do the R-squares tell us? Old and new R^2 are

\{0.998042,\ 0.997672\}

That says R^2 went down when I dropped a variable. Alternatively, R^2 was higher when I had one more variable. That’s what’s supposed to happen.

The fit with 3 variables has slightly smaller total error than the fit with two variables. In this case the difference is tiny, but if our goal is the best interpolation, then the higher R-squared equation may be preferable to the lower.

What about the adjusted R-squares? Old and new are

\{0.996085,\ 0.996896\}

The newer one is the higher; in some sense, the newer fitted equation is more likely true.

Finally, for the newer fit, all the t-stats (all two of them!) are larger than 1.

Incidentally, the estimated variance of the noise is 0.0044978. Recall that the sample variance of the actual u is 0.00453426 . As before, the model with the highest adjusted R^2 has the best estimate of the noise variance. (But we know this only because we know the true errors u.)

By looking at the t-statistics and the adjusted R^2, we have recovered the model used to construct the data. (They go hand-in-hand: it is approximately true that adding a variable whose t-statistic is greater than 1 (in absolute value) results in a higher adjusted R^2.)

instead, a cubic

You know, I just can’t resist fitting a cubic to the data. (I said I like playing with regression.) And it’s an arbitrary cubic, i.e. y = \beta_1 + \beta_2\ x + \beta_3\ x^2 + beta_4\ x^3\ .

The fitted equation is

y = -1.09742 + 6.35766\ x -3.12847\ x^2 + 0.852566\ x^3

and the parameter table is

Note that all of the t-statistics are small, less than 1 (in absolute value). That’s what I wanted to show you: all the t-statistics going to hell.

even worse, the two smallest t-statistics are for the constant and for the x^2 term, i.e. for the two correct variables!

(I didn’t plan that, but it was too good to pass up.)

That cubic regression does not give us good clues about what to do next. Had it been our first regression, we might as well not have run it. Because it comes after a few smaller ones, we know that removing x^3 will fix things. What we’re not sure of, in principle, is whether removing x, for example, will also fix things.

This is why, in general, I do stepwise regression: I start by running all possible regressions with a constant term and one variable, and I choose the fit that has the highest adjusted R^2. then I add one more variable – one at a time – and pick the equation with the highest adjusted R^2, assuming it’s larger than the previous stage. At each stage I add the variable which gives me the highest adjusted R^2.

It would be called forward regression (or forward selection) if that were all I did, but I also remove an earlier variable if its t-statistic becomes small. (Yes, that can happen.) In principle I could stop if the new adjusted R^2 decreases, but in fact I usually keep going so that I can look at the models with “too many” variables, too.

(So, when I started all this by fitting y = \beta_1 + \beta_2\ x + \beta_3\ x^2\ to the data, I was not following my own general guidelines. Hey, it’s just a polynomial in one variable. In this case, we could actually run every possible regression. For n variables, there are 2^n possible regressions. Since we only have 5 data points, we cannot fit higher than a quartic, so the maximum number of variables is 5 and there are 32 possible regressions.)

Now, I do not know if it has been shown that stepwise regression always finds the best possible regression. For a few variables, when it has been feasible for me to compute all possible regressions, I have seen that stepwise found the best one.

It seems to me that there is so much freedom in choosing and constructing variables, that stepwise is more than good enough: at some point, time and energy should be devoted to exploring transformations of the variables. To put that another way, if you do stepwise regression and the answer is not satisfactory, finding, transforming, or constructing more variables is more important than whether stepwise got the best possible answer. I’m sure it’s nearly optimal if not in fact optimal.

If, instead, we start with all of the variables first, and remove variables one at a time, that’s backward regression. My objection is that we would be starting with a bunch of t-statistics which cannot reliably tell us what to remove first. In this example, if we leave the constant, despite its low t-statistic, we would presumably drop the x^2 term first, which is exactly the wrong term to drop.

(There are some arguments for always leaving a constant term in, even when its t-statistic is small. One, it acts as a surrogate for all the variables we omitted; not just the ones we measured and omitted, but all the things we didn’t even measure. Two, the definitions of R^2 and adjusted R^2 need to be changed if there is no constant.)

Having said all that, the R^2 is higher for the cubic fit, as it should be: Here are the R^2 for constant & x^2… constant, x, and x^2… and the cubic…

\{0.997672,\ 0.998042,\ 0.998291\}

so the cubic might actually be a better interpolating equation than those with fewer powers of x. it depends on what our purpose is. BTW, when I say interpolation, I do not mean to include extrapolation; I do mean restricting x between the outermost x-values of the data, in this case between 1 and 2.)

the SVD

There is something else we should do with that cubic. When more than one of the t-statistics gets small upon adding one variable, that’s a strong signal that the design matrix is “multicollinear”: it’s of full rank but not by much. (Had it not been of full rank, we could not have computed the inverse of X^T\ X\ )

What is the real-valued rank of our design matrix with the cubic term added? That’s a job for the Singular Value Decomposition (SVD). If you haven’t look at the SVD, you might want to. here

For that cubic, the design matrix is…

X = \left(\begin{array}{llll} 1 & 1. & 1. & 1. \\ 1 & 1.1 & 1.21 & 1.331 \\ 1 & 1.5 & 2.25 & 3.375 \\ 1 & 1.7 & 2.89 & 4.913 \\ 1 & 2. & 4. & 8.\end{array}\right)

and the w-matrix from its singular value decomposition X = u\ w\ v^T\ , is…

w = \left(\begin{array}{llll} 12.1472 & 0. & 0. & 0. \\ 0. & 1.71458 & 0. & 0. \\ 0. & 0. & 0.182043 & 0. \\ 0. & 0. & 0. & 0.00503654 \\ 0. & 0. & 0. & 0.\end{array}\right)

Sure enough. There are 4 nonzero values of w (rank is 4), but the smallest is only .005: X is of rank 4, but it differs from a matrix of rank 3 by just .005.

Although x, x^2, and x^3 are linearly independent in principle, in this case there is a nearly linear relationship among the columns of X. We know that the problem is fixed if we drop x^3. as far as I know, so far, the SVD doesn’t tell us that. It doesn’t tell us if we have a near-linear relationship between x^2 and x^3 (which is my guess) or between all three powers of x, or some other possibility. I’m keeping an open mind on whether there’s a way to extract more information about just what is related to which. In fact, open mind be damned, there’s got to be a way (short of regressing the x’s on each other)….

Incidentally, it is an accident that the smallest w is close to the noise variance: the x’s themselves are not noisy.

In contrast to our PCA / FA examples, this design matrix has a column of 1s, i.e. a column whose variance is exactly 0. but I don’t want to remove it, before computing the SVD, because it’s conceivable that the problem is that some combination of the other columns adds up to 1.

finally, the 3rd singular value (.18 ) is two orders of magnitude smaller than the largest. Maybe we should drop two variables instead of just one. So far, I cannot see that the SVD tells me which variables to drop, but it seems to suggest that the best model has only two variables.

gee, it does: constant and x^2.

Let me close by re-iterating: get a good book (e.g. Ramanathan or Draper & Smith). The normal equations will pretty much give you an answer unless you broke them by building a design matrix of less than full rank. You need to know how to assess the answer which the normal equations gave you, and the clues may not always be right in your face.

Regression Example 1

setup

Let’s just fit a parabola to noisy data. Instead of using real data I will manufacture some. First I pick some x values; you might note that they do not have to be equally spaced.

x = \{1.,\ 1.1,\ 1.5,\ 3.,\ 4.\}

Now I generate 5 random variables u from a normal(0, .05) distribution. (the 2nd param is std dev, not variance.) my data is generated as

y = 2 + x^2 + u\ .

that is, my 5 noise values are

u = \{-0.0494377,\ 0.0900309,\ -0.0316807,\ 0.0148669,\ -0.0842221\}

and my 5 data values are

y = \{2.95056,\ 3.30003,\ 4.21832,\ 11.0149,\ 17.9158\}

Here, then, are my (x,y) points:

\left(\begin{array}{ll} 1. & 2.95056 \\ 1.1 & 3.30003 \\ 1.5 & 4.21832 \\ 3. & 11.0149 \\ 4. & 17.9158\end{array}\right)

Here is a plot of the true function without noise (y = x^2 + 2) and the (x,y) points (with noise).

In the real world, of course, what we have is only the 5 points; we want to fit a curve to them. I’m going to do this “by hand”, as I showed you in the expository post.

“by hand”

Let’s confirm all that, “by hand” as it were. I am going to just jump in and fit a general polynomial of degree 2: y = a + b\ x + c\ x^2\ (i.e. y = \beta_1 + \beta_2\ x + \beta_3\ x^2), following the computations described in the expository post.

i need a design matrix whose first column is 1s, whose second column is the x’s, and \[Dash] because I want to fit a quadratic \[Dash] whose 3rd column is the squares of the x’s. That is, I construct the design matrix X

X =\left(\begin{array}{lll} 1 & 1. & 1. \\ 1 & 1.1 & 1.21 \\ 1 & 1.5 & 2.25 \\ 1 & 3. & 9. \\ 1 & 4. & 16.\end{array}\right)

We compute X^T\ X\ … and its inverse…

X^T\ X = \left(\begin{array}{lll} 5. & 10.6 & 29.46 \\ 10.6 & 29.46 & 96.706 \\ 29.46 & 96.706 & 344.527\end{array}\right)

\left(X^T\ X\right)^{-1} = \left(\begin{array}{lll} 7.43111 & -7.48067 & 1.46434 \\ -7.48067 & 7.96246 & -1.59534 \\ 1.46434 & -1.59534 & 0.325488\end{array}\right)

Incidentally, that is the step for which you probably do not want to write your own code: you want someone else to provide code for computing the inverse of a matrix.

having the inverse, we compute the \beta from the normal equations:

\beta = \left(X^T\ X\right)^{-1}\ X^T\ y

and get

\beta = \{1.92723,\ 0.0964893,\ 0.975582\}

That is, we have just fitted the equation…

y = 1.92723 + 0.0964893\ x+0.975582\  x^2

to our data. We can plot it, but we shouldn’t see anything surprising. After all, the coefficients of the fitted equation are 1.97 instead of 2, .096 instead of 0, and .976 instead of 1. Pretty close.

Now we compute the predicted values from our equation; we are just using the data values of X in the fitted equation:

\hat{y} = X\ \beta = \{2.9993,\ 3.21382,\ 4.26702,\ 10.9969,\ 17.9225\}

The residuals are the difference between the predictions and the actual data. I never remember which is subtracted, but I write

y = X\ \beta + e = \hat{y} + e\ ,

and then I see that it’s y - \hat{y}  = e\ .

so we compute the e’s and the sum of squares of the e’s (the residuals and the error sum of squares):

e = \{-0.0487348,\ 0.0862128,\ -0.0486997,\ 0.0179365,\ -0.00671478\}

SSE = e \cdot e = 0.0125462

We should plot the residuals (the 5th one is on the “5″ on the x-axis).

For our design matrix, we have n = 5 observations and k = 3 variables. Then the “estimated error variance” SSE/(n-k) is…

= s^2 = \frac{SSE}{5-3} = 0.0062731

Let’s get the total sum of squares. We compute the mean of the y values, center the data by subtracting the mean, and get the sum of squares (as the dot product of two zero-mean vectors).

The mean of the y’s is 7.87991; after subtracting it, we get centered data

yc = \{-4.92935,\ -4.57988,\ -3.66159,\ 3.13496,\ 10.0359\}

and then

SST = yc \cdot yc = 169.228

Now we can compute the R^2:

R^2 = 1 - \frac{SSE}{SST} = 0.999926

and the adjusted R2:

adjR2 = 1-\frac{SSE/n-k}{SST/n-1} = 1 - \frac{SSE/2}{SST/4} = 0.999852

The covariance matrix of the \beta\ s is

\left(\begin{array}{lll} 0.0466161 & -0.046927 & 0.00918597 \\ -0.046927 & 0.0499493 & -0.0100077 \\ 0.00918597 & -0.0100077 & 0.00204182\end{array}\right)

and, in particular, the standard errors are the square roots of the diagonal elements of c:

se = \{0.215908,\ 0.223493,\ 0.0451865\}

i remark that although the matrix inverse X^T\ X^{-1} was one of the first things we computed, we need the estimated error variance s2, computed much later, in order to get the standard errors.

The t-statistics are the \beta\ s divided by the standard errors:

t = \frac{\beta}{se} = \{8.92616,\ 0.431732,\ 21.5901\}

Let me remark that some people present the \beta\ s and the t-statistics, others present the \beta\ s and the standard errors. No problem: given any two of the \beta\ s, t-statistics, and standard errors, you can compute the third.

let Mathematica® do it

Here’s where you use whatever you’ve got available; of all my choices, I’ll go with Mathematica®. Just in case you’re also using Mathematica®, here’s the specific command I used, and the output:

(“data” consists only of the (x,y) data; “{1,x,x^2}” tells Mathematica® what design matrix to construct; the “x” after that says that the first column of “data” is “x”; the “Clear” command is because I need “x” to be a symbol, not a list of numbers. I wouldn’t have this problem if I didn’t insist on using “x” as the name of the independent variable, i.e. doing double duty.)

The first line, BestFit, shows us the fitted equation. Again, just in case you’re using Mathematica®, here’s the command that extracted the BestFit and the result:

That’s exactly what we computed by hand. Similarly, the FitResiduals are the residuals e:

\{-0.0487348,\ 0.0862128,\ -0.0486997,\ 0.0179365,\ -0.00671478\}

The PredictedResponse are the 5 computed values, yhat, used to compute the residuals.

\{2.9993,\ 3.21382,\ 4.26702,\ 10.9969,\ 17.9225\}

In the ParameterTable, “Estimate” refers to \beta\ , “SE” is its standard error, and TStat is its T-statistic.

PValue is a probability corresponding to the t-statistic. I don’t remember whether Mathematica® is doing a 1-sided or 2-sided test and the documentation doesn’t seem to say. I don’t really care: my rule of thumb is to compare the (absolute value of the) t-statistic to 1. If I ever need to know exactly what PValue means, I’ll grab a t-distribution from one of my stat books and see what Mathematica® did.

Mathematica’s standard errors and t-statistics, R^2, adjusted R^2, and estimated error variance all agree with our earlier computations.

Finally, the ANOVAtable contains, among other things, the error and total sum of squares. (i’ll remark that for experimental data, the F-test can more useful than the R^2 and the adjusted R^2, because repeated x’s with different y’s means that we have some data points on vertical lines; we cannot make the error sum of squares zero. That in turn means that the R^2’s cannot be 1. that in turn means that we can’t tell what constitutes “a good” R^2.)

Our computed SSE and SST were

\{0.0125462,\ 169.228\}

respectively, and they agree with Mathematica®.

better model, worse fit

So, I have confirmed that my recipe matches Mathematica®. So much for computation. What about the results themselves?

The parameter table, or our own calculations, showed a t-statistic of 0.431732 for the x term in the fitted equation. That the t-stat for x is less than 1 suggests that the linear term vanishes. That’s very nice, since we know that true model did not have a linear term. We created the data without a linear term. It’s the small t-statistic that says: for this data there’s a high probability that \beta = 0\ .

i cannot overemphasize that if our goal is the best interpolation, instead of finding the “true model”, we might choose to use the equation we have.

but, I want to find the true model, so let’s drop the x term; and let Mathematica® do it all for us:

Our two fitted equations, then, are

so dropping the x term has gotten us a little closer to the true equation.

What do the R-squares tell us? Old and new are {0.999926, 0.999919}, so the new one is very slightly smaller. In absolute terms, the new fit is not as good as the first one. In this case the difference is tiny – miniscule, even – , but if our goal is the best interpolation, then a higher R-squared equation may be preferable to a lower.

What about the adjusted R-squares? Old and new are {0.999852, 0.999892}. The newer one is the higher; in some sense, the newer fitted equation is more likely true. Again, in this case, the difference is damned small.

Finally, for the newer ft, all the t-stats (all two of them!) are larger than 1. That’s why we might go with the new equation.

Incidentally, the estimated variance for the equation is smaller, 0.00457182 vs. 0.0062731.

and, FWIW, the sample variance of our noise – which we only know because we created this data – was 0.00453426.

The second equation has a smaller estimated error variance, even though the first equation has smaller errors in total. This is hand in glove with “closeness of fit” versus “true model”, R^2 versus adjusted R^2.

BTW, had I done the newer fit “by hand”, the design matrix would have had two columns: 1s and the squares. It’s ok to drop the x’s themselves. This is exactly how we specify a model, by computing and using whatever columns we choose. To be specific,

\left(\begin{array}{ll} 1 & 1. \\ 1 & 1.21 \\ 1 & 2.25 \\ 1 & 9. \\ 1 & 16.\end{array}\right)

There you are. That’s how to fit y = a + b x^2 to the data.

Regression (OLS, ordinary least squares) Exposition

Introduction

OK, suppose you know how to fit a line to a collection of (x,y) data points. There are some questions that people ask next:

  1. how would I fit a parabola to that data?
  2. how would I fit a plane to a collection of (x,y,z) data?

and, of course, one wants to know how to do more complicated fits:

  1. cubic and higher degree polynomials.
  2. higher dimensional “planes” (i.e more independent variables).

Fitting a line to (x,y) data is relatively straightforward, but the usual prescription gives no clue about how to generalize it. The usual prescription is as follows.

If our model, the line to be fitted, is

y = m x + b,

then we solve the following two equations for the two unknowns m and b:

\sum _{i=1}^n y_i = b\ n+m \sum _{i=1}^n x_i

(*)

\sum _{i=1}^n x_i y_i = b \sum _{i=1}^n x_i+m \sum _{i=1}^n x_i^2

where n is the number of observations. This is the line that minimizes the sum of the squared errors, an error being defined as the difference between each datum y_i and the computed \hat{y_i} = m\  x_i + b\ . we would usually get a different line if we fitted x = M y + B, i.e. if we measured errors horizontally.

So how would we begin to generalize that pair of equations? Fortunately, if we go to a matrix representation, one generalization gets us higher degree polynomials, planes instead of lines, higher dimensions, and a whole lot more.

This is one of my favorite pieces of mathematics, because without matrices this would be a huge collection of monstrous ugly equations; with matrices, one equation suffices for an incredible variety of cases.

The model

We will write two matrix equations, one for the data

y = X \beta + e\

and one for the fitted equation

\hat{y} = X \beta\

which imply

y = \hat{y} + e

or

e = y - \hat{y}

This representation has

  • a column vector y of the dependent variable.
  • a data matrix X, in which each column is a variable; the j^{th}\ row is an observation, with values for each of the variables.
  • a column vector \beta of coefficients to be found.

That is, for a fixed row, say i, the matrix equation becomes

\hat{y_i} = \sum _{j=1}^k X_{\text{ij}} \beta _j

Incidentally, the notations are almost universal:

  • dependent variable y,
  • matrix of independent variables X,
  • coefficients of the fit \beta,
  • and the computed values \hat{y}

For working with the theory, one would also want to distinguish the true coefficients \beta from the computed coefficients \hat{\beta}, and the true disturbances e from the computed errors \hat{e}, which are almost invariably called the residuals.

Nevertheless, I have written \beta and e where one should, more precisely, write \hat{\beta} and \hat{e}\ . One should be prepared to make that distinction when necessary.

Let me say that another way. For working with the theory, we would assume that the true model is

y = X\ \beta + e,

for any data matrix X. Then we take a specific data matrix, compute the (ordinary) least-squares (OLS) coefficients \hat{\beta}\ , which we think of as estimates of the true \beta. And for our specific data matrix X, we compute the residuals \hat{e}\ , which are a specific realization of the possible errors e.

Creating the design matrix X

The key is that we get to define the model to be fitted by what we put in the X matrix. What we actually use for the least-squares fit is often called the design matrix, because we usually add columns to the original data matrix.

The simple case of fitting a line is a special case: our design matrix X has a column of 1s, and a column of the x’s: i.e.

X_{i1} = 1

and

X_{i2} = x_i\ .

Then we compute predicted (fitted) values of y as:

\hat{y_i} = \sum_j X_{ij} \beta_j = 1\ \beta_1 + \beta_2\ x_i = \beta_1 + \beta_2\ x_i\ .

(And, no, whether the 1s are in column 1 or column 2 is irrelevant, except that, of course, if the 1s are in column 2, then the constant term would be \beta_2 instead of \beta_1\ .

Incidentally, since we have found the equation of a line, we could draw that line; we are not limited to computing just the fitted values. We could use that fitted equation for interpolation, or extrapolation: that is, we have the function

y(x) = \beta_1 + \beta_2\ x\ .

Want to fit a parabola? The first column of X is still all 1s; the second column is still the list of x’s; we put in a third column which is the squares of the x’s. That is,

X_{i1} = 1

X_{i2} = x_i

X_{i3} = {x_i}^2

then we compute predicted values

\hat{y_i} = \sum_j X_{ij} \beta_j = 1\ \beta_1 + \beta_2\ x_i  + \beta_3\ {x_i}^2

which lie on a parabola fitted to the data, and the parabola itself is

y(x) = \beta_1 + \beta_2\ x + \beta_3\ x^2\ .

Want to fit a plane? Just to preserve the notation, I will write it as

\hat{y_i} = a + b\ x_i +c\ z_i\ ,

because I insist on using y as the dependent variable; but that’s just a name. Anyway, for the constant term a, we put a column of 1s; for x and z we put columns of the x’s and of z’s. i.e.

X_{i1} = 1

X_{i2} = x_i

X_{i3} = z_i

then we get

\hat{y_i} = \sum_j X_{ij} \beta_j = 1\ \beta_1 + \beta_2\ x_i  + \beta_3\ z_i\ ,

a plane.

Fitting the model to the data

Here’s the magic formula. We compute the least-squares values of the coefficients \beta as

(**) \beta = {(X^T\ X)}^{-1}\ X^T\ y\ ,

where

X^T\ is the transpose of X

and

{(X^TX)}^{-1} is the matrix inverse of X^T\ X\ ;

y and \beta are (column) vectors.

The matrix equation (**) is called the normal equations.

So easy. So fraught with peril. It will almost always work, which means that it may give you an answer which isn’t very nice. You will want to graph the data and the fitted values, and the errors.

(To be specific, if the design matrix X is of full-rank, then {(X^TX)}^{-1} exists; if X has more rows than columns – i.e. more observations than variables – X is of full rank if and only if its columns are linearly independent.)

Incidentally, if we wrote (**) as

X^T\ X\ \beta = X^T\ y\ ,

and built X for the special case of fitting a line y = m x + b, we would recover that pair of equations (*), as we should expect.

Assessing the fit

Having computed the coefficients \beta\ , we compute

the vector of fitted values, yhat: \hat{y } = X\ \beta
the vector of residuals: e = y - \hat{y }
the error sum of squares: SSE = e\ \cdot e
and let n be the number of observations = no. of rows of X
the let k be the number of variables = no. of columns of X

Then we compute the following.

we estimate the error variance: s^2 = \frac{SSE}{n-k}

the total sum of squares: SST = (y - \hat{y})\ \cdot\ (y - \hat{y})

the R-squared: R^2 = 1- \frac{SSE}{SST}

the adjusted R-squared: \bar{R}^2 = 1 - \frac{SSE / (n-k)}{SST / (n-1)}

the covariance matrix of the \beta\ : c = s^2\ (X^T\ X)^{-1}

although in practice we only want the square root of the diagonal
i.e. the vector of standard errors: se = \sqrt{c_{ii}}\ .

the vector of t-statistics: t_i = \beta_i / se_i\ .

What do they all mean?

the R-squared , as it’s called, is a measure of how much of the variation in y is explained by the fit. It’s 1 if and only if the SSE = 0, or very close to zero; that means there are no errors. Unfortunately, R-squared cannot decrease when you add a variable to the model. The fit cannot get worse when we add a variable.

(Incidentally, if you try to fit, say, a parabola to exactly 3 (x,y) points, you will get the Lagrange interpolating polynomial which goes exactly thru those 3 points; and the R^2 will be exactly 1. Assuming that none of the points lie on a vertical line.)

We compensate for the fact that R-square cannot decrease as we add a variable, balancing an additional variable against the reduced variation, by using the adjusted R-squared, usually written with a bar over the R: \bar{R}^2\ . There are other measures of overall goodness-of-fit, and I usually print half a dozen. But I am only looking for discrepancies between what they say and what the adjusted R-squared says.

There are issues with the R-squareds if there is no constant term in the model. There are issues if some of the rows of the design matrix X are repeated. (That can happen for experimental data, with different measurements of y at the same values of the x’s; some of our data points lie on vertical lines.)

Hell, there are issues with the t-statistics if the X matrix doesn’t have exactly the true variables, no more and no less!

Each t-statistic is used as an estimate of the chance that each \beta is really 0. It is approximately true that if you have a t-statistic less than 1 (in absolute value), and if you drop that variable from the model – which is equivalent to setting its \beta to zero – then your adjusted R-squared will go up: your overall fit, adjusted for the number of variables, will be better without that variable.

Now I think you should go get a good book on the subject. Draper & Smith is excellent, but if your interest is specifically in economic-type data and modeling, I’d recommend Ramanathan. Both have data to play with and learn from.