I want to revisit my old 2nd regression example of May 2008. I have more tools available to me today than I did when I first created it – and it was originally done before Regress was replaced by LinearModelFit.

## Recap: fitting a quadratic and a cubic

What I had was five observations x, five disturbances u – and an equation defining the true model: y = 2 + x^2 + u. Here they are:

Construct a full data matrix with x, x^2, and y:

Run forward selection… and backward selection…

Forward and backward agree. In the original post I had gotten

y = 1.63209 + 0.597983 x + 0.772142 x^2

and

y = 2.05047 + 0.972441 x^2…

and this time?

That’s what I got this time, too. (That’s good.) Note that X is not significant in the 2-variable fit – which has the lower Adjusted R^2.

These two fits demonstrate that although we are concerned when t-statistics fall from significant to insignificant, here we see two statistics fall from about 30 down to 2.4. Yes, just the drop may be suggestive of multicollinearity, whether or not the resulting t-statistics are insignificant. (The standard errors have increased a lot.)

On the other hand, checking for multicollinearity is easy, so do it… we don’t need a hall pass, no justification is required for the checking. Here are the R^2 computed from the Variance Inflation Factors:

Highly m/c. So high, in fact, that I want to check the inverse of X’X. Set X to the design matrix of the 2-variable regression… get the transpose XT… invert X’X… then multiply the alleged inverse by X’X, and see how close the product is to an identity matrix… finally, clear X so that I can still use it as a name in the next regression:

OK.

Let’s make matters worse; let’s add x^3 to the variables. (I did this back in the original post, and what happens is the reason I chose to look at this again.)

First off, we see that forward and backward did not get the same regressions. Second, and more interestingly, forward still picked out the true model (x^2), but backward selection did not – in fact, the true variable, x^2, was the very first variable dropped by backward selection.

It is a little nice that backward selection ended up choosing X, because it let’s us see what we gained by using just X:

We know the true model is 2 + X^2, but X alone isn’t bad at all (with .99 for AdjustedR^2). That is, if we hadn’t run with X^2, we might never have moved beyond X.

But, returning to the key observation…. Although backward selection has worked very well for most of the other examples we’ve looked at, in this case it is utterly wrong. The t-statistics in the 3-variable regression do not reflect the fact that x^2 is, in fact, a better variable than x and x^3.

Let me be blunt: backward selection fails badly in this example, so badly that it rejects the true model at the first opportunity.

My old impression that starting with all the variables was a bad idea… is not always true – we’ve seen it work in several examples – but it can indeed sometimes be a bad idea. It can sometimes, as in this case, be like fishing around in garbage. In this case, the true variable X^2 has the worst t-statistic in the largest regression.

OK: keep using forward and stepwise and backward selection, hoping that one of them will deliver the best possible regression. (For a guarantee, however, we’d have to run all possible regressions – which we have done a few times. And even then we’re limited to whatever variables we chose.)

Oh, here is the 3-variable regression… all t-statistics are insignificant, and X2 had the lowest one (excluding the constant):

We should expect even more severe m/c… and we are not disappointed:

We should check the inverse of X’X again…

Still ok… but we’re seeing off-diagonal numbers from 10^-10 to 10^-12 instead of from 10^-13 to 10^-14. Not a big deal, I think.

## Moving on: selection criteria

We have a lot of very good fits here. Can our selection criteria sort it out?

Well, if we want to call our selection criteria… we’re going to have to have more observations. Some of the criteria divide by n – k – 2 or by n – k – 1, where (n,k) are the dimensions of the design matrix, namely

(n,k) =

We see that n – k – 1 = 0… and I didn’t write my code to catch that. It will die on division by zero. I could keep it from dying, but I can’t make it give answers. We need to increase n, the number of observations.

Let me add a couple of observations, so that I can get answers. That is, I insert two more values of x… two more disturbances u, and then compute y = 2 + x^2.

Assemble a data matrix with x, x^2, x^3, and y… run forward… backward…

We see that backward has again dropped x^2 from the 3-variable regression, so it still fails to find that x^2 alone has the highest Adjusted R^2. Backward selection still gets it wrong; forward selection still gets it right.

Let’s try our selection criteria on the forward regressions…

Unanimous (except, as usual, for Cp). And what one model did they choose? The true model:

Good.

And on the backward regressions…

Without the actual true model in there, they are torn between the 1-variable and the 2-variable fits.

We do, of course, still have very extreme multicollinearity:

But not extreme enough to threaten the inversion of X’X – so we have not crossed over into “damn near linear dependence”. (There’s no clear boundary, but there’s certainly a suggestive one: an inaccurate inverse of X’X. The vagueness comes from: how inaccurate?)

Anyway, we have multcollinearity? Orthogonalize!

But we know the true model. Still, we ought to see what orthogonalizing the data will do for us. (What can I say? It won’t improve the fit significantly, and the variables my be weird, but this is new enough to me that I enjoy seeing the multicollinearity eliminated.)

As usual, get the transpose of the design matrix (because the Orthogonalize command works on rows)… call the result ZT… transpose to get Z… then confirm that Z’Z = I…

Then drop the first column – we’ll get it back with the next design matrix – and add the dependent variable, to get a data matrix.

Run forward and backward…

OK, they agree on the regressions. Let’s print one set.

Now two of the variables are significant. By orthogonalizing X and X2 to get Z1 and Z2, we seemed to have spread x^2 out over Z1 and Z2.

I should be able to see that. Recall that the relationship between the original data K and the orthogonalized data J is

K = J T’ + M

and

T’ = J’ K,

where J is the new data, and K the old, and M is an array of column means. Heck, the easiest thing to do is… set K by dropping the dependent variable from the data matrix d3… set J by dropping the constant column from Z…compute T’… compute M as K – J T’…

Now, the columns of K are named…

n3 = {X,X2,X3}

and the columns of J are named…

n4 = {Z1,Z2,Z3}

so the relationship is

So, X2 == 2.84605 Z1 + 0.25236 Z2.

Which says that X2 has been written as a combination of Z1 and Z2… so, yes, both Z1 and Z2 should be significant. Our best regression for the orthogonalized data has 2 variables while our true model has one… but the one is a combination of the two. OK? OK by me.

And our selection criteria?

A minority of our criteria would select a 1-variable regression, even though we know the true model requires two of the orthogonal variables.

Oh, half the point of the orthogonalization was to see this:

The multicollinearity really is gone.

The key result of this example is that backward selection can fail to find the best regression. It’s probably worth rephrasing that: if we’re fishing around in a pile of insignificant t-statistics, the smallest t-statistic can belong to the best variable.

## Leave a Reply