Regression 1: The Hald Data: Forward, Backward, and Stepwise

Introduction

We pick up where we left off: we have run all possible regressions using the four variables of the Hald data.

Forward Selection

Let me cut to the chase. I have written a Mathematica® function called stepwise; it takes two inputs — the data and the list of variable names. Applied to the Hald data (data d1 and names n1), we get

It has given us 4 regressions. What are they?

The first uses X4, and has an adjusted R Squared of about .65 . It is, in fact, the 1-variable regression with the highest Adjusted R Squared.

The second adds X1 to X4: it is not the highest ranked 2-variable regression — but it is the highest ranked (Adjusted R Squared) 2-variable regression which includes X4, our choice by the first regression.

The third adds X2 to {X1, X4}: every criterion except Cp considers this the highest ranked 3-variable regression.

The fourth adds the only remaining variable, X3.

Note that the fourth regression has a lower Adjusted R Squared than the third one. Stepwise runs to the end, ultimately adding every variable.

Let me put that another way: what I did was find the best 1-variable regression (X4), then the best variable to add to that regression (X1), then the best variable to add to that regression (X2), and finally I added the last variable.

Yes, I used Adjusted R Squared to select the “best” at each stage. But we know from the previous post that thirteen of the fifteen criteria must agree with each other — when we restrict ourselves to regressions with a set number of variables.

That is, thirteen of the criteria would agree that X4 is the highest ranked 1-variable regression, that {X1, X4} is the highest ranked 2-variable regression containing X4, and that {X1, X2, X4} is the highest ranked 3-variable regression containing {X1, X4}. (I’ll confirm that very shortly.)

Only Cp is known to give a different order; but I firmly believe that PRESS can, too.

So I should probably add the ability to use PRESS as an alternative to Adjusted R Squared… I suppose I’ll do it someday. The simple fact is that I’m comfortable using Adjusted R Squared.

As for Cp, I’m not likely to introduce it as an alternative.. but I won’t rule it out. It appears to be far too sensitive to the choice of touchstone regression; I plan to show you that in subsequent posts.

Now let me recall all possible regressions sorted by Adjusted R Squared. Oh, let me emphasize that I would not normally have run all possible subsets before I ran stepwise. Frankly, stepwise is the very first thing I do when I get a dataset.

(For shame. Some would say I should plot the data first, but I don’t. I let stepwise tell me what variables are of interest, and then I look at them.)

Let me emphsize that I am looking at all possible regressions, when feasible, for the sole purpose of seeing how well stepwise (and forward and backward selection) perform.

Anyway, here are all possible subset regressions sorted by Adjusted R Squared (from the previous post):

Looking at just the 1-variable regressions, we see X4 ranked highest. Looking at 2-variable regressions, we see {X1, X2} ranked highest, but it doesn’t contain X4. {X1, X4} is second to {X1, X2}, and since it contains X4, it’s the one stepwise picked.

Oh, I should emphasize that stepwise did not run all possible regressions. Once it decided on X4, it only ran {X1, X4}, {X2, X4}, and {X3, X4}. Similarly, once it decided on {X1, X4}, it only ran {X1, X4, X2} and {X1, X4, X3}. Anyway, stepwise chose {X1, X2, X4}. Then it added the last remaining variable, X3. When it was done, it had given me a set of four regressions.

Now I’ll remind you that PRESS got exactly the same rankings. So I can tell from the previous table that PRESS would also have chosen X4, then {X1, X4}, then {X1, X2, X4} (and ended with the 4-variable regression).

But Cp did not get the same rankings. I can tell from the following table — without running an alternative form of stepwise — that stepwise using Cp would have chosen X4, then {X1, X4}, but then {X1, X3, X4}.

That is, having run all possible regressions, I can assess not only what stepwise did using Adjusted R Squared, but also what stepwise would have done if I could run it using Cp instead.

Why stepwise?

Instead of running all 16 possible regressions, I ran four 1-variable regressions, three 2-variable, two 3-variable, and one 4-variable… i.e. 4 + 3 + 2 + 1 = 10.

No, that wasn’t much of a savings over the 16 all possible subsets. But since the sum 1 + 2 + … + n = 1/2 n (1+n), we see that the number of regressions to be run grows significantly more slowly (approximately as n^2) than all possible subsets (which goes as 2^n).

Here, have a table showing n, 2^n, and 1/2 n (1+n) for n = 1..10:

We pretty much know what’s going on, but the obvious thing to do is to let our criteria select the best.

Interesting. HQc and AICu chose regression 2, while everything else chose regression 3. Note that even though I used Adjusted R Squared to choose the best regression at each stage, the criteria do not agree on whether we should go with two variables or with three.

Just what are regressions #2 amd #3?

I should recall from the previous post the rankings of the top 6 regressions (all possible subsets) by all the criteria.

We see that there were two criteria (AICu and HQc) that ranked #8 above #13. Guess what? The old #8 is our new #2…

Why didn’t any of our criteria choose the old #6?

all[[6]][“BestFit”] = 52.5773+1.46831 X1+0.66225 X2.

Because #6 doesn’t have X4 in it.

So, we seem to have missed a good possibility.

Well, let’s take a closer look at our new #2 and #3:

What I’ve actually done so far should be called “forward selection”. I have stepped forward thru the list of variables, adding the best remaining variable to the previous regression.

Stepwise Regression

How do we get stepwise from forward selection? By taking a big a step backwards.

Note that the parameter table preserves the order in which variables were added. We see that the t-statistic on X4 has fallen when we added X2. In fact, it has fallen a lot… so far, in fact, that it is now insignificant.

It certainly looks as though X4 is not as good as X2, once X2 makes it into the regression.

So drop it from consideration. Just remove X4 from the dataset.

Here’s how I drop a variable name… and then a similar command drops the column from the dataset d1. (The “None” says “drop none of the rows”; the {-2} says drop the second column from the end; -2 without braces would have dropped the last two columns from the end.)

With X4 removed, the best 1-variable regression uses X2, then the best 2-variable regression containing X2 is {X1,X2}, and then we add the only remaining variable X3. And our criteria now choose:

Now, combine those two lists, reg1 and reg2. (I do it in reverse order so that the last regression has all four variables — because Cp will assume that the last regression is the touchstone.)

That just confirms what we already know: the two stepwise runs have delivered the old #6 and #13, and the criteria are still divided over them.

Because reg combines reg1 and reg2 in reverse order, the newest #2 is the 2-variable regression, and the newest #6 is the 3-variable regression. Here are the two candidates for best regression by stepwise… followed by the two best candidates out of all possible regressions. The two sets of two are the same.

So.

Stepwise, run twice, gives me the same choice of “best regression” as running all possible subsets.

I know, however, that stepwise will not always lead to the best regressions….

The only way to know for sure if we found the “best” regressions is to run all possible subsets, but we can’t always do that either.

What else can we do?

Backward Selection

We could go backwards instead of forwards. Start with the regression using all the variables:

Actually, what we start with is the full dataset and list of names. Drop the variable having the lowest (absolute value) t-statistic. Then do it again. And again….Here’s what happens:

He dropped X3, as we said he would. Then, having {X1, X2, X4}, he dropped X4 — which is exactly what we dropped when we progressed from forward selection to stepwise:

We should call “select”…

I did something devious, but essential, in my code. Because my select function expects that the touchstone for Cp is the last regression, while I ran regressions in order of 4,3,2,1 variables, I stored the regressions in reverse order.

Regression #2 has 2 variables and regression #3 has 3 variables, even though regression #2 was run third, after regression #3.

Discussion

I had always prefered forward selection and stepwise to backward selection, partly because I just didn’t like fishing through all the lousy t-statistics in the largest regression. On the other hand, I’ve already run all subsets and backward selection for three other cases, and I have learned that backward selection is easier than stepwise, and — to my surprise — at least as effective in these four cases, and more effective in some.

It seems that even though the coefficients may not be trustworthy in the all-variable regressions, the t-statistics are. (This is a conclusion based on less experience than I would like, but there’s a simple cure: run more regressions.)

Backward selection, however, has one major restriction. It chooses one regression for each number of variables. If it ever happens that the criteria applied to all possible subsets are divided over two regressions with the same number of variables — well, backward selection cannot possibly include both of them. In such a case, backward selection could not possibly agree with the result of all possible subsets.

The same is true of pure forward selection (that is, whenever I run my stepwise code only once).

True stepwise (i.e. run more than once), by contrast, will have more regressions to choose from. For the Hald data, backward and forward selection had four regressions each — while stepwise had 7 distinct regressions. (It had four regressions with X4, and three without.)

So, I will routinely run backward selection from now on, in addition to stepwise. We have no guarantee that either will always find the best answers.

Computational Summary and More Discussion

Given the Hald data {X1, X2, X3, X4}, the very first thing I would do is run stepwise once and find the “best forward selections”:

Second, look at #2 and #3:

… and observe that the t-statistic on X4 dropped an awful lot, so I would do stepwise again, without X4. Having constructed data d2 and names n2 to omit it…

Third, what I really want is to run select on the combined set of regressions. If I combine them in reverse order, the touchstone will still be the 4-variable regression.

We have found two candidates for the best regression… save them.

Finally, if I could afford the time — and with only four variables, I can — I would run regressions on all possible subsets and select the best:

Save them.

The key question is: how many distinct candidates do we actually have? (Let’s pretend we don’t know.) We saw that they all had two or three variables, so let’s look at the 2-variable candidates:

They’re all the same. All three agree on the best 2-variable regression: {X1, X2}. That’s another way of saying that both stepwise and backward selection agree with the definitive “all possible subsets” on the best 2-variable regression.

And 3-variables?

Of course: they all agree on the best 3-variable regression: {X1, X2, X4}. And since one of the three is the result of all possible subsets, it’s definitive: stepwise and backward selection got the right two choices.

Come on. Once I run all possible subsets, I’m done: I’ve got the right answers; I have more than one candidate, but I made my choice based on every possible regression (with a constant term). But the point is to see how stepwise and backward selection perform in cases where we can check them. For the Hald data, they get the right answers – even though they only looked at seven and four regressions respectively.

Where the criteria disagree is on whether the best 2-variable regression is better than the best 3-variable regression. That’s why we have two candidates — not because I want to consider 2-variables and 3-variables, but because the criteria are not unanimous. (OK, as I said, backward selection could not have chosen two regressions with the same number of variables… but stepwise could have.)

I emphasize that forward selection by itself would exclude {X1, X2} from consideration. This is why we went from forward selection to stepwise, running a set of regressions without X4. On the other hand, the omission of X4 meant that the very fine {X1, X2, X4} was omitted from consideration on the second step — so I combined the results.

Anyway, I would now investigate these two regressions. “We’ve only just begun”, but at least we can now focus on two regressions.

Oh, when I started all this, I plotted each of the variables X1, X2, X3, X4, and Y; and I plotted the points (Xi, Y) for i = 1,2,3,4. I would still do that — but I would probably not include the plots involving X3 in a final report.

Why not include it? Because it’s not used in the two best candidates.

But why even plot it? Because that might show me why it’s not used. If there’s a problem with the data for X3, I come to its plots knowing that it was omitted, and therefore I am sensitive to possible anomalies in its plots. (Notice how cleverly I justify running regressions first and looking at the data second.)

OK, in this case I don’t save much. On the other hand, if I had started with 12 variables, but my best regression inlcuded only 6 of the 12, I think I should still plot the unused ones — just to find out out if they were omitted because there were problems with them. I might report such problems, but I also might choose not to investigate them further… not in the first analysis, anyway.

I expect to show you two more examples of forward selection and stepwise, backward selection, and all subsets regression. So if this all seemed a bit overwhelming, fear not… we will be looking at it some more.

And I’ll repeat the executive summary. I would always run both stepwise and backward selection; if feasible, I would run all possibe subsets, too. (For what it’s worth, on my MacBook, it took 3 minutes wall-clock-time to run all the 4096 regressions on 12 variables.)

I also need to close with a few comments.

First, my approach to stepwise regression is not exactly what Draper & Smith describe (see the bibliography page). (Now I tell you!) I went all the way back to the beginning, removing X4 from further consideration; and I applied my criteria to all seven of the regressions which I obtained.

As I read Draper & Smith, they would have removed X4, then decided that X4 was better to add than X3 — and they would have stopped rather than cycle, adding and deleting X4. They are describing a process of stepping either forward or backward at each stage. To put that another way: they decide what variable to add… step forward by running a regression… then ask if any variable should be deleted… and, if so, immediately step backwards, removing it.

I strongly prefer my own approach. I’m not trying to stop on the best regression: I’m trying to get a set of regressions worth further consideration, to which I get to apply my selection criteria.

Two, did you notice that backward selection only computed four regressions? The selection of which variable to delete was made before running the regression with that variable deleted. (Looking at the 4-variable regression, I decided to omit X3, and so I ran exactly one 3-variable regression, and so on.) This appears to be what Draper & Smith describe.

Just as I am considering extending my forward selection / stepwise to use Cp and PRESS as alternatives to Adjusted R Squared, I consider extending my backward selection so that it actually computes all four 3-variable regressions, and chooses the one with best Cp or PRESS. Then, like the first forward selection, it would compute ten regressions on the Hald data, instead of just four. And it would let me use other criteria than the t-statistics for the choice of the best next-smallest regression.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: