(**Dec 6: a couple of edits**, which you can find by searching on “edit”.)

Let’s work another example of forward selection, stepwise regression, backward selection and then let’s run all possible subsets just to see how well they did. (I didn’t call it example 1, but that was the Hald Data.)

## the Data

The following data can be found on Ramanathan’s website. Here is the page for the fifth edition of his book, “Introductory Econometrics with Applications” (see the bibliography page).

I have attempted to download the text file version of the data, but I get “page not found”. I sent an e-mail at the end of August 2010, but the problem has not been fixed as of 11/14/2010.

You can download a collection of Excel files, instead, from his site.

The particular example I am about to work is from the file DATA4-4.XLS

This data appears to be the same as in the 3rd and 4th editions of the book.

In the 3rd edition of the book, which I own, this example is section 4.9, the data is printed as table 4-16, and the data was compiled by Sean Naughton. It is cross-section data for 40 US cities for 1988. The variables are described as:

- BUSTRAVL… demand for urban transportation by bus in thousands of (person) hours;
- FARE……. bus fare in dollars;
- GASPRICE… price of a gallon of gasoline in dollars;
- INCOME….. average income per capita in thousands of dollars;
- DENSITY…. population density (persons/sq mile);
- LANDAREA… land area of the city (sq miles).

For the following Mathematica® instructions, the first command reads the file… the second command confirms that the first line contains the names of the variables. Our dependent variable is BUSTRAVL in column 1. The third command puts the names of the independent variables into a list, n1.

The next set of commands drops the first line (the names), and then rotates column 1 to the back, making it column 7. All of the data, including the dependent variable, is now in the array d1; as usual, each column is a variable.

The array d1 contains 40 observations and 7 variables, but only 6 independent variables… the Dimensions command told me the size of d1:

Dimensions[d1] = {40,7}

(As I sometimes do, I have equated the command and the subsequent answer, rather than add another screenshot.)

Although the data is available on the internet — well, because the data is available elsewhere on the internet — I feel free to display it here, so that this post is self-contained.

## Ramanathan’s Analysis

What he did was very similar to backward selection. He started by running all the variables, then removed GASPRICE because it had the highest p-value (equivalently, because it had the lowest (absolute value) t-statistic).

Then, in the regression without GASPRICE, he says that FARE had the highest p-value, but he chose to drop the second-highest instead, namely LANDAREA, “…since FARE is the price measure that is important in a demand equation….”

In the subsequent regression without LANDAREA, he found that FARE remained insignificant, so he finally dropped it.

He ended up with a regression using INCOME, POP, and DENSITY.

What he did was very nearly backward selection, except that he did it manually, and chose not to drop FARE when backward selection would have.

## Backward Selection

Okay, let me jump in with both feet. Since backward selection is very straightforward, and since it’s very close to what Ramanathan did, let me do it first.

Backward selection confirms my description of what Ramanathan saw: GASPRICE drops first, then FARE, then LANDAREA. Let me emphasize that he chose not to drop FARE second because (I would say) he was hoping FARE would be significant after he dropped LANDAREA. But he bowed to the inevitable, and removed FARE third.

I’ve also told you that he settled on the 3-variable regression (INCOME, POP, DENSITY).

What would all my criteria select?

The choice is almost unanimous: Cp is the only exception. The parameter tables and Adjusted R Squared for the two regressions are:

Let me point out my counting scheme: **strictly speaking, the first of those regressions has 4 independent variables — but I call it a 3-variable regression, for the three non-constant variables**.

Let me save those two regressions in a short list:

bestB={bac[[3]],bac[[5]]};

## Forward selection

Now let’s run forward selection, i.e. the first stage of a possible stepwise regression. (But I just can’t bring myself to name my function “forward” instead of “stepwise”.)

I find that interesting. In marked contrast to backward selection, forward selection offers us a choice of four regressions out of the six it ran. (Not very selective, huh?) Edit: oops. Stepwise offered us 6 regressions, but it ran 6*7/2 =21 regressions. Still, four out of a list of six isn’t very selective.

PRESS would go with the 1-variable regression… HQc would choose the 3-variable… everyone else — except Cp — would choose the 4-variable. Let’s save these four regressions, and display them.

best1={reg1[[1]],reg1[[3]],reg1[[4]],reg1[[5]]};

That’s what forward selection does.

## Stepwise Regression

But what about stepwise? We see that the (absolute value) t-statistic for LANDAREA fell from 3.8 in regression #3 to 0.8 when DENSITY was introduced in regression #4, so I drop LANDAREA.

Before I show you this, let me say that I believe (but I’m not – edit! – positive) that what Draper and Smith would do, after dropping LANDAREA, is to return to the regression with POP and INCOME, and see what is the next best variable to add. Hold this thought.

Let me drop the sixth name (LANDAREA) from the list of names, and the sixth column from the data array:

Now run stepwise on the reduced dataset, and call “select”:

This time our criteria are almost unanimous on #3, with Cp — of course? — the only exception.

Note that with LANDAREA removed, the second variable introduced is DENSITY; it enters before INCOME. **This is why I go all the way back after I drop a variable. I do not assume that because INCOME was third after POP and LANDAREA that it will be second after POP**.

Let me say that a different way. I think that Draper and Smith would have kept POP and INCOME, and I’m sure they would discover that DENSITY was the best variable to add. I do not assume that POP and INCOME is the appropriate 2-variable regression. I believe we do, as it happens, get the same answer (I believe we both get the same 3-variable regression); but I won’t count on it.

Note that for Cp, the touchstone regression has 5 variables instead of 6. Among other things, this means that the two regressions chosen by Cp are not directly comparable, since they were based on different touchstones.

Let me save both of these…

best2={reg2[[3]],reg2[[4]]};

and let me display them:

Since they are consecutive regressions, they show us what we need in order to consider stepwise again. Yes, FARE came in with a low t-statistic, but any other variable would have had a lower one — what matters is that no earlier variable fell down. **This is when I stop: no t statistic before the last entry has fallen below ~2 in absolute value.** (And the next regression would be all the remaining variables — dropping the lowest t statistic amounts to starting backward selection.)

Let me combine the two short lists of best regressions. (In the previous post, I combined the longer lists of all the regressions that had been constructed by the stepwise function; this time, I’m only going to use the results of the select function.)

Furhermore, at this point, since I did not include the 6-variable regression, any subsequent computation of Cp is rather nonsensical: neither the 5-variable touchstone nor the 6-variable touchstone is in the short list.

Note that there are no duplicates in this list. (My “select” fails if there are, so I would have had to remove any duplicates.)

Now ask the criteria to make their choices:

Again, that Cp is silly — we have the wrong touchstone regression. So, all the valid criteria are unanimous: regression #5 in the combined short list. Save it under a name I can remember…

bestS={best[[5]]};

… and display it:

And that is the very same regression chosen by backward selection. This is the one I would investigate.

## Mallows’ Cp

Suppose we wanted to use Cp. We have seen these three regressions, the first two from forward and stepwise, the third from backward selection:

Wait a minute. The first and third of those used the same touchstone, so we can choose between them. By including the 6-variable touchstone in the following list, I get a valid computation; all I care about is the Cp choice:

Choose the regression selected by Cp in backward selection over the one in forward selection. Our two remaining choices are then:

I can’t say I would give either of those a second look. Sure, those are good Adjusted R Squared values, but there are insignificant t-statistics in these two regressions.

Let me remind you that stepwise uses Adjusted R Squared, and we saw that any criterion except Cp and PRESS would give the same sequence of regressions; backward selection uses t-statistics.

**It seems more than a little unfair to expect Cp to do well, when it wasn’t consulted in getting the sequence of regressions**. I think that if I really want to give Cp a fair test, I should use it for stepwise, as an elective alternative to Adjusted R Squared. I’m in no hurry to do that, however, but this is a personal decision: I would rather use what I already have on all the datasets I’ve got for playing with.

## All possible subsets of variables

Now, for a different question.

If we look at all possible regressions, what will we choose?

Unanimous, except for Cp. What are the two?

The first of those is precisely the common choice of backward selection and stepwise. That is, both stepwise and backward selection got the right answer. Even though backward selection looked at only six regressions, and stepwise looked at ten (OK, eleven, but only ten distinct), they both picked out the same regression we would have gotten by looking at all possible 2^6 = 64 subsets of variables.

Now let’s look at the Cp choices.

Here are the choices of stepwise, backwards selection, and all possible subsets:

We’re looking at three different regressions. I infer that, for the Cp criterion, stepwise and backward selection did not find the best choice among all subsets. As I said earlier, there’s a good reason for that. My small subsets in forward selection and stepwise were chosen by Adjusted R Squared — not by Cp. Only after I get the relative handful of regressions do I give Cp a chance to look at them. Similarly, my small subset in backward selection was chosen by t-statistics — not by Cp.

If I really want to give Cp a fair shake, I should either use all subsets — or I should modify my stepwise code so that it can (optionally) get the subsets based on Cp.

I think it is fair to say that **applying Cp to the results of forward selection or stepwise, after using Adjusted R Squared, is unfair to Cp**.

Similar considerations might apply to PRESS — except that it doesn’t seem to be so very different from the other criteria.

## Summary

For this dataset, the selection criteria (except for Cp) applied to stepwise regression and to backward selection all agree on a unanimous choice for best regression: INCOME, POP, DENSITY.

Running all possible regressions shows us that stepwise regression and backward selection didn’t miss anything: for this data, we can’t do better than the result of stepwise or backward.

Until and unless I change my code, the only fair treatment of Cp would be to run all possible subsets.

My stepwise criterion, when I look at a k-variable regression (not counting the constant), is to drop the variable with minimum (absolute value) t-statistic less than about 2, excluding the constant and the k-th variable. I stop when there’s nothing to remove. What I’m looking for is a variable that came in early, but became insignificant when something else was added.

I’ll remind you that my stepwise algorithm is not what (I think) Draper and Smith describe as stepwise: after I remove a variable, I go back the beginning; I believe they do not.

I will be showing you more examples of this.

## Leave a Reply