Let us work another example from Ramanathan. Perhaps I should emphasize that although I am using his dataset, this is not his analysis. Let me assure you that his analysis is worth reading.
I described how to get the data from his website in the previous regression post. This is dataset DATA4 – 5. XLS, and it appears to be the same for both the 4th and 5th editions of the text. The data for the 3rd edition, however, is different from this, and the regressions are slightly different.
Here is his description of the data from the 4th edition file, which included descriptive information; the 5th edition data contains the variable names but nothing else (that I saw). I am, in fact, using the 5th edition data, despite the 4th edition description.
DATA4-5: THE FOLLOWING ARE 1990 CENSUS DATA BY STATES PARTICIPATION RATE (IN %) OF ALL WOMEN OVER 16
Compiled by Louis Cruz
- wlfp = persons 16 years & over–percent in labor force who are female
yf = median earnings (in thousands of dollars) by females 15 years & over with income in 1989
ym = median earnings (in thousands of dollars) by males 15 years & over with income in 1989
educ = females 25 years & over–percent high school graduate or higher
ue = civilian labor force–percent unemployed
- mr = female population 15 & over–percent now married (excluding separated)
- dr = female population 15 & over–percent who are divorced
- urb = percent of population living in urban areas
- wh = female population–percent 16 years and over who are white
The following commands import the Excel file… display the first line (variable names)… put the variable names into the list n1… create a data matrix d1 with the dependent variable WLFP in the last column… and print the dimensions of the data matrix….
We see that we have 9 variables (8 independent) and 50 observations.
As I did last time, here is the data… it is, after all, already publicly available on the Internet. Oh, For some reason I had to do it two pieces: the first 25 observations, then the remaining 25 observations.
the remaining 25 observations:
For no particular reason, except to be different, let me run stepwise first. (Okay, until and unless I drop a variable, this is forward selection.)
We have three candidates; interestingly, Cp agrees with Adjusted R Squared… a few criteria would choose #6 instead of #7… and HQc would choose #4.
Let me save those three…
Now let’s look at numbers 4 thru 7… (note that the Adjusted R Squared is printed below each parameter table.)
We see that regression #4 is the only one all of whose t-statistics are “significant”. In #5, we see that URB comes in with a t-statistic less than 2, but nothing earlier falls too far.But in #6, the t-statistic for YM has fallen below 2.
So take it out.
We see that it is the 2nd name… so we drop it… and then we drop the 2nd column from the data matrix d1….
Wow, almost unanimous. Even Cp agrees with the consensus. Note, however, that this Cp was computed using a different touchstone regression from the previous output from stepwise.
Save these two, anyway…
Let’s look at numbers #5 through #7:
In regression #5, all t-statistics are significant. In regression #6, DR comes in with a t-statistic slightly less than 2, but no prior t-statistic falls below 2.
In regression #7, MR comes in with a t-stat less than 2, and DR is still below 2, but nothing earlier has fallen below 2. (In fact, DR has a higher t-stat in regression #7 than in #6… I suppose that if it had fallen a lot instead, I might drop DR… but I’m not sure. Hey, do what you like – this is largely heuristic. After all, I’m going to let the criteria check out whatever I run, so it won’t kill me to run too many regressions.)
I’m done. That is, I choose to stop here.
I’m going to combine the candidates from the two stepwise runs, and I’m going to add the original touchstone regression – all 8 independent variables (that’s reg1[[-1]], the first from the end of the list reg1).
As I said in the previous example… I don’t see how to relate two Cp values which used different touchstone regressions. Since the first one, and in what follows, the backward selection and all possible subsets will use the 8-variable regression as the touchstone, I’m going to use it here as well.
On top of it all, since the choices in stepwise and in backward selection do not use Cp to get the small subsets of regressions, we’re really not being fair to Cp anyway.
Until and unless I change my code, all possible subsets is the only fair way to assess Cp… at least, that’s my take on it.
Anyway, here are the five candidates from reg1 and reg2, and the 8-variable regression for the touchstone.
Those appear to be six different regressions, just judging from the constants. (Oh, the Union command doesn’t do much, and I should have dropped it. I did confirm that the last regression in this list is, as it should be, the 8-variable one.)
Save those three candidates…
Now let’s run backward selection…
This time Cp disagrees.
Note that we dropped MR first and then YM, and then DR.
I decided to look at regressions 5–8, and I did it in reverse order, with #8 first.
In the 3rd edition, Ramanathan ran 5 models. The first was all the variables… and the next one dropped YM, MR, and DR. Then he said he preferred to see what happened if he dropped them one at a time – he did backwards selection without calling it that. He dropped ony YM, then YM and DR, then YM, MR, DR and URB. (He already had the regression that omitted TM, MR, and DR.) He effectively presented the results of a backward selection, to the point of dropping four variables.
On the different data for the 4th and 5th editions, my backward selection dropped MR, then YM, and then DR, instead of his YM, DR, and then MR. As far as I can see, I would have matched his sequence for the 3rd edition data, but the sequence is different for the 4th = 5th edition data.
Let’s save the three candidates…
Let’s recall the three candidates from stepwise…
They appear to agree, judging once again from just the constant terms. (I have no idea why the model display changes.)
Both stepwise and backward selection would choose three candidates: one selected by Cp, one selected by HQc, and the third selected by the other thirteen criteria.
All Possible Regressions
Now, did they make the right choices? Let’s run all possible subsets:
Nice. We get three candidates again: one chosen by Cp again, one chosen by HQc again, and the third chosen by the other thirteen criteria, again.
We already know that stepwise and backwards agreed, so compare backwards with all possible regressions:
They agree… that was the choice of HQc.
They agree, too. That was the choice of the other thirteen criteria.
What’s the difference between these two candidates? The addition of DR, with a t-statistic slightly less than 2. I find it worth noting that only HQc decided against adding DR.
These are Cp choices. They are comparable, because they had the same touchstone regression. We see that backwards selection (and therefore stepwise) did not agree with all possible subsets – backward selection and stepwise got the wrong answer.
Once again, I can’t be all that surprised. Backward selection uses the t statistics to choose what variable to drop… a fair test might be to compute the Cp for all the regressions with one fewer variable, and choose the best of those.
As in the previous example, I myself wouldn’t give a second look to any of the selections made by Cp. I would look at the other two candidates – and if I were pressed for time, I would look only at the one selected by thirteen of the criteria.