Let us work another example from Ramanathan. In fact, this is one of his exercises… for which he provides a fairly detailed list of questions. Perhaps I should emphasize that although I am using his dataset, this does not answer his questions. In other words, I’m about to single out the best regressions, but I’m not going beyond that.
On the other hand, his problem statement does suggest a backward selection, so I’m “doing the homework” to that extent.
Here is a description of the data based on the 4th edition file, which included descriptive information; the 5th edition data contains the variable names but nothing else (that I saw). I am, in fact, using the 5th edition data, despite the 4th edition description.
I showed you earlier where to get his data.
Data on the factors affecting baseball attendance — compiled by Scott Daniel.
- ATTEND = Attendance in thousands for the years 1984-86.
- POP = the population of the metropolitan area where the team is located (in thousands).
- CAPACITY = Capacity of the Stadium where the team plays its home games. (in thousands).
- PRIORWIN = the number of wins the team had in the year prior to the year under examination.
- CURNTWIN = the number of wins the team had in the year under study.
- G1 = the number of games behind the division leader the team is in the standings on April 30.
- G2 = the number of games behind the division leader the team is on May 31.
- G3 = the number of games behind the division leader the team is on June 30.
- G4 = the number of games behind the division leader the team is on July 31.
- G5 = the number of games behind the division leader the team is on August 31.
- GF = the number of games behind the division leader the team is after the season was over.
- OTHER = the number of other baseball teams in the metropolitan area.
- TEAMS = the number of Football, Basketball, or Hockey teams found in the Metropolitan area.
I described how to get the data from his website in the previous regression post. This is dataset DATA4 – 13. XLS, and it appears to be the same for the 3rd, 4th and 5th editions of the text.
The following commands import the Excel file… display the first line (variable names)… put the variable names into the list n1… create a data matrix d1 without the first line and with the dependent variable WLFP in the last column… and print the dimensions of the data matrix….
This time I’m not going to display the data (it doesn’t fit well); it is already publicly available on his website.
forward selection (stepwise 1)
For no particular reason, except force of habit, let me run stepwise first.
NOTE that the Adjusted R Squared is not monotone. #7 is worse than #6, but #8 is better (it’s nice to know that it can happen):
We see that G4 went insignificant (absolute valaue of t-stat less than 1) in 7.
I would also remark that although G1 was the best variable to add to regression 6, it came in with a t-statistic of less than 1 (in absolute value), so the Adjusted R Squared fell. But then G3 was added to regression 7, and came in with a t-statistic greater than 1, so the Agjusted R Squared went up. But the key is that G3 did not come in with a lower t-statistic than G1 did on the previous step; it is not true that the t-statistic for the next variable is always decreasing.
This is interesting… considering that backward selection works by removing the variable with the lowest t-statistic at each stage. For now, just remember this oddity.
From the list of names, G4 is the 8th variable… so we drop the 8th name from n1 and the 8th column from d1:
Run stepwise on the reduced data:
We have gotten four candidates. Save them…
i see G1 fall in #7… and that is one reason why I’m posting this analysis. In every other case, stepwise has ended after dropping one variable. Not this time.
… so I drop G1. It is variable 5, so drop the 5th entry from names n2 and the 5th column from the dataset d2:
Again, run stepwise on the new reduced data:
… and print them:
So, GF is barely acceptable when it enters in #7, falls in #8, but comes back up in #9. OTOH, the variables added in 8 and 9 are not acceptable.
Yes, it is true that G5 has become insignificant in #9, but the last two variables added are not worthwhile. Still, I could remove G5… but I’m not going to.
Command decision: stop here.
combining the stepwise selections
As I have done before, I am going to include the regression with all variables, so that Cp uses it for the touchstone. (We realize that the Cp for the second and third passes in stepwise did not use the same touchstone, and are not really comparable. I’m still in the position of choosing Cp unfairly: as an addiional separate procedure, I ought to be moving stepwise based on Cp, to make it fair — for Cp.)
Make the selections out of that set:
The criteria are almost all in favor of a very small model.
OK, now let’s run backwards selection:
Yikes! What is the world coming to? Cp agrees with many other criteria. SHIBATA agrees with Adjusted R Squared. (SIGMASQ always agrees with Adjusted R Squared.)
We have three candidates; save them:
Comparing stepwise and backward selection
We see that backwards selection and stepwise agree on the maximum Adjusted R Squared:
The small candidates from stepwise and backward selection are different:
The Cp candidate from backwards slection was one of the smaller candidates favored by other criteria; let’s compare it to the Cp candidate from stepwise. They are different from each other, too.
We got a lot of candidates, between stepwise and backward selection. We could merge these candidates, bearing in mind that there is one duplicate, and that we want to include the touchstone regression.
This time we see that all of the criteria choose a regression from the 3 stepwise candidates. (We had dropped the duplicate from backward selection.)
all possible subsets
For this dataset, we have 12 variables… there are 2^12 = 4096 possible regressions. This is still doable, so let’s see what the best candidates really are, and how well stepwise and backward selection performed.
For what it’s worth, it took 3 minutes wall clock time to run all those and select them.
And print them:
We should not be surprised that Adjusted R Squared (and its equivalent, SigmaSq) made one choice, Cp made a second choice, and all other criteria made a third choice.
Backwards selection and stepwise both found the following the maximum Adjusted R Squared:
It agrees with all possible subsets, so both backward selection and stepwise did find the best Adjusted R Squared.
For stepwise, the other criteria (except for Cp) were unanimous on
It agrees with all possible subsets, so again stepwise did find the best candidate according to the other criteria. (And, as we saw, the other criteria would have picked this regression over the choices made by backward selection.)
Since backward selection did not choose that regression, we know that it does not match all possible subsets. While I cannot detail why, I suspect that it has something to do with that odd behavior we saw at the beginning of stepwise… where G3 came into a regression with a higher t-statistic than G1 had on the previous — which says that G3 looked better after G1 was in than before G1 was in. Backward selection uses t-statistics to make it’s choices, so maybe something went awry. I don’t know in detail; all I know is that backward selection did not get the best regression according to the other criteria (where “best” was determined by looking at every possible regression).
For Cp, we know that the favored choice was
As we should expect… as we’ve discussed before… this does not agree with all possible subsets. (Neither stepwise nor backward selection is looking for good Cp.)
So: backward selection did not do as good a job as stepwise, for the other criteria. So, backward selection is easier, but it is not always as effective. And that’s the second reason why I felt I had to post this analysis.
I strongly suspect that there could be situations in which backward selection would agree with all possible subsets while stepwise did not; and I strongly suspect that there could be situations in which neither stepwise nor backward slection would agree with all possible subsets.
But if we can’t afford to run all possible subsets, stepwise and backward selection seem to be our best bets, even if we can’t guarantee they will find the best regressions.
Now, I think that’s enough. I hope I’ve shown you how to do this for yourself.
All we’ve done is select candidates using stepwise regression and backward selection and all possible regressions; we have not examined the resulting selected regressions.
And I’m probably not going to! Not for a while, anyway. Every book will show you that stuff.
Next up? The Hald data exhibits multicollinearity, and I’m going to show you how to pinpoint it.