I have found some data which illustrates a few points I want to make when I summarize what I’ve shown you about ordinary least squares regression – I should be publishing a summary soon. Let me provide some evidence for part of my summary.
This example comes from Atkinson, “Plots, Transformations and Regressions”, Oxford Science Publications, reprinted 1988 ISBN 0198533594. It was example 8, on p 106, and it deals with the properties of nylon yarn.
He, in turn, took it from John, Outliers in Factorial Experiments, Appl. Statist. 27, 111-19, 1978.
I do not know of an online source for this data.
The data appears to have been indexed, or categorized – there’s a name for this and I can’t think of it! “Coded”, perhaps. “A”, for example, is the “ratio of the speeds of the draw and feed rolls” – but values of -1, 0, 1 are almost certainly categories of ratios, not ratios themselves.
Anyway, this is what we have: 6 variables. 27 observations.
Let me just naively run forward selection. While I would not go out of my way to add observation number to the data, I will use it if it’s provided.
Yuck. An adjusted R^2 less than 0.2.
Interesting nevertheless. A comes in first, then the observation number. What would my selection criteria choose?
The second regression, with A and OBS, is the highest adjusted R^2, but every other criterion (sing along with me: “except Cp as usual”) chooses the first regression. Here are the two:
As we might expect from the selection, OBS came in with a t-statistic greater than 1 – hence the adjusted R^2 went up – but less than 2 – hence the coefficient could really be 0, and the other criteria are skeptical.
Now, neither John nor Atkinson seems to have used OBS as a variable, so I’m going to drop it for the rest of this post. (I did put it back in at the end, for myself. With quadratic terms added, OBS is still significant, but I think it was the last significant variable added.)
The key, however, is that John added quadratic terms – but not all of them. To quote Atkinson, “The effects of factor D were, according to John, known to be relatively small and no interaction with the other three factors was considered likely.”
So, let me construct the squares of A, B, C, D, and the product terms which do not involve D. That is, the data column aa = a*a = a^2, etc., and will be named AA.
Having defined the new variables, let’s build the new data matrix, omitting OBS this time.
Here’s forward selection:
Still yucky! What do the selection criteria pick out?
We have choices: 1,2,4 (sing along with me, “except for Cp”). The highest adjusted R^2, however, is now 0.254868.
reg2[][“AdjustedRSquared”] = 0.254868
Let’s just go ahead and add the other possible interactions, the ones involving D – the ones that shouldn’t matter.
Construct the variables…
Run forward selection…
Significant improvement. Still not good, but much better. Our highest adjusted R^2 is .46 instead of .25.
What I find most interesting is that the second variable introduced is BD – one of the interactions expected to be unlikely. The interaction BC will be third – good, it doesn’t involve D. Then C, and AD will be the fifth variable introduced.
So two of the first five variables will be interactions with D – ruled out on theoretical grounds.
Call upon the selection criteria…
Well, we should look at regressions 3,4,6.
Regression 3 has all t-statistics greater than 2… in regression 4 C comes in with a t-statistic between 1 and 2… in regression 6, we see t-statistics falling, but not all that far.
Let’s note that BD is significant.
We could look at the ultimate regression, with all variables…
We see many low t-statistics – but they always were, pretty much. And the t-statistic for BD is still significant; the one for C is .87 instead of 1.3.
But just to be safe, let’s check quickly for multicollinearity. (I will be shocked if we have it – but it wouldn’t be the first shock of my math.)
Here are the singular values of the design matrix and the condition number (largest over smallest):
After condition numbers in the millions, or merely thousands, 6 is pretty nice. We have seen, however, condition numbers in the 30s and 40s for the standardized and the centered Hald data, respectively, and those were both still multicollinear.
Let’s look at the VIFs (Variance Inflation Factors) converted to R^2: here’s the list, and the maximum value…
The highest R^2 is .15. Not one of those variables is a close fit of the other 13. I conclude that multicollinearity is not an issue.
Let’s try something entirely novel. Something that ought to be routine, but I haven’t gotten around to it much. Let’s actually look at the residuals. I’m going to use the sixth regression, maximum adjusted R^2.
Let’s even look at three kinds of residuals, the usual, the standardized, and the studentized. Ask for all three:
Here are the usual (raw) residuals.
Number 11 is pretty far out there, but so are #4 and #10.
Here are the standardized residuals. I’ll look for the biggest, but I think we should, in general, look for any outside 3.
Well, #11 is closer to -3 than to -2.
Here are the studentized, and we should look for values approximately outside 2 (t-distribution).
The 11th observation is most extreme, #10 is also outside 2. Let me remove #11 and see what happens.
Well, at least BD comes in after BC. More importantly, the fit is substantially better…
FYI, backward selection seems to generate the same set of regressions (I’ve checked the first five and the last two.)
Our selection criteria?
Looking at 9-11, we see a maximum adjusted R^2 of .75. But we also see a lot of small t-statistics.
But, as before, BD is significant.
Since #4 and #7 were selected by a couple of criteria, let’s look at them, and the two around #4:
We see that #3 is the last regression with all t-statistics significant, and it includes BD.
What happens if I now remove observation 10?
The highest adjusted R^2 rises to almost 84%.
I’m not going to do anything with these last regressions – except that, having added all those quadratic terms, I want to check multicollinearity again. As before, I do not expect it, because my t-statistics did not fall precipitously when I added something.
The VIFs for the regression with all variables… the largest VIF is under .22.
The singular values and condition number for the design matrix of the regression with all variables…
Now, the summary for this post.
- I do not expect multicollinearity because most t-statistics are small – I expect it because they used to be big.
- I do not advocate dropping observations simply because the fit gets better – that’s bad science at best.
- I do, however, advocate trying variables that “should not” be relevant.
Let me elaborate.
One of the benefits of forward selection is that I get to see what happens as variables are added. I get to see if hitherto significant t-statistics fall sharply when some variable is introduced. And although I focus on t-statistics becoming insignificant, their falling from 50 to 5 can be noteworthy. In this case, we saw lots of insignificant t-statistics when we ran lots of variables – but the first few variables retained their significance.
I like knowing that removing observation #11 raises the adjusted R^2 from .46 to .75 – but that doesn’t mean I’m going to throw it out for the final report. It does mean that I would investigate #11 – might something have gone wrong with the data collection? It might mean that I want more data. But “the fit is closer without it” doesn’t mean “the fit is truer without it”.
As for the cross-terms that were expected not to matter… I think Aristotle and Galileo should move in together. Expecting a heavier body to fall faster than a lighter one is all well and good – until you do the experiment. Theory may say that some variable should not matter – but I want the data to speak for itself. I am free to reject what it says, but I want to know when a theoretically irrelevant variable is significant and does in fact lead to a closer fit.
In particular, although I did not show it to you… if I include OBS, the observation number, in with the quadratic variables, it has a significant t-statistic. So I would want to know why the dependent variable appears to depend on the observation number. (Is this a time series?)
Of course, not being the experimenter, I don’t get to investigate these things. More generally, not being an expert in all the subjects for which I want to run regressions, I have to let the data speak for itself… I wouldn’t know how to exclude variables on theoretical grounds – even if I wanted to!
But I don’t personally find that a bad thing.