Regression 1 – All possible regressions of the Hald data

Introduction

Well, I’ve been battering you with mostly theory. I’ve shown you all of the properties that Mathematica® can provide for a regression… and I’ve shown you how almost all of them are computed.

Not content with that large collection of properties, I then took Mathematica’s four selection criteria and added 11 more criteria.

And through it all, I’ve only shown you two regressions.

Let me go to the other extreme: our example data set, the Hald data, only has four variables… so there are only 16 possible distinct regression equations (all of which contain a constant term — more about that later, probably in another post).

Let’s run them all, all sixteen.

Hang on one moment. This post has 7 sections:

  1. Introduction
  2. run all sixteen possible regressions, and see what are “best” according to each criterion.
  3. see what are the top six regressions according to each criterion.
  4. study the rankings for each fixed number of variables.
  5. see which regressions have significant t statistics.
  6. provide the supporting material in an appendix.
  7. summary

Note that the summary is at the end of the post, after the appendix.

OK, let’s go. We start with the Hald data.

\left(\begin{array}{ccccc} 7 & 26 & 6 & 60 & 78.5 \\ 1 & 29 & 15 & 52 & 74.3 \\ 11 & 56 & 8 & 20 & 104.3 \\ 11 & 31 & 8 & 47 & 87.6 \\ 7 & 52 & 6 & 33 & 95.9 \\ 11 & 55 & 9 & 22 & 109.2 \\ 3 & 71 & 17 & 6 & 102.7 \\ 1 & 31 & 22 & 44 & 72.5 \\ 2 & 54 & 18 & 22 & 93.1 \\ 21 & 47 & 4 & 26 & 115.9 \\ 1 & 40 & 23 & 34 & 83.8 \\ 11 & 66 & 9 & 12 & 113.3 \\ 10 & 68 & 8 & 12 & 109.4\end{array}\right)

We started looking at it back in this post. We used one particular regression to see how various properties were computed; it used X1, X2, and X4 (and a constant), but did not use X3:

reg["BestFit"] = 71.6483+1.45194 X1+0.41611 X2-0.23654 X4

Is that the best one? Is that why I picked it? (Yes, sort of. Perhaps I should remark that the property “BestFit” gives me the equation; it does not assert that this regression was better than any other.)

Selecting The “Best”

I should point out that finding the best possible regression is not to say we found a good one, in general. In this case, I think we did find a good one (two even), but I almost certainly will not prove that our best regressions are flawless.

The Mathematica command to run all possible regressions using those 4 variables is rather straightforward, once you realize that the index i can range over the subsets of the names n1. The semicolon at the end suppresses the printing of output but not the computations.

Fine.

Now I call a routine of my own that computes all of the selection criteria presented in the previous post, and finds which regression in that set of sixteen has the optimum value (maximum for Adjusted R Squared, minimum for everything else) for each regression.

That table says that out of the sixteen regressions, regression #13, whatever it is, has the highest Adjusted R Squared… and the lowest SIGMASQ, the lowest AIC, FPE, HQ, and Shibata, and PRESS. It also says that regression #6 has the lowest values of everything else. That is, SIC (BIC) chose #6, as did every modification of AIC, FPE, and HQ… as did GCV, RICE, and Mallows’ Cp.

We didn’t get one best answer, but we only got two.

I don’t know about you, but I like that. I like it a lot.

Although the criteria do not agree on one “best” regression, they give us a choice of only two. After all, if they always agreed on one regression, we’d only need one criterion; but if they gave too many different answers, we wouldn’t have anything close to “an” answer.

OK, exactly what are those two regressions? Here are the equations for #6 and #13:

all[[6]]["BestFit"] = 52.5773+1.46831 X1+0.66225 X2

all[[13]]["BestFit"] = 71.6483+1.45194 X1+0.41611 X2-0.23654 X4

The two regressions differ by the presence or absence of X4. This is nice. Very, very nice.

And, yes, #13 is the one I used to illustrate calculations in previous posts. I knew perfectly well when I picked it that among these sixteen it had maximum Adjusted R Squared.

Let’s look at the two parameter tables:

The smaller regression, #6, has all three t-statistics significant… when we add X4 to #6, getting #13, it has a t-statistic of -1.36… which means Adjusted R Squared goes up (because the added t-statistic is greater than 1 in absolute value)… but the coefficient may not be significantly different from 0 (because the t-statistic is less than about 2 in absolute value). In addition, the t-statistic on X2 has fallen from 14 to 2 when we added X4. Incidentally, the t-statistic for X4 is equivalent to an F-test to decide whether we should add X4 to the {X1, X2} regression.

Which one is better?

Take your pick.

Aww, what kind of answer is that?

If you require significant t-statistics, then you go with the two-variable regression #6. If you think a higher Adjusted R Squared is worthwhile, then go with #13. And always rembemer that for some purposes, the 4-variable regression #16 — which has maximum R Squared, i.e. minimum ESS, and is the touchstone for Mallow’s Cp — might be “best”.

In addition, I expect to show, in a subsequent post, that we might be safer in choosing #13 over #6. (If I recall correctly… if #13 is true, but we choose #6, then we are worse off than if #6 is true but we choose #13…. I’ve checked: I do recall correctly.)

I would also comment that if PRESS measures the sensitivity of the equation to the observations, then #13 is, in that sense, less sensitive than #6, despite having a shaky t-statistic.

In a sense, we are done. But there are a few things we might think about or look at.

I don’t particularly like ignoring all of the other regressions. I like that we only got two “best” candidates, but I don’t want to stop there, I’d still like to know more about the other regressions. In this case I can look at the other fourteen, but even 64 total regressions to look at is going to be overwhelming. To put that another way, why run them all if we’re not going to look at them all?

In addition, we can learn something about the criteria by looking at additional regressions. Furthermore, having looked at the others, I know that these criteria do not even agree on which are the top six regressions.

So let’s take a look at these 16. Let me construct a table with all of the criteria for the top six regressions (except for Cp, which I will do separately). (In fact, I created a table for all of the regressions, then kept the best six for each criterion. See the appendix.)

Here is the first version of the summary table. It has the criteria in the usual order:

The first line, for example, means that #13 has the highest Adjusted R Squared, #12 the second-highest, down to #8 having the sixth highest.

It turns out that most but not all of the criteria agree on the top four: 6, 12, 13, 14. But note that #6 ranks only 4th on Adjusted R Squared and SHIBATA. In fact, only seven regressions will appear in the top six: 6, 8, 12, 13, 14, 15, 16. We will see that the sets of top six disagree only on the inclusion of #15 or #16; the other five regressions are always among the top six.

Let me say that again: every one of these selection criteria ranks the five regressions 6, 8, 12, 13, 14 in the top six. Then they choose either #15 or #16 for the sixth element — not necessarily sixth in rank, but as the sixth entry in the set.

I will not be doing anything in particular with these rankings — not now anyway. One could conceivably apply voting techniques to these results, but I would be doing it for fun, not for the purpose of refining the rankings.

Let me now change the order in which those appear.

Now we can see that the seven criteria which rank #13 first fall into two common sets, and both sets choose #16 (rather than #15):
Adjusted R Squared, SIGMASQ, and SHIBATA completely agree (on the top six);
AIC, FPE, HQ, and PRESS completely agree among themselves;

The eight criteria which rank #6 first are much more diverse. Three of them choose #16:
GCV and FPEu completely agree;
SIC disagrees with them about the order of #8 and #16;

The remaining five choose #15 instead of #16:
AICu and HQc agree with each other;
RICE, Cp, and AICc each have a different order from AICu and HQc.

Bear in mind that #16 is the 4-variable regression… 12-15 are the 3-variable regressions… #1 is the constant only (i.e. 0-variables), 2-5 are the 1-variable regressions, and 6-11 are the 2-variable regressions.

Thus, the seven distinct regressions which are ranked somewhere in the top six are the 4-variable #16, all four 3-variable regressions (#12-15), and two of the 2-variable ones (#6 and 8).

So, any criterion (like AICu) which rejects the 4-variable #16 in favor of the 3-variable #15 has more heavily penalized the addition of a fourth variable.

Similarly, any criterion which chooses the 2-variable #6 over the 3-variable #13 has more heavily penalized the addition of a third variable. And the criteria (AICu and HQc) which also choose the 2-variable #8 over the 3-variable #13 are most heavily penalizing the addition of a third variable.

The rankings for fixed k

Note that, with one exception, the 3-variable regressions are always ranked 13, 12, 14, 15. There’s a reason for that. With two exceptions, these criteria are functions of ESS, n, and k (respectively, the sum of squared errors, the number of observations, and the number of coefficients in the equation). For the entire set of 16 regressions, the number of observations n is constant. But restricted to the set of 3-variable regressions, k is a constant; of course, it’s also a constant if we restrict to the set of 2-variable regressions, or the set of 1-variable regressions.

But with n and k constant, all but two of our criteria are functions only of ESS… and that means that for fixed k, they must always rank each set of the k-variable regressions in the same order.

The two exceptions are Cp and PRESS… PRESS because it uses the deletion residuals instead of the residuals; Cp because it really uses |Cp – k| — and I will show you how changing the sign of the ESS term switched things around. (We clearly see that Cp has a different order for the 3-variable regressions; I do not know if PRESS can actually have a different order, but I can’t rule it out.)

We’ll see more about this when we look at “forward selection” and “stepwise” regression, but the key point is this: every one of these criteria says that X4 is the best 1-variable regression, {X1, X2} is the best possible 2-variable regression, and with the exception of Cp, they all agree that {X1, X2, X4} is the best 3-variable regression.

This means that thirteen of these criteria are equivalent for forward selection and stepwise regression. More in a subsequent post.

I also have to say now that one reason for studying the Hald data is this little oddity: the best 2-variable regression does not contain X4, which is the best 1-variable regression.

Where the criteria differ is on the relative rankings of different-sized regressions; e.g. whether {X1, X2}, regression #6, is better than {X1, X2, X4}, regression #13.

Let’s look at the complete ranking by Adjusted R Squared… but I know that everything except PRESS and Cp must agree with it about X4, {X1, X2}, and {X1, X2, X4}.

Just run your finger down the list of variables… {X1, X2, X4} is the highest ranked 3-variable regression, {X1, X2} the highest ranked 2-variable regression, and X4 is the highest ranked 1-variable regresion.

What about PRESS?

It agrees: {X1, X2, X4}, then {X1, X2}, then X4.

What about Cp?

Start from the bottom this time. Like every one else, Cp thinks X4 is the best 1-variable regression, and that {X1, X2}, #6, is the best 2-variable regression, but it disagrees with everyone else about the best 3-variable regression, choosing {X1, X3, X4}, #14.

Let’s take a look at the Cp calculations. Here’s the end of a table which you will find in the appendix:

Reading column 2 – even though it’s not sorted – we see that Cp itself would rank the 3-var regressions as 13, 12, 14, 15 — just like every one else. And Cp – 4 would do the same thing. But when we compute the absolute value |Cp – 4|, column 3, we are actually using 4 – Cp for three of the regressions, and changing the sign to -Cp reverses the order of those three, giving us: 14, 12, 13.

Looking at the t-statistics

Let me now look for regressions not all of whose t statistics are significant. First, what is the minimum absolute value of all the t statistics in each regression?

We could also see what t statistics for the constant terms are significant. I did that, and it turns out that only #10 and #16 have insignificant constant terms.

I won’t show you that. Let’s stay with the table before us.

#15 has all of it’s t-statistics significant, while the other three 3-variable regressions either have an insignificant one, or a marginally significant one. (You may recall that we saw for a regression with 13 observations and 4 coefficients, hence 9 degrees of freedom, we are looking at a significant t stat as greater than 2.262 . The commonly referenced number 1.96 is for infinitely many degrees of freedom.)

Here is the parameter table for regression #15:

As expected, all of its t-statistics are significant. That is amazing and stunning and utterly unexpected.

Why?

Because according to all fifteen criteria, #15 ranks dead last among the four 3-variable regressions… despite having the best set of t-statistics among the four.

Nevertheless, the 2-variable regressions #6 and #8 unanimously outrank #15.

We have found an interesting regression — but we have not found a better candidate for the best one.

Let me take the sorted table of rankings….

… and manually remove 12, 13, 14, and 16:

Then we see that all criteria would choose #6 first; and all of them would choose #8 over #15. (That is so much easier than trying to re-rank everything.)

Appendix

The following tables… ta, tb, tc… are supporting material. They exhibit all of my computed selection criteria, except for Cp. They were cut up for display purposes.

Note the headings:

Let me show you only the sort on Adjusted R Squared… and take the top six regressions. The last command saves the ranking for me, and was used to construct the summary table.

Adjusted R Squared is the 2nd column… so I sort everything in table ta on column 2… I want the maximum first, hence the minus sign… I drop the 10 lowest regressions, and columns 3-6:

I do Cp separately because it requires two regressions, one being the touchstone against which the others are measured.

For reference, here is a table of Cp and |Cp – k|. With all four variables as well as a constant, #16 is the largest regression, and therefore it is my choice of touchstone. (Oh, I referenced it as -1 from the end — which means I don’t have to change that line when I run on a different number of regressions.) We see that for #16, as I said when we discussed Mallow’s Cp, we have Cp = 5, Cp – k = 0, and so we learn nothing about #16 from Cp.

Since I need to exclude that minimum value of 0 from the sort, let me rebuild the table without #16.

You saw part of that table earlier when I looked at the 3-variable regressions.

Then do a sort like the other ones. Sort the table tcp on column 3, drop the last 9 regressions (the 10th is already gone), and drop the second column.

Summary

I ran the 16 = 2^4 regressions possible with a constant term and every possible subset of the 4 variables X1, X2, X3, X4:

I asked my 15 criteria which of the 16 regressions they thought was best. Every one of them chose either regression #6 (X1 and X2) or regression #13 (X1, X2, and X4):

Then I looked at the top 6 regressions according to each of the criteria. We saw that the criteria differ in the relative rankings of different-sized regressions… where are 6 and 8 and 16 ranked compared to 13, 12, 14, 15?

But we also saw that all the criteria agreed that X4 alone was the best 1-variable regression, {X1, X2} was the best 2-variable regression; Cp was the only criterion that chose {X1, X3, X4} over {X1, X2, X4}.

While we’re here, let me point out that {X1, X4} is unanimously the second-ranked 2-variable regression. This is significant because it contains X4, the best single-variable.

I displayed the rankings for Adjusted R Squared, and argued that all but PRESS and Cp must match it:

We saw that PRESS did in fact match it:

And we saw that Cp (strictly speaking, the absolute difference |Cp – k| between Cp and k, altered the relative rankings of three of the four 3-variable regressions.

Note that I only need 3 tables to summarize the relative rankings for fixed k.

Finally, we looked at the minimum t-statistics in each regression, and discovered that the lowest-ranked of the 3-variable regressions, #15, was the only one all of whose t-statistics were greater than 2.262 .

That is, none of our criteria would have chosen #15, although it had a nice property which some higher-ranked regressions did not have. Nevertheless, two regressions, #6 and #8, were unanimously ranked above #15 — so our criteria are telling us that even if we care about the t-statistics, we can do “better” than #15.

This was a lot of work for just 16 regressions. It’s instructive for what we learned about the criteria, and I personally will continue running all possible regressions on textbook problems.

But I won’t look at anywhere near these many tables. And I will use another approach, too: stepwise regression.

About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 44 other followers

%d bloggers like this: