## Introduction

Aug 12. Edit: I’ve added a few remarks in one place. As usual, search on “edit”.

I want to look at the t-statistics for two regressions in particular. I will refresh our memories very soon, but what we had was two regressions that we could not particularly decide between. Let’s go back and look at them.

Let me get the Hald data. I set the file path…

I set the usual uninformative names – I wouldn’t dare change them after all the time I’ve spent getting used to them!… and I might as well display the data matrix…

We have 4 independent variables, and the last column is the dependent variable. One of the things we did was to get all possible regressions, and to run my selection criteria on them:

So, my selection criteria are about equally divided on two regressions. (We had also found that backward selection found the best 2-variable case, while forward selection did not. That’s why I just ran all possible regressions.)

Here are the two best, based on my assorted criteria. Oh, as usual in these circumstances, I am ignoring the constant term, and counting the number of variables other than the constant… hence “2-variable” and “3-variable” rather than “3-variable” and “4-variable”.

So they differ in X4 only. I do believe the P value is 2-sided. I note that the t-statistic for X1 is slightly, very slightly, higher in the 3-var case than in the 2-var case.

In summary, the 3-variable regression has a higher Adjusted R^2, but its third variable, X4, has a low t-statistic.

## the fundamental theorem

Well, what does our fundamental theorem say?

As I said in the last post, to get the usual supporting information, we would assume that the true model is

y = X B + u,

where

1. The matrix X consists of fixed numbers,
2. and is of full rank.
3. the u are drawn from identical and independent probability distributions
4. with a mean of 0 (E(u) = 0), and each with variance $\sigma^2\$
5. and the u are drawn from a normal distribution.

I deliberately numbered the assumptions that are given prominence. Just at the beginning, I wanted to down-play the zeroth assumption:

• the true model is y = X B + u.

It gets short shrift in the usual treatments, if it gets mentioned at all!

That assumption does not mean that the true model is “of the form” X B + u; it means that the regression design matrix X is the truth, the whole truth, and nothing but the truth: it contains exactly all the true variables… it isn’t missing any… and it doesn’t have any extra variables in it.

So… we have two regressions and we want to apply our theorem. Unfortunately, our two regressions cannot both be the true model… so the theorem cannot apply to both of them.

Oops. That sucks.

Worse, it need not apply to either of them, because it is possible that neither is the true model. (There were additional variables, which I never had access to.)

The usual approach, of course, is just go ahead and apply the theorem to both of them! And that’s where I begin to decide that the whole business of t-statistics and F-tests should be considered heuristic suggestions, rather than precise determinations.

Ok, let’s go with this. For the moment, let’s forget about the 2-variable regression, and pretend that the 3-variable regression is the true model.

If all those assumptions hold, then the t-statistic on X4 in the 3-variable regression says that if $\beta_4\$ is zero, there was a 20% chance (the P-value) of computing it to have absolute value .23654 . (I’m pretty sure it’s computing the P-value for a two-sided test.)

Should we conclude that X4 is not part of the true model?

But how can we do that?

In order to conclude that X4 is not part of the true model, we have to start by assuming it is part of the true model. And once we conclude it is not part of the true model, then the theorem couldn’t have been applied.

(This seems different from the proof, for example, that the square root of 2 is irrational, which begins by assuming it is rational. In this case, if the theorem does not apply, then the numbers don’t mean the same thing as if it does apply. Maybe I’m wrong on this, but among other things, we’re going to learn something about the t-statistics when we have an extraneous variable.)

Edit:
I’m being sloppy in my thinking. We do not have a theorem which says that a variable cannot be part of the true model if it has an insignificant t statistic. The fundamental theorem says that tests of hypotheses are valid – but it doesn’t say that all true variables must have significant t statistics; we have no theorem that says X4 is not part of the true model. And if we did have such a theorem, it would have to specify the level of significance. 5% is convenient and customary – but what’s so special about it?
End edit

We cannot use the theorem as it stands to justify removing X4 from the regression. At this point, it seems to me that all we can conclude is that $\beta_4\$ is small, if it is part of the true model.

This whole issue is usually called specification error… but I rarely see it accompanied by the explicit statement that the theorem does not apply.

What can we do?

First of all, I still use t-statistics as suggestions – in fact, as strong suggestions (I use t-statistics to run backward selection… but we have seen a case where backward selection dropped the true variable first.)… but I’m not likely to really worry about the fact that the degrees of freedom for the hald data is 9, and the correct t-statistic is 2.262 (if I recall correctly) to reject the null hypothesis at 5% (to decide that there is only a 5% chance that the computed nonzero $\beta_i\$ corresponds to B = 0). If it’s all approximate, then let’s just use approximations.

Second, this is one reason I use selection criteria: they balance the complexity of the equation against the goodness of fit.

(Incidentally, in my own work I am usually interested in a plausible-looking good fit over a restricted range of data, and my primary goal is the tightest possible fit that has a plausible shape. I don’t care at all about the t-statistics. I use t-statistics and selection criteria when I’m trying to find relationships in economic or sociological data.)

Third, we can actually investigate the cases where we have extra variables, or not enough variables. I suppose we could even investigate the case where we have some extra and are also missing some… and in fact I’ll do that, too.

So… Let’s suppose that the true model is

y = X B + u

but that we have fitted the equation

$y = \chi B + u \$.

From the fit, we have

$\beta = (\chi'\chi)^{-1}\chi'y\$

but we know the true model, so we substitute for y:

$\beta = (\chi'\chi)^{-1}\chi'(XB + u)\$

$= (\chi'\chi)^{-1}\chi'XB + (\chi'\chi)^{-1}\chi' u\$.

If we now take the expected value, we get

$E(\beta) = (\chi'\chi)^{-1}\chi'X B + (\chi'\chi)^{-1}\chi' E(u)\$

$= (\chi'\chi)^{-1}\chi'X B\$, because E(u) = 0… and we may name that matrix

=: P B

We need to look at the matrix

$P = (\chi'\chi)^{-1}\chi'X\$.

I’l give you a hint: consider

$\beta = (\chi'\chi)^{-1}\chi'y\$.

But I’ll show you what it is in two special cases; then I’ll give you the symbolic answers.

## Assuming we omit a true variable

First. let’s suppose that the true model is the 3-variable case, but that we fitted the 2-variable case. (We have omitted a true variable.) Then X is the design matrix

and $\chi\$ is the design matrix

Let’s compute $P = (\chi'\chi)^{-1}\chi'X\$:

Isn’t that interesting? An identity matrix augmented by 1 column.

Oh, I can’t compute PB because I don’t know B, but I could compute $P\beta\$ for the putative true model:

The point is not that we matched the 2-variable fit – we should have (really? I’m no longer quite so confident) – but that these numbers do not match the 3-variable estimates. More generally, if our true parameters were

and we see that every estimate is biased, by some multiple of b3, the true coefficient of the omitted variable.

## Assuming we have an extraneous variable

What if we did the other relevant case? This time, assume that the true model is the 2-variable case – we have an extraneous variable – so X is the design matrix

and $\chi\$ is the design matrix

(Perhaps it would have been cleaner to assume here that the 3-variable case was still the true model, but that we fitted the 4-variable case. Nevertheless, since the two cases of most interest to me are the 2-variable and 3-variable case, I want specifically to see what happens if I choose the wrong one out of this pair.)

We compute P:

Equally interesting: an identity matrix again, but augmented by a row of zeroes on the bottom.

And P2 B?

This time PB matches part of B. More generally,

And we see that our estimates are unbiased.

## understanding the special forms of the P matrix

So what’s going on?

The key is to understand the definition of P. We have

$P = (\chi'\chi)^{-1}\chi'X\$,

and we should compare it to

$\beta = (\chi'\chi)^{-1}\chi'y\$.

For the latter we have a vector y, for the former we have a matrix X. Ah ha!

Each column of P is the coefficients of the corresponding column of X fitted as a function of $\chi\$. In the first case, then, when the true model has X with columns 1, X1, X2, X4, we are regressing each of them against $\chi\$ with columns 1, X1, X2. (“1” being the constant term, a column of 1s.)

Well, what’s 1 as a function of {1, X1, X2}? The coefficient of the constant term is 1, and the other two coefficients are 0… so the first column of P is 3 entries, {1, 0, 0}.

What’s X1 as a function of {1, X1, X2}? The constant term and the X2 term have coefficients of zero, while the X1 term has a coefficient of 1… so the second column of P has entries {0,1,0}.

Similarly, the third column of P is the coefficients of X2 as a function of {1, X1, X2}, hence {0,0,1}. And that completes the identity matrix portion of P.

The fourth column? Well, that fits X4 as a function of {1, X1, X2}. Things would be just dandy if X4 were one of the independent variables, but it isn’t. Let’s get the fit of X4 as a function of {1, X1, X2}; I expect to see that the betas are the last column of P:

That is, the computed $\beta s\$ are the elements of the fourth column of P: the fourth column of P describes X4 as a function of {1, X1, X2}. Remember that this case is X a superset of $\chi\$. In general, in this case, we would end up with

$P = (I\ \Pi)\$,

where the size k of the kxk identity I is equal to the number of columns of $\chi\$ (3), and the number of columns r of $\Pi\$ is the difference between the number of columns of X and the number of columns of $\chi\$. In this example, k = 3 and r = 4 – 3 = 1.

What about the other case? Now we have X = {1, X1, X2} as the true model, and $\chi\$ = {1, X1, X2, X4}. We fit 1 as a function of {1, X1, X2, X4} and get coefficients {1, 0, 0, 0}, which become the first column of P… we fit X1 as a function of {1, X1, X2, X4} and get {0, 1, 0, 0}, the second column of P… we fit X2 as a function of {1, X1, X2, X4} and get {0, 0, 1, 0}, the third and last column of P. So we got an identity matrix, with a row of zeroes on the bottom, for true variables fitted as a functions including X4, the extraneous variable.

In general, for X a subset of $\chi\$ – i.e. for omitted variables – we will get

$\begin{pmatrix}I\\ 0\end{pmatrix}$

This time the size of the identity matrix is determined by the number of columns (3) of X, and the size of the zero matrix is determined by the number of columns of $\chi\$ minus the number of columns of X (4-3 = 1).

So that’s why P has those two forms in these two special cases.

I suppose I should illustrate the case where we have omitted one true variable and included one extraneous variable. Suppose the true model is the 2-variable case {1, X1, X2}:

So the design matrix of the true model is:

but that we fit {1, X1, X4}, i.e. omitting the true X2 and adding the extraneous X4. Hmm. Which regression is that? #8:

We compute P:

Remember that the true model is assumed to be {1,X1, X2}, while the fitted model is {1, X1, X4}. As we might have expected, we have a 2×2 identity, with a row of zeroes under it, and a column augmenting it on the right. The first column is 1 as a function of {1, X1, X4}, hence fitted coefficients {1, 0, 0}… the second column is X1 as a function of {1, X1, X4}, hence coefficients {0, 1, 0}… and the third column is X2 as a function of {1, X1, X4}, giving us three nonzero coefficients.

So that’s what P looks like. We append one column on the right for the omitted X2, and a partial row of zeroes under the identity matrix for the added extraneous X4.

## Summary

Now what was the question?

What’s the expected value of $\beta\$ for the fitted (incorrect) model?

We have

$E(\beta) = PB$

(that’s where P came from!) and we know two simple forms for P: either

$P = (I\ \Pi)$

or

$\begin{pmatrix}I\\ 0\end{pmatrix}$,

the first for omitted variables, the second for additional extraneous variables. (Note that B stood for different things in the three examples I worked, but here it is fixed: B is the true coefficients, B = {b1,… bk}.)

What we saw was that in the first case, each computed $E(\beta_i)\$ was biased by the columns of $\Pi\$, from the omitted variables. In the second case, we fail to get expected values for the extraneous variables – but we get unbiased expected values for all the the true coefficients.

Oh, when the expected values of the $\beta\$ are biased, our statistical tests are trashed. Whether you phrase that as “the variances are biased, because we subtracted the wrong expected values”, or “the numerators of the standardized variables are biased, because we subtracted the wrong expected values”, the result is the same: we can’t do a valid t-test. This is pretty major.

If the expected values of the $\beta\$ are unbiased, then our numerators for the standardization are fine… but I understand that the variances are over-estimated, so our computed t-tests are lower than they should be… so we are more likely to reject a coefficient as zero than we should be.

(So the low t-statistic for X4 in the 3-variable case… is really higher, if the 2-variable case is the true model. Talk about paradox!)

All I’ve looked at is the expected values of the coefficients. I’m not going to work out anything for the variances. Instead, I’m going to quote two texts. But first, let me state my position:

My selection criteria are evenly split between a 2-variable and a 3-variable case.

I know that the fundamental theorem cannot apply to both of these regressions… it need not even apply to either one of them… so if I use the t-statistics to reach a conclusion about one, I cannot use the t-statistic of the other. In a nutshell, I regard t-statistics as useful heuristics, not as quantitatively precise measures.

I am better off incorrectly choosing the 3-variable regression instead of the true 2-, than than I am incorrectly choosing the 2-variable case over the 3-.

Here is Ramanathan’s summary of the results (almost none of which I’ve proved) from p. 187 and pp.189-190 of the 3rd edition:

If we omit a true variable….

“A. If an independent variable whose true regression coefficient is nonzero is excluded from a model, the estimated values of all the other regression coefficients will be biased unless the excluded variable is uncorrelated with every included variable.

“B. Even if this condition is met, the estimated constant term is generally biased and hence forecasts will also be biased.

“C. The estimated variance of the regression coefficient of an included variable will generally be biased, and hence tests of hypotheses are invalid.”

If we include an extraneous variable….

“A. If an independent variable whose true regression coefficient is 0… is included in the model, the estimated values of all the other regression coefficients will still be unbiased and consistent.

“B. Their variance, however, will be higher than that without the irrelevant variable, and hence the coefficients will be inefficient.

“C. Because the estimated variances of the regression coefficients are unbiased, tests of hypotheses are still valid.”

Let me continue by quoting from Intrilligator’s “Econometric Models, Techniques, and Applications’, Prentice Hall 1978, p. 189.

“The asymmetry between the results in the 2 cases should be noted: excluding relevant variables yields biased and inconsistent estimators. Thus, in terms of bias and consistency, it is better to include too many than to include too few explanatory variables. Such practice is not generally recommended, however, because of other problems that can arise with included irrelevant variables, namely multicollinearity, inefficiency, and reduced degrees of freedom…. In general, the best approach is to include only explanatory variables that, on theoretical grounds, directly influence the dependent variable and that are not accounted for by any other included variables.”

For the record, I disagree with Intrilligator. I think the data should speak for itself.

You might look at the regression about nylon yarn, where the original investigator asserted that on theoretical grounds certain products of variables should not be significant – but they were.

I’m also not sure why reduced degrees of freedom matters – if the alternative to inefficient t-statistics is that the t-statistics are garbage. (This will get me into trouble.)

You might also remember that once upon a time, running regressions was nontrivial. Forward selection, stepwise, backward selection, and even all possible subsets are not all that expensive to run: it really is possible to let the data speak.

## Aside: things not done

Oh, let me close with a complete change of subject: What have we not done with ordinary least squares regression?

• We have not looked at transformations of the data.
• We have not looked at tests of the main assumptions about the errors u:

• constant variance (homoscedasticity)
• constant mean
• normality
• independence.
• We have not looked at single deletion statistics for detecting unusual observations.

But I think I’m done with regression for a while… except, I hope, for a bibliography post (which I should have done for color, too).