## Regression 1 – Example 6 Revisited (Housing Starts)

There was more I had intended to say about the housing starts data, i.e. example 6. Here it is.

## Review and Introduction

Recall that we had 5 independent variables

As usual, I want to have a list available which includes the name of the constant:

You might also recall that we have only 23 observations.

We had, also as usual, run both forward selection and backward selection:

We saw that we got exactly the same set of regressions from both searches, and that our selection criteria were unanimous (except, of course, for Cp) in choosing the 2nd regression, with housing starts as a function of GNP and interest rate. The adjusted R^2 is an underwhelming 0.375.

In addition, we decided that we probably had multicollinearity… initially, and most simply, by looking at the following table:

UNEMP enters the third regression with a low t-statistic, but the other t-statistics do not change. When YEAR enters in the fourth regression, however, the t-statistics for CON and GNP fall to insignificance.

Having conjectured that we have multicollinearity, I ask for the variance inflation factors – which I never print, because I prefer to see the associated R^2 instead: from the definition

VIF = 1/ (1-R^2),

I solve for R^2, since Mathematica provides VIFs upon request.

In this case we found that YEAR, POP, GNP were very, very closely related… UNEMP was closely related to the other variables… and even INTRATE wasn’t what I would call independent:

Note, however, that none of the regressions delivered any warning messages. Despite that, if we attempt to compute $(X'X)^{-1}\$ directly, where X is the full design matrix…

then we get a warning message.

## Testing the alleged inverse

One of the things I did not point out in the previous post is that, despite the warning, the inverse isn’t very bad at all:

In other words, it really is a warning as opposed to an error message.

While we’re looking at this, I should also point out that Mathematica® needs to compute $(X'X)^{-1}\$… because the inverse is proportional to the covariance matrix of the $\beta\$ coefficients – and the proportionality constant is the estimated variance. That is,

Covariance matrix of $\beta\$ = $\sigma^2\ (X'X)^{-1}$.

Let me illustrate that. I ask for, but do not print, the covariance matrix of the $\beta\$… then I ask for, but do not print, $(X'X)^{-1}\$… and then I asked for the ratio, covariance matrix divided by inverse:

And that constant ratio is, indeed, the estimated variance:

So. Sometimes we can get a warning message if we ask for the inverse of X’X… sometimes the inverse works and sometimes it doesn’t (as we saw for the singular matrix in this post).

Asking for the inverse also has additional difficulties: we don’t know exactly what small number triggers it – but, okay; worse, we have no guarantee that the same calculation will always trigger it – that is, Wolfram could change its practice at any time.

Nevertheless, it is so easy to ask for… that I intend to continue asking for it. I will, however, also check whether or not the inverse is correct at some level – for vague levels of “correct”.

In the meantime, I think we should also look at condition numbers… in order to build up some intuition about them.

## Splitting the data

Before we do that, I want to split the data in half – just in order to see how sensitive the results are to the data. Rather than split it into the first half and the last half – which might be different as older versus newer – I select the odd-numbered observations…

and the even-numbered observations…

Now run forward selection:

Interesting. Well, I think it’s interesting.

For one thing, although the 1-variable and 5-variable regressions agree on the variables, the other 3 are different: for the odd observations, POP is the last variable added; for the even observations, it is the 2nd variable added.

For another, the closest fit is substantially better (0.398) for the even observations than for the odd (0.239).

These sets of variables are different from each other, but they match the corresponding sets chosen by forward selection.

Now, how different are the fits themselves?

Well, we have a problem: for the even observations, the closest fit is

We’re going to have to run a couple more regressions ourselves. For the even-numbered observations, tell it to use GNP and interest rate… And compare that to our odd-numbered regression:

Yes, different – but perhaps not very far.

Then, for the odd-numbered observations, tell it to use population and interest rate… and compare that to our even-numbered regression:

These differ a bit more.

Are these differences significant? I really don’t know – and I suspect that we can’t tell without looking much more deeply into this data and the residuals.

Not what I care to do at this time. I’ll just make a mental note that I have a multi-collinear data set with somewhat different fits for even and odd observations.

## Subsets of the variables

Now let’s look at subsets of the variables. First off, let’s look at the entire design matrix, X.

The first line of output is the singular values of X… the 2nd line is the singular values rounded to the nearest 0.01… and the 3rd line is the condition number of X – that is, the ratio of the largest to the smallest singular value.

So. We see a condition number of 5×10^7… and this was associated with a warning message about the inverse of X’X, but the inverse worked; and three of the variables were very, very closely related, according to the R^2 from VIFs.

Okay… now let’s look at subsets of 5 variables. Here’s the commands and the raw singular values…

Here are the rounded singular values and the condition numbers…

The only subset of 5 variables with a condition number less than 10,000… requires that we drop the constant term!

The next best condition number is 500,000… and it requires that we drop YEAR.

The other 4 condition numbers are 10^7, like the set of all 6 variables. To be specific (reading down in order), removing interest rate, or removing unemployment, or removing GNP, or removing population… reduces the condition number by a factor of 2 at most.

Let us drop to the other extreme: subsets of 2 variables. Here are the commands and the raw singular values…

And here are the rounded singular values and the condition numbers…

Isn’t this interesting? The worst condition number – the only bad one – is for a matrix which contains only the constant term and YEAR. I have to look at this. First I run with just a constant… and then I add YEAR:

A highly significant constant term has become statistically insignificant when we add YEAR. Combined with a condition number of 500,000… yes, we have multicollinearity.

Do we get a warning when we attempt the inverse directly? (Let me not overwrite my matrix X.)

Yes. It should – the inverse worked for the 6 x 6 matrix, so it ought to work for a 2 x 2 submatrix.

Now let’s look at subsets of three variables. Here are the commands and the raw singular values…

Here are the rounded singular values…

And here are the condition numbers…

There are 5 condition numbers greater than about 150,000… there are 2 between 15,000 and 20,000. The 4 highest condition numbers involve the constant term and YEAR… the 5th involves population and GNP.

We cannot quite summarize this by saying that any 2 of YEAR, POP, GNP are highly multi-collinear. This is true – but it only takes care of three of the five large condition numbers. In fact, there are only four possible subsets of three variables containing CON and YEAR – and those four subsets are the four highest condition numbers.

Not too surprising… since we already know that CON and YEAR alone are highly multi-collinear.

Let me look at the one subset among those five which does not involve YEAR: namely, population and GNP. In one image I show the regression, the attempt to invert X’X (though it’s called Y’Y), and then a test of whether the inverse works.

So. A condition number about 175,000 still leads to a warning message, with an inverse that works pretty well nevertheless.

At this point, I’m going to stop looking at subsets: I really do think we should investigate removing YEAR from the data set. It’s the largest single problem.

But I’ll do that if and when I decide to play with this data again. For now, I’m done with it.