## Regression 1 – Example 6: Housing Starts

Time for another regression using Ramanathan’s data. Here’s the description of this data from the 4th edition data. See this post for information about obtaining his datasets.

(*
DATA4-3: Annual data on new housing units and their determinants
Source: 1987 Economic Report of the President. Because the housing
series has been discontinued, this data set could not be updated.
housing = total new housing units started, in thousands (Table B-50)
(Range 1072.1 – 2378.5)
pop = U.S. population in millions (Table B-30), Range 189.242 – 239.283
gnp = gross national product in constant 1982 dollars in billions
(Table B-2), Range 1873.3 – 3585.2
unemp = unemployment rate in % among all workers (Table B-35)
(Range 3.4 – 9,5)
intrate = new home mortgage yields, FHLBB, in % (Table B-68)
(Range 5.81 – 15.14)
*)
year housing pop gnp unemp intrate ;
1 1963 1985

It appears that the dataset matches table 4.10 of the 3rd edition. There are 6 vars. Get the data, and construct a data matrix d1 with the dependent variable (HOUSING) in the last column.

The Dimensions command confirms 6 variables, and tells me I have 23 observations.

Let me get my usual two lists of names, one for Mathematica® and one for me.

Let me run my forward and backward selections:

Yuck! An adjusted R^2 of .375 is the best we get. On the other hand, we got the same set of regressions from both searches.

In any case, let’s see what the selection criteria would choose:

They all agree on #2, except for Cp, as is so often the case. Here are #1 and #2:

Now that is rather interesting. INTRATE alone has an insignificant t-statistic, but things get better when I add GNP. GNP might be called an unsuppressor variable, but that’s all I’ll say. Here are #2 and #3:

But UNEMP comes in with a low t-statistic. Stop at 2… and that’s what the criteria said, choose 2.

But let’s keep looking anyway.

GNP has fallen, and so has the constant: we do have multicollinearity. we still do not have a decent fit, but we have m/c anyway.

Here is the maximum-number-of-variables regression, with the variables in two different orders:

How serious is the multicollinearity?

An easy first test is the R^2 computed from the VIFs.

It certainly looks like YEAR, POP, GNP are very, very closely related. This looks like very serious multicollinearity. Ramanathan looks at correlation coefficients – so he never observes, in print anyway, that POP, for example, is almost fitted perfectly by YEAR and GNP.

OK, let’s look at the design matrix, its singular values and its condition number:

Yikes. And that 10^7 is just the condition number for X. Not literally serious enough to be considered exact linear dependence, but closer than I like.

Well then, try to invert X’X:

Nice to see a warning message. (That’s an understatement.)

I might as well run the last test, looking at X.v; a rounding to the nearest 0.001 is enough to zero out the rightmost column, so we are very close to having a 1D nullspace.

The most variables that Ramanathan ran was

I got the same coefficients as he did.

Interesting that only INTRATE has significant t-stat. Recalling

We see that adding POP has wiped out most of the t-stats, so we have multicollinearity in the regression lm – let’s just try to invert X’X for lm:

I’ll admit that I’m surprised. Maybe I should look more closely as this data… but I’m not going to. My guess is that although a fit of any of one YEAR, POP, GNP as a function of the other two has an R^2 of nearly 1, the t-statistics are insignificant: we have excellent fits but with multicollinearity.

As soon as any two of YEAR, POP, GNP are included in a regression, we have severe – inverse threatening – multicollinearity.

We’ve seen less severe multicollinearity in both the Hald data and the Toyota data: the inverse of X’X could be computed.

The only thing we’ve seen that rivals the housing starts multicollinearity is the outright linear dependence when all three dummy variables were included in the Toyota data. In both of these cases, we get answers because Mathematica is inverting the singular values of X in order to compute $(X'X)^{-1}X'\$ in one piece, without getting $(X'X)^{-1}\$ separately.

So. The best we can do – without transformations or additional variables – seems to be a 2-variable regression:

I’m just not impressed.