We have seen that we can eliminate the multicollinearity from the Hald data if we orthogonalize the design matrix – thereby guaranteeing that the new data vectors will be orthogonal to a column of 1s. That, in turn, centers the new data, so that it is uncorrelated as well as orthogonal.
Doing that to the Toyota data will seem strange… because we have to do it to the dummy variables, too! But it will eliminate the multicollinearity.
I’m not sure it’s worthwhile to eliminate it… but we can… so let’s do it.
I get the data… given as COST, AGE, MILES…
Set up the variables and their names… run backward selection on COST as a function of AGE and MILES… and get the R^2 computed from the Variance Inflation Factors…
We have multicollinearity, and it’s very high. But we know can do better by adding dummy vars. Here’s the first one, although I won’t use it:
Here’s the dummy variable for the second region, 21-42 inclusive…
Finally the dummy variable for the third region, from 43 on.
We can – and should – check that their sum is 1 for every observation… and this is a good reason to define all three of them, even though we can’t use one of them (not with a constant term in the regression):
OK, let’s call backward selection after including dum2 and dum3 among the independent variables.
We get a much better fit: Adjusted R^2 = .9978 . Let’s compute the SSE, the sum of squares of the residuals:
Let’s confirm that we do, indeed, have significant multicollinearity:
There’s our multicollinearity still: MILES and AGE, and even a little of D3.
What regression would we pick? #4.
Unanimous (except, of course and as usual for Cp, which can never choose the last regression).
Maybe the constant isn’t significant. Note that we have reasonably high t-stats despite m/c.
Now, orthogonalize everything, including the dummy vars! Here’s the transposed design matrix X’ from data matrix d2… orthogonalize it and call that Z’… get the new matrix Z… confirm that Z’Z is the identity (which says that each column of Z is orthogonal to every other column)…
(I’m not displaying Z itself, because it has 57 rows.)
Let me be nice, and confirm that the columns of Z are uncorrelated (first without rounding, then chopped by default to 10^-6):
Next, drop the first column of Z (the first row of ZT), which is
– it’s constant, but I want to replace it by a column of 1s, automatically, in the new design matrix. The following command for d3 drops the first column from Z and appends the dependent variable y… then set the names… call backward selection… and print the 4-variable regression…
Every t-statistic is highly significant, most especially the constant. and four of the standard errors are the same…. (Oh, I get it! Our generic X’X is now Z’Z – an identity matrix, except for the (1,1) entry, and so is the inverse – so the standard errors all have 1 in the numerator.)
What about multicollinearity?
Beautiful! It’s all gone.
Let’s get the sum of the squared errors for the 4-variable regression… and compare it to the sum of the squared errors for the original 4-variable regression SSE2… heck, let’s compare the residuals directly, for the raw data and the orthogonal data…
The fit is exactly the same… y and e are the same, so yhat are the same, but the fit is harder to interpret. What used to be dummy variables are no longer 0/1 values, so we can’t interpret the fit as nicely as before.
Let’s at least look at the orthogonalized dummy variables. Here’s the new fourth column, which replaced D3:
Here’s the new orthogonal third column, which replaced D2.
I just happened to try the sum of the last two columns… it seems to have some structure….
Striking… but I have no idea what it means. Approximately piecewise linear.
Do they get cleaner if I orthogonalize the dummy vars immediately after the constant rather than last? No.
So what did we get? Nice solid t-statistics for a 4-variable regression – whereas the original 4-variable regression had a marginal t-statistc for the constant term. I take it that we can confidently say, based on the orthogonalized data, that the constant term is significant in the original regression despite its low t-statistic.
Personally, I think I’ll use orthogonalization to try to eliminate multicollinearity – but my purpose will be to assess the quality of the original fit based on the quality of the orthogonalized fit.
Maybe there’s more to be gotten from this, but I don’t see it yet.
Let me close by emphasizing that I orthogonalized everything – including the constant term and the dummy variables (except the dependent variable). Nothing less suffices as far as I can tell.