Edit: 2011 Nov 26. I computed a correlation matrix of the parameters when I meant to compute the correlation matrix of the data. Find “edit”.
Let’s do another regression, okay? I’m sick of the Hald data. It’s been so long since we did something other than multicollinearity… you might refresh your recollection of the earlier examples and the material leading up to them.
This data comes from Ramanathan, dataset 3-7 in both the 4th and 5th editions, but 3-6 in the 3rd. See this post for information about obtaining his data.
From the description in the 4th ed. data, I infer that this is data for one vehicle over a period of 14+ years.
DATA3-7: Data for a Toyota station wagon (57 observations)
cost = cumulative repair cost in actual dollars (11 – 3425)
age = age of car in weeks of ownership (Range 5 – 538)
miles = miles driven in thousands (Range 0.8 – 74.4)
Set up the data matrix with age, miles, and cost:
We see that there are 57 observations:
Let me show you all the data. Here are the first 19 observations:
Here are the next 19 observations:
and here are the last 19 observations:
There’s something I should do before I just start running regressions – what do you think that might be? – but I’m going to dive right in anyway. We’ll get to what I should probably learn to do first. I always get to it eventually, but only after I’ve run some regressions.
Stepwise (actually, forward selection) says that AGE is the best single variable; and we only have one other independent variable, MILES, so we see two regressions:
We really might as well take a look at COST as a function of MILES alone. That is, look at the other 1-variable regression:
Run the regression…
Append it to my list of regressions…
And change the order, because I want the 2-variable regression last – not just for my comfort, but especially because the last regression in the list will be used as the touchstone by the Cp calculations.
Now look at them all:
The Adjusted R^2 range from about 85% to 95%.
What would our criteria select?
Everything says go with the 2-variable regression – except for Mallow’s Cp, which cannot select the 2-variable regression, because it’s its touchstone. (How’s that for a compact example of the contraction versus the possessive?) That is, Cp judges all other regressions against the last one in the list.
The constant went insignificant as we went from regression 2 to regression 3. That’s our signal that we may have multicollinearity. (Because its t-statistic is still greater than 1, the Adjusted R^2 went up, and regression 3 is ranked above regression 2, despite the reduced t-statistic on the constant.)
Well, what tools do we have available for multicollinearity?
First, because we only have two independent variables, the correlation matrix of the design matrix is definitive:
Yikes! Our best fit has an adjusted R^2 of .95 – but AGE and MILES have a correlation coefficient of .996. They are more closely related to each other than COST is to the two of them.
Yikes, indeed. That’s not the correlation matrix of the data. It’s the correlation matrix of the parameters () – and I know that. I just goofed. This is the correlation matrix of the independent variables:
Now the correlation between AGE and MILES is positive. I’ll confess that I am surprised that the correlation between the has the same magnitude… I’m not so surprised that the sign is reversed: an increase in one coefficient should be offset by a decrease in the other to keep yhat the same.
Second, let’s look at the R^2 computed from the VIFs.
Again, those .99 say that AGE and MILES are very closely related, more closely related than COST as a function of them (.95).
Let’s be clear: this multicollinearity is more severe than that of the Hald data, judged by these two criteria.
Third, we can look at the Singular Value Decomposition, X = u w v’. Using it, we can consider the rounding required to zero-out the last column or columns; and we can look at the smallest singular value and the condition number of the design matrix. Let me define my matrices:
(Note that I have now got v – I will use it at the end.)
The required rounding to zero out the last column is quite large – to the nearest 2 – even though it is already clear from the correlation matrix or the VIF R^2 that we have a very strong relationship between AGE and MILES. Here’s the beginning of the printout…
… and here’s the end of the printout. (57 lines is too many to show, and not all that informative.)
We’ve already seen the R^2 from the VIF (.992), which say the same thing as the high correlation coefficients. Let’s look at the singular values.
Let’s see. The smallest singular value is not too small, so severe rounding should be required to zero out the last column of Xv. That’s what we saw.
The condition number is fairly large, but not as large as it was for the Hald data. The inversion of X’X, however, is not at risk, even with condition number for X’X ~ 1 million.
Let me say that another way. Even though this multicollinearity – judged either by the correlation matrix of the design matrix, or by the VIF R^2 – is more severe than that for the Hald data, the singular value decomposition (smallest singular value and condition number) shows less of a threat to the inversion.
Let’s check the inversion, in a sense. (The Inverse command seems to use a different algorithm than Linear Model Fit. I will show you in the next post that the Inverse command is more sensitive.)
All I wanted was to see no warning messages. I take it that I have confirmed that the matrix inversion is trustworthy.
Improving the Fit
Look at the fit we got:
Oh Lord. The black points are the data, and they say that at about week 20 there was an expensive repair… and at about week 42 there was a really expensive repair. The fit is struggling with these jumps in the data.
I really should have looked at the data first – I would have seen those jumps and known I needed to do something… but I did look eventually…. That’s what I do only after I’ve run regressions. Probably a terrible habit, so I encourage you to do what I say and not what I do: start by looking at the data.
Anyway, we can do better.
We need what are called dummy variables (at least in casual speech; we may have to call them instrumental variables for publication)… Let me just show this process to you. Those jumps are actually at 21 and 43…
Frankly, that data looks like different slopes, too; but for starters, go with 0-1 dummy variables D1 and D2. More pedagogy: go with D1, D2, D3. That is, I will create three variables, one for each of the three regions… each variable will be 1 in one region and 0 in the other two.
(A very common occurence is to have a dummy variable for each of the four quarters of a year, or for each of the twelve months of a year, or for each of the seven days of a week….)
Here’s the dummy variable for the first region, i.e. before the first jump at 21… and I check it by printing its values at 20 and 21:
Here’s the dummy variable for the second region, 21-42 inclusive… and two checks, one at each break:
Finally the dummy variable for the third region, from 43 on.
Now, I’m not going to use all three variables: they add up to 1 – and that’s the constant term in the design matrix.
Of course they add up to the constant term. They were carefully constructed so that exactly one of the three had the value 1 at any time. Now, since dum1 + dum2 + dum3 = CON, I am going to use two of these, not all three. I choose to use the 2nd and 3rd.
Here’s my forward selection…
They agree. And we see that our best fit now has an adjusted R^2 of .997 . Note that both dummy variables enter before MILES.
Let’s look at the fit:
That is one hell of an improvement. OK, we knew it from the Adjusted R^2 of .997 – but the visual is pretty striking.
Let me show you how that worked. We have fitted the equation
We have changed the coefficient of the constant term, but not the coefficients for AGE or MILES. We have fitted the same slopes (coefficients) for AGE and MILES, but we have shifted the curve upward by a constant.
In region 3, we have D2 = 0 and D3 = 1:
Again, we have a different constant, but the same coefficients for AGE and MILES.
That’s what 0-1 dummy variables do.
If I wanted to investigate changing the coefficients of AGE and MILES, I would use ramps for dummy variables instead. That is, for example, I would let R2 be 1,2,3,… in region 2 and 0 elsewhere.
Back to the regressions. What do our criteria select?
As before, they all agree on the last regression – except for Cp which can never choose the last one.
Let’s look at the regressions we ran.
The constant went insignificant when we added MILES at the end, but there is no indication of multicollinearity before then. We still have multicollinearity – adding D2 and D3 didn’t change that – but now at least our fit of cost is as good as MILES as a function of AGE.
Now that we have more than two independent variables, the correlation matrix is not definitive: it cannot detect relationships involving more than two variables.
The condition number has doubled, but it’s still less than for the Hald data.
But let me ask for the Inverse of X’X directly:
No warnings or errors, so the inversion is in no danger.
Let’s look at the R^2 from VIFs (I use the backward selection answer “bac” rather than forward selection “reg” simply because bac has the variables in their original order):
D3 shows up a little – D3 as a function of AGE and MILES has an R^2 = .9, which is higher than our original regression of COST as a function of AGE – but it’s still really AGE and MILES that are very closely related.
I’m not going to look at the required rounding to zero-out the last column of X.v; I’m sure nothing significant has changed.
Sensitivity of Coefficients
One potential effect of multicollinearity is sensitivity of the coefficients to the data. So let’s split the data. First get all the odd-numbered observations (the index i starts at 1 and steps by 2)…
Run backward selection on the odd-numbered observations…
OK, they look a little bit different. Let’s print the 1-variable regressions for all data, odd data, and even data:
We see that the coefficients for AGE don’t change much at all. Well, with just AGE in there, we don’t have multicollinearity. And maybe it’s not surprising that the constant term is ±5% of the “all data” case.
And Here they are for the 2-variable regressions (all, odd, even, respectively):
The “insignificant” constant term varies by ± sizable percentages… the coefficients for AGE and MILES vary by about ±5%.
Somewhat sensitive, but nothing drastic. Still, more variation than with just AGE.
Nothing to worry about, IMHO. Let me just emphasize that we were wondering if the fitted coefficients are sensitive to the data. (Will the fit change if we collect more data?)
Strange Sign on MILES
Let us look at another one of the possible effects of multicollinearity.
What most people react to in the all-variable regressions is that the sign on “miles” is negative. That seems to say that the more miles on the car, the less its cost.
Yes. What it actually says is that if you increase the MILES while holding everything else constant, the COST would be less. But how would you do that? If you increase MILES, then AGE must increase.
Let me emphasize that this issue arises because MILES and AGE cannot be varied independently of each other; this isn’t about the coefficients per se: it’s about the existence of a strong relationship, not about the specific coefficients.
Let’s do this from scratch, using the fit without dummy variables. Here’s the equation for MILES as a function of AGE:
Here’s the fitted equation for AGE as a function of MILES… and then I solve it for MILES as a function of AGE:
The two equations for MILES, called em and ea, agree fairly closely. Can I plot those equations?
What are the ranges for MILES and AGE?
Here are the equations and the data:
Here are just the lines:
How much closer could they be?
Let’s take a closer look at the bottom:
and at the top:
We see that they cross somewhere in the middle.
Now suppose we take our fit for cost as a function of AGE and MILES… and substitute our least squares fit for AGE as a function of MILES… and compare that to our fit for COST as a function of AGE.
That is, here is our 2-variable fit:
but we also know from
but then equation e1 says that COST would go up by roughly 28 * 7…
… having previously dropped by -154.635, we have a net change in COST of
and what does our formula for cost as a function of miles alone say? The equation is
What I just did was to illustrate that the pair of equations
The last equation is a 1-step process: change MILES and see a change in COST – the associated change in AGE is implicit. The first two equations are a 2-step process: change MILES alone to get one change in COST; then compute the associated change in AGE, and a second change in COST due to the change in AGE.
We got the same answers. The negative sign on MILES seems wrong because our intuition doesn’t want to change MILES while holding AGE fixed.
Here’s how show the equivalence. Write out the 2-variable equation… solve equation ea for AGE… and use the solution to eliminate AGE:
which is our single equation e2.
We escaped from the Hald data, but not from multicollinearity. We have a very strong relationship between AGE and MILES, but it doesn’t seem dangerous to the fit of COST as a function of AGE and MILES – even though it’s a better fit than the one for COST.
The introduction of dummy variables improved the fit of COST considerably, and did not change the multicollinearity significantly.
We investigated whether the coefficients of the COST fit were sensitive to the data (are they likely to change if we get more data?), and we investigated the odd fact that the coefficient of MILES in the all-variable COST fit was negative.