Now I want to illustrate another problem, this time with the powers of x. The following comes from Draper & Smith, p. 463, Archer Daniel Midlands data; it may be in a file, but – with only 8 observations – it was easier to type the data in. Heck, I didn’t even look to see if it was all in some file somewhere.
I have chosen to divide the years by 1000; in the next post I will do something else.
The output of the following command is the given y values… I typed integers and then divided by 100 once rather than type decimal points.
Run forward and backward selections:
We have considerably different choices between forward and backward regression. Still, let’s just blindly ask what the selection criteria would pick… but we have a problem, as in the previous post: the selection criteria need
n – k -2 > 0,
and with n = 8,
then 6 – k > 0,
i.e. 6 > k;
since k is an integer, we must have
k ≤ 5
– which translates to a constant plus 4 “variables”. We can only run “select” on the first four regressions; the last three have too many variables.
So, take the first four regressions.
OK, let’s look at regressions 1 and 4… (and I have looked at 5 too – it’s worse):
Oh, balderdash. Coefficients of 10^10 – we’re beyond multicollinearity and into linear dependence! At least, that’s my first reaction. But we will see. That 4-variable regression has R^2 = .993 and estimated variance 20% of the 1-variable regression.
Let’s check the inversion of X’X. It turns out that even regression #2 is not reliable: We get a warning message… and what should be an identity matrix is off by 10^-4.
Are the backward selections any better?
No, not really. The second regression doesn’t get the inverse of X’X very accurately.
Once I know X’X is being inverted somewhat inaccurately, there’s no reason to check the VIF R^2. We unquestionably have multicollinearity. What the heck? Do it anyway.
Under the circumstances, however, maybe I’d better check the first regressions, too. The inversion is good, for forward selection…
and for backward selection:
So, at least two regressions involving just one variable are reliable… I conjecture that all such regressions would be reliable; and at least two regressions involving two variables look unreliable… and I conjecture that all such are. But I’m not going to look at all of them.
While we’re exhibiting signs of multicollinearity, let’s look at the singular values of the data matrix. It’s dimensions are…
For an 8×6 matrix of full rank, we would get 6 nonzero singular values. We got only 5 – and two of them are under 10^-7, i.e. zero in single precision.
Three nonzero singular values, and the smallest is still around 10^-4, and the condition number for just the first three columns is
What about the backward selection? That is, the first four regressions – not because they’re the first four, but because they have 5 or fewer variables counting the constant.
Regressions 1 and 4…
The same kind of result as for forward selection: the 4-variable regression has R^2 > .993 and estimated variance 20% of the 1-variable regression.
We might as well literally take a look at the first forward selection, which used only X. Here is the fitted equation:
Here is a graph of the data and the fitted equation:
What about the 1-variable regression from backward selection?
Note that I am leaving the 1-variable fit in the graph.
Interesting. They seem to overlap over the interval of the data. Let’s forecast out to the year 3000 (i.e. x = 3.000):
We do see them diverge. Good.
Now let me try something reckless. Let’s look at forward regression #4. Here is the fitted equation – in two forms, one with X2 etc. and the other with X^2 etc.
Here’s a graph of the data and the fitted equation from the 4-variable regression:
OK, I’m impressed but also shocked that Mathematica gave me something so visually reasonable, although the coefficients of the fit are outrageously large.
I would not be surprised if other software packages were not so obliging.
What about the 4-variable fit from backward selection? Again, the fitted equation has X2 etc, and I need to turn that into X^2 etc.
Again, they appear to overlap almost perfectly. Let’s forecast out to the year 2500.
We should note the vertical scale: the predicted numbers are huge before we can see a difference between the functions.
We have used orthogonalization successfully to eliminate multicollinearity. OK, let’s orthogonalize the data. Here’s the design matrix (from backward selection because it has the variables in order).
Let’s orthogonalize the transpose of that matrix and call the result Z. Let’s compute Z’Z to see if we get an identity matrix.
Wait a minute! The two rightmost diagonal entries are zero; Z’Z cannot be inverted. We better book at Z itself.
We only get 5 vectors out of the Gram-Schmidt process (counting the constant)! And I know that if I had not divided the years by 1000, we would only have gotten 4.
OK, let’s run with what we have, four orthogonal variables:
This time we got the same set of regressions from both forward and backward selection.
The 3-variable fit doesn’t seem all that much better than the previous one, which was
Let’s check the inversion of X’X for regression 3:
We really can gain by orthogonalizing. We have more reasonable coefficients, and a more accurate inverse of X’X.
On the other hand, it’s harder to look at things. Z2 etc are not simple powers of Z1: Z2 != Z1^2, etc.
(The multicollinearity is gone, as usual.)
Here are the predicted values for the 1- and 3-variable regressions:
Let’s look at the fitted values. Yes, I’m plotting them against X, not against Z1.
The red dots do lie on a straight line: Z1 is a linear function of X, and the fitted 1-variable equation is linear in Z1, hence linear in X.
The blue dots are nonlinear in Z1, hence nonlinear in X.
Finally, let me overlay the original 3-variable fit:
The orthogonalized 3-variable fit (shown only as blue points)
is very close to the original 4-variable fit…
over the region of the actual data.
With x ranging from 1.986 to 1.993, and using powers of x, we encountered significant multicollinearity: R^2 from the VIFs were 1, the inversion of X’X gave warnings for any variable beyond x itself, and the coefficients of the fits were quite large for the 4-variable regression.
And yet, the fits themselves appeared to be valid. Furthermore, the 4-variable fit seemed to be a better fit than the 1-variable fit which used x alone.
We did, however, get different recommendations from forward and backward selection. And yet their fitted equations were visually indistinguishable.
When we orthogonalized the design matrix, we discovered that we had a new problem: we could only get five orthogonal columns, including the constant one.
Still, we were able to run regressions with just the created variables. Multicollinearity was eliminated, and we again got a good regression with multiple variables, this time with 3 instead of 4. And, this time we had nice coefficients.
I showed that the original 4-variable fit appeared to match the orthogonalized 3-variable fit.
In other words, orthogonalizing the data gives nicer equations, but the fitted values are effectively the same. We can eliminate multicollinearity, but we could not effectively improve on the original horrible-looking equations.
Next time, I hope to show you how a simple change to the x-values can also clean up our horrible-looking equations.