Let’s look again at a polynomial fit for our small set of annual data. We started this in the previous technical post.
What we used last time was
That is, I had divided the year by 1000… because, as messy as our results were, they would have been a little worse using the years themselves.
But there’s a simple transformation that we ought to try – and it will have a nice side effect.
Just center the data. Start with the years themselves, and subtract the mean:
I’ll observe that if we wanted to work with integers, we could just multiply by 2. In either case, our new x is not a unit vector.
Oh, the nice side effect? Our centered data is orthogonal to a constant vector.
Let’s see what happens.
As before, I compute the powers of x up to x^6, and I run both forward and backward selection.
Forward and backward did not agree in the previous post – but now they do. Maybe this is promising.
Let’s check the last inversion: no warning, and the identity matrix looks perfect.
This is a nice improvement. I at least am convinced that centering the years is a better way to treat them.
Nevertheless, having been shocked by the previous post, let me look at the singular values of the data matrix…
Now let’s drop the last two regressions, and see what our criteria would choose.
Regressions 1, 3, 4. Let’s also print 5, and 6 because we couldn’t include them in the selection…. Heck, let’s print all 6… as two sets of three:
The third regression has all t-statistics significant… in fact, x^2 is more significant with x^4 added.
In regression 4, x^6 enters with a t-statistic whose absolute value is greater than 1, but less than 2… so the Adjusted R^2 went up. (Note that I am printing R^2 rather than Adjusted R^2… because Draper & Smith talked about the best R^2 they could get. But we can see the Adjusted R^2 in the output from the forward and backward selections.)
In regression 5, x^3 enters with a low t-statistic, but everything else remains significant. Not until we add x^5 do we see other t-statistics fall below 1. And yet, the degradation is fairly slow.
Do we still have multicollinearity? Yes, indeed! Here are the R^2 computed from the Variance Inflation Factors:
So, just a better choice for x gave us computationally nicer results.
Let’s look at the 3-variable fit. First the predicted values… then the fitted equation… then that equation with X2 replaced by X^2, etc.:
Here’s a picture: the equation and the fitted points. (Hmm. I should have shown the data, too, but I didn’t. In that respect, this fit is as good as what we had last time.)
Visually, that’s pretty satisfying. I wouldn’t forecast very far out, though. Let’s go four years out…
That’s the problem with powers of x – they grow quickly.
Now, we should expect that orthogonalizing will eliminate the multicollinearity. And we hope that the we’ll still have nicely behaved computations.
As usual, transpose the design matrix… then orthogonalize… transpose again… and confirm that Z is orthogonal… (I use the backward selection because it keeps the variables in the original order.)
Here’s what the orthogonalized data looks like… in particular, we have a constant first column, and a second column which is proportional to the centered data.
Drop the first (constant) column… and append the dependent variable…
Run forward… and backward selection…
As before, take only the first four regressions (i.e. 5 variables including the constant), and see what our criteria choose:
Also as before, let’s go ahead and look at all 6 regressions, in two subsets:
We see pretty much the same thing. The third regression has all t-statistics significant… the fourth regression raises the Adjusted R^2 when it brings in a variable with t-statistic greater than 1… then the last two variables enter with insignificant t-statistics.
Orthogonalization hasn’t changed the quality of the fit – in one sense… but it has eliminated multicollinearity.
OK, fine, we’ve eliminated the multicollinearity… but at what cost? Our new data Z are linear combinations of powers of x… how can we look at what we’ve got? More importantly, how would we forecast using the orthogonalized fit?
Let me show the answer first. Here’s the previously fitted polynomial in X… with four predictions from the orthogonalized fit overlaid. That is, the curve comes from the powers of X, while the four points come from the fit using Z. They appear to coincide – and I will confirm that.
OK, let’s do a forecast using the Z fit. Given four new values of x, I want to compute the appropriate values of Z. This isn’t perfectly straight-forward… because I don’t want – I can’t do! – another fit. I want to compute new observations based on the existing definition of Z.
Well, we know how to do that. I let K be the matrix of “old” data (the powers of x, excluding the constant):
Then I let J be the matrix of “new” data (the orthogonal matrix Z, excluding the constant):
We know that the transition matrix T can be computed as
T’ = J’ K
(OK, OK, that’s the transpose T’ of T.)
We also recall that orthogonalizing the data was an affine transformation rather than a linear one; we have
K = J T’ – M.
Well, the fastest way to get M is to compute
M = K – J T’,
It’s certainly nice to see that M has constant columns. If it didn’t, then we would conclude that there was something wrong with the transformation.
(I also saved one row of M – because I’m going to work with vectors in a moment.)
You may well be wondering what the hell I’m doing. If I have a new observation, say, x = 9/2 – that is, the next number after 7/2 – then I want to know what Z values it generates. Well, first let’s just get the powers of our new x… there’s got to be a slick way of having Mathematica® generate them….
Those are the powers of 9/2.
What are the corresponding Z values? From
J T’ = K – M
we solve for J:
Check it. start with K… get J? I take our first observation… subtract M (actually M1, one row of M)… post-multiply by the inverse of T’… and compare that to the first observation of the J matrix:
Oh, when we get a new row for the J matrix, we’re going to want to use our orthogonalized fit. Just what is it?
I hope you see the problem: I don’t know how Z2 is related to Z, or how Z3 is related to Z and Z2. When we were working with X, it was trivial: X2 = X^2, X3 = X^3 etc.
Now, take our first forecast, x = 9/2. We subtract M1 from its powers… post-multiply by the inverse of T’… and we now have Z1, Z2, and Z4, so plug them into the equation.
(Yes, we also have Z3, Z5, and Z6, but they’re not used in the equation.)
The final number is the predicted value using our orthogonalized fit.
Now, there’s probably a cleaner way to do three more predictions, but I decided to take it easy and just do them one after another. My next three “old” values are 11/2, 13/2, and 15/2:
Get the set of Z (J) values and the predicted yhat for 11/2:
… and for 13/2:
… and for 15/2:
Now assemble points from those predictions. We know the x-value and we have the predicted y-value, and we get pairs of coordinates.
Now, let us return to our fitted polynomial in powers of X. Let’s see what it predicts for the same four values (9/2, 11/2, 13/2, 15/2).
Those four values are exactly what we got by transforming K to get J. The centered data and the orthogonalized data lead to apparently the same forecasts. Here’s that picture again. Looking at the commands, you see that the curve (f2) comes from the equation in powers of X, but the points come from data4, the fit using Z.
We can eliminate the multicollinearity… but we didn’t change the fit in any significant way. We may, as a result, have more confidence in the validity of the fit to powers of X, but it’s a pain to use the orthogonalized data!
There are some odds and ends I want to show you about this… in what will probably be the next post. In particular, we have in fact used orthogonal polynomials… without the hassle of working them out or looking them up… but they would show us how to get from X to Z.
I also want to show you what happens if I include x^7 in my data matrices.