## a smaller range

I want to show you something interesting. What if our data had spanned a much smaller range? Let the x’s range from 1 to 2 instead of 1 to 4:

${1.,\ 1.1,\ 1.5,\ 1.7,\ 2.}$

I use the very same u values. And the same model: my data is generated as $y = 2 + x^2 + u\$. My u values are still

$u = {-0.0494377,\ 0.0900309,\ -0.0316807,\ 0.0148669,\ -0.0842221}$

and my (x, y) points become:

$\left(\begin{array}{ll} 1. & 2.95056 \\ 1.1 & 3.30003 \\ 1.5 & 4.21832 \\ 1.7 & 4.90487 \\ 2. & 5.91578\end{array}\right)$

Here is a drawing of the true function without noise (y = x^2 + 2) and the (x,y) points with noise.

Over that span, a straight line would not be a bad fit. In fact, it would be very good. (There’s just not all that much curvature over this small a range. Ah, FWIW, the R^2 for the straight line would be 0.992583, not what you’d think of as far from 1. I cringe when I see R^2 less than 0.9, because I know how far off I can be in simulations and get 0.99.)

But for now, i’m going to plunge ahead. I know the data came from a parabola, so i’m going to fit one; but i’m going to forget that there’s no “x” term. That is, as before $y = a + b\ x + c\ x^2\$, i.e. $y = \beta_1 + \beta_2\ x + \beta_3\ x^2\$.

Here’s the equation Mathematica gives me. Instead of 2 + x^2 we have

$y = 1.63209 + 0.597983\ x + 0.772142\ x^2$

That’s pretty far off. 1.6 instead of 2, .6 instead of 0, and .77 instead of 1. As I said, over this small a range of x, we don’t have a lot of evidence for the curvature of the data. But, as before, the t-statistic on the linear term x is less than 1…

…so we drop the linear term and get a new regression. Our new equation is

$y = 2.05047 + 0.972441\ x^2$

so dropping the x term has gotten us much, much closer to the true model: 2.05 instead of 2 and .97 instead of 1.

What do the R-squares tell us? Old and new R^2 are

$\{0.998042,\ 0.997672\}$

That says R^2 went down when I dropped a variable. Alternatively, R^2 was higher when I had one more variable. That’s what’s supposed to happen.

The fit with 3 variables has slightly smaller total error than the fit with two variables. In this case the difference is tiny, but if our goal is the best interpolation, then the higher R-squared equation may be preferable to the lower.

$\{0.996085,\ 0.996896\}$

The newer one is the higher; in some sense, the newer fitted equation is more likely true.

Finally, for the newer fit, all the t-stats (all two of them!) are larger than 1.

Incidentally, the estimated variance of the noise is 0.0044978. Recall that the sample variance of the actual u is 0.00453426 . As before, the model with the highest adjusted R^2 has the best estimate of the noise variance. (But we know this only because we know the true errors u.)

By looking at the t-statistics and the adjusted R^2, we have recovered the model used to construct the data. (They go hand-in-hand: it is approximately true that adding a variable whose t-statistic is greater than 1 (in absolute value) results in a higher adjusted R^2.)

You know, I just can’t resist fitting a cubic to the data. (I said I like playing with regression.) And it’s an arbitrary cubic, i.e. $y = \beta_1 + \beta_2\ x + \beta_3\ x^2 + \beta_4\ x^3\$.

The fitted equation is

$y = -1.09742 + 6.35766\ x -3.12847\ x^2 + 0.852566\ x^3$

and the parameter table is

Note that all of the t-statistics are small, less than 1 (in absolute value). That’s what I wanted to show you: all the t-statistics going to hell.

even worse, the two smallest t-statistics are for the constant and for the x^2 term, i.e. for the two correct variables!

(I didn’t plan that, but it was too good to pass up.)

That cubic regression does not give us good clues about what to do next. Had it been our first regression, we might as well not have run it. Because it comes after a few smaller ones, we know that removing x^3 will fix things. What we’re not sure of, in principle, is whether removing x, for example, will also fix things.

This is why, in general, I do stepwise regression: I start by running all possible regressions with a constant term and one variable, and I choose the fit that has the highest adjusted R^2. then I add one more variable – one at a time – and pick the equation with the highest adjusted R^2, assuming it’s larger than the previous stage. At each stage I add the variable which gives me the highest adjusted R^2.

It would be called forward regression (or forward selection) if that were all I did, but I also remove an earlier variable if its t-statistic becomes small. (Yes, that can happen.) In principle I could stop if the new adjusted R^2 decreases, but in fact I usually keep going so that I can look at the models with “too many” variables, too.

(So, when I started all this by fitting $y = \beta_1 + \beta_2\ x + \beta_3\ x^2\$ to the data, I was not following my own general guidelines. Hey, it’s just a polynomial in one variable. In this case, we could actually run every possible regression. For n variables, there are 2^n possible regressions. Since we only have 5 data points, we cannot fit higher than a quartic, so the maximum number of variables is 5 and there are 32 possible regressions.)

Now, I do not know if it has been shown that stepwise regression always finds the best possible regression. For a few variables, when it has been feasible for me to compute all possible regressions, I have seen that stepwise found the best one.

It seems to me that there is so much freedom in choosing and constructing variables, that stepwise is more than good enough: at some point, time and energy should be devoted to exploring transformations of the variables. To put that another way, if you do stepwise regression and the answer is not satisfactory, finding, transforming, or constructing more variables is more important than whether stepwise got the best possible answer. I’m sure it’s nearly optimal if not in fact optimal.

If, instead, we start with all of the variables first, and remove variables one at a time, that’s backward regression. My objection is that we would be starting with a bunch of t-statistics which cannot reliably tell us what to remove first. In this example, if we leave the constant, despite its low t-statistic, we would presumably drop the x^2 term first, which is exactly the wrong term to drop.

(There are some arguments for always leaving a constant term in, even when its t-statistic is small. One, it acts as a surrogate for all the variables we omitted; not just the ones we measured and omitted, but all the things we didn’t even measure. Two, the definitions of R^2 and adjusted R^2 need to be changed if there is no constant.)

Having said all that, the R^2 is higher for the cubic fit, as it should be: Here are the R^2 for constant & x^2… constant, x, and x^2… and the cubic…

$\{0.997672,\ 0.998042,\ 0.998291\}$

so the cubic might actually be a better interpolating equation than those with fewer powers of x. it depends on what our purpose is. BTW, when I say interpolation, I do not mean to include extrapolation; I do mean restricting x between the outermost x-values of the data, in this case between 1 and 2.)

## the SVD

There is something else we should do with that cubic. When more than one of the t-statistics gets small upon adding one variable, that’s a strong signal that the design matrix is “multicollinear”: it’s of full rank but not by much. (Had it not been of full rank, we could not have computed the inverse of $X^T\ X\$)

What is the real-valued rank of our design matrix with the cubic term added? That’s a job for the Singular Value Decomposition (SVD). If you haven’t look at the SVD, you might want to. here

For that cubic, the design matrix is…

$X = \left(\begin{array}{llll} 1 & 1. & 1. & 1. \\ 1 & 1.1 & 1.21 & 1.331 \\ 1 & 1.5 & 2.25 & 3.375 \\ 1 & 1.7 & 2.89 & 4.913 \\ 1 & 2. & 4. & 8.\end{array}\right)$

and the w-matrix from its singular value decomposition $X = u\ w\ v^T\$, is…

$w = \left(\begin{array}{llll} 12.1472 & 0. & 0. & 0. \\ 0. & 1.71458 & 0. & 0. \\ 0. & 0. & 0.182043 & 0. \\ 0. & 0. & 0. & 0.00503654 \\ 0. & 0. & 0. & 0.\end{array}\right)$

Sure enough. There are 4 nonzero values of w (rank is 4), but the smallest is only .005: X is of rank 4, but it differs from a matrix of rank 3 by just .005.

Although x, x^2, and x^3 are linearly independent in principle, in this case there is a nearly linear relationship among the columns of X. We know that the problem is fixed if we drop x^3. as far as I know, so far, the SVD doesn’t tell us that. It doesn’t tell us if we have a near-linear relationship between x^2 and x^3 (which is my guess) or between all three powers of x, or some other possibility. I’m keeping an open mind on whether there’s a way to extract more information about just what is related to which. In fact, open mind be damned, there’s got to be a way (short of regressing the x’s on each other)….

Incidentally, it is an accident that the smallest w is close to the noise variance: the x’s themselves are not noisy.

In contrast to our PCA / FA examples, this design matrix has a column of 1s, i.e. a column whose variance is exactly 0. but I don’t want to remove it, before computing the SVD, because it’s conceivable that the problem is that some combination of the other columns adds up to 1.

finally, the 3rd singular value (.18 ) is two orders of magnitude smaller than the largest. Maybe we should drop two variables instead of just one. So far, I cannot see that the SVD tells me which variables to drop, but it seems to suggest that the best model has only two variables.

gee, it does: constant and x^2.

Let me close by re-iterating: get a good book (e.g. Ramanathan or Draper & Smith). The normal equations will pretty much give you an answer unless you broke them by building a design matrix of less than full rank. You need to know how to assess the answer which the normal equations gave you, and the clues may not always be right in your face.