Regression Example 1

setup

Let’s just fit a parabola to noisy data. Instead of using real data I will manufacture some. First I pick some x values; you might note that they do not have to be equally spaced.

x = \{1.,\ 1.1,\ 1.5,\ 3.,\ 4.\}

Now I generate 5 random variables u from a normal(0, .05) distribution. (the 2nd param is std dev, not variance.) my data is generated as

y = 2 + x^2 + u\ .

that is, my 5 noise values are

u = \{-0.0494377,\ 0.0900309,\ -0.0316807,\ 0.0148669,\ -0.0842221\}

and my 5 data values are

y = \{2.95056,\ 3.30003,\ 4.21832,\ 11.0149,\ 17.9158\}

Here, then, are my (x,y) points:

\left(\begin{array}{ll} 1. & 2.95056 \\ 1.1 & 3.30003 \\ 1.5 & 4.21832 \\ 3. & 11.0149 \\ 4. & 17.9158\end{array}\right)

Here is a plot of the true function without noise (y = x^2 + 2) and the (x,y) points (with noise).

In the real world, of course, what we have is only the 5 points; we want to fit a curve to them. I’m going to do this “by hand”, as I showed you in the expository post.

“by hand”

Let’s confirm all that, “by hand” as it were. I am going to just jump in and fit a general polynomial of degree 2: y = a + b\ x + c\ x^2\ (i.e. y = \beta_1 + \beta_2\ x + \beta_3\ x^2), following the computations described in the expository post.

i need a design matrix whose first column is 1s, whose second column is the x’s, and \[Dash] because I want to fit a quadratic \[Dash] whose 3rd column is the squares of the x’s. That is, I construct the design matrix X

X =\left(\begin{array}{lll} 1 & 1. & 1. \\ 1 & 1.1 & 1.21 \\ 1 & 1.5 & 2.25 \\ 1 & 3. & 9. \\ 1 & 4. & 16.\end{array}\right)

We compute X^T\ X\ … and its inverse…

X^T\ X = \left(\begin{array}{lll} 5. & 10.6 & 29.46 \\ 10.6 & 29.46 & 96.706 \\ 29.46 & 96.706 & 344.527\end{array}\right)

\left(X^T\ X\right)^{-1} = \left(\begin{array}{lll} 7.43111 & -7.48067 & 1.46434 \\ -7.48067 & 7.96246 & -1.59534 \\ 1.46434 & -1.59534 & 0.325488\end{array}\right)

Incidentally, that is the step for which you probably do not want to write your own code: you want someone else to provide code for computing the inverse of a matrix.

having the inverse, we compute the \beta from the normal equations:

\beta = \left(X^T\ X\right)^{-1}\ X^T\ y

and get

\beta = \{1.92723,\ 0.0964893,\ 0.975582\}

That is, we have just fitted the equation…

y = 1.92723 + 0.0964893\ x+0.975582\  x^2

to our data. We can plot it, but we shouldn’t see anything surprising. After all, the coefficients of the fitted equation are 1.97 instead of 2, .096 instead of 0, and .976 instead of 1. Pretty close.

Now we compute the predicted values from our equation; we are just using the data values of X in the fitted equation:

\hat{y} = X\ \beta = \{2.9993,\ 3.21382,\ 4.26702,\ 10.9969,\ 17.9225\}

The residuals are the difference between the predictions and the actual data. I never remember which is subtracted, but I write

y = X\ \beta + e = \hat{y} + e\ ,

and then I see that it’s y - \hat{y}  = e\ .

so we compute the e’s and the sum of squares of the e’s (the residuals and the error sum of squares):

e = \{-0.0487348,\ 0.0862128,\ -0.0486997,\ 0.0179365,\ -0.00671478\}

SSE = e \cdot e = 0.0125462

We should plot the residuals (the 5th one is on the “5” on the x-axis).

For our design matrix, we have n = 5 observations and k = 3 variables. Then the “estimated error variance” SSE/(n-k) is…

= s^2 = \frac{SSE}{5-3} = 0.0062731

Let’s get the total sum of squares. We compute the mean of the y values, center the data by subtracting the mean, and get the sum of squares (as the dot product of two zero-mean vectors).

The mean of the y’s is 7.87991; after subtracting it, we get centered data

yc = \{-4.92935,\ -4.57988,\ -3.66159,\ 3.13496,\ 10.0359\}

and then

SST = yc \cdot yc = 169.228

Now we can compute the R^2:

R^2 = 1 - \frac{SSE}{SST} = 0.999926

and the adjusted R2:

adjR2 = 1-\frac{SSE/n-k}{SST/n-1} = 1 - \frac{SSE/2}{SST/4} = 0.999852

The covariance matrix of the \beta\ s is

\left(\begin{array}{lll} 0.0466161 & -0.046927 & 0.00918597 \\ -0.046927 & 0.0499493 & -0.0100077 \\ 0.00918597 & -0.0100077 & 0.00204182\end{array}\right)

and, in particular, the standard errors are the square roots of the diagonal elements of c:

se = \{0.215908,\ 0.223493,\ 0.0451865\}

i remark that although the matrix inverse X^T\ X^{-1} was one of the first things we computed, we need the estimated error variance s2, computed much later, in order to get the standard errors.

The t-statistics are the \beta\ s divided by the standard errors:

t = \frac{\beta}{se} = \{8.92616,\ 0.431732,\ 21.5901\}

Let me remark that some people present the \beta\ s and the t-statistics, others present the \beta\ s and the standard errors. No problem: given any two of the \beta\ s, t-statistics, and standard errors, you can compute the third.

let Mathematica® do it

Here’s where you use whatever you’ve got available; of all my choices, I’ll go with Mathematica®. Just in case you’re also using Mathematica®, here’s the specific command I used, and the output:

(“data” consists only of the (x,y) data; “{1,x,x^2}” tells Mathematica® what design matrix to construct; the “x” after that says that the first column of “data” is “x”; the “Clear” command is because I need “x” to be a symbol, not a list of numbers. I wouldn’t have this problem if I didn’t insist on using “x” as the name of the independent variable, i.e. doing double duty.)

The first line, BestFit, shows us the fitted equation. Again, just in case you’re using Mathematica®, here’s the command that extracted the BestFit and the result:

That’s exactly what we computed by hand. Similarly, the FitResiduals are the residuals e:

\{-0.0487348,\ 0.0862128,\ -0.0486997,\ 0.0179365,\ -0.00671478\}

The PredictedResponse are the 5 computed values, yhat, used to compute the residuals.

\{2.9993,\ 3.21382,\ 4.26702,\ 10.9969,\ 17.9225\}

In the ParameterTable, “Estimate” refers to \beta\ , “SE” is its standard error, and TStat is its T-statistic.

PValue is a probability corresponding to the t-statistic. I don’t remember whether Mathematica® is doing a 1-sided or 2-sided test and the documentation doesn’t seem to say. I don’t really care: my rule of thumb is to compare the (absolute value of the) t-statistic to 1. If I ever need to know exactly what PValue means, I’ll grab a t-distribution from one of my stat books and see what Mathematica® did.

Mathematica’s standard errors and t-statistics, R^2, adjusted R^2, and estimated error variance all agree with our earlier computations.

Finally, the ANOVAtable contains, among other things, the error and total sum of squares. (i’ll remark that for experimental data, the F-test can more useful than the R^2 and the adjusted R^2, because repeated x’s with different y’s means that we have some data points on vertical lines; we cannot make the error sum of squares zero. That in turn means that the R^2’s cannot be 1. that in turn means that we can’t tell what constitutes “a good” R^2.)

Our computed SSE and SST were

\{0.0125462,\ 169.228\}

respectively, and they agree with Mathematica®.

better model, worse fit

So, I have confirmed that my recipe matches Mathematica®. So much for computation. What about the results themselves?

The parameter table, or our own calculations, showed a t-statistic of 0.431732 for the x term in the fitted equation. That the t-stat for x is less than 1 suggests that the linear term vanishes. That’s very nice, since we know that true model did not have a linear term. We created the data without a linear term. It’s the small t-statistic that says: for this data there’s a high probability that \beta = 0\ .

i cannot overemphasize that if our goal is the best interpolation, instead of finding the “true model”, we might choose to use the equation we have.

but, I want to find the true model, so let’s drop the x term; and let Mathematica® do it all for us:

Our two fitted equations, then, are

so dropping the x term has gotten us a little closer to the true equation.

What do the R-squares tell us? Old and new are {0.999926, 0.999919}, so the new one is very slightly smaller. In absolute terms, the new fit is not as good as the first one. In this case the difference is tiny – miniscule, even – , but if our goal is the best interpolation, then a higher R-squared equation may be preferable to a lower.

What about the adjusted R-squares? Old and new are {0.999852, 0.999892}. The newer one is the higher; in some sense, the newer fitted equation is more likely true. Again, in this case, the difference is damned small.

Finally, for the newer ft, all the t-stats (all two of them!) are larger than 1. That’s why we might go with the new equation.

Incidentally, the estimated variance for the equation is smaller, 0.00457182 vs. 0.0062731.

and, FWIW, the sample variance of our noise – which we only know because we created this data – was 0.00453426.

The second equation has a smaller estimated error variance, even though the first equation has smaller errors in total. This is hand in glove with “closeness of fit” versus “true model”, R^2 versus adjusted R^2.

BTW, had I done the newer fit “by hand”, the design matrix would have had two columns: 1s and the squares. It’s ok to drop the x’s themselves. This is exactly how we specify a model, by computing and using whatever columns we choose. To be specific,

\left(\begin{array}{ll} 1 & 1. \\ 1 & 1.21 \\ 1 & 2.25 \\ 1 & 9. \\ 1 & 16.\end{array}\right)

There you are. That’s how to fit y = a + b x^2 to the data.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: