edit 19 Sept 2010: the total sum of squares was written incorrectly; it needed to use , where is the mean of y.
OK, suppose you know how to fit a line to a collection of (x,y) data points. There are some questions that people ask next:
- how would I fit a parabola to that data?
- how would I fit a plane to a collection of (x,y,z) data?
and, of course, one wants to know how to do more complicated fits:
- cubic and higher degree polynomials.
- higher dimensional “planes” (i.e more independent variables).
Fitting a line to (x,y) data is relatively straightforward, but the usual prescription gives no clue about how to generalize it. The usual prescription is as follows.
If our model, the line to be fitted, is
y = m x + b,
then we solve the following two equations for the two unknowns m and b:
where n is the number of observations. This is the line that minimizes the sum of the squared errors, an error being defined as the difference between each datum and the computed . we would usually get a different line if we fitted x = M y + B, i.e. if we measured errors horizontally.
So how would we begin to generalize that pair of equations? Fortunately, if we go to a matrix representation, one generalization gets us higher degree polynomials, planes instead of lines, higher dimensions, and a whole lot more.
This is one of my favorite pieces of mathematics, because without matrices this would be a huge collection of monstrous ugly equations; with matrices, one equation suffices for an incredible variety of cases.
We will write two matrix equations, one for the data
and one for the fitted equation
This representation has
- a column vector y of the dependent variable.
- a data matrix X, in which each column is a variable; the row is an observation, with values for each of the variables.
- a column vector of coefficients to be found.
That is, for a fixed row, say i, the matrix equation becomes
Incidentally, the notations are almost universal:
- dependent variable y,
- matrix of independent variables X,
- coefficients of the fit ,
- and the computed values
For working with the theory, one would also want to distinguish the true coefficients from the computed coefficients , and the true disturbances e from the computed errors , which are almost invariably called the residuals.
Nevertheless, I have written and e where one should, more precisely, write and . One should be prepared to make that distinction when necessary.
Let me say that another way. For working with the theory, we would assume that the true model is
for any data matrix X. Then we take a specific data matrix, compute the (ordinary) least-squares (OLS) coefficients , which we think of as estimates of the true . And for our specific data matrix X, we compute the residuals , which are a specific realization of the possible errors e.
Creating the design matrix X
The key is that we get to define the model to be fitted by what we put in the X matrix. What we actually use for the least-squares fit is often called the design matrix, because we usually add columns to the original data matrix.
The simple case of fitting a line is a special case: our design matrix X has a column of 1s, and a column of the x’s: i.e.
Then we compute predicted (fitted) values of y as:
(And, no, whether the 1s are in column 1 or column 2 is irrelevant, except that, of course, if the 1s are in column 2, then the constant term would be instead of .
Incidentally, since we have found the equation of a line, we could draw that line; we are not limited to computing just the fitted values. We could use that fitted equation for interpolation, or extrapolation: that is, we have the function
Want to fit a parabola? The first column of X is still all 1s; the second column is still the list of x’s; we put in a third column which is the squares of the x’s. That is,
then we compute predicted values
which lie on a parabola fitted to the data, and the parabola itself is
Want to fit a plane? Just to preserve the notation, I will write it as
because I insist on using y as the dependent variable; but that’s just a name. Anyway, for the constant term a, we put a column of 1s; for x and z we put columns of the x’s and of z’s. i.e.
then we get
Fitting the model to the data
Here’s the magic formula. We compute the least-squares values of the coefficients as
is the transpose of X
is the matrix inverse of ;
y and are (column) vectors.
The matrix equation (**) is called the normal equations.
So easy. So fraught with peril. It will almost always work, which means that it may give you an answer which isn’t very nice. You will want to graph the data and the fitted values, and the errors.
(To be specific, if the design matrix X is of full-rank, then exists; if X has more rows than columns – i.e. more observations than variables – X is of full rank if and only if its columns are linearly independent.)
Incidentally, if we wrote (**) as
and built X for the special case of fitting a line y = m x + b, we would recover that pair of equations (*), as we should expect.
Assessing the fit
Having computed the coefficients , we compute
the vector of fitted values, yhat:
the vector of residuals:
the error sum of squares:
and let be the number of observations = no. of rows of X
the let be the number of variables = no. of columns of X
Then we compute the following.
we estimate the error variance:
the total sum of squares: , where is the mean of y.
the adjusted R-squared:
the covariance matrix of the :
although in practice we only want the square root of the diagonal
i.e. the vector of standard errors: .
the vector of t-statistics: .
What do they all mean?
the R-squared , as it’s called, is a measure of how much of the variation in y is explained by the fit. It’s 1 if and only if the SSE = 0, or very close to zero; that means there are no errors. Unfortunately, R-squared cannot decrease when you add a variable to the model. The fit cannot get worse when we add a variable.
(Incidentally, if you try to fit, say, a parabola to exactly 3 (x,y) points, you will get the Lagrange interpolating polynomial which goes exactly thru those 3 points; and the will be exactly 1. Assuming that none of the points lie on a vertical line.)
We compensate for the fact that R-square cannot decrease as we add a variable, balancing an additional variable against the reduced variation, by using the adjusted R-squared, usually written with a bar over the R: . There are other measures of overall goodness-of-fit, and I usually print half a dozen. But I am only looking for discrepancies between what they say and what the adjusted R-squared says.
There are issues with the R-squareds if there is no constant term in the model. There are issues if some of the rows of the design matrix X are repeated. (That can happen for experimental data, with different measurements of y at the same values of the x’s; some of our data points lie on vertical lines.)
Hell, there are issues with the t-statistics if the X matrix doesn’t have exactly the true variables, no more and no less!
Each t-statistic is used as an estimate of the chance that each is really 0. It is approximately true that if you have a t-statistic less than 1 (in absolute value), and if you drop that variable from the model – which is equivalent to setting its to zero – then your adjusted R-squared will go up: your overall fit, adjusted for the number of variables, will be better without that variable.
Now I think you should go get a good book on the subject. Draper & Smith is excellent, but if your interest is specifically in economic-type data and modeling, I’d recommend Ramanathan. Both have data to play with and learn from.