## Introduction

edit 19 Sept 2010: the total sum of squares was written incorrectly; it needed to use $y - \bar{y}\$, where $\bar{y}\$ is the mean of y.

OK, suppose you know how to fit a line to a collection of (x,y) data points. There are some questions that people ask next:

1. how would I fit a parabola to that data?
2. how would I fit a plane to a collection of (x,y,z) data?

and, of course, one wants to know how to do more complicated fits:

1. cubic and higher degree polynomials.
2. higher dimensional “planes” (i.e more independent variables).

Fitting a line to (x,y) data is relatively straightforward, but the usual prescription gives no clue about how to generalize it. The usual prescription is as follows.

If our model, the line to be fitted, is

y = m x + b,

then we solve the following two equations for the two unknowns m and b:

$\sum _{i=1}^n y_i = b\ n+m \sum _{i=1}^n x_i$

(*)

$\sum _{i=1}^n x_i y_i = b \sum _{i=1}^n x_i+m \sum _{i=1}^n x_i^2$

where n is the number of observations. This is the line that minimizes the sum of the squared errors, an error being defined as the difference between each datum $y_i$ and the computed $\hat{y_i} = m\ x_i + b\$. we would usually get a different line if we fitted x = M y + B, i.e. if we measured errors horizontally.

So how would we begin to generalize that pair of equations? Fortunately, if we go to a matrix representation, one generalization gets us higher degree polynomials, planes instead of lines, higher dimensions, and a whole lot more.

This is one of my favorite pieces of mathematics, because without matrices this would be a huge collection of monstrous ugly equations; with matrices, one equation suffices for an incredible variety of cases.

## The model

We will write two matrix equations, one for the data

$y = X \beta + e\$

and one for the fitted equation

$\hat{y} = X \beta\$

which imply

$y = \hat{y} + e$

or

$e = y - \hat{y}$

This representation has

• a column vector y of the dependent variable.
• a data matrix X, in which each column is a variable; the $j^{th}\$ row is an observation, with values for each of the variables.
• a column vector $\beta$ of coefficients to be found.

That is, for a fixed row, say i, the matrix equation becomes

$\hat{y_i} = \sum _{j=1}^k X_{\text{ij}} \beta _j$

Incidentally, the notations are almost universal:

• dependent variable y,
• matrix of independent variables X,
• coefficients of the fit $\beta$,
• and the computed values $\hat{y}$

For working with the theory, one would also want to distinguish the true coefficients $\beta$ from the computed coefficients $\hat{\beta}$, and the true disturbances e from the computed errors $\hat{e}$, which are almost invariably called the residuals.

Nevertheless, I have written $\beta$ and e where one should, more precisely, write $\hat{\beta}$ and $\hat{e}\$. One should be prepared to make that distinction when necessary.

Let me say that another way. For working with the theory, we would assume that the true model is

$y = X\ \beta + e,$

for any data matrix X. Then we take a specific data matrix, compute the (ordinary) least-squares (OLS) coefficients $\hat{\beta}\$, which we think of as estimates of the true $\beta$. And for our specific data matrix X, we compute the residuals $\hat{e}\$, which are a specific realization of the possible errors e.

## Creating the design matrix X

The key is that we get to define the model to be fitted by what we put in the X matrix. What we actually use for the least-squares fit is often called the design matrix, because we usually add columns to the original data matrix.

The simple case of fitting a line is a special case: our design matrix X has a column of 1s, and a column of the x’s: i.e.

$X_{i1} = 1$

and

$X_{i2} = x_i\$.

Then we compute predicted (fitted) values of y as:

$\hat{y_i} = \sum_j X_{ij} \beta_j = 1\ \beta_1 + \beta_2\ x_i = \beta_1 + \beta_2\ x_i\$.

(And, no, whether the 1s are in column 1 or column 2 is irrelevant, except that, of course, if the 1s are in column 2, then the constant term would be $\beta_2$ instead of $\beta_1\$.

Incidentally, since we have found the equation of a line, we could draw that line; we are not limited to computing just the fitted values. We could use that fitted equation for interpolation, or extrapolation: that is, we have the function

$y(x) = \beta_1 + \beta_2\ x\$.

Want to fit a parabola? The first column of X is still all 1s; the second column is still the list of x’s; we put in a third column which is the squares of the x’s. That is,

$X_{i1} = 1$

$X_{i2} = x_i$

$X_{i3} = {x_i}^2$

then we compute predicted values

$\hat{y_i} = \sum_j X_{ij} \beta_j = 1\ \beta_1 + \beta_2\ x_i + \beta_3\ {x_i}^2$

which lie on a parabola fitted to the data, and the parabola itself is

$y(x) = \beta_1 + \beta_2\ x + \beta_3\ x^2\$.

Want to fit a plane? Just to preserve the notation, I will write it as

$\hat{y_i} = a + b\ x_i +c\ z_i\$,

because I insist on using y as the dependent variable; but that’s just a name. Anyway, for the constant term a, we put a column of 1s; for x and z we put columns of the x’s and of z’s. i.e.

$X_{i1} = 1$

$X_{i2} = x_i$

$X_{i3} = z_i$

then we get

$\hat{y_i} = \sum_j X_{ij} \beta_j = 1\ \beta_1 + \beta_2\ x_i + \beta_3\ z_i\$,

a plane.

## Fitting the model to the data

Here’s the magic formula. We compute the least-squares values of the coefficients $\beta$ as

(**) $\beta = {(X^T\ X)}^{-1}\ X^T\ y\$,

where

$X^T\$ is the transpose of X

and

${(X^TX)}^{-1}$ is the matrix inverse of $X^T\ X\$;

y and $\beta$ are (column) vectors.

The matrix equation (**) is called the normal equations.

So easy. So fraught with peril. It will almost always work, which means that it may give you an answer which isn’t very nice. You will want to graph the data and the fitted values, and the errors.

(To be specific, if the design matrix X is of full-rank, then ${(X^TX)}^{-1}$ exists; if X has more rows than columns – i.e. more observations than variables – X is of full rank if and only if its columns are linearly independent.)

Incidentally, if we wrote (**) as

$X^T\ X\ \beta = X^T\ y\$,

and built X for the special case of fitting a line y = m x + b, we would recover that pair of equations (*), as we should expect.

## Assessing the fit

Having computed the coefficients $\beta\$, we compute

the vector of fitted values, yhat: $\hat{y } = X\ \beta$
the vector of residuals: $e = y - \hat{y }$
the error sum of squares: $SSE = e\ \cdot e$
and let $n$ be the number of observations = no. of rows of X
the let $k$ be the number of variables = no. of columns of X

Then we compute the following.

we estimate the error variance: $s^2 = \frac{SSE}{n-k}$

the total sum of squares: $SST = (y - \bar{y})\ \cdot\ (y - \bar{y})\$, where $\bar{y}\$ is the mean of y.

the R-squared: $R^2 = 1- \frac{SSE}{SST}$

the adjusted R-squared: $\bar{R}^2 = 1 - \frac{SSE / (n-k)}{SST / (n-1)}$

the covariance matrix of the $\beta\$: $c = s^2\ (X^T\ X)^{-1}$

although in practice we only want the square root of the diagonal
i.e. the vector of standard errors: $se = \sqrt{c_{ii}}\$.

the vector of t-statistics: $t_i = \beta_i / se_i\$.

What do they all mean?

the R-squared , as it’s called, is a measure of how much of the variation in y is explained by the fit. It’s 1 if and only if the SSE = 0, or very close to zero; that means there are no errors. Unfortunately, R-squared cannot decrease when you add a variable to the model. The fit cannot get worse when we add a variable.

(Incidentally, if you try to fit, say, a parabola to exactly 3 (x,y) points, you will get the Lagrange interpolating polynomial which goes exactly thru those 3 points; and the $R^2$ will be exactly 1. Assuming that none of the points lie on a vertical line.)

We compensate for the fact that R-square cannot decrease as we add a variable, balancing an additional variable against the reduced variation, by using the adjusted R-squared, usually written with a bar over the R: $\bar{R}^2\$. There are other measures of overall goodness-of-fit, and I usually print half a dozen. But I am only looking for discrepancies between what they say and what the adjusted R-squared says.

There are issues with the R-squareds if there is no constant term in the model. There are issues if some of the rows of the design matrix X are repeated. (That can happen for experimental data, with different measurements of y at the same values of the x’s; some of our data points lie on vertical lines.)

Hell, there are issues with the t-statistics if the X matrix doesn’t have exactly the true variables, no more and no less!

Each t-statistic is used as an estimate of the chance that each $\beta$ is really 0. It is approximately true that if you have a t-statistic less than 1 (in absolute value), and if you drop that variable from the model – which is equivalent to setting its $\beta$ to zero – then your adjusted R-squared will go up: your overall fit, adjusted for the number of variables, will be better without that variable.

Now I think you should go get a good book on the subject. Draper & Smith is excellent, but if your interest is specifically in economic-type data and modeling, I’d recommend Ramanathan. Both have data to play with and learn from.

### 3 Responses to “Regression (OLS, ordinary least squares) Exposition”

1. Gabe Says:

My wiki has articles on System Indentification and Parameter Estimation. You’ve covered a lot of it but I’ve got some stuff for different online Least Squares techniques for updating Kalman filters in addition to the least squares stuff you’ve presented. Here’s a link:

http://wikis.controltheorypro.com/index.php?title=Introduction_to_System_Identification

2. clockbackward Says:

Thanks for the nice, concise introduction to linear least squares regression. The allusion to ordinary least squares being “fraught with peril” I think is very accurate, and is a subject which I feel is not sufficiently often addressed. Readers who want a more in depth account of the many problems that can arise when trying to apply least squares might want to check out the following essay:

http://www.clockbackward.com/2009/06/18/ordinary-least-squares-linear-regression-flaws-problems-and-pitfalls/

3. rip Says:

As I said on his blog, I appreciate this link. It suggests many alternatives to OLS.

Down the road I will get around to regression diagnostics.

On the one hand, I love learning new ways to model data – witness wavelets; on the other hand, there are a lot of tools for evaluating and modifying the result of good old ordinary least squares.

Someone once said to me, the challenge isn’t in realizing that something is wrong – it’s in knowing how to fix it.