## introduction

We have spent a lot of time getting least-squares fits to data. We’ve also spent a lot of time looking at numbers which Mathematica can compute after we’ve gotten a fit. We didn’t have to use LinearModelFit… we could have used a simple Fit command instead.

But the Fit command just does the fit… it doesn’t even give us standard errors and t-statistics.

To get those, however, requires that we make some additional assumptions about the fit. And that’s what I want to talk about, in this and – probably – three more posts.

Let’s get started.

We assume that there is a true linear elationship between a vector y of dependent data, and a “design matrix” X, consisting of columns of explanatory (or independent) variables:

y = X B + u,

where X is an nxk matrix, y and u are column vectors of length n, B is a column vector of length k.

(Hey, I really ought to say that we use “independent” in at least three senses. We will assume that the columns of X are linearly independent vectors, we will assume that the components of u are independent random variables, and we refer to the columns of X as the independent variables.)

Then we make the following assumptions:

The vector u is a random variable, with

1) mean zero: E(u) = 0.

2) the covariance matrix of the u is a multiple of the (nxn) identity: $E(uu') = \sigma^2\ I\$.

Another way to phrase the second assumption is that the u are identically and independently distributed. (They are independent random variables because their cross-covariances are zero; we also say they are uncorrelated. We have also seen that for zero-mean vectors, uncorrelated is equivalent to linear independence; but this is a different property.)

We also assume

3) X is a set of fixed numbers.

4) X is of rank k ≤ n. (That is, X is of full rank.)

The effect of assumption (3) is that there is no uncertainty in the values of X. This can be somewhat unrealistic, but there you have it. This also means that all of the uncertainty – all of the probability – is contained in the random variables u.

Although we don’t often see it, k = n is perfectly fine – provided we're counting the constant term if there is one; we end up fitting the data exactly. Note that if there is a constant term in the model, it's part of X – a column of 1s.

My statement of assumption (2) brings me to my second general comment. I should not refer to the "expected value" E(x) as the mean of x – but even when we are careful about that, you will often find that people denote E(x) by m or $\mu\$, which of course stands for "mean". It's just too easy and comfortable a usage.

And why should we not refer to the expected value as the mean? Consider the residuals e of an ordinary-least-squares regression. It is true that their expected value is zero… but it is also true that their sum is always zero (if there's a constant term in the regression) – and if their sum is exactly zero, then their mean value is exactly zero.

Those are not the same thing. If we were to take a two random samples from some probability distribution with mean zero, then the expected value of the two random samples would be zero – but chances are very high that neither will actually be zero, and the chances are fairly high that their sum – hence the sample mean – will not be zero. But if we take more and more samples, the sample mean will tend to the expected value of zero.

"Tend to" is the operative phrase. By contrast, the sum of the residuals of a regression (with a constant term) does not "tend to" zero; it is zero.

So, I will try not to refer to the expected value E(x) as the mean of x – but I'm not going to be absolutely meticulous about it.

Back to our story.

I want to work out the expected values and covariance matrices of

• the fitted coefficients $\beta\$
• the residuals e
• the fitted values yhat – in more than one way
• the observations y – in more than one way

I will wait for the next post to show you that the sum of the residuals (hence their mean value) is zero, and to compute the expected value of the sum of squares of the residuals.

## the solution

We know how to compute the least squares estimates $\beta\$ of B:

$\beta = (X'X)^{-1}X'y\$.

That is, I am carefully distinguishing the true – but unknown – vector B from the computed least-squares estimate $\beta\$.

(As ever, X' denotes the transpose of X.) That equation says – because it's a matrix applied to the vector y – that the $\beta\$ are linear functions of the y. Incidentally, it will be useful to recall the definition of the catcher matrix C:

$\beta = Cy\$,

so $C = (X'X)^{-1}X'\$.

(I've just googled it, hoping to find out why it's called the catcher matrix. No luck.)

Next, write out the formula for yhat:

$yhat = X \beta = X(X'X)^{-1}X'y\$,

and then the definition of the hat matrix H,

yhat = H y,

so we see that $H = X(X'X)^{-1}X' = XC\$.

Then we would compute the residuals e. They are defined by analogy with the true model; we have both

$y = X \beta + e\$

and

$y = X B + u\$.

Then

$e = y - X \beta = y - \hat{y}\$

(I never remember whether the residuals are y – yhat or yhat – y, but you've just seen how I recover the sign.)

So, let us look at the $\beta\$, the e, and the yhat, in that order.

First, we will need some operational formulae for the mean – oops, expected value – and variance. In fact, let me work with just the expected value first.

## The expectation operator E

And let me work with just the scalar case first.

First, the expected value of a constant is that constant:

E(a) = a.

Second, E is linear, in the following senses:

E(u + v) = E(u) + E(v)

and if a and b are constants, then

E(a u + b) = a E(u) + b.

These formulae generalize to vectors. In particular, if A is a constant matrix and b is a constant vector, and u is a vector of random variables, then

E(Au + b) = AE(u) + b,

where Au and AE(u) are matrix (specifically matrix-vector) products.

So what's the expected value of $\beta\$?

First, write $\beta\$ in terms of u.

$\beta = C y = C (XB + u) = CXB + Cu\$

but $C = (X'X)^{-1}X'\$

so $CXB = (X'X)^{-1}X'X B = B\$,

so

$\beta = B + Cu\$.

Now we can compute the expected value. Remember that the expectation operator E is linear, B and X (hence C) are constants, and E(u) = 0:

$E(\beta)\$

= E(B + C u)

= B + C E(u)

= B + C * 0

= B

Thus, the expected value of $\beta\$ is B, and we say that our computed $\beta\$ is an unbiased estimate of the true, unknown, B.

What's the expected value of e?

First we write e in terms of u.

$e = y - yhat = y - X\beta = y - X(B + Cu)\$

= X B + u – X(B + C u)

= XB – XB + u – XCu

= u – XCu

= u – Hu

= (I – H)u

=: Mu.

(I don't know of any name for M; H, of course, is the hat matrix.)

Then the expected value of e is zero:

E(e) = E(Mu) = M E(u) = 0.

What's the expected value of y given X? To be precise, what's the expected value of

y = XB + u ?

Let me emphasize something: I’ve said it backwards. The real question is: what is the expected value of XB + u? People say that in English as: What is the expected value of y given X? I’m suggesting that you not waste any time on the English, if you’re not convinced of the equivalence. No matter how it’s described in words, I want the expected value of XB + u.

Having said that, I still going to write E( y | X) for the English phrase.

Anyway,

E(y | X) = E(XB + u) = E(XB) + E(u) = XB + 0 = XB.

Now, having

E(y | X) = XB

we estimate B by $\beta\$ and estimate E(y | X) by $X\beta\$, i.e. by yhat.

OK, what's the expected value of yhat given X? To be precise, what's the expected value of

$yhat = X \beta\$ ?

Well,

$E(\hat{y} | X) = E(X\beta) = X E(\beta) = X B\$.

What's the expected value of yhat, if we are given a single observation xo? To be precise, what's the expected value of

$yhat = x_o \beta\$ ?

Well,

$E(yhat) = x_o E(\beta) = x_o\ B\$.

That was a lot of work to show that the yhat and y have the same expected value. The corresponding result for the variance, however, will be nontrivial.

What's the expected value of y in such a case? To be precise, what's the expected value of

xo B + u?

We need to be a little more precise. We have computed $\beta\$ from a set of observations, and we hypothesize an underlying vector u of disturbances. Now we imagine that we have a new observation xo – and we need to hypothesize a new disturbance uo. So what is the expected value of

$y_o = x_o B + u_o\$ ?

Well,

$E(y_o | x_o) = E( x_o B + u_o) = E(x_o B) + E(u_o) = x_o B + 0 = x_o B\$.

You might note explicitly that xo is a row vector rather than a column vector… because it is, in principle, a new row of the X matrix, except that we’re not actually going to use it, in this case, to re-estimate the least-squares fit.

In summary, we have:

$E(\beta) = B$

E(e) = 0

E(y | X) = E(XB + u) = X B

$E(\hat{y} | X) = E(X\beta) = X B\$

$E(\hat{y} | x_o) = E(x_o\beta) = x_o B\$

$E(y | x_o) = E(x_o\beta) = x_o B\$.

## The covariance operator

In the scalar case, we are looking at this definition: the variance of u is the expected value of the square of the “centered” variable u – E(u):

$V(u) = E( (u-E(u)^2 )\$.

In the vector case, we will actually compute a matrix, the covariance matrix, as:

V(u) = E( (u-E(u)) (u-E(u))' ),

where the vectors u and E(u) are column vectors, etc. I'm sure you know (I hope you know!) that the diagonal of the covariance matrix gives the variance of each component of u.

There are some useful properties for the covariance.

First, the covariance of a constant is 0:

V(a) = 0.

Second, for a single random variable u, and a constant a, we pull out a^2:

$V(a\ u) = a^2\ u\$.

We can also show that – in very sloppy but simple language – the variance of u is the squared mean subtracted from the mean square:

$V(u) = E(u^2) - E(u)^2\$.

(There’s a place where it would be distracting to say “expected value” instead of “mean”.)

We can also define a covariance matrix for two different random variables u and w:

V(u, w) = E ( (u-E(u))(w-E(w)) )

and that in turn is equal to

V(u, w) = E(u w) – E(u) E(w).

Some of these formulae, like those for the expected value operator E, generalize to vectors of random variables. The definition becomes – note the transpose (‘) on the w-E(w) term –

V(u, w) = E ( (u-E(u))(w-E(w))' )

and that in turn – I'm pretty sure – can be written as

V(u, w) = E(u w') – E(u) E(w').

If A is a constant matrix, and b a constant vector, then

V( Au + b) = A V(u) A'.

(Instead of pulling out A^2, we pull out A and A’ in the appropriate places.)

If u and w are independent random vectors, then

V(u+w) = V(u) + V(w).

We could combine those two equations, and deal with the case when u and w are not independent:

V(Au + Bw) = A V(u) A' + B V(w) B' + A V(u,w) B' + B V(u,w) A'

Now, what's the covariance matrix of $\beta\$?

$\beta = B + Cu\$

so

$V(\beta) = V(B) + V(Cu)\$

= 0 + C V(u) C' = C V(u) C’ (Remember this.)

$= 0 + C I \sigma^2 C'\$

so

$V(\beta) = \sigma^2 CC'\$.

but

$C = (X'X)^{-1}X'\$

$C' = X (X'X)^-T = X (X'X)^{-1}\$ (since $(X'X)^{-1}\$is symmetric because X'X is).

Then

$CC' = (X'X)^{-1}X' X (X'X)^{-1} = (X'X)^{-1}\$

so

$V(\beta) = \sigma^2 CC' = \sigma^2 (X'X)^{-1}\$.

And so it's the diagonal terms of $\sigma^2 (X'X)^{-1}\$ which give us the variances of each $\beta_i\$… and their square roots are what we call the standard errors, namely the standard deviations of the $\beta\$.

What's the covariance matrix of the e?

e = Mu and

V(e) = E(ee') since E(e) = 0

= E(Muu'M')

= M E(uu') M' = $M I\sigma^2 M' = MM' \sigma^2 \ = M^2 \sigma^2 \ = M \sigma^2 \$

$= (I-H) \sigma^2 \$

since M = M' and M is idempotent (MM = M) because it is a projection operator.

This result is crucial for standardized and studentized residuals.

What's the covariance matrix of y given X? That is, what is the covariance matrix of XB+u?

Well,

$V(y | X) = V(XB + u) = 0 + V(u) = I \sigma^2\$.

This result is used in the computation of the coefficient of variation.

The covariance matrix of yhat given X? That is, what is the covariance matrix of $X\beta\$?

Well,

$E(yhat | X) = E(X\beta) = XB\$, so

$\hat{y} - E(\hat{y}) = X\beta - XB = X(B + Cu - B) = XCu\$.

Then $V(yhat | X) = V(XCu) = XC V(u) C'X' = X V(\beta) X'\$

since we just worked out that $V(\beta) = C V(u) C'\$ (“Remember this”, I said).

What if we have just one single observation xo? What's the covariance matrix of $\hat{y_o} = x_o\ \beta\$ ?

$V(x_o\ \beta) = x_o\ V(\beta) x_o'\$.

Since $x_o\ \beta\$ is also the expected value of y given xo, we also say that this is the covariance matrix of the mean value of y given xo.

And this is the variance which is used for mean prediction bands.

Finally, what's the covariance matrix of $y_o = x_o\ \beta + u\$ ?

$V(x_o\ \beta + u_o) = V(x_o\ \beta) + V(u_o)\$, which is true if $\beta\$ and uo are independent.

Our assumption that the individual $u_i\$ are mutually independent – that their covariance matrix is diagonal – says that uo is independent of all the ui in u. and while $\beta\$ depends on the $u_i\$, it is not recomputed using uo, so uo independent of u should imply uo independent of $\beta\$.

Why $\beta\$ instead of B? That is, why $x_o\ \beta\ + u_o\$ instead of xo B + u? Well, damn it, regardless of what we call it, we are going to compute the variance of $x_o\ \beta\ + u_o\$.

(How was that for cold rational mathematics? Spock would be shaking his head at me. Hell, I’m still shaking my head a little bit. See below.)

So we have, in general,

$V(x_o\ \beta + u_o) = V(x_o\ \beta) + V(u_o) + 2 V(x_o\ \beta, u_o)\$

Now, if $V(x_o\ \beta, u_o) = 0\$, i.e. $x_o\ V(\beta, u_o) x_o' = 0\$, i..e $V(\beta, u_o) = 0\$, i.e. $\beta\$ is independent of uo…

then

$V(x_o\ \beta + u) = V(x_o\ \beta) + V(u_o)\$

$= x_o V(\beta) x_o' + \sigma^2\$,

And this is the variance which is used for single prediction bands. (Yes, that’s the same link as for mean prediction bands.) The additional detail in that post may help illustrate why we care about the variance of $xo \beta+u_o\$. We understand it as the combined variance of the expected value of y (namely the variance of yhat) and the variance of y about its mean, given a new observation xo.

In summary, we have

$V(\beta) = \sigma^2 (X'X)^{-1}\$

$V(e) = (I-H) \sigma^2\$

$V(y | X) = V(XB + u) = V(u) = I \sigma^2\$

$V(\hat{y} | X) = E( XB) = X V(\beta) X'\$

$V(\hat{y} | x_o) = V(x_o \beta) = x_o\ V(\beta) x_o'\$

$V(x_o \beta + u_o) = x_o V(\beta) x_o' + \sigma^2\$.

And

$E(\beta) = B$

E(e) = 0

E(y | X) = E(XB + u) = X B

$E(\hat{y} | X) = X E(\beta) = X B\$

$E(\hat{y} | x_o) = xo E(\beta) = x_o B\$

$E(y | x_o) = x_o E(\beta) = x_o B.$

Whew!

The next post will look at the expected value of the error sum of squares.