Regression 1 – Assumptions and the error sum of squares

There’s one thing I didn’t work out in the previous post: the relatinship between the error sum of squares and the variance of the u. We have already computed the variance of the e, that is,

V(ee’).

What we want now is the expected value of the error sum of squares:

E(e’e).

(I should perhaps remind us that e is, by convention, a column vector… so its transpose e’ is a row vector… so e’e is a scalar, equal to the dot product of e with itself… while ee’ is a square matrix. Vectors can be pretty handy for this kind of stuff.)

The expected value of the sum of squared errors is surprisingly complicated. Well, maybe I should just say it’s different from what we did in the last post… and that’s one reason I moved it to a post of its own.

Let me show this to you quickly, and then justify it. Recall that

e = Mu

so

e’e = u’M’Mu = u’Mu

(M is symmetric, so M’M = M^2, and M is idempotent, so M^2 = M. We saw this last time.)

Now for the expected value… (life would be really simple if we could pull the M out of the E(), but we can’t)… we make two new assertions. We have

E(e’e) = E(u’Mu)

and we claim first that

E(u'Mu) = \sigma^2 \text{ trace}(M)

and then we claim that

trace(M) = n – k,

so that

E(e'e) = \sigma^2 (n-k)\ ,

and then

\sigma^2 = \frac{E(e'e)}{n-k}\ .

Then we replace E(e’e) by the computed value of e’e, and the symbol \sigma^2\ by s^2\ , and say that s^2\ defined by

s^2 = \frac{e'e} {n-k}

is an unbiased estimate of \sigma^2\ .

Let me say that all again; this proof boils down to

E( e’e )

= E (u’ Mu)

= tr( M V(u) )

= \sigma^2 		\text{ tr}(M)

= \sigma^2 ( 		\text{ tr}( I - H) )

= \sigma^2 (n-k)\ .

We need to show two things: that

E (u’ Mu) = tr( M V(u) )

and

tr( I – H) = n – k.

Let me justify the second claim first. We can compute the trace of M. We will also need some properties of the trace:

E(tr A) = tr E(A): expectation and trace commute
tr(cA) = c tr(A): we can take a scalar multiple outside the trace
tr(A+B) = tr(A) + tr(B): the trace of a sum is equal to the sum of the traces
tr(In) = n: the trace of the nxn identity is n
tr(AB) = tr(BA)

That last one requires comment. It is usually stated for square matrices A and B – but it is true whenever both products exist – which need not happen. We can, for example, multiply a 2×3 matrix times a 3×4 matrix… but the reverse product does not exist, because we cannot multiply a 3×4 times a 2×3.

Let’s look at it. The i,j element of the product AB is

(AB)_{ij} = \sum_k A_{ik}B_{kj}\ .

The trace of AB is the sum of the diagonal terms:

\sum_j (AB)_{jj} = \sum_{jk} A_{jk} B_{kj}\ .

On the other hand, the i, j element of BA is

(BA)_{ij} = \sum_k B_{ik}A_{kj}\ .

The trace of BA is the sum of the diagonal terms:

\sum_j (BA)_{jj} = \sum_{jk} B_{jk} A_{kj}\ .

which is exactly the same thing. (I’m being careless… in not showing the limits of the summation. I’m going to let it slide. I may regret this.)

We will even need the most extreme form of this theorem: take A to be a row vector, B to be a column vector, each of length n. Then AB is a scalar, the dot product of the two vectors… but BA is an nxn matrix – whose trace is the dot product of the two vectors.

Do you need to see some examples? Here’s a small but not extreme case:

Then AB is 3×3… and has trace…

BA is 2×2… and has the same trace:

Similarly, if we take two vectors – the extreme case –

The dot product is

and the 6×6 matrix product is… with trace… which is the same as the dot product:

Now let’s start computing.

tr M = tr(In – H )…where In is the nxn identity…

= tr In – tr(H)

= n - \text{ tr} (X (X'X)^{-1} X')

= n - \text{ tr} (X' X(X'X)^{-1})\ … (tr AB = tr BA, with A = X (X'X)^{-1}\ , B = X’)

= n – tr (Ik) … becase X’X and (X'X)^{-1}\ are kxk.

= n – k.

There’s an intuitive explanation behind all that, too.

The quick insight is that M (like H) is a projection operator – there is a basis such that it is a cut-down identity matrix… that is, some of its diagonal elements are equal to 0 instead of to 1.

How many are nonzero? That is, how many 1s are there? Well, M = I – H, and H is a projection operator onto a k-dimensional space, so tr(H) = k… and then tr(M) = tr(I-H) = tr(I) – tr(H) = n – k.

That’s all well and good, but what has the trace when H is diagonal got to do with the trace when H is not?

They’re the same. Two similar matrices, I hope you recall, represent a common linear operator with respect to two bases; and the trace is, in fact, a property of the linear operator (it’s one coefficient in the characteristic polynomial). That is, if two matrices are similar, then they have the same trace.

So:
H is similar to a diagonal matrix D with k 1s and n-k 0s on the diagonal…

tr(D) = k…
then tr H = k…
then tr(M) = tr(I-H) = n – k.

Now we have to back up and prove the first claim.

Why is E(u'Mu) = \sigma^2 \text{ tr}(M)\ ?

For starters,

u'Mu = \sum_{ij} u_i M_{ij} u_j = \sum_{ij} M_{ij}u_i u_j = \sum_{ij} M_{ij} D_{ji}

where D_{ji} = u_j u_i\ .

Now I want to say that u_j u_i = 0\ unless i = j… but that’s not true. What is true is that E(u_j u_i) = E(D_{ji}) = \sigma^2 \delta_{ji}\ – that is, I need to be taking expected values – so it’s time to write

E(u'Mu) = E(\sum_{ij} M_{ij} D_{ji}) = \sum_{ij} E(M_{ij} D_{ji}) = \sum_{ij} M_{ij} E(D_{ji})

= \sigma^2 \sum_{ij} M_{ij} \delta_{ij} = \sigma^2 \sum_i M_{ii} = \sigma^2 \text{ trace}(M)\ .

There is a more elegant way of doing all that. It turns out that we could do the calculation leading to the trace in more general circumstances: we could show that

E(B'AB) =\text{ tr}(AV) + \mu'A\mu\ ,

where A is a constant matrix and B is a random vector, with E(B) = \mu\ , and the variance of B, V(B), = V. Note that we have not taken V outside the trace – that won’t happen until and unless we decide V is a multiple of the identity.

Now we let A = M and B = u, so V = V(u) = \sigma^2 I\ and \mu = E(u) = 0\ … and get

E(u’Mu) = tr(MV) + 0*M*0 = \text{ tr}(M I \sigma^2) = \text{ tr}(M \sigma^2\ )

= \sigma^2\ \text{ tr}(M)\ .

So let’s back up and get the equation

E[B’AB) = tr(AV) + \mu'A\mu\ ,

which I found in Christensen “Plane Answers to Complex Questions” (bibliography). Unfortunately, his derivation is initially unsatisfying and ultimately tricky.

He starts with

(B-\mu)' A (B-\mu) = B'AB - \mu'AB - B'A\mu + \mu'A\mu

then takes expected values, so that he is computing something like a variance – but there’s a matrix A inside…

E( (B-\mu)' A (B-\mu) ) = E(B'AB) - \mu'A\mu - \mu'A\mu + \mu'A\mu = E(B'AB) - \mu'A\mu

so

E(B'AB') = E( (B-\mu)' A (B-\mu) ) + \mu'A\mu\ .

He then writes

E( (B-\mu)' A (B-\mu) ) = E \text{ tr}( (B-\mu)' A (B-\mu) )\ ,

and I have to confess I had to think about that. Then the light went on.

Duh! (B-\mu)' A (B-\mu)\ is a scalar – a row vector times a column vector – i.e. a 1×1 matrix, and the trace of a 1×1 matrix is the scalar itself! So we may introduce the trace! (Boy is that devious!)

Let’s simplify the notation. Let c = B-\mu\ , so we’re looking at

E(c'Ac) = E(\text{ tr}( c'Ac) )\ , because c’Ac is a scalar

= E(\text{ tr}( c'd) )\ , with d = Ac

= E(\text{ tr}( dc') )\ , because I can swap c’d and dc’

= E(\text{ tr}( Acc') )

=\text{ tr}( E( Acc') )\ , because E and tr commute

=\text{ tr}( A E( cc') )\ , because A is constant

=\text{ tr}( A E(B-\mu)'(B-\mu ) )

=\text{ tr}( AV(B) )

=\text{ tr}(AV)

That gives us

E(B'AB) =\text{ tr}(AV) + \mu'A\mu\ ,

which was to be proved.

And now I’ll repeat the original proof:

E( e’e )

= E (u’ Mu)

=\text{ tr}( M V(u) )

= \sigma^2\text{ tr}(M)

= \sigma^2 (\text{ tr}( I - H) )

= \sigma^2 (n-k)\ .

And that’s enough for now.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: