introduction
It was only on this round of repression regression studies that it really registered with me that the residuals e are correlated with the dependent variable y, but are uncorrelated with the fitted values yhat.
And it was only a couple of weeks ago that the more precise statements registered, and I decided to prove them. In fact, what we have are the following. Suppose we run a regression, with any number of variables, and we end up with some R^2.
 If we fit a line to the residuals e as a function of the fitted values yhat, we will get a slope of zero.

If we fit a line to the residuals e as a function of the dependent variable y, we will get a slope of 1 – R^2.
We will see that we can rephrase those statements:
 the correlation coefficient between e and yhat is zero;
 the correlation coefficient between e and y is Sqrt[1R^2].
What that means is that if we look at the residuals as a function of y, we should expect to see a significant slope – unless the R^2 is close to 1. If our purpose in drawing the graph is to look for structure in the residuals which might point to problems in our fit, well, we’ll almost always see such structure – a line of slope 1 – R^2 – and it’s meaningless.
Therefore:
Look at e versus yhat, not e versus y.
Along the way, I came across two other interesting relationships:
 for the case of y as a function of one variable x, the correlation coefficient between x and yhat is ±1.

in general, the correlation coefficient between y and yhat is R.
In order to prove these, I needed to take a much closer look at the computation of R^2.
If you do not care about the proofs, but just want to see these results illustrated, I will in fact illustrate them on the Hald data. You could skip ahead to the section “the Hald data”.
It turns out that – for me at least – there were 2 tricky points. I can’t go so far as to say they are subtle, but they were not obvious. Except in retrospect.
One of them is this. We will want to prove something about a 1variable regression, specifically about e as a function of y. More specifically, we want to show that the slope computed by the 1variable regression is the same as 1 – R^2 computed by the original kvariable regression.
That is, it is not enough to know how to compute R^2 for the 1variable regression… we need its generalization to a matrix X rather than one independent variable X.
(By “1variable regression”, in this case, I mean y as a function of one variable x. I hate to say it, but in what follows I will refer to y as a function of a single x as the “2variable case”. Sorry.)
The other point actually caused me a difficulty that I swept under the rug for a very long time. Let me approach it slowly.
One possible definition of the variance of a single variable X with mean and n observations is:
Note that the definition requires that we center the data. More to the point, if we use lowercase for the centered data and upper case for the raw data,
,
then the definition of the variance can be more simply written as
and, in fact, using vectors, as
x.x / n.
The point, however, is that we are computing the variance of the raw data X, although we may choose to write the equation in terms of the centered data x.
The very same thing happens when we talk about the correlation coefficient R, about R^2, and even about the coefficients .
It will be extremely convenient to write equations in terms of the centered data x and y – but we will still be talking about a regression involving the raw data X and Y.
I didn’t understand this when I first learned the theory of regression… and I have always avoided equations written in terms of the centered data.
Well, it’s time I got over that. And with that out of the way, let’s get about the purpose of this post.
Now let me illustrate this with two examples.
examples: the hald data
Let me use the Hald data. I set the file path…
I set my usual informative (not!) names… and I might as well display the data matrix…
one independent variable
Let me use only X1 as an independent variable… and display the parameter table and the R^2 for the fit…
Now let’s get the residuals e and the yhat…
What’s the correlation between x and yhat? We expect to see ±1:
And, since we have both vectors, what is yhat.e? Answer: 1.3074*10^11.
Now let’s fit e as a function of yhat:
Yes, the intercept is zero, too – but we know that the mean value of the residuals is zero.
Now let’s fit e as a function of y:
I print the parameter table and the equation – because I want the slope to pop out at us – and it will let me graph it.
And so it is.
Here’s a picture of e as a function of y:
two independent variables
I want to fit a regression that isn’t too good, and I want more than one independent variable, so I choose the two worst variables, X1 and X3… I print the parameter table and the R^2…
Next I extract the residuals e and the fitted values yhat…
We expect that the correlation between y and yhat is R. Here’s the correlation…
And while we’re here, what is e.yhat? It’s 7.7307*10^12.
Now I run a regression of e as a function of yhat… we expect that the slope will be zero… and it is:
Yes, the intercept is zero, too – but we know that the mean value of the residuals is zero.
Now for the more interesting case. We run a regression of the residuals e as a function of Y… I print the parameter table and the equation – because I want the slope to pop out at you.
… and so it is.
Have a picture:
getting started
We start with just two variables, called X and Y, each with n observations.
We use lower case to designate the centered variables:
.
The correlation coefficient r is defined as the productmoment (pearsonian) coefficient of correlation (this is r, not r^2, and corresponds to R in the general case):
r = x.y / n sx sy,
where sx and sy are the following estimates of standard deviation – biased because they are divided by n instead of by n1:
.
Note that I am writing vector dot products of the ndimensional vectors x and y.
I am also going to use r to denote the correlation coefficient for this special case when we have only one independent variable X.
We can confirm that the factor of n in the definition of r cancels those in the definitions of sx and sy:
Do you recognize that? We have the dot product of two vectors, divided by the lengths of each of them. That, in turn, is just the cosine of the angle between the (ndimensional) vectors x and y.
So the correlation coefficient r between two variables X and Y is the cosine of the angle between the centered variables x and y.
Then r^2 is:
the leastsquares line and the correlation coefficient
The leastsquares line is
with
and we can get from
.
Note the similarity between this special, scalar, case
and the general matrix solution
.
Instead of the dot product x.y of the centered data, we have the matrix product X’Y of the raw data; instead of dividing by the dot product of the centered data x.x, we premultiply by the matrix inverse of the raw data.
We could derive that equation for in the 2variable case simply by specializing our known general matrix solution. If this is new to you, here’s my initial exposition of it.
It would appear that we can insert into the equation for r^2.
That is, we have
which means that we may rewrite r^2 as
It’s easier to have Mathematica confirm that equation than to derive it. I start with that equation, and recover our second one:
One implication is that for the 2variable case, if the slope is zero, then the correlation is zero. What about the converse? If r = 0, then either or x.y = 0 (or y.y is infinity. not plausible.). But wait:
which says that if and only if x.y = 0 (assuming we actually have data, i.e. x ≠ 0).
In other words, so long as x and y are nontrivial (not zero vectors), is equivalent to r = 0; a fitted slope of zero for Y as a function of X is equivalent to X and Y are uncorrelated.
Now we can confirm yet another equation for r^2. I believe that
Again, it’s easier to have Mathematica confirm that than to derive it:
but the numerator expands to
so we get
which is, indeed our equation for r^2.
And that is the equation that generalizes to more than one independent variable, and gives us R^2 in the matrix case:
R^2 = 1 – ESS / TSS,
where ESS and TSS are the error sum of squares e.e and the total sum of squares y.y .
In summary, we have 3 equations for r or r^2:
The first was the original definition – and we can read it as “r is the cosine of the angle between the centered x and the centered y.”
The third is what we use as the definition of R^2 in the general case, when the vector X has been replaced by a matrix X – because e and y are still just vectors, and the dot products make sense.
Oh, the first one can be applied to any two variables; the other two assume that we have done a least squares fit, and gotten , yhat, and e.
And another remark: the second equation is written for the 2variable case. It does, in fact, generalize to the kvariable case (X a matrix instead of a vector):
the correlation between x and yhat
While we’re sittng here with our equations for r, let’s play with these equations a little bit. We have a general formula for r, the correlation between x and y:
What about the r between x and yhat? I showed by example that it was ±1.
We replace y by yhat… and then replace yhat by …
but that gives us in the numerator and in the denominator, and x.x in the numerator and in the denominator.
Now, x.x is always positive (unless x = 0), so
and the terms in the numerator and denominator cancel. But can be negative, so
That is, r = ± 1… the correlation between x and yhat is always ±1, and depends on the sign of (equivalently, on the sign of x.y).
Regress e on yhat
We need one more piece of information before we can show that a regression of e on yhat has a slope of zero.
The error vector e is orthogonal to the fitted vector yhat:
.
I’ll wait until the next post to prove that, but geometrically we have the following picture:
Here’s how I think of this. We have a vector y with n components; I’ve drawn n = 2. We have a set X of k vectors, each also with n components; I’ve drawn k = 1, along the xaxis. The question is, what’s the closest approach – the minimum distance – between y (considered as a point, the coordinates of the head of the vector) and the xaxis?
Answer: it’s the length of the line e – which is perpendicular to the xaxis. And the resulting best approximation – restricted to the subspace spanned by X – is .
Maybe I should emphasize that, although has ncomponents, it lies in a kdimensional subspace of R^n – namely the subspace spanned by the k columns (vectors) of X. (And that yhat is a linear combination of the columns of X is why this fails, I guess, for nonlinear fits.)
Anyway, geometrically I can see that , that is, e and yhat are perpendicular. The point on the xaxis closest to the tip of the yvector is at the foot of a perpendicular.
One way or another, believe it: e and yhat are orthogonal vectors.
If i regress e on yhat, my general equation for
becomes (y > e and x> yhat
because .
(We’re computing a slope for a different regession, so don’t call it !)
Anyway, the slope b is zero. In addition, our general equation
says that … so b = 0 implies that e and yhat are uncorrelated.
There’s one last point that leaves me a little confused. Draper and Smith say, on page 64:
“Because the e’s and the Y’s are usully correlated but the e’s and the ‘s are not…. there will always be a slope of 1R^2 in the ei vesus Yi plot, even if there is nothing wrong. However, a slope in the ei versus plot indicates that something is wrong.”
Excuse me, but we’ve just shown that the slope must be zero in the e versus Yi plot. What are they talking about?
They even have an exercise asking us to show that e and yhat are uncorrelated – which is nothing more than – but that in turn implies that the leastsquares fitted slope is zero.
Am I missing something?
On p. 67, they suggest computing , and say, “This should always be zero.”
I agree. How can it not be?
Well, as we learned while we were looking at multicollinearity, “zero” can be a little imprecise on a computer.
I’ve just gone and run 38 regressions out of Draper & Smith (exercises, pp. 96114), and one of them has a slight discrepancy: I compute that
.
On the other hand, the average magnitude of the errors is about 34,000, and the mean value of y is about 855,000. The numbers are huge. One residual exceeds 100,000.
The condition number (largest singular value divided by smallest) of the design matrix X is 2 x 10^6. Asking for the inverse of X’X gets us a warning message… but the computed inverse works, nevertheless.
Suppose we normalize the computation by asking for r^2 in the form
?
That gives a satisfying 6 x 10^15.
In other words, is relatively close to zero, considering the size of the residuals and the yhat.
I think what they really mean is that any pattern to the graph of e versus yhat signifies a problem… whereas we will see that we must expect a linear pattern – at the very least – on a graph of e versus y.
Let me summarize this section.
 , but zero in principle isn’t always exactly zero in practice on a computer.
 if the model is nonlinear, I believe yhat.e ≠ 0.
 these residuals e are neither standardized or studentized.
Still, I do not understand how the residuals can ever appear to have a nonzero slope when plotted against yhat.
regress e on y
Now let’s tackle the other case: that the leastsquares slope b of e as a function of y is b = 1 – R^2. For this, we need a formula for R^2 in the kvariable case… but it’s very simple: we define it to be
R^2 = 1 – e.e / y.y
and then
1 – R^2 = e.e / y.y .
We do not need to deal with any of the complexities of X being a matrix… we have a simple formula involving two (centered) vectors, and that’s all we need, because we’re looking only at e and y.
We want to show that e.e / y.y is the slope of the regression line of e against y… and that’s just a 2variable regression.
Our general xy equation
= x.y / x.x
becomes (first y > e and then x> y)
b = y.e / y.y
(As before, we’re computing a slope for a different regession, so don’t call it !)
Then
e = y – yhat
y = e + yhat
y.e = (e + yhat).e
= e.e + yhat.e
= e.e + 0
so
b = y.e / y.y
becomes
b = e.e / y.y,
which was to be be proved.
the correlation between y and yhat
We have seen two examples showing that r (R) is also the correlation between y and yhat.
We have
Suppose we let x > yhat… (and just as we distingushed b and , let us distinguish this correlation coefficient from the general r):
(That substitution of yhat + e for y goes seriously awry if we apply it to the symbol . But the numerator simplifies to yhat.yhat, so we end up with
so we get
and that we recognize as the r^2, so we’re done: the correlation coefficient between y and yhat is R, the square root of the R^2 of the regression. (In this case, yhat and y are positively correlated, so we don’t have to worry about the sign of R.
In summary, we have seen four relationships; the first three are general, the fourth holds only for y as a function of one variable x:
 the slope of e as a function of yhat is zero;
 the slope of e as a function of y is 1 – R^2.
 the correlation between y and yhat is R.
 the correlation between x and y (for a vector X) is ±1.
Leave a Reply