It was only on this round of
repression regression studies that it really registered with me that the residuals e are correlated with the dependent variable y, but are uncorrelated with the fitted values yhat.
And it was only a couple of weeks ago that the more precise statements registered, and I decided to prove them. In fact, what we have are the following. Suppose we run a regression, with any number of variables, and we end up with some R^2.
- If we fit a line to the residuals e as a function of the fitted values yhat, we will get a slope of zero.
If we fit a line to the residuals e as a function of the dependent variable y, we will get a slope of 1 – R^2.
We will see that we can rephrase those statements:
- the correlation coefficient between e and yhat is zero;
- the correlation coefficient between e and y is Sqrt[1-R^2].
What that means is that if we look at the residuals as a function of y, we should expect to see a significant slope – unless the R^2 is close to 1. If our purpose in drawing the graph is to look for structure in the residuals which might point to problems in our fit, well, we’ll almost always see such structure – a line of slope 1 – R^2 – and it’s meaningless.
Look at e versus yhat, not e versus y.
Along the way, I came across two other interesting relationships:
- for the case of y as a function of one variable x, the correlation coefficient between x and yhat is ±1.
in general, the correlation coefficient between y and yhat is R.
In order to prove these, I needed to take a much closer look at the computation of R^2.
If you do not care about the proofs, but just want to see these results illustrated, I will in fact illustrate them on the Hald data. You could skip ahead to the section “the Hald data”.
It turns out that – for me at least – there were 2 tricky points. I can’t go so far as to say they are subtle, but they were not obvious. Except in retrospect.
One of them is this. We will want to prove something about a 1-variable regression, specifically about e as a function of y. More specifically, we want to show that the slope computed by the 1-variable regression is the same as 1 – R^2 computed by the original k-variable regression.
That is, it is not enough to know how to compute R^2 for the 1-variable regression… we need its generalization to a matrix X rather than one independent variable X.
(By “1-variable regression”, in this case, I mean y as a function of one variable x. I hate to say it, but in what follows I will refer to y as a function of a single x as the “2-variable case”. Sorry.)
The other point actually caused me a difficulty that I swept under the rug for a very long time. Let me approach it slowly.
One possible definition of the variance of a single variable X with mean and n observations is:
Note that the definition requires that we center the data. More to the point, if we use lower-case for the centered data and upper case for the raw data,
then the definition of the variance can be more simply written as
and, in fact, using vectors, as
x.x / n.
The point, however, is that we are computing the variance of the raw data X, although we may choose to write the equation in terms of the centered data x.
The very same thing happens when we talk about the correlation coefficient R, about R^2, and even about the coefficients .
It will be extremely convenient to write equations in terms of the centered data x and y – but we will still be talking about a regression involving the raw data X and Y.
I didn’t understand this when I first learned the theory of regression… and I have always avoided equations written in terms of the centered data.
Well, it’s time I got over that. And with that out of the way, let’s get about the purpose of this post.
Now let me illustrate this with two examples.
examples: the hald data
Let me use the Hald data. I set the file path…
I set my usual informative (not!) names… and I might as well display the data matrix…
one independent variable
Let me use only X1 as an independent variable… and display the parameter table and the R^2 for the fit…
Now let’s get the residuals e and the yhat…
What’s the correlation between x and yhat? We expect to see ±1:
And, since we have both vectors, what is yhat.e? Answer: -1.3074*10^-11.
Now let’s fit e as a function of yhat:
Yes, the intercept is zero, too – but we know that the mean value of the residuals is zero.
Now let’s fit e as a function of y:
I print the parameter table and the equation – because I want the slope to pop out at us – and it will let me graph it.
And so it is.
Here’s a picture of e as a function of y:
two independent variables
I want to fit a regression that isn’t too good, and I want more than one independent variable, so I choose the two worst variables, X1 and X3… I print the parameter table and the R^2…
Next I extract the residuals e and the fitted values yhat…
We expect that the correlation between y and yhat is R. Here’s the correlation…
And while we’re here, what is e.yhat? It’s -7.7307*10^-12.
Now I run a regression of e as a function of yhat… we expect that the slope will be zero… and it is:
Yes, the intercept is zero, too – but we know that the mean value of the residuals is zero.
Now for the more interesting case. We run a regression of the residuals e as a function of Y… I print the parameter table and the equation – because I want the slope to pop out at you.
… and so it is.
Have a picture:
We start with just two variables, called X and Y, each with n observations.
We use lower case to designate the centered variables:
The correlation coefficient r is defined as the product-moment (pearsonian) coefficient of correlation (this is r, not r^2, and corresponds to R in the general case):
r = x.y / n sx sy,
where sx and sy are the following estimates of standard deviation – biased because they are divided by n instead of by n-1:
Note that I am writing vector dot products of the n-dimensional vectors x and y.
I am also going to use r to denote the correlation coefficient for this special case when we have only one independent variable X.
We can confirm that the factor of n in the definition of r cancels those in the definitions of sx and sy:
Do you recognize that? We have the dot product of two vectors, divided by the lengths of each of them. That, in turn, is just the cosine of the angle between the (n-dimensional) vectors x and y.
So the correlation coefficient r between two variables X and Y is the cosine of the angle between the centered variables x and y.
Then r^2 is:
the least-squares line and the correlation coefficient
The least-squares line is
and we can get from
Note the similarity between this special, scalar, case
and the general matrix solution
Instead of the dot product x.y of the centered data, we have the matrix product X’Y of the raw data; instead of dividing by the dot product of the centered data x.x, we pre-multiply by the matrix inverse of the raw data.
We could derive that equation for in the 2-variable case simply by specializing our known general matrix solution. If this is new to you, here’s my initial exposition of it.
It would appear that we can insert into the equation for r^2.
That is, we have
which means that we may rewrite r^2 as
It’s easier to have Mathematica confirm that equation than to derive it. I start with that equation, and recover our second one:
One implication is that for the 2-variable case, if the slope is zero, then the correlation is zero. What about the converse? If r = 0, then either or x.y = 0 (or y.y is infinity. not plausible.). But wait:
which says that if and only if x.y = 0 (assuming we actually have data, i.e. x ≠ 0).
In other words, so long as x and y are non-trivial (not zero vectors), is equivalent to r = 0; a fitted slope of zero for Y as a function of X is equivalent to X and Y are uncorrelated.
Now we can confirm yet another equation for r^2. I believe that
Again, it’s easier to have Mathematica confirm that than to derive it:
but the numerator expands to
so we get
which is, indeed our equation for r^2.
And that is the equation that generalizes to more than one independent variable, and gives us R^2 in the matrix case:
R^2 = 1 – ESS / TSS,
where ESS and TSS are the error sum of squares e.e and the total sum of squares y.y .
In summary, we have 3 equations for r or r^2:
The first was the original definition – and we can read it as “r is the cosine of the angle between the centered x and the centered y.”
The third is what we use as the definition of R^2 in the general case, when the vector X has been replaced by a matrix X – because e and y are still just vectors, and the dot products make sense.
Oh, the first one can be applied to any two variables; the other two assume that we have done a least squares fit, and gotten , yhat, and e.
And another remark: the second equation is written for the 2-variable case. It does, in fact, generalize to the k-variable case (X a matrix instead of a vector):
the correlation between x and yhat
While we’re sittng here with our equations for r, let’s play with these equations a little bit. We have a general formula for r, the correlation between x and y:
What about the r between x and yhat? I showed by example that it was ±1.
We replace y by yhat… and then replace yhat by …
but that gives us in the numerator and in the denominator, and x.x in the numerator and in the denominator.
and the terms in the numerator and denominator cancel. But can be negative, so
That is, r = ± 1… the correlation between x and yhat is always ±1, and depends on the sign of (equivalently, on the sign of x.y).
Regress e on yhat
We need one more piece of information before we can show that a regression of e on yhat has a slope of zero.
The error vector e is orthogonal to the fitted vector yhat:
I’ll wait until the next post to prove that, but geometrically we have the following picture:
Here’s how I think of this. We have a vector y with n components; I’ve drawn n = 2. We have a set X of k vectors, each also with n components; I’ve drawn k = 1, along the x-axis. The question is, what’s the closest approach – the minimum distance – between y (considered as a point, the coordinates of the head of the vector) and the x-axis?
Answer: it’s the length of the line e – which is perpendicular to the x-axis. And the resulting best approximation – restricted to the subspace spanned by X – is .
Maybe I should emphasize that, although has n-components, it lies in a k-dimensional subspace of R^n – namely the subspace spanned by the k columns (vectors) of X. (And that yhat is a linear combination of the columns of X is why this fails, I guess, for nonlinear fits.)
Anyway, geometrically I can see that , that is, e and yhat are perpendicular. The point on the x-axis closest to the tip of the y-vector is at the foot of a perpendicular.
One way or another, believe it: e and yhat are orthogonal vectors.
If i regress e on yhat, my general equation for
becomes (y -> e and x-> yhat
(We’re computing a slope for a different regession, so don’t call it !)
Anyway, the slope b is zero. In addition, our general equation
says that … so b = 0 implies that e and yhat are uncorrelated.
There’s one last point that leaves me a little confused. Draper and Smith say, on page 64:
“Because the e’s and the Y’s are usully correlated but the e’s and the ‘s are not…. there will always be a slope of 1-R^2 in the ei vesus Yi plot, even if there is nothing wrong. However, a slope in the ei versus plot indicates that something is wrong.”
Excuse me, but we’ve just shown that the slope must be zero in the e versus Yi plot. What are they talking about?
They even have an exercise asking us to show that e and yhat are uncorrelated – which is nothing more than – but that in turn implies that the least-squares fitted slope is zero.
Am I missing something?
On p. 67, they suggest computing , and say, “This should always be zero.”
I agree. How can it not be?
Well, as we learned while we were looking at multicollinearity, “zero” can be a little imprecise on a computer.
I’ve just gone and run 38 regressions out of Draper & Smith (exercises, pp. 96-114), and one of them has a slight discrepancy: I compute that
On the other hand, the average magnitude of the errors is about 34,000, and the mean value of y is about 855,000. The numbers are huge. One residual exceeds 100,000.
The condition number (largest singular value divided by smallest) of the design matrix X is 2 x 10^6. Asking for the inverse of X’X gets us a warning message… but the computed inverse works, nevertheless.
Suppose we normalize the computation by asking for r^2 in the form
That gives a satisfying -6 x 10^-15.
In other words, is relatively close to zero, considering the size of the residuals and the yhat.
I think what they really mean is that any pattern to the graph of e versus yhat signifies a problem… whereas we will see that we must expect a linear pattern – at the very least – on a graph of e versus y.
Let me summarize this section.
- , but zero in principle isn’t always exactly zero in practice on a computer.
- if the model is nonlinear, I believe yhat.e ≠ 0.
- these residuals e are neither standardized or studentized.
Still, I do not understand how the residuals can ever appear to have a nonzero slope when plotted against yhat.
regress e on y
Now let’s tackle the other case: that the least-squares slope b of e as a function of y is b = 1 – R^2. For this, we need a formula for R^2 in the k-variable case… but it’s very simple: we define it to be
R^2 = 1 – e.e / y.y
1 – R^2 = e.e / y.y .
We do not need to deal with any of the complexities of X being a matrix… we have a simple formula involving two (centered) vectors, and that’s all we need, because we’re looking only at e and y.
We want to show that e.e / y.y is the slope of the regression line of e against y… and that’s just a 2-variable regression.
Our general x-y equation
= x.y / x.x
becomes (first y -> e and then x-> y)
b = y.e / y.y
(As before, we’re computing a slope for a different regession, so don’t call it !)
e = y – yhat
y = e + yhat
y.e = (e + yhat).e
= e.e + yhat.e
= e.e + 0
b = y.e / y.y
b = e.e / y.y,
which was to be be proved.
the correlation between y and yhat
We have seen two examples showing that r (R) is also the correlation between y and yhat.
Suppose we let x -> yhat… (and just as we distingushed b and , let us distinguish this correlation coefficient from the general r):
(That substitution of yhat + e for y goes seriously awry if we apply it to the symbol . But the numerator simplifies to yhat.yhat, so we end up with
so we get
and that we recognize as the r^2, so we’re done: the correlation coefficient between y and yhat is R, the square root of the R^2 of the regression. (In this case, yhat and y are positively correlated, so we don’t have to worry about the sign of R.
In summary, we have seen four relationships; the first three are general, the fourth holds only for y as a function of one variable x:
- the slope of e as a function of yhat is zero;
- the slope of e as a function of y is 1 – R^2.
- the correlation between y and yhat is R.
- the correlation between x and y (for a vector X) is ±1.