(There are more drawings of the distributions under discussion, but they’re at the end of the post. This one, as you might guess from the “NormalDistribution[0,1]”, is a standard normal.)
In a previous post, I listed all but one of the assumptions we usually make when doing a least squares fit – that is, “ordinary least squares”, or curve fitting. In fact, if all we’re trying to get is a plausible fit to some data, we pretty much ignore even that set of assumptions.
Still, to get the usual supporting information, we would assume that the true model is
y = X B + u,
- The matrix X consists of fixed numbers,
- and is of full rank.
- the u are drawn from identical and independent probability distributions
- with a mean of 0 (E(u) = 0), and each with variance
We saw that we could figure out the expected value and variance of a whole lot of things of interest, especially for the errors e, the estimated coefficients , and the fitted values yhat.
Now I want to add one more assumption: the probability distribution from which the u are drawn is normal. Given our earlier assumptions about the mean and variance of u, we have for each ,
or, for the vector as a whole,
It is this assumption of normality that justifies – as far as possible – our computation of t-statistics in a parameter table, and our use of the F-test, if we wish. (As it happens, I prefer other things to the F-test.)
Before I present the actual theorems, let me describe the results in general terms.
1st of all, any linear combination of independent normal variables is normal. (I will have to go back and correct an earlier assertion that the residuals are not normal – they are, because they are a linear combination of the u_i; they just aren’t independent.)
2nd, the square of a standard normal variable is called a chi-square with one degree of freedom… and the sum of 2 independent chi-squares is again a chi-square.
3rd, speaking very roughly, a standard normal divided by a chi – the square root of a chi-squared – is called a t distribution. (It’s also multiplied by the square root of the degrees of freedom; and the normal and the chi must be independent.)
4th, the ratio of 2 independent chi-squares is called an F distribution. (Again, we actually have to include the degrees of freedom in the division.)
And the “degrees of freedom” associated with the chi-square, the t, and the F distributions all come from the chi-square: sometimes, but far from always, however many squares of normal variables we added up, that’s the degrees of freedom of the chi-square.
Since we then use a chi to get a t distribution, the degrees of freedom of the chi-square carry over to the t. By the same token, since the F distribution is the ratio of 2 chi-squares, the F distribution acquires two degrees of freedom from the 2 chi-squares.
Okay, let me try presenting this in an organized fashion.
The normal distribution
Theorem on the standardized normal:
the standardized normal is normal.
That is, if then the standardized variable
is normal with mean 0 and variance 1:
Z ~ N(0,1).
The point isn’t that N(0,1) is normal – of course it is – but that the result of standardizing any normal variable by subtracting its mean and dividing by the standard deviation – that result is normal.
Theorem on linear combinations of independent normals:
a linear combination of independent normals is normal.
That is, if then the weighted sum
is normal, with mean and variance :
That is, the mean is the weighted sum of the means and the variance is the square-weighted sum of the variances.
Note that while we require the to be independent, they need not have a common mean or a common variance. This theorem is pretty major.
Consider the special case where we take a very simple linear combination – the unweighted, or equally weighted, sum – of variables drawn from two significantly different normal distributions. Here’s a picture.
Imagine that we draw X from the red distribution, which is N(1,1)… we draw Y from the blue distribution, which is N(5, 0.5)… the theorem says that the sum X + Y (a1 = a2 = 1) is drawn from a normal distribution (black) whose mean is the sum of the means, and whose variance is the sum of the variances, . (How sly of me to use equal weights of 1, so the variances just add.)
As I said for the general linear combination, while we require the to be independent, they need not have a common mean or a common variance. Nevertheless, one of the first applications of this theorem is when the do have common mean and variance . Suppose we have n independent normal variables X_i, and we form the linear combination
Recognize it? It’s what we call the sample mean.
But it’s a linear combination of the , with , so it has mean
That is, we have just shown (Theorem on the sample mean) that if the population is
then the sample mean is
That is, the variance of the sample mean gets smaller as we take a larger sample. (We knew that…. But it’s not as obvious as we want to think. By contrast, the variance of the sample spectrum does not decrease as we take more samples.)
That theorem on a linear combination of independent normals deals with the formation of one new variable Z. We can generalize to a vector of linear combinations.
Theorem on affine combinations of independent normals:
an affine combination of independent normals is normal.
That is, if is a set of n independent normal variables denoted X, and A is a kxn matrix k, and b is a constant vector with k components, then
A X + b
is normal, with mean and variance .
Note: that’s what the theorem says. I believe we must require k <= n, and that A be of full rank. To be specific, I cannot imagine that we can construct more than n independent normals (i.e. k > n starting from n of them; and I know a counterexample for k = n, but A of less than full rank… namely, A = I-H and the construction of residuals from errors…
e = (I-H) u.
Each is normal, but the n of them are not independent. In fact, the vector e – in contrast to its components – is said to have a singular normal distribution, rather than a normal.
We've already seen the equations for expectation and variance… what's new is that the result is normal.
So much for normal variables. Well, we had theorems on
- the standardized normal
- linear combinations of independent normals
- the sample mean
- affine combinations of independent normals
Now let's look at the sums of squares of normal variables.
The chi-square distribution
Definition of the square of a normal variable:
the square of a standard normal variable has a distribution called a chi-square with one degree of freedom. That is, if
Z ~ N(0,1)
That takes care of one… one about more?
Definition of the sum of independent chi-squares:
the sum of n independent variables, each being , is .
This immediately gives us a decomposition. However we end up with a , we can view it as the sum of two variables with degrees of freedom that add up to n, n1 + n2 = n.
Theorem on the sum of two independent chi-squares:
The sum of two independent chi-squares is chi-square, and the degrees of freedom add. That is, if
and similarly for a sum of more than 2. That is, the sum of any finite number of independent chi-squares is itself chi-square, and the degrees of freedom add.
These 2 theorems are often stated as one, which sounds a bit different at 1st. For most applications, we want this alternate statement, or something like it.
Theorem: on the sum of squares of independent standard normals:
the sum of n squares of independent standard normals is a chi-square with n degrees of freedom.
Note that we are squaring and summing standard normals.
You see how these theorems are related?
In theorem on the sum of squares of normal variables, set n = 1 and it says that the square of a standard normal is . Good: that matches the definition of .
2nd, if I have a sum of n squares of normals, I could split it into the sum of n1 and n2 squares, for any n1 and n2 that add up to n – which must then each be .
Now for another similar theorem. What if we use the sample mean , and a common variance ? I would let
That is, if we compute the squares by using the sample mean instead of the unknown population mean , we lose one degree of freedom. We are adding up n squares, but we only have n-1 DOF. That is,
Theorem on the sum of squares using the sample mean:
if Yi are independent normal variables with unknown mean but variance , and we create standardized variables using the sample mean
We could also write that result as:
That's my favorite way to use it, with the unknown variance explicitly shown.
But you may find it in yet another form, and I will often end with it in this form. If we recall that the sample variance
is an unbiased estimate of the population variance, then what we have could be written in terms of that sample variance as
Let me hold up here. First, the variance of the sample mean is not the sample variance.
Given the observations , I can compute the sample variance of the , and we learned in our first statistics course that we needed to divide the sum of squares by n-1 rather than by n (to get an unbiased estimate of the populaton variance ).
The variance of the population (the underlying normal distribution) is and our estimate of is the sample variance of the .
The variance of the sample mean, as we saw earlier, is . Our estmate of it would be .
Second, we will find that it is extremely useful to separate out that common factor of . Let me restate those two theorems in the forms we use in practice – and add another one.
Practical theorem on SSQ using the population mean:
With , n observations, and common population variance , we have
Practical theorem on SSQ using the sample mean:
With , n observations, and common population variance , we have
Practical theorem on SSQ of regression residuals
With , n observations, k variables in the regression, and common error variance , we have
We see that the degrees of freedom may be n, n-1, or n-k in these three distinct cases. I should probably emphasize that the first of the three corresponds to the theorem on the sum of squares of independent normals… but the other two deal with the sums of squares of not-entirely-independent normals.
Worse, in the third case, the residuals do not have a common variance. As we worked out in this post, the ith residual e_i has variance , where H is the Hat matrix , so the third case somehow eliminates factors of from the result.
Let me confess that I do not fully understood the proof – I can follow it, but I don’t grok it – but it hinges on a change of variables that replaces the hat matrix by a diagonal matrix with n-k 1s and k 0s on the diagonal… and that gets us the result, by transforming to equal-variance variates.
Let me remind you of the earlier discussion, after the theorem on affine transformations: the residuals are the result of applying a symmetric projection operator I-H to the u… the projection operator is of rank n-k… and the sum of squares of the normalized residuals is chi-square(n-k). I am quoting a theorem which says that is true in general – for a symmetric projection operator.
So we had a couple of definitions and several theorems, but some are variants of others.
- the square of a normal variable
- the sum of independent chi-squares
- the sum of squares of independent standard normals
- the sum of squares using the sample mean
- SSQ using the population mean
- SSQ using the sample mean
- SSQ of regression residuals
The t distribution and the F distribution
Now we get to the really good stuff. I want to start using for the DOF (degrees of freedom) of a , since I may need to replace the DOF by any of n, or n-1, or n-k.
Definition of the t-distribution:
the scaled ratio of a standard normal and an independent has a distribution called t with degrees of freedom. That is, if
Z ~ N(0,1)
(and the two variables are independent) then
I call it “scaled” because of that factor of inside the square root. I think it's sometimes called a .
We might also take the square root of the chi-square and write
Let me remind us that is the degrees of freedom of the , not necessarily the number of squared normals that were summed to get it.
While I'm presenting theory, let me give you the (second-to-)last piece.
Definition of the ratio of two independent .
If we have two independent variables distributed as , namely with n1 and n2 degrees of freedom respectively, then the ratio of ratios
is said to have an F-distribution with (n1, n2) degrees of freedom.
Using the t distribution
Let's back up and apply the practical theorem on the sample mean to the sample mean itself. If we have independent normal with common population mean and variance,
then the sample mean Xbar is normal
and so is the standardized sample mean (LC xbar versus UC Xbar):
Don't misunderstand. I can't actually compute Xbar – , because I don't know .
And what will we do for the sum of squares? We subtract the sample mean from each :
and compute .
(Yes, I used Z for one thing and for another. Z is not the vector of Zi, in this case.)
Then our practical theorem on the SSQ using the sample mean says that
Anyway, we now have a standardized normal and an independent (I didn’t prove that) chi-square. Here's where I had to learn to be careful of the algebra. It's the sum of squares that is . So we form the scaled ratio
and we substitue for Z:
then we substitue for :
Finally, we substitue for
and at last I could choose to replace by 1/s, where s is the sample standard deviation:
We would use this statistic to compute a confidence interval for given Xbar.
Let's try another application – to the computed coefficients in a regression.
We want to test the hypothesis that B = 0, that the true value of a particular is zero.
Well, when we worked out the expected value of , we found that
and then our theorem on affine combinations of normals says that . We also computed that , so the variance of is the ith diagonal of CC', call it Cii. (We knew the variance last time; what we did not know was that the were normal.)
Now, our hypothesis is that B = 0, so consider the standardized :
We also know that with , i.e. with SSQ equal to the sum of squares of the residuals,
so we construct
Substitute for the degrees of freedom :
Now substitute for :
Then, finally, replace by s, the estimated standard deviation of the u.
And that, of course, is exactly how we compute the t-statistics that show up in the parameter table of a regression: divide each by its standard error, which is .
That's pretty much it for the use of the t-distribution. Although I never use the F-test, now is the time for me to look at the usual one….
Using the F distribution
Well, I've looked… let me illustrate some things using the Hald data. I set the file path…
I set my usual informative (not!) names… and I might as well display the data matrix…
Let's do stepwise… and look at the 3-variable regression… For the first time, I ask for the ANOVATable in addition to the ParameterTable:
Those "SS" for X4, X1, and X2 are called the sequential sums of squares. We can get them directly from… and we can add them together to get something called the regression sum of squares…
Each of those 3 is presumably a with one degree of freedom, and each is presumably independent of the other. Their sum, then, is a with 3 degrees of freedom.
We can extract the error sum of squares from the list of all the SSQ:
It is, as before, with n-k = 13 – 4 = 9 degrees of freedom. Our theorem on the F-distribution says that the regression sum of squares (divided by its DOF) divided by the error sum of squares (divided by its DOF)…
is F(3,9). That matches what Draper & Smith compute for the F-test on that regression. (It tests the hypothesis that all of the coefficients are zero.)
As far as i can tell, Linear Model Fit does not supply either the regression sum of squares or that F-test; we have to compute them ourselves.
Oh, the Total SSQ is, indeed, the sum of the Sequential SSQ and the Error SSQ. We have
is equal to the fifth, which was labeled "TOTAL".
There is something important to see here. I deliberately used stepwise regression because I knew that in this case it would not give me the variables in lexicographic order. Let's look at such a regression, which specifies the variables in named order:
We get the same fit as reg, but the variables and parameters are in different order. Here are the two parameter tables:
The only difference between the two tables is the order in which the variables are listed. But now let's look at the two ANOVA tables:
Whoa! The Error SSQ and the Total SSQ are the same – but the individual sequential SSQ are different.
No, I don't understand the sequential SSQ – but I had read and have now confirmed that they depend on the order of the variables. ("Sequential" is a good name for them, but that makes me dislike them, in the sense that they're not about the variable they're assigned to – they're about the sequence of variables. Perhaps I should not be so straight-laced.)
Oh, since total SSQ and error SSQ are the same, the two regression SSQ are the same. The three individual sequential sums of squares differ, but they have the same sum. Here are the Regression SSQ for the two regressions:
And that's where I'm going to put this down. I've shown us uses of the t-, and F-tests, and of the distribution. I've given us a summary of the relevant distribution theory.
Let me suggest that I have probably neglected to throw in the word "independent" as often as I should have. To add up the squares of n standard normal variables and get a chi-square with n DOF, the n normals must have been independent. Then the theorem on the t-distribution requires that the standard normal and the chi be independent. To apply the theory to the , we need to prove that each normally distributed is independent of the error sum of squares. (No, I'm not going to that. But you should expect your books to either prove it or assign it as an exercise.)
The practical theorem on SSQ using the sample mean said that we added up the squares of n normals but got chi-square with n-1 DOF. We conclude that they were not all independent. But the theorem on the t-distribution requires that Xbar – and the chi be independent; again, we can prove that they are, but I won't.
The practical theorem on SSQ for the residuals of a regression said that we added up the squares of n normals but got chi-square with n-k DOF. We conclude that not all n were independent. In fact, as I said earier, the residuals of a regression are individually normal but collectively said to be a singular normal distribution. We did not construct a t-statistic using the residuals.
Finally, to construct either a t-test or an F-test, by taking the scaled ratio of two variables, the two variables – numerator and denominator – must be independent.
While we're on the subject…
Theorem on independence and correlation:
if two random variables are iindependent, then their covariance is zero; the converse is not true in general, but if two normal variables have zero covariance, then they are independent.
Since pictures of distributions are easy….
Now let me throw some pictures at you. Having used a standard “Plot” command to get the image that opened this post, I will revert to using Dave Park’s Presentations package, as I do for all my graphics. (Strictly speaking, I’m using a preliminary version.)
Here’s a standard normal, as I would create it:
Here’s a t-distribution with 1 degree of freedom…
Here they are overlaid. We see that the t has fatter tails… more of the probability is further from the center.
Here is a selection of t-distributions… as the DOF rises, the t distribution approaches the standard normal. (I’ll let you overlay one yourself.)
Here is a selection of chi-square distributions, with DOF from 2 to 6… “2” is the sharp peaked one that reaches .5 .
Here are four F-distributions: F2,2), F(2,3), F(2},F(3,3)…
And finally, here are F(n,1) for n = 2,…6 .
I suspect it goes without saying – but obviously I’m going to say it – there’s a whole lot more you can do computationally with Mathematica’s probability density functions, more than just draw them.