Regression 1: More Properties

Having mentioned some of the most important properties in the previous regression post, let me discuss the properties in a more organized fashion. And, as I said before, even if you’re not using Mathematica® to do regression, you will want to find how to get many of these properties, whatever they are called by your software or if you must compute them yourself.

Let me begin by following exactly the computational sequence I laid out here as though we were trying to write code ourselves. If the following discussion seems a little too brief for you, you may want to look at that link, henceforth referred to as “the introductory post”.

I should be explicit about something: I figure that if I can’t compute something myself, then I do not understand it. Even if I can compute it, I may not understand it, but first of all, I must be able to calculate it. And that’s what I’m about to do with the regression properties: show how they are computed, and confirm that Mathematica and I get the same answers.

Even before we begin with the properties of a fitted equation, let’s look at the inputs.

Among the properties of a regression are Data, Response, and DesignMatrix. This means that regardless of what we think we put into the command, we can see after the fact what the regression actually used, just by asking for those 3 properties. On the other hand, recall that I named each of the variables in order to plot them; I don’t need to recover the “Response”, because I already have y defined separately from the data matrix d1.

The property “Data” is precisely the d1 matrix we supplied in the command… The “DesignMatrix” is precisely the X required by the normal equations… “Response” is the dependent variable y. Trust me on “Response”, but let me display the DesignMatrix:

reg[“DesignMatrix”]//MatrixForm =

\left(\begin{array}{cccc} 1. & 7. & 26. & 60. \\ 1. & 1. & 29. & 52. \\ 1. & 11. & 56. & 20. \\ 1. & 11. & 31. & 47. \\ 1. & 7. & 52. & 33. \\ 1. & 11. & 55. & 22. \\ 1. & 3. & 71. & 6. \\ 1. & 1. & 31. & 44. \\ 1. & 2. & 54. & 22. \\ 1. & 21. & 47. & 26. \\ 1. & 1. & 40. & 34. \\ 1. & 11. & 66. & 12. \\ 1. & 10. & 68. & 12.\end{array}\right)

Comparing it to the data matrix,

reg[“Data”]//MatrixForm =

\left(\begin{array}{ccccc} 7 & 26 & 6 & 60 & 78.5 \\ 1 & 29 & 15 & 52 & 74.3 \\ 11 & 56 & 8 & 20 & 104.3 \\ 11 & 31 & 8 & 47 & 87.6 \\ 7 & 52 & 6 & 33 & 95.9 \\ 11 & 55 & 9 & 22 & 109.2 \\ 3 & 71 & 17 & 6 & 102.7 \\ 1 & 31 & 22 & 44 & 72.5 \\ 2 & 54 & 18 & 22 & 93.1 \\ 21 & 47 & 4 & 26 & 115.9 \\ 1 & 40 & 23 & 34 & 83.8 \\ 11 & 66 & 9 & 12 & 113.3 \\ 10 & 68 & 8 & 12 & 109.4\end{array}\right)

we see that the DesignMatrix has dropped the last column (the dependent variable) and the third column X3 from the data matrix, and it has added the column of 1s for the constant term. Just as we would expect.

The following notation is common but not universal: n observations, k parameters (including the constant).

{n, k} = Dimensions[reg[“DesignMatrix”]] = {13, 4}

(As I have done before, I often display the Mathematica command – in this case {n,k} = Dimensions[reg[“DesignMatrix”]] – combined with the output which appeared in the next cell – in this case, {13,4}.)

Our immediate goal is to compute the \betas, and then the yhats. We know the normal equations for \beta:

\beta = (X'X)^{-1}\ X'\ y\ .

In order to compute \beta, we need the matrices

(X'X)^{-1}

and then

(X'X)^{-1}\ X'\ .

(Is it too late to remind you that X’ is the transpose of X?)

If we let

C = (X'X)^{-1}\ X'\ ,

then we could rewrite the normal equations as

\beta = C\ y\ .

If, now, we post-multiply the definition of C by X, we see that CX = I, so CX is an identity matrix. In other words, C is a pseudo-inverse of X. (We have talked about pseudo-inverses before, especially in this color post.)

C is called — I have no idea why — the CatcherMatrix. It’s 4×13, so I’m going to display its transpose:

C = reg[“CatcherMatrix”];
(CT=Transpose[C])//MatrixForm

C^T = \left(\begin{array}{cccc} -2.70897 & 0.00638718 & 0.0325943 & 0.0389565 \\ -0.301897 & -0.0115814 & 0.00396992 & 0.00913561 \\ 0.645335 & 0.00664865 & -0.00725193 & -0.00896044 \\ 0.593987 & 0.012522 & -0.0105539 & -0.00340957 \\ -3.14611 & 0.0006999 & 0.0423596 & 0.0392678 \\ 0.225413 & 0.00724223 & -0.00199076 & -0.00355552 \\ -0.00457347 & -0.0165621 & 0.00613753 & -0.00301569 \\ 2.35477 & -0.0143279 & -0.0300013 & -0.0242085 \\ 1.07225 & -0.0160562 & -0.0101287 & -0.0129262 \\ 1.91833 & 0.0329933 & -0.0283874 & -0.0240208 \\ 2.50043 & -0.0165514 & -0.030454 & -0.0277843 \\ -0.605898 & 0.00539082 & 0.0104831 & 0.00459322 \\ -1.54306 & 0.00319493 & 0.0232236 & 0.0159278\end{array}\right)

Next, if we start with the normal equations

\beta = (X'X)^{-1}\ X'\ y

and pre-multiply by X, we get

X\beta = X\ (X'X)^{-1}\ X'\ y

but X\beta = \hat{y}\ , so

\hat{y} = X\ (X'X)^{-1}\ X'\ y\ ,

and then we define that whole mess as the hat matrix H,

H = X\ (X'X)^{-1}\ X'

and get

yhat = H y.

(We have discussed the hat matrix before, most notably in this Malinowski PCA post.) The matrix H is a projection operator (H^2 = H) and it projects the y vector from (in this case) a 13-dimensional space to the 4-dimensional space spanned by the vectors 1, x1, x2, x4. That’s what it means to fit the equation.

The hat matrix is not a supplied property of the regression — but its diagonal (HatDiagonal) is:

reg[“HatDiagonal”] = {0.520579, 0.276701, 0.133154, 0.244309, 0.357325, 0.117365, 0.363411, 0.345222, 0.20881, 0.65244, 0.321051, 0.200403, 0.259231}

If we compute the hat matrix — it’s 13×13 so I’m not going to display it — and extract its diagonal entries, those are the numbers we get. I will show you in a subsequent post why we want the HatDiagonal.

With those distractions out of the way, we get the \beta from the normal equations; they are called “BestFitParameters

reg[“BestFitParameters”] = {71.6483, 1.45194, 0.41611, -0.23654}

I should point out that “BestFit” and “Function” provide the \beta in different forms:

reg[“Function”] = 71.6483 + 1.45194 #1 + 0.41611 #2 – 0.23654 #4&

reg[“BestFit”] = 71.6483 + 1.45194 X1 + 0.41611 X2 – 0.23654 X4

We coud compute \hat{y} = X\beta\ , and then from y = X \beta + e\ , we get the residuals

e = y – yhat.

As we saw in the previous post, the yhat are called “PredictedResponse” and the residuals are called “FitResiduals“.

yhat=reg[“PredictedResponse”] = {78.4383, 72.8673, 106.191, 89.4016, 95.6438, 105.302, 104.129, 75.5919, 91.8182, 115.546, 81.7023, 112.244, 111.625}

e=reg[“FitResiduals”] = {0.0616864, 1.43266, -1.89097, -1.80164, 0.256247, 3.89822, -1.42867, -3.09188, 1.28177, 0.353883, 2.09773, 1.05561, -2.22467}

We could confirm that y = yhat + e…. The input

reg[“Response”] == reg[“PredictedResponse”] + reg[“FitResiduals”]

returns “True”.

Equivalently, I could have typed

y == yhat + e

and would also have gotten “True”.

Having the residuals, we should compute the error sum of squares and the estimated variance of the disturbances. The error sum of squares is just the dot product of e with itself…

ESS = e.e = 47.9727

but we have to divide by n-k to get the estimated variance, where n is the number of observations and k is the number of parameters. (I defined them earlier, as the dimensions of the DesignMatrix.)

{n,k} = {13,4}

So: ESS/(n-k) = 5.3303

… and that is the “EstimatedVariance“:

reg[“EstimatedVariance”] = 5.3303

I want its square root (the associated standard deviation), and I will call it s:

Now we should get the total sum of squares and compute the R2 (R-squared) and the adjusted R2:

TSS = (y - \bar{y})\ \cdot\ (y - \bar{y})\ ,

where \bar{y}\ is the mean of y.

(I have just noticed that I’m now using ESS and TSS to symbolize the error and total sum of squares, whereas I used SSE and SST in the introductory post. Well, the notation just isn’t standardized, and as with so many things, I’m influenced by what I’ve seen recently.)

Let me remark that the total sum of squares is part of the calculation of the variance of y; if we divide TSS by (n-1) we get the (unbiased estimate of the) sample variance of y.

Anyway, the TSS is… then get 1 – ESS/TSS… and then get the adjusted R2:

Sure enough, the last two values are the properties RSquared and AdjustedRSquared:

reg[“RSquared”] = 0.982335

reg[“AdjustedRSquared”] = 0.976447

You should note that the R^2 is 1 iff the residuals are all zero — i.e. we have a perfect fit. Something like the Adjusted R^2 is necessary, however, if we want to compare regressions… because the R^2 cannot decrease when we add a variable. Don’t misunderstand me: if we have a perfect fit, I want to know about it — but R^2 can’t very well help me assess the tradeoff between improving the fit and adding more variables.

To put that another way, R^2 by itself is an unambiguous measure of how good a fit is… and which one of several is best… but we often want a measure which balances the cost and the improvement that come with increased complexity – hence the definition of a whole slew of measures of which the adjusted R^2 was probably the first.

The specific definition of the Adjusted R^2 is such that it could be written and understood as 1 minus the ratio of two unbiased estimates of variances:

1 – estimated variance / variance of y.

Now let’s get the covariance matrix of the \beta, also called the parameter covariance matrix:

s^2\ (X'X)^{-1}\ .

It’s called simply the covariance matrix — but we have to understand that it is not the covariance matrix of the data, but of the \beta\ :

\left(\begin{array}{cccc} 200.007 & -0.212234 & -2.60378 & -2.42105 \\ -0.212234 & 0.0136884 & 0.000991841 & 0.00207788 \\ -2.60378 & 0.000991841 & 0.0344513 & 0.0312474 \\ -2.42105 & 0.00207788 & 0.0312474 & 0.0300287\end{array}\right)

Recall that the square roots of its diagonal elements are the “standard errors” of the \beta, and that we get t-statistics by dividing each \beta by the corresponding standard error. In any case, here are the square roots of the diagonal, saved as “se” because I will need them again in a moment.

(Oops. That sentence in the middle of the next picture should read “Recall the parameter table”.)

We have indeed obtained the column headed “standard error”. Trust me – or confirm for yourself – that each t-statistic is the quotient of Estimate / Standard Error.

Let me say all too little about the t-statistics and the P-values. If you’ve never seen this before, it will be rather mysterious. I’m hoping, instead, that you have seen it, but you don’t really remember it. Let me jog your memory.

I assert that for this purpose I have n – k = 13 – 4 = 9 degrees of freedom (“dof”). And we will assume that each \beta\ has a Students T distribution with 9 dof, with mean \beta\ and the given standard error.

I looked in a statistics book and found that the t-statistic for 9 degrees of freedom and a 95% confidence level was 2.262. (I’m sure I could have gotten Mathematica to tell me that….) I checked it by asking Mathematica for the cumulative probabilites at ± 2.262:

What that all means is that the two red points on the following curve mark the boundaries of 95% of the area under the curve. In other words, 2.5% of the area is to the left of the left red dot, and 2.5% of the area is to the right of the right red dot.

And that’s what the PValues are telling us: the probability that \beta\ is really zero, but we happened to have data that leaves us way out in the tail.

To be specific, look at the t-statistic and P-Value for X2. The t-statistic in the parameter table is marginally less than 2.262, and the P-value is marginally greater than 5%.

Anyway, that’s where the P-Value comes from.

What about “CorrelationMatrix“? Is it the parameter correlation matrix? Yes.

Given a covariance matrix, we can construct the associated correlation matrix by dividing each entry by the appropriate square roots of the diagonal. That is, we get from the (i, j) element of the parameter covariance matrix to the (i, j) element of the parameter correlation matrix by dividing by s_i and s_j.

As I claim, that is the regression property CorrelationMatrix:

reg[“CorrelationMatrix”]//MatrixForm =

\left(\begin{array}{cccc} 1. & -0.128267 & -0.991927 & -0.987899 \\ -0.128267 & 1. & 0.0456733 & 0.102489 \\ -0.991927 & 0.0456733 & 1. & 0.971503 \\ -0.987899 & 0.102489 & 0.971503 & 1.\end{array}\right)

What properties have we not talked about?

\left(\begin{array}{cc} \text{AIC} & \text{ANOVATable} \\ \text{BetaDifferences} & \text{BIC} \\ \text{CoefficientOfVariation} & \text{CookDistances} \\ \text{CovarianceRatios} & \text{DurbinWatsonD} \\ \text{EigenstructureTable} & \text{FitDifferences} \\ \text{FVarianceRatios} & \text{MeanPredictionBands} \\ \text{MeanPredictionConfidenceIntervalTable} &   \text{ParameterConfidenceIntervalTable} \\ \text{PartialSumOfSquares} & \text{SequentialSumOfSquares} \\ \text{SingleDeletionVariances} & \text{SinglePredictionBands} \\ \text{SinglePredictionConfidenceIntervalTable} &   \text{StandardizedResiduals} \\ \text{StudentizedResiduals} & \text{VarianceInflationFactors}\end{array}\right)

That’s still a lot of proerties.

Well, the AIC and BIC are the “Akaike Information Criterion” and the “Bayesian Information Criterion”; like RSquared and AdjustedRSquared, they are measures of how good the fit is. I will talk about them in a subsequent post, when I introduce about 8 more measures. For what it’s worth, here they are for this regression:

reg[“AIC”] = 63.8663

reg[“BIC”] = 66.691

The ANOVATable is a summary of an analysis of variance of the regression. Well, analysis of variance is one of those things that I have seen but never played with enough to understand. I do not expect to be talking about it in the near future. I will probably also not talk about PartialSumOfSquares and SequentialSumOfSquares. By the way, I’m certainly not saying they’re unimportant, merely that I’ve gotten along without them. In any case, here’s what the ANOVATable looks like:

The following are all related to “deletion of a single observation”: BetaDifferences, CookDistances, CovarianceRatios, FitDifferences, FVarianceRatios, and SingleDeletionVariances. I will discuss them in the next post, along with “StandardizedResiduals” and “StudentizedResiduals“, which are used in single-deletion analysis, but could certainly have been discussed apart from it.

The Coefficient of Variation was unfamilar to me — I knew what it is, but not why it was. And it’s a good thing I knew what it is in general, because the Mathematica documentation is wrong. For a probability distribution, the Coefficient of Variation is its standard deviation divided by its mean, i.e the square root of its variance divided by its mean.

(The Mathematica tutorial on Statistical Modeling Analysis says mean divided by standard deviation. After all, we almost always divide by the standard deviation. But not this time.)

For a regression, this property is the estimated standard deviation divided by the mean of y. What seems weird is that it’s not using the standard deviation of y. Calling the component properties directly, I get s, ybar, and s/ybar:

Calling for it by name, we get

reg[“CoefficientOfVariation”] = 0.0241948,

which confirms that I know what they did.

In fact, this is the major thing I learned while writing this post, and I’ve added another post to the schedule. The usual variance of the y data (sum of squared deviations from the mean, divided by n-1) is the obvious “variance of y”. But we have an equation for y – not that I’ve ever actually shown it to you!

y = X\ B + u\ .

Note that instead of \beta\ I have written B, and instead of e I have written u. We assume that this is the true model, and that our least-squares fit is an estimate of it. More to the point, the only random part of this model is the disturbances u. Among other things, they are assumed to have an unknown variance \sigma^2\ .

But if they have a variance \sigma^2\ and y is given by that equation, then we may compute that the variance of y given X, Var(y | X), is also \sigma^2\ . But since \sigma^2\ is unknown, we use the unbiased estimate “estimated variance”.

That all needs to be said in more detail, and I will – but that’s why it makes sense to use the mean of y and the square root of the Estimated Variance: they really do belong together.

The Durbin-WatsonD is related to the lag-1 autocorrelation of the residuals.

reg[“DurbinWatsonD”] = 2.01056

Frankly, I would just as soon compute the autocorrelation function and partial autocorrelation function of the residuals, and see what structure, if any, they suggest. Of course, with only 13 observations, we shouldn’t compute more than about 4 values. (For the autocorrelation function, the first value is always 1; it doesn’t count.)

I may not talk about the autocorrelation and partial autocorrelation functions in the near future, but I’ll probably get to them some day, because they are essential ingredients for ARMA (auto-regressive, moving average) time series analysis – which I do want to talk about someday.

The following commands come from the Mathematica TimeSeries Package:

CorrelationFunction[e, 4] = {1., -0.0569033, -0.566034, -0.140973, 0.134833}

PartialCorrelationFunction[e, 4] = {-0.0569033, -0.571122, -0.333626, -0.44192}

The EigenstructureTable is unfamiliar to me.

Here’s what I know about it. The first column is the eigenvalues of the correlation matrix of X1, X2, X4:

The second column is the square root of (the largest eigenvalue divided by the eigenvalue in that row):

Finally, I cannot figure out how to compute the remaining 3×3 subtable. (And I’ve tried some principal components analysis on the correlation matrix!) I’ve asked on the Mathematica users’ group, and it’s possible the answer will come in before I post this, but I’m not going to delay this post. (And, no, no replies to my question.)

The two “bands”, MeanPredictionBands and SinglePredictionBands, are computing slightly different “error bars”, or interval estimates, for the yhat. Here they are – clearly they are equations in X1, X2, and X4.

They appear to differ in the constant terms inside the square roots: 200.007 for the Mean and 205.338 for the Single.

Here’s a related property, the MeanPredictionConfidenceIntervalTable:

My guess was that if I used the X1, X2, X4 of the first observation in the MeanPredictionBands equations, I would get the first tabulated confidence interval… and I did.

{74.67, 82.2066}

Then I tried the X1, X2, X4 for the last observation, and it worked too.

{108.966, 114.284}

So I concluded that the two “prediction tables” are the corresponding “prediction bands” applied to the data… but the bands are more general, because they could be applied to any other set of values, not just the original observations. As with the Eigenstructure table, I’ve asked about the bands properties on the Mathematica newsgroup, but I’m going full steam ahead without waiting for an answer. (Good thing, too – no answers in sight.)

(In fact, I have found a clue about the difference, but I haven’t had time to pursue the thread and work it out. When I know exactly what these are, I’ll let you know.)

This next one, the ParameterConfidenceIntervalTable, I understand and can compute for myself: it’s giving us interval estimates of the \betas. It transforms the Student T distribution to the \beta. Instead of mean zero and standard deviation 1, we shift and scale the distribution to a mean of \beta and a standard deviation equal to the standard error.

Finally, I read that the VarianceInflationFactors are used for assessing multicollinearity. I prefer to use the Singular Value Decomposition, but I will try to figure out what the VarianceInflationFactors tell us before I get to the post about multicollinearity. (Oh, yes indeed, the Hald data is multicollinear.)

Whew! I think we’ve run through all of them, although there are several about which I plan to say more, and there were a few that I don’t understand yet.

The next regression post will talk about the “single deletion” properties.

Advertisements

3 Responses to “Regression 1: More Properties”

  1. rip Says:

    OK, I’ve found out what Variance Inflation Factors are, and they are described in the last section of this post: https://rip94550.wordpress.com/2011/01/24/regression-1-multicollinearity-in-the-hald-data-%E2%80%93-1/

    rip

  2. rip Says:

    I figured out how the eigenstructure table is computed, in this post:
    https://rip94550.wordpress.com/2011/01/31/regression-1-%E2%80%93-mathematicas-eigenstructure-table-for-the-hald-data/

    … and, while I’m at it, in old news along the same lines, mean prediction bands and single prediction bands were figured out in this post:
    https://rip94550.wordpress.com/2010/10/04/regression-1-%E2%80%93-two-kinds-of-prediction-bands/

  3. thailand seo Says:

    We offer the best details about seo Corporation India and website link creating companies and
    Look for Motor Optimization Company .
    The moment you’ve researched the search phrases, you the have to make a conclusion on which ones you’re essentially
    likely to use.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: