Regression 1: Introduction

This is the first of several posts discussing multivariable regression (“ordinary least squares”, or OLS).

I will begin by introducing a data set which we will use for most of this set of posts… I will show you how to run a regression using Mathematica version 7 (it has changed from version 6)… I will just scratch the surface looking at the output from a regression.

Let’s get started.

setup

The following is referred to as “the Hald data”. It’s not very large — 13 observations on 5 variables.

If you have electronic data from Draper & Smith, it is file 15A. I also found it available here for download:

This is a famous dataset, so I encourage you to search the internet for articles about it. In addition to “hald data”, you might try “hald cement”, since the data comes from an analysis of cement.

I read the data, flatten it first and then partition it into 5 columns… I name all 5 columns using lower-case… and then I assign the corresponding uper-case names to the independent variables in the vector n1… display the entire data matrix d1… and print its dimensions:

Let me repeat that each column of this matrix is a variable; the dependent variable is last. Let me also repeat that I have defined data vectors x1, etc. and y.

Having the data, I want to plot x and y of x (for each x = X1, X2, X3, X4) and display them side by side.

The first pair of drawings is x1 on the left and (x1,y) on the right…

… the second pair of drawings in the previous picture is x2 and (x2,y).

The first pair of drawings in the next picture is x3 and (x3, y)…

… the second pair of drawings in the previous picture was x4 and (x4,y).

The final drawing is just y by itself:

All in all, that data is pretty well scattered.

one selected regression

Now, let me do the usual. For reasons known only to me (or, for reasons I will share later), I choose to regress using 3 variables, namely x1, x2, and x4. Here’s how I would do that in Mathematica version 7. The input parameters are the same as in previous versions, but the command has a different name. I already have a matrix “d1” of data, and a vector n1 of the names of the first 4 variables, the independent variables. I issue the command LinearModelFit with 3 arguments…

Mathematica returns what appears to be an equation in X1, X2, and X4. It also appears to be an object called a FittedModel. Oh, and I named it “reg”.

I want to emphasize that the model has 3 inputs. We saw the matrix “d1” at the beginning of the post: 5 columns of variables, the dependent variable being last. Recall that the list (vector) “n1” was defined to hold 4 names, for the independent variables:

{X1, X2, X3, X4}

The middle input is a list of the variables to be used in this model. It means that we do not need to change the data matrix in order to run different models with the same data. That’s pretty convenient.

So we have an equation. What else?

the lists of properties

I can get a list of what Mathematica knows about that regression by asking for “Properties” of the object named “reg” (and, in case you’re wondering, I have formatted the output; by default it is just a long list)…

\left(\begin{array}{cc} \text{AdjustedRSquared} & \text{AIC} \\ \text{ANOVATable} & \text{ANOVATableDegreesOfFreedom} \\ \text{ANOVATableEntries} & \text{ANOVATableFStatistics} \\ \text{ANOVATableMeanSquares} & \text{ANOVATablePValues} \\ \text{ANOVATableSumsOfSquares} & \text{BetaDifferences} \\ \text{BestFit} & \text{BestFitParameters} \\ \text{BIC} & \text{CatcherMatrix} \\ \text{CoefficientOfVariation} & \text{CookDistances} \\ \text{CorrelationMatrix} & \text{CovarianceMatrix} \\ \text{CovarianceRatios} & \text{Data} \\ \text{DesignMatrix} & \text{DurbinWatsonD} \\ \text{EigenstructureTable} & \text{EigenstructureTableEigenvalues} \\ \text{EigenstructureTableEntries} & \text{EigenstructureTableIndexes} \\ \text{EigenstructureTablePartitions} & \text{EstimatedVariance} \\ \text{FitDifferences} & \text{FitResiduals} \\ \text{Function} & \text{FVarianceRatios} \\ \text{HatDiagonal} & \text{MeanPredictionBands} \\ \text{MeanPredictionConfidenceIntervals} & \text{MeanPredictionConfidenceIntervalTable} \\ \text{MeanPredictionConfidenceIntervalTableEntries} & \text{MeanPredictionErrors} \\ \text{ParameterConfidenceIntervals} & \text{ParameterConfidenceIntervalTable} \\ \text{ParameterConfidenceIntervalTableEntries} & \text{ParameterConfidenceRegion} \\ \text{ParameterErrors} & \text{ParameterPValues} \\ \text{ParameterTable} & \text{ParameterTableEntries} \\ \text{ParameterTStatistics} & \text{PartialSumOfSquares} \\ \text{PredictedResponse} & \text{Properties} \\ \text{Response} & \text{RSquared} \\ \text{SequentialSumOfSquares} & \text{SingleDeletionVariances} \\ \text{SinglePredictionBands} & \text{SinglePredictionConfidenceIntervals} \\ \text{SinglePredictionConfidenceIntervalTable} & \text{SinglePredictionConfidenceIntervalTableEntries} \\ \text{SinglePredictionErrors} & \text{StandardizedResiduals} \\ \text{StudentizedResiduals} & \text{VarianceInflationFactors}\end{array}\right)

That’s quite a list. Fortunately, many of those entries (e.g. ANOVATableEntries) are subsets of, i.e. specific entries in, tables (e.g. ANOVATable). Let me remove the subordinate entries associated with ANOVATable, EigenstructureTable, MeanPredictionConfidenceIntervalTable, ParameterConfidenceIntervalTable, ParameterTable, and SinglePredictionConfidenceIntervalTable.

The result is still not a short list, but it’s more manageable:

\left(\begin{array}{cc} \text{AdjustedRSquared} & \text{AIC} \\ \text{ANOVATable} & \text{BetaDifferences} \\ \text{BestFit} & \text{BestFitParameters} \\ \text{BIC} & \text{CatcherMatrix} \\ \text{CoefficientOfVariation} & \text{CookDistances} \\ \text{CorrelationMatrix} & \text{CovarianceMatrix} \\ \text{CovarianceRatios} & \text{Data} \\ \text{DesignMatrix} & \text{DurbinWatsonD} \\ \text{EigenstructureTable} & \text{EstimatedVariance} \\ \text{FitDifferences} & \text{FitResiduals} \\ \text{Function} & \text{FVarianceRatios} \\ \text{HatDiagonal} & \text{MeanPredictionBands} \\ \text{MeanPredictionConfidenceIntervalTable} & \text{ParameterConfidenceIntervalTable} \\ \text{ParameterTable} & \text{PartialSumOfSquares} \\ \text{PredictedResponse} & \text{Properties} \\ \text{Response} & \text{RSquared} \\ \text{SequentialSumOfSquares} & \text{SingleDeletionVariances} \\ \text{SinglePredictionBands} & \text{SinglePredictionConfidenceIntervalTable} \\ \text{StandardizedResiduals} & \text{StudentizedResiduals}\end{array}\right)

Anyway, we appear to have an object named reg, which is an instance of a class called a FittedModel, with properties we may access. It’s sometimes a little frustrating to have to run a regression in order to get a complete list of “Properties”, but I’ve gotten used to it. In contrast to previous versions, we have all the properties of any regression we run; in the old days, we used to have to specify a RegressionReport with selected properties.

If you have Mathematica, you should read the Statistical Modeling Analysis Tutorial… but if you’re using a different statistics package, then these are things you should expect to have available. And if you are writing your own regression code, then these are things you will want to be able to compute (most of them above and beyond the elementary material in my introductory description).

OK, before I begin to talk about them — in the next post — let me grab a few, a very few.

One of the most fundamental of those properties is the parameter table; it summarizes the fit.

It confirms that the regression uses X1, X2 and X4… and it includes a constant term called “1”. The column headed “Estimate” contains the coefficients of the equation. The t-statistics are all greater than 1 (in absolute value), but the t-statistic for X4 is less than the magic number 1.96 in absolute value.

The P-value seems to say that there is about a 5% chance that the estimated coefficient for X2 is really zero, and that there is about a 20% chance that the coefficient for X4 is really zero. It’s reporting the tail probabilities for the t-statistics, which are supposed to be values from a t-distribution. I’ll talk a little more about this in the next post.

One of the most common properties is the Adjusted R Squared; it is one of four properties which measure goodness of fit (and I will be showing you many more such properties, down the road):

Since the Adjusted R^2 can never exceed 1, we seem to have a pretty good fit.

An empirical observation of mine — which I have read can be proved — is that the t-statistic for a variable is greater than 1 (in absolute value) if and only if adding that variable to the regression will increase the Adjusted R Squared. I will show you in a subsequent post that, indeed, this regression has a higher Adjusted R Squared than a regression with just X1 and X2. That is, I would have added X4 to that X1 and X2 even though its t-statistic is relatively low, because the Adjusted R Squared would go up when I added it. (In fact, I got here by a different path, but the effect is the same.)

(I believe that the t-statistic can be interpreted in two ways. One, it measures the chance that its \beta\ coefficient is really zero. Two, it measures the benefit of having added that variable.)

But the most fundamental of all these properties is the residuals, called the “FitResiduals”. I daresay that everything we know about this fit is contained in the residuals, which I habitually denote by e:

There are three absolutely essential plots at this point:

  1. a list plot of the residuals;
  2. a plot of the residuals as a function of the fitted values (yhat);
  3. a plot of the y and yhat.

That middle plot is a significant choice. The e and y are usually correlated… the e and yhat are not: we plot e versus yhat.

So let’s get the fitted values, called “PredictedResponse”, which I habitually denote by yhat…

And let’s get the two plots of residuals.

And here’s the y and yhat on one graph (black for y, red for yhat).

(That lone red dot on the left? It lies right on top of the black dot for that y.

Now, I’m not going to actually talk about these graphs yet, but I think of them as part and parcel of running a regression, which is why I show them immediately after it.

OK, there’s one caveat. I really didn’t pick that regression out of the air, so it would be more precise to say that displaying these graphs is part of examining a particular chosen regression.

In the next post, I’ll talk about most of the properties in the short list. I will save a handful of them (“single deletion”) for the third post.

One Response to “Regression 1: Introduction”

  1. rip Says:

    I notice that my command for reading the dataset uses “name” but doesn’t show you what this is.

    The Mathematica® command I omitted was:

    name=”/Users/rip/mathematics/time series etc ƒ/draper & smith f/REGRESS/15A.txt”

    That is the full path name on my system for the data file “15A.txt”. What a mess!

    That’s why there is a Mathematica command Insert / File Path. You click the command and then you navigate to the file, select it, and Mathematica will provide the string. I chose to drop it into an input cell after “name =”.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: