Regression 1 – Multicollinearity in Review

As I draft this, I plan to do four things in this post.

  1. Summarize the methods I’ve used to analyze multicollinearity.
  2. Suggest that multicollinearity is a continuum with no clear-cut boundaries.
  3. Summarize the conventional wisdom on its diagnosis and treatment.
  4. Flag significant points made in my posts.

Let me say up front that there is one more thing I know of that I want to learn about multicollinearity – but it won’t happen this time around. I would like to know what economists did to get around the multicollinearity involved in estimating production functions, such as the Cobb-Douglas.
Read the rest of this entry »

Using the QR Decomposition to orthogonalize data

This is going to be a very short post, illustrating one idea with one example (yes, one, not five).

It turns out that there is another way to have Mathematica® orthogonalize a matrix: it’s called the QR decomposition. The matrix Q will contain the orthogonalized data… and the matrix R will specify the relationship between the original data and the orthogonalized.

That means we do not have to do the laborious computations described in this post. Understand, if we do not care about the relationship between the original data and the orthogonalized data, then I see no advantage in Mathematica to using the QR over using the Orthogonalize command.
Read the rest of this entry »

Regression 1: ADM polynomials – 2

Let’s look again at a polynomial fit for our small set of annual data. We started this in the previous technical post.

What we used last time was

That is, I had divided the year by 1000… because, as messy as our results were, they would have been a little worse using the years themselves.

But there’s a simple transformation that we ought to try – and it will have a nice side effect.

Just center the data. Start with the years themselves, and subtract the mean:

I’ll observe that if we wanted to work with integers, we could just multiply by 2. In either case, our new x is not a unit vector.

Oh, the nice side effect? Our centered data is orthogonal to a constant vector.

Let’s see what happens.
Read the rest of this entry »

Regression 1: Archer Daniel Midlands (polynomials) – 1

Now I want to illustrate another problem, this time with the powers of x. The following comes from Draper & Smith, p. 463, Archer Daniel Midlands data; it may be in a file, but – with only 8 observations – it was easier to type the data in. Heck, I didn’t even look to see if it was all in some file somewhere.

raw data

I have chosen to divide the years by 1000; in the next post I will do something else.

The output of the following command is the given y values… I typed integers and then divided by 100 once rather than type decimal points.

Read the rest of this entry »

Regression 1: Example 8, Fitting a polynomial

I want to revisit my old 2nd regression example of May 2008. I have more tools available to me today than I did when I first created it – and it was originally done before Regress was replaced by LinearModelFit.

Recap: fitting a quadratic and a cubic

What I had was five observations x, five disturbances u – and an equation defining the true model: y = 2 + x^2 + u. Here they are:

Construct a full data matrix with x, x^2, and y:

Run forward selection… and backward selection…

Read the rest of this entry »

Regression 1: eliminating multicollinearity from the Toyota data

We have seen that we can eliminate the multicollinearity from the Hald data if we orthogonalize the design matrix – thereby guaranteeing that the new data vectors will be orthogonal to a column of 1s. That, in turn, centers the new data, so that it is uncorrelated as well as orthogonal.

Doing that to the Toyota data will seem strange… because we have to do it to the dummy variables, too! But it will eliminate the multicollinearity.

I’m not sure it’s worthwhile to eliminate it… but we can… so let’s do it.
Read the rest of this entry »

Regression 1: eliminating multicollinearity from the Hald data

I can eliminate the multicollinearity from the Hald dataset. I’ve seen it said that this is impossible. Nevertheless I conjecture that we can always do this – provided the data is not linearly dependent. (I expect orthogonalization to fail precisely when X’X is not invertible, and to be uncertain when X’X is on the edge of being not invertible.)

The challenge of multicollinearity is that it is a continuum, not usually a yes/no condition. Even exact linear dependence – which is yes/no in theory – can be ambiguous on a computer. In theory we either have linear dependence or linear independence. In practice, we may have approximate linear dependence, i.e. multicollinearity – but in theory approximate linear dependence is still linear independence.

But if approximate linear dependence is a continuum then it is also a continuum of linear independence.

So what’s the extreme form of linear independence?


What happens if we orthogonalize our data?

The procedure isn’t complicated: use the Gram-Schmidt algorithm – on the design matrix. Let me empahsize that: use the design matrix, which includes the columns of 1s. (We will also, in a separate calculation, see what happens if we do not include the vector of 1s.)

Here we go….
Read the rest of this entry »

Regression 1 – Example 5: A Toyota Car

Edit: 2011 Nov 26. I computed a correlation matrix of the parameters when I meant to compute the correlation matrix of the data. Find “edit”.

Let’s do another regression, okay? I’m sick of the Hald data. It’s been so long since we did something other than multicollinearity… you might refresh your recollection of the earlier examples and the material leading up to them.


This data comes from Ramanathan, dataset 3-7 in both the 4th and 5th editions, but 3-6 in the 3rd. See this post for information about obtaining his data.

From the description in the 4th ed. data, I infer that this is data for one vehicle over a period of 14+ years.

DATA3-7: Data for a Toyota station wagon (57 observations)
cost = cumulative repair cost in actual dollars (11 – 3425)
age = age of car in weeks of ownership (Range 5 – 538)
miles = miles driven in thousands (Range 0.8 – 74.4)
Read the rest of this entry »

Regression 1: Multicollinearity in the Centered Hald Data

Edit, 2011 Nov 25: added a link to the Norms and Condition Numbers post. Find “Edit”.

Let us put the Hald data to rest, by making a transformation intermediate between the raw data and the standardized data. We will merely center the data, by subtracting the means of each variable.

Some things will change, but some things will not. I learned a lot from both.

In particular, we will see – as we did with the standardized data – that the relationships among the independent variables remain. The details change, but we still have a very strong relationship among X1, X2, X3, and X4… a strong relationship between X2 and X4… and a weak relationship between X1 and X3.

These relationships will be identified, as before, in a couple of ways. The VIFs, variance inflation factors – which I recast as RSquareds using the equation

R^2 = 1 - \frac{1}{\text{VIF}}

which came from the definition in this post

\text{VIF} = \frac{1}{1-R^2}

are what would show me that these three relationships among the raw variables still hold at the same strength, even when we center or standardize the data.
Read the rest of this entry »

Regression 1 – Multicollinearity in subsets of the standardized hald data

Edit: 8 Aug. Remarks added in “Regressions with no constant term”.


It would be fair to say that this post is primarily for my reference, but it does provide a second example of looking at all subsets of multicollinear data.

As we originally did for the raw data, so for the standardized data: we looked at multicollinearity for the three regressions of most interest – namely, the best 2- and 3-variable regressions, and the all-4-variable regression.

Now, as we did for the raw data, so for the standardized data: let’s look at all subsets (of columns) of the design matrix X. (In fact, it is easier to look at the equivalent: all subsets of rows of the transpose of the design matrix, X’.

Nevertheless, let me summarize what we will find. You may not feel a need to look at the computations. I think at the very least you will want to look at the section on regressions without a constant term, and at the 3-variable subsets.
Read the rest of this entry »