I can eliminate the multicollinearity from the Hald dataset. I’ve seen it said that this is impossible. Nevertheless I conjecture that we can always do this – provided the data is not linearly dependent. (I expect orthogonalization to fail precisely when X’X is not invertible, and to be uncertain when X’X is on the edge of being not invertible.)

The challenge of multicollinearity is that it is a continuum, not usually a yes/no condition. Even exact linear dependence – which is yes/no in theory – can be ambiguous on a computer. In theory we either have linear dependence or linear independence. In practice, we may have approximate linear dependence, i.e. multicollinearity – but in theory approximate linear dependence is still linear independence.

But if approximate linear dependence is a continuum then it is also a continuum of linear independence.

So what’s the extreme form of linear independence?

Orthogonal.

What happens if we orthogonalize our data?

The procedure isn’t complicated: use the Gram-Schmidt algorithm – on the design matrix. Let me empahsize that: use the design matrix, which includes the columns of 1s. (We will also, in a separate calculation, see what happens if we do not include the vector of 1s.)

Here we go….

## preliminaries

Recall the Hald data:

Recall the wonderfully descriptive names of the independent variables:

Let’s run backward selection:

We know from our earlier studies that {X1, X2} is the best 2-variable regression, and {X1, X2, X4} is the best 3-variable regression.

We also know that our selection criteria are divided on whether 2 variables or 3 variables is the best regression:

Most importantly for our present purpose, we know that the Hald dataset is multicollinear. I compute the R^2 associated with the Variance Inflation Factors (only because R^2 means more to me), and we see that each variable is a very-well-fitted function of the other three:

I should probably remind us that from the definition of the VIF,

VIF = 1/(1-R^2)

I solved for R^2, getting

R^2 = 1 – 1/VIF.

This lets me use my intuition for R^2 with Mathematica’s computation of the VIFs.

Let me suggest that these VIF R^2 are an excellent tool for detecting (as opposed to identifying) multicollinearity. We need more information to isolate the multicollinearity, but these R^2 are definitive if anything is.

In order to get the VIF and these VIF R^2, we must have gotten a regression equation. We know, however, that very severe multicollinearity should make the inversion of X’X unreliable or impossible – so the very first indicator of multicollinearity may be warning or fatal error messages about the matrix inversion.

## orthogonalize the world

Now, I’m going to let Mathematica@ do a Gram-Schmidt orthogonalization. Given a matrix, it constructs a new one in which every row is orthogonal to every other, and every row is normalized to be a unit vector.

I have shown you how to do a Gram-Schmidt… the key idea is to start with any one of the original vectors, normalize it… then take a second vector, write it as two pieces, one parallel and one perpendicular to the first vector… then take as our new second vector just the piece perpendicular to the first vector. Oh, then normalize it.

So, first I get the design matrix – so column one is all 1s… transpose it… orthogonalize it… and transpose back:

We see that our column of 1s is still a vector of identical components – it is, in fact, a unit vector.

Let’s confirm that the columns are orthonormal to each other:

(The rows of Z are not orthogonal. That’s how I rediscovered that the Orthogonalize command works on rows, so I had to transpose before doing it.)

Just for the fun of it – and because so much of the literature focuses on it – let’s compute the correlation matrix of Z:

(I’m surprised that worked for the constant column, but I’ll take it.)

So, our new data is both uncorrelated and orthogonal.

Before I move on, let me say something about that: for centered data, orthogonal is equivalent to uncorrelated.

Orthogonal means that Z’Z has zeroes in all its off-diagonal slots; in particular, Z’Z is diagonal with 1s on the diagonal. Uncorrelated says that the covariance matrix and the correlation matrix have zeroes in all their off-diagonal slots: the two are diagonal, and the correlation matrix has 1s on the diagonal. But only for centered data is the correlation matrix proportional to X’X; otherwise, we have to subtract the means.

And, indeed, our new data is centered – except for the constant term:

OK, but how did that happen?

Well, any vector orthogonal to the column of 1s must be zero-mean. All the 1s vanish from any dot product involving the column of 1s, and we are left with: the sum of the components of the other vector is 0. (Or, for concreteness, consider 2D vectors: if (a,b) is orthogonal to (1,1), then 0 = (a,b).(1,1) = a + b = 0, i.e. (a,b) has mean zero.)

Now let’s run backward selection on the new data.

## use the new data

But this isn’t quite what I need: I want to drop the first column of the Z matrix, so that I can replace it with a column of 1s in a new design matrix. I will still have that all 5 columns are orthogonal to each other – but the column of 1s will not be a unit vector.

So let’s make a new dataset. I drop the row of .27735s from Z’, append the dependent variable y, and transpose back. I also create a new set of very imaginative names for the columns of the Z matrix… and I run backward selection…

Do we have multicollinearity?

No! Every R^2 from the VIFs is zero… not one of these variables Z is a function of the others.

But have we really gained anything?

Our criteria are still divided between #2 and #3. Let’s look at them all:

Ok, Z3 enters the third regression with a t-statistic greater than 1 (so Adjusted R^2 goes up) but less than 2 (so it is plausible that some criteria would prefer the second regression). Then Z4 enters with a very small R^2 (so we reject the fourth regression), but the other t-statistics are essentially unchanged.

Note that the fourth variable is not significant; making the data orthogonal doesn’t add dimensionality. Furthermore, the third variable is still in limbo: not quite significant, but close enough to raise the adjusted R^2.

I also note that the coefficients don’t change as we add variables.

Finally, I note that most of the standard errors are the same within each regression. Interesting.

By the way, I review these regressions as though they were forward selection – but they are backward selection: Z4 was dropped from the fourth regression because it had the smallest t-statistic… then Z3 was dropped….

Let’s recall the corresponding original regressions:

Two of the four Adjusted R^2 are the same across the two sets. I know why they’re the same for the two 4-variable regressions: they both contain all the variables, and hence all the same information. One is multicollinear and one is not – but the fits are identical. Here are the residuals from the original fit… and from the orthogonalized fit:

The largest difference between them is ~10^-13.

Perhaps this will serve to emphasize that multicollinearity is about the relationships among the independent variables, and has nothing to do with their relationships to the dependent variable.

By contrast, the two 3-variable regressions are different:

I do find it striking that the 2-variable regressions lead to exactly the same fit: that is, the residuals are identical. I’m surprised, and I’d like to know more about this. Someday.

I wonder: is this stability a justification for choosing the 2-variable fit over the 3-variable fit? I don’t know.

Let me emphasize that it makes perfect sense for the 4-variable regressions to match each other; and I’m not surprised that the 1-variable and 3-variable regressions do not match; but I’m startled that the 2-variable regressions match.

Let me confirm that the 2-variable regressions have exactly the same residuals:

The 1-variable regressions are quite different, judging from the residuals:

## orthogonalize only the data matrix

Now let’s back up and suppose that we had orthogonalized the data matrix instead of the design matrix – i.e. the “variables” but not the constant.

I start with the independent variables – i.e. all but the last column – in the data matrix:

I orthogonalize the rows of X’, calling the result Z’, and compute and display Z:

Orthogonal – of course, that’s how we created the matrix:

but not uncorrelated:

Again, the obstacle is that the data is not zero mean; to get from Z to its correlation matrix, we must start by subtracting the means of each column:

Let’s see what this did for our regressions. Restore the dependent variable… and run backward selection again:

Yikes! We still have multicollinearity. Well, after all, the most serious one was that the sum of all four variables was almost constant – i.e. a multiple of the constant column.

That’s still true, apparently.

Looking at more detail doesn’t tell us much: the determinant of X’X, the singular values of X, and computed values of X.v are not very small. But we know that there are issues of scaling involved, and these data columns have been severely scaled.

But let’s see what we have for fewer than four variables. I don’t want to regress Z4 on all of Z1, Z2, Z3 – that would merely confirm what we’ve already seen. What I want is, for example, Z4 on each pair in {Z1, Z2, Z3}.

Take the Z matrix – so Z4 is the last, hence dependent, variable… and drop the name Z4 from the list of names…

Run all possible 2-variable regressions (using {2} instead of 2 in the Subsets function):

Damn. The .67 for the first pair of R^2 tell me that Z1 and Z2 are multcollinear – but I’ll bet the constant term is included. Yes it is – because here’s a fit of Z2 as a function of Z1, and the constant is significant:

So even though Z1 and Z2 are orthogonal, if you give me a vector of 1s (or any constant) to add to the mix, I may get approximate linear dependence. In this case, {1, Z1, Z2} are approximately linearly dependent, even though Z1 and Z2 are in fact linearly independent (because they’re orthogonal).

Isn’t that a little frustrating? But approximate linear dependence is not exact linear dependence… and that is precisely the definition of linear independence! (A set of vectors is said to be linearly independent if it is not linearly dependent.)

Need to double check the orthogonality? Here’s the dot-product of Z1 and Z2.

So. It appears that by orthogonalizing the design matrix instead of the data matrix, and specifically taking the column of 1s as the first vector so that it is effectively unchanged, we can eliminate multicollinearity.

It is crucial – for non-centered data – that we orthogonalized the design matrix. In particular, this also applies to using Principal Component Analysis to construct orthogonal data columns: they are not orthogonal to the column of 1s unless they are centered. PCA may give us something, but it won’t eliminate multicollinearity (unless, perhaps, if the data is centered.)

Did we gain anything when we eliminated the multicollinearity?

We still can’t decide between the best 2-variable and 3-variable regressions. The adjusted R^2 haven’t changed much if at all. Our new variables are weird combinations of the original variables – I haven’t shown you how to compute the relationship between new and old, but I will.)

The only thing that seems to have improved is the t-statistics… and presumably the uncertainty in the fitted coefficients. I’m not sure how to verify that. (That means I’ve tried and gotten a strange result. In fact, I get that the original symmetric uncertainties, , are replaced by smaller unsymmetric uncertainties – and the loss of symmetry bothers me.)

It may be that it will always be worthwhile to investigate the complete removal of multicollinearity… if only because, for the Hald data, it says that the original t-statistics were reliable in telling us that two variables were significant, the third was marginal, and the fourth was irrelevant.

Next I propose to do the same thing to the Toyota data – which used dummy variables. Does that sound interesting? It does to me, obviously.

## Leave a Reply