Regression 1: Multicollinearity in the Centered Hald Data

Edit, 2011 Nov 25: added a link to the Norms and Condition Numbers post. Find “Edit”.

Let us put the Hald data to rest, by making a transformation intermediate between the raw data and the standardized data. We will merely center the data, by subtracting the means of each variable.

Some things will change, but some things will not. I learned a lot from both.

In particular, we will see – as we did with the standardized data – that the relationships among the independent variables remain. The details change, but we still have a very strong relationship among X1, X2, X3, and X4… a strong relationship between X2 and X4… and a weak relationship between X1 and X3.

These relationships will be identified, as before, in a couple of ways. The VIFs, variance inflation factors – which I recast as RSquareds using the equation

$R^2 = 1 - \frac{1}{\text{VIF}}$

which came from the definition in this post

$\text{VIF} = \frac{1}{1-R^2}$

are what would show me that these three relationships among the raw variables still hold at the same strength, even when we center or standardize the data.

Bear in mind, however, that we must successfully fit a model in order to compute VIFs. If we have exact linear dependence in the design matrix X, we will be unable to invert X’X. And we must fit a series of models in order to look at a series of VIFs.

In contrast to the VIFs, however, the SVD can still be used even when we have exact linear dependence – that is, even when we cannot invert X’X – and it will show us precisely what that dependence is. In addition, the SVD is more easily handled for several variables; we can use it as soon as we get the design matrix for all the variables.

Furthermore, I have found that there is a intermediate area… where LinearModelFit will “invert” X’X when it shouldn’t. The SVD will show us that the answers are unreliable.

And that, ultimately, is why I expect to continue using _both_ the VIFs and the SVD.

The singular value decomposition – the smallest singular values in particular, and the condition number, and the rounding of X.v – seems to be much more sensitive to scaling the data… and we are about to see that it is sensitive to simply centering the data. While it still provides an ordering of the different multicollinearities, the numerical values it assigns to them are related more to loss of accuracy in computing the model.

My strongest motivation for the post about norms and condition numbers was to illustrate the meaning of the condition number using an example of a Hilbert matrix. I, at least, am convinced that the condition number of X reflects threats to the invertibility of X’X – and we have seen that the SVD can detect threats where the VIF sees no relationship.

So, I have come to believe that the VIFs are definitive for relationships among the independent variables… so long as we can invert X’X; but that the SVD is very useful, too. I won’t be abandoning the SVD.

We have already seen that the standardized data and the raw data have three multicollinearities in common; we will see that the centered data also has the same three multicollinearities – as measured by the VIFs.

But we have already seen that the SVD would have flagged X1 and X2 in the raw data, but not in the standardized data. We will see that the centered data will flag the constant, just the constant all by itself.

OK, let’s go.

Get the data, which I got from Draper & Smith…

For completeness, here’s the data matrix:

As always, it is convenient to have two sets of names…

Get the mean of each column, including the dependent variable…

Subtract the column mean from each entry in that column…

Check our work; if I subtract the centered data from the raw data, I should see columns of constants, each constant being the mean of the original constant:

Good. Let me cut to the chase, and get the design matrix for the 4-variable regression. Note that I could just construct the design matrix myself; it is convenient to run a regression in order to get the design matrix, but I didn’t have to.

Only one t-statistic is significant, for X1, although there is no doubt that the constant term is 0, despite its t-statistic of 0.

Oops. Old habits die hard. “Despite” is the wrong thing to say.

Let me explain. A large t-statistic on a nonzero coefficient says there’s a high probability that the coefficient is nonzero, i.e. a low probability that the coefficient is zero. So an extremely low t-statistic is what we want when we think the coefficient really is zero. Thus, a t-statistic of 7 x 10^-15 is very, very nice for the constant term in this case.

To put that another way, a high t-statistic does not always mean there’s a high probability that the coefficient is right; it means there’s only a small chance that it’s really zero. Similarly, a small t-statistic, as for the constant term in this case, does not mean the coefficient is wrong; it means it’s probably zero – and that’s what we expect in this case, since all the variables, including the dependent, have been centered.

I’m about to look at subsets – without running regressions. I want to make it clear that I can do this. I know, however, that the VIFs flag the same three relationships among the variables. I will show you that at the end of the post.

Here’s the design matrix X (and I define but do not print its transpose XT, for later use):

I call for its singular value decomposition, and for simplicity I also call for its singular values “sv” — just so I don’t have to extract them from the 13×5 matrix w. We have

X = u w v’

where v’ denotes the transpose of v.

I print v – chopped but not rounded, to get rid of numbers less than 10^-15:

The only nonzero entry in the first row, and in the fourth column, is a 1. That says that v – which is a transition matrix for a new basis – simply moves the first column to the fourth. It does a lot to the other four columns, but does nothing to the first one except to move it. We can see this explicitly:

There is another interpretation. The columns of v are eigenvectors of X’X (that’s how they were found), and so we are seeing that {1,0,0,0,0} is one eigenvector of X’X – but why? Is it always a consequence of centered data?

I begin to think it is. It was true of the standardized data, but I didn’t notice it. I’m not going to pursue this now, but someday….

Let’s round X.v off, just to get a clearer picture of what we have:

It appears that as we move to the right, each column is generally smaller than the one before it.

This is true. We are multiplying columns of unit vectors by singular values, and getting columns whose length (as vectors) is the corresponding singular value.

Let me show you this. Get the transpose of Xv – only because Mathematica is row-oriented… for each row (vector), compute the 2-norm (the square root of the sum of squares of its entries):

The lists match. The singular values are in decreasing order, and so are the columns – decreasing in the sense of the 2-norm of each column.

Incidentally, this means that if a column of X.v remains the same as we look as different subsets, then it will have the same singular value associated with it. To be specific, any time that column of 1s appears in X.v, it will have the same singular value (3.60555) associated with it – because that’s its unchanging length. I wondered about that in an earlier post, but I hadn’t realized that I was looking at the singular value associated with the constant term… I hadn’t realized there was a constant term in X.v. (It’s not obvious even in retrospect, because I invariably rounded off X.v . Perhaps I should go back and look at v – it will show clearly if it is just moving the column of 1s.)

But if we look again at the rounded off X.v in this case, we see that there is one entry in the fifth column which is greater than 1 in absolute value.

Hmm.

If we start rounding off to make column 5 disappear, column 4 will vanish first. That is, a rounding slightly greater than 2 will zero out the fourth column…

… but it will take about 2.3 to zero out the fifth column…

What does this mean? Well, first, why does the fourth column vanish first; second, why do both vanish?

First, by trying to round off so that every entry in a column vanishes, we are essentially using the $\infty$-norm of the column vector – because we have to make the largest component, which is the $\infty$-norm, vanish.

Second, as far as the singular value decomposition is concerned, the closest we come to a nullspace is 2-dimensional, not 1-dimensional.

What are the last two columns of X.v? That is, in terms of the original basis, which is {CON, X1, X2, X3, X4}, what are the fourth and fifth basis vectors defined by v?

Here are the last two columns of v:

The first result says that the constant term is the fourth basis vector – we knew that – and the second result says that the fifth basis vector is approximately half the sum of the four Xi.

We shouldn’t be surprised by the fifth basis vector: the raw data added up to almost 100. But the constant?

Zeroing out the last two columns simultaneously says that the closest we come to a nullspace is spanned by two vectors, the constant term and half the sum (or, what amounts to the same thing, the sum) of the original variables.

The SVD still picks out the sum of the variables as a close relationship, but with a singular value of 1.69 instead of the 0.14 we found for the standardized data and the 0.035 we found for the raw data. Here’s the thing: the relationship hasn’t changed much – the sum of the four variables Xi is nearly constant, 100 for the raw data, 0 for the standardized and for the centered – but the singular value associated with this same multicollinearity has changed a lot.

The size of the smallest and largest singular values – strictly, the square of their ratio, which is the condition number of X’X – reflects the invertibility of X’X. Scaling affects this, in general, although with such condition numbers, we’re quite safe.

The most severe multicollinearity in the Hald data does not endanger the inversion of X’X – not when we work in double precision.

As a rule of thumb, I said (edit) at the end of this post that the log base 10 of the condition number is how many digits of accuracy we might lose when we invert a matrix. The ratio of largest to smallest singular values for all the raw data was… and I square that to get a condition number for X’X:

We may be losing 7 digits when we invert X’X for the raw data, but we’re working with 16, which leaves 9, and we only print 6 of those.

And that was the worst case.

Let me elaborate.

The smaller condition numbers for the centered data and standardized data tell us that the centered and the standardized data are much safer than the raw data when it comes to the inversion.

And that’s a different thing from the form of the multicollinearity.

The four variables Xi still add up to a constant, but the numerical threat posed by that relationship is less when the data is centered or standardized.

Fine, then. So much for the fifth column.

There’s no way around it: as far as the SVD is concerned, the second threat to the inversion – not a serious threat, but still ranked second – is the constant term, all by itself. It’s closer to zero – not very close, but closer – than any of the other basis vectors in v.

We rarely consider the possibility that we have actually been given a zero vector rather than a combination which sums to the zero vector, but that’s what’s happening here approximately.

The fourth column is approximately linearly dependent all by itself.

So: looking at all the variables, the SVD has flagged two multicollinearities almost simultaneously… only one of them is a relationship among the variables… and neither of them is a serious threat to inverting X’X.

Let’s look at subsets of 4 variables.

The fifth subset has the lowest singular value; then all other minimum sv’s are equal – and we recognize that 3.61 is the sv associated with the constant term, which the SVD thinks is the second most serious collinearity.

And the two distinct singular values are what we saw before. But now we have clearly got a subset without the constant term.

We do see two pairs of condition numbers, for subsets #1 and #3, #2 and #4 — as usual; but we see the difference in the condition numbers because the smallest singular values are the same in those four subsets. (Remember to look at larger condition numbers, not smaller ones, as corresponding to smaller minimum singular values.)

Look at the fifth subset, the most serious multicollinearity:

As before, a rounding of 2.3 wipes out the column that refers to the sum of all four Xi.

Here’s a subset (#2) that contains the constant.

This time, it’s the constant that vanished.

X.v is not picking up X2 and X4! It’s picking up the constant term!

Because it’s the constant term that was wiped out, the VIFs aren’t quite telling me the truth. Here I was, getting all set to rely predominantly on the VIFs, and I see that I still need X.v .

Let me say that again. The VIFs are picking up X2 and X4 – but the X.v is picking up the constant term by itself. We’re getting mixed – albeit true – signals.

I’m not going to look at any of the other subsets of four. Let’s move on to subsets of three.

This is quite different from the subsets of four. Where the subset without the constant had been the most serious multicollinearity, now all the subsets with the constant are more severe than those without. This is good, because it’s true: the constant term, as I’ve said before, is the second most serious threat to inverting X’X. Not a very serious threat at all, but ranked second among the threats.

Let’s look at the subset with the highest condition number, #5:

The required rounding was 2.1 to wipe out the last column. Since the subset included the constant term, that should have been the last column. Was it?

Yes. Once again, what went to zero was the constant term, not a combnation of X2 and X4.

There is another oddity. {CON, X1, X3} has a much lower condition number than the others; in particular, it has a lower – safer – condition number than {CON, X1, X2}. Well, we’ve seen the SVD flag {CON, X1, X2} in the raw data – but it wasn’t as severe as {CON, X1, X3}. Although X1 and X3 are more closely related than X1 and X2, the SVD says that the values of X1 and X2 pose a greater threat to inversion.

Centering the data has wreaked some changes in the geometry… but it also made the inversion of X’X much safer than for the raw data. The SVD is quibbling over the threats to inversion from rather safe matrices.

Let’s look at subsets of 2:

Judged by the smallest singular values, any subset containing the constant term is less safe than any subset without a constant – but the condition numbers tell a different story: the condition number for {X2, X4} is larger – less safe – than those for {CON, X1} and {CON, X3}.

But we still have oddities. The more weakly related {X1, X3} has a smaller minimal sv than the more strongly related {X2, X4}. But the condition numbers are ordered properly: largest is {X2, X4, next is X1, X3}.

I think I continue to use both the singular values and the condition number, because they may not agree.

Oh, let choose the subset which contains {X1, X3} – I want to see its VIF RSquareds:

We see that each of X1 and X3 is a weak function of the other.

Finally, having looked at subsets of the data without ever needing to run a regression, let me get what I know to be the best k-variables regressions. Backward selection picks out the best 4-, 3-, and 2-variable regressions…

The two approaches agree on the best 3- and 4-variable regressions. (Yes, there’s only one 4-variable regression, so they can’t very well disagree on it.)

Here are the RSquareds from the VIFs for the 4-variable regression…

We see that all four variables are related. Furthermore, those numbers are exactly the same as for the raw data, and for the standardized data. As I said, the strength of these relationships is not affected by centering or standardizing the data (although the numerical coefficients in the equations are affected).

Here are the RSquareds from the VIFs for the best 3-variable regression:

We know that this is X2 and X4, but let’s see it. The order of the variables is as shown…

… so the second and third VIF R^2 pertain to X2 and X4; it’s X1 that has a low R^2 as a function of X2 and X4.

But what about X1 and X3? Well, that’s why I looked at it back when I had subsets of two variables. Among the best regressions, there isn’t one that includes both X1 and X3 – except for the 4-variable regression, which overlooks {X1, X3} in the face of the stronger relationship between all four variables.

That’s the weakness of the VIFs – we need to have a regression in hand; and I find it easier to just look at subsets directly, rather than construct an exhaustive collection of regressions.

Need a summary? Go back and reread the introduction.

And breathe a sigh of relief. I think I’m done with multicollinearity of the Hald data.