Regression 1: Multicollinearity in the Hald data – 1

Edited 2011 Jan 25: one “edit” and two “aside” comments, all pertaining to vector and matrix norms.

Introduction and Review

Let me say up front that this post closes by explaining and using the “Variance Inflation Factors” from Mathematica’s Linear Model Fit. If that’s what you’re looking for, do a find on “VIF”. (I didn’t know how they were computed when I looked at some of Mathematica’s properties of a linear regression. Now I do.)

Before we begin investigating multicollinearity of the Hald data, let’s review what we have learned about the data.

I have shown you three or four ways of selecting the best regression from a specified set of variables – subject to the very significant caveat that we haven’t constructed new variables from them. That is, we have not transformed the variables, or taken powers and products of the variables, or taken lagged variables, and other such things. Had I done any of those things, I would have included the newly constructed variables in the specified set.

One of the ways was simply to run all possible regressions. When we did that for the Hald data, we found that our selection criteria were divided between two of the regressions:

Regression #6 has two variables, and all three of the coefficients in the fitted equation have significant T statistics:

Regression #13 has three variables, but – once again – three parameters have significant T statistics; the fourth T statistic is, however, greater than one (in absolute value) so the adjusted R Squared would be lower if we omitted it:

Then we looked at less-than-exhaustive searches for finding good regressions.

The second way of searching for the best regression was called “backward selection”. We start with a regression using all of the variables… drop the variable with the lowest T statistic and run a regression without it… and then drop the variable with the lowest T statistic out of that regression… and so on:

We see that backward selection has chosen the two variable regression {X1,X2} and the three variable regression {X1, X2, X4} – which are precisely the two selected out of all possible regressions.

The third way of searching for the best regression was called “forward selection”. I view it as the first step in “stepwise regression”, and my subroutine for doing it is called “stepwise” instead of the perhaps more appropriate “forward”.

What we got was:

We see our by now familiar three variable regression {X1, X2, X4}… but the two variable regression is {X1, X4}, instead of {X1, X2} which we know is superior.

The transition from forward selection to the fourth way, stepwise regression, is made by dropping X4 from consideration – because it had a low T statistic…

With X4 out of the running, we choose the two variable regression {X1, X2}.

By the way, because backward selection chose {X1, X2}, we could have seen that it had a higher Adjusted R Squared than the {X1, X4} chosen by forward selection. (Without running all possible regressions, we don’t know that {X1, X2} is the best two variable regression – but we know it’s better than {X1, X4}, at least by one criterion, namely Adjusted R Squared.)

At this point we have two regressions of interest. There are two more.

One is the regression with all four variables, because it exhibits one of the signs of multicollinearity: a lot of T statistics which used to be significant cease to be so.

I find it convenient to access it from the backwards selection rather than from stepwise – because the result of backwards selection displays the variables in the correct order:

I remind you that #4 was the first regression run backwards, but I store them in reverse order for consistency with the stepwise results, from smallest to largest number of variables.

The fourth regression of interest turned up when I took a closer look at all possible regressions. It was a three variable regression using {X2, X3, X4}…

There are two significant facts about that regression. One, all of its T statistics are significant. Two, by every selection criterion we have, it is the worst of the three-variable regressions.

(I’m still in the market for a handy criterion – rather than searching all possible subsets – which would have selected regression #15 precisely because all of its t-statistics were fine upstanding members of the community… i.e. were all significant. I found this interesting regression accidentally as the result of an exhaustive enumeration, not by an elegant search.)

For now, let me confine myself to investigating all four variables, rather than any smaller subsets.

Using the Singular Value Decomposition

We spent the previous post using the SVD to detect and to isolate linear dependence. We still don’t know much about multicollinearity – except that I’ve called it “approximate linear dependence”. Oh, and that I am most suspicious of it being an issue when I see significant t-statistics fall into insignificance when I add a variable.

In this post we will be studying the possible multicollinearity of the variables in the 4-variable regression…

which I will frequently call “the regression”. Let’s just apply the SVD to the design matrix for the Hald data. Just as for linear dependence, we really only need the list of singular values, rather than the entire SVD

X = u w v’,

where v’ denotes the transpose of v.

Since we are running regressions on the Hald data, we know on general principles that we need to include the column of 1s in our analysis. We learned this in the last post.

There are 5 nonzero singular values… the matrix X is of rank 5… but, of course, we knew that, because X’X was invertible. Otherwise we could not have actually gotten an answer for the regression.

But one of those five is significantly smaller than the others: the ratio of the fourth singular value to the fifth one is…

W[[4]] / W[[5]] = 294.192

The fourth singular value is 300 times larger than the fifth.

Coupled with the tell-tale sign that four of the five t-statistics in the regression are insignificant, that single small singular value cries out “multicollinearity” to me.

The uncertainty comes from the question: how small is small enough to be of concern. We’ll be getting to that, as best we can.

Again: if I had no other evidence, I wouldn’t be sure that 0.035 was really small for this design matrix. But I have multiple t-statistics that used to be good until we added the fourth variable.

We will get other evidence, too.

But for now, I take it that 0.035 is really small. That is, I declare we have detected multicollinearity..

Let’s try isolation. Just as I did for linear dependence, I want to investigate subsets of the columns of X. And just as I did for linear dependence, I’m actually going to examine subsets of the rows of the transpose X’ – strictly for Mathematica’s convenience.

I get the transpose of X… define r = {1,2,3,4,5}… and then get all possible subsets of 4 elements of r…

Now I can use the names instead of the indices when I print things out – i.e. use names of s of i instead of just s of i:

Hmm. We’re not out of the woods yet: 0.139 and 0.163 in lines 2 and 4 aren’t all that large. Yes, they’re about 4 times larger than the smallest of the original five singular values, but is that large enough to be not a problem?

And just what are the two subsets with small singular values? Answers: {con, X1, X2, X4} and {con, X2, X3, X4}. What’s common to them? {con, X2, X4}.

But note that {X1, X2, X3, X4}, which contains both X2 and X4 is OK (smallest singular value ~ 10), apparently because it doesn’t contain the constant.

Let’s come back to this. I’m going to put it off for another post.

assessing the smallest singular value

Let me remind us of what the smallest singular value means. (We’ve seen it before, and perhaps this post is a good additional explanation.) Suppose we take the complete singular value decomposition of X:

X = u w v’

The matrix w is as close to diagonal as a rectangular matrix can be:

Those five nonzero entries are precisely the singular values, which we got earlier as a list:

W = {211.367, 77.2361, 28.4597, 10.2674, 0.0349002}.

Now, set the smallest singular value to zero…

w[[5,5]]=0;

… so our w matrix becomes

… and build a matrix M using the altered w matrix (but the same u and v matrices):

M = u w v’


How much different is the matrix M from X? Here are the differences…

and here is the sum of squares of all those numbers, followed by its square root:

and what was that smallest singular value?

W[[5]] = 0.0349002

M is a matrix of rank 4 constructed from the matrix X of rank 5 by setting the smallest singular value of X to zero. The 2-norm (the Euclidean norm, the square root of the sum of squares) of the difference is precisely that smallest singular value.

X is of rank 5 but differs from a matrix of rank 4 by .0349002 .

I should remark that M is not unique, in the sense that there are other matrices of rank 4 which are just as close to X… but there is no matrix of rank 4 which is closer to X.

A Mathematica note: we can get the 2-norm by using the command

Aside
Norms of vectors and norms of matrices are not the same thing. We have a matrix, diff, of differences; we could “flatten” it to get a vector vd. The 2-norm of the vector corresponds to the Frobenius norm of the matrix, and they are equivalent: the square root of the sum of squares.

But while the infinity-norm of a vector is the maximum absolute value of the entries in the vector, the infinity-norm of a matrix is the maximum column sum of the absolute values.

I’ve been meaning to write about vector and matrix norms for quite some time.)
(end aside)

Ah, ha! If we drop all five singular values from X, we end up with the zero matrix. The only question is whether X itself should differ from the zero matrix by the sum of its singular values or by the largest one. I’m actually a little surprised that it’s not the sum of the singular values… but let me keep mulling it over.

The upshot is that the ratio of smallest to largest singular value gives us some idea of whether the smallest value is very small compared to the numbers in the matrix. The smallest singular value is an absolute measure of how far X is from rank 4. The ratio of smallest to largest singular values of X is then a relative measure of how far X is from rank 4. Converted to a percentage:

100 W[[5]] / W[[1]] = 0.0165116

There’s another way of looking at the relationship between X of rank 5 and M of rank 4. As displayed, they are obviously different.

What rounding of M would be required to make it match X visually?

That amounts to looking at (edit: )the 1-norm, the largest absolute difference. The singular value told us the total difference between X and M… the largest absolute difference would tell us the worst case difference between pairs of entries.

Aside:
My bad. It isn’t the 1-norm but the infinity-norm; and it’s the vector norm rather than the matrix norm.
End aside

So, we could do either of both of two things.

One, compute the largest absolute difference:

Two, do what we first thought of, make M and X look the same; but it’s easier to round off and display the difference matrix X – M… and we don’t have to use trial-and-error: given a maximum error of 0.0235, we know we can round to twice that.

Please note: it is very convenient that all of the Hald data were integers, and that they were about the same size. If our variables had had different precisions or were of substantially different sizes, we would not be able to use just one single rounding value (0.05 in this case).

one of the standard tools: Variance Inflation Factors

First, let’s just ask for the Variance Inflation Factors (VIFs) of the 4-variable regression:

What does Ryan (“Modern Regression Methods”, 0-471-52912-5) say? “Multicolinearity is declared to exist whenever any VIF is at least equal to 10.” OK: by that criterion, we have multicollinearity and it involves all of X1, X2, X3, X4.

More importantly, he says that the i_{th}\ VIF is computed from the R Squared (not the Adjusted R Squared):

VIF(i) = \frac{1}{1-R(i)^2\ .}

We can turn that around and compute each R Squared from the given VIF:

R^2 = 1 – 1 / VIF

(Ahem… not for the first one, so drop the zero.)

We see that writing it as VIF rather than as R Squared magnifies the differences. Those two 97.5% values don’t look at that different from the two 99.6% values.

Oh, the reason the multicollinearity involves all four variables is that any one of the four can be fitted very well by the other three. If, in contrast, X3 had not been involved in the multicollinearity, we would find that X3 could not be fitted by the other three.

Again: just the R^2 (R Squared) telling me that X4 can be fitted by the other three variables doesn’t tell me that all three were significant in that fit. Maybe X3 doesn’t matter for fitting X4. Well, then X3 would not be fitted well by the other three.

Let’s check the VIF for one case. Let’s run a regression of X4 on X1, X2, X3. I need to drop the dependent variable from my data matrix….


Now get the R Squared… subtract it from 1… and invert: i.e. compute

VIF = 1 / (1 – R^2)

and that is, indeed, what they got for the VIF of X4.

Okay, now we know how VIFs are computed.

So what we have is a compact summary of what we’ll get if we run the regressions.

Well, just what did we get for that regression?

Whoa! That is awfully close to

X4 = 98.65 – X1 – X2 – X3

i.e.

X1 + X2 + X3 + X4 = 98.65

which is close to

X1 + X2 + X3 + X4 = 100.

Do you get the impression that there might have been a few more variables in the data, and that all of them added up to 100%?

Darn it, I know I saw a detailed exposition of the full data set, but I can’t find it now. Anyway, I have read somewhere that our suspicion is correct: the sum of all variables was 100%, but some variables were omitted, and the remaining four do not quite add up to 100.

That’s easy enough to confirm; we should compute the row sums. But don’t use the full data set… we have to omit the dependent variable. Fine, use matrix d2, which we used for our regression of X4 on the others:

And then ask Mathematica to add up each column of the transpose:

So, we have detected multicollinearity involving all four variables, and we even know why it happened, and the constant term was crucial. (We saw in the post about linear dependence that X1 + X2 = 0 is linearly dependent, but X1 + X2 = 1 is not – unless there’s a constant term. So this is a real example where we would have missed the multicollinearity had we not investigated the design matrix, with its column of 1s. The relationship among the X’s should be far worse without the constant term.

Let’s see.

Summary

We reviewed the regressions we had seen for the Hald data. I believed, even if you didn’t, that the Hald data was multicollinear because the t statistics went to hell when we used all four variables.

We used two techniques – the Singular Value Decomposition and Variance Inflation Factors – to detect and to isolate multicollinearity. I might argue that for multicollinearity, detection isn’t very clear. At some level, any two vectors which are not othogonal are somewhat multicollinear.

Let me elaborate. Given two vectors, either one can be split into two component vectors, one parallel to the other and one orthogonal to the other. That parallel component is multicollinearity, to some extent. But is it enough to matter? Is it serious?

I think we will find that these techniques isolate multicollinearity, and they quantify it – but I’m not convinced that they can distinguish serious multicollinearity from benign. (I’m not convinced that a VIF > 10 always marks serious multicollinearity.)

We saw such quantification when we showed that the rank 5 design matrix for the Hald data differed from a matrix of of rank 4 by less than .025 in each datum.

We ran a regression of X4 as a function of X1, X2, X3 to identify the multicollinearity – the data has almost-constant row sums.

The computation step is different between linear dependence and multicollinearity, too. For linear dependence, we want to know that X’X is not invertible; for multicollinearity, by contrast, X’X must be invertible, otherwise we can’t get a regression equation.

Next? I expect that in the next couple of posts I will look at two more techniques; and I will look for multicollinearity in subsets of the Hald data. (Some of the singular values for subsets were only about a factor of 4 larger than the one related to the nearly-constant row sum of all four variables.)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: