Regression 1: Multicollinearity in the Hald data – 2

Introduction

The Hald data turns out to have been an excellent choice for investigating multicollinearity: it has at least four “near linear dependencies”. I’m about to show the details of three of them. (And in a subsequent post, I think I can eliminate all but three of them – but not the same three!)

We have already seen one of them: we know that the four independent variables have a nearly constant sum just under 100. (These four variables are, in fact, a subset of a larger set of variables – whose sum was 100%.)

We have seen two approaches to finding (exact) linear dependence of a set of variables:

We used the singular value decomposition of the design matrix for all four variables (that is, our subset was the entire set) to discover that all four variables (with a constant, too) were multi-collinear. But I did not continue looking at all the subsets.

I think I will return to that approach, looking at all subsets… but not today. Instead, I want to look at an orthonormal basis for the closest thing we have to a null space. And I want to do it for four regressions which I decided were worth investigating.

I’m going to use one additional tool, the variance inflation factors – indirectly. You will see that I view them as one possible explanation for multicollinearity detected by the SVD. (And yes, that should be surprising.)

At the end, I’m also going to display the eigenstructure table… not because I will be using it, but precisely because I don’t understand it and I want to see what it says. (I’m collecting experimental data, if you will.) Oh, I will also finally display the correlation matrix for the Hald data, and explain why I don’t much care for it.

This post, then, has two introductory sections… investigations of four regressions… one section about some other measures of multicollinearity… and a summary.

Let’s just get started. I will include a leisurely introduction (“initialization”)… but the new material will consist of applying one technique (two subordinate calculations) to four different regressions. If you’re up to speed on the Hald data and want to skip over the following section, feel free to do so.

Initialization

Get the data. (Incidentally, the RHS of the following was supplied by my executing Mathematica’s “Insert / File Path” command, and then finding the file I wanted on my computer.)

Regressions will require the following list of names:

I get set up by running both forward and backward selection on the data (as I have said before, my “stepwise” function should have been called “forward”, but I’m too comfortable with the original, inaccurate, name):

My selection criteria applied to the stepwise regressions reg recommend either reg[[2]], which uses X1, X4; or reg[[3]], which uses X1, X2, X4. (I automatically number the backwards selection in reverse order, so that bac[[4]] corresponds to reg[[4]], etc.)

Similarly, our selection criteria applied to the results bac of backward selection recommend either bac[[2]], which uses X1, X2; or bac[[3]], which uses X1, X2, X4:

It’s easy enough to compare the two different 2-variable regressions…

Call the selection criteria with just those two regressions. Remembering that Cp treats the last regression in the list as its touchstone, and that Cp cannot possibly select the touchstone, we expect that Cp must choose reg[[2]], the first in the list:

Every criterion (other than Cp) recommends bac[[2]] over reg[[2]]. And that’s what stepwise would have done, too, in the sense that it would have chosen bac[[2]] once we dropped X4 from consideration.

The reason we started looking at multicollinearity was reg[[4]] – or, the form I prefer, bac[[4]] – all four variables:

(I prefer bac[[4]] because the variables are listed in the “correct” order.) Only one t statistic is greater than 2 — and even it is, technically, insignificant, because with only 13 observations and 4 variables, the t value at 95% significance is 2.26, if I recall correctly.

Now, it makes no difference in what order we look at regressions. Let’s start with the 4-variable case, just because that’s the one we’ve seen before.

4 variables

Here’s the regression under investigation (yes, this is two consecutive identical screenshots):

As I have said, we are looking at it not because we think it’s a particularly good fit, but because it exhibits multicollinearity (when compared to the best 3-variable regression — otherwise we wouldn’t know that some of those t-statistics had been significant in other fits).

Let’s do a singular value decomposition of the design matrix for that regression. First, here are the singular values:

The smallest sv is two orders of magnitude below the second smallest one.

Here I compute the SVD, and display the orthogonal matrix v:

If the smallest singular value, only, had been zero, the rightmost column of v would have been a basis for the nullspace of X; and X.v would therefore have had zeroes in its rightmost column.

Well, X is of full rank, there is no nullspace…. but the rightmost column of X.v will still be very small because the fifth singular value is very small:

How small? Rounding off to the nearest 0.05 makes the fifth column all zeroes.

Well, just what is the basis vector that leads to the column of small numbers?

It’s this, the rightmost column of v:

{-0.999788, 0.0102846, 0.0103039, 0.0105194, 0.0100992}

and I can describe it by writing N1.v4 ~ 0 i.e.

-0.999788 X0 + 0.0102846 X1 + 0.0103039 X2 + 0.0105194 X3 + 0.0100992 X4 ~ 0.

(That is exactly analogous to describing a plane as the set of vectors perpendicular to the normal to the plane. The only hassle is setting it to zero because I want to solve for X4, say, when it’s not zero, merely very small. Try not to curse me out too badly.)

Hmm. Maybe it would be better to divide by the smallest absolute value in that vector. Bear in mind that I’m about to change the scale of things, which I will not always want to do. But for understanding the 1-dimensional space spanned by that vector v4, in this case, it is harmless:

That is: N1.v4 is now

-98.9965 X0 + 1.01835 X1 + 1.02027 X2 + 1.0416 X3 + 1. X4 ~ 0

And if we round off, we see (lying to Mathematica® and setting the expression to zero):

Can we check that by looking at fits among the independent variables? Yes. And I don’t even have to run the regressions myself! Mathematica supplies what we need.

Recall that the VIFs (here, V) are defined as

V = 1 / 1- R^2

where R^2 is the RSquared of a regression. Well , what I really want is the R^2, so I’m going to compute

R2 = 1 – 1 / V.

Some remarks are in order. We have seen that Mathematica does indeed relate VIFs and R^2 by those equations. I first saw this in Ryan’s “Modern Regression Methods”. But I have read things which suggest that this may not have been the original definition of the VIFs, and other things which suggest that this may not be a universal definition of VIFs. You may wish to confirm or deny this relationship for whatever software you are using.

Anyway, I want the R^2 and I know I can get them from Mathematica’s Variance Inflation Factors.

Here are the VIFs and the corresponding R^2 for our regression:

From now on I will simply call a function, rsq, which I wrote to perform that two-line calculation:

What do those R^2 mean? From left to right,

  • X1 as a function of X2, X3, X4 has an R^2 of .974023
  • X2 as a function of X1, X3, X4 has an R^2 of .99607
  • X3 as a function of X1, X2, X4 has an R^2 of .978664
  • X4 as a function of X1, X2, X3 has an R^2 of .99646

(Why aren’t the R^2 all the same? Because we don’t treat the independent and dependent variables the same; all our deviations from the fit are assigned to whichever one is the dependent variable.)

That each variable has a very high R^2 says that any one is a well-fitted function of the other three; this in turn means that all four variables are related, and highly so.

Ah… as much as I wish to distinguish the detection-isolation-identification of multicollinearity from an assessment of its severity, I can’t completely divorce the two. We have numbers, namely these R^2, so let’s at least pay them some notice, however scant.

What about the R^2 for the 4-variable regression of y?

bac[[4]][“RSquared”] = 0.982376 .

Whoa! Two of our R^2 for independent variables are higher than the one for y: X2 and X4 are fitted by the data better than y is!

We could now compare the fits of y to each of those fits; e.g. y and X4 as functions of the three variables X1, X2, X3, etc., as opposed to y as a function of all four variables, which is what we just did. It would be a terrible idea in principle, and we’ll see why later. (Okay… we’ll see that it seems not to be a consistent indicator.)

I’m not going to declare that we are at DEFCON 5 – I’m not going to assert “extreme multicollinearity” – because X2 or X4 is fiited better than y is… I’m just going to note that y is not quite as closely related to our four independent variables as they are to each other.

Let me put that another way. I don’t have a ruler with severity of multicollinearity marked off on it; I don’t have any unambiguous measures of the severity. I simply note that the fit for y is not quite as good as the fits for X2 and X4, whatever that means.

Oh, while we’re at it: X4 had the best fit, so just what was that fit? I didn’t have to run those four regressions to assess them, but now I want to see the best of them:

So: the OLS fit of X4 as a function corresponds nicely to the near linear dependence described by the rightmost column of v. Nevertheless, let me remark that I consider the SVD to be definitive for measuring the rank of X, while the OLS fits among the independent variables explains, in this case, why X is nearly of rank 4.

Incidentally, the parameter table for the regression of X4 is rather satisfying…

It shows that all the t statistics are significant. OLS says there is a pretty solid relationship among the four independent variables.

3 variables

Here’s our best 3-variable regression:

It is one of the two “best” regressions under consideration.

Let’s do the SVD. Here are the singular values:

Once again, the smallest sv is two orders of magnitude below the second smallest one.

Here’s X.v:

How small is that rightmost column? This time we have to round to the nearest 0.2 (rather than 0.05) to get zeroes in the rightmost column. I start to think that we can say this multicollinearity is less severe than the 4-variable case.

What’s the rightmost basis vector v?

Better would be to make the smallest component 1 – again, remembering that I’m changing the scale of things; that’s OK for a conceptual explanation, but our “approximately zero” is becoming a poorer approximation:

That’s not such a bad spread.

What can the VIFs tell us about this multicollinearity?

Here are the R^2 corresponding to the VIFs (I’ll start calling these the “VIF R^2”)…

rsq[bac[[3]]] = {0.0622037, 0.946753, 0.947202}

Those say that X2, X4 are related, but that X1 is not.

And here is the R^2 for our 3-variable regression:

bac[[3]][“RSquared”] = 0.982335 .

So two of those relationships among the Xi have a relatively high R^2, but they are both less than the R^2 (.982335) of y as a function of X1, X2, X4.

I may not be able to say how severe this multicollinearity is, but I might be able to say, again, that it is less severe than that among all four variables.

What’s exactly is the regression of X4 on X1 and X2? (That’s the one with the highest R^2.)

Similar to what we saw for the near-null space: X2 + X4 ~ 81.

So. In contrast to searching for linear dependence, the search for multicollinearity appears to go sequentially. When we had all four variables, we found that their sum was approximately 100. When we dropped X3 from consideration, we found that X4 plus X2 was about 80. The former fit is better, and overshadows the latter.

Oh, let’s look at the parameter table for X4 as a function of X1 and X2:

The t statistic for X1 is insignificant, as we expected from the VIF R^2.

Now, I want to set you up for something worth seeing. Having found a relationship for X4 as a function of X1 and X2, let’s fit y as a function of X1, X2… and print its R^2.

Well, cool. Its R^2 is still greater than the VIF R^2.

Well, we’ll see. (Don’t put too much stock in it.)

2 variables

Here’s the best 2-variable regression:

OK, this time there is only one order of magnitude difference between the two smallest singular values.

Here’s the SVD computed and X.v (not v) printed:

Hmm. The rightmost column is still fairly small. How small? We would need to round to the nearest integer to get zeroes… several of those values would round to .5 otherwise. (Note that number.)

Still, that says the data is nearly 2D even with the constant.

Now would be a good time for me to remind you that v is an orthogonal matrix with determinant +1 – which means that it is a rotation. That the third column of X.v is so small says that I can make it small just by rotating the data in space, i.e. just by changing the point of view from which I look at it.

Well, what’s the rightmost column of v?

Of course. That is the third column of X.v .

If I were to change the length of v3, I would find

That’s what I meant by conceptual insight versus changing the scale. Those numbers aren’t small anymore. far from it.

What do the VIF R^2 of the independent variables tell us?

Here are the R^2 for X1 of X2 and X2 of X1:

rsq[bac[[2]]] = {0.0522486, 0.0522486}.

Those are pretty crappy fits. Good. There’s no strong relationship between X1 and X2.

What a minute! The SVD says we have multicollinearity but the VIF R^2 say there’s no relationship between X1 and X2. But the SVD is definitive: our X matrix is of rank 3 but differs from a matrix of rank 2 by its smallest singular value. Not that small (1.05) but we can clearly see that the last column of X.v isn’t very far from zeroes.

This is why I view the VIF R^2 as possibly supplying an explanation for the multicollinearity.

In this case, it doesn’t.

I need to keep thinking about the geometry of this… and I almost certainly owe you a picture someday, but here’s how I see it at present:

X1 and X2 lie close to a plane in 3D, but not on a line. The multicollinearity isn’t caused by a straight line fit between X1 and X2 — they don’t lie on a line — but by the fact that the data lies exactly in one plane not through the origin (X0 =1), in such a way that it almost lies in a plane thru the origin (perpendicular to v3).

Roughly speaking, the plane X0 = 1 which the data does lie in can be twisted just a little to give a plane thru the origin, from which the data is never more than 0.5 away.

One might suspect that this multicollinearity is caused by the scale of our data. One would be right. (Next post.)

Let me also suggest that it is worthwhile discovering that multicollinearity can have at least two causes: the kinds of relationships we saw for the 4-variable and 3-variable regressions; and the odd geometry we see in this case.

the fourth interesting regression

We had come across the following regression, which uses X2, X3, X4:

It is interesting for two reasons: one, all of its t statistics are significant; two, it is ranked worst, by all my criteria, among all the 3-variable regressions. In particular, it is ranked worse than our best 3-variable regression, which has an insignificant t statistic for X4.

That all of its t statistics are significant is another illustration of the severity of the multicollinearity among all four independent variables: adding X1 to this regression drops all five t statistics to insignificance.

I simply want to look at it. I expect to find that X2 and X4 are multicollinear, just as they were when I ran X1, X2, X4.

Let’s do the SVD. Here are the singular values:

As it was for X1, X2, X4, the smallest sv is two orders of magnitude below the second smallest one.

I call for the SVD and display the X.v matrix (not v):

How small is that third column? We have to round to the nearest .2, just as we did for X1, X2, X4.

What’s the basis vector in the small direction? And how would I describe it?

It is more informative to use a vector which is not unit length:

This one seems to involve X3 more than X1 had been involved. Still, the vector v4 suggests that we have a linear relationship among the independent variables.

What about the VIF R^2?

rsq[lm] = {0.958863, 0.229724, 0.958087}.

That says that X2 and X4 are related. (What a surprise! Not. I hope you agree.) What would OLS give us?

Since the R^2 for y was

lm[“RSquared”] = 0.97282,

we see that the VIF R^2 are less than it, as they were for the best 3-variable regression.

Now I finally get to illustrate the warning I made earlier. Let’s look at y as a function of just X2, X3 — since here we have X4 as a function of X2, X3:

Yikes! X4 is fitted better by X2, X3 than y is. But is that really telling us something? We have simply thrown away too much of the explanatory power.

Why do I say that?

I think that on the basis of the SVD and the VIF R^2, this multicollinearity ought to be comparable to the multicollinearity among X1, X2, X4… even though in that case the R^2 for y is higher than the VIF R^2, and in this case lower.

If the multicollinearity is of the same severity — and in both cases it seems to be the multicollinearity of X2 and X4, then it can’t matter that we had R^2 higher for one, lower for the other. That’s a nontrivial “if”, so I’ll take it under advisement, but more than tentatively, I would say don’t drop variables from the regression of y.

As i said, I think this comparison — y as a function of the same variables as the chosen independent variable — is not consistent.

It remains to be seen if the other assessment of the R^2 of y versus the VIF R^2 is really as useful as it looks so far.

other measures

I want to close by showing you a few other things.

Here’s the correlation matrix of the data

That picks out X2 and X4 (-.973) , and maybe X1 and X3 (-.82)… but it certainly does not pick out all four. But we know there is multicollinearity among all four variables, and it seems to be more severe than than between X2 and X4.

This is why I don’t use the correlation matrix. It is not definitive. It fails to find the most severe multicollinearity in this data.

Oh, I have learned something I need to correct. The correlation coefficient does pick out a relationship including the constant term, not just proportionality, between two variables. One can prove that, but let me illustrate it instead of proving it.

What is the fit of X4 as a function of X1? And what is the square root of the R^2 for that fit?

That is to be compared to the (1,4) and (4,1) entry, -0.245445, in the correlation matrix. The negative sign could be inferred from the negative slope in the fit.

X4 as a function of X2? Here’s the fitted equation and the square root of its R^2:

This is to be compared to the (2,4) and (4,2) entry, -0.972955, in the correlation matrix. Again, the sign can be picked out from the negative slope in the fitted equation.

The absolute value of the correlation coefficient between two variables is the square root of the R^2 when one is fitted as a function of the other. (For such a fit, the R^2 is the same either way.)

The correlation matrix does a fine job of telling us the R^2 for one independent variable as a function of one other, but it misses larger relationships.

Here is the eigenstructure table for the 4-variable regression:

We see that the ratio is not always either less than 1 or greater than 1. Whatever the relationship between R^2 and the eigenstructure table, it’s not monotonic.

It appears that the four numbers on the bottom row under X1…X4 are telling us something very similar to the four VIF R^2: all four variables are involved.

Here is the eigenstructure table for the best 3-variable regression:

Again the ratio is not always either less than 1 or greater than 1.

Still, the three numbers on the bottom row under X1, X2, X4 are telling us the same thing as the three VIF R^2: variables X2 and X4 are related.

Here is the eigenstructure table for the best 2-variable regression:

This time we see that the VIF R^2 and the partitions are the same for each variable; but the partition values are substantially larger than the R^2. In contrast to the R^2, which I understand — no significant relationship between X1 and X2, I do not know how to interpret values of .61 for the partitions. Perhaps I should infer that .6 is insignificant.

summary

For the regression involving all four independent variables….

  • the smallest singular value was two orders of magnitude smaller than the next smallest;
  • the rightmost column of the v matrix suggested that the sum of the independent variables was approximately 99;
  • the rightmost column of X.v only had to be rounded to the nearest 0.05 to become all zeroes;
  • according to the VIF R^2, OLS finds a significant dependence of each independent variable on all of the other three;
  • the R^2 for y as a function of the four independent variables was less than the R^2 for X4 as a function of the other three, and less than the R^2 for X2 as a function of the other three;
  • I conclude that the four independent variables are multicollinear (SVD) because there is a relationship among them (SVD and OLS via the VIF R^2).

For the best regression on three variables…

  • the smallest singular value was two orders of magnitude smaller than the second smallest singular value;
  • the rightmost column of X.v became all zeroes if we rounded to the nearest 0.2;
  • the rightmost column of V suggested that the sum X2 + X4 was roughly 80;
  • according to the VIF R^2, OLS finds a significant relationship between X2 and X4;
  • the R^2 for X2 as a function of X1 and X4, and the R^2 for X4 as a function of X1 and X2, are both less than the R^2 for y as a function of X1, X2, X4;
  • I conclude that X2 and X4 are multicollinear (SVD), because there is a relationship between X2 and X4 (OLS);
  • I think this multicollinearity is less severe than that among all four variables (both from the SVD for the rounding on X.v, and OLS for the VIF R^2).

For the best regression on two variables…

  • the smallest singular value is one order of magnitude smaller than the second smallest singular value;
  • we have multicollinearity because X.v rounded to the nearest integer becomes all zeroes in its rightmost column – the data is nearly 2-dimensional;
  • but we do not have any significant relationship between X1 and X2;
  • instead, we conjecture that the multicollinearity is caused by the scaling of the data: it lies exactly in the plane X0 = 1 (the constant term), and approximately in a plane through the origin.

For the interesting regression…

  • we saw multicollinearity very similar to that of the best 3-variable regression: X2 and X4, with comparable numbers; in fact, I count it as the same: X2 and X4;
  • we also saw that the R^2 for y as a function of only X2 and X3 was much worse than that for X4 as a function of X2 and X3 – which is the opposite of what we found for the best 3-variable regression;
  • I conclude that dropping variables from the fit for y is not a consistent measure of the severity of the multicollinearity.
    • From the best 2-variable, best 3-variable, and the 4-variable regression, we learned that the SVD appears to detect multicollinearity sequentially: when it finds the 4-variable multicollinearity, it does not simultaneously detect the X2–X4 multicollinearity, or the weaker X1-X3, or the even weaker X1–X2.

      Bear in mind, however, that we did not have two independent multicollinearities, such as X1 and X3, and X2 and X4, without the 4-variable multicollinearity. In that case I would expect to see both of them at once. I’ll let you know, if I find such a case.

      From the correlation matrix…

      • We learned that X1 and X3 are multicollinear, though not as severely as X2 and X4;
      • that’s one multicollinearity I did not investigate, because it doesn’t show up in any of the four regressions I was interested in.

      (So I found a use for the correlation matrix despite my condemnation of it!)

      We also saw that the eigenstructure table seemed to convey the same information as the VIF R^2. I don’t know yet if it conveys any additional information.

      Finally, I would expect that just as X1 and X2 are (mildly) multicollinear without being remotely close to a line, X1 and X4, and X2 and X3, and X3 and X4, might be, too. That is, all four possible pairs other than X1 and X3, and X2 and X4 – which we know are somewhat multicollinear – might exhibit the same multicollinearity as X1 and X2.

      Whew!

4 Responses to “Regression 1: Multicollinearity in the Hald data – 2”

  1. fourier Says:

    your problem of collinearity can be easily be solve by fourier basis regression analysis (FB Regression)…. it is simply creating a new basis for your design matrix that are mutually orthogonal……….

  2. rip Says:

    I know how to create a new orthonormal basis – the columns of the v matrix in the SVD, X = u w v’, and I can reduce the multicollinearity.

    I’m curious about “Fourier basis regression”, but a google search gets only 7 hits.

    Can you explain what to do?

  3. Fourier Says:

    Please leave to me your email and i will discuss to you the FB Regression Analysis


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: