Regression 1 – Multicollinearity in the Hald Data 3 (subsets)

Introduction

edit: two links. I first looked at subsets of data in this post; I decided that the matrix product X.v was worth computing in this post. Both links were for linear dependence rather than for multicollinearity, but these are where I first explained what I was doing with subsets and X.v .

For most purposes, what we have already done with the Hald data might be sufficient. We have identified the multicollinearity only in the regressions of interest, i.e. in the closest fits using 2 or 3 variables, and in the regression using all 4 variables. In particular, we never explicitly looked at the relationship between X1 and X3 (not beyond what the correlation matrix told us, anyway), because X1 and X3 did not occur together except for the regression on all variables. And for that regression, all four variables were multicollinear.

Maybe I should remind us that we’re pretty sure the Hald data has 3 multicollinearities:

  1. all four variables sum to approximately 98.5;
  2. X2 and X4 are multicollinear, but not as severely as all four;
  3. X1 and X3 are multicollinear, but not as severely as X2 and X4.

On the other hand, there may be times when we want to investigate all of the multicollinearity, not just some of it. Then I, at least, would look at all subsets of the variables. (In this case, we didn’t look very closely at subsets which dropped the constant, because the singular values were relatively large.)

Let me do that.

This is very much like what we did when looking for linear dependence (“exact multicollinearity”). In fact, I’m going to use three tools:

  1. the singular values of subsets
  2. the rounded matrix product X.v
  3. the R Squareds computed from the VIFs (Variance Inflation Factors)

Get the data…

We haven’t seen this in a while, so here’s the data matrix:

It is convenient to have two sets of names…

Now, assign – set – data and ns to my generic names d1 and n1 (because I’m going to copy this whole section of computations for modifications of the Hald data)…

Get what we already know to be the best 2-,3-,4-variable regressions…

Get the design matrix X, and – more importantly for subsets – its transpose XT. Nevertheless, it is X which fits on the screen, so print X rather than XT:

Define r, a list from 1 to 5:

Let’s try isolation. Just as I did for linear dependence, I want to investigate subsets of the columns of X. And just as I did for linear dependence, I’m actually going to examine subsets of the rows of the transpose X’ – strictly for Mathematica’s convenience.

Let me remind you that when I refer to a regression with 4 variables, I mean X1, X2, X3, and X4 – but the regression actually has a fifth variable, the constant. When I start looking at subsets of the variables, I must count all 5 variables. This is why I have two sets of names: the regressions do not want the constant term named explicitly, but “CON” really helps a lot when looking at subsets. The standard phrase at this point is, “The different usages should not cause confusion.” Let me hope so, because I ain’t gonna clean it up!

In addition to printing the singular values, let me also print the condition numbers. (The condition number is the ratio of largest to smallest singular value.) In fact, let me print the singular values twice: the second time, rounded off – but you can’t very well check my condition numbers using rounded-off values.

All Five Variables

Here’s the “subset” of all 5 variables, i.e. the entire design matrix:

We have 1 subset… it contains CON, X1, X2, X3, X4… the singular values range from 211 down to 0.03… the condition number is 6056 (using the unrounded singular values, of course).

Well… we already know that we have multicollinearity involving all four variables. But from subsets alone, we have not yet isolated the multicollinearity. Detected, yes; isolated, no.

Nevertheless, we already understand that all four variables are highly multicollinear, and that knowldge suggests that “6056” might be a significantly high number for the severity of this multicollinearity.

Subsets of Four Variables

Let’s look at subsets of 4 variables.

As before, the first light-green output box is the raw singular values… the second box adds an index and names to the rounded singular values… the third output box shows the indices and the names and the condition number for that subset.

The smallest singular values are for the second and fourth subsets (.16 and .14 respectively). Then we can look at the first and third just to see what singular values about .5 mean, if anything.

The largest condition numbers are associated with the second and fourth subsets.

NOTE that the following presumes I have a column of 1s in the first column… so we will NOT look at the fifth subset, which omits the constant. Besides, its smallest singular value is 10 and its condition number 20.

Here is more detail on the second subset, specified by “s[[2]]”. The module “tools” returns “xv” and “rs”, which are printed, and then I printed the variable names. You should infer that “xv” is the matrix product X.v and “rs” is the R^2 computed from the VIFs. (Yes, “tools” computed the Singular Value Decomposition of a subset of X, given XT; and ran a regression from which I could get the Variance Inflation Factors and the corresponding R^2s.)

The rounding required to zero-out the last column of X.v is 0.2.

The R^2 tell me that the last two variables are the collinear ones: that’s X2 and X4.

Let me review that. I ran a regression of y on that subset of variables. The .06 says that X1 is not well-fitted by X2 and X4. The .9467 says that X2 is well-fitted by X1 and X4 – but it does not say that both X1 and X4 are significant. Similarly, the .947 says that X4 is well-fitted by X1 and X2 – but it does not say that both X1 and X2 are significant.

The lack of fit of X1 as a function of X2 and X4 is what tells me – without looking further – that X1 is not significant in the other two fits.

Of course, we saw all this in an earlier post, when we studied the regression with X1, X2, X4, which was our best 3-variable regression.

Moving on, here is more detail about the fourth subset:

This time it’s also X2 and X4 that are closely related. And the required rounding on X.v is, again, 0.2.

So the two subsets with the smallest singular values are both flagging the relationship between X2 and X4. That is, they are both flagging the same multicollinearity, in the two subsets {X1, X2, X4} and {X2, X3, X4}.

Let’s look at the first and third subsets. Here’s s[[1]]…

So, X1 and X3… the required rounding is 0.8 instead of 0.2 . And the R^2 are just under .7 instead in the high .90’s.

Here’s the third subset, s[[3]]…

Again, X1 and X3. The required rounding-off is 0.6, comparable to – slightly lower than – the first subset. With slightly less rounding, we might expect that the R^2 are slightly higher – and they are, slightly above .7 instead of below.

And that exhausts all the subsets of four variables which include the constant term. The two lowest (and similar) singular values are flagging one multicollinearity, X2 and X4; the other two lowest (and similar) singular values are flagging another, lesser, multicollinearity, X1 and X3.

Subsets of Three Variables

Let’s keep going. Subsets of three. This should eliminate the extraneous variables in the subsets of four.

The first screenshot is the command and the unformatted and un-rounded-off singular values, if you want to check the computations:

The second screenshot is the formatted and rounded-off singular values, and the condition numbers.

As before, the condition numbers are pretty small for the subsets (#7-10) which omit the constant; and the corresponding singular values are greater than 14.

For the six subsets that include the constant, on the other hand, the smallest singular value is for the 5th subset, then the 2nd, then the 4th, 1st, 6th; i.e. 5, 2, 4, 1, 6, 3.

Now is a good time to note that sorted on condition numbers, we would get a different order; 5, 4, 1, 6, 3, 2. Only “2” has moved, but it has moved a lot.

Note especially that the 2nd subset has the smallest condition number (among subsets which include the constant), but the second smallest singular value. Which matters?

Let’s look at the fifth subset, with a smallest singular value of .16 . That’s about the same as a pair of the four-variable subsets. Is it flagging X2 and X4 also?

Yes. That’s picking out X2 and X4 again, this time without X1 distracting us. I hope this isn’t surprising. We have already seen the multicollinearity, for example, of X2 and X4 in the subsets {X1, X2, X4} and {X2, X3, X4}; we really should expect to see the 2-element subset {X1, X2} has the same multicollinearity.

We see that the required rounding on X.v is still 0.2, and the R^2 are still in the 90s.

Now let’s look at the second subset, with a much larger smallest singular value, .65 – which isn’t much different from the other pair, with .52 and .59, in the four-variable subsets. Is this picking out X1 and X3?

Yes.

Again, the rounding (0.7) and the R^2 (< 0.7) are what we saw before on the subsets containing X1 and X3.

Let's look at the others. Here's the fourth subset (sv =.87):

The required rounding is higher (1 versus .7) for the fourth subset than for the second, and the RSquared between X2 is X3 is only 0.02 (versus .68). Judged by the required rounding on X.v, and by the R^2 among the independent variables, the collinearity between X1 and X3 is the third most serious in the data – so the size of the singular values – which ranked the second subset as worse than the fourth – is more appropriate than the condition numbers – which ranked the fourth as worse than the second. The fourth subset is nowhere near as multicollinear as the second.

Counterexample: Condition Numbers Can Be Misleading

Let me run that by you again. Subset #2 is {X1, X3}. It has the lowest, and presumably the safest, condition number among all these subsets. But its minimal singular value says this is the second most dangerous subset of two variables – and the R^2 show that it is certainly more dangerous than subset #4… so the condition numbers are misleading in this case.

(It turns out that for centered data, many of the smallest singular values are equal – hence uninformative about severity. I don’t understand why they’re all equal, but they can be. When that happens, then the condition numbers should be definitive.)

We Now Return to Our Regularly Scheduled Program

Let’s quickly check the other subsets, all of which have smallest singular values larger than 1. The first:

We see required rounding of 1. and almost-zero R^2.

And the sixth:

Again, required rounding is 1., and R^2 are next to nothing.

And, finally, the third:

The required rounding is even higher, 1.6; the R^2 are negligible.

Most of those have confirmed what we either knew (for one subset of two variables) or conjectured (for the other subsets): the SVD says any pair of variables is multicollinear because they lie in a plane which almost goes through the origin, even though the independent variables are not well-fitted by each other.

The key lesson there was that we had to look at the RSquared’s among the independent variables, and not just at the singular values and the rounding required to zero out a column of the matrix Xv.

And the most important thing I learned here was that the singular values are a better indicator than the condition numbers.

Looking for a Pattern

We did an awful lot of work. How much could we have inferred without looking at the VIF R2? I don’t know if the following will always hold up, but we should consider it.

From the SVD of the full design matrix, we saw one extremely small singular value, 0.03 .

When we looked at subsets of four variables (counting the constant, now), we found that the smallest singular values were much larger, but in two pairs – one pair about .15 and the other pair about .55 .

I might have conjectured that “much larger” meant we were looking at a different multicollinearity from the previous subset, and the “two pairs” meant that we were looking at two new multicollinearities. In other words, conjecture that all four variables were multicollinear (that’s one); but once we considered subsets missing one of the four, we could now see two pairs of multicollinear variables (that’s two and three) – not of equal severity to each other, and not as severely multicollinear as the full set.

When we looked at subsets of three variables, we found that the two smallest singular values were almost unchanged, one pair about .15 and the other pair about .6 .

and then we might have guessed that the pairs were X1 and X3, and X2 and X4.

It appears – from this one dataset! – that a change in the minimal singular values heralds a new multicollinearity. 0.03 was the 4-variables, .15 was X2 and X4, .6 was X1 and X3 – and we had about .15 whether we looked at {X1, X2, X4}, {X2, X3, X4}, or {X2, X4}… and we had about .6 whether we looked at {X1, X2, X3}, {X1, X3, X4}, or {X1, X3}.

An interesting pattern. Maybe we’ll see if it holds for other datasets. (I don’t know that I have another one with multiple multicollinearities.)

Next, I’m going to look at the centered Hald data. We have already seen that standardizing the data removed the small singular values for the pair X1, X4. I have looked at all subsets of the standardized data, and there’s nothing new to learn. (I could change my mind about that….)

But we have a shock in store when we look at the centered — not standardized —data. I wish it weren’t so, but I think the math is right.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: