Regression 1 – Multicollinearity in subsets of the standardized hald data

Edit: 8 Aug. Remarks added in “Regressions with no constant term”.

introduction

It would be fair to say that this post is primarily for my reference, but it does provide a second example of looking at all subsets of multicollinear data.

As we originally did for the raw data, so for the standardized data: we looked at multicollinearity for the three regressions of most interest – namely, the best 2- and 3-variable regressions, and the all-4-variable regression.

Now, as we did for the raw data, so for the standardized data: let’s look at all subsets (of columns) of the design matrix X. (In fact, it is easier to look at the equivalent: all subsets of rows of the transpose of the design matrix, X’.

Nevertheless, let me summarize what we will find. You may not feel a need to look at the computations. I think at the very least you will want to look at the section on regressions without a constant term, and at the 3-variable subsets.

First, we will see that the standardized data has three distinct multicollinearities; from most severe to least severe, they are

  1. all four independent variables
  2. X2 and X4
  3. X1 and X3.

That should not be a surprise. We did, after all, see these when we looked at multicollinearity in selected regressions for the standardized data. What we’ll see today is that those are the only multicollinearities.

We did, however, see one difference between the raw data and the standardized data. We have seen that the singular values are quite sensitive to scaling.

In the raw data, judged by the singular values, the set {CON, X1, X2} appeared to be multicollinear to some extent, in the sense of having an approximate 1-dimensional nullspace (the last column of X.v was 0 when we rounded to the nearest integer). That’s not an implausible rounding, given that the original independent variables were integers.

And yet the VIFs (Variance Inflation Factors, but as I print them, the R^2) for each of X1 and X2 as a function of the other said that they were not remotely related.

When we looked at the standardized data instead of the raw data, the singular value approach now showed no problem with X1 and X2.

That is, in the raw data the singular values suggested that X1 and X2 were related, but the VIFs said otherwise; in the standardized data, there is no disagreement between the singular values and the VIFs.

Second, in summary, we will see that the singular values correspond to the VIFs for all subsets of the standardized data.

Third, we will see, as we did before, that the smallest singular value will be different when it pinpoints a different multicollinearity; i.e. X2 and X4 versus all four; X2 and X4 versus X1 and X3. We will also see that the condition numbers are also a function of the specific multicollinearity or lack thereof.

Fourth, we will see that although the RSquared from the VIFs are pretty well the same for each multicollinearity, as are the smallest singular values, and the condition numbers, the rounding required to zero-out the last column of X.v – that will change. And it will be uncomfortably large in some cases.

Finally, we will see that the smallest singular values are much larger and the condition numbers much smaller than they were for the raw data – even though the multicollinearity as reflected by the VIF RSquareds is the same. That is, singular values and condition numbers are very sensitive to the data, whereas RSquared are not.

Incidentally, we will see that the constant term does not enter into any of the regression equations. This meant that I had to modify my tools module to not add or assume a constant column of 1s in the matrix under investigation. (I now have two modules, tools and tools2, respectively called if there is, or is not, a constant term.)

setup

Let me get the Hald data:

We haven’t seen this in a while, so here’s the data matrix:

It is convenient to have two sets of names, one for regressions (without the constant CON) and one for subsets (with):

Standardizing data looks mighty simple; there’s a built-in command:

But let’s make sure it did what I think it did. We can ask for the mean and variance of the new data matrix:

The numbers are right: the means are 0 and the variances are 1. And since it gave me pairs of five numbers rather than of thirteen, I infer that it computed the mean and variance of the five columns, not of the thirteen rows. We’re good to go.

Get what we believe to be the best k-variable regressions

Let me review those regressions. Backward selection agrees with forward selection on the 3-variable regression (and on the 4-variable regression), but they disagree on the 1- and 2- variable fits.

Looking at the Adjusted R^2, we see that forward selection shows X4 giving a better fit than X2, but backward selection shows {X1, X2} giving a better fit than {X1, X4}.

That is precisely what we saw for the raw data; and we also saw that between them, backward and forward selection did pick out the best 1-, 2- and 3-variable regressions.

But we don’t actually know that for the standardized data. Well, I have run all possible subsets of the standardized data, and we are in fact looking at the best possible 1-, 2- and 3-variable fits for the standardized data.

Here are the best 2- and 3-variable regressions, and the 4-variable regression:

We see that in every case the constant term is zero. That’s because I standardized the dependent variable, too.

regressions with no constant term

Hang on. I need to see regressions without the constant term. That is, I’m always going to include it in my regressions, but I want to quickly compare a model with and without the constant term.

Edit: I am specifically interested in the situation when the included constant is zero; if it is not, then omitting it will have a significant deleterious effect on the fit. I just want to confirm that the R^2 is the same whether I include a zero constant or not. Because the R^2 is the same – see below – I don’t need to rerun my regressions without the constant.

Let’s consider X1 and X2 with and without the constant. Here are the two regressions. (Yes, I could have used bac[[2]] instead of rerunning it and calling it lm2, but this way I get to emphasize the difference between the commands.)

The fits are the same:

The RSquared are the same:

The Adjusted R^2 are different – because the number of variables is different:

As I have said of other things, “Here be dragons.” I’m not going to research all the possibilities here. The problem is that the definitions of R Squared and Adjusted R Squared – when there is no constant term – are apparently not universal.

It appears that Mathematica® uses the same error sum of squares ESS and total sum of squares TSS for both computations… and uses n = 13 and k = 2 or 3… but computes

1 - \frac{ESS/(n-k)}{TSS/(n-1)}

if there is a constant term… but

1 - \frac{ESS/(n-k)}{TSS/n}\

i.e. dividing TSS by n instead of by n-1 if there is not a constant term. Maybe I will look at this someday, but not today. I’m happy for now just knowing what Mathematica is doing. As I said, I’m going to keep the constant term, even when it has a coefficient of 0.

The variance inflation factors are the same:

The standard errors, and therefore the t-statistics, are slightly different, but overall those two regressions look almost perfectly matched:

Let’s look at the design matrices and their singular values. Without a constant:

With a constant:

Interesting. It was the middle singular value that dropped out; the smallest and largest remain. I wonder about this – not that I doubt it, I just wonder if this is the key to some insight.

Anyway, we’ve now seen one illustration that it doesn’t seem to matter if we leave the constant term in our regressions, precisely because its coefficient is zero in all the fits.

subsets

In what follows, in addition to printing the singular values, let me also print the condition numbers. Recall that the condition number is the ratio of the largest to smallest singular value, i.e. the largest one divided by the smallest one.

Here’s for the entire design matrix: oh, I need

r=Range[5];

The first line is the computed singular values, to enough accuracy to let you check the condition number; the second line displays the names of the corresponding variables, and rounds off the singular values; the third line displays the names and the condition number.

I would emphasize that the condition number, at 37, is about 1/20 of what it was for the raw data. We learn, therefore, that there is no absolute division between “good” and “bad” condition numbers: they depend strongly on the scale of the data. (The standardized data, as we will see, seems every bit as multicollinear as the original, in terms of VIFs.)

The additional information is (using tools because there is a constant term):

The first output, as usual, is X.v rounded off; it is the new data referred to the v basis from the Singular Value Decomposition X = u w v’. We see that rounding to the nearest 0.2 zeroes out the last column of X.v, so the 1-dimensional space (line) spanned by the last column of v is approximately a null space for X… so we believe we have multicollinearity.

The second output is the R^2 corresponding to the VIFs. All four R^2 computed from the Variance Inflation Factors are very high, so we conclude that the multicollinearity involves all four variables X1. X2, X3, X4.

The third output simply supplies the names of the variables in the subset under examination.

In other words, standardizing the data did not eliminate the most severe multicollinearity. It is no longer true that X1 + X2 + X3 + X4 = 98.5 (very nearly), but there is still a relationship between the four.

subsets of four variables

The smallest singular value (in the middle output) is still 0.14, but it’s for the subset without a constant – which also has the largest condition number (in the third output). That smallest singular value is confirming that dropping CON from the subset of 5 variables does not affect the multicollinearity.

We also see (we know) 3 different m/c:

  1. subset #5 shows all four Xs, with 0.14
  2. subsets #2, #4 show X2 and X4 with ~ .53
  3. subsets #1, #3 show X1 and X3 with ~ 1.4

We should also note that there are two pairs of similar condition numbers (about 3 and about 9) associated with subsets containing X1, X3 and X2, X4 respectively. And the condition number for all four variables is still 37.

We don’t necessarily have to look at things in more detail: the two subsets #2 and #4 have X2 and X4 in common, so that ought to be the multicollinearity they both point to; and similarly for subsets #1 and #3.

Nevertheless, here is more detail on the fifth subset (this time calling tools2 because there is no constant term in this subset):

Rounding to the nearest .2 wipes out the last column, so we have an approximate null space; the R^2 from the VIFs shows that each variable is very well fitted by the other three. That it’s true for all four variables tells us that all of the other three are significant in each fit.

That’s just what we saw with all five variables, when the irrelevant constant term was included.

Now let’s look at the fourth subset, with a smallest singular value of .5:

The two significant R^2 confirm that the first and last variables are the multicollinear ones: that’s X2 and X4.

Here’s the second subset, with a smallest singular value of .57:

Again, only X2 and X4 are closely related.

The subset with the next smallest singular value is the third, at 1.33:

So, X1 and X3… but the rounding was 1.7 even though the R^2 are relatively large. There seems to be a scaling issue.

Still, this is worth noting. The R^2 from the VIFs show that we have a relationship between X1 and X3 – but the rounding off required to display our approximate null space is getting rather large. For the raw data, the roundings required to wipe out the last column of X.v for X1 and X3 were .6 and .8 .

Do you get the impression that I’m relying more on the R^2 than on the singular values? I am. The R^2 are very similar whether the data is raw or standardized – but the singular values and required rounding on X.v have changed drastically.

On the other hand, we have made the spread of the data smaller, but it takes larger rounding to see the approximate null space. Maybe there are still two effects: one scaling and one independent of it, and so we could use smaller rounding on the raw data because the size of the data values contributed to the assessment by singular values.

I’m not sure about that. All I know is that X1 and X3 are multicollinear, but the rounding (to the nearest 1.6 above and 1.7 to follow) seems “large”.

Well, let me also suggest that an R^2 of .7 isn’t really very good, and maybe the singular value is telling that this is weak as multicollinearity goes.

Here’s the first subset, with a smallest singular value of 1.43:

Again, X1 and X3, and the required rounding is to the nearest 1.7 .

Subsets of three

The last time we did this, we thereby eliminated the extraneous variables in the subsets of four, and saw clearly the relationship between X2 and X4 and between X1 and X3. That won’t happen this time – we’ll still have an extraneous variable. (The difference is because last time, the constant term was not extraneous.)

What do we have for the smallest singular values?

In the range .5 – .57

  • X2, X3, X4
  • CON, X2, X4
  • X1, X2, X4

In the range 1.33 – 1.45

  • CON, X1, X3
  • X1, X2, X3
  • X1, X2, X4.

What about the condition numbers? Three above 8: X2 and X4. Three about 3: X1 and X3. Four around 1: the unrelated variables.

In every one of those cases, however, there is one extraneous variable (e.g. X3 in the first line, CON in the second, X1 in the third.)

Then we have 4 subsets whose smallest singular value is 3.01 – 3.41: each subset involves the constant term and some pair of variables other than X2, X4 or X1, X3.

Let’s look at just a few of these. Subset 10 has a smallest singular value of .5:

Yes. That’s picking out X2 and X4 again.

Subset 9 has the smallest singular value in the second group (1.33):

Yes, X1 and X3.

Let’s look at one in the third group; subset 3 has the smallest of these smallest singular values (3.01); note that I call tools instead of tools2 because this subset includes the constant.

Yikes!

The matrix is pretty solidly 3D. Rounding it off to find a null space almost turns the matrix into a complete zero matrix.

Two things tell us that X1 and X4 are not collinear:

  1. the R^2 from the VIFs are close to 0
  2. the rounding required to wipe out the last column also wipes out the second column, and darn near wipes out the first.

subsets of two

We finally come to the end; with subsets of two variables we will at last see just X2 and X4, and X1 and X3.

We see one smallest singular value of .57 — that’s X2 and X4, of course. One smallest singular value of 1.45 — that’s X1 and X3, and the other eight smallest singular values are all around 3. It’s still true that the smallest singular value is approximately constant for a specific multicollinearity, regardless of the presence of extraneous variables.

The condition numbers are also approximately preserved: 8 for X2 and X4, 3 for X1 and X3, 1 for all the unrelated pairs.

Whew!

Now you might want to reread the summary in the introduction.

3 Responses to “Regression 1 – Multicollinearity in subsets of the standardized hald data”

  1. NICO VENGEANCE Says:

    I like math and i have math blog too

  2. rip Says:

    The middle singular value dropped out because it corresponded to the constant vector – and that’s the one I removed.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: