Regression 1 – Multicollinearity in the standardized Hald data

Introduction and Review

In the previous post, we investigated multicollinearity in the Hald data as given. We used the singular value decomposition (SVD) of the design matrix X

X = u w v’.

In particular, we generalized from our experience with linear dependence, namely the fact that the rightmost columns of v are a basis for the null space of X, if the columns of X are linearly dependent… that is, the rightmost columns of X.v are 0. If, instead, we have merely near linear dependence, i.e. multicollinearity, then the rightmost columns of X.v are small rather than zero.

I will say a little more about X.v shortly.

In contrast to looking at all subsets of the columns of X, we investigated multicollinearity for four specific regressions: the best 1-, 2-, and 3-variable regressions, and for one additional regression which I had found interesting.

We learned that the most significant multicollinearity involved all four independent variables: their sum was very nearly constant, and slightly less than 100. Reading on my part informed me that these four variables were a subset of the original measurements — which were percentages adding up to 100.

It is noteworthy — even crucial — that the correlation matrix of the data does not reveal this multicollinearity.

The correlation matrix does, however, suggests that X2 and X4 are multi-collinear, and that X1 and X3 are also. As it happens, we found the multicollinearity between X2 and X4 when we looked at the best 3-variable regression; I never actually confirmed the multicollinearity between X1 and X3. (Not in public anyway; I have since confirmed that the SVD finds it and that it looks less severe than the multicollinearity between X2 and X4.)

In addition, we found multicollinearity between X1 and X4… but it was different from the previous cases. There is no even remotely close linear relationship between X1 and X4… but the SVD says the design matrix consisting of a constant (call it X0), X1, and X4 is multi-collinear.

I pointed out that the three columns of that design matrix, viewed as points in 3-D space, lie exactly in the plane X0 = 1.

But linear subspaces of vector spaces must contain the origin… so two-dimensional vector subspaces of 3-D space are planes through the origin. The plane X0 = 1 does not go through the origin.

I suggested that the data was sufficiently spread out, however, that it was close to a plane through the origin, and that plane is what the singular value decomposition detected.

In such a case, we should consider transforming the data… and the most obvious transformation to try is to standardize the data.

the matrix product X v

Before we do that, let me remind you that the matrix product X.v computes the new components of our data with respect to the basis consisting of columns of v.

Maybe that’s obvious to you. v is an orthogonal matrix, so its columns represent an orthonormal basis… and we know that the components of a vector (any row of X) with respect to an orthonormal basis can be computed by taking dot products. And that’s what the product X.v is doing, row by row of X, column by column of v.

Alternatively, we could simply write out definitions. We know that v is a transition matrix — by definition, its columns are (the old components) of the new basis vectors.

We know that for any other vector, with old components x and new components y, the transition matrix v maps new components to old:

x = v y.

In that equation, of course, x and y are column vectors. Now imagine that x is a row of X; we want to transpose that equation:

x’ = y’ v’.

Now post-multiply by v:

x’v = y’ v’v,

i.e.

x’v = y’ (because v is orthogonal: v’v = I).

But our matrix X consists of rows x’, so that vector equation becomes the matrix equation

Xv = Y,

which says that each row of Y is the new components of the corresponding row of X. (I apologize for the notation, but I think we’re stuck with X for the matrix but x’ for an individual row.)

Oh, I’ve said a few times that v was orthogonal. What does that mean geometrically? Answer: v is a rotation. But you might note that it’s a rotation applied to rows of X, so it’s a rotation in 3D, 4D, or 5D for the regressions we looked at.

Now let’s get the Hald data again, and standardize it.

standardizing the data and getting new regressions

Here’s the Hald data:

\left(\begin{array}{ccccc} 7 & 26 & 6 & 60 & 78.5 \\ 1 & 29 & 15 & 52 & 74.3 \\ 11 & 56 & 8 & 20 & 104.3 \\ 11 & 31 & 8 & 47 & 87.6 \\ 7 & 52 & 6 & 33 & 95.9 \\ 11 & 55 & 9 & 22 & 109.2 \\ 3 & 71 & 17 & 6 & 102.7 \\ 1 & 31 & 22 & 44 & 72.5 \\ 2 & 54 & 18 & 22 & 93.1 \\ 21 & 47 & 4 & 26 & 115.9 \\ 1 & 40 & 23 & 34 & 83.8 \\ 11 & 66 & 9 & 12 & 113.3 \\ 10 & 68 & 8 & 12 & 109.4\end{array}\right)

The Mathematica® command I want is simply “Standardize”, but I include //N to get decimals instead of fractions:

It’s easy enough to check that each column has mean 0 and variance 1:

Let’s run stepwise (once, so it’s really just forward selection) and backward selection:

NOTE that the adjusted R^2 are the same as for the original data, as are the orders of selection. We get the same fits geometrically, although the coefficients \beta\ have changed. The standard errors have changed, too, but the t statistics have not. (And I didn’t change the names of the variables.)

That is, we get the same three best regressions for 1-, 2-, and 3-variables, as far as t statistics and R^2 are concerned.

Here they are:

Note that the constant term is always 0…. note that the 4-variable regression tells us we have multicollinearity because all but one t statistic fell to insignificance (and, strictly speaking, so did that 2.08 value for X1).

Hmm. I’m not actually happy with t statistics of 0 for the constant terms. But given that the coefficients of the constant term are zero and the standard errors are not, then its t statistics (one divided by the other) must be zero, too.

the 4-variable regression

Let’s do our thing. For the 4-variable regression, form the design matrix… get its SVD X = u w v’… display the singular values (the nonzero diagonal entries of w)… then compute X.v and see what rounding makes a column or columns zero:

The smallest singular value is only one order of magnitude smaller than the second smallest, instead of two as it was for the raw data. If we round to the nearest 0.2, we see only zeros in column 5. Column 4 is also small, but not zeros.

What about the VIFs (i.e. the R^2 for regressions of the Xi on each other)? The VIF are available:

The resulting R^2 for the four nonzero numbers are:

(Recall that VIF = 1 / (1- R^2) so given a VIF I can compute R^2 = 1 – 1/VIF, and that’s all my function “rsq” does. The first number, for example, is the R^2 for a regression of X1 on X2, X3, X4.)

Just as we got the same t statistics and Adjusted R^2 (and R^2) for y as a function of the Xi, we get the same t statistics and R^2 for the standardized variables as functions of each other.

Those four R^2 values are exactly what we got for the raw data.

The four variables Xi are multicollinear, and all four are involved. And the R^2’s for X2 and for X4 are higher than that for y as a function of all four. That comparison still holds, even though we had to do much more rounding to make a column of X.v round off to zero.

I take it that this is still the most serious multicollinearity… and we have a scale independent suggestion that it is serious in some sense (namely the R^2 for the X’s compared to the R^2 for the y)… but in absolute terms the multicollinearity might not be as serious as it was. Let’s just keep poking at the data.

the 3-variable regression

For the 3-variable regression, also, we will find the same result as before.

The smallest singular value is not even one order of magnitude smaller than the second smallest. If we round to the nearest 0.7, we see only zeros in column 5. Our rounding has moved from 0.2 to 0.7 — not all that large a change.

What about the VIFs (i.e. the R^2 for regressions of the Xi on each other)? The VIF are available… and the resulting R^2 for the four nonzero numbers are…

This is what we saw before: the last two variables are multicollinear — they are X2 and X4 — but the first variable is not involved.

the 2-variable regression

Let me look at the smallest regression: bac 2

Whoa! The three singular values are very nearly equal. Without even looking at the VIFs, I have to conclude that this data (X0, X1, X4) is thoroughly 3-dimensional.

The multicollinearity we saw for this subset of the raw data really was an artifact of the scaling. Once we standardize the data, even though it still has a constant term (X0 = 1), the data now lies in a cube, roughly speaking, rather than being spread out over a plane.

(Let me freely confess that I knew this – I had seen this happen – before I published the previous post; that’s why I could safely say that we were looking at an artifact of scaling. More to the point, it was this multicollinearity that delayed publication of last week’s post – until I successfully eliminated it, and understood how it existed in the raw data without showing up as a fit of X4 of X1.

Incidentally, whatever rounding makes the third column vanish… will wipe out all the columns.

Shout it from the rooftops: X0, X1, X4 is 3D, once we standardize X1 and X4.

In the previous post, the SVD said that X0, X1, X4 was somewhat multicollinear; but the VIF R^2 said that X1 and X4 weren’t remotely close to lying on a line. Standardizing the data has changed the verdict from the SVD: no multicollinearity.

Advertisements

2 Responses to “Regression 1 – Multicollinearity in the standardized Hald data”

  1. James Says:

    Where is the definition of “stepwise” and “backward” function? I can’t find them.

  2. rip Says:

    Hi James,

    I’m very sure that I never revealed my “stepwise” code — it’s ugly beyond belief. “backward” isn’t bad, so here it is. Maybe you can use it to write your own “stepwise”.

    First a little subroutine to figure out what variable to drop from a regression:

    toDrop[reg_]:=Module[{ts,min,pos},
    ts=Abs[reg[“ParameterTStatistics”]];
    min=Min[Drop[ts,1]];
    pos=Position[ts,min]-1//Flatten;
    pos
    ]

    Then the main function:

    backward[data_,names_]:=Module[{len,reg,n2=names,i,p1},
    len=Length[names];
    reg=ConstantArray[1,len];
    reg[[-1]]=LinearModelFit[data,n2,names];
    Print[n2,reg[[-1]][“AdjustedRSquared”]];
    p1=toDrop[reg[[-1]]];
    For[i=1, i<len,i++,
    n2=Drop[n2,p1];
    reg[[-i-1]]=LinearModelFit[data,n2,names];
    Print[n2,reg[[-i-1]]["AdjustedRSquared"]];
    p1=toDrop[reg[[-i-1]]];
    ];
    reg
    ]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: