The Hald data turns out to have been an excellent choice for investigating multicollinearity: it has at least four “near linear dependencies”. I’m about to show the details of three of them. (And in a subsequent post, I think I can eliminate all but three of them – but not the same three!)
We have already seen one of them: we know that the four independent variables have a nearly constant sum just under 100. (These four variables are, in fact, a subset of a larger set of variables – whose sum was 100%.)
We have seen two approaches to finding (exact) linear dependence of a set of variables:
- looking at the singular value decomposition (SVD) of subsets;
- looking at an orthonormal basis for the null space.
We used the singular value decomposition of the design matrix for all four variables (that is, our subset was the entire set) to discover that all four variables (with a constant, too) were multi-collinear. But I did not continue looking at all the subsets.
I think I will return to that approach, looking at all subsets… but not today. Instead, I want to look at an orthonormal basis for the closest thing we have to a null space. And I want to do it for four regressions which I decided were worth investigating.
I’m going to use one additional tool, the variance inflation factors – indirectly. You will see that I view them as one possible explanation for multicollinearity detected by the SVD. (And yes, that should be surprising.)
Read the rest of this entry »