## Introduction and Review

In the previous post, we investigated multicollinearity in the Hald data as given. We used the singular value decomposition (SVD) of the design matrix X

X = u w v’.

In particular, we generalized from our experience with linear dependence, namely the fact that the rightmost columns of v are a basis for the null space of X, if the columns of X are linearly dependent… that is, the rightmost columns of X.v are 0. If, instead, we have merely near linear dependence, i.e. multicollinearity, then the rightmost columns of X.v are small rather than zero.

I will say a little more about X.v shortly.

In contrast to looking at all subsets of the columns of X, we investigated multicollinearity for four specific regressions: the best 1-, 2-, and 3-variable regressions, and for one additional regression which I had found interesting.

We learned that the most significant multicollinearity involved all four independent variables: their sum was very nearly constant, and slightly less than 100. Reading on my part informed me that these four variables were a subset of the original measurements — which were percentages adding up to 100.

It is noteworthy — even crucial — that the correlation matrix of the data does not reveal this multicollinearity.

Read the rest of this entry »