## Happenings – 2010 Oct 9

I am still working pretty much exclusively on regression analysis. I’m even working on the post I planned, the so-called singular deletion statistics.

And I’ve seen some new ideas and information.

Let’s see. Although the residuals are not normal, each apparently is a singular normal… but since they don’t have the same variance, I think it doesn’t really matter: it seems silly to check the residuals for constant variance, and it seems almost as silly to check them for normality when we know they came from distributions with different variances (given our model assumptions). Conclusion: use the standardized residuals for tests of constant variance and of normality. Ah, no, I don’t yet know whether the standardized residuals are supposed to be normal if the disturbances are. But if it makes sense to test anything for normality, it would seem to be the standarized residuals.

Another idea. Someone suggested that instead of looking at all possible subsets of variables, we could simply look at the best regression with each possible number of variables, i.e. the best with j variables, as j runs from 1 to k, the maximum number. That is, find the best regression with 1 variable, then the best regression with 2 variables, etc. This is different from stepwise (or forward selection) in that it doesn’t require that one of the two variables be the one from the j=1 regression. To put that another way, at each step, having a best regression with j variables, stepwise looks for the best variable to add to that regression. The new suggestion is: don’t bother requiring any relationship between the regression with j variables and the one with j+1 variables — just get the best, for each j.

The downside is that this simply amounts to editing the list of all possible subsets, displaying the best one at each level. We still have to run every possible regression. For k variables that’s 2^k regressions; for k = 20, we’re talking more than 1 million. This idea doesn’t save time, and it’s feasible only when we can run all possible subsets.

Given all possible regressions, this is nothing more than selecting a subset. On the other hand, if the number of variables is large, then we could run all possible subsets up to some feasible number of variables.

It won’t replace stepwise in my toolbox, but it will augment it: I can imagine using it to start a stepwise analysis with more than 1 variable instead of 1 as I usually do. In addition, it will give us a different way of looking at all possible subsets. (The Hald data only has 4 variables; there are only 16 possible regressions. Looking at all of them is not only feasible, it is almost mandatory. And we will do it.)

Outside of regression… I’ve continued looking at the Bayesian analysis book. It’s got a chapter on spectrum analysis, and I have to do more with that….