backward selection and stepwise: code

Just about a month ago, in a comment, I posted my code for backward selection. I provided virtually no description of the code. I also said that I was too embarrassed to post my code for forward selection.

Well, I decided to do something about that – I have rewritten and tested my code for forward selection. It is, of course, still called “stepwise”, because the name is too deeply ingrained in my head.

While I’m showing you my new code for stepwise, I might as well discuss the backward selection code, too.

In other words, this post is about Mathematica code rather than about mathematics. Still, you might be able to adapt this to another programming language. In addition, it will show you that I have made at least one relatively severe assumption.

Simply because I have already posted its code, let me discuss backward selection first.

Here’s a picture of a utility function named “toDrop”.

This utility function has one input, reg, which is a single LinearModelFit. (Perhaps I should have declared it as such, but I didn’t.) It returns pos, the index of the variable having the lowest (absolute value) t-statistic.

Every variable used in the module – all intended to be local – is declared inside the {} following the word Module:

{ts, min, pos}

The line

ts = Abs[reg[“ParameterTStatistics”]];

extracts the absolute values of t-statistics from reg.

The line

min = Min[Drop[ts, 1]];

says exclude the t-statistic of the constant term from consideration, and get the minimum of all the other absolute values. That is, the first t-statistic is presumed to be that of the constant term.

NOTE: this means that my utility function cannot work properly if the constant term has been omitted from the regression. On the other hand, this utility is only called by backward selection — which was called with a list of variables, not with a design matrix. In other words, backward selection itself is never told to omit the constant.

NOTE also that if two t-statistics are equal (to within Mathematica’s representation of them), I will drop the first. The probability of the two equal t-statistics seems too low to worry about.

The line

pos = Position[ts, min] – 1 // Flatten;

does three things: first, find the index of the minimum t-statistic in the list ts; second, subtract 1 from it, to get the index of the variable; third, I need to lose a spare set of { } around the answer.

Let me be clear. If there are, in my usual parlance, “3 variables” then there are 4 t-statistics – the first of which is for the constant term. One way to think of it is that I have t-statistics numbered 1-4, but I want variables numbered 0-3, the constant being variable 0.

The final line returns reg pos.

Now for backward selection itself.

Here’s a picture of the code:

It has two inputs (expected to be lists but not declared as such): data, names. Every local variable in the module is declared:

{len, reg, n2 = names, i, p1}.

Instead of the first line

len = Length[names];

I could have initialized len in the declaration, as I initialized n2 = names.

The line

reg = ConstantArray[1, len];

is intended to create reg as a list of the correct size, len. There’s nothing special about initializing every entry to 1; I could have initialized them to anything. I just required a list of the correct size.

The line

reg[[-1]] = LinearModelFit[data, n2, names];

runs the regression with all variables (RHS) and stores it in the last (-1) location of the list reg. I want it to correspond to the last regression from stepwise, so I record it in the last position even though it is run first.

The line

Print[n2, reg[[-1]][“AdjustedRSquared”]];

prints the list of names n2, all of which are in this regression, and the Adjusted R Squared for this regression. (For this regression only, n2 = names).

The next line

p1 = toDrop[reg[[-1]]];

sets p1 to the index of the variable with the lowest (absolute value) t-statistic.

Then we have a FOR loop, because I know exactly how many times it is to be executed. I’m not comfortable with the punctuation of FOR loops, so I tend to value working examples of them.

The initialization line

For[i = 1, i < len, i++,

has the required information: index i, runs from 1 to len – 1 (note the strict inequality), and is incremented by 1 on each pass. It might have been cleaner to let i start high and be decremented on each pass.

The line

n2 = Drop[n2, p1];

redefines n2 by dropping the variable name with index p1; we had gotten the index but not removed that name.

Next, run a regression (RHS) using the new list n2; save it in reg in the next slot from the back (LHS).

reg[[-i – 1]] = LinearModelFit[data, n2, names];

Note that the first and third parameters (data, names) in the call to LInearModelFit do not change. The second parameter specifies which subset of data is to be used; we do not have to change the dataset.

(Not there, anyway. I do have to change the input to the backward module if I want to remove a variable from consideration.)

The next line

Print[n2, reg[[-i – 1]]["AdjustedRSquared"]];

prints n2, i.e. the list of variables used, and the Adjusted R Squared.

The last line of the FOR loop determines which variable to drop from the regression we just ran.

p1 = toDrop[reg[[-i – 1]]];

Go do another one.

Finally, after we exit the FOR loop, the module returns reg, the list of regressions we ran.

Let me remind you of what the output usually looks like:

Hey, what happened to the regressions? They were suppressed by the semi-colon at the end of the command. Personally, I don't mind seeing them… but for publication, I'd just as soon cut things to essentials.

Let me be clear. The visible lines of data are the result of the Print commands in the backward module; they are not the contents of bac. Without that semicolon, we would see this:

OK, so what does my new stepwise look like?

It has two inputs, data and names, which are declared Lists but perhaps should have been declared LinearModelFit.

As usual, I declare every local variable:

{table, max, pos, used = {}, avail = names, count, i, nvars = Length[names], reg}.

We see that used is initialized to an empty list, avail is initialized to the full list of names, and nvars is initialized to the length of the names list (the number of variables, hence to the number of regressions we will run).

As before, initialize reg so that it is a list (of 1s, but the value is irrelevant):

reg = ConstantArray[1, nvars];

Then we begin a FOR loop.

For[count = 1, count <= nvars, count++,

This time, the loop index is run nvars times ("<="); and, as before, it starts at 1 and is incremented by 1 on each pass.

The first line of the loop…

table = Table[LinearModelFit[data, {used, avail[[i]]} // Flatten, names]["AdjustedRSquared"], {i, Length[avail]}];

gets a list of Adjusted R Squares. The first and third parameters to each regression call are the same: data and names. The second parameter, however, appends one available variable avail[[i]] to the list, used, of variables already being used. Again, the used list is common to every regression in this table; they differ in which one variable, namely avail[[i]], they appended.

Find the largest Adjusted R Squared in the list and its its position in the list::

max = Max[table];
pos = Position[table, max][[1, 1]];

NOTE: if there are two or more Adjusted R Squared which are the same, we will find only the first. I may decide to remove this restriction – I think it will come into play if we ever have a perfect fit using a subset of the variables… because after some point, all the Adjusted R Squared will be 1. This is the relatively severe assumption I have made. It's pretty reasonable to think that two real numbers are unlikely to be exactly equal… but in the case of Adjusted R Squared, if it is ever 1, it remains 1 and we will have equality. (At least, I think that's my experience.)

Next, take the selected variable, at pos in avail, and append it to the end of the list used; and then drop it from the list avail:

used = {used, avail[[pos]]} // Flatten;
avail = Drop[avail, {pos}];

Now that we have added it to used, we should print used and the Adjusted R Squared:

Print[{used, table[[pos]]} // Flatten];

Now, we still have to run and save the regression itself; all we have kept so far is its Adjusted R Squared. Yes, we ran it – along with others – but all we saved was the Adjusted R Squared. Yes, it's wasteful to run it again, but it's simpler this way.

reg[[count]] = LinearModelFit[data, used, names];

Finally, after we exit the FOR loop, return the list reg of regressions.

As it was for backward, so for stepwise. The following command executes the Print command in the Module, and displays reg as it saves it:

Putting a semi-colon after the command…

does not affect the internal Print command… but it suppresses the printing of reg. Why not? We would soon be looking at them, or some of them, in more detail.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: