Regression 1: Single deletion statistics

Introduction

edited 2011 Sep 12 to correct the equation that leads to s(i).

This post continues the examination of regression properties available in Mathematica®. You might want to have read the previous “regression 1” posts, starting with the “Introduction”.

The following are all related to deletion of a single observation:

  • BetaDifferences,
  • CookDistances,
  • CovarianceRatios,
  • FitDifferences,
  • FVarianceRatios, and
  • SingleDeletionVariances.

I will discuss them in this post, along with

  • StandardizedResiduals and
  • StudentizedResiduals,

which are used in single-deletion analysis, but could certainly have been discussed apart from it. I will also introduce another form of residual.

My major reference for this underlying material is Belsley, Kuh, and Welsch – which is on my bibliographies page. I’m not positive, but I more than suspect they are responsible for the discovery of many of these statistics. In addition, I have been reading and will be using Atkinson’s “Plots, Transformations and Regression”, ISBN 0 19 853359 4, to see what these statistics can actually do for me.

The primary purpose of this post, however, is to show exactly how these statistics are computed. (For me, it’s not enough to have Mathematica® provide them; I have to know how they were computed.) I will, in fact, compute them for our Hald regression, but I won’t be doing a full-fledged analysis. That will come later, after I have more experience with them.

Here’s the key: It is a rather remarkable fact that we can deduce, from a given regression, what would happen to the fit if we removed exactly one observation.

Yes! I can figure out – without rerunning the regression – what the regression fit would be if I removed, say, the 6th observation from the data matrix. More usefully, I can make this determination for the hypothetical removal of each one of the observations… this means I can tell, easily, if the removal of any one observation would have significant impact on the final equation.

And that’s actually the key point. Rerunning a regression with one observation dropped is about as simple as it gets nowadays. But figuring out which observations would have significant impact if dropped? For that, it’s very handy to be able to assess all the individual potential impacts without actually running any more regressions.

I am still using the Hald data. Although I will introduce another regression equation shortly, our starting point will be the only equation we have fitted so far, using the 3 variables X1, X2, X4:

I’m also going to want the Catcher Matrix C (defined by \beta = C\ y\ ), and the diagonal – not the entire matrix – of (X'X)^{-1}\ , which I will denote XTXi:

Next, I want all of the following definitions in one place. We have the residuals e, the diagonal h of the hat matrix H, the estimated standard deviation s, and the single deletion standard deviations si. (Recall that any standard deviation is the square root of the associated variance.)

The e, h, and si are vectors, and I will denote their ith terms by ei, hi, and sii in text… they are e[[i]], h[[i]], and si[[i]] in Mathematica® commands, where the double square brackets delimit subscripts.

It can be shown that the covariance matrix of the residuals is s^2 (1-H), so the variance of the ith residual ei is s^2 (1 – hi), where hi is the ith entry on the hat diagonal. This is one of many things we need the hat diagonal for.

If it seems strange that we associate a variance with each residual… All of the stochastic information in a regression comes from the initial assumption that the true model is

y = X B + u,

where the u are independently and identically distributed with mean 0 and variance \sigma^2\ ; and in fact, normally distributed. In particular, for 13 observations, the covariance matrix of the u is a 13×13 diagonal matrix, with \sigma^2\ on the diagonal.

By the way, I am distinguishing the true parameters B from the estimated parameters \beta\ , and the true disturbances u from the residuals e. The fit is still written

y = X\ \beta + e\ .

It is from that hypothetical covariance matrix for the u that we compute every other variance that crops up. In particular, the first thing we compute is the estimated variance s^2 of the disturbances u:

s^2 = e.e / (n-k),

where e.e is the sum of squares of the residuals – written as the dot product of e with itself – and n and k are the number of observations and the number of parameters in the fit, respectively. (No, that factor of n-k should probably not be obvious. I’m planning to put all of these proofs in one place. Later.)

(Recall that for the Hald data regression, we have n = 13 observations and k = 4 parameters.)

Then we compute the covariance matrix for the residuals e. I claim it’s

\sigma^2\ (1-H)\ ,

i.e. the covariance matrix of the disturbances times the identity minus the hat matrix, 1-H. Since we don’t know \sigma^2\ , we replace it by its estimate s^2:

s^2 (1-H).

Oh, before we get going, let me introduce some vague terminology. A strange point is one that looks different, for any reason… We have the following kinds of strange points: an outlier is a value which is far from the central values… a point of high leverage is one with a large value of h (the value of hi is often called the leverage of the ith observation)… and an influential point is one which has more effect than most on the fit or some of its properties.

The single deletion properties are a way of looking for strange observations.

residuals

Now, how do we standardize a variable? Subtract the mean and divide by the standard deviation. Well, the mean of the residuals is zero, and their standard deviation is the square root of their variance, so we have that the standardized residuals, in the usual sense of the word, would be:

es_i = \frac{e_i}{s\ \sqrt{1-h_i} }

(Rather than eyeball two vectors which look the same, I simply subtract them, “chopping” to get rid of the 10^-16 garbage that I get for the difference. That string of zeros says that the standardized residuals I computed match the ones that Mathematica gave me.)

If the residuals were normal, then the standardized residuals would be ordinates on a normal distribution. That value of 1.79722 – call it 1.8 – for example, says that residual is 1.8 standard deviations from its mean. With all values between ±2, the standardized residuals look pretty good.

By standardizing the residuals, assuming that they are normally distributed with mean zero and variance 1 – which I have read but do not personally know to be true even if our assumptions on the true model are valid! – we get to use our usual cutoffs for a normal distribution: these numbers are now standard deviations, and more than 3 standard deviations is way out there. Whatever that really means.

But, since we are considering deleting exactly one observation – more precisely, we’re examining each of them for deletion – what happens to the estimated variance s^2 if we delete the ith observation? The results are called si^2, the single deletion variances.

Here’s how they are computed; since I wanted standard deviations, I solved the following equation for si (edit – I had an erroneous square root of 1-hi in the LaTeX; the screenshot was and is correct):

(n-k-1)\ si^2 = (n-k) s^2 - \frac{e_i^2}{1-h_i}

To get what Mathematica calls the StudentizedResiduals, we start to compute the standardized residuals, but replace s by si. It is noteworthy that although si^2 has 1-h in its definition, we still retain the Sqrt[1-h] when we compute Studentized Residuals.

By the way, I know of no good reason to use si instead of s — oh, yes, it seems the Studentized residuals have a T distribution — but it is handy, and it apparently emphasizes outlying residuals.

Here I use the definition:

esi_i = \frac{e_i}{si_i\ \sqrt{1-h_i} }

These apparently go as a T distribution, so numbers near 2 are getting out there.

Let me close by remarking that the names of these residuals were not and may not ever be standardized – no pun intended, and I mean that. Some people call the standardized residuals the studentized residuals instead, and use standardized for ei / s. Some people call the standardized residuals the internally studentized residuals, and use externally studentized for what Mathematica calls studentized. The studentized residuals are also called the deletion residuals, and the standardized deletion residuals. (Don’t panic.)

It turns out that it is also handy to have something else, also called the deletion residuals (more about them later, and no, there’s no square root around 1-h):

ed_i = \frac{e_i}{1-h_i}

Given the variety of names that occur in the literature, it is essential to have the definitions handy. Given the residuals ei, it is possible to compute

  • ei / s (obsolete)
  • ei / s / Sqrt[1-hi] (standardized residuals)

  • ei / s(i) / Sqrt[1-hi] (studentized residuals)

  • ei / (1 – hi) (deletion residuals) .

The first is, justifiably I think, no longer in fashion. They used to be called standardized residuals; we know better today. The middle two are what Mathematica calls standardized and studentized. The last one is what I call deletion residuals. You pretty much have to have the definitions handy, because you can’t be sure what is meant by these names in another document, but then there’s no need to panic. You’ll have to identify them from these definitions.

Oh, let me name and save the \beta\ : \beta\ = reg[“BestFitParameters”].

BETA

Although it might be simpler to start with something else, let me demonstrate that I really can compute the \beta\ which result from deleting any one observation. I will compute all of the changes in \beta\ from our given regression, and then I will confirm the result for one observation.

Just as we used si for the deletion standard deviations, we use \beta i\ for the set of \beta\ when the ith observation is deleted. We denote the change in \beta\ by DFBETA (presumably “delta fit beta”, and presumably written in upper case because the inventors cut their teeth on FORTRAN. Must I tell you that’s an ancient but still used programming language?):

DFBETA = \Delta\ \beta  = \beta - \beta i\ ,

which says

\beta i = \beta - DFBETA\ .

I claim that DFBETA is equal to…

DFBETA_{ij} = \frac{C_{ji}\ e_i}{1-h_i}

where the oddity is that I have transposed the DFBETA matrix.

Let me emphasize that although I am about to demonstrate that these entries really are the changes in the \beta\ caused by dropping any one observation, the real point is to look at such a table and see what observations are influential.

In particular, for the 8th observation, we have a huge change in the constant term:

DFBETA[[8]] = {-11.1193, 0.0676566, 0.141667, 0.114313}

which says that if I drop the 8th observation, then \beta 8\ would be

\beta 8 = \beta - DFBETA[[8]] = \{82.7676,1.38428,0.274443,-0.350853\}\ .

Well, let’s just see if those are right.

Drop the 8th observation from the data…

Yes! (Yes, yes, yes! I love that result!)

We’re not quite done yet. Let’s consider DFBETA again. To standardize the DFBETA, they tell me to divide by the individual si and diagonal of (X’X)^-1… and we append an S to the name, to denote “standardized”.

DFBETAS = DFBETA / sii / XTXij.

These are the BetaDifferences.

So: Mathematica’s BetaDifferences are the DFBETAS; but we know what the DFBETA represent: the numerical change in \beta\ if we drop any one observation.

I think that is way cool.

Oh, since the \beta\ generally follow a T-distribution, any values over 2 are suspect… None of them are that large. In fact, it is suggested that we use ±2/Sqrt[n] as a cutoff, because as the number of observations grows, the chance of any individual \beta\ being very large goes down. In this case, that cutoff is

2./Sqrt[n] = 0.5547

Rather than draw 4 plots, let me just round off:

\left(\begin{array}{cccc} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ -1 & 1 & 1 & 1 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 1 & 0 & -1 & -1 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0\end{array}\right)

OK: observations 8 and 11 (the 8th and 11th lines of that array) are relatively influential on the \beta\ . Hold that thought.

FIT

Now for a different idea. Suppose we’ve dropped an observation – which we did – and fitted an equation without it – which we also did.

What does the new equation predict for the observation we omitted?

Let’s do it. We have the fit called reg2…

reg2[“BestFit”] = 82.7676+1.38428 X1+0.274443 X2-0.350853 X4.

We got that equation by omitting observation 8. Now let’s apply that equation to observation 8. We compute

X[[8]] \beta 8\ = 77.222

Just as I use \beta 8\ for the \beta i\ with i = 8, I use yhat8 for this computed yhat.

For the original fit, we had yhat[[8]] = 75.5919,

where we have a subscript 8.

… and we could compute the difference yhat[[8]] – yhat8 = -1.63014 .

Now, I claim that I could have computed all those differences (called DFFIT) as…

DFFIT_i = \frac{h_i\ e_i}{1-h_i}

As we did with DFBETA, moreover, we will standardize DFFIT the same way, dividing by both si and Sqrt[1-hi]:

= 1.1094

Only observation 8 exceeds that cutoff: observation 8 is influential on the fitted value yhat8.

deletion residuals

We have yhati. We have y. In addition to the change in yhat (i.e. DFFIT), we could compute the residual for each one of those. This is not a property of Mathematica’s regressions, but I think they’re worth knowing. Since

DFFIT = yhat – yhati = h ei / Sqrt[1-hi],

the residual

ed = y – yhati = y (- yhat + yhat) – yhati

= (y – yhat) + (yhat – yhati)

= e + DFFIT

= e + ( e h / (1-h) )

= (e (1-h) + e h) / (1-h)

= (e – e h + e h) / (1-h),

which simplifies to

ed_i = \frac{e_i}{1-h_i}\ .

Pretty simple formula, huh? And as nice a display of the meaning of 1-h as I could wish for. As hi approaches 1, that associated deletion residual goes to infinity. (Too many of these properties combine things.)

I have seen these called deletion residuals and I like the name. If necessary, I would go with single deletion residuals… but, of course, ultimately I would rely on the definition for the meaning, rather than the name.

Oh, we can confirm that this formula gives us the residual for the 8th observation and the second equation. That formula says I want to compute

I drew no cutoffs because I have no idea what they should be. The largest values are for observations 6 and 8.

COVRATIO

It is defined as the ratio of determinants of two parameter covariance matrices. For our regression with observation 8 omitted, we get

Although its definition is more complicated, it can be computed as (si/s)^2k 1/(1-hi):

I have been told that plausible cutoffs are 1 ± 3k/n = {0.0769231,1.92308}

Interesting. It flags observations 1, 5, 10. Whatever it’s saying, it isn’t talking about observation 8.

FVARATIO

The FVARATIO can be computed as…

FVARATIO differs from COVRATIO in using (si/s)^2 rather than (si/s)^2k. They are Mathematica’s FVarianceRatios:


This time, only observations 1 and 10 are outliers.

CooksD

I found this in Atkinson. As usual, its definition is more complicated, but it can be simplified to:

D_i = \frac{es_i^2\ h_i} {k\ (1-h_i)}

Interesting. It’s unambiguously singling out the 8th observation.

h and si

Now that we’ve seen that 1/(1-h) affects the deletion residuals, let’s look at h. First of all, h is between 0 and 1… I know it can be 1, but I don’t know whether it can be 0. Second, the sum of the h’s is equal to k, the number of parameters. Both of these are true because the hat matrix H is a projection operator of rank k.

Since the sum of the h’s is k, and there are n of them, the average value of the h’s is k/n. A plausible cutoff is twice that. Here’s a plot of h with that cutoff.

We could just as well plot 1/(1-h); it will tell us the same thing, whatever that is. (The limit 2k/n on h translates to n / (n-2k) on 1/(1-h).)

So the plots of h and 1/1-h both flag the 10th observation. Interesting. Not the 8th.

It is also worth realizing that h depends only on the X matrix: it has nothing to do with the dependent variable y. People say that the leverage h reflects the potential to be influential, rather than actual influence.

It also occurs to me that since we have the regression reg2, with observation 8 omitted, we should look at its estimated variance:

In any case, we see that the single deletion variance really is what it’s supposed to be.

Oh, since the ratio of variances, (si/s)^2, shows up in both COVRATIO and FVARATIO, let’s plot it.

Not a whole lot of variation there, except for the 6th and 8th observations.

Summary

I need to play with these. Now that I have summaries of them, I can do that… but it will be a while. I want to have played with them a fair bit before I talk about them.

What did we look at, anyway? Eight properties of a regression

SingleDeletionVariances
Standardized residuals and Studentized residuals;
BetaDifferences = DFBETAS
FitDifferences = DFFITS
CovarianceRatios = COVRATIO
FVarianceRatios = FVARATIO
CookDistances = CooksD.

In addition, we looked at DFBETA and DFFIT, which in practice motivate the standardized DFBETAS and DFFITS which Mathematica provides. In theory, of course, they let us predict the new \beta\ and yhat for each dropped observation. I say “in theory” because we don’t really need to predict anything in practice.

Last but not least, we looked at the deletion residuals,

ed = e / 1-h

which to my mind provide a nice explanation of 1-h. In addition, we could show that 1/1-h is the ratio of determinants of XTX matrices.

Finally, I won’t ask what they told us… but I will ask which observations they talked about.

Standardized residuals: no outliers
Studentized residuals: observations 6 and maybe 8.
BetaDifferences = DFBETAS: observations 8 and 11.
FitDifferences = DFFITS: observation 8.
CovarianceRatios = COVRATIO: observations 1,5,10.
FVarianceRatios = FVARATIO: observation 10 and maybe 1.
CookDistances = CooksD: observation 8.
deletion residuals: observations 8 and 6.
h: observation 10.
(si/s)^2: observations 6 and 8.

I suppose it’s worth remarking that the only reason I singled out the 8th observation was because I saw in DFBETA that it had such a large influence (11 out of 70) on the constant term.

Anyway, we see that although most of these statistics flag the 8th observation, not all of them do. Still, “most” suggests that we want to investigate dropping the 8th observation, i.e. to take a closer look at reg2, our model with the 8th observation omitted.

A few caveats before I stop. One, deleting observations because that improves the fit is not exactly good science. Heck, if we’re going to throw away data, why not go all the way, delete all the observations and present the theoretical equation?

Two, I have read that investigating single deletion statistics sequentially… drop one observation as indicated, then another, and so on… is not always fruitful. Apparently, it may be important to investigate dropping more than one observation at a time. I don’t know, but I owe you that warning.

Three, what we have seen here may be typical: we have a lot of statistics, and most but not all of them may single out a particular observation or set of observations. It seems typical that we are looking for consensus among these statistics, rather than a single definitive statistic. On the other hand, I think it likely that with practice I may have a better feeling for the differences among these statistics, and they may be flagging different kinds of influence when they flag different observations.

Well, that’s it for now on the single deletion statistics.

I still think it is way cool that we could calculate what we would get with any observation dropped, without re-fitting the model.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: