Outliers

What is an outlier?

Outlier is a general term to describe a data point that doesn't follow the pattern set by the bulk of the data, when one takes into account the model.

Outlier Example One

plot of chunk unnamed-chunk-2

Outlier Example Two

plot of chunk unnamed-chunk-3

Outlier Example Three

plot of chunk unnamed-chunk-4

Outlier Example Four

plot of chunk unnamed-chunk-5

Outlier Example Four

plot of chunk unnamed-chunk-6

Outliers, leverage, influence

Outliers are points that don't fit the trend in the rest of the data.

High leverage points have the potential to have an unusually large influence on the fitted model.

Influential points are high leverage points that cause a very different line to be fit than would be with that point removed.

Example of high leverage, high influence

We can data on the surface temperature and light intensity of 47 stars in the star cluster CYG OB1, near Cygnus.

plot of chunk unnamed-chunk-7

Example of high leverage, high influence

We can data on the surface temperature and light intensity of 47 stars in the star cluster CYG OB1, near Cygnus.

plot of chunk unnamed-chunk-8

Example of high leverage, high influence

We can data on the surface temperature and light intensity of 47 stars in the star cluster CYG OB1, near Cygnus.

plot of chunk unnamed-chunk-9

Example of high leverage, low influence

plot of chunk unnamed-chunk-10

Quantifying leverage: \(h_{ii}\)

We need a metric for the leverage of \(x_i\) that incorporates

The distance \(x_i\) is away from the bulk of the \(x\)'s.
The extent to which the fitted regression line is attracted by the given point.

\[ h_{ii} = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{j = 1}^n(x_j - \bar{x})^2} \]

\(h_{ii}\) values

plot of chunk unnamed-chunk-11

What is "high" leverage?

Rule of Thumb: in simple regression, a point has "high leverage" if \(h_{ii} > 4/n\).

m1 <- lm(y ~ x)
h <- lm.influence(lm(y ~ x))$hat

plot of chunk unnamed-chunk-12

From leverage to influence

Leverage measures the weight given to each point in determining the regression line.

Influence measures how different the regression line would be without a given point.

plot of chunk unnamed-chunk-13

Cook's Distance

A widely-used measure of influence:

\[ D_i = \frac{\sum_{j = 1}^n (\hat{y}_{j(i)} - \hat{y}_j)^2}{2S^2} \]

where \(\hat{y}_{j(i)}\) is the \(j^{th}\) fitted value based on the fit with the \(i^{th}\) case removed.

Cook's Distance

An alternate form:

\[ D_i = \frac{r_i^2}{2} \frac{h_{ii}}{1 - h_{ii}} \]

To be influential a point must:

Have high leverage \(h_{ii}\) and
Have a high standardized residual \(r_i\)

Cook's Distance in R

plot of chunk unnamed-chunk-14

Cook's Distance in R

m1 <- lm(light ~ temp, data = star)
par(mfrow = c(1, 2))
plot(m1, 4:5)