Outliers are points that don't fit the trend in the rest of the data.
High leverage points have the potential to have an unusually large influence on the fitted model.
Influential points are high leverage points that cause a very different line to be fit than would be with that point removed.
We can data on the surface temperature and light intensity of 47 stars in the star cluster CYG OB1, near Cygnus.
We can data on the surface temperature and light intensity of 47 stars in the star cluster CYG OB1, near Cygnus.
We can data on the surface temperature and light intensity of 47 stars in the star cluster CYG OB1, near Cygnus.
We need a metric for the leverage of \(x_i\) that incorporates
\[ h_{ii} = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{j = 1}^n(x_j - \bar{x})^2} \]
Rule of Thumb: in simple regression, a point has "high leverage" if \(h_{ii} > 4/n\).
m1 <- lm(y ~ x) h <- lm.influence(lm(y ~ x))$hat
Leverage measures the weight given to each point in determining the regression line.
Influence measures how different the regression line would be without a given point.
A widely-used measure of influence:
\[ D_i = \frac{\sum_{j = 1}^n (\hat{y}_{j(i)} - \hat{y}_j)^2}{2S^2} \]
where \(\hat{y}_{j(i)}\) is the \(j^{th}\) fitted value based on the fit with the \(i^{th}\) case removed.
An alternate form:
\[ D_i = \frac{r_i^2}{2} \frac{h_{ii}}{1 - h_{ii}} \]
To be influential a point must:
m1 <- lm(light ~ temp, data = star) par(mfrow = c(1, 2)) plot(m1, 4:5)