A valid model is one that is a good fit to the data; i.e. where the assumptions made by the model are reasonable for the data set.
This not the same thing as:
Both of these only matter if we're dealing with a valid model.
Today we'll talk about:
Outliers are points that don't fit the trend in the rest of the data.
High leverage points have the potential to have an unusually large influence on the fitted model.
Influential points are high leverage points that cause a very different line to be fit than would be with that point removed.
We need a metric for the leverage of \(x_i\) that incorporates
For simple regression:
\[ h_{ii} = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{j = 1}^n(x_j - \bar{x})^2} \]
A general solution for multiple regression: calculate the hat matrix, pull out the diagonal elements.
We will need to assess whether the assumption of errors with constant variance is reasonable.
What if the non-constant variance that we observe isn't a property of the errors but the non-constant variance that goes into making the prediction from which the residuals are calculated?
We can write the residuals as
\[ \hat{e} = (I - H) Y \]
and calcuate the expected value and variance as
\[ E(\hat{e} | X) = 0 \]
\[ Var(\hat{e} | X) = \sigma^2 (I - H) \]
and for a particular \(\hat{e}_i\):
\[ Var(\hat{e}_i) = \sigma^2 (I - h_{ii}) \]
To be sure all of our residuals are assessed on equal footing, we divide each one by our estimate of it's standard deviation.
\[ r_i = \frac{\hat{e}}{s \sqrt{1 - h_{ii}}} \]
Where \(s\) is our usual estimate of \(\sigma\).
Observations with high standardized residuals can be considered outliers. Rule of thumb: \(|r_i| > 2\) for small data, \(|r_i| > 4\) for large.
An alternate form:
\[ D_i = \frac{r_i^2}{p + 1} \frac{h_{ii}}{1 - h_{ii}} \]
To be influential a point must:
In the data set LA, this scatterplot suggests two influential points but are they influential in a MLR model?
In the data set LA, this scatterplot suggests two influential points but are they influential in a MLR model?
influence(m1)$hat
.)