
A valid model is one that is a good fit to the data; i.e. where the assumptions made by the model are reasonable for the data set.
This not the same thing as:
Both of these only matter if we're dealing with a valid model.
Today we'll talk about:
Outliers are points that don't fit the trend in the rest of the data.
High leverage points have the potential to have an unusually large influence on the fitted model.
Influential points are high leverage points that cause a very different line to be fit than would be with that point removed.
We need a metric for the leverage of xi that incorporates
For simple regression:
hii=1n+(xi−ˉx)2∑nj=1(xj−ˉx)2
A general solution for multiple regression: calculate the hat matrix, pull out the diagonal elements.
We will need to assess whether the assumption of errors with constant variance is reasonable.
What if the non-constant variance that we observe isn't a property of the errors but the non-constant variance that goes into making the prediction from which the residuals are calculated?
We can write the residuals as
ˆe=(I−H)Y
and calcuate the expected value and variance as
E(ˆe|X)=0
Var(ˆe|X)=σ2(I−H)
and for a particular ˆei:
Var(ˆei)=σ2(I−hii)
To be sure all of our residuals are assessed on equal footing, we divide each one by our estimate of it's standard deviation.
ri=ˆes√1−hii
Where s is our usual estimate of σ.
Observations with high standardized residuals can be considered outliers. Rule of thumb: |ri|>2 for small data, |ri|>4 for large.
An alternate form:
Di=r2ip+1hii1−hii
To be influential a point must:
In the data set LA, this scatterplot suggests two influential points but are they influential in a MLR model?
In the data set LA, this scatterplot suggests two influential points but are they influential in a MLR model?
influence(m1)$hat
.)