Processing math: 100%

Diagnostics I

Building (Vlad) Valid models

Building valid models

A valid model is one that is a good fit to the data; i.e. where the assumptions made by the model are reasonable for the data set.

This not the same thing as:

  • A model with high explanatory power (e.g. high R2)
  • A statistically significant model.

Both of these only matter if we're dealing with a valid model.

Building valid models

Today we'll talk about:

  1. Leverage
  2. Standardized residuals
  3. Influence

Leverage and Influence

Recall… the outlier

plot of chunk unnamed-chunk-2

…the leverage

plot of chunk unnamed-chunk-3

…the influence

plot of chunk unnamed-chunk-4

…the influence

plot of chunk unnamed-chunk-5

…the influence

plot of chunk unnamed-chunk-6

Outliers, leverage, influence

Outliers are points that don't fit the trend in the rest of the data.

High leverage points have the potential to have an unusually large influence on the fitted model.

Influential points are high leverage points that cause a very different line to be fit than would be with that point removed.

Quantifying leverage: hii

We need a metric for the leverage of xi that incorporates

  1. The distance xi is away from the bulk of the x's.
  2. The extent to which the fitted regression line is attracted by the given point.

For simple regression:

hii=1n+(xi−ˉx)2∑nj=1(xj−ˉx)2

A general solution for multiple regression: calculate the hat matrix, pull out the diagonal elements.

Standardized residuals

Why standardize?

We will need to assess whether the assumption of errors with constant variance is reasonable.

What if the non-constant variance that we observe isn't a property of the errors but the non-constant variance that goes into making the prediction from which the residuals are calculated?

Properties of the residuals

We can write the residuals as

ˆe=(I−H)Y

and calcuate the expected value and variance as

E(ˆe|X)=0

Var(ˆe|X)=σ2(I−H)

and for a particular ˆei:

Var(ˆei)=σ2(I−hii)

Standardized residuals

To be sure all of our residuals are assessed on equal footing, we divide each one by our estimate of it's standard deviation.

ri=ˆes√1−hii

Where s is our usual estimate of σ.

Observations with high standardized residuals can be considered outliers. Rule of thumb: |ri|>2 for small data, |ri|>4 for large.

We've already been using these

plot of chunk unnamed-chunk-7

Influence

Cook's Distance

An alternate form:

Di=r2ip+1hii1−hii

To be influential a point must:

  1. Have high leverage hii and
  2. Have a high standardized residual ri

Activity

Activity 9 (Part I)

plot of chunk unnamed-chunk-8

In the data set LA, this scatterplot suggests two influential points but are they influential in a MLR model?

Activity 9 (Part I)

In the data set LA, this scatterplot suggests two influential points but are they influential in a MLR model?

  1. Fit the model ^price∼sqrt+bed+city.
  2. By the rules of thumb, are those two points high leverage? Outliers? (you can extract the hat values using influence(m1)$hat.)
  3. Calculate the Cook's distance of those two observations using the equation provided on the last slide.
  4. Generate the Cook's distance plot to double check that the values that you calculated in 3 seem correct.