Building (Vlad) Valid models

Building valid models

A valid model is one that is a good fit to the data; i.e. where the assumptions made by the model are reasonable for the data set.

This not the same thing as:

  • A model with high explanatory power (e.g. high \(R^2\))
  • A statistically significant model.

Both of these only matter if we're dealing with a valid model.

Building valid models

Today we'll talk about:

  1. Leverage
  2. Standardized residuals
  3. Influence

Leverage and Influence

Recall… the outlier

plot of chunk unnamed-chunk-2

…the leverage

plot of chunk unnamed-chunk-3

…the influence

plot of chunk unnamed-chunk-4

…the influence

plot of chunk unnamed-chunk-5

…the influence

plot of chunk unnamed-chunk-6

Outliers, leverage, influence

Outliers are points that don't fit the trend in the rest of the data.

High leverage points have the potential to have an unusually large influence on the fitted model.

Influential points are high leverage points that cause a very different line to be fit than would be with that point removed.

Quantifying leverage: \(h_{ii}\)

We need a metric for the leverage of \(x_i\) that incorporates

  1. The distance \(x_i\) is away from the bulk of the \(x\)'s.
  2. The extent to which the fitted regression line is attracted by the given point.

For simple regression:

\[ h_{ii} = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{j = 1}^n(x_j - \bar{x})^2} \]

A general solution for multiple regression: calculate the hat matrix, pull out the diagonal elements.

Standardized residuals

Why standardize?

We will need to assess whether the assumption of errors with constant variance is reasonable.

What if the non-constant variance that we observe isn't a property of the errors but the non-constant variance that goes into making the prediction from which the residuals are calculated?

Properties of the residuals

We can write the residuals as

\[ \hat{e} = (I - H) Y \]

and calcuate the expected value and variance as

\[ E(\hat{e} | X) = 0 \]

\[ Var(\hat{e} | X) = \sigma^2 (I - H) \]

and for a particular \(\hat{e}_i\):

\[ Var(\hat{e}_i) = \sigma^2 (I - h_{ii}) \]

Standardized residuals

To be sure all of our residuals are assessed on equal footing, we divide each one by our estimate of it's standard deviation.

\[ r_i = \frac{\hat{e}}{s \sqrt{1 - h_{ii}}} \]

Where \(s\) is our usual estimate of \(\sigma\).

Observations with high standardized residuals can be considered outliers. Rule of thumb: \(|r_i| > 2\) for small data, \(|r_i| > 4\) for large.

We've already been using these

plot of chunk unnamed-chunk-7

Influence

Cook's Distance

An alternate form:

\[ D_i = \frac{r_i^2}{p + 1} \frac{h_{ii}}{1 - h_{ii}} \]

To be influential a point must:

  1. Have high leverage \(h_{ii}\) and
  2. Have a high standardized residual \(r_i\)

Activity

Activity 9 (Part I)

plot of chunk unnamed-chunk-8

In the data set LA, this scatterplot suggests two influential points but are they influential in a MLR model?

Activity 9 (Part I)

In the data set LA, this scatterplot suggests two influential points but are they influential in a MLR model?

  1. Fit the model \(\hat{price} \sim sqrt + bed + city\).
  2. By the rules of thumb, are those two points high leverage? Outliers? (you can extract the hat values using influence(m1)$hat.)
  3. Calculate the Cook's distance of those two observations using the equation provided on the last slide.
  4. Generate the Cook's distance plot to double check that the values that you calculated in 3 seem correct.