Building Valid models

MLR Diagnostics

Last time:

  1. Influence: use the hat matrix and standardized residuals.
  2. Normality: use the qq plot of standardized residuals.

This time:

  1. The trouble with MLR residuals plots.
  2. Added variable plots.

Residual plots, SRL vs MLR

In simple linear regression we use residual plots to assess:

  1. Does the mean function appear linear?
  2. Is it reasonable to assume that the errors have constant variance?

Residual plots in SLR

plot of chunk unnamed-chunk-1

If this was a SLR model, we could conclude that the mean function looks fairly linear but there the errors appear to have increasing variance.

Residual plots in MLR

We fit the model:

\[ y \sim x_1 + x_2 \]

But this is synthetic data generated from a model with constant variance.

Whaaaaaa?

plot of chunk unnamed-chunk-2

Residual plots in MLR

In MLR, in general, you cannot infer the structure you see in the residuals vs fitted plot as being the structure that was misspecified.

  • Non-constant variance in the residuals doesn't neccessarily suggest non-constant variance in the errors.
  • Non-linear structures don't necessarily suggest a non-linear mean function.

The only conclusion you can draw is that something is misspecified.

Residual plots in MLR

So now what?

  • Although several types of invalid models can create non-constant variance in the residuals, a valid model will always be structureless.
  • If you can be sure you have a good mean function, then the residual plot is more informative.
  • Marginal Model Plots
  • Added Variable Plots

Added variable plots

The objective of constructing an added variable plot is to assess how much each variable adds to your model.

Consider the nyc restaurant data, where we'd like to build the model:

\[ Price \sim Food + Decor + Service + East \]

We can assess the isolated effect of each predictor on the response with a series of simple scatterplots…

plot of chunk unnamed-chunk-3

pairs(Price ~ Food + Decor + Service + East, data = nyc)

plot of chunk unnamed-chunk-4

Added variable plots

An added variable plot tells you how much a given predictor \(x_i\) can explain the response after the other predictors have been taken into account. They plot:

  • On the y-axis, the residuals from the model predicting the response without \(x_i\).
  • On the x-axis, the residuals from predicting \(x_i\) using those same predictors.

Added variable plot for Food

First, get the residuals from the model

\[ Price \sim Decor + Service + East \]

resY <- lm(Price ~ Decor + Service + East, data = nyc)$res

Second, get the residuals from the model

\[ Food \sim Decor + Service + East \]

resX <- lm(Food ~ Decor + Service + East, data = nyc)$res

The plot them against each other…

plot(resY ~ resX)

plot of chunk unnamed-chunk-7

library(car)
m1 <- lm(Price ~ Food + Decor + Service + East, data = nyc)
avPlot(m1,variable = "Food")

plot of chunk unnamed-chunk-8

Something to notice…

If we fit a line through the AVP, the slope should look familiar…

AVPm1 <- lm(resY ~ resX)
AVPm1$coef
## (Intercept)        resX 
##   5.074e-17   1.538e+00
m1$coef
## (Intercept)        Food       Decor     Service        East 
##  -24.023800    1.538120    1.910087   -0.002727    2.068050

plot of chunk unnamed-chunk-10

avPlots(m1)

plot of chunk unnamed-chunk-11

How to use AVP

  1. AVPs can be used to assess whether it makes sense to include an additional variable in the model (similar to looking at the p-value of the predictor).
  2. They're a bit more informative, though, since they would also indicate if the relationship between that predictor and the response is linear in the context of the other variables.

Activity

Activity 9

plot of chunk unnamed-chunk-12

In the data set LA, this scatterplot suggests two influential points but are they influential in a MLR model?

In the data set LA, this scatterplot suggests two influential points but are they influential in a MLR model?

  1. Fit the model \(price \sim sqft + bed + city\).
  2. By the rules of thumb, are those two points high leverage? Outliers? (you can extract the hat values using influence(m1)$hat.)
  3. Calculate the Cook's distance of those two observations using the equation: \(D_i = (r_i^2/(p + 1)) * ((h_{ii})/(1 - h_{ii}))\).
  4. Generate the Cook's distance plot to double check that the values that you calculated in 3 seem correct.
  5. Now fit the more appropriate model, with \(logprice\) and \(logsqrt\) and construct added variable plots. What do you learn about the relative usefulness of \(logsqft\) and \(bed\) as predictors?