Diagnostics III

Last time

We learned how structure in a residual plot can generally not be used to make specific inferences about where the model is misspecified. So we need other tools:

\(Y\) versus \(\hat{Y}\): used to assess whether the mean function is being modeled well.
Marginal model plots: used to assess whether the mean function between each predictor and the response is being modeled well.

Naive model

m1 <- lm(price ~ sqft + bed + city, data = LA)
plot(LA$price ~ m1$fit, col = "steelblue")
abline(0, 1)

plot of chunk unnamed-chunk-2

Improved model

m2 <- lm(logprice ~ logsqft + bed + city, data = LA)
plot(LA$logprice ~ m2$fit, col = "steelblue")
abline(0, 1)

plot of chunk unnamed-chunk-3

\(Y\) versus \(\hat{Y}\)

Used to assess whether your model is doing a good job of modeling the response. If it is, you'll see points along the identity line. If it's not, there will be non-linear structure try to correct by transforming the response and assess on a predictor-by-predictor basis using marginal model plots.

Marginal Model Plots

Defective widgets

defects <- read.table("http://www.stat.tamu.edu/~sheather/book/docs/datasets/defects.txt",
                      header = TRUE)
head(defects)

##   Case Temperature Density  Rate Defective
## 1    1        0.97   32.08 177.7       0.2
## 2    2        2.85   21.14 254.1      47.9
## 3    3        2.95   20.65 272.6      50.9
## 4    4        2.84   22.53 273.4      49.7
## 5    5        1.84   27.43 210.8      11.0
## 6    6        2.05   25.42 236.1      15.6

pairs(Defective ~ Temperature + Density + Rate, data = defects)

plot of chunk unnamed-chunk-5

A first try

\[ \widehat{Defective} \sim Temperature + Density + Rate \]

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  10.3244    65.9265  0.1566  0.87677
## Temperature  16.0779     8.2941  1.9385  0.06349
## Density      -1.8273     1.4971 -1.2206  0.23320
## Rate          0.1167     0.1306  0.8936  0.37972

plot of chunk unnamed-chunk-7

plot of chunk unnamed-chunk-8

\(Y\) versus \(\hat{Y}\)

plot of chunk unnamed-chunk-9

Where to go from here?

The standardized residual plots and the plot of \(y\) on \(\hat{y}\) suggest that something is amiss, but what? We need to be sure that the structure in the data is being mirrored well by the structure in our model. This comparison is made for each predictor using the marginal model plot.

Marginal relationships

plot of chunk unnamed-chunk-10

Marginal relationships

plot of chunk unnamed-chunk-11

Nonparametric methods

LOESS: locally-weighted scatterplot smoothing.

Marginal relationships

plot of chunk unnamed-chunk-12

Marginal relationships

plot of chunk unnamed-chunk-13

Marginal model plot

Used to assess the marginal relationship between each predictor and the response. It compares the fit from the model with the nonparametric fit to the scatterplot. If your model is well-specified, these two lines will be close to coincident.

You can build them by hand using loess() or use mmp() in the car package.

plot of chunk unnamed-chunk-14

scatterplotMatrix(~ Defective + Temperature + Density + Rate, data = defects)

plot of chunk unnamed-chunk-15

An alternative model

\[ \widehat{\sqrt{Defective}} \sim Temperature + Density + Rate \]

defects <- transform(defects, sqrtDefective = sqrt(Defective))
m2 <- lm(sqrtDefective ~ Temperature + Density + Rate, data = defects)
summary(m2)$coef

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   5.5930    5.26401   1.062  0.29778
## Temperature   1.5652    0.66226   2.363  0.02587
## Density      -0.2917    0.11954  -2.440  0.02182
## Rate          0.0129    0.01043   1.237  0.22727

plot of chunk unnamed-chunk-17

Comparing m1 and m2

plot of chunk unnamed-chunk-18

MMP vs AVP

Marginal model plots: are useful in checking to see that you're doing a good job of modeling the marginal relationship between a given predictor and the response.
Added variable plots: assess how much variation in the response can be explained by a given predictor after the other predictors have already been taken into account (links to p-values).

Multicollinearity

Car seat position

library(faraway)
data(seatpos)
head(seatpos)

##   Age Weight HtShoes    Ht Seated  Arm Thigh  Leg hipcenter
## 1  46    180   187.2 184.9   95.2 36.1  45.3 41.3   -206.30
## 2  31    175   167.5 165.5   83.8 32.9  36.5 35.9   -178.21
## 3  23    100   153.6 152.2   82.9 26.0  36.6 31.0    -71.67
## 4  19    185   190.3 187.4   97.3 37.4  44.1 41.0   -257.72
## 5  23    159   178.0 174.1   93.9 29.5  40.1 36.9   -173.23
## 6  47    170   178.7 177.0   92.4 36.0  43.2 37.4   -185.15

Modeling hipcenter

m1 <- lm(hipcenter ~ ., data = seatpos)
# the dot in the formula notation means 'all other variables'
summary(m1)$coef

##              Estimate Std. Error  t value Pr(>|t|)
## (Intercept) 436.43213   166.5716  2.62009  0.01384
## Age           0.77572     0.5703  1.36012  0.18427
## Weight        0.02631     0.3310  0.07950  0.93718
## HtShoes      -2.69241     9.7530 -0.27606  0.78446
## Ht            0.60134    10.1299  0.05936  0.95307
## Seated        0.53375     3.7619  0.14188  0.88815
## Arm          -1.32807     3.9002 -0.34051  0.73592
## Thigh        -1.14312     2.6600 -0.42974  0.67056
## Leg          -6.43905     4.7139 -1.36598  0.18245

plot of chunk unnamed-chunk-21

High multicollinearity

plot of chunk unnamed-chunk-22

plot of chunk unnamed-chunk-24

Low multicollinearity

plot of chunk unnamed-chunk-25

plot of chunk unnamed-chunk-26

Consequences of multicollinearity

When the predictors are highly correlated, the model has difficulty deciding which of the them is responsibility for the variability in the response. In one data set, you'll find it attributes the significant positive slope to \(X1\), in the next, it will attribute it to \(X2\).

Geometrically, the plane is unstable and will vascillate wildly when fit to a new data set. The multicollinearity increases the variance of your slope estimates, so it becomes difficult to get good/stable/significant estimates.

Assessing multicollinearity

Pairs plot: look for strong linear relationships between predictors.
Correlation matrix: calculate the correlation between your predictors using cor().
Variance Inflation Factors (VIF):

Correlation matrix

cor(d)

##         Y       X1       X2
## Y  1.0000  0.52470  0.76332
## X1 0.5247  1.00000 -0.03031
## X2 0.7633 -0.03031  1.00000

Correlations above 0.7 will often induce considerable variance in your slopes.

Variance Inflation Factor

The variance of a given slope can be written:

\[ Var(\hat{\beta}_j) = \frac{1}{1 - R^2_j} \times \frac{\sigma^2}{(n - 1) S^2_{x_j}} \]

Where \(R^2_j\) is the \(R^2\) from predicting \(x_j\) using the other predictors and \(S^2_{x_j}\) is the variance of \(x_j\).

The first term is called the VIF: \(1 / (1 - R^2_j)\)

Variance Inflation Factor

library(car)
vif(m1)

##    X1    X2 
## 1.001 1.001

Rule of thumb: VIFs greater than 5 should be addressed.

VIF for seatpos

m1 <- lm(hipcenter ~ ., data = seatpos)
vif(m1)

##     Age  Weight HtShoes      Ht  Seated     Arm   Thigh     Leg 
##   1.998   3.647 307.429 333.138   8.951   4.496   2.763   6.694

Addressing Multicollinearity

Multicollinearity suggests that you have multiple predictors that are conveying the same information, so you probably don't need both in the model.

Occam's Razor: since the model with all of the predictors doesn't add any descriptive power, we should favor the simpler model.

Best way to decide which to remove: subject area knowledge.

Simplifying seatpos

round(cor(seatpos), 2)

##             Age Weight HtShoes    Ht Seated   Arm Thigh   Leg hipcenter
## Age        1.00   0.08   -0.08 -0.09  -0.17  0.36  0.09 -0.04      0.21
## Weight     0.08   1.00    0.83  0.83   0.78  0.70  0.57  0.78     -0.64
## HtShoes   -0.08   0.83    1.00  1.00   0.93  0.75  0.72  0.91     -0.80
## Ht        -0.09   0.83    1.00  1.00   0.93  0.75  0.73  0.91     -0.80
## Seated    -0.17   0.78    0.93  0.93   1.00  0.63  0.61  0.81     -0.73
## Arm        0.36   0.70    0.75  0.75   0.63  1.00  0.67  0.75     -0.59
## Thigh      0.09   0.57    0.72  0.73   0.61  0.67  1.00  0.65     -0.59
## Leg       -0.04   0.78    0.91  0.91   0.81  0.75  0.65  1.00     -0.79
## hipcenter  0.21  -0.64   -0.80 -0.80  -0.73 -0.59 -0.59 -0.79      1.00

Since most of these variables measure some version of height, we can just use one of them.

Simplifying seatpos

m2 <- lm(hipcenter ~ Age + Weight + Ht, data = seatpos)
summary(m2)

## 
## Call:
## lm(formula = hipcenter ~ Age + Weight + Ht, data = seatpos)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -91.53 -23.00   2.16  24.95  53.98 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 528.29773  135.31295    3.90  0.00043 ***
## Age           0.51950    0.40804    1.27  0.21159    
## Weight        0.00427    0.31172    0.01  0.98915    
## Ht           -4.21190    0.99906   -4.22  0.00017 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.5 on 34 degrees of freedom
## Multiple R-squared:  0.656,  Adjusted R-squared:  0.626 
## F-statistic: 21.6 on 3 and 34 DF,  p-value: 5.13e-08

vif(m2)

##    Age Weight     Ht 
##  1.093  3.458  3.463

Diagnostics Summary

Before drawing any conclusions from a regression model, we must be confident it is a valid way to model the data. Our model assumes:

Linearity: the conditional mean of the response is a linear function of the predictors.
The errors have constant variance and are uncorrelated.
The errors are normally distributed with mean zero.

It's also sensible to build a model with:

No highly influential points.
Low multicollinearity.