Projects

Proposals recap

  • Many groups are merge()ing data, which is great. Make sure the rows align!
  • Population, sample, and the observational unit.

This week

  • Data cleaning
  • Exploratory Data Analysis: use the tools we started this class with (mplot(), ggplot2, lattice) to explore the structure in your data. Distribution of your variables (histograms, density plots), predictors vs response, relationships between predictors. Use scatterplots, boxplots, etc.
  • Work in Markdown!!! be sure all your plots and data manipulation is completely transparent and reproducible.

Activity 9

  1. Fit the model \(price \sim sqft + bed + city\).
  2. By the rules of thumb, are those two points high leverage? Outliers? (you can extract the hat values using influence(m1)$hat.)
  3. Calculate the Cook's distance of those two observations using the equation: \(D_i = ((r_i^2)/(p + 1)) * ((h_{ii})/(1 - h_{ii}))\).
  4. Generate the Cook's distance plot to double check that the values that you calculated in 3 seem correct.
  5. Now fit the more appropriate model, with \(logprice\) and \(logsqrt\) and construct added variable plots. What do you learn about the relative usefulness of \(logsqft\) and \(bed\) as predictors?

High leverage?

LA <- read.csv("http://andrewpbray.github.io/data/LA.csv")
m1 <- lm(price ~ sqft + bed + city, data = LA)
levs <- influence(m1)$hat

High leverage?

plot of chunk unnamed-chunk-2

##    1255    1289    1277    1293    1294    1292 
## 0.04203 0.04301 0.04827 0.07888 0.10424 0.15780

High residual?

e_hat <- m1$res
s <- sqrt(1/(nrow(LA) - length(m1$coef)) * sum(e_hat^2))
r <- e_hat/(s * sqrt(1 - levs))
head(r)
##       1       2       3       4       5       6 
## -0.1251 -0.1450 -0.1156 -0.5685 -1.0501  0.2048
head(rstandard(m1))
##       1       2       3       4       5       6 
## -0.1251 -0.1450 -0.1156 -0.5685 -1.0501  0.2048

High residual?

hist(r)

plot of chunk unnamed-chunk-4

tail(sort(r))
##   1193   1287   1288   1096   1294   1292 
##  3.579  3.664  3.664  3.745 14.660 27.696

Influence

cdist <- (r^2 / length(m1$coef)) * (levs/(1 - levs))
tail(sort(cdist))
##     1254     1260     1291     1289     1294     1292 
##  0.05409  0.05848  0.08015  0.09034  4.16812 23.95309

Influence

plot(m1, 5)

plot of chunk unnamed-chunk-6

AVPs

LA <- transform(LA, logprice = log(price), logsqft = log(sqft))
m2 <- lm(logprice ~ logsqft + bed + city, data = LA)
library(car)
avPlots(m2)

plot of chunk unnamed-chunk-7

Diagnostics III

Last time

We learned how structure in a residual plot can generally not be used to make specific inferences about where the model is misspecified. So we need other tools:

  1. Added Variable Plots: demonstrate the relationship between a single predictor and the remaining variation in the data after a model has been fit.
  2. \(Y\) versus \(\hat{Y}\): used to assess whether the mean function is being well modeled
  3. Marginal model plots: used to assess whether the mean function between each predictor and the response is being well modeled.

Model 1

\[ \widehat{price} \sim sqft + bed + city \]

plot of chunk unnamed-chunk-8

Model 2

\[ \widehat{logprice} \sim logsqft + bed + city \]

plot of chunk unnamed-chunk-9

\(Y\) versus \(\hat{Y}\)

Used to assess whether your model is doing a good job of modeling the response. If it is, you'll see points along the identity line. If it's not, there will be non-linear structure that you can assess on a predictor-by-predictor basis using marginal model plots.

Marginal Model Plots

Defective widgets

defects <- read.table("http://www.stat.tamu.edu/~sheather/book/docs/datasets/defects.txt",
                      header = TRUE)
head(defects)
##   Case Temperature Density  Rate Defective
## 1    1        0.97   32.08 177.7       0.2
## 2    2        2.85   21.14 254.1      47.9
## 3    3        2.95   20.65 272.6      50.9
## 4    4        2.84   22.53 273.4      49.7
## 5    5        1.84   27.43 210.8      11.0
## 6    6        2.05   25.42 236.1      15.6

pairs(Defective ~ Temperature + Density + Rate, data = defects)

plot of chunk unnamed-chunk-11

A first try

\[ \widehat{Defective} \sim Temperature + Density + Rate \]

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  10.3244    65.9265  0.1566  0.87677
## Temperature  16.0779     8.2941  1.9385  0.06349
## Density      -1.8273     1.4971 -1.2206  0.23320
## Rate          0.1167     0.1306  0.8936  0.37972

plot of chunk unnamed-chunk-13

plot of chunk unnamed-chunk-14

plot of chunk unnamed-chunk-15

Where to go from here?

The standardized residual plots and the plot of \(y\) on \(\hat{y}\) suggest that something is amiss, but what? We need to be sure that the structure in the data is being mirrored well by the structure in our model. This comparison is made for each predictor using the marginal model plot.

Marginal relationships

plot of chunk unnamed-chunk-16

Marginal relationships

plot of chunk unnamed-chunk-17

Nonparametric methods

LOESS: locally-weighted scatterplot smoothing.

Marginal relationships

plot of chunk unnamed-chunk-18

Marginal relationships

plot of chunk unnamed-chunk-19

Marginal model plot

Used to assess the marginal relationship between each predictor and the response. It compares the fit from the model with the nonparametric via to the scatterplot. If your model is well-specficied, these two lines will be close to coincident.

You can build them by hand using loess() or use mmp() in the car package.

plot of chunk unnamed-chunk-20

An alternative model

\[ \widehat{\sqrt{Defective}} \sim Temperature + Density + Rate \]

defects <- transform(defects, sqrtDefective = sqrt(Defective))
m2 <- lm(sqrtDefective ~ Temperature + Density + Rate, data = defects)
summary(m2)$coef
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   5.5930    5.26401   1.062  0.29778
## Temperature   1.5652    0.66226   2.363  0.02587
## Density      -0.2917    0.11954  -2.440  0.02182
## Rate          0.0129    0.01043   1.237  0.22727

plot of chunk unnamed-chunk-22

Comparing m1 and m2

plot of chunk unnamed-chunk-23

AVP vs MMP

  • Marginal model plots: are useful in checking to see that you're doing a good job of modeling the marginal relationship between a given predictor and the response.
  • Added variable plots: assess how much variation in the response can be explained by a given predictor after the other predictors have already been taken into account (links to p-values).