merge()
ing data, which is great. Make sure the rows align!mplot()
, ggplot2
, lattice
) to explore the structure in your data. Distribution of your variables (histograms, density plots), predictors vs response, relationships between predictors. Use scatterplots, boxplots, etc.influence(m1)$hat
.)LA <- read.csv("http://andrewpbray.github.io/data/LA.csv") m1 <- lm(price ~ sqft + bed + city, data = LA) levs <- influence(m1)$hat
## 1255 1289 1277 1293 1294 1292 ## 0.04203 0.04301 0.04827 0.07888 0.10424 0.15780
e_hat <- m1$res s <- sqrt(1/(nrow(LA) - length(m1$coef)) * sum(e_hat^2)) r <- e_hat/(s * sqrt(1 - levs)) head(r)
## 1 2 3 4 5 6 ## -0.1251 -0.1450 -0.1156 -0.5685 -1.0501 0.2048
head(rstandard(m1))
## 1 2 3 4 5 6 ## -0.1251 -0.1450 -0.1156 -0.5685 -1.0501 0.2048
hist(r)
tail(sort(r))
## 1193 1287 1288 1096 1294 1292 ## 3.579 3.664 3.664 3.745 14.660 27.696
cdist <- (r^2 / length(m1$coef)) * (levs/(1 - levs)) tail(sort(cdist))
## 1254 1260 1291 1289 1294 1292 ## 0.05409 0.05848 0.08015 0.09034 4.16812 23.95309
plot(m1, 5)
LA <- transform(LA, logprice = log(price), logsqft = log(sqft)) m2 <- lm(logprice ~ logsqft + bed + city, data = LA) library(car) avPlots(m2)
We learned how structure in a residual plot can generally not be used to make specific inferences about where the model is misspecified. So we need other tools:
\[ \widehat{price} \sim sqft + bed + city \]
\[ \widehat{logprice} \sim logsqft + bed + city \]
Used to assess whether your model is doing a good job of modeling the response. If it is, you'll see points along the identity line. If it's not, there will be non-linear structure that you can assess on a predictor-by-predictor basis using marginal model plots.
defects <- read.table("http://www.stat.tamu.edu/~sheather/book/docs/datasets/defects.txt", header = TRUE) head(defects)
## Case Temperature Density Rate Defective ## 1 1 0.97 32.08 177.7 0.2 ## 2 2 2.85 21.14 254.1 47.9 ## 3 3 2.95 20.65 272.6 50.9 ## 4 4 2.84 22.53 273.4 49.7 ## 5 5 1.84 27.43 210.8 11.0 ## 6 6 2.05 25.42 236.1 15.6
pairs(Defective ~ Temperature + Density + Rate, data = defects)
\[ \widehat{Defective} \sim Temperature + Density + Rate \]
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 10.3244 65.9265 0.1566 0.87677 ## Temperature 16.0779 8.2941 1.9385 0.06349 ## Density -1.8273 1.4971 -1.2206 0.23320 ## Rate 0.1167 0.1306 0.8936 0.37972
The standardized residual plots and the plot of \(y\) on \(\hat{y}\) suggest that something is amiss, but what? We need to be sure that the structure in the data is being mirrored well by the structure in our model. This comparison is made for each predictor using the marginal model plot.
LOESS: locally-weighted scatterplot smoothing.
Used to assess the marginal relationship between each predictor and the response. It compares the fit from the model with the nonparametric via to the scatterplot. If your model is well-specficied, these two lines will be close to coincident.
You can build them by hand using loess()
or use mmp()
in the car
package.
\[ \widehat{\sqrt{Defective}} \sim Temperature + Density + Rate \]
defects <- transform(defects, sqrtDefective = sqrt(Defective)) m2 <- lm(sqrtDefective ~ Temperature + Density + Rate, data = defects) summary(m2)$coef
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5.5930 5.26401 1.062 0.29778 ## Temperature 1.5652 0.66226 2.363 0.02587 ## Density -0.2917 0.11954 -2.440 0.02182 ## Rate 0.0129 0.01043 1.237 0.22727