Simple Linear Regression Quiz

Topics:

  • Visualizing data.
  • Fitting a linear model.
  • Assessing whether the assumptions are reasonable.
  • Inference (CIs, PIs)

The quiz will be posted this evening and due Friday evening.

You may utlize resources such as the book, the internet, notes, slides, etc. You must do the work yourself however; no asking questions of other students, forums, etc. Please email me if you any questions come up.

Consider fitted models to four different data sets

summary(m1 <- lm(y1 ~ x1, data = anscombe))
## 
## Call:
## lm(formula = y1 ~ x1, data = anscombe)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9213 -0.4558 -0.0414  0.7094  1.8388 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.000      1.125    2.67   0.0257 * 
## x1             0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.24 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00217

summary(lm(y2 ~ x2, data = anscombe))
## 
## Call:
## lm(formula = y2 ~ x2, data = anscombe)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.901 -0.761  0.129  0.949  1.269 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.001      1.125    2.67   0.0258 * 
## x2             0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.24 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

summary(lm(y3 ~ x3, data = anscombe))
## 
## Call:
## lm(formula = y3 ~ x3, data = anscombe)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.159 -0.615 -0.230  0.154  3.241 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x3             0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.24 on 9 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.629 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00218

summary(lm(y4 ~ x4, data = anscombe))
## 
## Call:
## lm(formula = y4 ~ x4, data = anscombe)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.751 -0.831  0.000  0.809  1.839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    3.002      1.124    2.67   0.0256 * 
## x4             0.500      0.118    4.24   0.0022 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.24 on 9 degrees of freedom
## Multiple R-squared:  0.667,  Adjusted R-squared:  0.63 
## F-statistic:   18 on 1 and 9 DF,  p-value: 0.00216

Anscombe's Quartet

plot of chunk unnamed-chunk-5

For a valid model we need

  1. The conditional mean of Y|X is a linear function of X.
  2. The variance of Y|X is the same for any X.
  3. The errors (and thus the Y|X are independent of one another).
  4. The errors are normally distributed with mean zero.

5*. No "outliers".

These can all be assessed with residual plots.

Basic residual plots

How to construct:

  1. Calculate \(\hat{e}_i = y_i - \hat{y}_i\) for each point in your data set (also available as m1$res).
  2. Create a scatter plot with the residuals on the y axis. On the x-axis you can plot either the x-coordinates or the fitted values (_i).
plot(anscombe$x1, m1$res) # x versus residuals
plot(m1$fit, m1$res) # x versus fitted
plot(m1, 1) # built-in function

plot of chunk unnamed-chunk-7

Anscombe I

  1. The conditional mean of Y|X is a linear function of X. OK!

  2. The variance of Y|X is the same for any X. OK!

  3. The errors (and thus the Y|X are independent of one another). OK!

  4. The errors are normally distributed with mean zero. Probably ok?

  5. No "outliers". OK!

plot of chunk unnamed-chunk-8

Anscombe II

  1. The conditional mean of Y|X is a linear function of X. No way! Looks quadratic

  2. The variance of Y|X is the same for any X. OK!

  3. The errors (and thus the Y|X are independent of one another). Could be an issue, but probably not.

  4. The errors are normally distributed with mean zero. Doesn't look like it.

  5. No "outliers". Mmm, 8 and 6 maybe?

plot of chunk unnamed-chunk-9

Anscombe III

  1. The conditional mean of Y|X is a linear function of X. Perfectly linear, but we've fit the wrong line!

  2. The variance of Y|X is the same for any X. Looks like a problem.

  3. The errors (and thus the Y|X are independent of one another). Hard to tell.

  4. The errors are normally distributed with mean zero. Hard to tell.

  5. No "outliers". Whoops! Point 3 is a glaring problem.

plot of chunk unnamed-chunk-10

Anscombe IV

  1. The conditional mean of Y|X is a linear function of X. Possibly, though the X doesn't appear to have much predictive power.

  2. The variance of Y|X is the same for any X. Hard to say.

  3. The errors (and thus the Y|X are independent of one another). Hard to tell.

  4. The errors are normally distributed with mean zero. Looks more uniform.

  5. No "outliers". The slope of the line is being completely determined by one point!

Assessing Normality

We can check the assumption that the errors are normal by looking at the distribution of the residuals. Difficult to do in a residual plot, so we use a QQ plot (for quantile-quantile), aka normal probability plot.

Quantile: The \(j^{th}\) quantile, \(q_j\), is the value of a random variable \(X\) that fulfills:

\[ P(X \le q_j) = j \]

For the standard normal distribution, \(q_{.5} = 0\), \(q_{.025} = -1.96\), \(q_{.975} = 1.96\).

Constructing a QQ plot

  1. Standardize your residuals. \[ \tilde{e}_i = \frac{\hat{e}_i - \bar{\hat{x}}}{s} \]

  2. If you have \(n\) standardized residuals, you can consider the lowest to be the \(1/n\) quantile, the second lowest, the \(2/n\) quantile, the median to be the \(.5\) quantile, etc.

  3. Look up these values for the standard normal distribution and find what the appropriate quantiles would be (this is what qnorm() does). These become your theoretical quantiles.

  4. Plot the theoretical quantiles against the standardized residuals.

plot(m1, 2)

plot of chunk unnamed-chunk-11

Interpreting a QQ plot

  • Perfectly normally distributed residuals would align along the identify line.
  • Short tails will veer of the line horizontally.
  • Long tails will veer off the line vertically.
  • Expect some variation, even from normal residuals!

Normal residuals

x <- rnorm(40)
qqnorm(x)
qqline(x)

plot of chunk unnamed-chunk-12

Heavy tailed residuals

x <- rt(40, 1)
qqnorm(x)
qqline(x)

plot of chunk unnamed-chunk-13

Returning to Anscome 1

plot of chunk unnamed-chunk-14

Checking constant variance

Recall the quakes data:

plot of chunk unnamed-chunk-15

plot of chunk unnamed-chunk-16

plot(m1, 3)

plot of chunk unnamed-chunk-17

Checking constant variance

  • The scale location plot transforms the residuals to make non-constant variance (heteroscedasticity) more apparent.
  • The red line is a guide: flat = constant variance.
  • The basic residual plot can also be used, but it's a bit more difficult to tell (also, that red line refers to the linear trend).

Activity 4

  • Please work in pairs, trios.
  • Submit whatever you have by the end of class via moodle and include all group members in file name.