Warm Up

Activity #7 Part I

  • Revisit the RailTrail data set from Activity 4.
  • Consider two models: a) SLR model to predict ridership by temperature, b) same approach but with added quadratic term.
  • Discuss the relative merits of the two models.

Disclaimer

For the rest of the week, we won't talk at all about assessing model validity (looking at residual plots). that step is absolutely vital, but we're putting it on hold till next week.

Example: Textbooks

Consider a sample of 15 textbooks. How well can we predict weight by volume?

plot of chunk unnamed-chunk-1

Consider a sample of 15 textbooks. How well can we predict weight by volume?

plot of chunk unnamed-chunk-2

summary(m1)
## 
## Call:
## lm(formula = weight ~ volume, data = allbacks)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -190.0 -109.9   38.1  109.7  145.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 107.6793    88.3776    1.22     0.24    
## volume        0.7086     0.0975    7.27  6.3e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 124 on 13 degrees of freedom
## Multiple R-squared:  0.803,  Adjusted R-squared:  0.787 
## F-statistic: 52.9 on 1 and 13 DF,  p-value: 6.26e-06

Adding a categorical predictor

allbacks[c(1, 2, 9, 10), ]
##    volume area weight cover
## 1     885  382    800    hb
## 2    1016  468    950    hb
## 9     953    0    700    pb
## 10    929    0    650    pb

We should be able to better predict the weight if we use both the volume and knowledge of the type of cover.

Adding a categorical predictor

class(allbacks$cover)
## [1] "factor"
levels(allbacks$cover)
## [1] "hb" "pb"
m2 <- lm(weight ~ volume + cover, data = allbacks)
  • Categorial variables in R are of the class factor. Check with class(), coerce with as.factor().
  • Modeling a continuous variable used a continuous variable and a categorical variable is known as Analysis of Covariance.

How R thinks of his model

summary(m2)$coef
##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)  197.963   59.19274   3.344 5.841e-03
## volume         0.718    0.06153  11.669 6.598e-08
## coverpb     -184.047   40.49420  -4.545 6.719e-04

\[ \widehat{weight} = 197.96 + 0.712 volume - 184.047 coverpb \]

  • Whichever level is first alphabetically (hb) becomes the reference level.
  • coverpb represents the average difference in weight between two books of the same weight but of different covers.
  • Again, every row is a t-test that that parameter is zero.

Simple linear legression

plot of chunk unnamed-chunk-7

\[ \widehat{weight} = 107.70 + 0.71 volume \]

plot of chunk unnamed-chunk-8

\[ \widehat{weight}_{hb} = 197.96 + 0.72 volume \\ \widehat{weight}_{pb} = 13.91 + 0.72 volume \]

Comparing explanatory power

summary(m1)$r.squared
## [1] 0.8026
summary(m2)$r.squared
## [1] 0.9275
summary(m1)$adj.r.squared
## [1] 0.7875
summary(m2)$adj.r.squared
## [1] 0.9154

Adding more complexity?

We've established that this data is best modeled with two intercepts, but should the two lines have their own slopes as well?

\[ \widehat{weight} = \beta_0 + \beta_1 volume + \beta_2 coverpb + \beta_3 volume * coverpb \]

plot of chunk unnamed-chunk-10

\[ \widehat{weight}_{hb} = 161.586 + 0.76 volume \\ \widehat{weight}_{pb} = 11.37 + 0.68 volume \]

m3 <- lm(weight ~ volume * cover, data = allbacks)
summary(m3)
## 
## Call:
## lm(formula = weight ~ volume * cover, data = allbacks)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -89.7  -32.1  -21.8   17.9  215.9 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     161.5865    86.5192    1.87    0.089 .  
## volume            0.7616     0.0972    7.84  7.9e-06 ***
## coverpb        -120.2141   115.6590   -1.04    0.321    
## volume:coverpb   -0.0757     0.1280   -0.59    0.566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 80.4 on 11 degrees of freedom
## Multiple R-squared:  0.93,   Adjusted R-squared:  0.911 
## F-statistic: 48.5 on 3 and 11 DF,  p-value: 1.24e-06

Which is better?

The interaction terms is insignificant, suggesting that the two classes of books might not follow different trends between their volume and weights.

Note that it also made the intercept term insignificant as well.

Activity #7 Part II

  • Revisit the twins data set from the quiz.
  • Is there evidence that the relationship between IQs differs between the social status groups (intercepts or slopes)?