MLR: polynomials and R^2

Multiple Regression: polynomials

Multiple regression refers to the method of predicting one variable as a linear function of more than one predictor.

LA homes example:

\[ \widehat{price} = \beta_0 + \beta_1 sqft + \beta_2 bath \]

But we could also introduc as the additional predictor a polynomial term of the existing predictor.

\[ \widehat{price} = \beta_0 + \beta_1 sqft + \beta_2 sqft^2 \]

LA homes

Recall what happened when we fit this model:

\[ \widehat{logprice} = \beta_0 + \beta_1 logsqft \]

plot of chunk unnamed-chunk-1

Quadratic Model

We could consider adding a quadratic term to our model:

\[ \widehat{price} = \beta_0 + \beta_1 sqft + \beta_2 sqft^2 \]

m2 <- lm(log_price ~ log_sqft + I(log_sqft^2), data = LA)
summary(m2)$coef

##               Estimate Std. Error t value  Pr(>|t|)
## (Intercept)    10.2474    1.17846   8.696 8.438e-18
## log_sqft       -0.5482    0.30914  -1.773 7.638e-02
## I(log_sqft^2)   0.1301    0.02017   6.449 1.490e-10

Comparing models

$plot of chunk unnamed-chunk-3$

Comparing models

summary(m1)$coef

##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)    2.703    0.14369   18.81 1.972e-71
## log_sqft       1.442    0.01954   73.79 0.000e+00

summary(m2)$coef

##               Estimate Std. Error t value  Pr(>|t|)
## (Intercept)    10.2474    1.17846   8.696 8.438e-18
## log_sqft       -0.5482    0.30914  -1.773 7.638e-02
## I(log_sqft^2)   0.1301    0.02017   6.449 1.490e-10

Linear model

plot of chunk unnamed-chunk-6

Linear model with quadratic

Model selection

The residual plots for the second (more complex) model seem slightly better, so we're inclined to use that model. We can also compare the explanatory power of the models by looking at $R^2$.

summary(m1)$adj

## [1] 0.7736

summary(m2)$adj

## [1] 0.7793

These two models are very similar - both are quite good in terms of validity and explanatory power - but the quadratic one edges out the simple linear one.

Demonstration on r^2

m3 <- lm(log_price ~ log_sqft + rnorm(length(LA$log_price)), data = LA)
summary(m1)$r.squared

## [1] 0.7738

summary(m3)$r.squared

## [1] 0.7738

summary(m1)$adj

## [1] 0.7736

summary(m3)$adj

## [1] 0.7735

Activity #7 Part I

Revisit the RailTrail data set from Activity 4.
Consider two models: a) SLR model to predict ridership by temperature, b) same approach but with added quadratic term.
Discuss the relative merits of the two models.
Don't submit this activity yet - to be continued on Weds.