Matrix MLR - least squares and adding variables

Matrices

Restaurants in NYC

nyc <- read.csv("http://andrewpbray.github.io/data/nyc.csv")
dim(nyc)

## [1] 168   7

nyc[1:3,]

##   Case          Restaurant Price Food Decor Service East
## 1    1 Daniella Ristorante    43   22    18      20    0
## 2    2  Tello's Ristorante    32   20    19      19    0
## 3    3          Biricchino    34   21    13      18    0

What determines the price of a meal?

Let's look at the relationship between price, food rating, and decor rating.

What determines the price of a meal?

\[ Price \sim Food + Decor \]

nyc[1:3, ]

##   Case          Restaurant Price Food Decor Service East
## 1    1 Daniella Ristorante    43   22    18      20    0
## 2    2  Tello's Ristorante    32   20    19      19    0
## 3    3          Biricchino    34   21    13      18    0

m1 <- lm(Price ~ Food + Decor, data = nyc)

Model 1: Food + Decor

summary(m1)

## 
## Call:
## lm(formula = Price ~ Food + Decor, data = nyc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.945  -3.766  -0.153   3.701  18.757 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -24.5002     4.7230  -5.187 6.19e-07 ***
## Food          1.6461     0.2615   6.294 2.68e-09 ***
## Decor         1.8820     0.1919   9.810  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.788 on 165 degrees of freedom
## Multiple R-squared:  0.6167, Adjusted R-squared:  0.6121 
## F-statistic: 132.7 on 2 and 165 DF,  p-value: < 2.2e-16

The geometry of regression models

The mean function is . . .

A line when you have one continuous \(x\).
Parallel lines when you have one continuous \(x_1\) and one categorical \(x_2\).
Unrelated lines when you have one continuous \(x_1\), one categorical \(x_2\), and an interaction term \(x_1 * x_2\).

When you have two continuous predictors \(x_1\), \(x_2\), then your mean function is . . .

a plane

3d plot

Location, location, location

Does the price depend on where the restaurant is located in Manhattan?

\[ Price \sim Food + Decor + East \]

nyc[1:3, ]

##   Case          Restaurant Price Food Decor Service East
## 1    1 Daniella Ristorante    43   22    18      20    0
## 2    2  Tello's Ristorante    32   20    19      19    0
## 3    3          Biricchino    34   21    13      18    0

Model 2: Food + Decor + East

m2 <- lm(Price ~ Food + Decor + East, data = nyc)
summary(m2)

## 
## Call:
## lm(formula = Price ~ Food + Decor + East, data = nyc)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.0451  -3.8809   0.0389   3.3918  17.7557 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -24.0269     4.6727  -5.142 7.67e-07 ***
## Food          1.5363     0.2632   5.838 2.76e-08 ***
## Decor         1.9094     0.1900  10.049  < 2e-16 ***
## East          2.0670     0.9318   2.218   0.0279 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.72 on 164 degrees of freedom
## Multiple R-squared:  0.6279, Adjusted R-squared:  0.6211 
## F-statistic: 92.24 on 3 and 164 DF,  p-value: < 2.2e-16

The geometry of regression models

When you have two continuous predictors \(x_1\), \(x_2\), then your mean function is a plane.
When you have two continuous predictors \(x_1\), \(x_2\), and a categorical predictor \(x_3\), then your mean function represents parallel planes.

3d Plot

The geometry of regression models

When you have two continuous predictors \(x_1\), \(x_2\), then your mean function is a plane.
When you have two continuous predictors \(x_1\), \(x_2\), and a categorical predictor \(x_3\), then your mean function represents parallel planes.
When you add in interaction effects, the planes become tilted.

Model 3: Food + Decor + East + Decor:East

m3 <- lm(Price ~ Food + Decor + East + Decor:East, data = nyc)
summary(m3)

## 
## Call:
## lm(formula = Price ~ Food + Decor + East + Decor:East, data = nyc)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.7855  -3.6649   0.3785   3.7292  17.6358 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -29.3971     6.3770  -4.610 8.10e-06 ***
## Food          1.6634     0.2822   5.895 2.09e-08 ***
## Decor         2.0695     0.2298   9.006 5.42e-16 ***
## East          9.6616     6.2184   1.554    0.122    
## Decor:East   -0.4346     0.3518  -1.235    0.219    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.711 on 163 degrees of freedom
## Multiple R-squared:  0.6313, Adjusted R-squared:  0.6223 
## F-statistic: 69.78 on 4 and 163 DF,  p-value: < 2.2e-16

3d plot

Comparing Models

The East term was significant in model 2, suggesting that there is a significant relationship between location and price.
That term became nonsignificant when we allowed the slope of Decor to vary with location, and that difference in slopes was also nonsignificant.

Activity 8

Load in the LA homes data set and fit the following model:

\[ logprice \sim logsqft + bed + city \]

What appears to be the reference level for city?
In the context of this problem, what is suggested by the sign of the coefficient for bed? Do this make sense to you?
Calculate the vector \(\hat{\beta}\) using the matrix formulation of the least squares estimates (useful functions: cbind(), rep(), matrix(), as.matrix(), t(), solve()). Do they agree with the estimates that come out of lm()?
See if you can plot your full model as geometric structures on a 3D scatterplot of the data.