Case Study: Critics and Wine

Meet Robert Parker

The most influential critic in the world today happens to be a critic of wine. His name is Robert Parker.

Meet Clive Coates

Clive Coates, Master of Wine, is one of the world's leading wine authorities.

Parker is the wine writer who matters. Clives Coates is very serious and well-respected, but in terms of commercial impact his influence is zero.

The data

wine <- read.csv("http://andrewpbray.github.io/data/Bordeaux.csv")
dim(wine)

## [1] 72  9

names(wine)

## [1] "Wine"             "Price"            "ParkerPoints"    
## [4] "CoatesPoints"     "P95andAbove"      "FirstGrowth"     
## [7] "CultWine"         "Pomerol"          "VintageSuperstar"

The mission

Develop a regression model to estimate the percentage effect on price of a 1% increase in ParkerPoints and a 1% increase in CoatesPoins using a subset of or all seven predictors.

Exploratory Data Analysis

library(car)
scatterplotMatrix(wine[, 2:4])

plot of chunk unnamed-chunk-2

EDA

plot of chunk unnamed-chunk-3

Fitting the simplest model.

\[ \widehat{Price} \sim ParkerPoints + CoatesPoints + P95andAbove \\+ FirstGrowth + CultWine + Pomerol + VintageSuperstar \]

m1 <- lm(Price ~ . - Wine, data = wine)

plot of chunk unnamed-chunk-5

\(Y\) versus \(\hat{Y}\)

plot(wine$Price ~ m1$fit, data = wine)
abline(0, 1)

plot of chunk unnamed-chunk-6

A second model

wine <- transform(wine, logPrice = log(Price),
                  logParkerPoints = log(ParkerPoints),
                  logCoatesPoints = log(CoatesPoints))
m2 <- lm(logPrice ~ . - Price - Wine - ParkerPoints - CoatesPoints,
         data = wine)

Recall: logging the predictors of interest as well as the response allows us to interpret the estimates slopes as the percentage increase in the \(Y\) associated with a 1% increase in the \(x\).

SLR residual plots

plot of chunk unnamed-chunk-8

MLR standardized res plots

plot of chunk unnamed-chunk-9

MLR standardized res plots

plot of chunk unnamed-chunk-10

\(Y\) versus \(\hat{Y}\)

plot of chunk unnamed-chunk-11

MMPs

plot of chunk unnamed-chunk-12

AVPs

plot of chunk unnamed-chunk-13

Multicollinearity

vif(m2)

##      P95andAbove      FirstGrowth         CultWine          Pomerol 
##            4.013            1.625            1.188            1.124 
## VintageSuperstar  logParkerPoints  logCoatesPoints 
##            1.139            5.825            1.410

One value exceeds the traditional cutoff of 5, but not by too much. Multicollinearity is a minor issue here.

Model validity

We're confident we're working with a valid model.

Linearity: the conditional mean of the response is a linear function of the predictors. (See \(Y\) vs \(\hat{Y}\) and mmps)
The errors have constant variance and are uncorrelated. (See standardized residual plots vs predictors)
The errors are normally distributed with mean zero. (see QQ plot)

It's also sensible to build a model with:

No highly influential points. (see Cook's dist plot)
Low multicollinearity. (see VIF)

Model summary

##                  Estimate Std. Error t value  Pr(>|t|)
## (Intercept)      -51.1416    8.98557 -5.6915 3.389e-07
## P95andAbove        0.1006    0.13697  0.7341 4.656e-01
## FirstGrowth        0.8697    0.12524  6.9441 2.332e-09
## CultWine           1.3532    0.14569  9.2878 1.784e-13
## Pomerol            0.5364    0.09366  5.7275 2.946e-07
## VintageSuperstar   0.6159    0.22067  2.7910 6.918e-03
## logParkerPoints   11.5886    2.06763  5.6048 4.744e-07
## logCoatesPoints    1.6205    0.61154  2.6499 1.013e-02

Note that all of the predictors are signficant at the .05 level (another reason to not worry too much about the borderline VIF), with the exception of P95andAbove.

We could consider dropping it from the model to remove redudancy with logParkerPrice.

Updated model summary

m3 <- update(m2, . ~ . - P95andAbove)
summary(m3)$coef

##                  Estimate Std. Error t value  Pr(>|t|)
## (Intercept)      -56.4755    5.26798 -10.721 5.205e-16
## FirstGrowth        0.8615    0.12430   6.931 2.299e-09
## CultWine           1.3360    0.14330   9.323 1.339e-13
## Pomerol            0.5362    0.09333   5.745 2.641e-07
## VintageSuperstar   0.5947    0.21800   2.728 8.186e-03
## logParkerPoints   12.7843    1.26915  10.073 6.662e-15
## logCoatesPoints    1.6045    0.60898   2.635 1.052e-02

Since the parameter estimates barely budged, this is essentially the same model as the previous one: no need to re-check diagnostic plots.