Projects

Model Selection

The Problem of Model Selection

A given data set can conceivably have been generated from uncountably many models. Identifying the true model is like finding a piece of hay in a haystack. Said another way, the model space is massive and the criterion for what constitutes the "best" model is ill-defined.

The Problem of Model Selection

Best strategy: Use domain knowledge to constrain the model space and/or build models that help you answer specific scientific questions.

Another common strategy:

  1. Pick a criterion for "best".
  2. Decide how to explore the model space.
  3. Select "best" model in search area.

Tread Carefully!!! The second strategy can lead to myopic analysis, overconfidence, and wrong-headed conclusions.

What do we mean by "best"?

While we'd like to find the "true" model, in practice we just hope we're doing a good job at:

  1. Prediction
  2. Description

Synthetic example

How smooth should our model be?

Four candidates

m1 <- lm(y ~ x)
m2 <- lm(y ~ x + I(x^2))
m3 <- lm(y ~ x + I(x^2) + I(x^3))
m4 <- lm(y ~ x + I(x^2) + I(x^3) + I(x^4))

Four candidates

Likelihood

Def: the joint probability (actually: density) of all of the data given a particular model. If our \(Y\)s are independent of each other given the \(X\), then:

\[ P(Y_. | X_.) = P(y_1 | x_1) P(y_2 | x_2) \ldots P(y_n | x_n) \]

L1 <- prod(dnorm(m1$res, mean = 0, sd = summary(m1)$sigma))

Comparing Likelihoods

L1 <- prod(dnorm(m1$res, mean = 0, sd = summary(m1)$sigma))
L2 <- prod(dnorm(m2$res, mean = 0, sd = summary(m2)$sigma))
L3 <- prod(dnorm(m3$res, mean = 0, sd = summary(m3)$sigma))
L4 <- prod(dnorm(m4$res, mean = 0, sd = summary(m4)$sigma))
c(L1, L2, L3, L4)
## [1] 1.845701e-86 1.252416e-69 2.682451e-49 6.055980e-47

The observed data is most probable under the model with a quartic term. So that's the best model, right?

The BEST model!

The BEST model!

mBEST <- lm(y ~ poly(x, 20))
LBEST <- prod(dnorm(mBEST$res, mean = 0, sd = summary(mBEST)$sigma))
c(L1, L2, L3, L4, LBEST)
## [1] 1.845701e-86 1.252416e-69 2.682451e-49 6.055980e-47 1.052673e-42

But surely that's not the best model…

Four Criteria

  1. \(R^2_{adj}\)
  2. \(AIC\)
  3. \(AIC_C\)
  4. \(BIC\)

There are many others…

\(R^2_{adj}\)

A measure of explanatory power of model:

\[ R^2 = SSreg/SST= 1 - RSS/SST \]

But like likelihood, it only goes up with added predictors, therefore we add a penalty.

\[ R^2_{adj} = 1 - \frac{RSS/(n - (p + 1))}{SST/(n - 1)} \]

Nonetheless, choosing the model that has the highest \(R^2_{adj}\) can lead to overfitting.

\(AIC\)

Akaike Information Criterion: AIC, a balance of goodness of fit and complexity using information theory.

\[ AIC = 2[-log(\textrm{likelihood of model}) + (p + 2)] \]

which can be simplified to,

\[ AIC = n log(\frac{RSS}{n}) + 2p \]

Smaller = better. Tends to overfit in small sample sizes.

\(AIC_C\)

AIC Corrected: a bias-corrected version for use on small sample sizes.

\[ AIC_C = AIC + \frac{2(p + 2)(p + 3)}{n - (p + 1)} \]

\(BIC\)

Bayesian Information Criterion: BIC, for all but the very smallest data sets, takes a heavier penalty for complexity.

\[ BIC = -2 log(\textrm{likelihood of model}) + (p + 2) log(n) \]

Criteria compared