A given data set can conceivably have been generated from uncountably many models. Identifying the true model is like finding a piece of hay in a haystack. Said another way, the model space is massive and the criterion for what constitutes the "best" model is ill-defined.
Best strategy: Use domain knowledge to constrain the model space and/or build models that help you answer specific scientific questions.
Another common strategy:
Tread Carefully!!! The second strategy can lead to myopic analysis, overconfidence, and wrong-headed conclusions.
While we'd like to find the "true" model, in practice we just hope we're doing a good job at:
How smooth should our model be?
m1 <- lm(y ~ x) m2 <- lm(y ~ x + I(x^2)) m3 <- lm(y ~ x + I(x^2) + I(x^3)) m4 <- lm(y ~ x + I(x^2) + I(x^3) + I(x^4))
Def: the joint probability (actually: density) of all of the data given a particular model. If our \(Y\)s are independent of each other given the \(X\), then:
\[ P(Y_. | X_.) = P(y_1 | x_1) P(y_2 | x_2) \ldots P(y_n | x_n) \]
L1 <- prod(dnorm(m1$res, mean = 0, sd = summary(m1)$sigma))
L1 <- prod(dnorm(m1$res, mean = 0, sd = summary(m1)$sigma)) L2 <- prod(dnorm(m2$res, mean = 0, sd = summary(m2)$sigma)) L3 <- prod(dnorm(m3$res, mean = 0, sd = summary(m3)$sigma)) L4 <- prod(dnorm(m4$res, mean = 0, sd = summary(m4)$sigma)) c(L1, L2, L3, L4)
## [1] 1.845701e-86 1.252416e-69 2.682451e-49 6.055980e-47
The observed data is most probable under the model with a quartic term. So that's the best model, right?
mBEST <- lm(y ~ poly(x, 20)) LBEST <- prod(dnorm(mBEST$res, mean = 0, sd = summary(mBEST)$sigma)) c(L1, L2, L3, L4, LBEST)
## [1] 1.845701e-86 1.252416e-69 2.682451e-49 6.055980e-47 1.052673e-42
But surely that's not the best model…
There are many others…
A measure of explanatory power of model:
\[ R^2 = SSreg/SST= 1 - RSS/SST \]
But like likelihood, it only goes up with added predictors, therefore we add a penalty.
\[ R^2_{adj} = 1 - \frac{RSS/(n - (p + 1))}{SST/(n - 1)} \]
Nonetheless, choosing the model that has the highest \(R^2_{adj}\) can lead to overfitting.
Akaike Information Criterion: AIC
, a balance of goodness of fit and complexity using information theory.
\[ AIC = 2[-log(\textrm{likelihood of model}) + (p + 2)] \]
which can be simplified to,
\[ AIC = n log(\frac{RSS}{n}) + 2p \]
Smaller = better. Tends to overfit in small sample sizes.
AIC Corrected: a bias-corrected version for use on small sample sizes.
\[ AIC_C = AIC + \frac{2(p + 2)(p + 3)}{n - (p + 1)} \]
Bayesian Information Criterion: BIC
, for all but the very smallest data sets, takes a heavier penalty for complexity.
\[ BIC = -2 log(\textrm{likelihood of model}) + (p + 2) log(n) \]