Chapter 2 exercises

1. With lots of data, a more flexible method would likely be better. There less danger that you would end up modeling noise, i.e. overfitting is more difficult.
2. With lots of variables, there is a serious danger of overfitting, so we’ll want to use an inflexible method to help constrain all the possible models that we could fit.
3. With highly non-linear relationships, we’d need to allow for a certain degree of flexibility in order to capture that structure.
4. If the variance of the error term is high, there is a danger that a flexible method would model the random error (overfitting) instead of the systematic variation. Because of this, we’d want an inflexible method.
1. Regression, inference, $$n = 500$$, $$p = 3$$ (the response/output, CEO salary, is generally not included in $$p$$).
2. Classification, prediction, $$n = 20$$, $$p = 13$$.
3. Regression, prediction, $$n = 52$$ (weeks), $$p = 3$$.
1. Skip

2. Examples will vary widely.

3. Flexible approaches are useful when you have a highly non-linear $$f$$ as well as when you have lots of $$n$$ relative to $$p$$. In essence, if the true $$f$$ is complex, and you have a sufficient amount of data to resolve it, a flexible model will lead to better predictions. In situations where a large $$p$$ leads to a huge number of possible models, it can be helpful to restrict the model space with a less flexible method. Less flexible methods often lead to more easily interpretable models, which is very important if you’re interesting in inference.

4. Parametric approaches put constraints on the functional form of $$f$$ and thus help guard against overfitting. The parameter estimates are also often fairly interpretable, which is useful if you’re interested in inference. When the structure in the data is complex, however, parametric approaches can exhibit systematic bias (effectively underfitting).

Additional exercises

Input: $$X$$ is usually written as $$n$$ by $$p$$. Here, $$n = 10$$ and $$p = 64^2 = 4096$$, the total number of pixels in each photo.

$X = \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,4096} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,4096} \\ \vdots & \vdots & \ddots & \vdots \\ x_{10,1} & x_{10,2} & \cdots & x_{10,4096} \end{pmatrix}$

Transformed Input: Most of the information in those 4096 pixels is likely encoded in a smaller number of features, $$m$$.

$X^{*} = \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,m} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,m} \\ \vdots & \vdots & \ddots & \vdots \\ x_{10,1} & x_{10,2} & \cdots & x_{10,m} \end{pmatrix}$

Output: A single vector of length $$n$$.

$y = \begin{pmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{10} \end{pmatrix}$

Model: One simple $$f$$ would be a linear model on the transformed input.

$y = \beta_0 + \beta_1 x_1^{*} + \beta_2 x_2^{*} + \ldots + \beta_m x_m^{*} + \epsilon$