If it has an asterisk, that means it's a new specific value that's not in the data set. e.g. \(x^*\).
What does \(Y^*\) mean?
What value would we predict for a new \(x^*\)?
\[ \hat{y}^* = \hat{\beta}_0 + \hat{\beta}_1 * x^* \]
How much uncertainty do we have in that prediction?
\[ \hat{y}^* = \hat{\beta}_0 + \hat{\beta}_1 * x^* \]
Two sources of uncertainty:
We can calculate \(SE(\hat{y}^*)\):
\[ S \sqrt{\frac{1}{n} + \frac{(x^* - \bar{x})^2}{SXX}} \]
We know that \(\hat{y}^*\) will also be t-distributed, so we can form a CI:
\[ \hat{y}^* \pm t * SE(\hat{y}^*) \]
m1 <- lm(f_data~x) x_star <- 24 m1$coef[1] + m1$coef[2] * x_star
## (Intercept) ## 28.46
predict(m1, data.frame(x = x_star), interval = "confidence")
## fit lwr upr ## 1 28.46 27.35 29.56
Consider the SE term:
\[ SE(\hat{y}^*) = S \sqrt{\frac{1}{n} + \frac{(x^* - \bar{x})^2}{SXX}} \]
For what values of \(x^*\) would you expect the interval for \(\hat{y}^*\) to be the narrowest?
Look familiar?
\(Y^*\) represents the actual values that you might be observed in the y. This comes not from the estimated mean function:
\[ \hat{y} = \hat{\beta}_0 + \hat{\beta}_1 * x \]
But from the estimated data generating function:
\[ Y = \hat{\beta}_0 + \hat{\beta}_1 * x + e\]
Which has three sources of uncertainty:
The SE for the CI:
\[ SE(\hat{y}^*) = S \sqrt{\frac{1}{n} + \frac{(x^* - \bar{x})^2}{SXX}} \]
gains an extra term for the PI:
\[ SE(Y^*) = S \sqrt{1 + \frac{1}{n} + \frac{(x^* - \bar{x})^2}{SXX}} \]
What is the 95% prediction interval for \(x^* = 24\)?
\[ \hat{y}^* \pm t * SE(Y^*) \]
m1 <- lm(f_data~x) x_star <- 24 m1$coef[1] + m1$coef[2] * x_star
## (Intercept) ## 28.46
predict(m1, data.frame(x = x_star), interval = "prediction")
## fit lwr upr ## 1 28.46 23.81 33.1
\(Y\) is related to \(x\) by a simple linear regression model. \[ E(Y|X) = \beta_0 + \beta_1 * x \]
The errors \(e_1, e_2, \ldots, e_n\) are independent of one another.
The errors have a common variance \(\sigma^2\).
The errors are normally distributed: \(e \sim N(0, \sigma^2)\)
Said another way…
\[ f(Y|X = x) \sim N(\beta_0 + \beta_1 * x, \sigma^2) \]
Regression is a functional smooth summary of the structure of the conditional distribution of \(Y|X\).
n <- 60 beta_0 <- 12 beta_1 <- .7 sigma <- 2 x <- rnorm(n, mean = 20, sd = 3) f_mean <- beta_0 + beta_1 * x # mean function f_data <- f_mean + rnorm(n, mean = 0, sd = sigma) # data generating function