Say you fit a linear model to data and find the residual plots look awful. One strategy to regain a valid model is to transform your data.
Say you fit a linear model to data and find the residual plots look awful. One strategy to regain a valid model is to transform your data.
A building maintenance company is planning to submit a bid to clean corporate offices. How much should they bid? They'd like to be able to cover the job with a team of 4 crews and a team of a 16 crews, but they want to be sure. To make a good prediction, they collected data on how many crews were required over a sample of 53 days.
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.785 2.0965 0.8513 3.986e-01 ## Crews 3.701 0.2118 17.4721 3.554e-23
The mean function appears to be linear and the residuals are well-approximated by the normal distribution.
There are no influential points, however there is dramatic increasing variance.
Prediction intervals are particularly sensitive to model assumptions, so we have good reason to distrust this one.
The square root tranform is often useful to reducing the increasing variance that is found in many types of count data.
\[ X_t = \sqrt{X} \]
Let's transform both \(X\) and \(Y\).
cleaning2 <- transform(cleaning, sqrtCrews = sqrt(Crews)) cleaning2 <- transform(cleaning2, sqrtRooms = sqrt(Rooms)) cleaning2[1:2, ]
## Case Crews Rooms sqrtCrews sqrtRooms ## 1 1 16 51 4.000 7.141 ## 2 2 10 37 3.162 6.083
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.2001 0.2758 0.7257 4.713e-01 ## sqrtCrews 1.9016 0.0936 20.3158 4.203e-26
The mean function appears to be linear and the residuals are well-approximated by the normal distribution.
There are no influential points and the variance has been stabilized.
pi
## fit lwr upr ## 1 16.59 1.589 31.59 ## 2 61.00 45.810 76.19
pi2^2
## fit lwr upr ## 1 16.03 7.784 27.21 ## 2 60.94 43.327 81.55
Can we use the age of a truck to predict what it's price should be? Consider a random sample of 43 pickup trucks.
The very old truck will be a high leverage point and may not be of interest to model. Let's only consider trucks made in the last 20 years.
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -2278766 238325.7 -9.562 6.924e-12 ## year 1143 119.1 9.597 6.238e-12
The normality assumption on the errors seems fine but there seems to be a quadratic trend in the mean function.
One observation (44) should be investigated for its influence. There is evidence of increasing variance in the residuals.
pickups2 <- transform(pickups, log_price = log(price))
Variables that span multiple orders of magnitude often benefit from a natural log transformation.
\[ Y_t = log_e(Y) \]
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -258.9981 26.12294 -9.915 2.472e-12 ## year 0.1339 0.01306 10.253 9.343e-13
The residuals from this model appear less normal, though the quadratic trend in the mean function is now less apparent.
There are no points flagged as influential and our variance has been stabilized.
## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -258.9981 26.12294 -9.915 2.472e-12 ## year 0.1339 0.01306 10.253 9.343e-13
\[ \widehat{log(price)} = -258.99 + 0.13 * year \]
For each additional year the car is newer, we would expect the log price of the car to increase on average by 0.13 dollars.
Which isn't very useful . . .
Two useful identities:
The slope coefficient for the log-transformed model is 0.13, meaning the log price difference between cars that are one year apart is predicted to be 0.13 log dollars.
\[ \begin{eqnarray} log(price at year x + 1) - log(price at year x) &=& 0.13 \\ log(\frac{price at year x + 1}{price at year x}) &=& 0.13 \\ e^{log(\frac{price at year x + 1}{price at year x})} &=& e^{0.13} \\ \frac{price at year x + 1}{price at year x} = 1.14 \\ \end{eqnarray} \]
For each additional year the car is newer we would expect the price of the car to increase on average by a factor of 1.14.