Transformations

Say you fit a linear model to data and find the residual plots look awful. One strategy to regain a valid model is to transform your data.

Example 1: Cleaning crews

A building maintenance company is planning to submit a bid to clean corporate offices. How much should they bid? They'd like to be able to cover the job with a team of 4 crews and a team of a 16 crews, but they want to be sure. To make a good prediction, they collected data on how many crews were required over a sample of 53 days.

plot of chunk unnamed-chunk-1

Linear model?

##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)    1.785     2.0965  0.8513 3.986e-01
## Crews          3.701     0.2118 17.4721 3.554e-23

plot of chunk unnamed-chunk-2

Linearity and normality

plot of chunk unnamed-chunk-3

The mean function appears to be linear and the residuals are well-approximated by the normal distribution.

Constant variance and influence

plot of chunk unnamed-chunk-4

There are no influential points, however there is dramatic increasing variance.

PIs on an invalid model

plot of chunk unnamed-chunk-5

Prediction intervals are particularly sensitive to model assumptions, so we have good reason to distrust this one.

Square root transform

The square root tranform is often useful to reducing the increasing variance that is found in many types of count data.

\[ X_t = \sqrt{X} \]

Let's transform both \(X\) and \(Y\).

cleaning2 <- transform(cleaning, sqrtCrews = sqrt(Crews))
cleaning2 <- transform(cleaning2, sqrtRooms = sqrt(Rooms))
cleaning2[1:2, ]
##   Case Crews Rooms sqrtCrews sqrtRooms
## 1    1    16    51     4.000     7.141
## 2    2    10    37     3.162     6.083

Transformed linear model?

##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)   0.2001     0.2758  0.7257 4.713e-01
## sqrtCrews     1.9016     0.0936 20.3158 4.203e-26

plot of chunk unnamed-chunk-7

Linearity and normality

plot of chunk unnamed-chunk-8

The mean function appears to be linear and the residuals are well-approximated by the normal distribution.

Constant variance and influence

plot of chunk unnamed-chunk-9

There are no influential points and the variance has been stabilized.

PIs from a valid model

plot of chunk unnamed-chunk-10

Comparing PIs

pi
##     fit    lwr   upr
## 1 16.59  1.589 31.59
## 2 61.00 45.810 76.19
pi2^2
##     fit    lwr   upr
## 1 16.03  7.784 27.21
## 2 60.94 43.327 81.55

Log Transformations

Example 2: Truck prices

Can we use the age of a truck to predict what it's price should be? Consider a random sample of 43 pickup trucks.

plot of chunk unnamed-chunk-12

Consider unusual observations

The very old truck will be a high leverage point and may not be of interest to model. Let's only consider trucks made in the last 20 years.

plot of chunk unnamed-chunk-13

Linear nodel?

##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept) -2278766   238325.7  -9.562 6.924e-12
## year            1143      119.1   9.597 6.238e-12

plot of chunk unnamed-chunk-14

Linearity and normality

plot of chunk unnamed-chunk-15

The normality assumption on the errors seems fine but there seems to be a quadratic trend in the mean function.

Constant variance and influence

plot of chunk unnamed-chunk-16

One observation (44) should be investigated for its influence. There is evidence of increasing variance in the residuals.

pickups2 <- transform(pickups, log_price = log(price))

plot of chunk unnamed-chunk-17

Variables that span multiple orders of magnitude often benefit from a natural log transformation.

\[ Y_t = log_e(Y) \]

Log-transformed linear model

##              Estimate Std. Error t value  Pr(>|t|)
## (Intercept) -258.9981   26.12294  -9.915 2.472e-12
## year           0.1339    0.01306  10.253 9.343e-13

plot of chunk unnamed-chunk-18

Linearity and normality

plot of chunk unnamed-chunk-19

The residuals from this model appear less normal, though the quadratic trend in the mean function is now less apparent.

Constant variance and influence

plot of chunk unnamed-chunk-20

There are no points flagged as influential and our variance has been stabilized.

Model interpretation

##              Estimate Std. Error t value  Pr(>|t|)
## (Intercept) -258.9981   26.12294  -9.915 2.472e-12
## year           0.1339    0.01306  10.253 9.343e-13

\[ \widehat{log(price)} = -258.99 + 0.13 * year \]

For each additional year the car is newer, we would expect the log price of the car to increase on average by 0.13 dollars.

Which isn't very useful . . .

Working with logs

Two useful identities:

  • \[ log(a) - log(b) = log(\frac{a}{b}) \]
  • \[ e^{log(x)} = x \]

The slope coefficient for the log-transformed model is 0.13, meaning the log price difference between cars that are one year apart is predicted to be 0.13 log dollars.

\[ \begin{eqnarray} log(price at year x + 1) - log(price at year x) &=& 0.13 \\ log(\frac{price at year x + 1}{price at year x}) &=& 0.13 \\ e^{log(\frac{price at year x + 1}{price at year x})} &=& e^{0.13} \\ \frac{price at year x + 1}{price at year x} = 1.14 \\ \end{eqnarray} \]

For each additional year the car is newer we would expect the price of the car to increase on average by a factor of 1.14.

Transformations summary

  • If a linear model fit to the raw data leads to questionable residual plots, consider transformations.
  • Count data and prices often benefit from transformations.
  • The natural log and the square root are the most common, but you can use any transformation you like.
  • Transformations may change model interpretations.
  • Non-constant variance is a serious problem but it can often be solved by transforming the response.
  • Transformations can also fix non-linearity, as can polynomials - coming soon!