Log Transformation
Log Transformation is a effective transformation typically applied on variables with large scales such as prices of a home. The fundamental idea, particularly in regression, of using the Log Transformation is to convert the data from its original distribution to a normal distribution by dealing with skews. This is of course important for linear regression which has assumptions about errors.
Let's look at an example using the Ames Housing Dataset
# loading data processing and visualization theme
library(tidyverse)
library(ggthemr)
# loading a visualization them
ggthemr('fresh')
ames <- as_tibble(read.csv("../Datasets/ames.csv"))
head(ames)
| Order | PID | area | price | MS.SubClass | MS.Zoning | Lot.Frontage |
|---|---|---|---|---|---|---|
| 1 | 526__301__100 | 1,656 | 215,000 | 20 | RL | 141 |
| 2 | 526__350__040 | 896 | 105,000 | 20 | RH | 80 |
| 3 | 526__351__010 | 1,329 | 172,000 | 20 | RL | 81 |
| 4 | 526__353__030 | 2,110 | 244,000 | 20 | RL | 93 |
| 5 | 527__105__010 | 1,629 | 189,900 | 60 | RL | 74 |
| 6 | 527__105__030 | 1,604 | 195,500 | 60 | RL | 78 |
Let's visualize the response variable: Sales Price
ames %>%
ggplot(., aes(x = price)) +
geom_histogram( color = 'black', bins = 30 ) +
ggtitle("Distribution of Sale Price")
ames %>%
ggplot(., aes(x = log(price))) +
geom_histogram( color = 'black', bins = 30 ) +
ggtitle("Distribution of Log Sale Price")
We notice that the data is much closer to a normal distribution now however, we see that the log transformation has introduced a left skew.
Box-Cox Transformation
Another alternative is the Box-Cox Transformation which aims to achieve similar results as the log transformation but converting the data into a normal distribution. The Box-Cox transformation is particularly useful with non-normal data. The formal definition is given as:
$$ Y(\lambda) = \begin{cases} \frac{Y^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0 \\ \ln(Y) & \text{if } \lambda = 0 \end{cases} $$
where $\lambda$ is a parameter that is estimated from the data, determining the exact nature of the transformation. The Box-Cox transformation adjusts the data such that it approximates a normal distribution, which is beneficial for many statistical modeling techniques that assume normality of the input data.
Let's see how to implement it. We first compute the lambda parameter and transform the variable with the function above.
library(forecast)
# computing the lambda parameter
boxcox_lambda <- BoxCox.lambda(ames$price)
boxcox_lambda
ames %>%
ggplot(., aes(x = BoxCox(price, boxcox_lambda))) +
geom_histogram( color = 'black', bins = 30 ) +
ggtitle("Distribution of Box-Cox Sale Price")