Linear Regression

This note will delve into the fundamentals of linear regression, a crucial technique in predictive modeling. Our primary focus will be on applying linear regression models using the tidymodels and tidyverse packages in R.

Dataset: California Housing

For this exercise, we will be using the California Housing dataset, a comprehensive dataset that includes various features of residential homes in California. This dataset is widely used for practicing and understanding regression models due to its rich and detailed attributes.

library(tidyverse)  # data wrangling

housing <- read.csv('../Datasets/housing.csv')
housing <- as_tibble(housing)

# loading the ames dataset
slice_head(housing, n = 10)

OUTPUTA tibble: 10 x 10 longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity -122.23 37.88 41 880 129 322 126 8.3252 452600 NEAR BAY -122.22 37.86 21 7099 1106 2401 1138 8.3014 358500 NEAR BAY -122.24 37.85 52 1467 190 496 177 7.2574 352100 NEAR BAY -122.25 37.85 52 1274 235 558 219 5.6431 341300 NEAR BAY -122.25 37.85 52 1627 280 565 259 3.8462 342200 NEAR BAY -122.25 37.85 52 919 213 413 193 4.0368 269700 NEAR BAY -122.25 37.84 52 2535 489 1094 514 3.6591 299200 NEAR BAY -122.25 37.84 52 3104 687 1157 647 3.1200 241400 NEAR BAY -122.26 37.84 42 2555 665 1206 595 2.0804 226700 NEAR BAY -122.25 37.84 52 3549 707 1551 714 3.6912 261100 NEAR BAY

glimpse(housing)

OUTPUTRows: 20,640 Columns: 10 $ longitude -122.23, -122.22, -122.24, -122.25, -122.25, -122.2~ $ latitude 37.88, 37.86, 37.85, 37.85, 37.85, 37.85, 37.84, 37~ $ housing_median_age 41, 21, 52, 52, 52, 52, 52, 52, 42, 52, 52, 52, 52,~ $ total_rooms 880, 7099, 1467, 1274, 1627, 919, 2535, 3104, 2555,~ $ total_bedrooms 129, 1106, 190, 235, 280, 213, 489, 687, 665, 707, ~ $ population 322, 2401, 496, 558, 565, 413, 1094, 1157, 1206, 15~ $ households 126, 1138, 177, 219, 259, 193, 514, 647, 595, 714, ~ $ median_income 8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6~ $ median_house_value 452600, 358500, 352100, 341300, 342200, 269700, 299~ $ ocean_proximity "NEAR BAY", "NEAR BAY", "NEAR BAY", "NEAR BAY", "NE~

dim(housing)

OUTPUT[1] 20640 10

names(housing)

OUTPUT[1] "longitude" "latitude" [3] "housing_median_age" "total_rooms" [5] "total_bedrooms" "population" [7] "households" "median_income" [9] "median_house_value" "ocean_proximity"

The housing dataset contains $10$ variables and $20650$ observations.

Exploring the California Housing Dataset

Our first step in exploring the dataset is to look at the dependent variable `median_house_value`. The goal is to understand its structure and use that information to inform the modeling process.

1. Distribution of the Median House Value

To begin, we will visualize the distribution of the `median_house_value`. This will help us understand the range, central tendency, and spread of the house values in the dataset. Visualizing the distribution can also reveal any potential skewness or outliers that might affect our regression model. We can use histograms or density plots to achieve this.

library(ggthemr)
ggthemr('fresh')

options(repr.plot.width = 12, repr.plot.height = 10)

# visualize the distribution of the median_house_value
ggplot( data = housing, aes(x = median_house_value) ) + 
    geom_histogram(  col = 'black', binwidth = 5000) + 
    labs(title = "Distribution of Median House Value", x = "Median House Value", y = "Frequency")

ggplot( data = housing, aes(x = median_house_value) ) + 
    geom_histogram( bins = 40, col = 'black') +
    ggtitle('Distribution of log_10(Median House Value)') + scale_x_log10()

Distribution of Median House Value Log Scale

We notice that the log transformation has shifted the distribution to a left-skewed distribution. A choice can be made between using the original values or the log-transformed values, with an understanding of the impact this decision will have on model performance and interpretation.

Exploring Dependent Variable against a Sample of Numerical Variables.

There are a number of features that we can use to predict the Median_House_Value. Let's look at relationship between Longitude and Latidude and the Price of a Home

ggplot(data = housing, aes( x = longitude, y = latitude, color = median_house_value)) +
    geom_point( size = 2.5, alpha = .4 ) +
    scale_color_viridis_c() +
    ggtitle('Relationship between Latitude and Longitude by Median House Price')

General Observerations

1. One useful observation is that the `median_house_value` tends to be higher for homes located on the left side of the `latitude` and `longitude` coordinates, which correspond to the coastal area. This suggests that `ocean_proximity` is an important factor in determining house values.

2. However, it is important to note that not all houses near the coast are more expensive than those further inland. Therefore, we should consider additional factors in our analysis to gain a more comprehensive understanding of house prices.

Relationship Between Population and Median House Value

Now we look at the relationship between Population and Median House Value to determine whether high or low density population determine the value of a home.

ggplot(housing, aes( x = population, y = median_house_value, color = median_house_value )) +
    geom_point(size = 2.5, alpha = .4 ) +
    scale_color_viridis_c() +
    ggtitle('Relationship between Population and Median House Price')

General Observation

1. There is no obvious relationship between the population of an area with the median_house_value variable.

2. We also observe a few outlier points with significant high population numbers.

Relationship Between Median Income and Median House Value

Now we look at the relationship between Median Income and Median House Value to determine whether the value of a home increases with a rise in median income.

ggplot(housing, aes(x = median_income, y = median_house_value, color = median_house_value)) +
    geom_point(alpha = 2) +   
    scale_color_viridis_c() +
    ggtitle('Median Income vs. Median House Value')

Scatter Plot Median Income vs. Median House Value

Implementing Simple Linear Regression

The mathematical formula for simple linear regression is:

$$ y = \beta_0 + \beta_1 x + \epsilon $$

Where:
- $y$ is the dependent variable (response),
- $x$ is the independent variable (predictor),
- $\beta_0$ is the y-intercept of the regression line,
- $\beta_1$ is the slope of the regression line, representing the change in $y$ for a one-unit change in $x$,
- $\epsilon$ is the error term, accounting for the variation in $y$ that cannot be explained by the linear relationship with $x$.

In this section, we will build a simple linear regression model using the California Housing dataset. We will explore how the `tidymodels` and `tidyverse` packages in R can be used to fit the model, evaluate its performance, and interpret the results. Our primary focus will be on predicting the `median_house_value` using relevant predictor variables from the dataset.

Building a Model with tidymodels

To develop a model, we will use the tidymodels framework, which provides a cohesive set of packages for modeling and machine learning. You can learn more about tidymodels at tidymodels.org.

The tidymodels workflow involves a few key steps:

1. Specify the model: Define the type of model you want to build, such as regression or classification.

2. Specify the engine: Choose the function or algorithm to use for modeling, such as `lm` for linear regression or `glm` for generalized linear models.

3. Fit the model and estimate parameters: Apply the model to your data and estimate the parameters based on the specified engine.

Let's demonstrate how this works

library(tidymodels)

Splitting the Data into Train and Test

A fundamental aspect of statistical modeling is the process of splitting your data into training and testing sets. This allows you to build and validate your model effectively. The training set is used to fit the model, while the testing set is used to evaluate its performance on unseen data. This helps in assessing the model’s generalizability and ensures that it performs well on new data.

Splitting the data into train and test sets involves the following steps:

__Set a Seed for Reproducibility__: Setting a seed ensures that the split is reproducible.
__Use an Initial Split__: Create an initial split of the data, typically into 80% training and 20% testing sets.
__Extract Training and Testing Data__: Separate the data into training and testing sets based on the initial split.

Here is how you can implement this using the tidymodels framework:

# Split the data into training and testing sets
set.seed(503)  # For reproducibility
data_split <- initial_split(housing, prop = 0.8)

# extract training and testing data
train_data <- training(data_split)
test_data <- testing(data_split)

Defining Linear Regression Model with tidymodels

As noted earlier, Defining a linear regression model using the tidymodels framework involves a few systematic steps. Below is the demonstrate of implementing these steps separated and then collectively

# Specifying the model
linear_reg()

OUTPUTLinear Regression Model Specification (regression) Computational engine: lm

Specifying the Engine

linear_reg() %>% set_engine("lm")  # uses lm: linear model

Fitting the SLR model

To fit the model with a specified model and engine, we simply add the fit method to the sequence of model and engine. The model below fits a linear regression of the form:

$$ median\ house \ value = \beta_0 + \beta_1 * median \ income + \epsilon $$

linear_reg() %>% 
   set_engine("lm") %>%
   fit(median_house_value ~ median_income, data = train_data)

OUTPUTparsnip model object Call: stats::lm(formula = median_house_value ~ median_income, data = data) Coefficients: (Intercept) median_income 45186 41824

We can now see the outcome of the training model with the coefficients and model object presented.

slr_model <- linear_reg() %>% 
    set_engine("lm") %>%
    fit(median_house_value ~ median_income, data = train_data)

OUTPUTA tibble: 2 x 5 term estimate std.error statistic p.value (Intercept) 45185.70 1487.4435 30.3781 2.76119e-197 median_income 41823.82 345.3143 121.1181 0.00000e+00

Summary of the Model

We can also retrieve the model's summary by plucking the `fit` and piping it to a summary object as shown below.

slr_model %>% pluck('fit') %>% summary()

OUTPUTCall: stats::lm(formula = median_house_value ~ median_income, data = data) Residuals: Min 1Q Median 3Q Max -457077 -55783 -16981 36750 433908 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 45185.7 1487.4 30.38 <2e-16 *** median_income 41823.8 345.3 121.12 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 83850 on 16510 degrees of freedom Multiple R-squared: 0.4705, Adjusted R-squared: 0.4705 F-statistic: 1.467e+04 on 1 and 16510 DF, p-value: < 2.2e-16

Model Statistics

Another useful function is the glance function which can extract model statistics.

glance(slr_model)

OUTPUT# A tibble: 12 × 13 column n mean sd median trimmed mad 1 r.squar… 1 4.70e- 1 NA 4.70e- 1 4.70e- 1 0 2 adj.r.s… 1 4.70e- 1 NA 4.70e- 1 4.70e- 1 0 3 sigma 1 8.39e+ 4 NA 8.39e+ 4 8.39e+ 4 0 4 statist… 1 1.47e+ 4 NA 1.47e+ 4 1.47e+ 4 0 5 p.value 1 0 NA 0 0 0 6 df 1 1 e+ 0 NA 1 e+ 0 1 e+ 0 0 7 logLik 1 -2.11e+ 5 NA -2.11e+ 5 -2.11e+ 5 0 8 AIC 1 4.21e+ 5 NA 4.21e+ 5 4.21e+ 5 0 9 BIC 1 4.21e+ 5 NA 4.21e+ 5 4.21e+ 5 0 10 deviance 1 1.16e+14 NA 1.16e+14 1.16e+14 0 11 df.resi… 1 1.65e+ 4 NA 1.65e+ 4 1.65e+ 4 0 12 nobs 1 1.65e+ 4 NA 1.65e+ 4 1.65e+ 4 0 # ℹ 6 more variables: min , max , range , # skew , kurtosis , se

Augmenting Model Fit

Another important function within the tidymodels ecosystem is the augment function which returns additional information about the model fit. Specifically, augment adds columns to the original dataset containing predictions, residuals, and other diagnostic metrics.

train_aug <- augment(slr_model$fit) 
train_aug %>% head()

OUTPUTA tibble: 6 x 8 median_house_value median_income .fitted .resid .hat .sigma .cooksd .std.resid 185400 3.2125 179544.7 5855.271 6.791192e-05 83856.33 1.655860e-07 0.06982951 500000 5.1155 259135.5 240864.538 8.683360e-05 83835.38 3.582890e-04 2.87255902 175400 3.4151 188018.2 -12618.235 6.408422e-05 83856.28 7.256522e-07 -0.15048381 268400 6.1504 302418.9 -34018.935 1.486844e-04 83855.92 1.223941e-05 -0.40572357 180400 3.7167 200632.3 -20232.300 6.096492e-05 83856.19 1.774794e-06 -0.24128800 237000 5.6642 282084.2 -45084.193 1.151026e-04 83855.61 1.664018e-05 -0.53768328

Visualizing the Model

One of the ways to visualize our model is to plot the regression line against the observation data. This helps in understanding how well the model fits the data and in identifying any deviations or patterns that the model might not have captured.

ggplot(train_aug, aes(x = median_income, y = median_house_value)) +
    # scatterplot points
    geom_point(alpha = 0.5) +
    # Passing the coefficients to the the geom line
    geom_abline(slope = 41823.8, intercept = 45185.7, color = 'red', linewidth = 1.09) +
    labs(title = "Predicted vs Actual Values (Training Data)", x = "Predicted Values", y = "Actual Values")

Predicting Values on the Test Set

Now that we have a model generated, we can run predictions on the test set. We utilize the predict function as we would in base R and pass the new test_data as the data source

test_predictions <- predict(slr_model, new_data = test_data)
slice_head(test_predictions, n = 10)

OUTPUTA tibble: 10 x 1 .pred 206048.5 199565.8 157005.9 133860.6 145721.8 148059.8 120803.2 125796.9 127223.1 120338.9