Data Imputation with Recipes

There are many ways to deal with missing data. In some cases, it may be reasonable to do away completely with missing observations. In many cases, imputing techniques can greatly enrich the dataset, preserving a proportion of the data that would otherwise not be regarded.

In this section, we explore a few techniques that can provide imputing facilities. But before we do that, we briefly have to introduce recipes.

Tidy Recipes

The recipes library is part of the tidyverse ecosystem, providing various preprocessing functions that integrate seamlessly with the modeling facilities available within the tidymodels framework. Since these labs focus heavily on tidymodels, it is useful to provide a brief introduction here. In the next lab, we will introduce modeling with tidymodels more comprehensively.

library(recipes)

grep("impute_", ls("package:recipes"), value = TRUE)

OUTPUT [1] "step_impute_bag" "step_impute_knn" "step_impute_linear" [4] "step_impute_lower" "step_impute_mean" "step_impute_median" [7] "step_impute_mode" "step_impute_roll"

Defining a Recipe

In order to use recipes to perform imputation, we must first define the recipe. The most straight forward way to achieve this is using the formula method.

library(tidyverse)

ames <- as_tibble(read.csv("Datasets/ames.csv"))    

# creating a recipe
ames_recipe <- recipe(price ~ ., data = ames)

Once we have the recipe defined, imputation is as easy as adding the specific step to the recipe.

Median Imputing

In the example below, we can impute using median value for missing value of a column. To do this, we simply add the function step_impute_median() to the recipe. The example below uses the variable Lot.Frontage

ames_recipe %>% step_impute_median(Lot.Frontage )

OUTPUT-- Recipe ---------------------------------------------------------------------- -- Inputs Number of variables by role outcome: 1 predictor: 81 -- Operations * Median imputation for: Lot.Frontage

K-Nearest Neighbor Imputation

The K-nearest neighbor imputation is another effective impute function that can work for both numeric and non-numeric predictions. To apply this on all predictors, you can simply add all_predictors to the step_impute_knn() method.

ames_recipe %>% step_impute_knn( Garage.Yr.Blt, neighbors = 5 )

OUTPUT-- Recipe ---------------------------------------------------------------------- -- Inputs Number of variables by role outcome: 1 predictor: 81 -- Operations * K-nearest neighbor imputation for: Garage.Yr.Blt

Many More imputation methods exist and you may wish to explore them.

Prep and Bake

Notice that the above imputation techniques only define the model or technique for imputation. In order to perform the imputation itself, we need two steps: `prep` and `bake`.

The prep step estimates the parameters necessary for imputation and the bake function implements the imputation.

sample_recipe <- recipe(price ~ ., data = ames) %>% 
                   step_impute_median( Lot.Frontage ) 
                   
impute_rec <- prep(sample_recipe, training = ames)

imputed_data <- bake(impute_rec, new_data = ames)

vis_miss(imputed_data, cluster = TRUE)