Data Imputation with Recipes
There are many ways to deal with missing data. In some cases, it may be reasonable to do away completely with missing observations. In many cases, imputing techniques can greatly enrich the dataset, preserving a proportion of the data that would otherwise not be regarded.
In this section, we explore a few techniques that can provide imputing facilities. But before we do that, we
briefly have to introduce
Tidy Recipes
The
library(recipes)
grep("impute_", ls("package:recipes"), value = TRUE)
Defining a Recipe
In order to use recipes to perform imputation, we must first define the recipe. The most straight forward way to achieve this is using the formula method.
library(tidyverse)
ames <- as_tibble(read.csv("Datasets/ames.csv"))
# creating a recipe
ames_recipe <- recipe(price ~ ., data = ames)
Once we have the recipe defined, imputation is as easy as adding the specific step to the recipe.
Median Imputing
In the example below, we can impute using median value for missing value of a column. To do this, we simply add
the function
ames_recipe %>% step_impute_median(Lot.Frontage )
K-Nearest Neighbor Imputation
The K-nearest neighbor imputation is another effective impute function that can work for both numeric and
non-numeric predictions. To apply this on all predictors, you can simply add
ames_recipe %>% step_impute_knn( Garage.Yr.Blt, neighbors = 5 )
Many More imputation methods exist and you may wish to explore them.
Prep and Bake
Notice that the above imputation techniques only define the model or technique for imputation. In order to perform the imputation itself, we need two steps: `prep` and `bake`.
The prep step estimates the parameters necessary for imputation and the bake function implements the imputation.
sample_recipe <- recipe(price ~ ., data = ames) %>%
step_impute_median( Lot.Frontage )
impute_rec <- prep(sample_recipe, training = ames)
imputed_data <- bake(impute_rec, new_data = ames)
vis_miss(imputed_data, cluster = TRUE)