Dealing with Missing Values

A common and crucial preprocessing step in data analytics is dealing with missing values. These gaps in the data can arise due to various reasons, such as technical glitches or human errors during data collection. Addressing the missing value problem involves making decisions that balance acquiring useful information from incomplete observations and potentially introducing bias into the dataset. Properly handling missing values ensures the integrity and reliability of the analysis, helping to draw accurate and meaningful insights.

First, here is how we may be able to identify missing data

library(tidyverse)

ames <- as_tibble(read.csv("Datasets/ames.csv"))
sum(is.na(ames))

OUTPUT13960

ames %>% summarise_all(~sum(is.na(.))) %>% glimpse()

OUTPUTRows: 1 Columns: 82 $ Order 0 $ PID 0 $ area 0 $ price 0 $ MS.SubClass 0 $ MS.Zoning 0 $ Lot.Frontage 490 $ Lot.Area 0 $ Street 0 $ Alley 2732 $ Lot.Shape 0 $ Land.Contour 0 $ Utilities 0 $ Lot.Config 0 $ Land.Slope 0 $ Neighborhood 0 $ Condition.1 0 $ Condition.2 0 $ Bldg.Type 0 $ House.Style 0 $ Overall.Qual 0 $ Overall.Cond 0 $ Year.Built 0 ... $ Mo.Sold 0 $ Yr.Sold 0 $ Sale.Type 0 $ Sale.Condition 0

We can also visualize missing values at the feature/column level. A package visdat has a useful feature to put missing values by feature into context.

library(visdat)
vis_miss(ames, cluster = TRUE)