Missing value treatment & imputation with predictive method

Author : Analytics Educator

Hello, today I am going to teach you an important topic, which every data scientist (or data scientist aspirants) should know.

In any dataset, most of the times, we face problems of missing data. We can’t simply let these missing data be there in the dataset. There are different ways to handle them. First let’s take a sample dataset to illustrate an example.

Install the required package

Install the package VIM

library(VIM)

Once the package is installed, now we will get an inbuilt dataset named sleep.

head(sleep)
##    BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger
## 1 6654.000   5712.0   NA    NA   3.3 38.6  645    3   5      3
## 2    1.000      6.6  6.3   2.0   8.3  4.5   42    3   1      3
## 3    3.385     44.5   NA    NA  12.5 14.0   60    1   1      1
## 4    0.920      5.7   NA    NA  16.5   NA   25    5   2      3
## 5 2547.000   4603.0  2.1   1.8   3.9 69.0  624    3   5      4
## 6   10.550    179.5  9.1   0.7   9.8 27.0  180    4   4      4

Here we can see that there are multiple missing values in the variables. Let’s find out total percentage of missing value in the data set

mean(!complete.cases(sleep))*100
## [1] 32.25806

If you find 5% or less data are missing, which is scattered through out the variables then you simply delete all the rows with missing values by executing the following code.

sleep <- na.omit(sleep)

However, here we have missing values of around 32%. Hence we can’t delete all the rows, since removing 32% of total data would have adverse impact on our analysis.

Now let’s see the percentage of missing values for each variable

colMeans(is.na(sleep))*100#sleep is the dataset name
##   BodyWgt  BrainWgt      NonD     Dream     Sleep      Span      Gest 
##  0.000000  0.000000 22.580645 19.354839  6.451613  6.451613  6.451613 
##      Pred       Exp    Danger 
##  0.000000  0.000000  0.000000

If you find any variable having missing values of around 35% or more and which is not an important one, then you may simply drop that particular variable.

Here we can see that neither the variables are having very high percentage of missing values nor the missing value percentage are too insignificant so that all rows could be deleted.

We can also visualize the missing value percentage.

library(mice)
md.pattern(sleep)
##    BodyWgt BrainWgt Pred Exp Danger Sleep Span Gest Dream NonD   
## 42       1        1    1   1      1     1    1    1     1    1  0
##  2       1        1    1   1      1     1    0    1     1    1  1
##  3       1        1    1   1      1     1    1    0     1    1  1
##  9       1        1    1   1      1     1    1    1     0    0  2
##  2       1        1    1   1      1     0    1    1     1    0  2
##  1       1        1    1   1      1     1    0    0     1    1  2
##  2       1        1    1   1      1     0    1    1     0    0  3
##  1       1        1    1   1      1     1    0    1     0    0  3
##          0        0    0   0      0     4    4    4    12   14 38

The 1s and 0s in the body of the table indicate the missing-values patterns, with a 0 indicating a missing value for a given column variable and a 1 indicating a non-missing value. The first row describes the pattern of “no missing values” (all elements are 1). The second row describes the pattern “no missing values except for Span.” The first column indicates the number of cases in each missing data pattern, and the last column indicates the number of variables with missing values present in each pattern. Here you can see that there are 42 cases without missing data and 2 cases that are missing Span alone.

## An easier way to visualize
aggr(sleep, prop=FALSE, numbers=TRUE,sortVars=TRUE)

## 
##  Variables sorted by number of missings: 
##  Variable Count
##      NonD    14
##     Dream    12
##     Sleep     4
##      Span     4
##      Gest     4
##   BodyWgt     0
##  BrainWgt     0
##      Pred     0
##       Exp     0
##    Danger     0

You can see that the variable NonD has the largest number of missing values (14)

## An easier way to visualize
aggr(sleep, prop=TRUE, numbers=TRUE,sortVars=TRUE)

## 
##  Variables sorted by number of missings: 
##  Variable      Count
##      NonD 0.22580645
##     Dream 0.19354839
##     Sleep 0.06451613
##      Span 0.06451613
##      Gest 0.06451613
##   BodyWgt 0.00000000
##  BrainWgt 0.00000000
##      Pred 0.00000000
##       Exp 0.00000000
##    Danger 0.00000000

prop=TRUE will give the data in percentage. The variable NonD has 22% missingi values.

Now let’s replace the missing values by Multiple Imputation (MI)

Monte Carlo methods are used to fill in the missing data.Standard statistical methods (Gibbs sampling) are applied to each of the simulated datasets. By default, predictive mean matching (pmm) is used to replace missing data. The function returns an object containing several complete datasets (the default is 5).

replaced_Data <- mice(sleep, m=5, maxit = 50, method = 'pmm', seed = 500, print=FALSE)

Here is an explanation of the parameters used: m - Refers to 5 imputed data sets maxit - Refers to no. of iterations taken to impute missing values method - Refers to method used in imputation. we used predictive mean matching (pmm). seed = any random number

Since there are 5 imputed data sets, you can select any using complete() function.

#get complete data ( 2nd out of 5)
total_data2 <- complete(replaced_Data,2)
head(total_data2)
##    BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger
## 1 6654.000   5712.0  3.2   0.0   3.3 38.6  645    3   5      3
## 2    1.000      6.6  6.3   2.0   8.3  4.5   42    3   1      3
## 3    3.385     44.5 11.0   1.5  12.5 14.0   60    1   1      1
## 4    0.920      5.7 15.2   1.3  16.5  3.0   25    5   2      3
## 5 2547.000   4603.0  2.1   1.8   3.9 69.0  624    3   5      4
## 6   10.550    179.5  9.1   0.7   9.8 27.0  180    4   4      4

The end!