Hello, today I am going to teach you an important topic, which every data scientist (or data scientist aspirants) should know.
In any dataset, most of the times, we face problems of missing data. We can’t simply let these missing data be there in the dataset. There are different ways to handle them. First let’s take a sample dataset to illustrate an example.
Install the package VIM
library(VIM)
Once the package is installed, now we will get an inbuilt dataset named sleep.
head(sleep)
## BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger
## 1 6654.000 5712.0 NA NA 3.3 38.6 645 3 5 3
## 2 1.000 6.6 6.3 2.0 8.3 4.5 42 3 1 3
## 3 3.385 44.5 NA NA 12.5 14.0 60 1 1 1
## 4 0.920 5.7 NA NA 16.5 NA 25 5 2 3
## 5 2547.000 4603.0 2.1 1.8 3.9 69.0 624 3 5 4
## 6 10.550 179.5 9.1 0.7 9.8 27.0 180 4 4 4
Here we can see that there are multiple missing values in the variables. Let’s find out total percentage of missing value in the data set
mean(!complete.cases(sleep))*100
## [1] 32.25806
If you find 5% or less data are missing, which is scattered through out the variables then you simply delete all the rows with missing values by executing the following code.
sleep <- na.omit(sleep)
However, here we have missing values of around 32%. Hence we can’t delete all the rows, since removing 32% of total data would have adverse impact on our analysis.
Now let’s see the percentage of missing values for each variable
colMeans(is.na(sleep))*100#sleep is the dataset name
## BodyWgt BrainWgt NonD Dream Sleep Span Gest
## 0.000000 0.000000 22.580645 19.354839 6.451613 6.451613 6.451613
## Pred Exp Danger
## 0.000000 0.000000 0.000000
If you find any variable having missing values of around 35% or more and which is not an important one, then you may simply drop that particular variable.
Here we can see that neither the variables are having very high percentage of missing values nor the missing value percentage are too insignificant so that all rows could be deleted.
We can also visualize the missing value percentage.
library(mice)
md.pattern(sleep)
## BodyWgt BrainWgt Pred Exp Danger Sleep Span Gest Dream NonD
## 42 1 1 1 1 1 1 1 1 1 1 0
## 2 1 1 1 1 1 1 0 1 1 1 1
## 3 1 1 1 1 1 1 1 0 1 1 1
## 9 1 1 1 1 1 1 1 1 0 0 2
## 2 1 1 1 1 1 0 1 1 1 0 2
## 1 1 1 1 1 1 1 0 0 1 1 2
## 2 1 1 1 1 1 0 1 1 0 0 3
## 1 1 1 1 1 1 1 0 1 0 0 3
## 0 0 0 0 0 4 4 4 12 14 38
The 1s and 0s in the body of the table indicate the missing-values patterns, with a 0 indicating a missing value for a given column variable and a 1 indicating a non-missing value. The first row describes the pattern of “no missing values” (all elements are 1). The second row describes the pattern “no missing values except for Span.” The first column indicates the number of cases in each missing data pattern, and the last column indicates the number of variables with missing values present in each pattern. Here you can see that there are 42 cases without missing data and 2 cases that are missing Span alone.
## An easier way to visualize
aggr(sleep, prop=FALSE, numbers=TRUE,sortVars=TRUE)
##
## Variables sorted by number of missings:
## Variable Count
## NonD 14
## Dream 12
## Sleep 4
## Span 4
## Gest 4
## BodyWgt 0
## BrainWgt 0
## Pred 0
## Exp 0
## Danger 0
You can see that the variable NonD has the largest number of missing values (14)
## An easier way to visualize
aggr(sleep, prop=TRUE, numbers=TRUE,sortVars=TRUE)
##
## Variables sorted by number of missings:
## Variable Count
## NonD 0.22580645
## Dream 0.19354839
## Sleep 0.06451613
## Span 0.06451613
## Gest 0.06451613
## BodyWgt 0.00000000
## BrainWgt 0.00000000
## Pred 0.00000000
## Exp 0.00000000
## Danger 0.00000000
prop=TRUE will give the data in percentage. The variable NonD has 22% missingi values.
Monte Carlo methods are used to fill in the missing data.Standard statistical methods (Gibbs sampling) are applied to each of the simulated datasets. By default, predictive mean matching (pmm) is used to replace missing data. The function returns an object containing several complete datasets (the default is 5).
replaced_Data <- mice(sleep, m=5, maxit = 50, method = 'pmm', seed = 500, print=FALSE)
Here is an explanation of the parameters used: m - Refers to 5 imputed data sets maxit - Refers to no. of iterations taken to impute missing values method - Refers to method used in imputation. we used predictive mean matching (pmm). seed = any random number
Since there are 5 imputed data sets, you can select any using complete() function.
#get complete data ( 2nd out of 5)
total_data2 <- complete(replaced_Data,2)
head(total_data2)
## BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger
## 1 6654.000 5712.0 3.2 0.0 3.3 38.6 645 3 5 3
## 2 1.000 6.6 6.3 2.0 8.3 4.5 42 3 1 3
## 3 3.385 44.5 11.0 1.5 12.5 14.0 60 1 1 1
## 4 0.920 5.7 15.2 1.3 16.5 3.0 25 5 2 3
## 5 2547.000 4603.0 2.1 1.8 3.9 69.0 624 3 5 4
## 6 10.550 179.5 9.1 0.7 9.8 27.0 180 4 4 4