Missing Data Imputation

Missing Data Imputation

The most common issue faced during feature engineering is handling of missing data. It is important to handle the missing data as otherwise your machine learning libraries like Scikit-learn would not be able to work with your data. Before we look at the various ways to handle missing data, we need to first analyse the missing data causes and patterns. Causes can be several ranging from issues with data storage, or issues with data collection process or simply that it would be impossible to collect such information. For example, if the data is collected from interviews then it would not be possible to collect the data for deceased.

Missing Data Patterns

Let’s look at the patterns of the missing data which will help us in identifying the right imputation strategy:
MCAR: Missing Data Completely at Random. Here the missing data is missing any kind of pattern. Let’s take an example of an optional field in a user form where some users fill it while others don’t. If there is no pattern to who is filling this data then it can be categorised as MCAR. This missing data will not impact the overall outcome of the model
MAR: Missing Data at Random. In MAR, some pattern can be established between the missing data and the other variables. In the previous user form example, if the form field is date-of-birth then there can be more chance that certain gender might be observed to not provide that information as compared to others. If analysing more data also confirms this pattern, it can be categorised as MAR pattern.
MNAR: Missing Data Not at Random. If some explanation can be provided as to why the data is missing then it can be categorised as MNAR. For example, if the user form field driving license number then it would be missing for everyone under the permissible driving age.

It would help in understanding the data source to classify the missing data pattern. Like if we are analysing the user form data, then we need to first know what are the optional form fields. There could be a scenario that a form field was initially optional and later on made mandatory so such data discrepancy might also be observed. Main idea is to first study the data source and collection process to get into a better position to understand the missing data.

Handling Missing Data

Dropping Records

The easiest option is to drop the records which contain missing values as it requires no data transformation. This approach is called Complete Case Analysis as well as Listwise Deletion. Before dropping the records, make sure that not a significant size of dataset is getting dropped. Also, there could be cases where the missing data is not at random so it might be useful to flag records with missing data. Additionally, think of the scenario if the data would also be missing in production then how would our model behave.

Now assuming that the missing data has no specific significance and the data is missing at random, you can follow the below steps to ensure we do some due diligence before the cleanup.

  • Compare the percentage of dataset left compared to the full dataset. Continue if the overall reduction is less your acceptable limit for example it is within 5%
  • If more than the acceptable data records are getting dropped, find the percentage of rows with missing values for each independent variable and only drop the records for those variables which are under the limit
  • Check the distribution of few numerical variables using histogram plot to ensure the before and after distribution is similar to ensure that the dataset hasn’t changed its characteristic
  • For categorical variables check that the frequency distribution is comparable after cleanup

Common Value Substitution

For numerical variables, the common values can be derived by using either mean or median values of the available data for a variable. For normal distribution any one of them can be used but for skewed data or for data with outliers it is better to go with median value.

One main advantage of this approach if that no records get dropped so you have the full dataset to work with. Also it is not very time consuming to perform this substitution.

On the other hand one drawback is that it further skews the variable distribution in towards the median value. If significant number of values are missing, it will distort the data distribution, both variance and covariance. You can check the pre and post imputation distribution using box-plot and specifically check the IQR, whisker lengths and exclusion of more data points as outliers now that the Q1 and Q3 values would have come more towards the median values. Evaluate the variance using the var() function and covariance using the cov() function on the series object of the variable to quantify the impact. To further analyse the impact on the distribution, visualise the pre and post imputation data with the Kernel Density Estimation (KDE) chart.

For categorical variables we can replace with the value with the mode value i.e. the value having highest frequency. If multiple values have the highest frequency or are in similar range, then we prepare a list of such values and randomly assign the values from them. The basic assumption while performing this imputation is that the data is missing at random so we are replacing with the most common or average values.

One point to note is that you should calculate the common or average values from the training set to avoid the overfitting of the model.

Fixed Value Substitution

When the data is not missing at random and you would want to capture the knowledge that the observation was missing and instead of replacing with a common value we replace with a fixed value which is outside the range of distribution on either positive or negative side.

For numerical variables having normal distribution you can use the rule of three times the standard deviation to find the boundary of outliers. For skewed data use the IQR proximity rule. You can tweak the outlier boundary by trying various IQR multipliers. 1.5 is a standard multiplier but sometimes a multiplier of 3 is also used to get extreme values.

Perform the due diligence steps mentioned in the Common Value Substitution section to analyse the impact of the imputation.

For categorical variables we can replace with a fixed string which will behave as a new category value. Since it is a new category it wont impact the frequency distribution of all other values for the variable. One point to note that for tree based algorithms it can overfit the model to the missing value category.

Random Value Substitution

The approach is similar for numerical as well as categorical variables. A value at random is chosen from existing values each time for replacing the missing value. Note, we and not assigning the same value for each missing value but doing a random value lookup separately for each missing value.

The implementation approach is simple but make sure to use a seed in your random value generator to have consistency of randomness as otherwise the predictions will vary between runs. Now same seed will generate the same value for all missing values so to have the variation use different seed for different lookup which can be based on values of some other independent variable.

Also, only perform lookup on the training set to avoid overfitting. One drawback of this step is that you will have to persist the training set data in production also at the time of prediction.

Multivariate Imputation

It is a more sophisticated imputation approach as compared to the univariate approaches we have discussed earlier. The variable having missing values is considered as the target variable and using other variables we train a model to predict what value should be used for imputation.

We can use any model to predict the missing value but KNNImputer is a popular choice. It follows the principle that if the variables without missing values are in vicinity i.e. they are neighbours, then the value of variable with missing values will also be similar.

There are a few constraints in using the KNNImputer. Firstly, it uses same value of neighbours (K) for all the variables. If you need to use different neighbour values for different variables then you need to compute them individually. Also, since this algorithm is based on the calculation of Euclidean distance it cant be used for categorical variables.

MICE or Multivariate Imputation of Chained Equations is an even more advanced form of multivariate imputation. Here initially all the variables with missing values except one have their missing values imputed with any other model. Then the missing values for that one variable excluded in last step are predicted and replaced. A round robin approach is followed to perform the same steps for all other variables with missing values. Note, in each cycle the variable selected for imputation has to revert to the original state with missing values as we had substituted them earlier when it was used as independent variable.

There is a special implementation of MICE, missForest, where Random Forest model is used. Keys benefits of missForest is that supports both categorical as well as numerical variables. Also, it can work with non-linear data.

Make sure to compare the results of multivariate imputation with the univariate approaches. If there are no significant improvements in the results then it is better to stick to the univariate approaches.

Wrap Up

We have seen various approaches to handle the missing values. For the substitution based approaches, besides data imputation we can also maintain a flag in a new variable where the missing data was substituted. This would be important in cases where the data was not missing at random.

It is important to highlight again that special due diligence checks should be performed after dropping or imputing data as explained in the common value substitution section.

For further reading go through the documentation of the following Scikit-learn classes: