Handling Outliers

Posted: 16/10/2021/Under: Machine Learning/By: Vaibhav

feature engineering machine learning outlier

Outliers are those values which are extremely different from other values in the dataset. To work with outliers we have to find answers to two problems. Firstly, how do we define an outlier and secondly, how do we handle the outliers? Let’s take a look at the two questions separately.

Outlier Identification

Before handling the outliers it is first important to establish which data points should we consider as outliers. We can use one of the following techniques for it.

Z-Score

It is the distance from mean in units of standard deviation which can be represented by the formula

z-score = (x – mean) / standard deviation

For example, a Z-score of +/-3 can be used as outlier boundary as it would mean three standard deviations away from mean. This technique is recommended for Gaussian distribution.

mean_val = df['col_name'].mean()
std_val = df['col_name'].std()

df['z_score'] = (df['col_name'] - mean_val) / std_val

Using the z-score variable we can now filter the records having value over the defined threshold to get the outliers.

It can also be calculated using the standardisation libraries which we will discuss in the next article on Feature Scaling.

Inter-quantile range (IQR)

The IQR range is used to determine the boundaries with the formula,

Upper Boundary = 75th quantile + (1.5 * IQR)
Lower Boundary = 25th quantile – (1.5 * IQR)

The multiplication factor of 1.5 is commonly used but it can be increased to a higher values to get extreme outliers. This strategy can be used for even skewed distributions.

iqr = df['col_name'].quantile(0.75) - df['col_name'].quantile(0.25)

up_limit = df['col_name'].quantile(0.75) + 1.5 * iqr
low_limit = df['col_name'].quantile(0.25) - 1.5 * iqr

Filter the dataframe using the upper and lower boundaries of the outliers.

Quantiles

It is simpler than the IQR approach where the direct quantile values are used to define the boundary as follows:

Upper Boundary = 95th quantile
Lower Boundary = 5th quantile

up_limit = df['col_name'].quantile(0.95)
low_limit = df['col_name'].quantile(0.5)

Outlier Handling

Let’s compare the various options to handle the outlier observations:

Approach	Description
Drop	Drop the records with outliers. It is the simplest of all approaches but can lead to some important loss of information in other variables. Also we need to ensure that the number of observations with outliers are only within an acceptable range.
Capping	Replace the outlier value with either the maximum or minimum value of the boundaries defined to identify the outlier. Although this approach doesn’t delete any data but it will distort the data distribution.
Missing Data	Delete the outlier values and fill them using one of the missing data imputation strategies. This approach will also distort the distribution.
Binning	The outliers are moved to the farthest bins on the edge of the distribution which are within the outlier boundary
Constant	If business sense suggests we may decide to replace the outlier with a constant value.

Wrap Up

Keep the following points in mind during outlier handling.

The outlier handling can distort the distribution so make sure you analyse it afterwards.

If you are using the capping technique for outlier handling, you use only the training set to define the outlier boundaries.

Also, if you are using the missing data approach then the strategy is defined on only the training set. Similarly for binning the edge bins are defined using the training set.

Lastly, in certain cases the outlier might have relevance, like in fraud detection the outlier might assist in identifying the transaction anomalies. So consider business impact before handling the outliers.