Binning or Discretization

Posted: 18/10/2021/Under: Machine Learning/By: Vaibhav

The process of converting a continuous data into a discrete set of bins is called discretization. Unlike categorical variables where a bin represent a unique value, for continuous data the bin represents an interval and any data point falling within that interval is counted in that bin. Thus this process is also called data binning.

Binning is not just limited to numerical data. You can group data points of a day into time periods by defining bins for morning or night. You can also group time periods into hourly bins. The dates can be grouped into bins for months or years.

In this article we will cover discretisation techniques for numerical variables.

Uniform – Equal Width

As the name suggests, the width i.e. the interval period of all the bins is same. We need to determine the number of bins ‘N’ which we want to keep in the dataset and divide the range by ‘N’ to get the bin width.

bin_width = (max – min) / N

Although the width of bins is same but the data distribution may not be even across all bins. Some bins can have more hits than the others. In fact in real life scenarios you will even get bins with zero hits which are called empty bins.

>>> from sklearn.preprocessing import KBinsDiscretizer

>>> data = [[10],[20],[15],[25]]
>>> df = pd.DataFrame(data, columns = ['X'])

>>> est = KBinsDiscretizer(n_bins=2, encode='ordinal', strategy='uniform')
>>> est.fit(df)
KBinsDiscretizer(encode='ordinal', n_bins=2, strategy='uniform')

>>> est.transform(df)
array([[0.],
       [1.],
       [0.],
       [1.]])

In the above example the value of ‘N’ is 2 i.e. the data is grouped into 2 bins. the data points 10 and 15 are associated with bin 0 whereas the data points 20 and 25 are mapped to bin 1. The strategy ‘uniform’ means we want same width for all the bins.

Pandas library has also now provided an easy binning technique as shown below.

>>> data = [[10],[20],[15],[25]]
>>> df = pd.DataFrame(data, columns = ['X'])

>>> pd.cut(df.X, bins=3)
0    (9.985, 15.0]
1     (15.0, 20.0]
2    (9.985, 15.0]
3     (20.0, 25.0]
Name: X, dtype: category
Categories (3, interval[float64]): [(9.985, 15.0] < (15.0, 20.0] < (20.0, 25.0]]

The data is divided into bins. To encode the bins into Ordinal use the ‘labels’ parameter. The returned values can be directly mapped into a new column of the dataframe.

>>> pd.cut(df.X, bins=3, labels=False)
0    0
1    1
2    0
3    2
Name: X, dtype: int64

If required you can also specify labels to the bins instead of encoding the values.

>>> pd.cut(df.X, bins=3, labels=['low','medium','high'])
0       low
1    medium
2       low
3      high
Name: X, dtype: category
Categories (3, object): ['low' < 'medium' < 'high']

Quantile – Equal Frequency

In this technique the data points are equally divided amongst ‘N’ bins which means the width of the bins are not the same and there would not be any empty bins.

>>> from sklearn.preprocessing import KBinsDiscretizer

>>> data = [[10], [20], [15], [25], [-10], [100]]
>>> df = pd.DataFrame(data, columns = ['X'])

>>> est = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile')
>>> est.fit(df)
KBinsDiscretizer(encode='ordinal', n_bins=3)

>>> est.transform(df)
array([[0.],
       [1.],
       [1.],
       [2.],
       [0.],
       [2.]])

Although we have mentioned ‘quantile’ as the binning strategy but the parameter can be skipped as ‘quantile’ is the default value. In the above example there are a total of 3 bins and each bin contains 2 data points. The two highest values, 100 and 25 are in the last bin i.e. bin number 2. The two lowest values, 10 and -10 are in the first bin. So irrespective of the spread of data the data points are divided evenly.

Pandas library has also introduced a technique to provide the Quantile based binning.

>>> pd.qcut(df.X, q=3, labels=False)
0    0
1    1
2    1
3    2
4    0
5    2
Name: X, dtype: int64

K-Means

More than binning it is a clustering technique where the data points are first divided into ‘N’ clusters and then each cluster is associated with a bin.

>>> from sklearn.preprocessing import KBinsDiscretizer

>>> data = [[10], [20], [15], [25], [-10], [100]]
>>> df = pd.DataFrame(data, columns = ['X'])

>>> est = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='kmeans')
>>> est.fit(df)
KBinsDiscretizer(encode='ordinal', n_bins=3, strategy='kmeans')

>>> est.transform(df)
array([[1.],
       [1.],
       [1.],
       [1.],
       [0.],
       [2.]])

Since the kmeans clustering is based in the Euclidean distance, the data points which are closer to the centroids are grouped together. Consequently 10, 15, 20 and 25 are all grouped in the same cluster. -10 and 100 are then assigned their own individual clusters.

One point to note here is that this is the most computationally expensive strategy of the three. Also, if the data points are not spaced out unevenly, it is better to use one of the previous strategies.

If you are interested to read more about kmeans clustering refer to this blog.

Encoding Discrete Bins into Numerical Values

You would notice in out previous examples that we are getting back the bin numbers which can be directly used for model training. These are equivalent to the Ordinal encoding where each discrete value is associated with a number sequentially. This is because in the scikit-learn library we are using, we are providing a parameter ‘encode’ with value ‘ordinal’ to get the bin number.

In case we want to use One-hot encoding, sparse or dense, we can specify that instead as shown below. One-hot sparse is the default option.

>>> est = KBinsDiscretizer(n_bins=3, encode='onehot-dense')

>>> est.fit(df)
KBinsDiscretizer(encode='onehot-dense', n_bins=3)

>>> r = est.transform(df)
>>> print(r)
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 1.]]

Wrap Up

Few points should be considered while deciding on the factors determining the number of bins. If there are too many bins then you can end up in a scenario where all data points are occupying their own individual bin i.e. maximum height of a bin would be 1. This would dilute the whole variable as no useful insight can be derived from such a distribution. On the other hand we also don’t want the data to be too skewed with few bins where most of the data is lying within a couple of bins unless we are designing an anomaly detection system where the tall bins would denote the normal behaviour.

Refer to the official documentation of KBinsDiscretizer and Pandas cut and qcut functions for more information.