Binning or Discretization

Binning or Discretization

The process of converting a continuous data into a discrete set of bins is called discretization. Unlike categorical variables where a bin represent a unique value, for continuous data the bin represents an interval and any data point falling within that interval is counted in that bin. Thus this process is also called data binning. Binning is not just limited to numerical data. You can […]

Read Me

Feature Scaling

Feature Scaling

The predictors in a dataset are mostly of different magnitudes. For example, in a ‘user’ dataset the ‘age’ feature will have positive values and normally in single or double digit but if the same dataset also contains salary, its values can easily be in five or six figures. We will discuss some techniques to normalise the variables so that all features have same or similar […]

Read Me

Handling Outliers

Handling Outliers

Outliers are those values which are extremely different from other values in the dataset. To work with outliers we have to find answers to two problems. Firstly, how do we define an outlier and secondly, how do we handle the outliers? Let’s take a look at the two questions separately. Outlier Identification Before handling the outliers it is first important to establish which data points […]

Read Me

Encoding Categorical Variables

Encoding Categorical Variables

Your machine learning models cannot train on the categorical variables so they need to be encoded into a numerical format. In this article we will discuss different encoding techniques. One Hot Encoding In this technique we replace each categorical variable with multiple dummy variables where the number of new variables depend on the cardinality of the categorical variable. The dummy variables have binary values where […]

Read Me