Encoding Categorical Variables

Encoding Categorical Variables

Your machine learning models cannot train on the categorical variables so they need to be encoded into a numerical format. In this article we will discuss different encoding techniques.

One Hot Encoding

In this technique we replace each categorical variable with multiple dummy variables where the number of new variables depend on the cardinality of the categorical variable. The dummy variables have binary values where they contain the value 1 for the categorical value represented by that new dummy variable and 0 for all the other values.

from sklearn.preprocessing import OneHotEncoder

# Scikit-learn
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(X_train)

Encoded columns can also be generated in Pandas directly using the get_dummies function which you can then concatenate to the main dataframe.

# Pandas
enc_columns = pd.get_dummies(X_train['vehicle'])

For a categorical column ‘vehicle’, the dummy variables will be created in the following manner.

vehiclecarbikeboat
car100
bike010
boat001
car100

The main drawback of this approach is that if the categorical variable has very high cardinality then it will extend the feature space considerably thus increasing the overall computing time and resource requirements of the model. A fix for that is to only keep top N values and replace all rare values with a constant. This would generate only N+1 dummy variables.

k-1 One Hot Encoding

In the previous example we can drop one of the dummy variables and still be able to distinguish between each value of categorical variable. We can drop the first column by passing additional parameter. The ‘car’ value will be represented by all 0’s in the encoded columns.

# Scikit-learn
enc = OneHotEncoder(handle_unknown='ignore', drop='first')
enc.fit(X_train)
# Pandas
enc_columns = pd.get_dummies(X_train['vehicle'], drop_first=True)
vehiclebikeboat
car00
bike10
boat01
car00

After dropping the first column we would see that there is no particular dummy variable which can identify the ‘car’ vehicle. We need to check all the dummy variables to identify it so if an algorithm works at a single variable level then it may get impacted. This is especially true for tree based algorithms. Also, other individual variable specific operations like recursive feature selection will not be considered for the dropped dummy variable.

Without New Dummy Variables

There are techniques to directly replace the categorical variable’s value with a numerical value thus avoiding the need to generate new features. The replacing value can be again categorised into two types; those that are target variable and those that are not. Lets look at both the options one by one:

Based on Target Variable

In these techniques the categorical variable is replaced by one of the following values:

  • Mean value of the target variable grouped by the values of the categorical variable
  • Ordered integer where an appropriate metric is computed for the target variable, for e.g. mean, and then the values are ranked according to the computed value. The rank is used to replace the categorical variable’s values
  • Probability Ratio which is applicable for classification problems or numerical variables post binning transformation. We replace the value with the probability of True vs the probability of False for the records with that value. Find the mean of True (6 out of 10 is 6/10=0.6), the mean of False (4/10=0.4 or can be derived from the probability of True as 1-0.6=0.4) and then calculate the ratio (0.6/0.4=1.5).
    Note, if probability of False is 0 for a value the divide by 0 will not work so we need to handle this special scenario
  • Weight of Evidence (WOE). It provides the predictive power of the independent variable. It is calculated by finding the natural log of ratio of percentage of Good by percentage of Bad. Let’s say target variable has a total of 80 Good values and 20 Bad values. Now for a variable value lets assume the number of Good are 8 and number of Bad are 4. The woe will be calculated as:
    log((8/80) / (4/20)) = log(0.1 / 0.2) = log(0.5) = -0.0301

To replace the value first create a dictionary with the mapping of the column values to the calculated value e.g. mean in case using the first option. The use map() function on the dataframe.

X_train['vehicle'] = X_train['vehicle'].map(vehicle_dict)

Note, keep an ‘Other’ value to handle any missing, rare or new values in Test dataset.

Independent of Target Variable

The following techniques are independent of the target variable:

  • Count Encoding where the value is replaced by the total count of observations of a value. One drawback of such an approach is that if more than one value has the same count then it may impact the model performance
  • Frequency Encoding. It is similar to Count encoding but here we replace the value by the percentage of the count compared to the total observations. So if the total records are 50 and count of a value of variable is 5 then the value will be replaced by 10 as it is 10% of total observations. It also has the same drawback as the Count encoding
  • Label Encoding where values of a categorical variables with cardinality N is replaced by values from 0 to N-1. You can directly use the LabelEncoder from scikit-learn. It also provides an inverse_transform() method to convert the encoded value back to the original string. This model is not appropriate for the linear models as the distance calculations will be impacted by the range of numerical values
# Label Encoder example
from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()
enc.fit(X_train['vehicle'])

X_train['vehicle'] = enc.transform(X_train['vehicle'])
X_test['vehicle'] = enc.transform(X_test['vehicle'])

Wrap Up

The above techniques is not an exhaustive list but are the main techniques generally followed. Refer to the following link from Kaggle for more comprehensive list of techniques for further reading.