Binning or Discretization

Binning or Discretization

The process of converting a continuous data into a discrete set of bins is called discretization. Unlike categorical variables where a bin represent a unique value, for continuous data the bin represents an interval and any data point falling within that interval is counted in that bin. Thus this process is also called data binning. Binning is not just limited to numerical data. You can […]

Read Me

Feature Scaling

Feature Scaling

The predictors in a dataset are mostly of different magnitudes. For example, in a ‘user’ dataset the ‘age’ feature will have positive values and normally in single or double digit but if the same dataset also contains salary, its values can easily be in five or six figures. We will discuss some techniques to normalise the variables so that all features have same or similar […]

Read Me

Handling Outliers

Handling Outliers

Outliers are those values which are extremely different from other values in the dataset. To work with outliers we have to find answers to two problems. Firstly, how do we define an outlier and secondly, how do we handle the outliers? Let’s take a look at the two questions separately. Outlier Identification Before handling the outliers it is first important to establish which data points […]

Read Me

Encoding Categorical Variables

Encoding Categorical Variables

Your machine learning models cannot train on the categorical variables so they need to be encoded into a numerical format. In this article we will discuss different encoding techniques. One Hot Encoding In this technique we replace each categorical variable with multiple dummy variables where the number of new variables depend on the cardinality of the categorical variable. The dummy variables have binary values where […]

Read Me

Missing Data Imputation

Missing Data Imputation

The most common issue faced during feature engineering is handling of missing data. It is important to handle the missing data as otherwise your machine learning libraries like Scikit-learn would not be able to work with your data. Before we look at the various ways to handle missing data, we need to first analyse the missing data causes and patterns. Causes can be several ranging […]

Read Me

Deep Learning at Scale References

Deep Learning at Scale References

Checkout the following projects for Deep Learning at scale.   TensorFlowOnSpark Developed by Yahoo, TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from deep learning framework TensorFlow and big-data frameworks Apache Spark and Apache Hadoop, TensorFlowOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. https://github.com/yahoo/TensorFlowOnSpark   BigDL: Distributed Deep Learning on Apache Spark Another distributed deep learning library to directly […]

Read Me

Spark MLlib Data Types

Spark MLlib Data Types

Spark MLlib has special data types since in Machine Learning we normally have to deal with a binary distribution of vectors or matrix.   Note: Any sample code is following Scala syntax Overview Local Vector A local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine. MLlib supports two types of local vectors: dense and sparse. Dense Vector Dense vector has […]

Read Me

Artificial Neural Network

Artificial Neural Network

So far we have seen how basic calculations work in TensorFlow. The computational graph we have built actually resemble the biological neural network of human brain because of which it is commonly known as Artificial Neural Network or simply Neural Network. The neurons in a neural network are organized across three types of layers: Input Layer: This layer is used to feed the input data […]

Read Me

TensorFlow Introduction

TensorFlow Introduction

TensorFlow is open source library by Google for Deep Learning and a Tensor is a multi-dimensional data node having the following three parts: Name Shape Data type Tensor(“Const:0”, shape=(), dtype=string) TensorFlow Hello World example: First use the following command to install TensorFlow on Windows: pip3 install –upgrade tensorflow import tensorflow as tf hello = tf.constant(‘Hello World’) print(hello) If you execute the above program, the text will not […]

Read Me

Gridsearch

Gridsearch

Grid search is good for tuning hyper-parameters. Hyper-parameters are parameters that are not directly learnt within estimators. We will compare the SVM models for different C and gamma values using Gridsearch. Refer to the blog on SVM if you want to learn more about Support Vector Machines. We will directly dive into a practical example of breast cancer classifier problem and optimizing the classifier using […]

Read Me