Blog

Binning or Discretization

The process of converting a continuous data into a discrete set of bins is called discretization. Unlike categorical variables where a bin represent a unique value, for continuous data the bin represents an interval and any data point falling within that interval is counted in that bin. Thus this process is also called data binning. Binning is not just limited to numerical data. You can […]

Posted: 18/10/2021
Under: Machine Learning

Read Me

Feature Scaling

The predictors in a dataset are mostly of different magnitudes. For example, in a ‘user’ dataset the ‘age’ feature will have positive values and normally in single or double digit but if the same dataset also contains salary, its values can easily be in five or six figures. We will discuss some techniques to normalise the variables so that all features have same or similar […]

Posted: 17/10/2021
Under: Machine Learning

Read Me

Handling Outliers

Outliers are those values which are extremely different from other values in the dataset. To work with outliers we have to find answers to two problems. Firstly, how do we define an outlier and secondly, how do we handle the outliers? Let’s take a look at the two questions separately. Outlier Identification Before handling the outliers it is first important to establish which data points […]

Posted: 16/10/2021
Under: Machine Learning

Read Me

Encoding Categorical Variables

Your machine learning models cannot train on the categorical variables so they need to be encoded into a numerical format. In this article we will discuss different encoding techniques. One Hot Encoding In this technique we replace each categorical variable with multiple dummy variables where the number of new variables depend on the cardinality of the categorical variable. The dummy variables have binary values where […]

Posted: 16/10/2021
Under: Machine Learning

Read Me

Missing Data Imputation

The most common issue faced during feature engineering is handling of missing data. It is important to handle the missing data as otherwise your machine learning libraries like Scikit-learn would not be able to work with your data. Before we look at the various ways to handle missing data, we need to first analyse the missing data causes and patterns. Causes can be several ranging […]

Posted: 25/09/2021
Under: Machine Learning

Read Me

Apache Airflow – An Ideal Workflow Manager

When I became part of a data analytics project which provided a platform to top management to take data driven decisions for development teams, we were primarily analysing only one data source which was code repositories although the code repos in itself were multiple sources with recent solutions on stash but many older solutions still using legacy version control systems. This data pipeline was all […]

Posted: 05/09/2021
Under: Big Data

Read Me

Apache Airflow Installation Steps for MacOS

This post outlines the steps needed to setup a local instance of Apache Airflow on your Mac. I have performed this installation on MacOS Big Sur. Installation and Environment Setup First launch a terminal window and go to the directory where you want to setup Airflow. On my system I have created a folder ‘airflow’.(base) … % mkdir airflow(base) … % cd airflow(base) … airflow […]

Posted: 05/08/2021
Under: Big Data

Read Me

SQL Server Execution Plan Visualization

Recently I had to support a team I was not part of in troubleshooting some performance issues while querying data from SQL Server. Being an external person I also did not have any access to the project infrastructure. Not being from Microsoft background, I haven’t worked on SQL Server in a long time. So, I applied my experience of working with Oracle and requested for […]

Posted: 01/12/2020
Under: Other Topics

Read Me

Plotly – An Interactive Visualizations Library

Generally the Python developers use the Matplotlib data visualization library to generate basic visualizations. Other popular option is Seaborn which is also based on Matplotlib but it provides some better visualizations. Developers working on Pandas dataframes at times use the default visualization provided within Pandas. But all these data visualization libraries have a limitation, the visualizations generated by them is a static image. This is […]

Posted: 05/11/2020
Under: Data Visualization

Read Me

How the stock markets story is scripted – Part 2

In last week’s blog we discussed the role of institutional investors on the price movement of the stock market using the subprime mortgage crisis as the reference time-period. We will continue our analysis but with respect to the current market situation. Before we start drawing parallels between the current stock market crash with the 2008 crash we should understand that no two events are the same. How […]

Posted: 04/05/2020
Under: Other Topics

Read Me

Binning or Discretization

Feature Scaling

Handling Outliers

Encoding Categorical Variables

Missing Data Imputation

Apache Airflow – An Ideal Workflow Manager

Apache Airflow Installation Steps for MacOS

SQL Server Execution Plan Visualization

Plotly – An Interactive Visualizations Library

How the stock markets story is scripted – Part 2

Pages

Recent Posts

Archives

Categories