Apache Airflow – An Ideal Workflow Manager

Apache Airflow – An Ideal Workflow Manager

When I became part of a data analytics project which provided a platform to top management to take data driven decisions for development teams, we were primarily analysing only one data source which was code repositories although the code repos in itself were multiple sources with recent solutions on stash but many older solutions still using legacy version control systems. This data pipeline was all […]

Read Me

Apache Airflow Installation Steps for MacOS

Apache Airflow Installation Steps for MacOS

This post outlines the steps needed to setup a local instance of Apache Airflow on your Mac. I have performed this installation on MacOS Big Sur. Installation and Environment Setup First launch a terminal window and go to the directory where you want to setup Airflow. On my system I have created a folder ‘airflow’.(base) … % mkdir airflow(base) … % cd airflow(base) … airflow […]

Read Me

Splunk Overview

Splunk Overview

Splunk is a tool which ingests any data generated by the IT systems and help you generate insights from it in the form of reports, dashboards and alerts. This blog will provide an overview of the Splunk while covering some capabilities for searching through the log events. Following are the three main functions performed by Splunk: Data input Parsing and indexing Parsing: Analyses data and […]

Read Me

Word Count MapReduce Program in Java

Word Count MapReduce Program in Java

The Word Count program is like the Hello Word program of Big Data where we read an input text and count the number of occurrences of each word. In this sample program we will read input from a file which will be uploaded on HDFS and the final word count result will again be saved on HDFS. Hadoop setup is the basic pre-requisite for this […]

Read Me

Kafka – Spark Streaming Integration

Kafka – Spark Streaming Integration

Spark streaming is a distributed stream processing engine which can ingest data from various sources. One of the most popular source is Apache Kafka which is a distributed streaming platform providing you publish and subscribe features of an enterprise messaging system while also supporting data stream processing. In this blog we will create a realtime streaming pipeline for ingesting credit card data and finding Merchants […]

Read Me

Setup Standalone Apache Kafka Instance

Setup Standalone Apache Kafka Instance

Apache Kafka is a distributed streaming platform providing you publish and subscribe features of an enterprise messaging system while also supporting data stream processing. In this blog we will setup a standalone Kafka topic on a local machine on Windows operating system. Please note, consider this setup as a Hello World application as it is not meant for production use.   Software versions used in […]

Read Me

Apache Hive with MongoDB Integration

Apache Hive with MongoDB Integration

Apache Hive is a tool from Apache Hadoop eco-system to convert SQL like queries into Hadoop jobs for data summarization, querying and analysis. In this blog post we will see how data stored in MongoDB can be imported into Hive table. The data from Hive table is then processed and the result in stored in another Hive table.   We will use a 1 minute […]

Read Me

Spark Streaming with MongoDB

Spark Streaming with MongoDB

Spark streaming enables us to do realtime processing of data streams. In this blog post we will see how data stream coming to Spark over TCP socket can be processed and the result saved into MongoDB. You can extrapolate this example in your applications where you are using MongoDB as the data sink after processing by Spark.   We will use the word count example […]

Read Me

Deep Learning at Scale References

Deep Learning at Scale References

Checkout the following projects for Deep Learning at scale.   TensorFlowOnSpark Developed by Yahoo, TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from deep learning framework TensorFlow and big-data frameworks Apache Spark and Apache Hadoop, TensorFlowOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. https://github.com/yahoo/TensorFlowOnSpark   BigDL: Distributed Deep Learning on Apache Spark Another distributed deep learning library to directly […]

Read Me

Spark MLlib Data Types

Spark MLlib Data Types

Spark MLlib has special data types since in Machine Learning we normally have to deal with a binary distribution of vectors or matrix.   Note: Any sample code is following Scala syntax Overview Local Vector A local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine. MLlib supports two types of local vectors: dense and sparse. Dense Vector Dense vector has […]

Read Me