Kafka – Spark Streaming Integration

Kafka – Spark Streaming Integration

Spark streaming is a distributed stream processing engine which can ingest data from various sources. One of the most popular source is Apache Kafka which is a distributed streaming platform providing you publish and subscribe features of an enterprise messaging system while also supporting data stream processing. In this blog we will create a realtime streaming pipeline for ingesting credit card data and finding Merchants […]

Read Me

Setup Standalone Apache Kafka Instance

Setup Standalone Apache Kafka Instance

Apache Kafka is a distributed streaming platform providing you publish and subscribe features of an enterprise messaging system while also supporting data stream processing. In this blog we will setup a standalone Kafka topic on a local machine on Windows operating system. Please note, consider this setup as a Hello World application as it is not meant for production use.   Software versions used in […]

Read Me

Apache Hive with MongoDB Integration

Apache Hive with MongoDB Integration

Apache Hive is a tool from Apache Hadoop eco-system to convert SQL like queries into Hadoop jobs for data summarization, querying and analysis. In this blog post we will see how data stored in MongoDB can be imported into Hive table. The data from Hive table is then processed and the result in stored in another Hive table.   We will use a 1 minute […]

Read Me