Apache Airflow – An Ideal Workflow Manager

Apache Airflow – An Ideal Workflow Manager

When I became part of a data analytics project which provided a platform to top management to take data driven decisions for development teams, we were primarily analysing only one data source which was code repositories although the code repos in itself were multiple sources with recent solutions on stash but many older solutions still using legacy version control systems. This data pipeline was all controlled and executed by shell scripts and the data gathering was done by a third part analyzer.

For implementing new requirements we had a team of Java developers so any new batch jobs were developed in Java and managed by the Spring Batch framework. Since the core of the existing solutions was the third party analyzer controlled by the shell script, we continued using that. 

Over the course of a year we added many other data sources and our tech stack also grew. We now also had a good number of Python scripts handling a lot of data gathering, transformation and analysis tasks.

Now we were at a stage where we had a lot of moving parts but the overall management of all tasks was not optimal. We wanted a solution which can put all the orchestration for all workflows in a single place and at the same time provide better transparency on the execution status and easy handles for recovery from failures. We envisioned that in coming months the complexity of our workflows will further increase as we expand our analytical platform to more use cases.

Apache Airflow being a popular and proven product was not a difficult choice. We did a quick POC and it provided all what we expected from a workflow manager.

A User Friendly Product

For starters, it was very convenient to setup an instance for the POC. At the project level we had setup the instance on a Linux system but I also did a quick setup on Mac for some personal experiments. I have shared the steps for setup on Mac in a separate article.

Then when you launch Airflow for the first time you will notice a lot of sample examples covering almost all key concepts of Airflow already available on the dashboard. This will help you in quickly getting up-to-speed with Airflow constructs.

The dashboard is very intuitive and you will be able to get used to it in no time. What is the state of your DAGs (Directed Acyclic Graphs), status of your tasks, access to task specific logs, replaying any task is all there.

Points to Consider

I will also share some of the key learnings of my journey which might help you avoid some pitfalls and you would be able to make better use of the product.

The tasks need an executor for allocating resources and processing but the default executor you get after installation is SequentialExecutor which as the name suggests can only run one task at a time. Generally you will setup Airflow with distributed processing in which case you will use one of the remote executors. But if you are setting up Airflow on a standalone machine you should still change the executor to LocalExecutor which will enable you to execute multiple tasks in parallel.

Secondly, if you have limited infrastructure, which was the limitation for us, you need to find the balance between the number of tasks and the extent of transparency in the workflow. It is always enticing to go all out during code refactoring to break down all your workflow steps to the most granular level possible but it can lead to a serious task management overhead. It will increase the delays caused by the task scheduler and it might increase the overall execution time of your workflow. So keep this point in mind while designing your task boundaries.

Another point related to infrastructure is having the right settings to optimise the CPU utilisation. It is a known issue. We followed the steps mentioned in this link which drastically reduced the CPU consumption for us.

If you still have a lot of tasks then the Graph View on the dashboard will be overwhelmed. It will take more time to load and it would be very difficult to comprehend and navigate the graph. It is better to group a related subset of tasks into Task Group so that you can only expand a particular group you want to analyse whereas the rest of the groups can stay in the collapsed state.

Another key aspect of workflow management is alerting. Although Airflow gives you an option to define EmailOperator but use that for conditional notifications. For setting notification for any task failure use the email_on_failure property of DAG. The failure notification which you will receive will contain even the link to the logs of the task that has failed as well as a link to clear the status of the task so that the operation can be again executed.

Summary

To conclude, if you are in a similar situation I would highly recommend including a workflow manager and definitely consider Airflow as an option. It has improved the overall transparency of our batch jobs, it has provided easy interfaces for analysing task failures and replaying them from any point of the life cycle, improved the troubleshooting capability, provided a central point for orchestration all the workflows and consequently it has improved the overall maintainability of our solution.