Airflow In Python

Posted onby

Airflow is a tool that permits scheduling and monitoring your data pipeline. This tool is written in Python and it is an open source workflow management platform. Airflow can be used to write a machine learning pipelines, ETL pipelines, or in general to schedule your jobs. Airflow pipelines are defined in Python, allowing for dynamic pipeline generation. This allows for writing code that instantiates pipelines dynamically.

Basic Airflow concepts¶

  • Task: a defined unit of work (these are called operators in Airflow)
  • Task instance: an individual run of a single task. Task instances also have an indicative state, which could be “running”, “success”, “failed”, “skipped”, “up for retry”, etc.
  • DAG: Directed acyclic graph,a set of tasks with explicit execution order, beginning, and end
  • DAG run: individual execution/run of a DAG
Airflow

Debunking the DAG

The vertices and edges (the arrows linking the nodes) have an order and direction associated to them

each node in a DAG corresponds to a task, which in turn represents some sort of data processing. For example:

Node A could be the code for pulling data from an API, node B could be the code for anonymizing the data. Node B could be the code for checking that there are no duplicate records, and so on.

Python

What Is Apache Airflow

Python

These ‘pipelines’ are acyclic since they need a point of completion.

Airflow Python Version

Dependencies

Airflow In Python Programming

Each of the vertices has a particular direction that shows the relationship between certain nodes. For example, we can only anonymize data once this has been pulled out from the API.