Setting Up Airflow

Posted onby
  1. Setting Up Airflow On Aws
  2. Setting Up Airflow On Windows
  3. Setting Up Airflow System

Added a few screenshots and extra instructions for VS Code users. Sadly, currently there are some issues with using pytest from VS Code, so I am unable to fill TESTING.rst with screenshots and instructions for VS Code users. Also i would like to add a letter numbering for a second level here: I would be glad if someone could help me with it. On your app dashboard click on the create an app button Make sure to give it a descriptive name, something like airflow-tutorial or the such 5. Once you complete the details and create your new app you should be able to access it via the main app dashboard. Airflow allows you to author workflows by creating tasks in a Direct Acyclic Graph (DAG). Airflow scheduler then executes the tasks in these DAGs on a configured array of workers (executors). At Core Compete, we use Airflow to orchestrate ETL jobs on cloud platforms like GCP and AWS. Info ('Loaded airflowlocalsettings from%s.' , airflowlocalsettings. file) except ImportError: log. Debug ('Failed to import airflowlocalsettings.' , excinfo = True) def initialize : 'Initialize Airflow with all the settings from this file' configurevars preparesyspath importlocalsettings global LOGGINGCLASSPATH. There are many ways to configure these components to bring airflow into your grow space. A typical setup places the fan and the filter inside, which makes it easier to manage while dampening the fan noise. Both can be situated in any order within the ventilation chain if.

This section will guide you through the pre requisites for the workshop.Please make sure to install the libraries before the workshop as the conference WiFican get quite slow when having too many people downloading and installing things at the sametime.

Make sure to follow all the steps as detailed here especially 🐍 PyCon attendeesas there are specific details for the PyCon setup that needs to be done in advance.

Python 3.x¶

3.7 Preferred

We will be using Python.Installing all of Python’s packages individually can be a bitdifficult, so we recommend using Anaconda whichprovides a variety of useful packages/tools.

To download Anaconda, follow the link https://www.anaconda.com/download/ and selectPython 3. Following the download, run the installer as per usual on your machine.

If you prefer not using Anaconda then this tutorial can help you with the installation andsetup.

If you already have Python installed but not via Anaconda do not worry.Make sure to have either venv or pipenv installed. Then follow the instructions to setyour virtual environment further down.

Git¶

Git is a version control software that records changesto a file or set of files. Git is especially helpful for software developersas it allows changes to be tracked (including who and when) when working on aproject.

To download Git, go to the following link and choose the correct version for youroperating system: https://git-scm.com/downloads.

Windows¶

Download the git for Windows installer .Make sure to select “use Git from the Windows command prompt”this will ensure that Git is permanently added to your PATH.

Also select “Checkout Windows-style, commit Unix-style line endings” selected and click on “Next”.

This will provide you both git and git bash. We will use the command line quite a lot during the workshopso using git bash is a good option.

GitHub¶

GitHub is a web-based service for version control using Git. You will needto set up an account at https://github.com. Basic GitHub accounts arefree and you can now also have private repositories.

Text Editors/IDEs¶

Text editors are tools with powerful features designed to optimize writing code.There are several text editors that you can choose from.Here are some we recommend:

  • VS code: this is your facilitator’s favourite 💜 and it is worth trying if you have not checked it yet

We suggest trying several editors before settling on one.

If you decide to go for VSCode make sure to alsohave the Python extensioninstalled. This will make your life so much easier (and it comes with a lot of niftyfeatures 😎).

Microsoft Azure¶

You will need to get an Azure account as we will be using this to deploy theAirflow instance.

Note

If you are doing this tutorial live at PyCon US then yourfacilitator will provide you with specific instructions to set up your Azure subscription. If you have not received these please let your facilitator know ASAP.

Follow this linkto get an Azure free subscription. This will give you 150 dollars in credit so youcan get started getting things up and experimenting with Azure and Airflow.

MySQL¶

MySQL is one of the most popular databases used/We need MySQL to follow along with the tutorial. Make sure to install it beforehand.

Mac users¶

Warning

There are some known issues with MySQL in Mac so we recommend using this approach to install and set MySQL up: https://gist.github.com/nrollr/3f57fc15ded7dddddcc4e82fe137b58e.

Also, note that you will need to make sure that OpenSSL is on your path to make sure this is added accordingly:If using zsh:

If using bash:

make sure to reload using source~/.bashrc or source~/.zshrc

Troubleshooting¶

Later on, during the setup, you will be installing mysqlclient.If during the process you get compilation errorstry the following:

if you want to be safe before installing the library we recommend you set the following env variables:

Windows users¶

Download and install MySQL from the official website https://dev.mysql.com/downloads/installer/ and execute it.For additional configuration and pre-requisites make sure to visit the official MySQL docs.

Linux users¶

You can install the Python and MySQL headers and libraries like so:

Debian/Ubuntu:

Red Hat / Centos

After installation you need to start the service with:

To ensure that the database launches after a reboot:

You should now be able to start the mysql shell through /usr/bin/mysql-uroot-pyou will be asked for the password you set during installation.

Creating a virtual environment¶

You will need to create a virtual environment to make sure that you have the right packages and setup needed to follow along the tutorial.Follow the instructions that best suit your installation.

Anaconda¶

If you are using Anaconda first you will need to make a directory for the tutorial, for example mkdirairflow-tutorial.Once created make sure to change into it using cdairflow-tutorial.

Next, make a copy of this environment.yaml
and install the

dependencies via condaenvcreate-fenvironment.yml.Once all the dependencies are installed you can activate your environment through the following commands

To exit the environment you can use

pipenv¶

Create a directory for the tutorial, for example:

and change your working directory to this newly created one cdairflow-tutorial.

Once then make a copy of this Pipfilein your new directory and install via pipenvinstall.This will install the dependencies you need. This might take a while so you can make yourself a brew in the meantime.

Once all the dependencies are installed you can run pipenvshell which will start a session with the correct virtual environment activated. To exit the shell session using exit.

virtualenv¶

Create a directory for the tutorial, for example :

and change directories into it (cdairflow-tutorial).Now you need to run venv

this will create a virtual Python environment in the env/airflow folder.Before installing the required packages you need to activate your virtual environment:

Make a copy of this requirements filein your new directory.Now you can install the packages using via pip pipinstall-rrequirements.txt

To leave the virtual environment run deactivate

Twitter and twitter developer account¶

This tutorial uses the Twitter API for some examples and to build some of the pipelines included.

Please make sure to follow the next steps to get you all set up.

  1. Create an account at https://twitter.com/.

  2. Next, you will need to apply for a developer account, head to https://developer.twitter.com/en/apply.

    You will need to provide detailed information about what you want to use the API for.Make sure to complete all the steps and confirm your email address so that you can be notified about the status of your application.

    Warning

    Before completing the application read the PyCon attendees section below ⬇️ Twitter developer app

  3. Once your application has been approved you will need to go to https://developer.twitter.com/en/apps login with your details (they should be the same as your Twitter account ones).

  4. On your app dashboard click on the create an app button

    Make sure to give it a descriptive name, something like airflow-tutorial or the such

5. Once you complete the details and create your new app you should be able to access it via the main app dashboard. Click on details button next to the app name and head over to permissions.We only need read permissions for the tutorial, so these should look something like this

  1. Now if you click on the Keys and tokens you will be able to see a set of an API key, an API secret, an Access token, and an Access secret

    They are only valid for the permissions you specified before. Keep a record of these in a safe place as we will need them for the Airflow pipelines.

Docker¶

We are going to use Docker for some bits of the tutorial (this will make it easier to have a local Airflow instance).

Follow the instructions at https://docs.docker.com/v17.12/install/ make sure to read the pre-requisites quite carefully before starting the installation.

Setting up airflow filter

🐍 PyCon attendees¶

Twitter developer app¶

The Twitter team will be expediting your applications to make sure you are all set up for the day 😎.

When filling in your application make sure to add the following details (as written here) to make sure this is processed.

In the what are you planning to use the developer account for:

Azure Pass account¶

As a PyCon attendee, you will be issued with an Azure pass worth 200 dollars with a 90 days validity.You will not need to add credit card details to activate but you will need to follow this process to redeem your credits.

1. Send an email your facilitator at trallard@bitsandchips.me with the subject line AirflowPyCon-AzurePass, they will send you an email with a unique code to redeem. Please do not share with anyone,this is a single-use pass and once activated it will be invalid.

2. Go to this site to redeem your pass.We recommend doing this in a private/incognito window. You can then click start and attach your new pass to your existing account.

If you see the following error (see image)

you can go to this site to register the email and proceed.

4. Confirm your email address. You will then be asked to add the promo code that you were sent by your instructor.Do not close or refresh the window until you have received a confirmation that this has been successful.

  1. Activate your subscription: click on the activate button and fill in the personal details

Again once completed, do not refresh the window until you see this image

At this point, your subscription will be ready, click on Get started to go to your Azure portal

Apache Airflow Training

This 1-day GoDataDriven training teaches you the internals, terminology, and best practices of writing DAGs. Plus hands-on experience in writing and maintaining data pipelines.

Airflow is a scheduler for workflows such as data pipelines, similar to Luigi and Oozie. It's written in Python and we at GoDataDriven have been contributing to it inthelastfewmonths.

This tutorial is loosely based on the Airflow tutorial in the official documentation. It will walk you through the basics of setting up Airflow and creating an Airflow workflow, and it will give you some practical tips. A (possibly) more up-to-date version of this blog can be found in my git repo.

1. Setup

Setting up a basic configuration of Airflow is pretty straightforward. After installing the Python package, we'll need a database to store some data and start the core Airflow services.

You can skip this section if Airflow is already set up. Make sure that you can run airflow commands, know where to put your DAGs and have access to the web UI.

Install Airflow

Airflow is installable with pip via a simple pip install apache-airflow. Either use a separate Python virtual environment or install it in your default python environment.

To use the conda virtual environment as defined in environment.yml from my git repo:

  • Install miniconda.
  • Make sure that conda is on your path:
  • Create the virtual environment from environment.yml:
  • Activate the virtual environment:

You should now have an (almost) working Airflow installation.

Alternatively, install Airflow yourself by running:

Airflow used to be packaged as airflow but is packaged as apache-airflow since version 1.8.1. Make sure that you install any extra packages with the right Python package: e.g. use pip install apache-airflow[dask] if you've installed apache-airflow and do not use pip install airflow[dask]. Leaving out the prefix apache- will install an old version of Airflow next to your current version, leading to a world of hurt.

You may run into problems if you don't have the right binaries or Python packages installed for certain backends or operators. When specifying support for e.g. PostgreSQL when installing extra Airflow packages, make sure the database is installed; do a brew install postgresql or apt-get install postgresql before the pip install apache-airflow[postgres]. Similarly, when running into HiveOperator errors, do a pip install apache-airflow[hive] and make sure you can use Hive.

Run Airflow

Before you can use Airflow you have to initialize its database. The database contains information about historical & running workflows, connections to external data sources, user management, etc. Once the database is set up, Airflow's UI can be accessed by running a web server and workflows can be started.

The default database is a SQLite database, which is fine for this tutorial. In a production setting you'll probably be using something like MySQL or PostgreSQL. You'll probably want to back it up as this database stores the state of everything related to Airflow.

Airflow will use the directory set in the environment variable AIRFLOW_HOME to store its configuration and our SQlite database. This directory will be used after your first Airflow command. If you don't set the environment variable AIRFLOW_HOME, Airflow will create the directory ~/airflow/ to put its files in.

Set environment variable AIRFLOW_HOME to e.g. your current directory $(pwd):

or any other suitable directory.

Next, initialize the database:

Now start the web server and go to localhost:8080 to check out the UI:

It should look something like this:

With the web server running workflows can be started from a new terminal window. Open a new terminal, activate the virtual environment and set the environment variable AIRFLOW_HOME for this terminal as well:

Make sure that you're an in the same directory as before when using $(pwd).

Run a supplied example:

And check in the web UI that it has run by going to Browse -> Task Instances.

This concludes all the setting up that you need for this tutorial.

Tips

  • Both Python 2 and 3 are be supported by Airflow. However, some of the lesser used parts (e.g. operators in contrib) might not support Python 3.
  • For more information on configuration check the sections on Configuration and Security of the Airflow documentation.
  • Check the Airflow repository for upstart and systemd templates.
  • Airflow logs extensively, so pick your log folder carefully.
  • Set the timezone of your production machine to
    : Airflow assumes it's UTC.

Learn Spark or Python in just one day

Develop Your Data Science Capabilities. **Online**, instructor-led on 23 or 26 March 2020, 09:00 - 17:00 CET.

2. Workflows

We'll create a workflow by specifying actions as a Directed Acyclic Graph (DAG) in Python. The tasks of a workflow make up a Graph; the graph is Directed because the tasks are ordered; and we don't want to get stuck in an eternal loop so the graph also has to be Acyclic.

Setting Up Airflow On Aws

The figure below shows an example of a DAG:

The DAG of this tutorial is a bit easier. It will consist of the following tasks:

  • print 'hello'
  • wait 5 seconds
  • print 'world

Setting Up Airflow On Windows

and we'll plan daily execution of this workflow.

Create a DAG file

Go to the folder that you've designated to be your AIRFLOW_HOME and find the DAGs folder located in subfolder dags/ (if you cannot find, check the setting dags_folder in $AIRFLOW_HOME/airflow.cfg). Create a Python file with the name airflow_tutorial.py that will contain your DAG. Your workflow will automatically be picked up and scheduled to run.

First we'll configure settings that are shared by all our tasks. Settings for tasks can be passed as arguments when creating them, but we can also pass a dictionary with default values to the DAG. This allows us to share default arguments for all the tasks in our DAG is the best place to set e.g. the owner and start date of our DAG.

Add the following import and dictionary to airflow_tutorial.py to specify the owner, start time, and retry settings that are shared by our tasks:

Configure common settings

These settings tell Airflow that this workflow is owned by 'me', that the workflow is valid since June 1st of 2017, it should not send emails and it is allowed to retry the workflow once if it fails with a delay of 5 minutes. Other common default arguments are email settings on failure and the end time.

Create the DAG

We'll now create a DAG object that will contain our tasks.

Name it airflow_tutorial_v01 and pass default_args:

With schedule_interval='0 0 * * *' we've specified a run at every hour 0; the DAG will run each day at 00:00. See crontab.guru for help deciphering cron schedule expressions. Alternatively, you can use strings like '@daily' and '@hourly'.

We've used a context manager to create a DAG (new since 1.8). All the tasks for the DAG should be indented to indicate that they are part of this DAG. Without this context manager you'd have to set the dag parameter for each of your tasks.

Airflow will generate DAG runs from the start_date with the specified schedule_interval. Once a DAG is active, Airflow continuously checks in the database if all the DAG runs have successfully ran since the start_date. Any missing DAG runs are automatically scheduled. When you initialize on 2016-01-04 a DAG with a start_date at 2016-01-01 and a daily schedule_interval, Airflow will schedule DAG runs for all the days between 2016-01-01 and 2016-01-04.

A run starts after the time for the run has passed. The time for which the workflow runs is called the execution_date. The daily workflow for 2016-06-02 runs after 2016-06-02 23:59 and the hourly workflow for 2016-07-03 01:00 starts after 2016-07-03 01:59.

From the ETL viewpoint this makes sense: you can only process the daily data for a day after it has passed. This can, however, ask for some juggling with date for other workflows. For Machine Learning models you may want to use all the data up to a given date, you'll have to add the schedule_interval to your execution_date somewhere in the workflow logic.

Because Airflow saves all the (scheduled) DAG runs in its database, you should not change the start_date and schedule_interval of a DAG. Instead, up the version number of the DAG (e.g. airflow_tutorial_v02) and avoid running unnecessary tasks by using the web interface or command line tools

Timezones and especially daylight savings can mean trouble when scheduling things, so keep your Airflow machine in UTC. You don't want to skip an hour because daylight savings kicks in (or out).

Create the tasks

Tasks are represented by operators that either perform an action, transfer data, or sense if something has been done. Examples of actions are running a bash script or calling a Python function; of transfers are copying tables between databases or uploading a file; and of sensors are checking if a file exists or data has been added to a database.

We'll create a workflow consisting of three tasks: we'll print 'hello', wait for 10 seconds and finally print 'world'. The first two are done with the BashOperator and the latter with the PythonOperator. Give each operator an unique task ID and something to do:

Note how we can pass bash commands in the BashOperator and that the PythonOperator asks for a Python function that can be called.

Dependencies in tasks are added by setting other actions as upstream (or downstream). Link the operations in a chain so that sleep will be run after print_hello and is followed by print_world; print_hello -> sleep -> print_world:

After rearranging the code your final DAG should look something like:

Test the DAG

First check that DAG file contains valid Python code by executing the file with Python:

You can manually test a single task for a given execution_date with airflow test:

This runs the task locally as if it was for 2017-07-01, ignoring other tasks and without communicating to the database.

Activate the DAG

Now that you're confident that your dag works, let's set it to run automatically! To do so, the scheduler needs to be turned on; the scheduler monitors all tasks and all DAGs and triggers the task instances whose dependencies have been met. Open a new terminal, activate the virtual environment and set the environment variable AIRFLOW_HOME for this terminal, and type

Setting Up Airflow System

Once the scheduler is up and running, refresh the DAGs page in the web UI. You should see airflow_tutorial_v01 in the list of DAGs with an on/off switch next to it. Turn on the DAG in the web UI and sit back while Airflow starts backfilling the dag runs!

Tips

  • Make your DAGs idempotent: rerunning them should give the same results.
  • Use the the cron notation for schedule_interval instead of @daily and @hourly. @daily and @hourly always run after respectively midnight and the full hour, regardless of the hour/minute specified.
  • Manage your connections and secrets with the Connections and/or Variables.

3. Exercises

You now know the basics of setting up Airflow, creating a DAG and turning it on; time to go deeper!

  • Change the interval to every 30 minutes.
  • Use a sensor to add a delay of 5 minutes before starting.
  • Implement templating for the BashOperator: print the execution_date instead of 'hello' (check out the original tutorial and the example DAG).
  • Use templating for the PythonOperator: print the execution_date with one hour added in the function print_world() (check out the documentation of the PythonOperator).

4. Resources

  • The official Airflow tutorial: showing a bit more in-depth templating magic.
  • ETL best practices with Airflow: good best practices to follow when using Airflow.
  • Airflow: Tips, Tricks, and Pitfalls: more explanations to help you grok Airflow.
  • Whirl: Fast iterative local development and testing of Apache Airflow workflows

Learn Apache Airflow from the best

We offer an in-depth Apache Airflow course to teach you the internals, terminology, and best practices of working with Airflow, with hands-on experience in writing an maintaining data pipelines.