Gcs To Bigquery Airflow

Posted onby
Gcs To Bigquery Airflow

In the previous blog posts (part 1, part 2, part 3, and part 4) in this series, we talked about why we decided to build a marketing data warehouse. This endeavor started by figuring out how to deal with the first part: making the data lake. In the fourth blog post, a more technical one, I’ll give some insights into how we’re leveraging Apache’s Airflow to build the more complicated data pipelines, and I give you some tips on how to get started.

This blog post is part of a series of five? (maybe more, you never know), in which we’ll dive into the details of why we wanted to create a data warehouse, how we created the data lake, how we used the data lake to create a data warehouse. It is written with the help of @RickDronkers and @Hussain / MarketLytics, who we’ve worked with alongside during this (ongoing) project.

Getting Started with Cloud Composer

Google Cloud BigQuery Operators¶. BigQuery is Google’s fully managed, petabyte scale, low cost analytics data warehouse. It is a serverless Software as a Service (SaaS) that doesn’t need a database administrator. It allows users to focus on analyzing data to find meaningful insights using familiar SQL. Bigqueryconnid – reference to a specific BigQuery hook. Delegateto – The account to impersonate, if any. For this to work, the service account making the request must have domain-wide delegation enabled. Labels – a dictionary containing labels for the job/query, passed to BigQuery.

Cloud Composer is part of Google’s Cloud Platform and brings you most of the upside of using Apache Airflow (open source) and barely any of the downsides (setup, maintenance, etc.). Or to follow their main USP: “A fully managed workflow orchestration service built on Apache Airflow.” While we had worked with Airflow before, we weren’t looking forward to spending time having to worry about its management as we planned to spend most time setting up and maintaining the data pipelines. In the end, then you have to stick to create pipelines (DAGs).

What is it suitable for?

You want to load data from the Google Analytics API, store it locally, translate some values to something new, and have it available in Google BigQuery. However, you would build it; it’s multiple tasks and functions that are depending on itself. You wouldn’t want to load the data into BigQuery when the data wouldn’t have been cleaned (trash in, trash out sounds familiar?). With BigQuery, the next task is only being processed if the previous step was successful.

Tasks

Class GCSToBigQueryOperator (BaseOperator): ' Loads files from Google Cloud Storage into BigQuery. The schema to be used for the BigQuery table may be specified in one of two ways. You may either directly pass the schema fields in, or you may point the operator to a Google Cloud Storage object name. The object in Google Cloud Storage must be a JSON file with the schema fields in it. Bases: airflow.models.BaseOperator Loads files from Google cloud storage into BigQuery. The schema to be used for the BigQuery table may be specified in one of two ways. You may either directly pass the schema fields in, or you may point the operator to a Google cloud storage object name. Apache Airflow - A platform to programmatically author, schedule, and monitor workflows - apache/airflow.

Airflow Gcs To Bigquery Operator

Gcs to bigquery airflow download

Tasks are in almost every case; just one thing: get data from BigQuery, upload a file from GCS into BigQuery, download a file from Cloud Storage to local, process data. What makes Airflow very efficient to work with is that the majority of data processing tasks already have pre-built functions. The first three tasks that I listed here are operators (GoogleCloudStorageDownloadOperator, GoogleCloudStorageToBigQueryOperator) that operate as functions.

Versus Google Cloud Functions

If you mainly run very simple ‘pipelines’ that only exist of 1 function that needs to be executed or have only a handful use cases, it is likely overkill to leverage Cloud Composer; the costs might be too high, you still have overhead with DAGs. In that case, you might be better off with Google Cloud Functions as you can write similar scripts that will enable you to also trigger them with Google Cloud Scheduler to run at a specific time.

Bigquery

Costs

The costs for Google Cloud Composer are doable, for a basic setup, it’s around 450 dollars (if you run the instances 24 hours * 7 days a week) as you leverage multiple (a minimum of 3) small instances. For more information on the costs, I would point to this pricing example.

Building Pipelines

See above an example data pipeline, in typical Airflow fashion every task is depending on the previous task. In other words: notify_slack_channel would not run if any of the previous tasks would fail. All tasks are happening in a particular order from left to right. In most cases, data pipelines become more complicated as you can have multiple flows going on at the same time and combining them at the end.

Tips & Tricks

Google Cloud Build, Repositories

The files for Google Cloud Composer are saved in Google Cloud Storage. Which is smart in itself, but at the same time, you want them to live in a Git repository so you can efficiently work on it together. By leveraging this blog post, you’re able to connect the Cloud Storage bucket to a repository and set up a sync between the two. This will help you build a deployment pipeline basically and make sure that only production-ready code from your master branch ends up in GCS.

Gcs To Bigquery Airflow Transfer

Managing Dependencies

After working with it for a few months now, I’m still not sure if managing dependencies through Google Cloud Composer is a good or bad thing, as it creates some obstacles if you want to run a deployment and want to add some Python libraries (as your servers could be down 10-30 mins at a time). For other setups, this usually is a bit more smooth and creates less downtime.

Sendgrid for Email Alerts

One of the upsides of Apache Airflow is that it sends alerts upon failure of tasks. Make sure to set up the Sendgrid notifications while you’re setting up Google Cloud Composer. This will be the most straightforward way of receiving email alerts (for free, as in most cases, you shouldn’t get too many failure emails).

README

Document the crap out of your setup and DAGs. When I took over some of the pipelines that were used at Postmates for XML sitemap generation it was a nightmare, it was hard to read, the code didn’t make a lot of sense, and we had to refactor certain things just because of that. As sometimes pipelines (just like regular code) can be left untouched/unviewed for months (as they literally sometimes only have one job) you want to make sure that you come back and understand what happens inside the tasks.

Again… This blog post is written with the help of @RickDronkers and @Hussain / MarketLytics who we’ve worked with alongside during this (ongoing) project.

Gcs To Bigquery Airflow Controller

Part 4: Visualization with Google DataStudio – Building a Marketing Data Lake and Data Warehouse on Google Cloud Platform

Gcs To Bigquery Airflow Convert

Case Study: How Restructuring 6800 Content Pieces For SEO Worked