Airflow Setup

Airflow Server Setup

Spin up an EC2 instance.  This time let’s do it with base Ubuntu install.  I found it easier to install all the components on Ubuntu opposed to Amazon Linux AMI.  If you need instructions on this please see the setup Setup source data server article.  Just pick Ubuntu AMI instead of Amazon Linux in the 3rd screenshot.

SSH into the EC2 instance and run the wget commands at the end of Setup source data server article.  

Note: the user is now ‘ubuntu’.  Your SSH command will look like the following, with 36.183.155.80 replaced by the IP of your server):

ssh ubuntu@36.183.155.80 -i MyPOCKeyPair.pem.txt

In our previous example we pulled these source files from another server.  Here we will just manually download them to the Airflow server. Our script to load these into S3 can be easily updated to add a step to scp or sftp from another server.  I’m going to skip that step in interests of saving time.

Let’s update some firewall rules, so that we can navigate to this ec2 instance.  Navigate to the EC2 dashboard.  Select the instance you just created.  Under ‘Description’ tab, select your security group.

AF - UPDATE SECURITY GROUPS1

From the security group page click ‘Edit’ under the ‘Inbound’ tab.

AF - update firewall - edit

Add your machine IP with /32 (CIDR block range) to the following ports: 80, 8080,22,443.  I have two sets below, b/c I access this instance from home and IP addresses I use on the road.

af - update firewall rules

You will also want to set up a Snowflake account for this walk through.  Instructions can be found in my Snowflake Setup article.

Alright lets setup our Airflow server.  First we will add some scripts that help with install Personal Package Archives (PPA), enable our universe repositories , and update our package lists.

Then install Pip.  Pip is a package management software used to install and manage packages written in Python.

>sudo apt-get install software-properties-common 

>sudo apt-add-repository universe 

>sudo apt-get update 

>sudo apt-get install python-pip

Now let’s install and activate a python virtual environment.  This will allow us to install and update packages without affecting the core machine’s python libraries.

>sudo pip install virtualenv

>source venv/bin/activate

Install apache airflow server with s3, all databases, and jdbc support.

(venv)>pip install "apache-airflow[s3, alldbs,jdbc]"

Initialize the airflow database.  In practice you will want to setup a real database for the backend.  Airflow documentation recommends MySQL or Postgres. More configuration details can be found here (Airflow Configuration Documentation).  There are also details on configuring Airflow to encrypt passwords.

(venv)>airflow initdb

Launch the scheduler and the webserver

(venv)>airflow scheduler -D

(venv)>airflow webserver -D

Note when restarting the webserver and/or scheduler I simply ran a command to look for the processes running and issued a kill command.  And then restarted the process with the command above .There are fancier ways of doing this if you want to google them. I’n trying to keep the Linux stuff simple here.

(venv)>ps -ef | grep airflow

(venv)>kill (pip)

We are going to install boto3, a Python library for interacting with AWS.  As well and the snowflake connector. Note: Airflow has S3 support, but I ran into an issue when trying to use it.  It’s easy enough to script in Python, so I went ahead and did that.

(venv)>pip install boto3==1.3.0
(venv)>pip install snowflake-connector-python

Now I wanted to install some sort of editor for Python, so I did not have to edit everything in vi, or do it locally and then scp up to my EC2 instance.  I went with jupyter notebook, which runs through your browser. I used port forwarding so I could use the browser on my local machine.

Install the notebook:

(venv)>sudo apt-get -y install ipython ipython-notebook

(venv)>sudo -H pip install jupyter

I updated the jupyter notebook config to automatically save a .py file when saving the notebook.

If the .jupyter folder (in home directory) does not contain a jupyter_notebook_config.py file, you need to generate it with the command

>jupyter notebook --generate-config

Add this snippet to your jupyter_notebook_config.py file.

import os
from subprocess import check_call

def post_save(model, os_path, contents_manager):
    """post-save hook for converting notebooks to .py scripts"""
    if model['type'] != 'notebook':
        return # only do this for notebooks
    d, fname = os.path.split(os_path)
    check_call(['ipython', 'nbconvert', '--to', 'script', fname], cwd=d)

c.FileContentsManager.post_save_hook = post_save

af - jupyter_note_book_config

 

Now let us port forward to your EC2 instance from the local machine.  Then launch jupyter notebook from ec2. Note that after running the notebook from ec2 the output may give you a link to go to and/or a token to use for log in.  Be sure to note the port and token from the link. Next we will log into our machine on localhost:8000 which will pick up the traffic for your ec2instanceip:8888.  If you have trouble with this step, make sure notebook is running on port 8888. If you start it multiple times it may be running on port 8889.

Local-Machine$ ssh -L 8000:localhost:8888 sammy@111.111.111.111 -i MyPOCKeyPair.pem.txt

ec2-instance> jupyter notebook

af - jnotebook exec

Navigate to localhost:8000 on your local web browser.

af - localhost8000

And you should see the jupyter notebook homepage.

af - jnotebook

If you simply navigate to the public IP of your Airflow EC2 server on port 8080.

ie. -> http://54.176.153.84:8080

And now you can see the Airflow admin page.

af - aiflow web landing

Up next configuring our first DAG and some python scripts to move data to S3.