Stage Data in Snowflake

Now let’s load a staging table in Snowflake.  I like to think of staging in the old school Kimble way as just writing data to disk.  In my first stop to the database, I try to alter the data as little as possible. I feel this makes it easier to troubleshoot issues and validate data later on.

Let’s create a new folder under ‘default’ to hold all the jobs that will write our files from s3 to Snowflake.  Right click ‘default’ and select ‘Create Folder’.

stage to snowflake - create dim folder

Name the folder ‘Stage Data to Snowflake’.

stage to snowflake - name orch job

Right click this new folder and select ‘Add Orchestration Job’.

stage to snowflake - add 2nd orch job

Name the job ‘s3_stg_Title_Basics’.  I generally like the name to include src and target descriptions (if possible) and the name of the target data set.

stage to snowflake - title basics

For this exercise we are going to use the ‘Create/Replace Table’ stage, to have Matillion build the table for us, if it does not exist.  And we are going to use the ‘S3 Load’ stage to load data from S3.

Search for ‘table’ in the Components pain.  Grab the ‘Create/Replace Table’ stage from the ‘DDL’ folder, and drag it on the canvas.

stage to snowflake - table

Now search for ‘s3’.  Grab the ‘S3 Load’ stage and drag it onto the canvas.

stage to snowflake - s3 load

I’m also going to add a ‘Truncate Table’ stage.  So our job will truncate the data before each load.

Connect the stages like so.

stage to snowflake - add trunc stage

Select the ‘Create/Replace Table 0’ stage.  Set the properties as shown below. The column information is the screenshot below the properties.  Columns must be populated before you can select primary and clustering keys. I’m not sure why there is a clustering key option here, I feel that is a carry over from Redshift, but I filled it out anyway.

stage to snowflake - create prop

Here is a list of the column properties.  Use the ‘+’ at the bottom to add another line.  Add Column Name, Data Type, and Size for all columns.  Set the ‘tconst’ column ‘Not Null’=’Yes’ and ‘Unique’=’Yes’.  Tconst is the title key and should not be null. To find details on the data set look at the downloaded files and on imdb’s site here https://www.imdb.com/interfaces/.

stage to snowflake - create tbl columns

Select ‘Truncate Table 0’.  Name it ‘Truncate Title_Basics’ and add the table name ‘TITLE_BASICS’.  Remember Table names are case sensitive.

stage to snowflake - trunc tbl prop

To Add a table that does not exist in the DB yet, click the ‘+’ sign.

stage to snowflake - trunc tbl add name

Select ‘Load 0’.  Fill in the following properties.  Most of what set below is self explanatory.  The fixed delimiter of ‘/t’ lets Matillion know that our data is tab delimited.  Skip Header of 1, tells it to skip the first line of our file.

stage to snowflake - load properties

Right click on ‘Start 0’ icon and select ‘Run From Component (Dev)’.

stage to snowflake - start job

You should see something like the following.

stage to snowflake - success run

Once this runs successfully, copy this job and create them for the rest of the files.  

stage to snowflake - copy jobs

Below is what I used for table definitions.  

NOTE: for Title_Crew I set the ‘S3 Load Title Crew’ stage ‘Truncate Columns’ Parameter to ‘True’.  Some Directors and Writers columns were exceeding the 5000 character limit. For the purposes of this course, we will just truncate and move on.  In real life we would want to investigate this issue and put logic in place to prevent this from happening in the future. There is probably a new line missing and several rows are getting concatenated into this one field.  So if we were going to build this out as a true system, we would create a preprocessing script to identify where this is happening and add that newline in. Or we could simply flag the row and send it to an error table. Some of these concepts will be covered later.

stage to snowflake - name basics

stage to snowflake - title akas

stage to snowflake - title basics tbl def

stage to snowflake - title crew

stage to snowflake - title principals

stage to snowflake - title rankings

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s