Now let’s load a staging table in Snowflake. I like to think of staging in the old school Kimble way as just writing data to disk. In my first stop to the database, I try to alter the data as little as possible. I feel this makes it easier to troubleshoot issues and validate data later on.
Let’s create a new folder under ‘default’ to hold all the jobs that will write our files from s3 to Snowflake. Right click ‘default’ and select ‘Create Folder’.
Name the folder ‘Stage Data to Snowflake’.
Right click this new folder and select ‘Add Orchestration Job’.
Name the job ‘s3_stg_Title_Basics’. I generally like the name to include src and target descriptions (if possible) and the name of the target data set.
For this exercise we are going to use the ‘Create/Replace Table’ stage, to have Matillion build the table for us, if it does not exist. And we are going to use the ‘S3 Load’ stage to load data from S3.
Search for ‘table’ in the Components pain. Grab the ‘Create/Replace Table’ stage from the ‘DDL’ folder, and drag it on the canvas.
Now search for ‘s3’. Grab the ‘S3 Load’ stage and drag it onto the canvas.
I’m also going to add a ‘Truncate Table’ stage. So our job will truncate the data before each load.
Connect the stages like so.
Select the ‘Create/Replace Table 0’ stage. Set the properties as shown below. The column information is the screenshot below the properties. Columns must be populated before you can select primary and clustering keys. I’m not sure why there is a clustering key option here, I feel that is a carry over from Redshift, but I filled it out anyway.
Here is a list of the column properties. Use the ‘+’ at the bottom to add another line. Add Column Name, Data Type, and Size for all columns. Set the ‘tconst’ column ‘Not Null’=’Yes’ and ‘Unique’=’Yes’. Tconst is the title key and should not be null. To find details on the data set look at the downloaded files and on imdb’s site here https://www.imdb.com/interfaces/.
Select ‘Truncate Table 0’. Name it ‘Truncate Title_Basics’ and add the table name ‘TITLE_BASICS’. Remember Table names are case sensitive.
To Add a table that does not exist in the DB yet, click the ‘+’ sign.
Select ‘Load 0’. Fill in the following properties. Most of what set below is self explanatory. The fixed delimiter of ‘/t’ lets Matillion know that our data is tab delimited. Skip Header of 1, tells it to skip the first line of our file.
Right click on ‘Start 0’ icon and select ‘Run From Component (Dev)’.
You should see something like the following.
Once this runs successfully, copy this job and create them for the rest of the files.
Below is what I used for table definitions.
NOTE: for Title_Crew I set the ‘S3 Load Title Crew’ stage ‘Truncate Columns’ Parameter to ‘True’. Some Directors and Writers columns were exceeding the 5000 character limit. For the purposes of this course, we will just truncate and move on. In real life we would want to investigate this issue and put logic in place to prevent this from happening in the future. There is probably a new line missing and several rows are getting concatenated into this one field. So if we were going to build this out as a true system, we would create a preprocessing script to identify where this is happening and add that newline in. Or we could simply flag the row and send it to an error table. Some of these concepts will be covered later.