Load Dimensions with Matillion

Now we will build out our star schema.  Here we will build some simple destructively loaded dimensions and a fact table.  In future sections I will get into more advanced ETL operations such as incremental loads, change tracking, and stub records.  For the purposes of this tutorial, we will keep it relatively simple. That being said, simple is good. If I can get away with a destructive load, I’ll do it.  It makes things a lot simpler.

I’m going to use an MD5 hashing algorithm to create Surrogate IDs (SIDs).  Surrogate IDs are just single field unique identifiers for dimensions. Historically they are populated with an integer incremented sequentially either by the database or the ETL system.  If performance is not an issue, I like to use a hashing algorithm b/c it simplifies the ETL design. It also helps load performance tremendously. With a hash we do not need to join to a dimension to find its SID value.  We just hash the join key on the fact table. This helps reduce processing in the fact load, as well as allows to load the facts and dimensions at the same time.  Also, if SID values some how get out of sync, we can easily just re-hash them.

First up, Person_D.

Let’s create a Dimensions folder in our project.  Right click ‘default’ and select ‘Create Folder’.

stage to snowflake - create dim folder

Enter a name for this folder.  I used ‘Dimensions’.

Right click ‘Dimensions’ and select ‘Add Transformation Job’.  The transformation jobs here are more like point and click SQL, where as the Orchestration jobs have more DDL functions and connectors for moving data around.  I think of the Orchestration jobs like traditional ETL tools Sequencer jobs or job scheduling software. Transformation jobs would be more like ELT or graphical SQL jobs.  In either case I normally like to work from the inside out. I create the core transformation logic, then work on the DDL and finally the sequencing.

Dimension - rename folder name dim

For this transformation job we will use three components.

Table Input -> This accesses our source data set, or the data we will use as the basis for our dimension build.

Calculator ->  We will use this component to generate our SID as well as create a deceased flag.

Table output -> this will insert our source data plus our 2 new fields into our target dimension table.

dimension - add transformation job

Name the job.  I named mine ‘Person_D’.
dimension - name transformation

From ‘Components’ search for ‘table’.  Grab ‘Table Input’ and ‘Table Output’ and add them to the canvas.

dimension - add table components

Search for ‘Calc’ and select and drag to the canvas ‘Calculator’.

dimension - add calc transfrom

Below you can see my canvas after renaming and connecting the components.

dimension - connect stages

Select the table input component ‘Properties’ tab.

dimension - tbl input properties

Change the ‘Name’ property. In the ‘Table’ properties select NAME_BASICS.  This will be the source data of our dimension.

dimension - table name

Hit the column names ‘Edit Properties’ selector.  Highlight all columns except knownForTitles (we are not using that field), and click the right arrow.  Click OK.

dimension - src select fields

select the ‘Calculator’ component.

dimension - select calc stage

Under properties select the ‘…’ (‘Edit Properties’ menu) for Calculations.

dimension - open calc properties

In the Edit Properties window select the plus sign under expressions to add a new calculated field.

dimension - add calc

In the top right side of Edit Properties add the name for our new field ‘NConstSID’.  In the text box underneath that add the SQL formula we want to use. Here we will use the MD5 function passing it the field ‘nconst’ -> ‘MD5(“nconst”)’.  Since nconst is a unique identifier this will work. If we had a composite key or multiple fields that uniquely identify a row in a table, I would concatenate the values from all the fields and pass the result to the MD5 function.
dimension - NconstSID

Click the Plus sign again under expressions again to add another field ‘deceasedFlag’, and following code block:

case when "deathYear" is NULL 
THEN 'Y' 
ELSE 'N' 
END 



dimension - deceased flag

Now let’s add a Load Timestamp.  This is what I would call part of the load subrecord.  Certain fields that you would tack on to all tables of the warehouse that are populated by the load process.  These contain what would be called “Process Metadata”. You may want to add other things like data source, delete flag, update date time, etc.

Click the ‘+’ sign again, add the field ‘loadDttm’, go to the Functions pane, select ‘Data functions folder, and double click ‘CURRENT_TIMESTAMP’.  Up in the code window remove anything inside the () and add ::timstamp_ntz to remove the timezone nonsense. It shoudl look like this (current_timestamp()::timestamp_ntz). Click OK.

dimension - loaddttm 2

Click ‘OK’ at the bottom of the ‘Edit Properties’ dialogue box.

Now there are two ways to create the target table.  We can use a DDL create statement and run it on the database.  Or we can have the tool do this for us. Normally I would just create a DDL from the data model, but since we are working in ETL, let’s do it through the tool.

We will do this with the following steps:

  1. Replace the ‘Table Output’ component with the ‘Rewrite Table’ component.  This component will drop and rebuild the table every time. We’ll replace this after the first run.
  2. Create an Orchestration job, that will use the ‘Create Replace Table’ component and then call our Transformation job.
  3. After successful run and we like our data structure we will go back and update our Transformation job to use the ‘Table Output’ component’.

Replace ‘Table Output’ component with the ‘Rewrite Table’ component.  Under properties update ‘Table’ name with value ‘Person_D’.
dimension - select rewrite table

Right click ‘Dimension’ folder and select ‘Add Orchestration Job’.

dimension - add orch job

Name it something meaningful.  I use ‘SEQ_Person_D’ out of habit.

dimension - name seq d

Add ‘DDL->Create/Replace Table’ and ‘Transform->Run Transformation’ components to canvas and rename appropriately.

dimension - add stages to orch

Select ‘Create/Replace Person_D’ and update the properties to match below.

dimension - orch run select run transform

Columns:

Select the ‘Run Transformation’ component now named ‘Person_D’ and add ‘Person_D’ as the Transformation Job Value.
dimension - orch run select run transform

Now lets run the job.  This should build and populate our table.

Right click Start and select ‘Run From Component (Dev)’.

execute orch job

Log should look something like this.

Dimension - log out put

Work out any issues and then we’ll go back and replace our ‘Rewrite Table’ component from our Transformation job with a ‘Table Output’.

Right click ‘Rewrite Table’ stage and select ‘Delete Component’.

dimension - remove rewrite

Add the ‘Table Output’ stage.  Update the name, and select ‘PERSON_D’ from the drop down box that appears in the Target Table Edit Properties box.

dimension - table output target name

Open the Column Mapping Edit Properties box.  Select the input and output columns. In this case the names will be the same.  This is what is called a direct mapping. The input columns come from our Calculator stage and the Output Columns come from the database.

dimension - column mapping

Change Truncate value from append to ‘Truncate’.

dimension - table output truncate

Run SEQ_Person again, and make sure there are no more errors.

Now let’s add a default (Not Available) Row.  Many ETL systems will add default rows for certain instances where values do not exist.  There are many ways to do this, and I will not cover them all here.

NOTE:  I’m doing this somewhat for sake of best practices.  With this model we do not really need this row.

PRO TIP:  This can also be done in the BI tool, by using outer joins and wrapping all dimension fields returned with an NVL type statement.

Click on SEQ_Person_D tab.  Drag the “DDL->SQL Script” component to the canvas.  Rename the component by updating “Name” field to “Insert NA Row” or something else you like.

dimension - add sql script for na

Add the following SQL:

Insert into PERSON_D ("NConstSID","nconst","primaryName", "birthYear","deathYear", "primaryProfession","deceasedFlag","loadDttm") 
values ('0','-','Not Available','1753','9999','NA','N',CURRENT_TIMESTAMP::TIMESTAMP_NTZ) ;

dimension - insert statement for na

Run again for good measure.

Check the counts in Snowflake to make sure it jives.

select 'Person_D',count(*)-1 from person_d
union all
select 'Name_Basics',count(*) from name_basics;

PRO TIP:  You can add this SQL to a job.  Then send the results out via email or log them to a table and display them in a dashboard.  You could even add controls to abort your load if something is off.

Copy this job and update the others.  Below are the table DDLs and Calculations for each dimension.

PRO TIP:  When you can template jobs like this think about what items you would need to change going from copy to copy.  Those items should be parameterized, or the code should be automated. This is outside the scope of this course, but think about reusable components the way you would think of classes or functions in any other high level programming logic.

Region_R

dimension - region_r

dimension - region_r

dimension - md5 region_R

dimension - loaddttm region_r

Title_D

dimension - seq title_ddimension - title_D

dimension - md5 region_R

dimension - loaddttm region_r

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s