Load Fact with Matillion

Now we will build our fact table.  The Fact table will hold our metrics and will be the center of the data model.

Create a ‘Facts’ Folder under Default.  Create a new ‘Transformation’ Job. Name it ‘Cast_F’.

fact - create folder facts

Add 2 ‘Read->Table Input’ components and a ‘Join->Join’ component.  Connect them by dragging the arrow from the Table input to the Join stage.  Rename the stages as shown in the screenshot below. In the ‘TITLE_PRINCIPALS’ Table Input set the Table to ‘TITLE_PRINCIPALS’.  Open the Edit Properties window and select the columns shown.

Now let’s update the TITLE_BASICS Stage. Set the title name to ‘TITLE_BASICS’.  Select the columns shown below.

fact - title principals

Now select the ‘Join’ component.  Set the Name to ‘Title_Name_x’. Set the ‘Main Table’ to TITLE PRINCIPALS.  That table contains the set of data that we will add to and build on. Another way of thinking about it, is that TITLE_PRINCIPLES is the grain of the fact table.

Set ‘Main Table Alias’ to ‘TP’.

fact - title name

Open the Joins Edit Propertie box.  Click the plus sign to add a row. Then add  Join Table -> TITLE_BASICS, Join Alias -> TB, and Join Type -> Left.  This means that we will keep all the data elements on the left side of the join expression below.

fact - title basic joins

Open the ‘Join Expressions’ Edit Properties box.  Add the following join condition to join our data sets by tconst.

“TP”.”tconst”=”TB”.”tconst”

fact - tb left join

Open the ‘Output Columns’ Edit Properties box.  Click the plus sign and add each row shown below, or click ‘Add All’.

fact - join expressions add all

Now let’s run through that process again.  To add another lookup for TITLE_RATINGS. Add another ‘Join’ component and ‘Table Input’ component’ to the canvas.  Rename them as shown below. Update the TITLE_RANKINGS table input table to TITLE_RANKINGS.

fact - title rating

Add the following ‘Column Names’.

fact - column names

Update the ‘Join’ stage as shown below.  Add the Main table -> Title_Name_x. This will use the output of our previous join as the source data for this new join.  Main Table Alias -> TN.

fact - title name x2

Add  Join Table -> TITLE_RATINGS, Join Alias -> TR, and Join Type -> Left.

fact - title ratings joins

Open the ‘Join Expressions’ Edit Properties box.  Add the following join condition to join our data sets by tconst.

“TP”.”tconst”=”TB”.”tconst”

fact - title name x2 join exp

Add the following Output Columns or click ‘Add All’.

fact - title name x2 join output col

Now let’s add a calculator stage to add our SID values to the fact table.  In the calculation for these values will add a default value in the case one of these values is NULL.  That really should happen here, because these are key values but it is good practice anyway. If one of these happened to be NULL it would get converted to ‘0’, which would then join to the NA row of our dimensions.  This allows equal joins in our BI tool or queries on our star.

The Expressions are:

nconstSID 
nvl(MD5("nconst"),'0')

consisted
nvl(MD5("tconst"),'0')

fact - calc stage

Finally lets add our Rewrite table component.  Name it Cast_F, and input CAST_F as the table.

fact - cast f

Now let’s create an Orchestration job ‘SEQ_CAST_F’.  This will be very similar to the ones create for our dimensions.

fact - create table

fact - create table columns

Now let’s run this and check for errors.

If there are no errors, let’s do a simple count check.

select 'Cast_F',count(*) from Cast_F
union all
select 'Title Principals',count(*) from title_principals;

Now we can run some queries against our star.

NOTE:  Matilion Quotes fields when creating the DDL to build a table.  This seems to translate literally to Snowflake. Meaning if you build your tables through Matillion, Snowflake will not recognize your fields without “” around them.  In real life one would normally just build tables with a DDL script in Snowflake directly and not really use Matillion for this anyway. So, I don’t see this as that big a deal.

If this bugs you at much as me, go back and copy the DDL’s generated by Matillion in the ‘Create/Replace Table’ stages under the SQL tab. Remove the quotes and run in Snowflake.  Then remove the ‘Create/Replace Table’ stage or set the ‘Create if not exists’ option. Go to the table output and update the ‘Output Column’ section of the Column Mappings property.  The fields should now all be upper case. You will also have to remove the quotes around the SQL in the NA row script.

Now you can run queries against the data mart.  Here is a query to find the average age of the actors in movies that had an average rating above 9.

select
x.startYear as Year
,avg(x.age) as AverageAge
from
(
select
primaryName
,birthYear
,deathYear
,case when deathYear is not null Then deathYear-birthYear
else date_part(year,current_date) - birthYear end as age
,averageRating
,startYear
from
cast_f c
left join PERSON_D p 
on c.nConstSID=p.NConstSID
where isAdult=0
and startYear between 2000 and 2020
and averageRating>=9
and birthYear is not null
and primaryProfession like '%actor%'
) x
group by x.startYear
order by x.startYear;

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s