Load Source Data to S3

In this section we will move data from our source system to AWS S3.

Let’s first create an S3 bucket to stage our movie data files.  Navigate to Services->S3.

s3 - s3

Select Create bucket

s3 create bucket

Give the bucket a name.  Note: Bucket names must be unique across all of AWS.

s3 bucket

Check make sure your bucket exists.

s3 bucket exists

Let’s create an Orchestration job in Matillion to load the S3 bucket.  Right click on default and select ‘Add Orchestration job’.

s3 orchestration job

Name the job and click ‘OK’.

s3 name orch job

Under Components select Load/Unload->S3 Put Object and drag it onto the canvas.

s3 load unload

s3 put object

Attach ‘Start 0’ to ‘S3 Put Object 0’ by clicking once on the circle to the right of ‘Start 0’, move your cursor on top of ‘S3 Put Object 0’ and click again.

s3 put object 2

Select ‘S3 Put Object 0’ stage by clicking once on it, and fill out the Properties in the panel below.

Input data -> SFTP

Input data URL -> Set this to the url of the EC2 instance holding your source data files.  Format like so -> sftp://[Private IPs]:22[Directory][Filename]

Example -> Sftp://172.31.239.117:22/home/ec2-user/title.basics.tsv.gz

Output object name: The name the file will have in S3

Username: ec2-user

SFTP Key: copy and paste the full contents of MyPOCKeyPair.pem.txt

S3 path -> s3://[bucket name]

s3 put object properties

Now since this is sftp, we need to open a port to our ec2 server holding our source data.

Navigate to the EC2 console.  Select our t2.medium server that is running Matilion.  Note the private IPs.

s3 ec2 console

Now select the t2.micro that is our source system, and click the security group link.

S3 security groups

Click the Edit Button on the Inbound tab.

s3 secutity group edit

Click ‘Add Rule’.  Then type in port range ’22’.  Add the Private IPs from the Matilion server.  Be sure to add the CIDER block ‘/32’ at the end of the IP.  Click Save.

s3 security group add rule

Run the Matillion job.  Right Click on the Start Stage and select ‘Run From Component (Dev)’.

s3 run from component

Check the task tab for a completed status.  

s3 success

If the job failed click the arrow in a box on the far right of the run status to expand that window (the -> is found under the task tab, just to the right of ‘Completed’ time).  Most likely the issue is the ftp address is wrong, RSA key not fully copied, or like the error below the port is not open.

s3 port not open error

After successful run, check S3 to verify the file made it.

s3 check for file

Matillion does not have user prompting for variable values at this time.  That would have been nice, b/c then we could have built one job and prompted for the file name.  Let’s clean this up, and add a few variables anyway.

Right click on ‘default’ folder in explorer, and select ‘Create Folder’.

s3 create folder

Add a name.

s3 create folder src_to_s3

Right click on the job and select ‘Manage Job’.

s3 src to s3 manage job

Enter new name.

s3 src to s3 title job name

Drag the newly named job ‘Src_S3_Title_Basics’ to the folder ‘Src_to_S3’.

Now let’s add a project level variable for the server ip address.  Click project drop down. Select ‘Manage Environment Variables’.

s3 manage env variables

Click the plus sign at the bottom and add variable name, data type, default value.

s3 src_server_ip

Single click ‘S3 Put Object 0’.  Select Export tab and Manage Variables->Manage Variables.

s3 export manage var

Click the plus sign at the bottom of the Manage Job Variables window.  Add the variable name ‘filename’, and ‘title.basics.tsv.gz’ as the value.  Select ‘Copied’ as behavior. This means that the variable value will be passed down any branches.  The other option ‘Shared’ will share this value with all nodes.

s3 var filename

Now add the variables to the properties.  Be sure to use the ${variable} format. In this example the variable names would be ${src_server_ip} and ${filename}

s3 variable in properties

Run the job again to make sure it still works.  Then Copy, rename, and create jobs for each file.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s