In this section we will create an AWS EC2 instance to act as our source data server. We will load data here from imdb.com.
Now let’s launch an EC2 instance to hold our data. This will act as our source system. Services-> EC2.
Select Amazon Linux.
Choose t2.micro. This will be free tier eligible.
Keep default network and subnet. ‘Enable’ Auto-assign Public IP. This gives this server a public or external IP address. This will make it easy to connect to this server from outside the VPC (Virtual Private Cloud).
Bump up the storage a little here, because we are going to load some data on this server.
Keep ‘Assign a security group default’ set as ‘Create a new security group’. Select My Ip in the routing table below. This allows port 22 SSH traffic into this server from our IP. This will allow us to SSH terminal into the server.
Create a new key pair. Give the new key pair a name. Download the Key Pair. DO NOT lose this. We will use this key pair to SSH into this machine. Click Launch.
Click view running instances. At the bottom. Or services->EC2. Then click ‘Running Instances’ to the right, or Instance under EC2 Dashboard.
Copy your Public DNS.
Open a terminal window on Mac chmod the pem file (key pair) downloaded earlier. Limit access to the key pair.
>chmod 500 MyPOCKeyPair.pem.txt
Now use terminal window on Mac or Putty on Windows to SSH to our EC2 instance.
Type the command -> ssh -i ec2-user@[/Path/KeyPair] [Public DNS]
Example (my key pair file is in my ‘/Downloads’ directory).
>cd /Downloads >ssh -i MyPOCKeyPair.pem.txt firstname.lastname@example.org
Now download the imdb data sets using the following commands:
> wget https://datasets.imdbws.com/name.basics.tsv.gz > wget https://datasets.imdbws.com/title.akas.tsv.gz > wget https://datasets.imdbws.com/title.basics.tsv.gz > wget https://datasets.imdbws.com/title.crew.tsv.gz > wget https://datasets.imdbws.com/title.principals.tsv.gz > wget https://datasets.imdbws.com/title.ratings.tsv.gz