Building AWS Redshift Data Warehouse with Matillion and Tableau
1. 1
Building a Data Warehouse in 2 Hours using Amazon Redshift
Lessons learned from our hands-on workshop
Our primary goal was to showcase the power and ease of building a data warehouse using AWS Redshift. In
order to load source data from AWS S3 efficiently, we used an AWS Marketplace Partner (Matillion ETL for
Redshift) as our data load tool. To complete a typical Enterprise business scenario, we used another AWS
Marketplace Partner (Tableau) to be able to generate some data visualization in the form of a dashboard.
Another goal was to build a reference use case using AWS best practices, such as using an IAM user with
least privilege permissions and also to use an AWS VPC for the solution components. The image above
shows our scenario.
The Workshop Team at AWS re:Invent
At Left is a picture of the great team I
worked with at AWS re:Invent. They
included contractor Kim Schmidt, AWS
team members and vendors from the
Amazon AWS Marketplace Partners
Matillion and Tableau. Kim has done a
series of YouTube screencasts related
to this blog post at this location:
2. 2
The Business Problem and Dashboard Goal
As with all successful data warehouse projects, we started with the source data and related that to the
business questions of interest. Our source data revolved around flights and weather, so we expected our
solutions to enable us to answer questions and to display the results - such the following:
• “Which airline carriers had the most delays per year?”
• “Which airports had the greatest percentage of flight delays based on weather conditions (such as
rain?)”
• “Which airplane types had the most weather-related delays?”
For our scenario, we used two public sample data sets – the first was “US airport flight information from 1995 -
> 2008”. This flight data set included every flight to or from a US airport (and whether it left on time or not).
The second data set is public weather data, taken from NOAA, including the daily weather readings for each
US Airport.
Our solution dashboard (using Tableau) is shown below.
3. 3
How to Use this Blog Post
There are 3 different approaches you could take when reading this blog post, depending on your time, level of
expertise and depth of knowledge gain you want. They are listed below:
• Approach 1 – Read the post for information and (optionally) watch the short, included screencasts
• Approach 2 – Use the pre-built artifacts (scripts, jobs, etc…), open, explore and then run them on an
AWS Environment that you set up.
• Approach 3 – Build everything from scratch (including using your own data if inclined). NOTE: You will
have to modify your setup steps based on the size and complexity of your source data.
To that end we’ll detail the steps you’ll need to take and we’ll add reference scripts and artifacts as we go. For
Approach 1, simply read this entire post. For Approach 2, read the “The Workshop Exercises” section so that
you can understand what steps to take on your own environment. To set up your AWS environment, you can
either click to set up via the AWS console or you can use our AWS cli script.
For Approach 3, read the same as for Approach 2 but also read everything under steps for Approach 2 with the
header “Build your own Data Warehouse on AWS.” We will provide step-by-step instructions to build the
Matillion ETL jobs and also the Tableau workbook.
Shown below is our reference architecture.
re:Invent AWS Data Warehouse Workshop Architecture
4. 4
The Workshop Exercises
Exercise 0 – Environment
At re:Invent we pre-provisioned one workshop environment for each student team to use; each environment
included these services and data:
• One Amazon Redshift cluster using a 1 x dc1.large instance launched in an AWS VPC
• The public flights data and weather data in the following bucket s3://mtln-flight-data
• One AMI EC2 instance (launched from the AWS Marketplace) running Matillion ETL for Redshift.
• The Matillion instance used an AWS Elastic IP Address
• The Matillion instance ran in the same AWS VPC as the Redshift instance
• Matillion jobs solution file to load and process the data ”‘FinalSolution.json” on each desktop
• One data analytics tool (installed on each desktop), including JDBC driver for Redshift - Tableau
• Unique IAM User login for each team. IAM user assigned appropriate permissions.
Exercise 1 – Review Environment
After seeing a demo of the AWS Console (Redshift) environment and the AWS Marketplace (Matillion and
Tableau) environments, the first exercise was to have the student teams’ login in and explore their pre-
provisioned environments. They reviewed the following aspects of their setup and took notes on new features
they saw in each of the following:
1. IAM Users (Best Practice)
2. VPC (Best Practice)
3. Redshift single-node
4. Matillion via EC2 -> Matillion Browser Interface
Exercise 2 - Open, Review and Run Matillion Load (Orchestration) Jobs
After seeing a demo of the data scenario and source files, the student teams next saw a demo of the Matillion
load jobs. Instructor imported the Matillion load jobs and showed how to examine the load flow. In this
exercise, students then imported two Matillion load (orchestration) jobs (“Import Flight Data” & “Import Weather
Data”) that were on their desktops and then reviewed then ran those jobs. They then examined the output in
Matillion during the job processing and also in the Redshift console (“Query Execution” and “Load” tabs). The
data load takes approximately 5 minutes; therefore during the load the instructor demonstrated additional
Matillion capabilities while students wait for their data to finish loading.
Exercise 3 - Open, Review and Run Matillion Transformation Job
After seeing a demo of the data transformation job, in this exercise the student teams imported and ran their
own transformation job. The Instructor imported the Matillion transformation job and reviewed the job steps in
detail while the student jobs completed. Students examined the output in Matillion during job processing and
also in the Redshift console ((“Query Execution” and “Load” tabs).
Exercise 4 – Connect to Tableau and Visualize the Results
After seeing a demo of how to connect a desktop installation of Tableau to their Redshift cluster, in this
exercise the student teams then connected to Tableau (on their desktop). Subsequently the instructor
demonstrated how to implement joins in Tableau from the source Redshift data. Student teams then
performed the joins. In the final step the instructor will demonstrate a Tableau visualization using the data.
Student teams will then create one or more visualizations based on their data.
5. 5
Building your own Data Warehouse on AWS
To start, you will want to download the workshop script and sample files from GitHub -
https://github.com/lynnlangit/AWS-Redshift-Matillion-Workshop
First let’s get started with a copy of Matillion ETL for Redshift. The AMI is available for a 14-day free trial from
the Marketplace. Please follow our getting started instructions however for the impatient the key points are: -
● Attach an IAM Role to the instance that has access to Redshift (AmazonRedshiftReadOnlyAccess), S3
(AmazonS3FullAccess) and optionally SNS. If you omit this it is possible to enter credentials later
● Run in a an AWS VPC. When in a VPC, ensure the instance has internet access for connecting to S3
● Once started, connect to the instance with a web browser on http://<server_name_or_ip>/
Connecting to AWS Redshift from within Matillion
Once the instance is started and the software is launched you should see the following screen. Fill in the
details for you cluster.
If the test succeeds then Matillion ETL can talk to Redshift and we are ready to start building ETL jobs.
Orientation
6. 6
This orientation diagram shows the key elements of the tool on a main Transformation canvas. We will drill into
the detail further as we go. Also see this video that gives a good product overview.
Loading our Weather Data
The paradigm we follow in Matillion ETL for Redshift is to first load our untransformed source data into Amazon
Redshift, using an orchestration job (ready for transformation by a transformation job). So, that’s what we’ll do
now.
When you first start a project a sample orchestration job and transformation job are created to help new users
orient themselves in the tool. You can keep these for reference if you wish, or remove them.
Our first job is to create a new orchestration job that will load our weather data from the S3 bucket.
1. From the Project menu select Add orchestration job.
7. 7
2. Name the job Import Weather Data and click OK
3. Now we have a blank canvas. In the components panel expand Orchestration -> Flow and drag the
Start component onto the canvas. This component simply indicates where to start the orchestration job.
This component will have green borders (i.e. validated as ok)
4. Next we create a table, to hold our list of weather stations. Select Orchestration -> DDL ->
Create/Replace Table component and drag onto the canvas. This will require some input in the
properties area. The important properties to set here are the New Table Name and the metadata
5. The Table Metadata should be set up as follows this matches the column format of the text input data.
8. 8
6. The Distribution Style is set to “All”, meaning this data will be copied to all nodes on the Redshift
cluster. We choose All because this is a small table and this is most efficient, performant way to store it
on Redshift
7. We define the “USAF” column as the Sort key as this is the column’s natural key and we will use this for
joining later
8. Finally, in order to validate our component needs an input connection. To do this select the Start
component on the canvas and click on the small gray circle to its right. This will then allow you to draw
a connector to the create table component indicating that this happens next.
9. Now our component is valid we can run it ad-hoc and create the table. To do this right click on the
component and select Run Component. The table will be created in Redshift.
10. Next we will load some data into our table form S3. Drag on a new component Orchestration ->
Load/Unload -> S3 Load.
11. The S3 Load component has a long list of properties but most can be left default. The important ones
are shows in this table
Property Value Notes
Target Table Name station list If your create component worked
correctly you should be able to
simply select your table from the list.
Load Columns Choose All
S3 URL Location s3://mtln-flight-data/weather This is the public bucket where the
9. 9
data is kept.
S3 Object Prefix ish-history.csv This is the name of the file (or object)
in the bucket. In this case it’s a
single file, but you can use a prefix
and process multiple files.
Date File Type CSV It’s a comma-separated file.
CSV Quoter “ Elements are quoted.
Region eu-west-1 The region of the s3 bucket where
the data is loaded from.
12. The rest can be left as default.
13. Once again right click and Run Component in order to load the data into the table. Note the task panel
will show the number of rows transferred.

14. Now we repeat the steps above (from step 4) to create a table for the main weather data called
raw_weather. This holds a lot of data so we will distribute “EVEN” that will spread the data evenly
across the cluster. The weather data columns look like this.
Column Name Data Type Size Decimal Places
STN Numeric 6 0
WBAN Numeric 5 0
YEAR Numeric 4 0
MODA Numeric 4 0
TEMP Numeric 6 1
PRCP Numeric 5 2
VISIB Numeric 6 1
WDSP Numeric 6 1
15. The weather data is a delimited file, so for this type we use the following settings
Property Value Notes
Target Table Name raw_weather If your create component
worked ok you should be able
to simply select your table
from the list.
Load Columns Choose All
10. 10
S3 URL Location s3://mtln-flight-data/weather This is the public bucket
where the data is kept.
S3 Object Prefix weather_simple This is the name of the file (or
object) in the bucket. In this
case it’s a prefix hence
redshift will automatically load
all matching files. Multiple
files in great for large
quantities of data because
it means the data can be
loaded in parallel by the
cluster.
Date File Type Delimited It's a delimited file.
Delimiter , It’s a comma separated file.
Compression Method Gzip The data is compressed with
GZIP.
Region eu-west-1 The region of the s3 bucket
where the data is loaded
from.
16. Now finally we can complete the orchestration job
If we run this job now we should have both tables populated.
Importing the Flight data
11. 11
Next it’s time to import the Flight Data, however to avoid too much repetition we will import a job to do this by
importing a pre built job. This demonstrates how jobs can be reused using the Export/Import functionality in
Matillion ETL for Redshift.
1. Download the Flight Data.json from here.
2. Select Project -> Import Jobs
3. Click Browse... and choose the file you downloaded, then select the “Import Flight Data.json” file and
click OK
4. Now open the Orchestration Job by double clicking the Orchestration -> Import Flight Data to open the
job.
5. Review the job and when you are happy with what it is doing, re-validate and then run it.
12. 12
6. The job will take a few minutes to load all the data
Note: The Flight Data job also pre-creates the output table “Flights Analysis” that we will use for analysis of the
result of our transformation.
Creating Transformation job
We have done the E and the L (Extract, from S3, and Load, into Redshift), so we are ready to start the fun bit;
the T (Transform).
In the next section we will build a transformation job that will join our Flights and Weather Data and output it
into a new table designed for easy analysis.
We will join the two data sets using the airport code, which exists in both. After that we’ll add some simple
calculations. All the sort of things you’ll do, at scale, in a real life business scenario.
So let’s get started.
Our first challenge is that our flights data doesn’t contain much friendly airplane information, other than the tail
number (tailnum). People doing data analysis will want that there, so let’s start with a simple join.
1. First we need to import a partially completed Transformation Job using Project -> Import Jobs and
import the file called Transform Weather and Flight Data.json
2. Once imported double click to open the Job and let us begin by adding our flights data flow to the
existing data flows.
3. Right click on the job and choose Revalidate Job. to build all the views
13. 13
Note: Before we start adding new components take a moment to look at what the existing components are
doing. You will notice we have a transformation for the weather data with joins, calculators filters and
aggregates. You can click through these and see how they are configured as we go.
4. Remove the Note that says “Build flight data flow here” by right clicking and doing Delete Note
5. In Components add Data -> Read -> Table Input to the job.
6. Select the Component and set the table name to raw_flights and for simplicity sake select all column
names.
7. Repeat the above steps for the table raw_plane_info . Now we have something to join.
8. Add a Data -> Join -> Join and wire up like so.
14. 14
9. Now let us configure the join as follows
Property Value Notes
Name Join Plane Info
Main Table raw_flights This is the main flow that we
will be joining to. Note that
the join can support multiple
flows not just two.
Main Table Alias flights This is used later when we
specify the join condition.
Joins Join Table 1: raw_plane_info
Join Alias 1: planes
Join Type 1: Inner
This describes that we will
inner join to the plane info
table. Note: you can do
multiple joins and join flows to
themselves if you require.
Join Expression 1:
f_inner_a
"flights"."tailnum" =
"planes"."tailnum"
This describes our join. This
time it’s simply joining the iata
code with the origin airport
but it could be much more
complex if needed. We will
look at the calculation editor
in more detail later.
Output Columns All available. These are the columns of
data the flow out of the
component. Sometimes it's
15. 15
possible to get an error here.
A useful trick is to delete
the all columns and allow
the component to re-
validate (by clicking OK).
This will automatically re-
add all valid output
columns.
10. If the component is now valid, now is a good time to stop and explore what is going on
Matillion ETL for Redshift has created a view for each of the three components created so far and Amazon
Redshift has ensured that each view is valid. i.e. the SQL syntax is correct and the view is physically allowed to
exist. Lets look at what Matillion ETL for Redshift can tell us about our data flow so far.
The Sample tab will allow us to look at the output of our flow so far and also indicate the number of rows
involved at this step.
The Metadata tab shows us the data types and columns involved in the output.
The SQL tab shows the SQL of the generated view, this allows you to see exactly what the tool is doing at
each step.
16. 16
The Plan will
tell you information about how Redshift will tackle the query
And finally Help is context sensitive help for the the component
11. Next we add a Data -> Transform -> Filter component to remove all privately owned planes from the
dataset. It's OK to filter the data later down the flow as the query optimiser will usually improve this
when it performs the actual query. Set it up like this
Property Value Notes
Name Filter Private Planes
Filter Conditions Input Column: type
Qualifier: Not
Comparator: Equal to
Value: Individual
Note again how we are
relying on the output of the
previous calculate.
Combine conditions AND
12. Next in the flow we add a Data -> Transform ->Calculator component. The calculator is a powerful
component that allows us to do in flow calculations across a row of data.
13. The main element of the Calculator component is the Calculations editor
17. 17
14. For this calculator we need four expressions.
Name Expression Notes
delay_date TO_DATE ("year" || '-' || "month" || '-' ||
"dayofmonth", 'YYYY-MM-DD')
Since the source data has no
actual date column we
construct one from the year,
month and dayofmonth fields.
Is Departure
Delayed
CASE "depdelay" > 0
WHEN true THEN 'Yes'
ELSE 'No'
END
Sets a simple flag for the
departure delay. This sort of
field makes life easier for
analysts
Is Long
Delay
CASE "depdelay" > "airtime" * 0.2
WHEN true THEN 'Yes'
ELSE 'No'
END
Another flag this time
identifying flights that are
“Long Delayed” i.e. the delay
was > 20 % of the overall
flight.
Is Flight
Diverted
CASE WHEN "diverted" = 1
THEN 'Yes'
ELSE 'No'
END
Here we convert a 1 or 0 flag
to a more user friendly Yes or
No
15. Our flow now looks like this.
18. 18
16. Now we can add our output table to for Analysis. Add a Data -> Write -> Table Output. Set this up as
below
Property Value Notes
Name Analysis Flights
Target Table Name Analysis Flights This table was created when
we ran the “import Weather
Data” Orchestration job. The
columns are already set up to
correctly compress your data.
Fix Data Type Mismatches No Not needed here as the types
are correct but sometimes it
can be useful to allow
matillion to attempt to map
data types
Column Mapping see below This maps the column names
in your flow to the physical
columns in the table.
Truncate Truncate Means every time we add
data this table will be
truncated.
The column mappings are setup like this:-
19. 19
17. Now our Flights analysis flow is complete. the whole job looks like this. To run everything from end to
end right click and select Run Job.
20. 20
16. You can watch the execution in your task list
17. This will leave us with 5 analysis tables populated and ready to work with an analysis tool such as
Tableau.
So we are done. Our data is neatly prepared and ready for analysis. Our Jobs can now be Versioned and
Scheduled. So the data can be updated regularly id required.
Note: Don’t forget the collaboration features in the tool. Send the URL of the job you happen to be working on
to a colleague, and then work on the job together, collaborative, in real time - just like in Google Docs.
About Matillion
Front end - Matillion ETL for Redshift is an entirely browser based tool, launched as an AMI from the AWS
Marketplace. As such it runs inside your existing AWS account and can be up and running in a few minutes.
Matillion has been designed specifically for Redshift.
Back End - Matillion ETL for Redshift uses ELT Architecture, pushing down data transformations to Amazon
Redshift. The tool takes advantage of Redshift’s ability to layer many views, whilst still optimizing the
execution plan accordingly. Each transformation component generates a corresponding view in Redshift and
Matillion ETL for Redshift keeps these views in sync. This approach has some significant real world
advantages.
● ‘ELT’ is several orders of magnitude faster than ‘ETL’. This is because the data remains in the
database, which understands its structure and how to transform it most efficiently. This as opposed to
‘ETL’, where the data has to be expensively extracted from the database before being transformed in-
memory then reloaded
21. 21
● Amazon Redshift robustly validates the views as they are created, so the user can be confident that if
the view is successfully created, that part of the job will work. This avoids time wasted debugging an
ETL job after it has been created
● Matillion ETL for Redshift allows you to ‘sample’ the data at any point in your flow. This can be
extremely useful when debugging and understanding complex data flows
Conclusion
As we can see with this relatively simple data set, the key to good analysis is good data preparation... and the
fastest and quickest way to do that is using an ELT-based tool such as Matillion, running over a MPP columnar
database like Amazon Redshift.
Of course this is a very simple job and real world applications are much more complex. That is when the
advantage of a tool like this really comes alive. The graphical job development, collaborative nature of
Matillion, versioning support, built in scheduling, et al, all serve to make your ETL jobs far more enjoyable to
create and far more valuable once created.
Tableau for reInvent Workshop
0a. Connect
to
data
• Connect
to
the
redshift
data
source
and
select
public
schema,
use
correct
port
0b.
Join
Data
–
use
LEFT
Joins!
o Drag
Analysis
Flights
table
onto
the
connection
window
o Drag
Analysis
Carriers
onto
the
connection
window
(it
will
automatically
join
on
carrier
code).
Turn
into
a
left
join.
o Drag
Analysis
Weather
onto
the
connection
window,
turn
into
a
left
join
as
well.
o Drag
Analysis
Airports
onto
the
connection
window,
left
join
1.
Examine
how
many
flights
by
each
carrier
o New
sheet
(name
it
flights
by
carrier)
o Double
click
on
number
of
records
(bottom
of
measures)
o RIGHT
click,
HOLD,
and
drag
date
onto
columns
from
Analysis
Flights,
you
will
see
a
number
of
different
ways
to
display
dates,
select
Week(Date)
towards
the
bottom
(the
green
one…)
• Drag
carrier
name
from
analysis
carriers
onto
color
o On
the
Marks
card,
click
the
drop
down
that
says
“Automatic”
and
select
“Area”
o OPTIONAL:
Click
the
drop
down
on
the
color
legend
and
select
“Cyclic”
and
click
assign
pallet
(
2.
Where
are
the
weather
delays?
o NEW
SHEET
(name
it
avg
weather
delays)
o Double
click
on
City
22. 22
o Drag
weather
delay
onto
color
§ Change
to
average
by
clicking
the
drop
down
on
the
green
“pill”
and
selecting
Measure-‐>average
o Drag
number
of
records
onto
size
o Click
the
drop
down
on
the
color
legend
and
change
to
red…
3.
What
is
the
Total
delay
by
carrier?
o NEW
SHEET
(name
it
total
delay
by
carrier)
o Drag
Carrier
onto
Rows
o Drag
Arrival
Delay
onto
columns
o Change
arrival
delay
to
average
o Sort
descending
(one
on
the
right…)
• Click
Color,
Change
to
Grey
4.
Create
a
dashboard
o Click
new
dashboard
o Double
click
Flights
by
Carrier
(or
sheet
1)
o Drag
and
drop
Avg
Weather
Delays
below
that
o Drag
and
drop
Total
Delay
by
Carrier
to
the
left
of
both
o Change
Fit
to
Entire
View
for
Total
delay
by
carrier
o
o Click
the
drop
down
on
total
delay
by
carrier
and
select
“Use
as
filter”
o Click
on
one
of
the
carriers
to
see
the
number
of
flights
they
have
had,
where
their
flights
are
going
to
and
which
airports
are
the
most
delayed
by
weather