Getting Started With Amazon Redshift

1
www.matillion.com
© 2017 Matillion. All rights reserved.
Presented by:
Copyright © 2017. All rights reserved. Matillion, trademarks, registered trademarks or
service marks are property of their respective owners. 3/20/2017
Matillion Webinar
Tuesday, August 29, 2017
Getting Started with Amazon Redshift
Harpreet Singh & Nick Tierney

2
www.matillion.com
• Data integration tool built specifically for
AWS and Amazon Redshift
• Push-down ELT architecture
• Intuitive browser-based UX
• Powerful feature set
• Retail like acquisition through AWS
Marketplace
Matillion ETL for Amazon Redshift

3
www.matillion.com
ELT
Architecture
KEY BENEFITS
• Simplified infrastructure
• Fast performance, scalable
• Increases development productivity
Intuitive UX
KEY BENEFITS
• Low skills on-ramp
• Powerful
• Reduces cost of data integration development
Built for AWS
KEY BENEFITS
• Perfect integration to AWS services
• Designed for the Cloud
• No compromise
Retail-like
purchasing
KEY BENEFITS
• 5 minutes to stand-up
• Cheap, utility-based pricing
• Available exclusively on AWS Marketplace
Benefits Matillion ETL for Amazon Redshift

4
www.matillion.com
Agenda
● An overview of petabyte scale data warehouses, the architecture, and use cases.
● An introduction to Amazon Redshift parallel processing, columnar, and scaled out
architecture.
● Learn how to configure your data warehouse cluster, optimize your scheme, and
quickly load your data.
● An overview of all the latest features of Amazon Redshift.
● How Matillion ETL works with Redshift and can help you load massive amounts of
data in minutes.

5
www.matillion.com
Overview
Amazon Redshift is Massively Parallel Relational data warehouse based on industry
standard PostgreSQL, so most existing SQL client applications will work with only minimal
changes.
● Petabyte scale;
● Fully managed;
● Zero Admin
● SSD & HDD platforms
● As low as $1,000/TB/Year

6
www.matillion.com
Amazon Redshift cost comparison
DW1 (HDD)
Price Per Hour for Effective Annual
DW1.XL Single Node Price per TB
Price Per Hour for Effective Annual
DW2.L Single Node Price per TB

7
www.matillion.com
Amazon Redshift is priced by the amount of data you store and by the number of nodes. The number of nodes is expandable. Depending upon the current
volume of data you need to manage, your data team can setup anywhere from a single node -- which is a 160 GB or 0.016 TB of solid state disk space -- to
a 128 node cluster with a capacity for 16 TB on a hard disk drive.
There are a couple of other caveats to keep in mind. But, before we present those, it’s important for you to understand that Amazon separates its nodal
definitions into two meta-types: Dense Compute and Dense Storage. Each of the sub-types within the classifications stores a maximum amount of
compressed data.
Dense Compute: Recommended for less than 500GB of data
The smallest in the Dense Compute class is the dc1.large with 2 virtual cores, and .16 TB of SDD storage capacity. You can increase dc1.large from one
node to a cluster of 32 nodes, which expands the SSD capacity to 5.12 TB.
Meanwhile, the dc1.8xlarge runs 32 virtual cores and is scalable from a cluster of 2 to 128 nodes which allow a maximum of 326 TB of SSD storage space.
Dense Storage: Recommended for cost effective scalability for over 500GB of data
For data management using hard disk drive space and a larger number of virtual cores, Redshift has two options. The ds2.xlarge can be initiated with only a
single node that has up to a 2TB capacity. However, the single node can be increased to a maximum cluster of 32 nodes.
Need a larger cluster instead? You can switch to the ds2.8xlarge option with 36 virtual cores starting with 2 nodes and expandable to a 128 node cluster with
a maximum of 16 TB of magnetic HDD space.
Redshift has no upfront costs (there is an exception which is described below). However, the price per hour is largely dependent on your region. For
example, let’s say your region option is US West (Northern California), and you're running a small startup with a single node. Redshift’s current pricing
structure states that you’ll pay $0.33 per hour for that 0.16TB SSD. On the other hand, running a single node in the US East (North Virginia) region costs you
$0.25 per hour.
These are all categorized as Redshift’s “On Demand” pricing. If you want the “Reserved Instance pricing” then a longer term commitment of either one or
three years available.
A few things to keep in mind regarding infrastructure and pricing for Redshift:
•You choose either the Dense Computer or Dense Storage nodes. They cannot be mixed and matched (at least not currently).
•If you decide that Reserved Instance pricing is the way to go, there are no upfront costs for the one year term commitment, but you can choose to pay
upfront in partial or in full. With a three year contract, you also have a choice between a full or partial upfront payments.

8
www.matillion.com
Redshift Architecture Client applications
Amazon Redshift integrates with various data loading and ETL (extract, transform, and load) tools and business intelligence (BI) reporting, data
mining, and analytics tools.
Clusters
The core infrastructure component of an Amazon Redshift data warehouse is a cluster.
A cluster is composed of one or more compute nodes. If a cluster is provisioned with two or more compute nodes, an additional leader node
coordinates the compute nodes and handles external communication. Your client application interacts directly only with the leader node. The
compute nodes are transparent to external applications.
Leader node
The leader node manages communications with client programs and all communication with compute nodes. It parses and develops execution
plans to carry out database operations, in particular, the series of steps necessary to obtain results for complex queries. Based on the execution
plan, the leader node compiles code, distributes the compiled code to the compute nodes, and assigns a portion of the data to each compute
node.
Compute nodes
The leader node compiles code for individual elements of the execution plan and assigns the code to individual compute nodes. The compute
nodes execute the compiled code and send intermediate results back to the leader node for final aggregation.
Node slices
A compute node is partitioned into slices. Each slice is allocated a portion of the node's memory and disk space, where it processes a portion of
the workload assigned to the node. The leader node manages distributing data to the slices and apportions the workload for any queries or other
database operations to the slices. The slices then work in parallel to complete the operation.
The number of slices per node is determined by the node size of the cluster.
When you create a table, you can optionally specify one column as the distribution key. When the table is loaded with data, the rows are
distributed to the node slices according to the distribution key that is defined for a table. Choosing a good distribution key enables Amazon
Redshift to use parallel processing to load data and execute queries efficiently.
Databases
A cluster contains one or more databases. User data is stored on the compute nodes. Your SQL client communicates with the leader node, which
in turn coordinates query execution with the compute nodes.

9
www.matillion.com
Use Cases
Traditional Enterprise DW
● Reduce costs by extending
DW rather than adding HW
● Migrate completely from
existing DW systems.
● Respond faster to business
Companies with Big Data
● Improve performance by
order of magnitude
● Make more data available for
analysis
● Access business data via
standard reporting tools
SaaS Companies
● Add analytics functionality to
applications
● Scale DW capacity as demand
grows
● Reduce HW & SW costs by an
order of magnitude

10
www.matillion.com
Petabyte Scale
With a few clicks in console or a simple API call, you can
easily change the number or type of nodes in your data
warehouse and scale up all the way to a petabyte or more of
compressed user data. Dense Storage (DS) nodes allow you
to create very large data warehouses using hard disk drives
(HDDs) for a very low price point. Dense Compute (DC) nodes
allow you to create very high performance data warehouses
using fast CPUs, large amounts of RAM and solid-state disks
(SSDs). While resizing, Amazon Redshift allows you to
continue to query your data warehouse in read-only mode
until the new cluster is fully provisioned and ready for use.
Redshift features
Optimized for Data Warehousing
Amazon Redshift uses a variety of innovations to obtain very
high query performance on datasets ranging in size from a
hundred gigabytes to an exabyte or more. For petabyte-scale
local data, it uses columnar storage, data compression, and zone
maps to reduce the amount of I/O needed to perform queries.
Amazon Redshift has a massively parallel processing (MPP) data
warehouse architecture, parallelizing and distributing SQL
operations to take advantage of all available resources. The
underlying hardware is designed for high performance data
processing, using local attached storage to maximize throughput
between the CPUs and drives, and a 10GigE mesh network to
maximize throughput between nodes. For exabyte-scale data in
Amazon S3, Amazon Redshift generates an optimal query plan
that minimizes the amount of data scanned and delegates the
query execution to a pool of Redshift Spectrum instances that
scales automatically, so queries run quickly regardless of data
size.
Query your Amazon S3 “data lake”
Redshift Spectrum enables you to run queries against
exabytes of unstructured data in Amazon S3, with no loading
or ETL required. When you issue a query, it goes to the
Amazon Redshift SQL endpoint, which generates and
optimizes a query plan. Amazon Redshift determines what
data is local and what is in Amazon S3, generates a plan to
minimize the amount of Amazon S3 data that needs to be
read, requests Amazon Redshift Spectrum workers out of a
shared resource pool to read and process data from Amazon
S3, and pulls results back into your Amazon Redshift cluster
for any remaining processing.
No Up-Front Costs
You pay only for the resources you provision. You can choose On-
Demand pricing with no up-front costs or long-term commitments,
or obtain significantly discounted rates with Reserved Instance
pricing. On-Demand pricing starts at just $0.25/hour per 160GB
DC1.Large node or $0.85/hour per 2TB DS2.XLarge node. With
Partial Upfront Reserved Instances, you can lower your effective
price to $0.10/hour per DC1.Large node ($5,500/TB/year) or
$0.228/hour per DS2.XLarge node ($999/TB/year). Redshift
Spectrum queries are priced at $5/TB scanned from S3. For more
information, see the Amazon Redshift Pricing page.
Fault Tolerant
Amazon Redshift has multiple features that enhance the reliability
of your data warehouse cluster. All data written to a node in your
cluster is automatically replicated to other nodes within the cluster
and all data is continuously backed up to Amazon S3. Amazon
Redshift continuously monitors the health of the cluster and
automatically re-replicates data from failed drives and replaces
nodes as necessary.

11
www.matillion.com
Redshift features
Automated Backups
Amazon Redshift automatically and continuously backs up new data
to Amazon S3. It stores your snapshots for a user-defined period
from 1 up to 35 days. You can take your own snapshots at any time,
and they are retained until you explicitly delete them. Amazon
Redshift can also asynchronously replicate your snapshots to S3 in
another region for disaster recovery. Once you delete a cluster,
your system snapshots are removed, but your user snapshots are
available until you explicitly delete them.
Fast Restores
You can use any system or user snapshot to restore your cluster
using the AWS Management Console or the Amazon Redshift APIs.
Your cluster is available as soon as the system metadata has been
restored and you can start running queries while user data is
spooled down in the background.
Encryption
With just a couple of parameter settings, you can set up Amazon
Redshift to use SSL to secure data in transit and hardware-
accelerated AES-256 encryption for data at rest. If you choose to
enable encryption of data at rest, all data written to disk will be
encrypted as well as any backups. By default, Amazon Redshift
takes care of key management but you can choose to manage your
keys using your own hardware security modules (HSMs), AWS
CloudHSM, or AWS Key Management Service.
Network Isolation
Amazon Redshift enables you to configure firewall rules to control
network access to your data warehouse cluster. You can run
Amazon Redshift inside Amazon VPC to isolate your data
warehouse cluster in your own virtual network and connect it to
your existing IT infrastructure using industry-standard encrypted
IPsec VPN.
Audit and Compliance
Amazon Redshift integrates with AWS CloudTrail to enable you to
audit all Redshift API calls. Amazon Redshift also logs all SQL
operations, including connection attempts, queries and changes to
your database. You can access these logs using SQL queries against
system tables or choose to have them downloaded to a secure
location on Amazon S3. Amazon Redshift is compliant with SOC1,
SOC2, SOC3 and PCI DSS Level 1 requirements. For more details,
please visit AWS Cloud Compliance.

12
www.matillion.com
Setting up Redshift Cluster
● Create IAM Role
● Launch a cluster
● Authorize cluster access
● Connect to cluster
● Loading data.(Matillion)

13
www.matillion.com
Launch a Redshift Cluster
Go into the Amazon web services console and select "Redshift" from the list of services.

14
www.matillion.com
Enter suitable values for the cluster identifier, database name (e.g. 'snowplow'), port, username and password. Click the "Continue"
button.

15
www.matillion.com
We now need to configure the cluster size. Select the values that are most appropriate to your situation. We generally recommend starting with a single node cluster
with node type i.e. a dw1.xlarge or dw2.large node, and then adding nodes as your data volumes grow.
You now have the opportunity to encrypt the database and set the availability zone if you wish. Select your preferences and click "Continue".

16
www.matillion.com
Amazon summarises your cluster information. Click "Launch Cluster" to fire your Redshift instance up. This will take a few minutes to complete.
Alternatively, you could use AWS CLI to launch a new cluster. The outcome of the above steps could be achieved with the following command.
$ aws redshift create-cluster
--node-type dc1.large
--cluster-type single-node
--cluster-identifier snowplow
--db-name pbz
--master-username admin
--master-user-password
TopSecret1

17
www.matillion.com
Matillion as ETL Tool
External Data
On Premises
Databases
Matillion ETL
Server
Data Transfer
Redshift
V
P
N RDS
DynamoDB
VPC
Browser
Design

18
www.matillion.com
Matillion Data Sources

19
www.matillion.com
ELT - Extract, Load then Transform
MATILLION EXTRACT, LOAD &
TRANSFORM APPROACH
TRADITIONAL EXTRACT, LOAD &
TRANSFORM APPROACH

20
www.matillion.com

21
www.matillion.com
Presented by:
Copyright © 2017. All rights reserved. Matillion, trademarks, registered trademarks or
service marks are property of their respective owners. 3/20/2017
Thank You
Harpreet Singh & Nick Tierney

Getting Started With Amazon Redshift

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Getting Started With Amazon Redshift

Similar to Getting Started With Amazon Redshift (20)

More from Matillion

More from Matillion (16)

Recently uploaded

Recently uploaded (20)

Getting Started With Amazon Redshift

Editor's Notes