Amazon Redshift is a fast, petabyte-scale data warehouse that makes it easy to analyze your data for a fraction of the cost of traditional data warehouses.
In this webinar, you will learn how to easily migrate your data from other data warehouses into Amazon Redshift, efficiently load your data with Amazon Redshift's massively parallel processing (MPP) capabilities, and automate data loading with AWS Lambda and AWS Data Pipeline. You will also learn about ETL tools from our partners to extract, transform, and prepare data from disparate data sources before loading it into Amazon Redshift.
Learning Objectives:
Understand common patterns for migrating your data to Amazon Redshift
See live examples of the Copy command that fully parallelizes data ingestion
Learn how to automate the load process using AWS Lambda & AWS Data Pipleline
Techniques for real time data loading
Options for ETL tools from our partners
2. Amazon Redshift – Resources
Getting Started – June Webinar Series:
https://www.youtube.com/watch?v=biqBjWqJi-Q
Best Practices – July Webinar Series:
Optimizing Performance – July 21, 2015
Migration and Data Loading – July 22,2015
Reporting and Advanced Analytics – July 23, 2015
5. Common Migration Patterns
Data from a variety of relational OLTP systems
structure lends itself to SQL schemas
Data from logs, devices, sensors…
data is less structured
6. Structured Data Loading
Data is often being loaded into another warehouse
existing ETL process
Temptation is to ‘lift & shift’ workload.
Resist temptation. Instead consider:
What do I really want to do?
What do I need?
7. Ingesting Less Structured Data
Some data does not lend itself to a relational schema
Common pattern is to use EMR:
impose structure
import into Redshift
Other solutions are often home grown scripting
applications.
8. Loading Data
Load to an empty Redshift database.
Load changes captured in the source system to Redshift
9. Truncate and Load
This is by far the easiest option:
Move the data to Amazon Simple Storage Service
multi-part upload
import/export service
direct connect
COPY the data into Redshift, a table at a time.
10. Load Changes
Identify changes in source systems
Move data to Amazon S3
Load changes
‘Upsert process’
Partner ETL tools
11. Partner ETL
Amazon Redshift is supported by a variety of ETL vendors
Many simplify the process of data loading
Visit http://aws.amazon.com/redshift/partners
There are a variety of vendors offering a free trial of their
products, allowing you to evaluate and choose the one that
suits your needs.
12. Upsert
The goal is to insert new rows and update changed rows in
Redshift.
Load data into a temporary staging table
Join the staging with production and delete the common
rows.
Copy the new data into the production table.
See Updating and Inserting New Data in the developer’s
guide
13. Checkpoint
We’ve talked about common migration patterns
Sources of data and data structure
Methods of getting data to AWS
Options for loading data
15. Amazon Redshift Architecture
Leader Node
• SQL endpoint, JDBC/ODBC
• Stores metadata
• Coordinates query execution
Compute Nodes
• Local, columnar storage
• Execute queries in parallel
• Load, backup, restore via Amazon S3
• Load from Amazon DynamoDB or SSH
Two hardware platforms
• Optimized for data processing
• DS2: HDD; scale from 2TB to 2PB
• DC1: SSD; scale from 160GB to 326TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
16. A Closer Look
Each node is split into slices
• One slice per core
Each slice is allocated
memory, CPU, and disk space
Each slice processes a piece
of the workload in parallel
17. COPY command
COMPUPDATE ON when running on an empty table
Use the COPY command.
Each slice can load one file at a time.
Partition input files so every slice can load in parallel.
Use a Manifest file.
18. Use multiple input files to maximize throughput
Use the COPY command
Each slice can load one file at a
time
A single input file means only one
slice is ingesting data
Instead of 100MB/s, you’re only
getting 6.25MB/s
19. Use multiple input files to maximize throughput
Use the COPY command
You need at least as many input
files as you have slices
With 16 input files, all slices are
working so you maximize
throughput
Get 100MB/s per node; scale
linearly as you add nodes
20. Primary keys and manifest files
Amazon Redshift doesn’t enforce primary key constraints
• If you load data multiple times, Amazon Redshift won’t complain
• If you declare primary keys in your DML, the optimizer will
expect the data to be unique
Use manifest files to control exactly what is loaded and
how to respond if input files are missing
• Define a JSON manifest on Amazon S3
• Ensures the cluster loads exactly what you want
21. Analyze sort/dist key columns after every load
Amazon Redshift’s query
optimizer relies on up-to-date
statistics
Maximize performance by
updating stats on sort/dist key
columns after every load
22. Automatic compression
Better performance, lower costs
COPY samples data automatically when loading into an empty
table
• Samples up to 100,000 rows and picks optimal encoding
If you have a regular ETL process and you use temp tables or
staging tables, turn off automatic compression
• Use analyze compression to determine the right encodings
• Bake those encodings into your DML
23. Checking STL_LOAD_COMMITS
SELECT query, trim(filename) as filename, curtime, status
FROM stl_load_commits
WHERE filename LIKE ’%table name%'
ORDER BY query;
After the load operation is complete, query the
STL_LOAD_COMMITS system table to verify that the
expected files were loaded.
24. COPY and 18 inserts
COPY country FROM
's3://…country.txt' CREDENTIALS …
1.57s then
.
insert into country (country_name)
values ('Slovakia'),('Slovenia'),('South
Africa'),('South Korea'),('Spain'); 5.44s
‘
Insert vs Copy
Commit info
25. COPY best practice
Use it.
Avoid inserts, which will not run in parallel.
If you are moving data from table to another, use the
deep copy features:
1. Use the original CREATE TABLE ddl and then
INSERT INTO … SELECT
2. CREATE TABLE AS
3. CREATE TABLE LIKE
4. Create a temporary table and truncate the
original.
27. Automating Data Ingestion
Many customers run custom scripts on EC2 instances to
load data into Redshift.
Another option is to use the Amazon Data Pipeline
automation tool.
AWS Lambda-based Amazon Redshift Loader
32. Using the Lambda based Redshift Loader
Offers the ability to drop files
into S3 and load them into any
number of database tables in
multiple Amazon Redshift
clusters automatically, with no
servers to maintain.
33. Configure the sample loader
johnlou$ ./configureSample.sh more.ohno.us-east-1.redshift.amazonaws.com 8192 mydb
johnlou us-east-1
Password for user johnlou:
create user test_lambda_load_user password 'Change-me1!';
CREATE USER
create table lambda_redshift_sample(
column_a int,
column_b int,
column_c int
);
CREATE TABLE
Enter the Region for the Redshift Load Configuration > us-east-1
Enter the S3 Bucket to use for the Sample Input > johnlou-ohno/loader-demo-data
Enter the Access Key used by Redshift to get data from S3 > nope
Enter the Secret Key used by Redshift to get data from S3 > nope
Creating Tables in Dynamo DB if Required
Configuration for johnlou-ohno/loader-demo-data/input successfully written in us-east-1
36. Micro-batch loading
Ideal for time series data
Balance input files
Pre-configure column encoding
Reduce frequency of statistics calculation
Load in sort key order
Use SSD instances
Consider using the ‘Load Stream’ architecture HasOffers
developed.
38. Data Loading Options
Parallel upload to Amazon S3
AWS Direct Connect
AWS Import/Export
Amazon Kinesis
Systems integrators
Data Integration Systems Integrators
39. Resources on the AWS Big Data Blog
Best Practices for Micro-Batch Loading on Amazon
Redshift
Using Attunity Cloudbeam at UMUC to Replicate Data
to Amazon RDS and Amazon Redshift
A Zero-Administration Amazon Redshift Database
Loader
40. Best Practices References
Best Practices for Designing Tables
Best Practices for Designing Queries
Best Practices for Loading Data