Deploying your Data Warehouse on AWS

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Pratim Das
20th of September 2017
Deploying your Data Warehouse on
AWS

Deploying your Data
Warehouse on AWS
AWS Lambda Glue
and/or
and/or
ETL? Managed?
Complex? Cost?
Batch
Firehose
Glue
S3
Streaming?
DBs (OLTP)?
Own code?
Parallel?
Managed?
SCT Migration
Agent
DWH?
S3
Query
optimized
&
Ready for
self-
service
Ingest Store
Prepare for
Analytics Store AnalyzeAthena
Query Service
Ad-hoc Analysis BI & Visualization
Redshift Spectrum Redshift Spectrum
DWH and Data Marts
Predictive
Redshift
Data Warehouse
Redshift
Data Warehouse

Managed Massively Parallel Petabyte Scale
Data Warehouse
Streaming Backup/Restore to S3
Load data from S3, DynamoDB and EMR
Extensive Security Features
Scale from 160 GB -> 2 PB Online
Fast
CompatibleSecure
ElasticSimple
Cost
Efficient

Amazon Redshift Cluster Architecture
Massively parallel, shared nothing
Leader node
• SQL endpoint
• Stores metadata
• Coordinates parallel SQL processing
Compute nodes
• Local, columnar storage
• Executes queries in parallel
• Load, backup, restore
• 2, 16 or 32 slices
10 GigE
(HPC)
Ingestion
Backup
Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
S3 / EMR / DynamoDB / SSH
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
Leader
Node

Use Case: Traditional Data Warehousing
Business
Reporting
Advanced pipelines
and queries
Secure and
Compliant
Easy Migration – Point & Click using AWS Database Migration Service
Secure & Compliant – End-to-End Encryption. SOC 1/2/3, PCI-DSS, HIPAA and FedRAMP compliant
Large Ecosystem – Variety of cloud and on-premises BI and ETL tools
Japanese Mobile
Phone Provider
Powering 100 marketplaces
in 50 countries
World’s Largest Children’s
Book Publisher
Bulk Loads
and Updates

Use Case: Log Analysis
Log & Machine
IOT Data
Clickstream
Events Data
Time-Series
Data
Cheap – Analyze large volumes of data cost-effectively
Fast – Massively Parallel Processing (MPP) and columnar architecture for fast queries and parallel loads
Near real-time – Micro-batch loading and Amazon Kinesis Firehose for near-real time analytics
Interactive data analysis and
recommendation engine
Ride analytics for pricing
and product development
Ad prediction and
on-demand analytics

Use Case: Business Applications
Multi-Tenant BI
Applications
Back-end
services
Analytics as a
Service
Fully Managed – Provisioning, backups, upgrades, security, compression all come built-in so you can
focus on your business applications
Ease of Chargeback – Pay as you go, add clusters as needed. A few big common clusters, several
data marts
Service Oriented Architecture – Integrated with other AWS services. Easy to plug into your pipeline
Infosys Information
Platform (IIP)
Analytics-as-a-
Service
Product and Consumer
Analytics

Selected Amazon Redshift customers

NTT Docomo: Japan’s largest mobile service provider
68 million customers
Tens of TBs per day of data across a
mobile network
6 PB of total data (uncompressed)
Data science for marketing
operations, logistics, and so on
Greenplum on-premises
Scaling challenges
Performance issues
Need same level of security
Need for a hybrid environment

125 node DS2.8XL cluster
4,500 vCPUs, 30 TB RAM
2 PB compressed
10x faster analytic queries
50% reduction in time for new
BI application deployment
Significantly less operations
overhead
Data
Source
ET
AWS Direct
Connect
Client
Forwarder
LoaderState
Management
SandboxAmazon Redshift
S3
NTT Docomo: Japan’s largest mobile service provider

Nasdaq: powering 100 marketplaces in 50 countries
Orders, quotes, trade executions,
market “tick” data from 7 exchanges
7 billion rows/day
Analyze market share, client activity,
surveillance, billing, and so on
Microsoft SQL Server on-premises
Expensive legacy DW
($1.16 M/yr.)
Limited capacity (1 yr. of data
online)
Needed lower TCO
Must satisfy multiple security
and regulatory requirements
Similar performance

23 node DS2.8XL cluster
828 vCPUs, 5 TB RAM
368 TB compressed
2.7 T rows, 900 B derived
8 tables with 100 B rows
7 man-month migration
¼ the cost, 2x storage, room to
grow
Faster performance, very
secure
Nasdaq: powering 100 marketplaces in 50 countries

Tuning Redshift for
Performance

Design for Queryability
• Equally on each slice
• Minimum amount of work
• Use just enough cluster resources

Do an Equal Amount of Work
on Each Slice

Choose Best Table Distribution Style
All
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
All data on
every node
Key
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Same key to
same location
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Even
Round robin
distribution

Do the Minimum Amount of
Work on Each Slice

Columnar storage
+
Large data block sizes
+
Data compression
+
Zone maps
+
Direct-attached storage
analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959
Reduced I/O = Enhanced Performance

Use Cluster Resources
Efficiently to Complete as
Quickly as Possible

Amazon Redshift Workload Management
Waiting
Workload Management
BI tools
SQL clients
Analytics tools
Client
Running
Queries: 80% memory
ETL: 20% memory
4 Slots
2 Slots
80/4 = 20% per slot
20/2 = 10% per slot

Redshift Playbook
Part 1: Preamble, Prerequisites, and
Prioritization
Part 2: Distribution Styles and
Distribution Keys
Part 3: Compound and Interleaved
Sort Keys
Part 4: Compression Encodings
Part 5: Table Data Durability
amzn.to/2quChdM

Optimizing Amazon Redshift by Using the AWS
Schema Conversion Tool
amzn.to/2sTYow1

Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
No ETL: Query data in-place
using open file formats
Full Amazon Redshift
SQL support
S3
SQL

Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
1

Query is optimized and compiled at
the leader node. Determine what gets
run locally and what goes to Amazon
Redshift Spectrum
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
2

Query plan is sent to
all compute nodes
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
3

Compute nodes obtain partition info from
Data Catalog; dynamically prune partitions
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
4

Each compute node issues multiple
requests to the Amazon Redshift
Spectrum layer
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
5

Amazon Redshift Spectrum nodes
scan your S3 data
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
6

7
Amazon Redshift
Spectrum projects,
filters, joins and
aggregates
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog

Final aggregations and joins
with local Amazon Redshift
tables done in-cluster
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
8

Result is sent back to client
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
9

Demo:
Running an analytic query
over an exabyte in S3

Lets build an analytic query - #1
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Lets get the prior books she’s written.
1 Table
2 Filters
SELECT
P.ASIN,
P.TITLE
FROM
products P
WHERE
P.TITLE LIKE ‘%POTTER%’ AND
P.AUTHOR = ‘J. K. Rowling’

day sales?
Lets compute the sales of the prior books she’s written in this
series and return the top 20 values
2 Tables (1 S3, 1 local)
2 Filters
1 Join
2 Group By columns
1 Order By
1 Limit
1 Aggregation
SELECT
P.ASIN,
P.TITLE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
products P
WHERE
D.ASIN = P.ASIN AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
GROUP BY P.ASIN, P.TITLE
ORDER BY SALES_sum DESC
LIMIT 20;

day sales?
series and return the top 20 values, just for the first three days
of sales of first editions
5 Filters
2 Joins
3 Group By columns
1 Order By
1 Limit
1 Aggregation
1 Function
2 Casts
SELECT
P.ASIN,
P.TITLE,
P.RELEASE_DATE,
FROM
asin_attributes A,
products P
WHERE
D.ASIN = P.ASIN AND
P.ASIN = A.ASIN AND
A.EDITION LIKE '%FIRST%' AND
D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND
D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE)
GROUP BY P.ASIN, P.TITLE, P.RELEASE_DATE
LIMIT 20;

day sales?
series and return the top 20 values, just for the first three days
of sales of first editions in the city of Seattle, WA, USA
8 Filters
3 Joins
4 Group By columns
1 Order By
1 Limit
1 Aggregation
1 Function
2 Casts
SELECT
P.ASIN,
P.TITLE,
R.POSTAL_CODE,
P.RELEASE_DATE,
FROM
asin_attributes A,
products P,
regions R
WHERE
D.ASIN = P.ASIN AND
P.ASIN = A.ASIN AND
D.REGION_ID = R.REGION_ID AND
A.EDITION LIKE '%FIRST%' AND
R.COUNTRY_CODE = ‘US’ AND
R.CITY = ‘Seattle’ AND
R.STATE = ‘WA’ AND
D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND
D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE)
GROUP BY P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE
LIMIT 20;

Now let’s run that query over an exabyte of data in S3
Roughly 140 TB of customer item order detail
records for each day over past 20 years.
190 million files across 15,000 partitions in S3.
One partition per day for USA and rest of world.
Need a billion-fold reduction in data processed.
Running this query using a 1000 node Hive cluster
would take over 5 years.*
• Compression ……………..….……..5X
• Columnar file format……….......…10X
• Scanning with 2500 nodes…....2500X
• Static partition elimination…............2X
• Dynamic partition elimination..….350X
• Redshift’s query optimizer……......40X
---------------------------------------------------
Total reduction……….…………3.5B X
* Estimated using 20 node Hive cluster & 1.4TB, assume linear
* Query used a 20 node DC1.8XLarge Amazon Redshift cluster
* Not actual sales data - generated for this demo based on data
format used by Amazon Retail.

Is Amazon Redshift Spectrum useful if I don’t have an exabyte?
Your data will get bigger
On average, data warehousing volumes grow 10x every 5 years
The average Amazon Redshift customer doubles data each year
Amazon Redshift Spectrum makes data analysis simpler
Access your data without ETL pipelines
Teams using Amazon EMR, Athena & Redshift can collaborate using the same data lake
Amazon Redshift Spectrum improves availability and concurrency
Run multiple Amazon Redshift clusters against common data
Isolate jobs with tight SLAs from ad hoc analysis

Getting data to Redshift using AWS Database
Migration Service (DMS)
Simple to use Minimal Downtime Supports most widely
used Databases
Low Cost Fast & Easy to Set-up Reliable
Source=OLTP

Getting data to Redshift using AWS Schema
Conversion Tool (SCT) Source=OLAP

Loading data from S3
• Splitting Your Data into Multiple Files
• Uploading Files to Amazon S3
• Using the COPY Command to Load from
Amazon S3

Upsert into Amazon Redshift using AWS Glue
and SneaQL
http://amzn.to/2woj3gB

Partner ETL solutions
Data integration

QuickSight for BI on Redshift
Amazon Redshift

Advanced Analytics
• Approximate functions
• User defined functions
• Machine learning
• Data science

4 types of partners
• Load and transform your data with Data Integration
Partners
• Analyze data and share insights across your
organization with Business Intelligence Partners
• Architect and implement your analytics platform
with System Integration and Consulting Partners
• Query, explore and model your data using tools and
utilities from Query and Data Modeling Partners
aws.amazon.com/redshift/partners/

AWS Named as a Leader in The Forrester
WaveTM: Big Data Warehouse Q2 2017
http://bit.ly/2w1TAEy
On June 15, Forrester published the Big Data
Warehouse, Q2 2017, in which AWS is
positioned as a Leader. According to Forrester,
“With more than 5,000 deployments, Amazon
Redshift has the largest data warehouse
deployments in the cloud.” AWS received the
highest score possible, 5/5, for customer base,
market awareness, ability to execute, road map,
support, and partners. “AWS’s key strengths lie
in its dynamic scale, automated administration,
flexibility of database offerings, good security,
and high availability (HA) capabilities, which
make it a preferred choice for customers.

Extending your DWH (or Migrations) to
Redshift
http://amzn.to/2vN3UBO
Oracle to Redshift

Redshift
http://amzn.to/2wZy7OA
Teradata to Redshift

Redshift
http://amzn.to/2hbKwYd
Converge Silos to Redshift

London Redshift Meetup
http://bit.ly/2x6ZdGA
Next Session on the 26th of
October.
RSVP opens 6th of October

Deploying your Data Warehouse on AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deploying your Data Warehouse on AWS

Similar to Deploying your Data Warehouse on AWS (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Deploying your Data Warehouse on AWS

Editor's Notes