Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Roy Ben-Alta, Principal BDM
MAY-2-2017
shift
Fast, simple, Exabyte-scale data warehousing for less than $1,000/TB/Year
Data warehousing in the era of Big Data: Deep
Dive into Amazon Redshift Spectrum

Overview
• History of Data Warehouse, Hadoop
• Amazon Redshift In Modern Data Architecture
• Amazon Redshift Spectrum Overview
• Getting Started
• Q&A

History Of Data Warehouse, Hadoop

Evolution of Data Architectures
1985: Data Warehouse Appliances Benefits
• Consolidated multiple decision support
environments (i.e. databases) into a single
architecture
• Best performance available at time of
conception, hence the expensive licenses
• Worked well with structured, columnar data
• Could build customized data marts on top
Shared Storage Tier
(NAS Appliance)
Compute
Node
Compute
Node
Compute
Node
Compute
Node
• Proprietary software license paid per node
per year
• Gold-plated hardware available only from
the vendor with per node per year cost
Constraints
• Proprietary software license paid per node per
year
• Gold-plated hardware available only from the
vendor with per node per year cost
• Could not handle unstructured data sets
• Heavy ETL & data cleansing

Data analyzed for benefit
Available data
Legacy Architecture Models = No Growth
COST
VALUE
Investment value of analytics
2010 2015 2020 2025
Datavolume
Very Expensive
Lock-In
Proprietary
Inflexible licensing

2006: Hadoop Clusters
CPU
Memory
HDFS Storage
Hadoop Master Node
CPU
Memory
HDFS Storage
CPU
Memory
HDFS Storage
Improvements
• Open source based software license!!!
• Commodity white box servers!!!!
• Could handle structured & unstructured data sets
• Many different applications within the framework
(MapReduce, Spark, Hive, Pig, HBase, Presto, etc.)
Constraints
• HDFS 3X replication to protect against node failure
gets expensive at scale
• 500 TB data set = 1.5 PB cluster
• Local storage means you must scale and pay for
CPU & memory resources when adding data
capacity
• General purpose, monolithic cluster with many
different apps on same hardware
• Still a data silo

2009: Decoupled EMR Architecture
CPU
Memory
Hadoop Master Node
CPU
Memory
CPU
Memory
Improvements
• Decoupled storage & compute
• Scale CPU and memory resources independently
and up & down
• Only pay for the 500 TB data set (not 3X)
• Multi-physical facility replication via S3
• Multiple clusters can run in parallel against shared
data in S3
• Each job gets its own optimized cluster. i.e. Spark
on memory intensive, Hive on CPU intensive,
HBase on I/O intensive, etc.
Constraints
• Still have a cluster to provision and manage
• Must expose EMR cluster to SQL users via Hive,
Presto, etc.
S3 as HDFS

2012
February 2017
> 100 Significant Patches
> 140 Significant Features
• Automated installation, patching, backups
• No servers to manage and maintain
• MPP Columnar relational database
• $1,000 / TB / Year
• Accessible to any ODBC or JDBC BI Tool
2012: Hello Amazon Redshift

2016: Clusterless Improvements
• No cluster/infrastructure to manage
• Business users and analysts can write SQL
without having to provision a cluster or touch
infrastructure
• Pay by the query
• Zero Administration
• Process data where it lives
Constraints
• Limited to SQL, Hive and Spark jobs today.
More frameworks to come!
SQL Interface in web
browser
Athena for SQL
S3 Data Lake
Glue for ETL
S3 Data Lake
Spark & Hive Interface
in web browser

AWS Big Data Portfolio
Collect Store Analyze
Amazon Kinesis
Firehose
AWS Direct
Connect
Amazon
SnowballAmazon Kinesis
Streams
Amazon S3 Amazon Glacier
Amazon
CloudSearch
Amazon RDS,
Amazon Aurora
Amazon
Dynamo DB
Amazon
Elasticsearch
Amazon EMR Amazon EC2
Amazon
Redshift
Amazon Machine
Learning
Amazon
QuickSight
AWS Data PipelineAWS Database Migration Service AWS Glue
Amazon
Athena
Amazon Kinesis
Analytics

Scaling up your analytics systems With AWS Traditional IT *
get a new BI server 20 minutes 3 months
upgrade your analytics server to the
newest Intel processors and add 16GB
memory
15 minutes 2 months
add 500TB of storage instant 2 months
grow a DWH cluster from 8GB to 1PB 1 hour 8 months
build a 1024-node Hadoop cluster 30 minutes unlikely
roll out multi-region production
environment
hours months
* actual provisioning times in a well-organized IT division
Speed Matters

Amazon Redshift In Modern Data Architecture

Columnar
MPP
OLAP
AWS IAMAmazon
VPC
Amazon SWF
Amazon S3 AWS KMS Amazon
Route 53
Amazon
CloudWatch
Amazon
EC2
PostgreSQL Amazon Redshift

Amazon Redshift is easy to use
Provisioning in
minutes
Automatic patching SQL - Data loading
Backups are built-in Security is built-in Compression is built-in

Amazon Redshift is available everywhere AWS is
Dublin
Frankfurt
London
Seoul
Sydney
Tokyo
Singapore
Beijing
Mumbai
Sao Paulo
US East - Virginia
US West - Oregon
US West – Northern California
GovCloud
Columbus Ohio
Montreal
Currently Available
Coming soon

Traditional Data Warehousing
Business
Reporting
Complex pipelines
and queries
Secure and
Compliant
Easy Migration – Point & Click using AWS Database Migration Service
Secure & Compliant – End-to-End Encryption. SOC 1/2/3, PCI-DSS, HIPAA and FedRAMP compliant
Large Ecosystem – Variety of cloud and on-premises BI and ETL tools
Japanese Mobile
Phone Provider
Powering 100 marketplaces
in 50 countries
World’s Largest Children’s
Book Publisher
Bulk Loads
and Updates

Business Applications
Multi-Tenant BI
Applications
Back-end
services
Analytics as a
Service
Fully Managed – Provisioning, backups, upgrades, security, compression all come built-in so you can
focus on your business applications
Ease of Chargeback – Pay as you go, add clusters as needed. A few big common clusters, several
data marts
Service Oriented Architecture – Integrated with other AWS services. Easy to plug into your pipeline
Infosys Information
Platform (IIP)
Analytics-as-a-
Service
Product and Consumer
Analytics

Log Analysis
Log & Machine
IOT Data
Clickstream
Events Data
Time-Series
Data
Cheap – Analyze large volumes of data cost-effectively
Fast – Massively Parallel Processing (MPP) and columnar architecture for fast queries and parallel loads
Near real-time – Micro-batch loading and Amazon Kinesis Firehose for near-real time analytics
Interactive data analysis and
recommendation engine
Ride analytics for pricing
and product development
Ad prediction and
on-demand analytics

Redshift is used for mission-critical workloads
Financial and
management reporting
Payments to suppliers
and billing workflows
Web/Mobile clickstream
and event analysis
Recommendation and
predictive analytics

Amazon Redshift has a large ecosystem
Data Integration Systems IntegratorsBusiness Intelligence

Amazon Redshift Spectrum Overview

Generate
Collect & Store
Analyze
Individual AWS customers
generate over a PB/day
It’s never been easier to generate vast amounts of data

Generate
Collect & Store
Analyze
generating over PB/day
Amazon S3 lets you collect and store all this data
Store exabytes of
data in S3

Generate
Collect & Store
Analyze
generating over PB/day
Highly
Constrained
But how do you analyze it?
Store exabytes of
data in S3

The tyranny of “OR”
Amazon EMR
Directly access data in S3
Scale out to thousands of nodes
Open data formats
Popular big data frameworks
Anything you can dream up and code
Amazon Redshift
Optimized for data warehousing
Super-fast local disk performance
Sophisticated query optimization
Join-optimized data formats
Query using standard SQL

But I don’t want to choose.
I shouldn’t have to choose
I want “all of the above”

I want
sophisticated query optimization and scale-out processing
super fast performance and support for open formats
the throughput of local disk and the scale of S3

I want all this
From one data processing engine
With my data accessible from all data processing engines
Now and in the future

We’re told “you have to choose”
Pick small clusters for joins or large ones for scans
Shuffles are expensive
Open formats can’t collocate data for joins
They have to deal with variable cluster sizes
Query optimization requires statistics
You can’t determine this for external data

Enter Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
No ETL: Query data in-
place using open file
formats
Full Amazon Redshift
SQL support
S3
SQL

Amazon Redshift Spectrum is fast
• Leverages Amazon Redshift’s advanced cost-based optimizer
• Pushes down projections, filters, aggregations and join reduction
• Dynamic partition pruning to minimize data processed
• Automatic parallelization of query execution against S3 data
• Efficient join processing within the Amazon Redshift cluster

Amazon Redshift Spectrum is cost-effective
• You pay for your Amazon Redshift cluster plus $5 per TB scanned from S3
• Each query can leverage 1000s of Amazon Redshift Spectrum nodes
• You can reduce the TB scanned and improve query performance by:
Partitioning data
Using a columnar file format
Compressing data

Amazon Redshift Spectrum is secure
End-to-end
data encryption
Alerts &
notifications
Virtual private cloud
Audit logging
Certifications &
compliance
Encrypt S3 data using SSE and AWS
KMS
Encrypt all Amazon Redshift data
using KMS, AWS CloudHSM or your
on-premises HSMs
Enforce SSL with perfect forward
encryption using ECDHE
Amazon Redshift leader node in
your VPC. Compute nodes in
private VPC. Spectrum nodes in
private VPC, store no state.
Communicate event-specific
notifications via email, text
message, or call with Amazon SNS
All API calls are logged using AWS
CloudTrail
All SQL statements are logged
within Amazon Redshift
PCI/DSSFedRAMP
SOC1/2/3 HIPAA/BAA

Amazon Redshift Spectrum uses standard SQL
• Redshift Spectrum seamlessly integrates with your existing SQL & BI apps
• Support for same JDBC and ODBC
• Support for complex joins, nested queries & window functions
• Support for data partitioned in S3 by any key
Date, Time and any other custom keys
e.g., Year, Month, Day, Hour

Is Amazon Redshift Spectrum useful if I don’t have an exabyte?
Your data will get bigger
• On average, data warehousing volumes grow 10x every 5 years
• The average Amazon Redshift customer doubles data each year
Amazon Redshift Spectrum makes data analysis simpler
• Teams using Amazon EMR, Athena & Redshift can collaborate using the same data lake
Amazon Redshift Spectrum improves availability and concurrency
• Run multiple Amazon Redshift clusters against common data
• Isolate jobs with tight SLAs from ad hoc analysis

“Redshift Spectrum will let us expand the universe
of the data we analyze to 100s of petabytes over
time. This is truly a game changer, and we can think
of no other system in the world that can get us
there.”
“Multiple teams can now query the same
Amazon S3 data sets using both Amazon
Redshift and Amazon EMR.”
Customers love Amazon Redshift Spectrum
“Redshift Spectrum will help us scale yet further
while also lowering our costs.”
“Redshift Spectrum’s fast performance across
massive data sets is unprecedented.”
“Redshift Spectrum enables us to directly operate on
our data in its native format in Amazon S3 with no
preprocessing or transformation.”
“Our data science team using Amazon EMR can now
collaborate with our marketing and product teams
using Redshift Spectrum to analyze the same Amazon
S3 data sets.”

Migrations can be accelerated with DMS & SCT
“AWS Database Migration Service is the
most impressive migration service we’ve
seen.”
Migrate – Over 1,000 unique migrations to Redshift using DMS
Amazon
Redshift
*
*

Amazon Redshift Spectrum
shift
Fast, simple, exabyte-scale data warehousing for less than $1,000/TB/Year
Available now
 Queue hopping
 10X VACUUM performance improvement
 Node fault tolerance
 Enhanced VPC routing
 IAM support for LOAD/UNLOAD
 Auto compression for CTAS
 TimestampTZ datatype
 Query Monitoring rules
Coming soon
 Automatic and incremental background
VACUUM
 Short query bias
 Power start
 IAM Authentication for DB users
 Auto compression for new tables
 Enhanced JSON & AVRO ingestion performance

Getting Started with Redshift Spectrum
Quick Start
Amazon Redshift Pricing

Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift

Similar to Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift