Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all of your data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
2. Overview
• History of Data Warehouse, Hadoop
• Amazon Redshift In Modern Data Architecture
• Amazon Redshift Spectrum Overview
• Getting Started
• Q&A
4. Evolution of Data Architectures
1985: Data Warehouse Appliances Benefits
• Consolidated multiple decision support
environments (i.e. databases) into a single
architecture
• Best performance available at time of
conception, hence the expensive licenses
• Worked well with structured, columnar data
• Could build customized data marts on top
Shared Storage Tier
(NAS Appliance)
Compute
Node
Compute
Node
Compute
Node
Compute
Node
• Proprietary software license paid per node
per year
• Gold-plated hardware available only from
the vendor with per node per year cost
Constraints
• Proprietary software license paid per node per
year
• Gold-plated hardware available only from the
vendor with per node per year cost
• Could not handle unstructured data sets
• Heavy ETL & data cleansing
5. Data analyzed for benefit
Available data
Legacy Architecture Models = No Growth
COST
VALUE
Investment value of analytics
2010 2015 2020 2025
Datavolume
Very Expensive
Lock-In
Proprietary
Inflexible licensing
6. Evolution of Data Architectures
2006: Hadoop Clusters
CPU
Memory
HDFS Storage
Hadoop Master Node
CPU
Memory
HDFS Storage
CPU
Memory
HDFS Storage
Improvements
• Open source based software license!!!
• Commodity white box servers!!!!
• Could handle structured & unstructured data sets
• Many different applications within the framework
(MapReduce, Spark, Hive, Pig, HBase, Presto, etc.)
Constraints
• HDFS 3X replication to protect against node failure
gets expensive at scale
• 500 TB data set = 1.5 PB cluster
• Local storage means you must scale and pay for
CPU & memory resources when adding data
capacity
• General purpose, monolithic cluster with many
different apps on same hardware
• Still a data silo
7. Evolution of Data Architectures
2009: Decoupled EMR Architecture
CPU
Memory
Hadoop Master Node
CPU
Memory
CPU
Memory
Improvements
• Decoupled storage & compute
• Scale CPU and memory resources independently
and up & down
• Only pay for the 500 TB data set (not 3X)
• Multi-physical facility replication via S3
• Multiple clusters can run in parallel against shared
data in S3
• Each job gets its own optimized cluster. i.e. Spark
on memory intensive, Hive on CPU intensive,
HBase on I/O intensive, etc.
Constraints
• Still have a cluster to provision and manage
• Must expose EMR cluster to SQL users via Hive,
Presto, etc.
S3 as HDFS
8. 2012
February 2017
> 100 Significant Patches
> 140 Significant Features
• Automated installation, patching, backups
• No servers to manage and maintain
• MPP Columnar relational database
• $1,000 / TB / Year
• Accessible to any ODBC or JDBC BI Tool
Evolution of Data Architectures
2012: Hello Amazon Redshift
9. Evolution of Data Architectures
2016: Clusterless Improvements
• No cluster/infrastructure to manage
• Business users and analysts can write SQL
without having to provision a cluster or touch
infrastructure
• Pay by the query
• Zero Administration
• Process data where it lives
Constraints
• Limited to SQL, Hive and Spark jobs today.
More frameworks to come!
SQL Interface in web
browser
Athena for SQL
S3 Data Lake
Glue for ETL
S3 Data Lake
Spark & Hive Interface
in web browser
10. AWS Big Data Portfolio
Collect Store Analyze
Amazon Kinesis
Firehose
AWS Direct
Connect
Amazon
SnowballAmazon Kinesis
Streams
Amazon S3 Amazon Glacier
Amazon
CloudSearch
Amazon RDS,
Amazon Aurora
Amazon
Dynamo DB
Amazon
Elasticsearch
Amazon EMR Amazon EC2
Amazon
Redshift
Amazon Machine
Learning
Amazon
QuickSight
AWS Data PipelineAWS Database Migration Service AWS Glue
Amazon
Athena
Amazon Kinesis
Analytics
11. Scaling up your analytics systems With AWS Traditional IT *
get a new BI server 20 minutes 3 months
upgrade your analytics server to the
newest Intel processors and add 16GB
memory
15 minutes 2 months
add 500TB of storage instant 2 months
grow a DWH cluster from 8GB to 1PB 1 hour 8 months
build a 1024-node Hadoop cluster 30 minutes unlikely
roll out multi-region production
environment
hours months
* actual provisioning times in a well-organized IT division
Speed Matters
14. Amazon Redshift is easy to use
Provisioning in
minutes
Automatic patching SQL - Data loading
Backups are built-in Security is built-in Compression is built-in
15. Amazon Redshift is available everywhere AWS is
Dublin
Frankfurt
London
Seoul
Sydney
Tokyo
Singapore
Beijing
Mumbai
Sao Paulo
US East - Virginia
US West - Oregon
US West – Northern California
GovCloud
Columbus Ohio
Montreal
Currently Available
Coming soon
16. Traditional Data Warehousing
Business
Reporting
Complex pipelines
and queries
Secure and
Compliant
Easy Migration – Point & Click using AWS Database Migration Service
Secure & Compliant – End-to-End Encryption. SOC 1/2/3, PCI-DSS, HIPAA and FedRAMP compliant
Large Ecosystem – Variety of cloud and on-premises BI and ETL tools
Japanese Mobile
Phone Provider
Powering 100 marketplaces
in 50 countries
World’s Largest Children’s
Book Publisher
Bulk Loads
and Updates
17. Business Applications
Multi-Tenant BI
Applications
Back-end
services
Analytics as a
Service
Fully Managed – Provisioning, backups, upgrades, security, compression all come built-in so you can
focus on your business applications
Ease of Chargeback – Pay as you go, add clusters as needed. A few big common clusters, several
data marts
Service Oriented Architecture – Integrated with other AWS services. Easy to plug into your pipeline
Infosys Information
Platform (IIP)
Analytics-as-a-
Service
Product and Consumer
Analytics
18. Log Analysis
Log & Machine
IOT Data
Clickstream
Events Data
Time-Series
Data
Cheap – Analyze large volumes of data cost-effectively
Fast – Massively Parallel Processing (MPP) and columnar architecture for fast queries and parallel loads
Near real-time – Micro-batch loading and Amazon Kinesis Firehose for near-real time analytics
Interactive data analysis and
recommendation engine
Ride analytics for pricing
and product development
Ad prediction and
on-demand analytics
19. Redshift is used for mission-critical workloads
Financial and
management reporting
Payments to suppliers
and billing workflows
Web/Mobile clickstream
and event analysis
Recommendation and
predictive analytics
20. Amazon Redshift has a large ecosystem
Data Integration Systems IntegratorsBusiness Intelligence
25. The tyranny of “OR”
Amazon EMR
Directly access data in S3
Scale out to thousands of nodes
Open data formats
Popular big data frameworks
Anything you can dream up and code
Amazon Redshift
Optimized for data warehousing
Super-fast local disk performance
Sophisticated query optimization
Join-optimized data formats
Query using standard SQL
26. But I don’t want to choose.
I shouldn’t have to choose
I want “all of the above”
27. I want
sophisticated query optimization and scale-out processing
super fast performance and support for open formats
the throughput of local disk and the scale of S3
28. I want all this
From one data processing engine
With my data accessible from all data processing engines
Now and in the future
29. We’re told “you have to choose”
Pick small clusters for joins or large ones for scans
Shuffles are expensive
Open formats can’t collocate data for joins
They have to deal with variable cluster sizes
Query optimization requires statistics
You can’t determine this for external data
30. Enter Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
No ETL: Query data in-
place using open file
formats
Full Amazon Redshift
SQL support
S3
SQL
31. Amazon Redshift Spectrum is fast
• Leverages Amazon Redshift’s advanced cost-based optimizer
• Pushes down projections, filters, aggregations and join reduction
• Dynamic partition pruning to minimize data processed
• Automatic parallelization of query execution against S3 data
• Efficient join processing within the Amazon Redshift cluster
32. Amazon Redshift Spectrum is cost-effective
• You pay for your Amazon Redshift cluster plus $5 per TB scanned from S3
• Each query can leverage 1000s of Amazon Redshift Spectrum nodes
• You can reduce the TB scanned and improve query performance by:
Partitioning data
Using a columnar file format
Compressing data
33. Amazon Redshift Spectrum is secure
End-to-end
data encryption
Alerts &
notifications
Virtual private cloud
Audit logging
Certifications &
compliance
Encrypt S3 data using SSE and AWS
KMS
Encrypt all Amazon Redshift data
using KMS, AWS CloudHSM or your
on-premises HSMs
Enforce SSL with perfect forward
encryption using ECDHE
Amazon Redshift leader node in
your VPC. Compute nodes in
private VPC. Spectrum nodes in
private VPC, store no state.
Communicate event-specific
notifications via email, text
message, or call with Amazon SNS
All API calls are logged using AWS
CloudTrail
All SQL statements are logged
within Amazon Redshift
PCI/DSSFedRAMP
SOC1/2/3 HIPAA/BAA
34. Amazon Redshift Spectrum uses standard SQL
• Redshift Spectrum seamlessly integrates with your existing SQL & BI apps
• Support for same JDBC and ODBC
• Support for complex joins, nested queries & window functions
• Support for data partitioned in S3 by any key
Date, Time and any other custom keys
e.g., Year, Month, Day, Hour
35. Is Amazon Redshift Spectrum useful if I don’t have an exabyte?
Your data will get bigger
• On average, data warehousing volumes grow 10x every 5 years
• The average Amazon Redshift customer doubles data each year
Amazon Redshift Spectrum makes data analysis simpler
• Teams using Amazon EMR, Athena & Redshift can collaborate using the same data lake
Amazon Redshift Spectrum improves availability and concurrency
• Run multiple Amazon Redshift clusters against common data
• Isolate jobs with tight SLAs from ad hoc analysis
36. “Redshift Spectrum will let us expand the universe
of the data we analyze to 100s of petabytes over
time. This is truly a game changer, and we can think
of no other system in the world that can get us
there.”
“Multiple teams can now query the same
Amazon S3 data sets using both Amazon
Redshift and Amazon EMR.”
Customers love Amazon Redshift Spectrum
“Redshift Spectrum will help us scale yet further
while also lowering our costs.”
“Redshift Spectrum’s fast performance across
massive data sets is unprecedented.”
“Redshift Spectrum enables us to directly operate on
our data in its native format in Amazon S3 with no
preprocessing or transformation.”
“Our data science team using Amazon EMR can now
collaborate with our marketing and product teams
using Redshift Spectrum to analyze the same Amazon
S3 data sets.”
37. Migrations can be accelerated with DMS & SCT
“AWS Database Migration Service is the
most impressive migration service we’ve
seen.”
Migrate – Over 1,000 unique migrations to Redshift using DMS
Amazon
Redshift
*
*
38. Amazon Redshift Spectrum
shift
Fast, simple, exabyte-scale data warehousing for less than $1,000/TB/Year
Available now
Queue hopping
10X VACUUM performance improvement
Node fault tolerance
Enhanced VPC routing
IAM support for LOAD/UNLOAD
Auto compression for CTAS
TimestampTZ datatype
Query Monitoring rules
Coming soon
Automatic and incremental background
VACUUM
Short query bias
Power start
IAM Authentication for DB users
Auto compression for new tables
Enhanced JSON & AVRO ingestion performance