Agilisium's insights on reference architecture patterns based on Amazon Redshift Spectrum, a new technology that enables to run the MPP Warehouse SQL queries against exabytes of data in a backing object store.
1. 1
Extending Analytic Reach:
From The Warehouse to The Data Lake
Mike Limcaco | CTO
2017 Big Data Day LA
University of Southern California | 2017-08-06
7. 7
The Emerging Analytics Architecture (AWS)
Storage
Serverless
Compute
Data
Processing
Amazon S3
Datalake Storage
AWS Glue Data Catalog
Hive compatible Metastore
Amazon Kinesis
Streaming
Amazon Redshift Spectrum
Warehouse-Datalake Bridge
AWS Lambda
Triggered Code
Amazon Redshift
PB-scale MPP Warehouse
Amazon Athena
SQL as a Service
Amazon EMR
Hadoop as a Service
AWS Glue
ETL
8. 8
The Emerging Analytics Architecture (AWS)
Amazon S3
Datalake Storage
Amazon Redshift Spectrum
Warehouse-Datalake Bridge
Amazon Redshift
PB-scale MPP Warehouse
Amazon EMR
Hadoop as a Service
9. 9
Pick one …
• Direct access to object store (S3)
• Scale out to thousands of nodes
• Open Data Formats
• Popular big data frameworks
• Developer-friendly
• Fast local disk performance
• Sophisticated query optimization
• Join-optimized
• Familiar DW/BI workflows
Hadoop (e.g. EMR) SQL-Based Warehousing
(e.g. Amazon Redshift)
10.
11. 11
Amazon Redshift Spectrum
Run SQL queries against S3
• Leverages Amazon Redshift advanced
cost-based optimizer
• Pushes down projections, filters,
aggregations and join reduction
• Dynamic partition pruning to minimize
data processed
• Automated parallelization of query
execution against S3 data
• Efficient join processing with the Amazon
Redshift cluster
App
Data Lake
Object Storage
Amazon
Redshift
SQL
Client
Amazon
S3 Storage
SpectrumBridge
MPP
Warehouse
HTTP
JDBC/ODBC
12. 12
Amazon Redshift Spectrum
Run SQL queries against S3
• Leverages Amazon Redshift advanced
cost-based optimizer
• Pushes down projections, filters,
aggregations and join reduction
• Dynamic partition pruning to minimize
data processed
• Automated parallelization of query
execution against S3 data
• Efficient join processing with the Amazon
Redshift cluster
App
SQL
Client
JDBC/ODBC
The Enormous
Virtual Warehouse
22. 22LastFM Music Streaming Events
Horizontal Partitioning Datetime User_ID Country
2007 Mike USA
2008 Jack Finland
Datetime User_ID Track Artist
2015 5:00pm Alice Songbird Kenny G
2013 11:14pm Mike Suit and Tie Justin Timberlake
Datetime User_ID Track Artist
1999 5:15pm Mike Ice Ice Baby Vanilla Ice
1994 4:48pm Mike Wannabe Spice Girls
Colder
User Profile
Streaming
Events
(RECENT)
Streaming
Events
(ARCHIVE)
26. 26
SELECT
u.country, COUNT(*) AS plays, 'REDSHIFT' AS source
FROM
lastfm_users u,
lastfm_music_streaming_events s
WHERE
u.userid = s.userid
GROUP BY
u.country
Query Redshift ONLINE Data
28. 28
SELECT ….
FROM
lastfm_users u,
lastfm_music_streaming_events s
WHERE
u.userid = s.userid
…
UNION
SELECT …
FROM
lastfm_users u,
datalake.lastfm_music_streaming_events dl
WHERE
u.userid = dl.userid
…
Query Redshift ONLINE + ARCHIVED S3 Data
Local Redshift Tables
External S3 Data
31. 31
Summary
• Online warehousing can participate in extended data lake operations
• External tables in Internet-scale object storage (S3) can be shared
between
• Hadoop workloads (EMR)
• Serverless SQL as a Service (Athena)
• SQL-based MPP Warehousing (Redshift)
• You can readily tap extra capacity, concurrency, throughput via
Amazon Redshift Spectrum