Abstract:- The data marts and warehouses we work with often require us to think about how to scope our analytic questions based on the finite amount of storage allocated to these enterprise components. With new innovations in the cloud space, we can leverage the near-infinite storage capacities of Data Lake object storage and use this as foundational source that can be combined with online data in the warehouse. In this talk we present reference architecture patterns based on Amazon Redshift Spectrum - a new technology enabling you to run MPP Warehouse SQL queries against exabytes of data in a backing object store. With Redshift Spectrum, customers can extend the analytic reach of their SQL interactions to push beyond data stored on local disks in the data warehouse to query vast amounts of unstructured data in the Amazon S3 Data Lake-- without having to load or transform any data.
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco
1. 1
Extending Analytic Reach:
From The Warehouse to The Data Lake
Mike Limcaco | CTO
2017 Big Data Day LA
University of Southern California | 2017-08-06
7. 7
The Emerging Analytics Architecture (AWS)
Storage
Serverless
Compute
Data
Processing
Amazon S3
Datalake Storage
AWS Glue Data Catalog
Hive compatible Metastore
Amazon Kinesis
Streaming
Amazon Redshift Spectrum
Warehouse-Datalake Bridge
AWS Lambda
Triggered Code
Amazon Redshift
PB-scale MPP Warehouse
Amazon Athena
SQL as a Service
Amazon EMR
Hadoop as a Service
AWS Glue
ETL
8. 8
The Emerging Analytics Architecture (AWS)
Amazon S3
Datalake Storage
Amazon Redshift Spectrum
Warehouse-Datalake Bridge
Amazon Redshift
PB-scale MPP Warehouse
Amazon EMR
Hadoop as a Service
9. 9
Pick one …
• Direct access to object store (S3)
• Scale out to thousands of nodes
• Open Data Formats
• Popular big data frameworks
• Developer-friendly
• Fast local disk performance
• Sophisticated query optimization
• Join-optimized
• Familiar DW/BI workflows
Hadoop (e.g. EMR) SQL-Based Warehousing
(e.g. Amazon Redshift)
10.
11. 11
Amazon Redshift Spectrum
Run SQL queries against S3
• Leverages Amazon Redshift advanced
cost-based optimizer
• Pushes down projections, filters,
aggregations and join reduction
• Dynamic partition pruning to minimize
data processed
• Automated parallelization of query
execution against S3 data
• Efficient join processing with the Amazon
Redshift cluster
App
Data Lake
Object Storage
Amazon
Redshift
SQL
Client
Amazon
S3 Storage
SpectrumBridge
MPP
Warehouse
HTTP
JDBC/ODBC
12. 12
Amazon Redshift Spectrum
Run SQL queries against S3
• Leverages Amazon Redshift advanced
cost-based optimizer
• Pushes down projections, filters,
aggregations and join reduction
• Dynamic partition pruning to minimize
data processed
• Automated parallelization of query
execution against S3 data
• Efficient join processing with the Amazon
Redshift cluster
App
SQL
Client
JDBC/ODBC
The Enormous
Virtual Warehouse
22. 22LastFM Music Streaming Events
Horizontal Partitioning Datetime User_ID Country
2007 Mike USA
2008 Jack Finland
Datetime User_ID Track Artist
2015 5:00pm Alice Songbird Kenny G
2013 11:14pm Mike Suit and Tie Justin Timberlake
Datetime User_ID Track Artist
1999 5:15pm Mike Ice Ice Baby Vanilla Ice
1994 4:48pm Mike Wannabe Spice Girls
Colder
User Profile
Streaming
Events
(RECENT)
Streaming
Events
(ARCHIVE)
26. 26
SELECT
u.country, COUNT(*) AS plays, 'REDSHIFT' AS source
FROM
lastfm_users u,
lastfm_music_streaming_events s
WHERE
u.userid = s.userid
GROUP BY
u.country
Query Redshift ONLINE Data
28. 28
SELECT ….
FROM
lastfm_users u,
lastfm_music_streaming_events s
WHERE
u.userid = s.userid
…
UNION
SELECT …
FROM
lastfm_users u,
datalake.lastfm_music_streaming_events dl
WHERE
u.userid = dl.userid
…
Query Redshift ONLINE + ARCHIVED S3 Data
Local Redshift Tables
External S3 Data
31. 31
Summary
• Online warehousing can participate in extended data lake operations
• External tables in Internet-scale object storage (S3) can be shared
between
• Hadoop workloads (EMR)
• Serverless SQL as a Service (Athena)
• SQL-based MPP Warehousing (Redshift)
• You can readily tap extra capacity, concurrency, throughput via
Amazon Redshift Spectrum