SlideShare a Scribd company logo
1 of 32
Download to read offline
1
Extending Analytic Reach:
From The Warehouse to The Data Lake
Mike Limcaco | CTO
2017 Big Data Day LA
University of Southern California | 2017-08-06
2
(Most) Data is Dark
http://bit.ly/2k4fDJQ
3
4
Big Data
Enormous Data
5
Warehouse
(e.g. Amazon Redshift)
Vast Object Storage Domain
(e.g. Data Lake on Amazon S3)
6
Warehouse
(e.g. Amazon Redshift)
Vast Object Storage Domain
(e.g. Data Lake on Amazon S3)
The Virtual Warehouse
7
The Emerging Analytics Architecture (AWS)
Storage
Serverless
Compute
Data
Processing
Amazon S3
Datalake Storage
AWS Glue Data Catalog
Hive compatible Metastore
Amazon Kinesis
Streaming
Amazon Redshift Spectrum
Warehouse-Datalake Bridge
AWS Lambda
Triggered Code
Amazon Redshift
PB-scale MPP Warehouse
Amazon Athena
SQL as a Service
Amazon EMR
Hadoop as a Service
AWS Glue
ETL
8
The Emerging Analytics Architecture (AWS)
Amazon S3
Datalake Storage
Amazon Redshift Spectrum
Warehouse-Datalake Bridge
Amazon Redshift
PB-scale MPP Warehouse
Amazon EMR
Hadoop as a Service
9
Pick one …
• Direct access to object store (S3)
• Scale out to thousands of nodes
• Open Data Formats
• Popular big data frameworks
• Developer-friendly
• Fast local disk performance
• Sophisticated query optimization
• Join-optimized
• Familiar DW/BI workflows
Hadoop (e.g. EMR) SQL-Based Warehousing
(e.g. Amazon Redshift)
11
Amazon Redshift Spectrum
Run SQL queries against S3
• Leverages Amazon Redshift advanced
cost-based optimizer
• Pushes down projections, filters,
aggregations and join reduction
• Dynamic partition pruning to minimize
data processed
• Automated parallelization of query
execution against S3 data
• Efficient join processing with the Amazon
Redshift cluster
App
Data Lake
Object Storage
Amazon
Redshift
SQL
Client
Amazon
S3 Storage
SpectrumBridge
MPP
Warehouse
HTTP
JDBC/ODBC
12
Amazon Redshift Spectrum
Run SQL queries against S3
• Leverages Amazon Redshift advanced
cost-based optimizer
• Pushes down projections, filters,
aggregations and join reduction
• Dynamic partition pruning to minimize
data processed
• Automated parallelization of query
execution against S3 data
• Efficient join processing with the Amazon
Redshift cluster
App
SQL
Client
JDBC/ODBC
The Enormous
Virtual Warehouse
13
Query Flow
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
Exabyte-scale object
storage
Data Catalog
Apache Hive
Metastore
HTTP
Spectrum
14
Query Flow
Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
Exabyte-scale object
storage
Data Catalog
Apache Hive
Metastore
1
HTTP
Spectrum
15
Query Flow
Query optimized &
compiled. Plan sent to
all Compute Nodes
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
Exabyte-scale object
storage
Data Catalog
Apache Hive
Metastore
2
HTTP
Spectrum
16
Query Flow
Compute nodes dynamically
prune partitions based on
Catalog info
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
Exabyte-scale object
storage
Data Catalog
Apache Hive
Metastore
3
HTTP
Spectrum
17
Query Flow
Spectrum nodes scan
S3, projects/filters/scans
and aggregates
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
Exabyte-scale object
storage
Data Catalog
Apache Hive
Metastore
4
HTTP
Spectrum
18
Query Flow
Final aggregations and
joins on local tables
done in-cluster
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
Exabyte-scale object
storage
Data Catalog
Apache Hive
Metastore
5
HTTP
Spectrum
19
Query Flow
Results sent back to
client
Amazon
Redshift
JDBC/ODBC
...1 2 3 4 N
Amazon S3
Exabyte-scale object
storage
Data Catalog
Apache Hive
Metastore
6
HTTP
Spectrum
Demo
21
Domain Model
Dimensions
Facts
(Online)
Dimensions Dimensions
Data Pond
Data Lake
Data
(RAW)
Facts
(Archive)
Data
(Other)
22LastFM Music Streaming Events
Horizontal Partitioning Datetime User_ID Country
2007 Mike USA
2008 Jack Finland
Datetime User_ID Track Artist
2015 5:00pm Alice Songbird Kenny G
2013 11:14pm Mike Suit and Tie Justin Timberlake
Datetime User_ID Track Artist
1999 5:15pm Mike Ice Ice Baby Vanilla Ice
1994 4:48pm Mike Wannabe Spice Girls
Colder
User Profile
Streaming
Events
(RECENT)
Streaming
Events
(ARCHIVE)
23
Colder
24
Dimensions
FACTS (Online)
Facts
(ARCHIVE)
Amazon S3
Redshift
Spectrum Glue CatalogAthena
25
create external table lastfm_music_streaming_events
(
userid string,
datetime timestamp,
artist_id string,
artist_name string,
track_id string,
track_name string
)
stored as parquet
location 's3://my-archived-facts/lastfm/parquet/events/';
Register EXTERNAL S3 Table
26
SELECT
u.country, COUNT(*) AS plays, 'REDSHIFT' AS source
FROM
lastfm_users u,
lastfm_music_streaming_events s
WHERE
u.userid = s.userid
GROUP BY
u.country
Query Redshift ONLINE Data
27
28
SELECT ….
FROM
lastfm_users u,
lastfm_music_streaming_events s
WHERE
u.userid = s.userid
…
UNION
SELECT …
FROM
lastfm_users u,
datalake.lastfm_music_streaming_events dl
WHERE
u.userid = dl.userid
…
Query Redshift ONLINE + ARCHIVED S3 Data
Local Redshift Tables
External S3 Data
29
TL;DR
31
Summary
• Online warehousing can participate in extended data lake operations
• External tables in Internet-scale object storage (S3) can be shared
between
• Hadoop workloads (EMR)
• Serverless SQL as a Service (Athena)
• SQL-based MPP Warehousing (Redshift)
• You can readily tap extra capacity, concurrency, throughput via
Amazon Redshift Spectrum
mike@agilisium.com
2629 Townsgate Road Suite 235
Westlake Village, CA 91361
Thank You
contact@agilisium.com
careers@agilisium.com

More Related Content

What's hot

CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...Databricks
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 Databricks
 
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)Databricks
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageDatabricks
 
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and SparkODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and SparkCarolyn Duby
 
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingSpark Summit
 
Superset druid realtime
Superset druid realtimeSuperset druid realtime
Superset druid realtimearupmalakar
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Spark Summit
 
Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
Whoops, The Numbers Are Wrong! Scaling Data Quality @ NetflixWhoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
Whoops, The Numbers Are Wrong! Scaling Data Quality @ NetflixDataWorks Summit
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of MetadataJim Dowling
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupApache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupRobert Metzger
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven productsLars Albertsson
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Icebergkbajda
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceTao Feng
 
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Databricks
 
Presto Summit 2018 - 01 - Facebook Presto
Presto Summit 2018  - 01 - Facebook PrestoPresto Summit 2018  - 01 - Facebook Presto
Presto Summit 2018 - 01 - Facebook Prestokbajda
 

What's hot (20)

CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and SparkODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
 
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
 
Superset druid realtime
Superset druid realtimeSuperset druid realtime
Superset druid realtime
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
 
Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
Whoops, The Numbers Are Wrong! Scaling Data Quality @ NetflixWhoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
Whoops, The Numbers Are Wrong! Scaling Data Quality @ Netflix
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
Apache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya MeetupApache Flink @ Tel Aviv / Herzliya Meetup
Apache Flink @ Tel Aviv / Herzliya Meetup
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conference
 
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Q...
 
Spark sql meetup
Spark sql meetupSpark sql meetup
Spark sql meetup
 
Presto Summit 2018 - 01 - Facebook Presto
Presto Summit 2018  - 01 - Facebook PrestoPresto Summit 2018  - 01 - Facebook Presto
Presto Summit 2018 - 01 - Facebook Presto
 

Similar to Extending Analytic Reach

Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018Amazon Web Services
 
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...Sungmin Kim
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Amazon Web Services
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Amazon Web Services
 
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumModernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumAmazon Web Services
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseData Con LA
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Amazon Web Services
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Amazon Web Services
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Amazon Web Services
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeTorsten Steinbach
 
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...Amazon Web Services
 
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAmazon Web Services Korea
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveTorsten Steinbach
 
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...Amazon Web Services
 
Dataflow in 104corp - DataConTW2018
Dataflow in 104corp - DataConTW2018Dataflow in 104corp - DataConTW2018
Dataflow in 104corp - DataConTW2018Gavin Lin
 
Amazon Kinesis Firehose - Pop-up Loft TLV 2017
Amazon Kinesis Firehose - Pop-up Loft TLV 2017Amazon Kinesis Firehose - Pop-up Loft TLV 2017
Amazon Kinesis Firehose - Pop-up Loft TLV 2017Amazon Web Services
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAmazon Web Services
 

Similar to Extending Analytic Reach (20)

Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018Modernise your Data Warehouse - AWS Summit Sydney 2018
Modernise your Data Warehouse - AWS Summit Sydney 2018
 
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
 
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumModernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
 
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
 
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
IBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep DiveIBM Cloud Day January 2021 Data Lake Deep Dive
IBM Cloud Day January 2021 Data Lake Deep Dive
 
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
BDA305 NEW LAUNCH! Intro to Amazon Redshift Spectrum: Now query exabytes of d...
 
Dataflow in 104corp - DataConTW2018
Dataflow in 104corp - DataConTW2018Dataflow in 104corp - DataConTW2018
Dataflow in 104corp - DataConTW2018
 
Amazon Kinesis Firehose - Pop-up Loft TLV 2017
Amazon Kinesis Firehose - Pop-up Loft TLV 2017Amazon Kinesis Firehose - Pop-up Loft TLV 2017
Amazon Kinesis Firehose - Pop-up Loft TLV 2017
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 

More from Agilisium Consulting

Get the most out of your AWS Redshift investment while keeping cost down
Get the most out of your AWS Redshift investment while keeping cost downGet the most out of your AWS Redshift investment while keeping cost down
Get the most out of your AWS Redshift investment while keeping cost downAgilisium Consulting
 
Big data services slideshare - agilisium 2.0 - v1.0
Big data services   slideshare - agilisium 2.0 - v1.0Big data services   slideshare - agilisium 2.0 - v1.0
Big data services slideshare - agilisium 2.0 - v1.0Agilisium Consulting
 
Big data governance slideshare - v0.5
Big data governance   slideshare - v0.5Big data governance   slideshare - v0.5
Big data governance slideshare - v0.5Agilisium Consulting
 
Big data engineering slideshare - v0.4
Big data engineering   slideshare - v0.4Big data engineering   slideshare - v0.4
Big data engineering slideshare - v0.4Agilisium Consulting
 
Big data consulting slideshare - v0.4
Big data consulting   slideshare - v0.4Big data consulting   slideshare - v0.4
Big data consulting slideshare - v0.4Agilisium Consulting
 
Why Data Lake should be the foundation of Enterprise Data Architecture
Why Data Lake should be the foundation of Enterprise Data ArchitectureWhy Data Lake should be the foundation of Enterprise Data Architecture
Why Data Lake should be the foundation of Enterprise Data ArchitectureAgilisium Consulting
 
Exploiting Data Lakes: Architecture, Capabilities & Future
Exploiting Data Lakes: Architecture, Capabilities & FutureExploiting Data Lakes: Architecture, Capabilities & Future
Exploiting Data Lakes: Architecture, Capabilities & FutureAgilisium Consulting
 

More from Agilisium Consulting (8)

Get the most out of your AWS Redshift investment while keeping cost down
Get the most out of your AWS Redshift investment while keeping cost downGet the most out of your AWS Redshift investment while keeping cost down
Get the most out of your AWS Redshift investment while keeping cost down
 
BI & Analytics
BI & Analytics BI & Analytics
BI & Analytics
 
Big data services slideshare - agilisium 2.0 - v1.0
Big data services   slideshare - agilisium 2.0 - v1.0Big data services   slideshare - agilisium 2.0 - v1.0
Big data services slideshare - agilisium 2.0 - v1.0
 
Big data governance slideshare - v0.5
Big data governance   slideshare - v0.5Big data governance   slideshare - v0.5
Big data governance slideshare - v0.5
 
Big data engineering slideshare - v0.4
Big data engineering   slideshare - v0.4Big data engineering   slideshare - v0.4
Big data engineering slideshare - v0.4
 
Big data consulting slideshare - v0.4
Big data consulting   slideshare - v0.4Big data consulting   slideshare - v0.4
Big data consulting slideshare - v0.4
 
Why Data Lake should be the foundation of Enterprise Data Architecture
Why Data Lake should be the foundation of Enterprise Data ArchitectureWhy Data Lake should be the foundation of Enterprise Data Architecture
Why Data Lake should be the foundation of Enterprise Data Architecture
 
Exploiting Data Lakes: Architecture, Capabilities & Future
Exploiting Data Lakes: Architecture, Capabilities & FutureExploiting Data Lakes: Architecture, Capabilities & Future
Exploiting Data Lakes: Architecture, Capabilities & Future
 

Recently uploaded

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

Extending Analytic Reach

  • 1. 1 Extending Analytic Reach: From The Warehouse to The Data Lake Mike Limcaco | CTO 2017 Big Data Day LA University of Southern California | 2017-08-06
  • 2. 2 (Most) Data is Dark http://bit.ly/2k4fDJQ
  • 3. 3
  • 5. 5 Warehouse (e.g. Amazon Redshift) Vast Object Storage Domain (e.g. Data Lake on Amazon S3)
  • 6. 6 Warehouse (e.g. Amazon Redshift) Vast Object Storage Domain (e.g. Data Lake on Amazon S3) The Virtual Warehouse
  • 7. 7 The Emerging Analytics Architecture (AWS) Storage Serverless Compute Data Processing Amazon S3 Datalake Storage AWS Glue Data Catalog Hive compatible Metastore Amazon Kinesis Streaming Amazon Redshift Spectrum Warehouse-Datalake Bridge AWS Lambda Triggered Code Amazon Redshift PB-scale MPP Warehouse Amazon Athena SQL as a Service Amazon EMR Hadoop as a Service AWS Glue ETL
  • 8. 8 The Emerging Analytics Architecture (AWS) Amazon S3 Datalake Storage Amazon Redshift Spectrum Warehouse-Datalake Bridge Amazon Redshift PB-scale MPP Warehouse Amazon EMR Hadoop as a Service
  • 9. 9 Pick one … • Direct access to object store (S3) • Scale out to thousands of nodes • Open Data Formats • Popular big data frameworks • Developer-friendly • Fast local disk performance • Sophisticated query optimization • Join-optimized • Familiar DW/BI workflows Hadoop (e.g. EMR) SQL-Based Warehousing (e.g. Amazon Redshift)
  • 10.
  • 11. 11 Amazon Redshift Spectrum Run SQL queries against S3 • Leverages Amazon Redshift advanced cost-based optimizer • Pushes down projections, filters, aggregations and join reduction • Dynamic partition pruning to minimize data processed • Automated parallelization of query execution against S3 data • Efficient join processing with the Amazon Redshift cluster App Data Lake Object Storage Amazon Redshift SQL Client Amazon S3 Storage SpectrumBridge MPP Warehouse HTTP JDBC/ODBC
  • 12. 12 Amazon Redshift Spectrum Run SQL queries against S3 • Leverages Amazon Redshift advanced cost-based optimizer • Pushes down projections, filters, aggregations and join reduction • Dynamic partition pruning to minimize data processed • Automated parallelization of query execution against S3 data • Efficient join processing with the Amazon Redshift cluster App SQL Client JDBC/ODBC The Enormous Virtual Warehouse
  • 13. 13 Query Flow Amazon Redshift JDBC/ODBC ...1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore HTTP Spectrum
  • 14. 14 Query Flow Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY… Amazon Redshift JDBC/ODBC ...1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 1 HTTP Spectrum
  • 15. 15 Query Flow Query optimized & compiled. Plan sent to all Compute Nodes Amazon Redshift JDBC/ODBC ...1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 2 HTTP Spectrum
  • 16. 16 Query Flow Compute nodes dynamically prune partitions based on Catalog info Amazon Redshift JDBC/ODBC ...1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 3 HTTP Spectrum
  • 17. 17 Query Flow Spectrum nodes scan S3, projects/filters/scans and aggregates Amazon Redshift JDBC/ODBC ...1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 4 HTTP Spectrum
  • 18. 18 Query Flow Final aggregations and joins on local tables done in-cluster Amazon Redshift JDBC/ODBC ...1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 5 HTTP Spectrum
  • 19. 19 Query Flow Results sent back to client Amazon Redshift JDBC/ODBC ...1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 6 HTTP Spectrum
  • 20. Demo
  • 21. 21 Domain Model Dimensions Facts (Online) Dimensions Dimensions Data Pond Data Lake Data (RAW) Facts (Archive) Data (Other)
  • 22. 22LastFM Music Streaming Events Horizontal Partitioning Datetime User_ID Country 2007 Mike USA 2008 Jack Finland Datetime User_ID Track Artist 2015 5:00pm Alice Songbird Kenny G 2013 11:14pm Mike Suit and Tie Justin Timberlake Datetime User_ID Track Artist 1999 5:15pm Mike Ice Ice Baby Vanilla Ice 1994 4:48pm Mike Wannabe Spice Girls Colder User Profile Streaming Events (RECENT) Streaming Events (ARCHIVE)
  • 25. 25 create external table lastfm_music_streaming_events ( userid string, datetime timestamp, artist_id string, artist_name string, track_id string, track_name string ) stored as parquet location 's3://my-archived-facts/lastfm/parquet/events/'; Register EXTERNAL S3 Table
  • 26. 26 SELECT u.country, COUNT(*) AS plays, 'REDSHIFT' AS source FROM lastfm_users u, lastfm_music_streaming_events s WHERE u.userid = s.userid GROUP BY u.country Query Redshift ONLINE Data
  • 27. 27
  • 28. 28 SELECT …. FROM lastfm_users u, lastfm_music_streaming_events s WHERE u.userid = s.userid … UNION SELECT … FROM lastfm_users u, datalake.lastfm_music_streaming_events dl WHERE u.userid = dl.userid … Query Redshift ONLINE + ARCHIVED S3 Data Local Redshift Tables External S3 Data
  • 29. 29
  • 30. TL;DR
  • 31. 31 Summary • Online warehousing can participate in extended data lake operations • External tables in Internet-scale object storage (S3) can be shared between • Hadoop workloads (EMR) • Serverless SQL as a Service (Athena) • SQL-based MPP Warehousing (Redshift) • You can readily tap extra capacity, concurrency, throughput via Amazon Redshift Spectrum
  • 32. mike@agilisium.com 2629 Townsgate Road Suite 235 Westlake Village, CA 91361 Thank You contact@agilisium.com careers@agilisium.com