SlideShare a Scribd company logo
1 of 20
Scalable data warehousing
State of the art
Matthieu Lamairesse Hortonworks
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Scalable Data Warehousing on Hadoop
Overview
Solution Architecture
Demo
Roadmap
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
BI on Hadoop : What people think
COMPLEX
SLOW
SLOW
MR
BATCH
SQL
MDX
EXTRACT
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Reality : Differentiated data consumption
LLAP
Druid
DATA LAYER
Analyst
Spark
HiveACID
Data scientist
Admin & DBA
Relationnal
OLAP
Data Science
Business Intelligence
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
BI : A Tool for each use case
Slow but robust
à Optimized MR
à Batch Processing (SQL)
à ETL / ELT
à Static reports
LLAP
Interactive Ad Hoc / Exploratory Structured Interactive
Druid
à In Memory Cache
à Ad Hoc / Exploratory
à Interactive Dashboard
Þ No data transformation
Þ Interactive performance
Þ No Security setting changes
à OLAP cubes
à Fast dashboards
Tech
Query speed
Model
Use
cases
Relationnal
Full Joins / ACID
De-normalized
no Joins
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Scalable Data Warehousing on Hadoop
Overview
Solution Architecture
Demo
Roadmap
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hortonworks EDW Solution Components
Hadoop
Scalable Storage and Compute
Hive LLAP
High Performance SQL Data Mart
Druid
Dynamic OLAP Cubes for Higher
PerformanceFast, scalable SQL analytics
Intelligent in-memory caching
Define OLAP cubes for 10x faster queries
Unified semantic layer for all BI tools
Pre-aggregated
data
... Or, full-fidelity
data
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive : At the heart of STRUCTURED data access on Hadoop
à SQL Standard ACID MERGE now
available in Hive.
à Efficiently perform record-level
inserts, updates and deletes.
à Delivers real Data Management in
Hadoop, massively simplifying
updates, data restatements and
change data capture.
ACID MERGE Makes Data Maintenance Simple
X Y
1 X1
5 X2
7 X3
A B
1 X1
2 Y2
3 Y3
4 Y4
5 X2
6 Y6
7 X3
Hadoop
Hive LLAP
Druid
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive LLAP -- MPP Performance at Hadoop Scale
à Hybrid model combining daemons and containers for fast, concurrent execution of
analytical workloads (e.g. Hive SQL queries)
– Concurrent queries without specialized YARN queue setup
– Multi-threaded execution of vectorized operator pipelines
à Asynchronous IO and efficient in-memory caching
à How does it work ?
– Hive decides where query fragments
run (LLAP, Container, AM) based on
configuration, data size, format, etc
– Each Query coordinated
independently by a Tez AM
– Number of concurrent queries
throttled by number of active Ams
– Hive Operators used for processing
– Tez Runtime used for data transfer
Hadoop
Hive LLAP
Druid
Deep
Storag
e
YARN Cluster
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
Query
Coordinators
Coord-
inator
Coord-
inator
Coord-
inator
HiveServer2
(Query
Endpoint)
ODBC /
JDBC
SQL
Queries
In-Memory Cache
(Shared Across All Users)
HDFS and
Compatible
S3 WASB Isilon
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HIVE : Continuously evolving SQL performance on Hadoop
= 100X
Hive + MR Hive + Tez Hive + Tez + LLAP
= 20X
Hive 1.0 Hive 1.3 Hive 2.0
Batch SQL
Analytics
SQL
Interactive
SQL
• ETL
• Reporting
• Data Mining
• Deep Analytics
• Reporting
• BI Tools:
Microstrategy,
Cognos
• Ad-Hoc
• Drill-Down
• Agile BI Tools:
Tableau, Power BI
Hadoop
Hive LLAP
Druid
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid High Level Architecture
Many nodes, based on
data size and #queries
HDFS or S3
Pivot or Dashboards,
BI Tools
Synthesize real-time and
historical information
Hadoop
Hive LLAP
Druid
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Columnar format + Dictionaries + Inverted Indexes = Speed
Hadoop
Hive LLAP
Druid
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive / Druid Integration
à Build OLAP materialized views over Hive
data using Hive SQL.
à Merge historical and real-time data.
à Query using Hive and any tool that
supports Hive SQL.
Key Attributes:
OLAP Models
SQL Analytics
Streams and Archives (HDFS / Kafka)
Transformation Pipelines
CRMERP
Real-time
Events
Derived Cubes Derived Cubes
Facts and Historical
Events
(Hive LLAP)
Updatable
Dimensions
(Hive LLAP)
Full Fidelity Events
(Druid)
Aggregated Events
(Druid)
Dashboards
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Scalable Data Warehousing on Hadoop
Overview
Solution Architecture
Demo
Roadmap
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Démo – Trajets des taxis New Yorkais 2012 -2013
On y retrouve entres autres les indicateurs suivants :
à Timestamp de la course ( Début / Fin
à Coordonnées Géo (Début / Fin )
à Distance
à Type de paiement
à Couts de la course (Total, Pourboire, Taxes, … )
Le jeux de données décrit l’intégralité des trajets des taxi New Yorkais sur 2 ans
Infrastructure :
- 5 Data Nodes ( 8 vCPU, 15GB ram )
- 1 Maitre
- 1 Passerelle pour KNOX
Volumétrie :
- Table dé-normalisée de 351 646 964 entrées
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Scalable Data Warehousing on Hadoop
Overview
Solution Architecture
Demo
Roadmap
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Roadmap : End of Phase 1 – Current
UI
Core Platform
S3 or HDFS
HiveServer2
MDX
Unified SQL and MDX Layer
SQL BI Tools MDX Tools
Hive
Realtime Feeds
(Kafka, Storm, etc.)
Druid
OLAP Indexes
HiveServer2
Hive SQL
Thrift Server
SparkSQL
Fast SQL MDX
Superset UI
Fast Exploration
Builder UI
SmartSense
Ranger
Atlas
Ambari
Management
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Roadmap : Target ( HDP 3.X )
UI
Core Platform
S3 or HDFS
HiveServer2
MDX
Unified SQL and MDX Layer
SQL BI Tools MDX Tools
Hive
Realtime Feeds
(Kafka, Storm, etc.)
Druid
OLAP Indexes
HiveServer2
Hive SQL
Thrift Server
SparkSQL
Fast SQL MDX
Superset UI
Fast Exploration
Builder UI
SmartSense
Ranger
Atlas
Ambari
Management
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive + Druid = Insight When You Need It
OLAP Cubes SQL Tables
Streaming Data Historical Data
Unified SQL Layer
Pre-Aggregate ACID MERGE
Easily ingest event
data into OLAP cubes
Keep data up-to-date
with Hive MERGE
Build OLAP Cubes from Hive
Archive data to Hive for history
Run OLAP queries in real-time
or Deep Analytics over all history
Deep AnalyticsReal-Time Query
Materialized View
Navigation:
Transparent re-write of
queries to use the OLAP
index when possible
CALCITE-173
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

More Related Content

What's hot

Lessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloudLessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloudDataWorks Summit
 
Apache Hadoop Crash Course
Apache Hadoop Crash CourseApache Hadoop Crash Course
Apache Hadoop Crash CourseDataWorks Summit
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
Synchronicity of a distributed financial system
Synchronicity of a distributed financial systemSynchronicity of a distributed financial system
Synchronicity of a distributed financial systemDataWorks Summit
 
Why and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on FlinkWhy and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on FlinkDataWorks Summit
 
Lessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARNLessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARNDataWorks Summit
 
MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -MapR Technologies
 
MapR Streams and MapR Converged Data Platform
MapR Streams and MapR Converged Data PlatformMapR Streams and MapR Converged Data Platform
MapR Streams and MapR Converged Data PlatformMapR Technologies
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionDataWorks Summit
 
Real Time Streaming Architecture at Ford
Real Time Streaming Architecture at FordReal Time Streaming Architecture at Ford
Real Time Streaming Architecture at FordDataWorks Summit
 
Multi-tenant Hadoop - the challenge of maintaining high SLAS
Multi-tenant Hadoop - the challenge of maintaining high SLASMulti-tenant Hadoop - the challenge of maintaining high SLAS
Multi-tenant Hadoop - the challenge of maintaining high SLASDataWorks Summit
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNDataWorks Summit
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitDataWorks Summit
 
The Car of the Future - Autonomous, Connected, and Data Centric
The Car of the Future - Autonomous, Connected, and Data CentricThe Car of the Future - Autonomous, Connected, and Data Centric
The Car of the Future - Autonomous, Connected, and Data CentricDataWorks Summit
 

What's hot (20)

Lessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloudLessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloud
 
Apache Hadoop Crash Course
Apache Hadoop Crash CourseApache Hadoop Crash Course
Apache Hadoop Crash Course
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Intro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJIntro to Spark & Zeppelin - Crash Course - HS16SJ
Intro to Spark & Zeppelin - Crash Course - HS16SJ
 
Synchronicity of a distributed financial system
Synchronicity of a distributed financial systemSynchronicity of a distributed financial system
Synchronicity of a distributed financial system
 
Why and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on FlinkWhy and how to leverage the simplicity and power of SQL on Flink
Why and how to leverage the simplicity and power of SQL on Flink
 
Admiral Group
Admiral GroupAdmiral Group
Admiral Group
 
Lessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARNLessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARN
 
MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -MapR on Azure: Getting Value from Big Data in the Cloud -
MapR on Azure: Getting Value from Big Data in the Cloud -
 
MapR Streams and MapR Converged Data Platform
MapR Streams and MapR Converged Data PlatformMapR Streams and MapR Converged Data Platform
MapR Streams and MapR Converged Data Platform
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Real Time Streaming Architecture at Ford
Real Time Streaming Architecture at FordReal Time Streaming Architecture at Ford
Real Time Streaming Architecture at Ford
 
Multi-tenant Hadoop - the challenge of maintaining high SLAS
Multi-tenant Hadoop - the challenge of maintaining high SLASMulti-tenant Hadoop - the challenge of maintaining high SLAS
Multi-tenant Hadoop - the challenge of maintaining high SLAS
 
Containers and Big Data
Containers and Big DataContainers and Big Data
Containers and Big Data
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
The Car of the Future - Autonomous, Connected, and Data Centric
The Car of the Future - Autonomous, Connected, and Data CentricThe Car of the Future - Autonomous, Connected, and Data Centric
The Car of the Future - Autonomous, Connected, and Data Centric
 

Similar to Scalable data warehousing state of the art

Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_featuresAlberto Romero
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseDataWorks Summit
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017alanfgates
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Ashish Narasimham
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016alanfgates
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Frank Munz
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...Big Data Spain
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin DataWorks Summit/Hadoop Summit
 
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersEdelweiss Kammermann
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveDataWorks Summit
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseDataWorks Summit
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerDataWorks Summit
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudDataWorks Summit/Hadoop Summit
 

Similar to Scalable data warehousing state of the art (20)

Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
 
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging Manager
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 

More from Abdelkrim Hadjidj

Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2Abdelkrim Hadjidj
 
Paris FOD meetup - koordinator
Paris FOD meetup - koordinatorParis FOD meetup - koordinator
Paris FOD meetup - koordinatorAbdelkrim Hadjidj
 
Paris FOD meetup - Streams Messaging Manager
Paris FOD meetup - Streams Messaging ManagerParis FOD meetup - Streams Messaging Manager
Paris FOD meetup - Streams Messaging ManagerAbdelkrim Hadjidj
 
Paris FOD meetup - kafka security 101
Paris FOD meetup - kafka security 101Paris FOD meetup - kafka security 101
Paris FOD meetup - kafka security 101Abdelkrim Hadjidj
 
Apache NiFi: latest developments for flow management at scale
Apache NiFi: latest developments for flow management at scaleApache NiFi: latest developments for flow management at scale
Apache NiFi: latest developments for flow management at scaleAbdelkrim Hadjidj
 
Future of Data Meetup : Boontadata
Future of Data Meetup : BoontadataFuture of Data Meetup : Boontadata
Future of Data Meetup : BoontadataAbdelkrim Hadjidj
 

More from Abdelkrim Hadjidj (7)

Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2Disaster Recovery and High Availability with Kafka, SRM and MM2
Disaster Recovery and High Availability with Kafka, SRM and MM2
 
Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
Paris FOD meetup - koordinator
Paris FOD meetup - koordinatorParis FOD meetup - koordinator
Paris FOD meetup - koordinator
 
Paris FOD meetup - Streams Messaging Manager
Paris FOD meetup - Streams Messaging ManagerParis FOD meetup - Streams Messaging Manager
Paris FOD meetup - Streams Messaging Manager
 
Paris FOD meetup - kafka security 101
Paris FOD meetup - kafka security 101Paris FOD meetup - kafka security 101
Paris FOD meetup - kafka security 101
 
Apache NiFi: latest developments for flow management at scale
Apache NiFi: latest developments for flow management at scaleApache NiFi: latest developments for flow management at scale
Apache NiFi: latest developments for flow management at scale
 
Future of Data Meetup : Boontadata
Future of Data Meetup : BoontadataFuture of Data Meetup : Boontadata
Future of Data Meetup : Boontadata
 

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 

Recently uploaded (20)

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 

Scalable data warehousing state of the art

  • 1. Scalable data warehousing State of the art Matthieu Lamairesse Hortonworks
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Scalable Data Warehousing on Hadoop Overview Solution Architecture Demo Roadmap
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved BI on Hadoop : What people think COMPLEX SLOW SLOW MR BATCH SQL MDX EXTRACT
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Reality : Differentiated data consumption LLAP Druid DATA LAYER Analyst Spark HiveACID Data scientist Admin & DBA Relationnal OLAP Data Science Business Intelligence
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved BI : A Tool for each use case Slow but robust à Optimized MR à Batch Processing (SQL) à ETL / ELT à Static reports LLAP Interactive Ad Hoc / Exploratory Structured Interactive Druid à In Memory Cache à Ad Hoc / Exploratory à Interactive Dashboard Þ No data transformation Þ Interactive performance Þ No Security setting changes à OLAP cubes à Fast dashboards Tech Query speed Model Use cases Relationnal Full Joins / ACID De-normalized no Joins
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Scalable Data Warehousing on Hadoop Overview Solution Architecture Demo Roadmap
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hortonworks EDW Solution Components Hadoop Scalable Storage and Compute Hive LLAP High Performance SQL Data Mart Druid Dynamic OLAP Cubes for Higher PerformanceFast, scalable SQL analytics Intelligent in-memory caching Define OLAP cubes for 10x faster queries Unified semantic layer for all BI tools Pre-aggregated data ... Or, full-fidelity data
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive : At the heart of STRUCTURED data access on Hadoop à SQL Standard ACID MERGE now available in Hive. à Efficiently perform record-level inserts, updates and deletes. à Delivers real Data Management in Hadoop, massively simplifying updates, data restatements and change data capture. ACID MERGE Makes Data Maintenance Simple X Y 1 X1 5 X2 7 X3 A B 1 X1 2 Y2 3 Y3 4 Y4 5 X2 6 Y6 7 X3 Hadoop Hive LLAP Druid
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive LLAP -- MPP Performance at Hadoop Scale à Hybrid model combining daemons and containers for fast, concurrent execution of analytical workloads (e.g. Hive SQL queries) – Concurrent queries without specialized YARN queue setup – Multi-threaded execution of vectorized operator pipelines à Asynchronous IO and efficient in-memory caching à How does it work ? – Hive decides where query fragments run (LLAP, Container, AM) based on configuration, data size, format, etc – Each Query coordinated independently by a Tez AM – Number of concurrent queries throttled by number of active Ams – Hive Operators used for processing – Tez Runtime used for data transfer Hadoop Hive LLAP Druid Deep Storag e YARN Cluster LLAP Daemon Query Executors LLAP Daemon Query Executors LLAP Daemon Query Executors LLAP Daemon Query Executors Query Coordinators Coord- inator Coord- inator Coord- inator HiveServer2 (Query Endpoint) ODBC / JDBC SQL Queries In-Memory Cache (Shared Across All Users) HDFS and Compatible S3 WASB Isilon
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HIVE : Continuously evolving SQL performance on Hadoop = 100X Hive + MR Hive + Tez Hive + Tez + LLAP = 20X Hive 1.0 Hive 1.3 Hive 2.0 Batch SQL Analytics SQL Interactive SQL • ETL • Reporting • Data Mining • Deep Analytics • Reporting • BI Tools: Microstrategy, Cognos • Ad-Hoc • Drill-Down • Agile BI Tools: Tableau, Power BI Hadoop Hive LLAP Druid
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid High Level Architecture Many nodes, based on data size and #queries HDFS or S3 Pivot or Dashboards, BI Tools Synthesize real-time and historical information Hadoop Hive LLAP Druid
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Columnar format + Dictionaries + Inverted Indexes = Speed Hadoop Hive LLAP Druid
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive / Druid Integration à Build OLAP materialized views over Hive data using Hive SQL. à Merge historical and real-time data. à Query using Hive and any tool that supports Hive SQL. Key Attributes: OLAP Models SQL Analytics Streams and Archives (HDFS / Kafka) Transformation Pipelines CRMERP Real-time Events Derived Cubes Derived Cubes Facts and Historical Events (Hive LLAP) Updatable Dimensions (Hive LLAP) Full Fidelity Events (Druid) Aggregated Events (Druid) Dashboards
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Scalable Data Warehousing on Hadoop Overview Solution Architecture Demo Roadmap
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Démo – Trajets des taxis New Yorkais 2012 -2013 On y retrouve entres autres les indicateurs suivants : à Timestamp de la course ( Début / Fin à Coordonnées Géo (Début / Fin ) à Distance à Type de paiement à Couts de la course (Total, Pourboire, Taxes, … ) Le jeux de données décrit l’intégralité des trajets des taxi New Yorkais sur 2 ans Infrastructure : - 5 Data Nodes ( 8 vCPU, 15GB ram ) - 1 Maitre - 1 Passerelle pour KNOX Volumétrie : - Table dé-normalisée de 351 646 964 entrées
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Scalable Data Warehousing on Hadoop Overview Solution Architecture Demo Roadmap
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Roadmap : End of Phase 1 – Current UI Core Platform S3 or HDFS HiveServer2 MDX Unified SQL and MDX Layer SQL BI Tools MDX Tools Hive Realtime Feeds (Kafka, Storm, etc.) Druid OLAP Indexes HiveServer2 Hive SQL Thrift Server SparkSQL Fast SQL MDX Superset UI Fast Exploration Builder UI SmartSense Ranger Atlas Ambari Management
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Roadmap : Target ( HDP 3.X ) UI Core Platform S3 or HDFS HiveServer2 MDX Unified SQL and MDX Layer SQL BI Tools MDX Tools Hive Realtime Feeds (Kafka, Storm, etc.) Druid OLAP Indexes HiveServer2 Hive SQL Thrift Server SparkSQL Fast SQL MDX Superset UI Fast Exploration Builder UI SmartSense Ranger Atlas Ambari Management
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive + Druid = Insight When You Need It OLAP Cubes SQL Tables Streaming Data Historical Data Unified SQL Layer Pre-Aggregate ACID MERGE Easily ingest event data into OLAP cubes Keep data up-to-date with Hive MERGE Build OLAP Cubes from Hive Archive data to Hive for history Run OLAP queries in real-time or Deep Analytics over all history Deep AnalyticsReal-Time Query Materialized View Navigation: Transparent re-write of queries to use the OLAP index when possible CALCITE-173
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved