Scalable data warehousing state of the art

Scalable data warehousing
State of the art
Matthieu Lamairesse Hortonworks

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Scalable Data Warehousing on Hadoop
Overview
Solution Architecture
Demo
Roadmap

BI on Hadoop : What people think
COMPLEX
SLOW
SLOW
MR
BATCH
SQL
MDX
EXTRACT

Reality : Differentiated data consumption
LLAP
Druid
DATA LAYER
Analyst
Spark
HiveACID
Data scientist
Admin & DBA
Relationnal
OLAP
Data Science
Business Intelligence

BI : A Tool for each use case
Slow but robust
Ã Optimized MR
Ã Batch Processing (SQL)
Ã ETL / ELT
Ã Static reports
LLAP
Interactive Ad Hoc / Exploratory Structured Interactive
Druid
Ã In Memory Cache
Ã Ad Hoc / Exploratory
Ã Interactive Dashboard
Þ No data transformation
Þ Interactive performance
Þ No Security setting changes
Ã OLAP cubes
Ã Fast dashboards
Tech
Query speed
Model
Use
cases
Relationnal
Full Joins / ACID
De-normalized
no Joins

Agenda
Overview
Demo
Roadmap

Hortonworks EDW Solution Components
Hadoop
Scalable Storage and Compute
Hive LLAP
High Performance SQL Data Mart
Druid
Dynamic OLAP Cubes for Higher
PerformanceFast, scalable SQL analytics
Intelligent in-memory caching
Define OLAP cubes for 10x faster queries
Unified semantic layer for all BI tools
Pre-aggregated
data
... Or, full-fidelity
data

Hive : At the heart of STRUCTURED data access on Hadoop
Ã SQL Standard ACID MERGE now
available in Hive.
Ã Efficiently perform record-level
inserts, updates and deletes.
Ã Delivers real Data Management in
Hadoop, massively simplifying
updates, data restatements and
change data capture.
ACID MERGE Makes Data Maintenance Simple
X Y
1 X1
5 X2
7 X3
A B
1 X1
2 Y2
3 Y3
4 Y4
5 X2
6 Y6
7 X3
Hadoop
Hive LLAP
Druid

Hive LLAP -- MPP Performance at Hadoop Scale
Ã Hybrid model combining daemons and containers for fast, concurrent execution of
analytical workloads (e.g. Hive SQL queries)
– Concurrent queries without specialized YARN queue setup
– Multi-threaded execution of vectorized operator pipelines
Ã Asynchronous IO and efficient in-memory caching
Ã How does it work ?
– Hive decides where query fragments
run (LLAP, Container, AM) based on
configuration, data size, format, etc
– Each Query coordinated
independently by a Tez AM
– Number of concurrent queries
throttled by number of active Ams
– Hive Operators used for processing
– Tez Runtime used for data transfer
Hadoop
Hive LLAP
Druid
Deep
Storag
e
YARN Cluster
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
Query
Coordinators
Coord-
inator
Coord-
inator
Coord-
inator
HiveServer2
(Query
Endpoint)
ODBC /
JDBC
SQL
Queries
In-Memory Cache
(Shared Across All Users)
HDFS and
Compatible
S3 WASB Isilon

HIVE : Continuously evolving SQL performance on Hadoop
= 100X
Hive + MR Hive + Tez Hive + Tez + LLAP
= 20X
Hive 1.0 Hive 1.3 Hive 2.0
Batch SQL
Analytics
SQL
Interactive
SQL
• ETL
• Reporting
• Data Mining
• Deep Analytics
• Reporting
• BI Tools:
Microstrategy,
Cognos
• Ad-Hoc
• Drill-Down
• Agile BI Tools:
Tableau, Power BI
Hadoop
Hive LLAP
Druid

Druid High Level Architecture
Many nodes, based on
data size and #queries
HDFS or S3
Pivot or Dashboards,
BI Tools
Synthesize real-time and
historical information
Hadoop
Hive LLAP
Druid

Columnar format + Dictionaries + Inverted Indexes = Speed
Hadoop
Hive LLAP
Druid

Hive / Druid Integration
Ã Build OLAP materialized views over Hive
data using Hive SQL.
Ã Merge historical and real-time data.
Ã Query using Hive and any tool that
supports Hive SQL.
Key Attributes:
OLAP Models
SQL Analytics
Streams and Archives (HDFS / Kafka)
Transformation Pipelines
CRMERP
Real-time
Events
Derived Cubes Derived Cubes
Facts and Historical
Events
(Hive LLAP)
Updatable
Dimensions
(Hive LLAP)
Full Fidelity Events
(Druid)
Aggregated Events
(Druid)
Dashboards

Agenda
Overview
Demo
Roadmap

Démo – Trajets des taxis New Yorkais 2012 -2013
On y retrouve entres autres les indicateurs suivants :
Ã Timestamp de la course ( Début / Fin
Ã Coordonnées Géo (Début / Fin )
Ã Distance
Ã Type de paiement
Ã Couts de la course (Total, Pourboire, Taxes, … )
Le jeux de données décrit l’intégralité des trajets des taxi New Yorkais sur 2 ans
Infrastructure :
- 5 Data Nodes ( 8 vCPU, 15GB ram )
- 1 Maitre
- 1 Passerelle pour KNOX
Volumétrie :
- Table dé-normalisée de 351 646 964 entrées

Agenda
Overview
Demo
Roadmap

Roadmap : End of Phase 1 – Current
UI
Core Platform
S3 or HDFS
HiveServer2
MDX
Unified SQL and MDX Layer
SQL BI Tools MDX Tools
Hive
Realtime Feeds
(Kafka, Storm, etc.)
Druid
OLAP Indexes
HiveServer2
Hive SQL
Thrift Server
SparkSQL
Fast SQL MDX
Superset UI
Fast Exploration
Builder UI
SmartSense
Ranger
Atlas
Ambari
Management

Roadmap : Target ( HDP 3.X )
UI
Core Platform
S3 or HDFS
HiveServer2
MDX
Unified SQL and MDX Layer
SQL BI Tools MDX Tools
Hive
Realtime Feeds
(Kafka, Storm, etc.)
Druid
OLAP Indexes
HiveServer2
Hive SQL
Thrift Server
SparkSQL
Fast SQL MDX
Superset UI
Fast Exploration
Builder UI
SmartSense
Ranger
Atlas
Ambari
Management

Hive + Druid = Insight When You Need It
OLAP Cubes SQL Tables
Streaming Data Historical Data
Unified SQL Layer
Pre-Aggregate ACID MERGE
Easily ingest event
data into OLAP cubes
Keep data up-to-date
with Hive MERGE
Build OLAP Cubes from Hive
Archive data to Hive for history
Run OLAP queries in real-time
or Deep Analytics over all history
Deep AnalyticsReal-Time Query
Materialized View
Navigation:
Transparent re-write of
queries to use the OLAP
index when possible
CALCITE-173

Scalable data warehousing state of the art

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scalable data warehousing state of the art

Similar to Scalable data warehousing state of the art (20)

More from Abdelkrim Hadjidj

More from Abdelkrim Hadjidj (7)

Recently uploaded

Recently uploaded (20)

Scalable data warehousing state of the art