SlideShare a Scribd company logo

Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive

Xu Jiang
Xu Jiang

Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets. If you want to do multi-dimension analysis on large data sets (billion+ rows) with low query latency (sub-seconds), Kylin is a good option. Kylin also provides seamless integration with existing BI tools (e.g Tableau).

1 of 36
Download to read offline
1 
Kylin: Hadoop OLAP Engine - Tech Deep Dive 
Jiang Xu, Architect of Kylin 
October 2014
Kylin Overview 
Kylin 
- n. (in Chinese art) a mythical animal of composite form 
2 
http://kylin.io 
Kylin is an open source Distributed Analytics Engine from 
eBay Inc. that provides SQL interface and multi-dimensional 
analysis (OLAP) on Hadoop supporting extremely large 
datasets
What Is Kylin? 
• Extremely Fast OLAP Engine at Scale 
Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data 
• ANSI-SQL Interface on Hadoop 
Kylin offers ANSI-SQL on Hadoop and supports most ANSI-SQL query functions 
• Interactive Query Capability 
Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive queries for the same 
dataset 
• MOLAP Cube 
User can define a data model and pre-build in Kylin with more than 10+ billions of raw data records 
• Seamless Integration with BI Tools 
Kylin currently offers integration capability with BI Tools like Tableau. Integration with Microstrategy and Excel 
is coming soon 
3
What Is Kylin? - Other Highlights 
• Job Management and Monitoring 
• Compression and Encoding Support 
• Incremental Refresh of Cubes 
• Leverage HBase Coprocessor for query latency 
• Approximate Query Capability for distinct Count (HyperLogLog) 
• Easy Web interface to manage, build, monitor and query cubes 
• Security capability to set ACL at Cube/Project Level 
• Support LDAP Integration 
4
Glance of SQL-on-Hadoop Ecosystem … 
• SQL translated to MapReduce jobs 
– Hive 
– Stinger without Tez 
• SQL processed by a MPP Engine 
– Impala 
– Drill 
– Presto 
– Spark + Shark 
• SQL process by a existing SQL Engine + HDFS 
– EMC Greenplum (postgres) 
– Taobao Garude (mysql) 
• OLAP on Hadoop in other Companies 
– Adobe: HBase Cube 
– LinkedIn: Avatara 
– Salesforce.com: Phoenix 
5
Why Do We Build Kylin? 
• Why existing SQL-on-Hadoop solutions fall short? 
The existing SQL-on-Hadoop needs to scan partial or whole data set to answer a user query. Moreover, table join may trigger 
the data transfer across host. Due to large scan range and network traffic latency, many queries are very slow (minute+ 
latency). 
• What is MOLAP/ROLAP? 
– MOLAP (Multi-dimensional OLAP) is to pre-compute data along different dimensions of interest and store resultant 
values in the cube. MOLAP is much faster but is inflexible. Kylin is more like MOLAP. 
– ROLAP (Relational-OLAP) is to use star or snow-flake schema to do runtime aggregation. ROLAP is flexible but much 
slower. All existing SQL-on-Hadoop is kind of ROLAP. 
• How does Kylin support ROLAP/MOLAP? 
Kylin builds data cube (MOLAP) from hive table (ROLAP) according to the metadata definition. 
– If the query can be fulfilled by data cube, Kylin will route the query to data cube that is MOLAP. 
– If the query can’t be fulfilled by data cube, Kylin will route the query to hive table that is ROLAP. 
– Basically, you can think Kylin as HOLAP on top of MOLAP and ROLAP. 
6

Recommended

Apache Kylin on HBase: Extreme OLAP engine for big data
Apache Kylin on HBase: Extreme OLAP engine for big dataApache Kylin on HBase: Extreme OLAP engine for big data
Apache Kylin on HBase: Extreme OLAP engine for big dataShi Shao Feng
 
Kylin olap part 1- getting started
Kylin olap   part 1- getting startedKylin olap   part 1- getting started
Kylin olap part 1- getting startedShubham Shirude
 
Datacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheConDatacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheConamarsri
 
Apache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big DataApache Kylin Extreme OLAP Engine for Big Data
Apache Kylin Extreme OLAP Engine for Big DataLuke Han
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentationargonauts007
 
Big Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache KylinBig Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache Kylininovex GmbH
 
Apache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopApache Kylin – Cubes on Hadoop
Apache Kylin – Cubes on HadoopDataWorks Summit
 

More Related Content

What's hot

Apache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecApache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecYang Li
 
Kylin Engineering Principles
Kylin Engineering PrinciplesKylin Engineering Principles
Kylin Engineering PrinciplesXu Jiang
 
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for HadoopHBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for HadoopHBaseCon
 
Apache Kylin Streaming
Apache Kylin Streaming Apache Kylin Streaming
Apache Kylin Streaming hongbin ma
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseYang Li
 
Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseHBaseCon
 
Kylin OLAP Engine Tour
Kylin OLAP Engine TourKylin OLAP Engine Tour
Kylin OLAP Engine TourLuke Han
 
Apache Kylin - Balance between space and time - Hadoop Summit 2015
Apache Kylin -  Balance between space and time - Hadoop Summit 2015Apache Kylin -  Balance between space and time - Hadoop Summit 2015
Apache Kylin - Balance between space and time - Hadoop Summit 2015Debashis Saha
 
Apache Kylin 1.5 Updates
Apache Kylin 1.5 UpdatesApache Kylin 1.5 Updates
Apache Kylin 1.5 UpdatesYang Li
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Seshu Adunuthula
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopTony Ng
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache KylinYang Li
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopDataWorks Summit
 
Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)qhzhou
 
Apache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 BeijingApache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 BeijingLuke Han
 
Apache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeApache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeDataWorks Summit
 
The Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke HanThe Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke HanLuke Han
 
Adding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark MeetupAdding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark MeetupLuke Han
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillDataWorks Summit
 

What's hot (20)

Apache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecApache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
 
Kylin Engineering Principles
Kylin Engineering PrinciplesKylin Engineering Principles
Kylin Engineering Principles
 
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for HadoopHBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
Apache Kylin Streaming
Apache Kylin Streaming Apache Kylin Streaming
Apache Kylin Streaming
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouse
 
Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBase
 
Kylin OLAP Engine Tour
Kylin OLAP Engine TourKylin OLAP Engine Tour
Kylin OLAP Engine Tour
 
Apache Kylin - Balance between space and time - Hadoop Summit 2015
Apache Kylin -  Balance between space and time - Hadoop Summit 2015Apache Kylin -  Balance between space and time - Hadoop Summit 2015
Apache Kylin - Balance between space and time - Hadoop Summit 2015
 
Apache Kylin 1.5 Updates
Apache Kylin 1.5 UpdatesApache Kylin 1.5 Updates
Apache Kylin 1.5 Updates
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache Kylin
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
 
Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)
 
Apache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 BeijingApache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 Beijing
 
Apache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeApache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and Time
 
The Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke HanThe Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke Han
 
Adding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark MeetupAdding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark Meetup
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 

Viewers also liked

Low Latency OLAP with Hadoop and HBase
Low Latency OLAP with Hadoop and HBaseLow Latency OLAP with Hadoop and HBase
Low Latency OLAP with Hadoop and HBaseDataWorks Summit
 
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?DataWorks Summit
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidTony Ng
 
Cloud for Developers: Azure vs. Google App Engine vs. Amazon vs. AppHarbor
Cloud for Developers: Azure vs. Google App Engine vs. Amazon vs. AppHarborCloud for Developers: Azure vs. Google App Engine vs. Amazon vs. AppHarbor
Cloud for Developers: Azure vs. Google App Engine vs. Amazon vs. AppHarborSvetlin Nakov
 
Combustion in diesel engine
Combustion in diesel engineCombustion in diesel engine
Combustion in diesel engineAmanpreet Singh
 
Search Engine Optimization (SEO)
Search Engine Optimization (SEO)Search Engine Optimization (SEO)
Search Engine Optimization (SEO)Dennis Deacon
 
Periodic Table of SEO Success Factors & Guide to SEO by SearchEngineLand
Periodic Table of SEO Success Factors & Guide to SEO by SearchEngineLandPeriodic Table of SEO Success Factors & Guide to SEO by SearchEngineLand
Periodic Table of SEO Success Factors & Guide to SEO by SearchEngineLandSearch Engine Land
 
Internal Combustion Engines - Construction and Working (All you need to know,...
Internal Combustion Engines - Construction and Working (All you need to know,...Internal Combustion Engines - Construction and Working (All you need to know,...
Internal Combustion Engines - Construction and Working (All you need to know,...Mihir Pai
 
Tk03 Google App Engine Fr
Tk03 Google App Engine FrTk03 Google App Engine Fr
Tk03 Google App Engine FrValtech
 
ppt on 2 stroke and 4 stroke petrol engine
ppt on 2 stroke and 4 stroke petrol engineppt on 2 stroke and 4 stroke petrol engine
ppt on 2 stroke and 4 stroke petrol engineharshid panchal
 
DNUG2015 Frühjahrskonferenz: Brücken bauen, Grenzen überwinden: Domino im Dia...
DNUG2015 Frühjahrskonferenz: Brücken bauen, Grenzen überwinden: Domino im Dia...DNUG2015 Frühjahrskonferenz: Brücken bauen, Grenzen überwinden: Domino im Dia...
DNUG2015 Frühjahrskonferenz: Brücken bauen, Grenzen überwinden: Domino im Dia...JRibbeck
 
I.C.ENGINE PPT
I.C.ENGINE PPTI.C.ENGINE PPT
I.C.ENGINE PPT8695
 
CAP 4: SEO - Optimizacion de Contenido
CAP 4: SEO - Optimizacion de ContenidoCAP 4: SEO - Optimizacion de Contenido
CAP 4: SEO - Optimizacion de ContenidoGary Briceño
 
App engine
App engineApp engine
App engineThirdWay
 
CAP 3: SEO - Keywords Research
CAP 3: SEO - Keywords ResearchCAP 3: SEO - Keywords Research
CAP 3: SEO - Keywords ResearchGary Briceño
 
Desarrollando para Nmap Scripting Engine (NSE) [GuadalajaraCON 2013]
Desarrollando para Nmap Scripting Engine (NSE) [GuadalajaraCON 2013]Desarrollando para Nmap Scripting Engine (NSE) [GuadalajaraCON 2013]
Desarrollando para Nmap Scripting Engine (NSE) [GuadalajaraCON 2013]Websec México, S.C.
 
Presentacion introduccion ibm file net p8 v10
Presentacion introduccion ibm file net p8 v10Presentacion introduccion ibm file net p8 v10
Presentacion introduccion ibm file net p8 v10Javier Laguens Garcia
 

Viewers also liked (20)

Low Latency OLAP with Hadoop and HBase
Low Latency OLAP with Hadoop and HBaseLow Latency OLAP with Hadoop and HBase
Low Latency OLAP with Hadoop and HBase
 
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Cloud for Developers: Azure vs. Google App Engine vs. Amazon vs. AppHarbor
Cloud for Developers: Azure vs. Google App Engine vs. Amazon vs. AppHarborCloud for Developers: Azure vs. Google App Engine vs. Amazon vs. AppHarbor
Cloud for Developers: Azure vs. Google App Engine vs. Amazon vs. AppHarbor
 
Combustion in diesel engine
Combustion in diesel engineCombustion in diesel engine
Combustion in diesel engine
 
RoomCloud Booking Engine
RoomCloud Booking EngineRoomCloud Booking Engine
RoomCloud Booking Engine
 
Search Engine Optimization (SEO)
Search Engine Optimization (SEO)Search Engine Optimization (SEO)
Search Engine Optimization (SEO)
 
Periodic Table of SEO Success Factors & Guide to SEO by SearchEngineLand
Periodic Table of SEO Success Factors & Guide to SEO by SearchEngineLandPeriodic Table of SEO Success Factors & Guide to SEO by SearchEngineLand
Periodic Table of SEO Success Factors & Guide to SEO by SearchEngineLand
 
Ic engine
Ic engineIc engine
Ic engine
 
Internal Combustion Engines - Construction and Working (All you need to know,...
Internal Combustion Engines - Construction and Working (All you need to know,...Internal Combustion Engines - Construction and Working (All you need to know,...
Internal Combustion Engines - Construction and Working (All you need to know,...
 
Tk03 Google App Engine Fr
Tk03 Google App Engine FrTk03 Google App Engine Fr
Tk03 Google App Engine Fr
 
ppt on 2 stroke and 4 stroke petrol engine
ppt on 2 stroke and 4 stroke petrol engineppt on 2 stroke and 4 stroke petrol engine
ppt on 2 stroke and 4 stroke petrol engine
 
DNUG2015 Frühjahrskonferenz: Brücken bauen, Grenzen überwinden: Domino im Dia...
DNUG2015 Frühjahrskonferenz: Brücken bauen, Grenzen überwinden: Domino im Dia...DNUG2015 Frühjahrskonferenz: Brücken bauen, Grenzen überwinden: Domino im Dia...
DNUG2015 Frühjahrskonferenz: Brücken bauen, Grenzen überwinden: Domino im Dia...
 
I.C.ENGINE PPT
I.C.ENGINE PPTI.C.ENGINE PPT
I.C.ENGINE PPT
 
CAP 4: SEO - Optimizacion de Contenido
CAP 4: SEO - Optimizacion de ContenidoCAP 4: SEO - Optimizacion de Contenido
CAP 4: SEO - Optimizacion de Contenido
 
App engine
App engineApp engine
App engine
 
CAP 3: SEO - Keywords Research
CAP 3: SEO - Keywords ResearchCAP 3: SEO - Keywords Research
CAP 3: SEO - Keywords Research
 
El SEOy la geolocalización
El SEOy la geolocalizaciónEl SEOy la geolocalización
El SEOy la geolocalización
 
Desarrollando para Nmap Scripting Engine (NSE) [GuadalajaraCON 2013]
Desarrollando para Nmap Scripting Engine (NSE) [GuadalajaraCON 2013]Desarrollando para Nmap Scripting Engine (NSE) [GuadalajaraCON 2013]
Desarrollando para Nmap Scripting Engine (NSE) [GuadalajaraCON 2013]
 
Presentacion introduccion ibm file net p8 v10
Presentacion introduccion ibm file net p8 v10Presentacion introduccion ibm file net p8 v10
Presentacion introduccion ibm file net p8 v10
 

Similar to Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive

ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartEvans Ye
 
Accelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache KylinAccelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache KylinTyler Wishnoff
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)Stratebi
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeDatabricks
 
ApacheKylin_HBaseCon2015
ApacheKylin_HBaseCon2015ApacheKylin_HBaseCon2015
ApacheKylin_HBaseCon2015Luke Han
 
Apache Kylin Introduction
Apache Kylin IntroductionApache Kylin Introduction
Apache Kylin IntroductionLuke Han
 
ScaleDB Technical Presentation
ScaleDB Technical PresentationScaleDB Technical Presentation
ScaleDB Technical PresentationIvan Zoratti
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Lucas Jellema
 
Grill at HadoopSummit
Grill at HadoopSummitGrill at HadoopSummit
Grill at HadoopSummitamarsri
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersLucidworks
 
Designing, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDesigning, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDenny Lee
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Nicolas Poggi
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
 
How we switched to columnar at SpendHQ
How we switched to columnar at SpendHQHow we switched to columnar at SpendHQ
How we switched to columnar at SpendHQMariaDB plc
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin
 

Similar to Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive (20)

ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smart
 
Accelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache KylinAccelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache Kylin
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data Lake
 
ApacheKylin_HBaseCon2015
ApacheKylin_HBaseCon2015ApacheKylin_HBaseCon2015
ApacheKylin_HBaseCon2015
 
Apache Kylin Introduction
Apache Kylin IntroductionApache Kylin Introduction
Apache Kylin Introduction
 
ScaleDB Technical Presentation
ScaleDB Technical PresentationScaleDB Technical Presentation
ScaleDB Technical Presentation
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
 
Oow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BIOow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BI
 
Grill at HadoopSummit
Grill at HadoopSummitGrill at HadoopSummit
Grill at HadoopSummit
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
 
Designing, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDesigning, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons Learned
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
How we switched to columnar at SpendHQ
How we switched to columnar at SpendHQHow we switched to columnar at SpendHQ
How we switched to columnar at SpendHQ
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 

Recently uploaded

Issues affecting LGBT as they grow older.pptx
Issues affecting LGBT as they grow older.pptxIssues affecting LGBT as they grow older.pptx
Issues affecting LGBT as they grow older.pptxbill846304
 
Monthly HSE Report March for overall HSE
Monthly HSE Report March for overall HSEMonthly HSE Report March for overall HSE
Monthly HSE Report March for overall HSEOlgaOliveaJohn
 
IE Application: Express Yourself - Sofia Merizalde
IE Application: Express Yourself - Sofia MerizaldeIE Application: Express Yourself - Sofia Merizalde
IE Application: Express Yourself - Sofia Merizaldesofiamerizaldev
 
Auditorium Session 3 - Resilience - Financial Resilience and Collaboration
Auditorium Session 3 - Resilience - Financial Resilience and CollaborationAuditorium Session 3 - Resilience - Financial Resilience and Collaboration
Auditorium Session 3 - Resilience - Financial Resilience and CollaborationMuseums Galleries Scotland
 
Instructional Supervision - By Dr. Cherinet Aytenfsu Weldearegay.pdf
Instructional Supervision - By Dr. Cherinet Aytenfsu Weldearegay.pdfInstructional Supervision - By Dr. Cherinet Aytenfsu Weldearegay.pdf
Instructional Supervision - By Dr. Cherinet Aytenfsu Weldearegay.pdfDr. Cherinet Aytenfsu Weldearegay
 
Garcia_RobertDaniel_SPCSTA_PB1_2024-02.pptx
Garcia_RobertDaniel_SPCSTA_PB1_2024-02.pptxGarcia_RobertDaniel_SPCSTA_PB1_2024-02.pptx
Garcia_RobertDaniel_SPCSTA_PB1_2024-02.pptx0461620
 
Teams Nation 2024 - #Copilot & Teams or Just Premium.pptx
Teams Nation 2024 - #Copilot & Teams or Just Premium.pptxTeams Nation 2024 - #Copilot & Teams or Just Premium.pptx
Teams Nation 2024 - #Copilot & Teams or Just Premium.pptxKai Stenberg
 
Auditorium Session 2 - Workforce - Diversity/Skills & Confidence
Auditorium Session 2 - Workforce - Diversity/Skills & ConfidenceAuditorium Session 2 - Workforce - Diversity/Skills & Confidence
Auditorium Session 2 - Workforce - Diversity/Skills & ConfidenceMuseums Galleries Scotland
 
TheSimpsons_Fandom_Assignment_4.5pc.pptx
TheSimpsons_Fandom_Assignment_4.5pc.pptxTheSimpsons_Fandom_Assignment_4.5pc.pptx
TheSimpsons_Fandom_Assignment_4.5pc.pptxStevenLuker3
 
Freeman_Abigail Personal Brand Exploration
Freeman_Abigail Personal Brand ExplorationFreeman_Abigail Personal Brand Exploration
Freeman_Abigail Personal Brand Explorationabbytoliver
 

Recently uploaded (11)

Issues affecting LGBT as they grow older.pptx
Issues affecting LGBT as they grow older.pptxIssues affecting LGBT as they grow older.pptx
Issues affecting LGBT as they grow older.pptx
 
Monthly HSE Report March for overall HSE
Monthly HSE Report March for overall HSEMonthly HSE Report March for overall HSE
Monthly HSE Report March for overall HSE
 
IE Application: Express Yourself - Sofia Merizalde
IE Application: Express Yourself - Sofia MerizaldeIE Application: Express Yourself - Sofia Merizalde
IE Application: Express Yourself - Sofia Merizalde
 
Auditorium Session 1 - Connection - Inclusion
Auditorium Session 1 - Connection - InclusionAuditorium Session 1 - Connection - Inclusion
Auditorium Session 1 - Connection - Inclusion
 
Auditorium Session 3 - Resilience - Financial Resilience and Collaboration
Auditorium Session 3 - Resilience - Financial Resilience and CollaborationAuditorium Session 3 - Resilience - Financial Resilience and Collaboration
Auditorium Session 3 - Resilience - Financial Resilience and Collaboration
 
Instructional Supervision - By Dr. Cherinet Aytenfsu Weldearegay.pdf
Instructional Supervision - By Dr. Cherinet Aytenfsu Weldearegay.pdfInstructional Supervision - By Dr. Cherinet Aytenfsu Weldearegay.pdf
Instructional Supervision - By Dr. Cherinet Aytenfsu Weldearegay.pdf
 
Garcia_RobertDaniel_SPCSTA_PB1_2024-02.pptx
Garcia_RobertDaniel_SPCSTA_PB1_2024-02.pptxGarcia_RobertDaniel_SPCSTA_PB1_2024-02.pptx
Garcia_RobertDaniel_SPCSTA_PB1_2024-02.pptx
 
Teams Nation 2024 - #Copilot & Teams or Just Premium.pptx
Teams Nation 2024 - #Copilot & Teams or Just Premium.pptxTeams Nation 2024 - #Copilot & Teams or Just Premium.pptx
Teams Nation 2024 - #Copilot & Teams or Just Premium.pptx
 
Auditorium Session 2 - Workforce - Diversity/Skills & Confidence
Auditorium Session 2 - Workforce - Diversity/Skills & ConfidenceAuditorium Session 2 - Workforce - Diversity/Skills & Confidence
Auditorium Session 2 - Workforce - Diversity/Skills & Confidence
 
TheSimpsons_Fandom_Assignment_4.5pc.pptx
TheSimpsons_Fandom_Assignment_4.5pc.pptxTheSimpsons_Fandom_Assignment_4.5pc.pptx
TheSimpsons_Fandom_Assignment_4.5pc.pptx
 
Freeman_Abigail Personal Brand Exploration
Freeman_Abigail Personal Brand ExplorationFreeman_Abigail Personal Brand Exploration
Freeman_Abigail Personal Brand Exploration
 

Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive

  • 1. 1 Kylin: Hadoop OLAP Engine - Tech Deep Dive Jiang Xu, Architect of Kylin October 2014
  • 2. Kylin Overview Kylin - n. (in Chinese art) a mythical animal of composite form 2 http://kylin.io Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
  • 3. What Is Kylin? • Extremely Fast OLAP Engine at Scale Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data • ANSI-SQL Interface on Hadoop Kylin offers ANSI-SQL on Hadoop and supports most ANSI-SQL query functions • Interactive Query Capability Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive queries for the same dataset • MOLAP Cube User can define a data model and pre-build in Kylin with more than 10+ billions of raw data records • Seamless Integration with BI Tools Kylin currently offers integration capability with BI Tools like Tableau. Integration with Microstrategy and Excel is coming soon 3
  • 4. What Is Kylin? - Other Highlights • Job Management and Monitoring • Compression and Encoding Support • Incremental Refresh of Cubes • Leverage HBase Coprocessor for query latency • Approximate Query Capability for distinct Count (HyperLogLog) • Easy Web interface to manage, build, monitor and query cubes • Security capability to set ACL at Cube/Project Level • Support LDAP Integration 4
  • 5. Glance of SQL-on-Hadoop Ecosystem … • SQL translated to MapReduce jobs – Hive – Stinger without Tez • SQL processed by a MPP Engine – Impala – Drill – Presto – Spark + Shark • SQL process by a existing SQL Engine + HDFS – EMC Greenplum (postgres) – Taobao Garude (mysql) • OLAP on Hadoop in other Companies – Adobe: HBase Cube – LinkedIn: Avatara – Salesforce.com: Phoenix 5
  • 6. Why Do We Build Kylin? • Why existing SQL-on-Hadoop solutions fall short? The existing SQL-on-Hadoop needs to scan partial or whole data set to answer a user query. Moreover, table join may trigger the data transfer across host. Due to large scan range and network traffic latency, many queries are very slow (minute+ latency). • What is MOLAP/ROLAP? – MOLAP (Multi-dimensional OLAP) is to pre-compute data along different dimensions of interest and store resultant values in the cube. MOLAP is much faster but is inflexible. Kylin is more like MOLAP. – ROLAP (Relational-OLAP) is to use star or snow-flake schema to do runtime aggregation. ROLAP is flexible but much slower. All existing SQL-on-Hadoop is kind of ROLAP. • How does Kylin support ROLAP/MOLAP? Kylin builds data cube (MOLAP) from hive table (ROLAP) according to the metadata definition. – If the query can be fulfilled by data cube, Kylin will route the query to data cube that is MOLAP. – If the query can’t be fulfilled by data cube, Kylin will route the query to hive table that is ROLAP. – Basically, you can think Kylin as HOLAP on top of MOLAP and ROLAP. 6
  • 8. Analytics Query Taxonomy 8 High Level Aggregation • Very High Level, e.g GMV by site by vertical by weeks Analysis Query • Middle level, e.g GMV by site by vertical, by category (level x) past 12 weeks Drill Down to Detail • Detail Level (SSA Table) Low Level Aggregation • Seller ID Transaction Level • Transaction ID 80+% Analytics Kylin is designed to accelerate Analytics queries!
  • 9. Query Performance -- Compare to Hive 9 200 100 0 SQL #1 SQL #2 SQL #3 # Query Type Return Dataset Query On Kylin (s) Query On Hive (s) Hive Kylin Comments 1 High Level Aggregation 4 0.129 157.437 1,217 times 2 Analysis Query 22,669 1.615 109.206 68 times 3 Drill Down to Detail 325,029 12.058 113.123 9 times 4 Drill Down to Detail 524,780 22.42 6383.21 278 times 5 Data Dump 972,002 49.054 N/A
  • 10. Query Performance – Latency & Throughput 10
  • 12. Component Design 12 Query Engine (Calcite) Metadata Manager JDBC Driver Storage Engine Data Cube (HBase) Job Engine (MapReduce) Star Schema (Hive) SQL RESTful Server Dictionary & Cube JDBC Driver ODBC Driver HBase Coprossor
  • 13. Background – Cube & Cuboid 13 • Cuboid = one combination of dimensions • Cube = all combination of dimensions (all cuboids) time, item time item location supplier time, item, location item, location time, location, supplier time, item, location, supplier time, location Time, supplier item, supplier location, supplier time, item, supplier item, location, supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid
  • 14. Metadata- Overview 14 Cube: … Fact Table: … Dimensions: … Measures: … Row Key: … HBase Mapping: … Dim Fact Dim Dim Source Star Schema row A row B row C Val 1 Val 2 Val 3 Column Family Row Key Column Target HBase Storage Cube Metadata
  • 15. Metadata – Dimension • Dimension – Normal – Mandatory – Hierarchy – Derived 15
  • 16. Metadata – Measure • Measure – Sum – Count – Max – Min – Average – Distinct Count (based on HyperLogLog) 16
  • 17. Query Engine - Overview • Query engine is based on Calcite • Query Execution Plan • Optiq Plug-ins in Query Engine 17
  • 18. Query Engine – Calcite • Calcite (http://incubator.apache.org/projects/calcite.html) is an extensible open source SQL engine that is also used in Stinger/Drill/Cascading. 18
  • 19. Query Engine – Explain Plan 19 SELECT test_cal_dt.week_beg_dt, test_category.lv1_categ, test_category.lv2_categ, test_category.lv3_categ, test_kylin_fact.format_name, test_sites.site_name, sum(test_kylin_fact.price) as total_price, count(*) as total_count FROM test_kylin_fact LEFT JOIN test_cal_dt ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt LEFT JOIN test_category_groupings ON test_kylin_fact.leaf_categ_id = test_category_groupings.leaf_categ_id AND test_kylin_fact.lstg_site_id = test_category_groupings.site_id LEFT JOIN test_sites ON test_kylin_fact.lstg_site_id = test_sites.site_id WHERE test_kylin_fact.seller_id = 123456 OR test_kylin_fact.lstg_format_name = ’New' GROUP BY test_cal_dt.week_beg_dt, test_category.lv1_categ, test_category.lv2_categ, test_category.lv3_categ, test_kylin_fact.format_name, test_sites.site_name OLAPToEnumerableConverter OLAPProjectRel(WEEK_BEG_DT=[$0], LV1_CATEG=[$1], LVL2_CATEG=[$2], LVL3_CATEG=[$3], FORMAT_NAME=[$4], SITE_NAME=[$5], TOTAL_PRICE=[CASE(=($7, 0), null, $6)], TOTAL_COUNT=[$8]) OLAPAggregateRel(group=[{0, 1, 2, 3, 4, 5}], agg#0=[$SUM0($6)], agg#1=[COUNT($6)], TRANS_CNT=[COUNT()]) OLAPProjectRel(WEEK_BEG_DT=[$13], LV1_CATEG=[$21], LVL2_CATEG=[$15], LVL3_CATEG=[$14], FORMAT_NAME=[$5], SITE_NAME=[$23], PRICE=[$0]) OLAPFilterRel(condition=[OR(=($3, 123456), =($5, ‘New'))]) OLAPJoinRel(condition=[=($2, $25)], joinType=[left]) OLAPJoinRel(condition=[AND(=($6, $22), =($2, $17))], joinType=[left]) OLAPJoinRel(condition=[=($4, $12)], joinType=[left]) OLAPTableScan(table=[[DEFAULT, TEST_KYLIN_FACT]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]]) OLAPTableScan(table=[[DEFAULT, TEST_CAL_DT]], fields=[[0, 1]]) OLAPTableScan(table=[[DEFAULT, TEST_CATEGORY_GROUPINGS]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8]]) OLAPTableScan(table=[[DEFAULT, TEST_SITES]], fields=[[0, 1, 2]])
  • 20. Query Engine – Calcite Plugin-ins • Metadata SPI – Provide table schema from kylin metadata • Optimize Rule – Translate the logic operator into kylin operator • Relational Operator – Find right cube – Translate SQL into storage engine api call – Generate physical execute plan by linq4j java implementation • Result Enumerator – Translate storage engine result into java implementation result. • SQL Function – Add HyperLogLog for distinct count – Implement date time related functions (i.e. Quarter) 20
  • 21. Storage Engine - Overview • Provide cube query for query engine – Common iterator interface for storage engine – Isolate query engine from underline storage • HBase Storage – HBase schema – Pre-join & pre-aggregation – Dictionary – HBase coprocessor 21
  • 22. Storage Engine – HBase Schema 22
  • 23. Storage Engine – Pre-join & pre-aggregation • Pre-join (Hive) – Kylin will generate pre-join HQL based on metadata that will join the fact table with dimension tables – Kylin will use Hive to execute the pre-join HQL that will generate the pre-joined flat table • Pre-aggregation (MapReduce) – Kylin will generate a minimum spanning tree of cuboid from cube lattice graph. – Kylin will generate MapReduce job base on the minimum spanning tree – MapReduce job will do pre-aggregation to generate all cuboids from N dimensions to 1 dimension. • MOLAP Cube (HBase) – The pre-join & pre-aggregation results (i.e. MOLAP Cube) is stored in HBase 23
  • 24. Storage Engine – Dictionary • Dictionary maps dimension values into IDs that will reduce the memory and storage footprint. • Metadata can define whether one dimension use “dictionary” or not • Dictionary is based on Trie 24
  • 25. Storage Engine – HBase Coprocessor • HBase coprocessor can reduce network traffic and parallelize scan logic • HBase client – Serialize the filter + dimension + metrics into bytes – Send the encoded bytes to corprocessor • HBase Corprocessor – Deserialize the bytes to filter + dimension + metrics – Iterate all rows from each scan range • Filter unmatched rows • Aggregate the matched rows by dimensions in an cache • Send back the aggregated rows from cache 25
  • 26. Cube Build - Overview • Multi Staged Build • Map Reduce Job Flow • Cube Build Steps 26
  • 27. Cube Build – Multi Staged Build 27
  • 28. Cube Build – Map Reduce Job Flow 28
  • 29. Cube Build – Steps • Build dictionary from dimension tables on local disk. And copy dictionary to HDFS • Run Hive query to build a joined flatten table • Run map reduce job to build cuboids in HDFS sequence files from tier 1 to tier N • Calculate the key distribution of HDFS sequence files. And evenly split the key space into K regions. • • Translate HDFS sequence files into HBase Hfile • Bulk load the HFile into HBase 29
  • 30. Cube Optimization - Overview • “Curse of dimensionality”: N dimension cube has 2N cuboid – Full Cube vs. Partial Cube • Hugh data volume – Incremental Build • Slow Table Scan – TopN Query on High Cardinality Dimension – Bitmap inverted index – Time range partition – In-memory parallel scan: block cache + endpoint coprocessor 30
  • 31. Cube Optimization – Full Cube vs. Partial Cube • Full Cube – Pre-aggregate all dimension combinations – “Curse of dimensionality”: N dimension cube has 2N cuboid. • Partial Cube – To avoid dimension explosion, we divide the dimensions into different aggregation groups – For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid number will reduce from 1 Billion to 3 Thousands – Tradeoff between online aggregation and offline pre-aggregation 31
  • 32. Cube Optimization – Partial Cube 32
  • 33. Cube Optimization – Incremental Build 33
  • 34. Cube Optimization – TopN Query on High Cardinality Dimension 34 • Bitmap inverted index • Separate high cardinality dimension from low cardinality dimension • Time range partition • In-memory (block cache) • Parallel scan (endpoint coprocessor)
  • 35. Kylin Resources • Web Site http://kylin.io • Google Groups https://groups.google.com/forum/#!forum/kylin-olap • Source Code https://github.com/KylinOLAP/Kylin 35
  • 36. Q & A 36