SlideShare a Scribd company logo
1 
Kylin: Hadoop OLAP Engine - Tech Deep Dive 
Jiang Xu, Architect of Kylin 
October 2014
Kylin Overview 
Kylin 
- n. (in Chinese art) a mythical animal of composite form 
2 
http://kylin.io 
Kylin is an open source Distributed Analytics Engine from 
eBay Inc. that provides SQL interface and multi-dimensional 
analysis (OLAP) on Hadoop supporting extremely large 
datasets
What Is Kylin? 
• Extremely Fast OLAP Engine at Scale 
Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data 
• ANSI-SQL Interface on Hadoop 
Kylin offers ANSI-SQL on Hadoop and supports most ANSI-SQL query functions 
• Interactive Query Capability 
Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive queries for the same 
dataset 
• MOLAP Cube 
User can define a data model and pre-build in Kylin with more than 10+ billions of raw data records 
• Seamless Integration with BI Tools 
Kylin currently offers integration capability with BI Tools like Tableau. Integration with Microstrategy and Excel 
is coming soon 
3
What Is Kylin? - Other Highlights 
• Job Management and Monitoring 
• Compression and Encoding Support 
• Incremental Refresh of Cubes 
• Leverage HBase Coprocessor for query latency 
• Approximate Query Capability for distinct Count (HyperLogLog) 
• Easy Web interface to manage, build, monitor and query cubes 
• Security capability to set ACL at Cube/Project Level 
• Support LDAP Integration 
4
Glance of SQL-on-Hadoop Ecosystem … 
• SQL translated to MapReduce jobs 
– Hive 
– Stinger without Tez 
• SQL processed by a MPP Engine 
– Impala 
– Drill 
– Presto 
– Spark + Shark 
• SQL process by a existing SQL Engine + HDFS 
– EMC Greenplum (postgres) 
– Taobao Garude (mysql) 
• OLAP on Hadoop in other Companies 
– Adobe: HBase Cube 
– LinkedIn: Avatara 
– Salesforce.com: Phoenix 
5
Why Do We Build Kylin? 
• Why existing SQL-on-Hadoop solutions fall short? 
The existing SQL-on-Hadoop needs to scan partial or whole data set to answer a user query. Moreover, table join may trigger 
the data transfer across host. Due to large scan range and network traffic latency, many queries are very slow (minute+ 
latency). 
• What is MOLAP/ROLAP? 
– MOLAP (Multi-dimensional OLAP) is to pre-compute data along different dimensions of interest and store resultant 
values in the cube. MOLAP is much faster but is inflexible. Kylin is more like MOLAP. 
– ROLAP (Relational-OLAP) is to use star or snow-flake schema to do runtime aggregation. ROLAP is flexible but much 
slower. All existing SQL-on-Hadoop is kind of ROLAP. 
• How does Kylin support ROLAP/MOLAP? 
Kylin builds data cube (MOLAP) from hive table (ROLAP) according to the metadata definition. 
– If the query can be fulfilled by data cube, Kylin will route the query to data cube that is MOLAP. 
– If the query can’t be fulfilled by data cube, Kylin will route the query to hive table that is ROLAP. 
– Basically, you can think Kylin as HOLAP on top of MOLAP and ROLAP. 
6
Architecture Overview 
7
Analytics Query Taxonomy 
8 
High Level 
Aggregation 
• Very High Level, e.g GMV by 
site by vertical by weeks 
Analysis 
Query 
• Middle level, e.g GMV by site by vertical, by 
category (level x) past 12 weeks 
Drill Down to 
Detail • Detail Level (SSA Table) 
Low Level 
Aggregation • Seller ID 
Transaction 
Level • Transaction ID 
80+% 
Analytics 
Kylin is designed to accelerate 
Analytics queries!
Query Performance -- Compare to Hive 
9 
200 
100 
0 
SQL #1 SQL #2 SQL #3 
# Query Type Return 
Dataset 
Query 
On Kylin (s) 
Query 
On Hive (s) 
Hive 
Kylin 
Comments 
1 High Level Aggregation 4 0.129 157.437 1,217 times 
2 Analysis Query 22,669 1.615 109.206 68 times 
3 Drill Down to Detail 325,029 12.058 113.123 9 times 
4 Drill Down to Detail 524,780 22.42 6383.21 278 times 
5 Data Dump 972,002 49.054 N/A
Query Performance – Latency & Throughput 
10
Tech Highlights 
11
Component Design 
12 
Query Engine 
(Calcite) 
Metadata Manager 
JDBC Driver 
Storage Engine 
Data Cube 
(HBase) 
Job Engine 
(MapReduce) 
Star Schema 
(Hive) 
SQL 
RESTful Server 
Dictionary 
& Cube 
JDBC Driver ODBC Driver 
HBase 
Coprossor
Background – Cube & Cuboid 
13 
• Cuboid = one combination of dimensions 
• Cube = all combination of dimensions (all cuboids) 
time, item 
time item location supplier 
time, item, location 
item, location 
time, location, supplier 
time, item, location, supplier 
time, location 
Time, supplier 
item, supplier 
location, supplier 
time, item, supplier 
item, location, supplier 
0-D(apex) cuboid 
1-D cuboids 
2-D cuboids 
3-D cuboids 
4-D(base) cuboid
Metadata- Overview 
14 
Cube: … 
Fact Table: … 
Dimensions: … 
Measures: … 
Row Key: … 
HBase Mapping: 
… 
Dim 
Fact 
Dim Dim 
Source 
Star Schema 
row A 
row B 
row C 
Val 1 
Val 2 
Val 3 
Column 
Family 
Row Key 
Column 
Target 
HBase Storage 
Cube Metadata
Metadata – Dimension 
• Dimension 
– Normal 
– Mandatory 
– Hierarchy 
– Derived 
15
Metadata – Measure 
• Measure 
– Sum 
– Count 
– Max 
– Min 
– Average 
– Distinct Count (based on HyperLogLog) 
16
Query Engine - Overview 
• Query engine is based on Calcite 
• Query Execution Plan 
• Optiq Plug-ins in Query Engine 
17
Query Engine – Calcite 
• Calcite (http://incubator.apache.org/projects/calcite.html) is an extensible open source SQL 
engine that is also used in Stinger/Drill/Cascading. 
18
Query Engine – Explain Plan 
19 
SELECT test_cal_dt.week_beg_dt, test_category.lv1_categ, test_category.lv2_categ, test_category.lv3_categ, test_kylin_fact.format_name, 
test_sites.site_name, sum(test_kylin_fact.price) as total_price, count(*) as total_count 
FROM test_kylin_fact 
LEFT JOIN test_cal_dt ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt 
LEFT JOIN test_category_groupings ON test_kylin_fact.leaf_categ_id = test_category_groupings.leaf_categ_id AND test_kylin_fact.lstg_site_id = 
test_category_groupings.site_id 
LEFT JOIN test_sites ON test_kylin_fact.lstg_site_id = test_sites.site_id 
WHERE test_kylin_fact.seller_id = 123456 OR test_kylin_fact.lstg_format_name = ’New' 
GROUP BY test_cal_dt.week_beg_dt, test_category.lv1_categ, test_category.lv2_categ, test_category.lv3_categ, test_kylin_fact.format_name, 
test_sites.site_name 
OLAPToEnumerableConverter 
OLAPProjectRel(WEEK_BEG_DT=[$0], LV1_CATEG=[$1], LVL2_CATEG=[$2], LVL3_CATEG=[$3], FORMAT_NAME=[$4], 
SITE_NAME=[$5], TOTAL_PRICE=[CASE(=($7, 0), null, $6)], TOTAL_COUNT=[$8]) 
OLAPAggregateRel(group=[{0, 1, 2, 3, 4, 5}], agg#0=[$SUM0($6)], agg#1=[COUNT($6)], TRANS_CNT=[COUNT()]) 
OLAPProjectRel(WEEK_BEG_DT=[$13], LV1_CATEG=[$21], LVL2_CATEG=[$15], LVL3_CATEG=[$14], FORMAT_NAME=[$5], 
SITE_NAME=[$23], PRICE=[$0]) 
OLAPFilterRel(condition=[OR(=($3, 123456), =($5, ‘New'))]) 
OLAPJoinRel(condition=[=($2, $25)], joinType=[left]) 
OLAPJoinRel(condition=[AND(=($6, $22), =($2, $17))], joinType=[left]) 
OLAPJoinRel(condition=[=($4, $12)], joinType=[left]) 
OLAPTableScan(table=[[DEFAULT, TEST_KYLIN_FACT]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]]) 
OLAPTableScan(table=[[DEFAULT, TEST_CAL_DT]], fields=[[0, 1]]) 
OLAPTableScan(table=[[DEFAULT, TEST_CATEGORY_GROUPINGS]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8]]) 
OLAPTableScan(table=[[DEFAULT, TEST_SITES]], fields=[[0, 1, 2]])
Query Engine – Calcite Plugin-ins 
• Metadata SPI 
– Provide table schema from kylin metadata 
• Optimize Rule 
– Translate the logic operator into kylin operator 
• Relational Operator 
– Find right cube 
– Translate SQL into storage engine api call 
– Generate physical execute plan by linq4j java implementation 
• Result Enumerator 
– Translate storage engine result into java implementation result. 
• SQL Function 
– Add HyperLogLog for distinct count 
– Implement date time related functions (i.e. Quarter) 
20
Storage Engine - Overview 
• Provide cube query for query engine 
– Common iterator interface for storage engine 
– Isolate query engine from underline storage 
• HBase Storage 
– HBase schema 
– Pre-join & pre-aggregation 
– Dictionary 
– HBase coprocessor 
21
Storage Engine – HBase Schema 
22
Storage Engine – Pre-join & pre-aggregation 
• Pre-join (Hive) 
– Kylin will generate pre-join HQL based on metadata that will join the fact table with dimension tables 
– Kylin will use Hive to execute the pre-join HQL that will generate the pre-joined flat table 
• Pre-aggregation (MapReduce) 
– Kylin will generate a minimum spanning tree of cuboid from cube lattice graph. 
– Kylin will generate MapReduce job base on the minimum spanning tree 
– MapReduce job will do pre-aggregation to generate all cuboids from N dimensions to 1 dimension. 
• MOLAP Cube (HBase) 
– The pre-join & pre-aggregation results (i.e. MOLAP Cube) is stored in HBase 
23
Storage Engine – Dictionary 
• Dictionary maps dimension values into IDs that will reduce the memory and storage footprint. 
• Metadata can define whether one dimension use “dictionary” or not 
• Dictionary is based on Trie 
24
Storage Engine – HBase Coprocessor 
• HBase coprocessor can reduce network traffic and parallelize scan logic 
• HBase client 
– Serialize the filter + dimension + metrics into bytes 
– Send the encoded bytes to corprocessor 
• HBase Corprocessor 
– Deserialize the bytes to filter + dimension + metrics 
– Iterate all rows from each scan range 
• Filter unmatched rows 
• Aggregate the matched rows by dimensions in an cache 
• Send back the aggregated rows from cache 
25
Cube Build - Overview 
• Multi Staged Build 
• Map Reduce Job Flow 
• Cube Build Steps 
26
Cube Build – Multi Staged Build 
27
Cube Build – Map Reduce Job Flow 
28
Cube Build – Steps 
• Build dictionary from dimension tables on local disk. And copy dictionary to HDFS 
• Run Hive query to build a joined flatten table 
• Run map reduce job to build cuboids in HDFS sequence files from tier 1 to tier N 
• Calculate the key distribution of HDFS sequence files. And evenly split the key space into K regions. 
• 
• Translate HDFS sequence files into HBase Hfile 
• Bulk load the HFile into HBase 
29
Cube Optimization - Overview 
• “Curse of dimensionality”: N dimension cube has 2N cuboid 
– Full Cube vs. Partial Cube 
• Hugh data volume 
– Incremental Build 
• Slow Table Scan – TopN Query on High Cardinality Dimension 
– Bitmap inverted index 
– Time range partition 
– In-memory parallel scan: block cache + endpoint coprocessor 
30
Cube Optimization – Full Cube vs. Partial Cube 
• Full Cube 
– Pre-aggregate all dimension combinations 
– “Curse of dimensionality”: N dimension cube has 2N cuboid. 
• Partial Cube 
– To avoid dimension explosion, we divide the dimensions into different aggregation groups 
– For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid number will 
reduce from 1 Billion to 3 Thousands 
– Tradeoff between online aggregation and offline pre-aggregation 
31
Cube Optimization – Partial Cube 
32
Cube Optimization – Incremental Build 
33
Cube Optimization – TopN Query on High Cardinality Dimension 
34 
• Bitmap inverted index 
• Separate high cardinality 
dimension from low cardinality 
dimension 
• Time range partition 
• In-memory (block cache) 
• Parallel scan (endpoint 
coprocessor)
Kylin Resources 
• Web Site 
http://kylin.io 
• Google Groups 
https://groups.google.com/forum/#!forum/kylin-olap 
• Source Code 
https://github.com/KylinOLAP/Kylin 
35
Q & A 
36

More Related Content

What's hot

Apache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecApache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
Yang Li
 
Kylin Engineering Principles
Kylin Engineering PrinciplesKylin Engineering Principles
Kylin Engineering Principles
Xu Jiang
 
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for HadoopHBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
DataWorks Summit/Hadoop Summit
 
Apache Kylin Streaming
Apache Kylin Streaming Apache Kylin Streaming
Apache Kylin Streaming
hongbin ma
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouse
Yang Li
 
Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBase
HBaseCon
 
Kylin OLAP Engine Tour
Kylin OLAP Engine TourKylin OLAP Engine Tour
Kylin OLAP Engine Tour
Luke Han
 
Apache Kylin - Balance between space and time - Hadoop Summit 2015
Apache Kylin -  Balance between space and time - Hadoop Summit 2015Apache Kylin -  Balance between space and time - Hadoop Summit 2015
Apache Kylin - Balance between space and time - Hadoop Summit 2015
Debashis Saha
 
Apache Kylin 1.5 Updates
Apache Kylin 1.5 UpdatesApache Kylin 1.5 Updates
Apache Kylin 1.5 Updates
Yang Li
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015
Seshu Adunuthula
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
Tony Ng
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache Kylin
Yang Li
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
DataWorks Summit
 
Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)
qhzhou
 
Apache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 BeijingApache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 Beijing
Luke Han
 
Apache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeApache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and Time
DataWorks Summit
 
The Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke HanThe Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke Han
Luke Han
 
Adding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark MeetupAdding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark Meetup
Luke Han
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
DataWorks Summit
 

What's hot (20)

Apache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 DecApache Kylin: Hadoop OLAP Engine, 2014 Dec
Apache Kylin: Hadoop OLAP Engine, 2014 Dec
 
Kylin Engineering Principles
Kylin Engineering PrinciplesKylin Engineering Principles
Kylin Engineering Principles
 
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for HadoopHBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
Apache Kylin Streaming
Apache Kylin Streaming Apache Kylin Streaming
Apache Kylin Streaming
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouse
 
Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBase
 
Kylin OLAP Engine Tour
Kylin OLAP Engine TourKylin OLAP Engine Tour
Kylin OLAP Engine Tour
 
Apache Kylin - Balance between space and time - Hadoop Summit 2015
Apache Kylin -  Balance between space and time - Hadoop Summit 2015Apache Kylin -  Balance between space and time - Hadoop Summit 2015
Apache Kylin - Balance between space and time - Hadoop Summit 2015
 
Apache Kylin 1.5 Updates
Apache Kylin 1.5 UpdatesApache Kylin 1.5 Updates
Apache Kylin 1.5 Updates
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015
 
eBay Experimentation Platform on Hadoop
eBay Experimentation Platform on HadoopeBay Experimentation Platform on Hadoop
eBay Experimentation Platform on Hadoop
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache Kylin
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
 
Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)Apache kylin (china hadoop summit 2015 shanghai)
Apache kylin (china hadoop summit 2015 shanghai)
 
Apache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 BeijingApache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 Beijing
 
Apache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeApache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and Time
 
The Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke HanThe Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke Han
 
Adding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark MeetupAdding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark Meetup
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 

Viewers also liked

Low Latency OLAP with Hadoop and HBase
Low Latency OLAP with Hadoop and HBaseLow Latency OLAP with Hadoop and HBase
Low Latency OLAP with Hadoop and HBase
DataWorks Summit
 
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?
DataWorks Summit
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Tony Ng
 
Cloud for Developers: Azure vs. Google App Engine vs. Amazon vs. AppHarbor
Cloud for Developers: Azure vs. Google App Engine vs. Amazon vs. AppHarborCloud for Developers: Azure vs. Google App Engine vs. Amazon vs. AppHarbor
Cloud for Developers: Azure vs. Google App Engine vs. Amazon vs. AppHarbor
Svetlin Nakov
 
Combustion in diesel engine
Combustion in diesel engineCombustion in diesel engine
Combustion in diesel engine
Amanpreet Singh
 
RoomCloud Booking Engine
RoomCloud Booking EngineRoomCloud Booking Engine
RoomCloud Booking Engine
Sylvain Diverchy
 
Search Engine Optimization (SEO)
Search Engine Optimization (SEO)Search Engine Optimization (SEO)
Search Engine Optimization (SEO)
Dennis Deacon
 
Periodic Table of SEO Success Factors & Guide to SEO by SearchEngineLand
Periodic Table of SEO Success Factors & Guide to SEO by SearchEngineLandPeriodic Table of SEO Success Factors & Guide to SEO by SearchEngineLand
Periodic Table of SEO Success Factors & Guide to SEO by SearchEngineLand
Search Engine Land
 
Internal Combustion Engines - Construction and Working (All you need to know,...
Internal Combustion Engines - Construction and Working (All you need to know,...Internal Combustion Engines - Construction and Working (All you need to know,...
Internal Combustion Engines - Construction and Working (All you need to know,...
Mihir Pai
 
Tk03 Google App Engine Fr
Tk03 Google App Engine FrTk03 Google App Engine Fr
Tk03 Google App Engine Fr
Valtech
 
ppt on 2 stroke and 4 stroke petrol engine
ppt on 2 stroke and 4 stroke petrol engineppt on 2 stroke and 4 stroke petrol engine
ppt on 2 stroke and 4 stroke petrol engine
harshid panchal
 
DNUG2015 Frühjahrskonferenz: Brücken bauen, Grenzen überwinden: Domino im Dia...
DNUG2015 Frühjahrskonferenz: Brücken bauen, Grenzen überwinden: Domino im Dia...DNUG2015 Frühjahrskonferenz: Brücken bauen, Grenzen überwinden: Domino im Dia...
DNUG2015 Frühjahrskonferenz: Brücken bauen, Grenzen überwinden: Domino im Dia...
JRibbeck
 
I.C.ENGINE PPT
I.C.ENGINE PPTI.C.ENGINE PPT
I.C.ENGINE PPT
8695
 
CAP 4: SEO - Optimizacion de Contenido
CAP 4: SEO - Optimizacion de ContenidoCAP 4: SEO - Optimizacion de Contenido
CAP 4: SEO - Optimizacion de Contenido
Gary Briceño
 
App engine
App engineApp engine
App engine
ThirdWay
 
CAP 3: SEO - Keywords Research
CAP 3: SEO - Keywords ResearchCAP 3: SEO - Keywords Research
CAP 3: SEO - Keywords Research
Gary Briceño
 
El SEOy la geolocalización
El SEOy la geolocalizaciónEl SEOy la geolocalización
Desarrollando para Nmap Scripting Engine (NSE) [GuadalajaraCON 2013]
Desarrollando para Nmap Scripting Engine (NSE) [GuadalajaraCON 2013]Desarrollando para Nmap Scripting Engine (NSE) [GuadalajaraCON 2013]
Desarrollando para Nmap Scripting Engine (NSE) [GuadalajaraCON 2013]
Websec México, S.C.
 
Presentacion introduccion ibm file net p8 v10
Presentacion introduccion ibm file net p8 v10Presentacion introduccion ibm file net p8 v10
Presentacion introduccion ibm file net p8 v10Javier Laguens Garcia
 

Viewers also liked (20)

Low Latency OLAP with Hadoop and HBase
Low Latency OLAP with Hadoop and HBaseLow Latency OLAP with Hadoop and HBase
Low Latency OLAP with Hadoop and HBase
 
IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?IS OLAP DEAD IN THE AGE OF BIG DATA?
IS OLAP DEAD IN THE AGE OF BIG DATA?
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and DruidPulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
 
Cloud for Developers: Azure vs. Google App Engine vs. Amazon vs. AppHarbor
Cloud for Developers: Azure vs. Google App Engine vs. Amazon vs. AppHarborCloud for Developers: Azure vs. Google App Engine vs. Amazon vs. AppHarbor
Cloud for Developers: Azure vs. Google App Engine vs. Amazon vs. AppHarbor
 
Combustion in diesel engine
Combustion in diesel engineCombustion in diesel engine
Combustion in diesel engine
 
RoomCloud Booking Engine
RoomCloud Booking EngineRoomCloud Booking Engine
RoomCloud Booking Engine
 
Search Engine Optimization (SEO)
Search Engine Optimization (SEO)Search Engine Optimization (SEO)
Search Engine Optimization (SEO)
 
Periodic Table of SEO Success Factors & Guide to SEO by SearchEngineLand
Periodic Table of SEO Success Factors & Guide to SEO by SearchEngineLandPeriodic Table of SEO Success Factors & Guide to SEO by SearchEngineLand
Periodic Table of SEO Success Factors & Guide to SEO by SearchEngineLand
 
Ic engine
Ic engineIc engine
Ic engine
 
Internal Combustion Engines - Construction and Working (All you need to know,...
Internal Combustion Engines - Construction and Working (All you need to know,...Internal Combustion Engines - Construction and Working (All you need to know,...
Internal Combustion Engines - Construction and Working (All you need to know,...
 
Tk03 Google App Engine Fr
Tk03 Google App Engine FrTk03 Google App Engine Fr
Tk03 Google App Engine Fr
 
ppt on 2 stroke and 4 stroke petrol engine
ppt on 2 stroke and 4 stroke petrol engineppt on 2 stroke and 4 stroke petrol engine
ppt on 2 stroke and 4 stroke petrol engine
 
DNUG2015 Frühjahrskonferenz: Brücken bauen, Grenzen überwinden: Domino im Dia...
DNUG2015 Frühjahrskonferenz: Brücken bauen, Grenzen überwinden: Domino im Dia...DNUG2015 Frühjahrskonferenz: Brücken bauen, Grenzen überwinden: Domino im Dia...
DNUG2015 Frühjahrskonferenz: Brücken bauen, Grenzen überwinden: Domino im Dia...
 
I.C.ENGINE PPT
I.C.ENGINE PPTI.C.ENGINE PPT
I.C.ENGINE PPT
 
CAP 4: SEO - Optimizacion de Contenido
CAP 4: SEO - Optimizacion de ContenidoCAP 4: SEO - Optimizacion de Contenido
CAP 4: SEO - Optimizacion de Contenido
 
App engine
App engineApp engine
App engine
 
CAP 3: SEO - Keywords Research
CAP 3: SEO - Keywords ResearchCAP 3: SEO - Keywords Research
CAP 3: SEO - Keywords Research
 
El SEOy la geolocalización
El SEOy la geolocalizaciónEl SEOy la geolocalización
El SEOy la geolocalización
 
Desarrollando para Nmap Scripting Engine (NSE) [GuadalajaraCON 2013]
Desarrollando para Nmap Scripting Engine (NSE) [GuadalajaraCON 2013]Desarrollando para Nmap Scripting Engine (NSE) [GuadalajaraCON 2013]
Desarrollando para Nmap Scripting Engine (NSE) [GuadalajaraCON 2013]
 
Presentacion introduccion ibm file net p8 v10
Presentacion introduccion ibm file net p8 v10Presentacion introduccion ibm file net p8 v10
Presentacion introduccion ibm file net p8 v10
 

Similar to Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive

ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smart
Evans Ye
 
Accelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache KylinAccelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache Kylin
Tyler Wishnoff
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
Stratebi
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data Lake
Databricks
 
ApacheKylin_HBaseCon2015
ApacheKylin_HBaseCon2015ApacheKylin_HBaseCon2015
ApacheKylin_HBaseCon2015
Luke Han
 
Apache Kylin Introduction
Apache Kylin IntroductionApache Kylin Introduction
Apache Kylin Introduction
Luke Han
 
ScaleDB Technical Presentation
ScaleDB Technical PresentationScaleDB Technical Presentation
ScaleDB Technical Presentation
Ivan Zoratti
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Lucas Jellema
 
Oow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BIOow2016 review-db-dev-bigdata-BI
Grill at HadoopSummit
Grill at HadoopSummitGrill at HadoopSummit
Grill at HadoopSummit
amarsri
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Lucidworks
 
Designing, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDesigning, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons Learned
Denny Lee
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)
Nicolas Poggi
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Fwdays
 
How we switched to columnar at SpendHQ
How we switched to columnar at SpendHQHow we switched to columnar at SpendHQ
How we switched to columnar at SpendHQ
MariaDB plc
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Data Con LA
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
nvvrajesh
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Dmitry Anoshin
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 

Similar to Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive (20)

ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smart
 
Accelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache KylinAccelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache Kylin
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data Lake
 
ApacheKylin_HBaseCon2015
ApacheKylin_HBaseCon2015ApacheKylin_HBaseCon2015
ApacheKylin_HBaseCon2015
 
Apache Kylin Introduction
Apache Kylin IntroductionApache Kylin Introduction
Apache Kylin Introduction
 
ScaleDB Technical Presentation
ScaleDB Technical PresentationScaleDB Technical Presentation
ScaleDB Technical Presentation
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
 
Oow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BIOow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BI
 
Grill at HadoopSummit
Grill at HadoopSummitGrill at HadoopSummit
Grill at HadoopSummit
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
 
Designing, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons LearnedDesigning, Building, and Maintaining Large Cubes using Lessons Learned
Designing, Building, and Maintaining Large Cubes using Lessons Learned
 
Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)Using BigBench to compare Hive and Spark (Long version)
Using BigBench to compare Hive and Spark (Long version)
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
How we switched to columnar at SpendHQ
How we switched to columnar at SpendHQHow we switched to columnar at SpendHQ
How we switched to columnar at SpendHQ
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
Cheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduceCheetah:Data Warehouse on Top of MapReduce
Cheetah:Data Warehouse on Top of MapReduce
 

Recently uploaded

0x01 - Newton's Third Law: Static vs. Dynamic Abusers
0x01 - Newton's Third Law:  Static vs. Dynamic Abusers0x01 - Newton's Third Law:  Static vs. Dynamic Abusers
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
OWASP Beja
 
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
OECD Directorate for Financial and Enterprise Affairs
 
Obesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditionsObesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditions
Faculty of Medicine And Health Sciences
 
Acorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutesAcorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutes
IP ServerOne
 
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptxsomanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
Howard Spence
 
International Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software TestingInternational Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software Testing
Sebastiano Panichella
 
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Orkestra
 
Getting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control TowerGetting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control Tower
Vladimir Samoylov
 
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Sebastiano Panichella
 
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Sebastiano Panichella
 
Eureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 PresentationEureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 Presentation
Access Innovations, Inc.
 
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXOBitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Matjaž Lipuš
 
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdfBonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
khadija278284
 

Recently uploaded (13)

0x01 - Newton's Third Law: Static vs. Dynamic Abusers
0x01 - Newton's Third Law:  Static vs. Dynamic Abusers0x01 - Newton's Third Law:  Static vs. Dynamic Abusers
0x01 - Newton's Third Law: Static vs. Dynamic Abusers
 
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
Competition and Regulation in Professional Services – KLEINER – June 2024 OEC...
 
Obesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditionsObesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditions
 
Acorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutesAcorn Recovery: Restore IT infra within minutes
Acorn Recovery: Restore IT infra within minutes
 
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptxsomanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
 
International Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software TestingInternational Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software Testing
 
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...
 
Getting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control TowerGetting started with Amazon Bedrock Studio and Control Tower
Getting started with Amazon Bedrock Studio and Control Tower
 
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
 
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...
 
Eureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 PresentationEureka, I found it! - Special Libraries Association 2021 Presentation
Eureka, I found it! - Special Libraries Association 2021 Presentation
 
Bitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXOBitcoin Lightning wallet and tic-tac-toe game XOXO
Bitcoin Lightning wallet and tic-tac-toe game XOXO
 
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdfBonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
 

Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive

  • 1. 1 Kylin: Hadoop OLAP Engine - Tech Deep Dive Jiang Xu, Architect of Kylin October 2014
  • 2. Kylin Overview Kylin - n. (in Chinese art) a mythical animal of composite form 2 http://kylin.io Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
  • 3. What Is Kylin? • Extremely Fast OLAP Engine at Scale Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data • ANSI-SQL Interface on Hadoop Kylin offers ANSI-SQL on Hadoop and supports most ANSI-SQL query functions • Interactive Query Capability Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive queries for the same dataset • MOLAP Cube User can define a data model and pre-build in Kylin with more than 10+ billions of raw data records • Seamless Integration with BI Tools Kylin currently offers integration capability with BI Tools like Tableau. Integration with Microstrategy and Excel is coming soon 3
  • 4. What Is Kylin? - Other Highlights • Job Management and Monitoring • Compression and Encoding Support • Incremental Refresh of Cubes • Leverage HBase Coprocessor for query latency • Approximate Query Capability for distinct Count (HyperLogLog) • Easy Web interface to manage, build, monitor and query cubes • Security capability to set ACL at Cube/Project Level • Support LDAP Integration 4
  • 5. Glance of SQL-on-Hadoop Ecosystem … • SQL translated to MapReduce jobs – Hive – Stinger without Tez • SQL processed by a MPP Engine – Impala – Drill – Presto – Spark + Shark • SQL process by a existing SQL Engine + HDFS – EMC Greenplum (postgres) – Taobao Garude (mysql) • OLAP on Hadoop in other Companies – Adobe: HBase Cube – LinkedIn: Avatara – Salesforce.com: Phoenix 5
  • 6. Why Do We Build Kylin? • Why existing SQL-on-Hadoop solutions fall short? The existing SQL-on-Hadoop needs to scan partial or whole data set to answer a user query. Moreover, table join may trigger the data transfer across host. Due to large scan range and network traffic latency, many queries are very slow (minute+ latency). • What is MOLAP/ROLAP? – MOLAP (Multi-dimensional OLAP) is to pre-compute data along different dimensions of interest and store resultant values in the cube. MOLAP is much faster but is inflexible. Kylin is more like MOLAP. – ROLAP (Relational-OLAP) is to use star or snow-flake schema to do runtime aggregation. ROLAP is flexible but much slower. All existing SQL-on-Hadoop is kind of ROLAP. • How does Kylin support ROLAP/MOLAP? Kylin builds data cube (MOLAP) from hive table (ROLAP) according to the metadata definition. – If the query can be fulfilled by data cube, Kylin will route the query to data cube that is MOLAP. – If the query can’t be fulfilled by data cube, Kylin will route the query to hive table that is ROLAP. – Basically, you can think Kylin as HOLAP on top of MOLAP and ROLAP. 6
  • 8. Analytics Query Taxonomy 8 High Level Aggregation • Very High Level, e.g GMV by site by vertical by weeks Analysis Query • Middle level, e.g GMV by site by vertical, by category (level x) past 12 weeks Drill Down to Detail • Detail Level (SSA Table) Low Level Aggregation • Seller ID Transaction Level • Transaction ID 80+% Analytics Kylin is designed to accelerate Analytics queries!
  • 9. Query Performance -- Compare to Hive 9 200 100 0 SQL #1 SQL #2 SQL #3 # Query Type Return Dataset Query On Kylin (s) Query On Hive (s) Hive Kylin Comments 1 High Level Aggregation 4 0.129 157.437 1,217 times 2 Analysis Query 22,669 1.615 109.206 68 times 3 Drill Down to Detail 325,029 12.058 113.123 9 times 4 Drill Down to Detail 524,780 22.42 6383.21 278 times 5 Data Dump 972,002 49.054 N/A
  • 10. Query Performance – Latency & Throughput 10
  • 12. Component Design 12 Query Engine (Calcite) Metadata Manager JDBC Driver Storage Engine Data Cube (HBase) Job Engine (MapReduce) Star Schema (Hive) SQL RESTful Server Dictionary & Cube JDBC Driver ODBC Driver HBase Coprossor
  • 13. Background – Cube & Cuboid 13 • Cuboid = one combination of dimensions • Cube = all combination of dimensions (all cuboids) time, item time item location supplier time, item, location item, location time, location, supplier time, item, location, supplier time, location Time, supplier item, supplier location, supplier time, item, supplier item, location, supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid
  • 14. Metadata- Overview 14 Cube: … Fact Table: … Dimensions: … Measures: … Row Key: … HBase Mapping: … Dim Fact Dim Dim Source Star Schema row A row B row C Val 1 Val 2 Val 3 Column Family Row Key Column Target HBase Storage Cube Metadata
  • 15. Metadata – Dimension • Dimension – Normal – Mandatory – Hierarchy – Derived 15
  • 16. Metadata – Measure • Measure – Sum – Count – Max – Min – Average – Distinct Count (based on HyperLogLog) 16
  • 17. Query Engine - Overview • Query engine is based on Calcite • Query Execution Plan • Optiq Plug-ins in Query Engine 17
  • 18. Query Engine – Calcite • Calcite (http://incubator.apache.org/projects/calcite.html) is an extensible open source SQL engine that is also used in Stinger/Drill/Cascading. 18
  • 19. Query Engine – Explain Plan 19 SELECT test_cal_dt.week_beg_dt, test_category.lv1_categ, test_category.lv2_categ, test_category.lv3_categ, test_kylin_fact.format_name, test_sites.site_name, sum(test_kylin_fact.price) as total_price, count(*) as total_count FROM test_kylin_fact LEFT JOIN test_cal_dt ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt LEFT JOIN test_category_groupings ON test_kylin_fact.leaf_categ_id = test_category_groupings.leaf_categ_id AND test_kylin_fact.lstg_site_id = test_category_groupings.site_id LEFT JOIN test_sites ON test_kylin_fact.lstg_site_id = test_sites.site_id WHERE test_kylin_fact.seller_id = 123456 OR test_kylin_fact.lstg_format_name = ’New' GROUP BY test_cal_dt.week_beg_dt, test_category.lv1_categ, test_category.lv2_categ, test_category.lv3_categ, test_kylin_fact.format_name, test_sites.site_name OLAPToEnumerableConverter OLAPProjectRel(WEEK_BEG_DT=[$0], LV1_CATEG=[$1], LVL2_CATEG=[$2], LVL3_CATEG=[$3], FORMAT_NAME=[$4], SITE_NAME=[$5], TOTAL_PRICE=[CASE(=($7, 0), null, $6)], TOTAL_COUNT=[$8]) OLAPAggregateRel(group=[{0, 1, 2, 3, 4, 5}], agg#0=[$SUM0($6)], agg#1=[COUNT($6)], TRANS_CNT=[COUNT()]) OLAPProjectRel(WEEK_BEG_DT=[$13], LV1_CATEG=[$21], LVL2_CATEG=[$15], LVL3_CATEG=[$14], FORMAT_NAME=[$5], SITE_NAME=[$23], PRICE=[$0]) OLAPFilterRel(condition=[OR(=($3, 123456), =($5, ‘New'))]) OLAPJoinRel(condition=[=($2, $25)], joinType=[left]) OLAPJoinRel(condition=[AND(=($6, $22), =($2, $17))], joinType=[left]) OLAPJoinRel(condition=[=($4, $12)], joinType=[left]) OLAPTableScan(table=[[DEFAULT, TEST_KYLIN_FACT]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]]) OLAPTableScan(table=[[DEFAULT, TEST_CAL_DT]], fields=[[0, 1]]) OLAPTableScan(table=[[DEFAULT, TEST_CATEGORY_GROUPINGS]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8]]) OLAPTableScan(table=[[DEFAULT, TEST_SITES]], fields=[[0, 1, 2]])
  • 20. Query Engine – Calcite Plugin-ins • Metadata SPI – Provide table schema from kylin metadata • Optimize Rule – Translate the logic operator into kylin operator • Relational Operator – Find right cube – Translate SQL into storage engine api call – Generate physical execute plan by linq4j java implementation • Result Enumerator – Translate storage engine result into java implementation result. • SQL Function – Add HyperLogLog for distinct count – Implement date time related functions (i.e. Quarter) 20
  • 21. Storage Engine - Overview • Provide cube query for query engine – Common iterator interface for storage engine – Isolate query engine from underline storage • HBase Storage – HBase schema – Pre-join & pre-aggregation – Dictionary – HBase coprocessor 21
  • 22. Storage Engine – HBase Schema 22
  • 23. Storage Engine – Pre-join & pre-aggregation • Pre-join (Hive) – Kylin will generate pre-join HQL based on metadata that will join the fact table with dimension tables – Kylin will use Hive to execute the pre-join HQL that will generate the pre-joined flat table • Pre-aggregation (MapReduce) – Kylin will generate a minimum spanning tree of cuboid from cube lattice graph. – Kylin will generate MapReduce job base on the minimum spanning tree – MapReduce job will do pre-aggregation to generate all cuboids from N dimensions to 1 dimension. • MOLAP Cube (HBase) – The pre-join & pre-aggregation results (i.e. MOLAP Cube) is stored in HBase 23
  • 24. Storage Engine – Dictionary • Dictionary maps dimension values into IDs that will reduce the memory and storage footprint. • Metadata can define whether one dimension use “dictionary” or not • Dictionary is based on Trie 24
  • 25. Storage Engine – HBase Coprocessor • HBase coprocessor can reduce network traffic and parallelize scan logic • HBase client – Serialize the filter + dimension + metrics into bytes – Send the encoded bytes to corprocessor • HBase Corprocessor – Deserialize the bytes to filter + dimension + metrics – Iterate all rows from each scan range • Filter unmatched rows • Aggregate the matched rows by dimensions in an cache • Send back the aggregated rows from cache 25
  • 26. Cube Build - Overview • Multi Staged Build • Map Reduce Job Flow • Cube Build Steps 26
  • 27. Cube Build – Multi Staged Build 27
  • 28. Cube Build – Map Reduce Job Flow 28
  • 29. Cube Build – Steps • Build dictionary from dimension tables on local disk. And copy dictionary to HDFS • Run Hive query to build a joined flatten table • Run map reduce job to build cuboids in HDFS sequence files from tier 1 to tier N • Calculate the key distribution of HDFS sequence files. And evenly split the key space into K regions. • • Translate HDFS sequence files into HBase Hfile • Bulk load the HFile into HBase 29
  • 30. Cube Optimization - Overview • “Curse of dimensionality”: N dimension cube has 2N cuboid – Full Cube vs. Partial Cube • Hugh data volume – Incremental Build • Slow Table Scan – TopN Query on High Cardinality Dimension – Bitmap inverted index – Time range partition – In-memory parallel scan: block cache + endpoint coprocessor 30
  • 31. Cube Optimization – Full Cube vs. Partial Cube • Full Cube – Pre-aggregate all dimension combinations – “Curse of dimensionality”: N dimension cube has 2N cuboid. • Partial Cube – To avoid dimension explosion, we divide the dimensions into different aggregation groups – For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid number will reduce from 1 Billion to 3 Thousands – Tradeoff between online aggregation and offline pre-aggregation 31
  • 32. Cube Optimization – Partial Cube 32
  • 33. Cube Optimization – Incremental Build 33
  • 34. Cube Optimization – TopN Query on High Cardinality Dimension 34 • Bitmap inverted index • Separate high cardinality dimension from low cardinality dimension • Time range partition • In-memory (block cache) • Parallel scan (endpoint coprocessor)
  • 35. Kylin Resources • Web Site http://kylin.io • Google Groups https://groups.google.com/forum/#!forum/kylin-olap • Source Code https://github.com/KylinOLAP/Kylin 35
  • 36. Q & A 36