© 2014 MapR Technologies 1© 2014 MapR Technologies
© 2014 MapR Technologies 2
Who I am
Ted Dunning, Chief Applications Architect, MapR Technologies
Email tdunning@mapr.com tdunning@apache.org
Twitter @Ted_Dunning
VP Incubator
Email tdunning@apache.org
Twitter @ApacheMahout @ApacheDrill
Credit for slides to Luke Han and the Kylin dev team
© 2014 MapR Technologies 3
Kylin Committers
ankur Ankur Bansal
jiangxu Jiang Xu
liyang Li Yang
lukehan Luke Han*
mahongbin Hongbin Ma
xduo Xiaodong Duo
yisong George Song
jhyde Julian Hyde
The real deal
Calcite plenipotentiary
© 2014 MapR Technologies 4
Agenda
• What is Apache Kylin?
• Features & Tech Highlights
• Performance
• Roadmap
• Q & A
© 2014 MapR Technologies 5
What is Kylin?
Extreme OLAP Engine for Big Data
Kylin is an open source Distributed Analytics Engine from (originally from eBay) that
provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop for
extremely large datasets
kylin / ˈkiːˈlɪn / 麒麟
--n. (in Chinese art) a mythical animal of composite form
• Open Sourced on Oct 1st, 2014
• Accepted into incubation November, 2014
• Preparing for first Apache release
© 2014 MapR Technologies 6
Big Data Obligatory Slide
• More and more data becoming available on Hadoop
• Limitations in existing Business Intelligence (BI) Tools
– Limited support for Hadoop
– Data size growing exponentially
– High latency of interactive queries
– Scale-Up architecture
• Challenges to adopt Hadoop as interactive analysis system
– Majority of analyst groups are SQL savvy
– No mature SQL interface on Hadoop
– OLAP capability on Hadoop ecosystem not ready yet
© 2014 MapR Technologies 7
Goals
• Sub-second query latency on billions of rows
• ANSI SQL for both analysts and engineers
• Full OLAP capability to offer advanced functionality
• Seamless Integration with BI Tools
• Support for high cardinality and dimensionality
• High concurrency – thousands of end users
• Distributed and scale out architecture for large data volume
© 2014 MapR Technologies 8
Possible Strategies
• Build from scratch
– A grand tradition
– Large-scale SQL support is much harder than it looks
– Huge level of distraction
• Patch Hive
– Not feasible due to design assumptions in Hive
– Weak optimizer
– Hive isn’t standard SQL anyway
– (but isn’t Hive moving to Calcite?)
© 2014 MapR Technologies 9
Kylin’s Strategy
• Use Calcite as SQL core
– Real SQL
– Real cost-based optimizer
– Already in Apache
– Provides linkage to Apache Drill and future of Hive
• Build cubes externally
– Don’t care which tools, currently Hive, soon Spark
• Use Calcite’s Rex interpreter
– Assumes final aggregations fit on one machine
• Possibly integrate with Drill at some point for parallel execution
© 2014 MapR Technologies 10
Transaction
Operation
Strategy
Analytics Query Taxonomy
High Level
Aggregation
• Very High Level, e.g GMV
by site by vertical by weeks
Analysis
Query
• Mid-level, e.g GMV by site
by vertical, by category
(level x) past 12 weeks
Drill Down
to Detail • Detail Level (Summary Table)
Low Level
Aggregation
• First Level
Aggregation
Transaction
Level
• Transaction
Data
OLAP
Kylin is designed to accelerate 80+% of analytics queries on Hadoop
OLTP
© 2014 MapR Technologies 11
Technical Challenges
• Huge volume data
– Table scan
• Big table joins
– Data shuffling
• Analysis on different granularity
– Runtime aggregation expensive
• Map Reduce job
– Batch processing
© 2014 MapR Technologies 12
How Cubes Work
• Start with a simple table
– revenue,time,item,location,supplier
• Build a table of aggregates for every combination of fields
select sum(revenue), max(revenue), supplier from tbl group by time,item,location;
select sum(revenue), max(revenue), location,supplier from tbl group by time,item;
select sum(revenue), max(revenue), location from tbl group by time,item,supplier;
…
• Then transform queries using appropriate magic
select sum(revenue), city from tbl join location_details
where state = ‘MN’ group by city
 select … from (select sum(),location from cube) join location_details
where state = ‘MN’ group by city
© 2014 MapR Technologies 13
How Cubes Don’t Work
• Total number of cubes is exponential in columns
• High cardinality can result in large cubes
• Skewed data can make cubes larger as original data
• Magic may be insufficient to recognize cubable queries
• Keeping cubes up to date can be hard
• Forget OLTP thoughts like pervasive transactions
© 2014 MapR Technologies 15
OLAP Cube – Balance between Space and Time
Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells
1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier>
2. (9/15, milk, Urbana, *) - <time, item, location>
3. (*, milk, Urbana, *) - <item, location>
4. (*, milk, Chicago, *) - <item, location>
5. (*, milk, *, *) - <item>
Cuboid = one combination of dimensions
Cube = all combinations of dimensions
1111
0111 1011 1101 1110
0011 0101 0110 1001 1010 1100
0001 0010 0100 1000
0000
© 2014 MapR Technologies 16
From Relational to Key-Value
© 2014 MapR Technologies 17
Kylin Architecture Overview
17
Cube Build Engine
(MapReduce…)
SQL
Low Latency - SecondsMid Latency - Minutes
Routing
3rd Party App
(Web App, Mobile…)
Metadata
SQL Tools
(BI Tools: Tableau…)
Query Engine
Hadoop
Hive
REST API JDBC/ODBC
 Online Analysis Data Flow
 Offline Data Flow
 Clients/Users interactive with Kylin via
SQL
 OLAP Cube is transparent to users
Star Schema Data Key Value Data
Data CubeOLAP
Cube
(HBase)
SQL
REST Server
© 2014 MapR Technologies 18
Kylin Depends on Hadoop Eco-system
• Hive
– Input source, pre-join star schema during cube building
• MapReduce
– Aggregate metrics during cube building
• HDFS
– Store intermediate files during cube building
• HBase
– Store and query data cubes
• Calcite
– SQL parsing, code generation, optimization
© 2014 MapR Technologies 19
Agenda
• What is Apache Kylin?
• Features & Tech Highlights
• Performance
• Roadmap
• Q & A
© 2014 MapR Technologies 20
Kylin Highlights
• Extremely Fast OLAP Engine at Scale
Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data to seconds
• ANSI SQL Interface on Hadoop
Kylin offers ANSI SQL on Hadoop and supports most ANSI SQL query functions
• Seamless Integration with BI Tools
Kylin currently offers integration capability with BI Tools like Tableau.
• Interactive Query Capability
Users can interact with Hadoop data via Kylin at sub-second latency
• MOLAP Cube
User can define a data model and pre-build in Kylin with more than 10+ billions of raw data
records
© 2014 MapR Technologies 21
More Highlights
• Compression and Encoding Support
• Incremental Refresh of Cubes
• Approximate Query Capability for distinct Count (HyperLogLog)
• Leverage HBase Coprocessor for query latency
• Job Management and Monitoring
• Easy Web interface to manage, build, monitor and query cubes
• Security capability to set ACL at Cube/Project Level
• Support LDAP Integration
© 2014 MapR Technologies 22
Cube Designer
© 2014 MapR Technologies 23
Job Management
© 2014 MapR Technologies 24
Query and Visualization
© 2014 MapR Technologies 25
Tableau Integration
© 2014 MapR Technologies 26
Data Modeling Points of View
Cube: …
Fact Table: …
Dimensions: …
Measures: …
Storage(HBase): …
Fact
Dim Dim
Dim
Source
Star Schema
row A
row B
row C
Column Family
Val 1
Val 2
Val 3
Row Key Column
Target
HBase Storage
Mapping
Cube Metadata
End User Cube Modeler Admin
© 2014 MapR Technologies 27
Process Flow
Source Joined
tables
Build
dict
Dimension
dictionaries
Hive
© 2014 MapR Technologies 28
Process Flow
Joined n cuboid
n-1 cuboids
Apex cuboid
MR MR
Dimension
dictionaries
MR
© 2014 MapR Technologies 29
Process flow
n cuboid n-1 cuboids Apex cuboid
MR
H-files
HBase
© 2014 MapR Technologies 30
How To Store Cube? – HBase Schema
© 2014 MapR Technologies 31
SELECT test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name, test_kylin_fact.lstg_format_name,
test_sites.site_name, SUM(test_kylin_fact.price) AS GMV, COUNT(*) AS TRANS_CNT
FROM test_kylin_fact
LEFT JOIN test_cal_dt ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt
LEFT JOIN test_category ON test_kylin_fact.leaf_categ_id = test_category.leaf_categ_id AND test_kylin_fact.lstg_site_id = test_category.site_id
LEFT JOIN test_sites ON test_kylin_fact.lstg_site_id = test_sites.site_id
WHERE test_kylin_fact.seller_id = 123456OR test_kylin_fact.lstg_format_name = ’New'
GROUP BY test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name,
test_kylin_fact.lstg_format_name,test_sites.site_name
OLAPToEnumerableConverter
OLAPProjectRel(WEEK_BEG_DT=[$0], category_name=[$1], CATEG_LVL2_NAME=[$2], CATEG_LVL3_NAME=[$3], LSTG_FORMAT_NAME=[$4],
SITE_NAME=[$5], GMV=[CASE(=($7, 0), null, $6)], TRANS_CNT=[$8])
OLAPAggregateRel(group=[{0, 1, 2, 3, 4, 5}], agg#0=[$SUM0($6)], agg#1=[COUNT($6)], TRANS_CNT=[COUNT()])
OLAPProjectRel(WEEK_BEG_DT=[$13], category_name=[$21], CATEG_LVL2_NAME=[$15], CATEG_LVL3_NAME=[$14], LSTG_FORMAT_NAME=[$5],
SITE_NAME=[$23], PRICE=[$0])
OLAPFilterRel(condition=[OR(=($3, 123456), =($5, ’New'))])
OLAPJoinRel(condition=[=($2, $25)], joinType=[left])
OLAPJoinRel(condition=[AND(=($6, $22), =($2, $17))], joinType=[left])
OLAPJoinRel(condition=[=($4, $12)], joinType=[left])
OLAPTableScan(table=[[DEFAULT, TEST_KYLIN_FACT]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]])
OLAPTableScan(table=[[DEFAULT, TEST_CAL_DT]], fields=[[0, 1]])
OLAPTableScan(table=[[DEFAULT, test_category]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8]])
OLAPTableScan(table=[[DEFAULT, TEST_SITES]], fields=[[0, 1, 2]])
Query Engine – Kylin Explain Plan
© 2014 MapR Technologies 32
Now Let’s Make it Really Work
• Full Cube
– Pre-aggregate all dimension combinations
– “Curse of dimensionality”: N dimension cube has 2N cuboid.
• Partial Cube
– To avoid dimension explosion, we divide the dimensions into different
aggregation groups
• 2N+M+L  2N + 2M + 2L
– For cube with 30 dimensions, if we divide these dimensions into 3
group, the cuboid count is reduced from 1 Billion to 3 thousand
• 230  210 + 210 + 210
– Tradeoff between online aggregation and offline pre-aggregation
© 2014 MapR Technologies 34
Incremental Cube Building
© 2014 MapR Technologies 36
Agenda
• What is Apache Kylin?
• Features & Tech Highlights
• Performance
• Roadmap
• Q & A
© 2014 MapR Technologies 37
# Query Type Return Dataset Query
On Kylin (s)
Query
On Hive (s)
Comments
1 High Level
Aggregation
4 0.129 157.437 1,217 times
2 Analysis Query 22,669 1.615 109.206 68 times
3 Drill Down to Detail 325,029 12.058 113.123 9 times
4 Drill Down to Detail 524,780 22.42 6383.21 278 times
5 Data Dump 972,002 49.054 N/A
0
50
100
150
200
SQL #1 SQL #2 SQL #3
Hive
Kylin
High
Level
Aggregati
on
Analysis
Query
Drill Down
to Detail
Low Level
Aggregati
on
Transactio
n Level
Based on 12+B records
Kylin vs. Hive
© 2014 MapR Technologies 38
Performance Scaleout
Linear scale out with more nodes
© 2014 MapR Technologies 39
Performance - Query Latency
99 %-ile
95 %-ile
© 2014 MapR Technologies 40
Agenda
• What is Apache Kylin?
• Features & Tech Highlights
• Performance
• Roadmap
• Q & A
© 2014 MapR Technologies 41
201520142013
Initial
Prototype
for MOLAP
• Basic end to end
POC
MOLAP
• Incremental
Refresh
• ANSI SQL
• ODBC Driver
• Web GUI
• ACL
• Open Source
HOLAP
• Streaming OLAP
• JDBC Driver
• New UI
• Excel Support
• … more
Next Gen
• Automation
• Capacity
Management
• In-Memory Analysis
(TBD)
• Spark (TBD)
• … more
TBD
Future…
Sep, 2013
Jan, 2014
Sep, 2014
Q1, 2015
Kylin History and Roadmap
© 2014 MapR Technologies 42
Kylin Ecosystem
• Kylin Core
– Fundamental framework of Kylin
OLAP Engine
• Extension
– Plugins to support for additional
functions and features
• Integration
– Lifecycle Management Support to
integrate with other applications
• Interface
– Allows for third party users to build
more features via user-interface atop
Kylin core
• Driver
– ODBC and JDBC Drivers
Kylin OLAP
Core
Extension
 Security
 Redis Storage
 Spark Engine
 Docker
Interface
 Web Console
 Customized BI
 Ambari/Hue Plugin
Integration
 ODBC Driver
 ETL
 Drill
 SparkSQL
© 2014 MapR Technologies 44
If you want to go fast, go alone.
If you want to go far, go together.
--African Proverb
© 2014 MapR Technologies 45
Agenda
• What is Apache Kylin?
• Features & Tech Highlights
• Performance
• Roadmap
• Q & A
© 2014 MapR Technologies 46
Q&A
@mapr maprtech
tdunning@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

Apache Kylin - OLAP Cubes for SQL on Hadoop

  • 1.
    © 2014 MapRTechnologies 1© 2014 MapR Technologies
  • 2.
    © 2014 MapRTechnologies 2 Who I am Ted Dunning, Chief Applications Architect, MapR Technologies Email tdunning@mapr.com tdunning@apache.org Twitter @Ted_Dunning VP Incubator Email tdunning@apache.org Twitter @ApacheMahout @ApacheDrill Credit for slides to Luke Han and the Kylin dev team
  • 3.
    © 2014 MapRTechnologies 3 Kylin Committers ankur Ankur Bansal jiangxu Jiang Xu liyang Li Yang lukehan Luke Han* mahongbin Hongbin Ma xduo Xiaodong Duo yisong George Song jhyde Julian Hyde The real deal Calcite plenipotentiary
  • 4.
    © 2014 MapRTechnologies 4 Agenda • What is Apache Kylin? • Features & Tech Highlights • Performance • Roadmap • Q & A
  • 5.
    © 2014 MapRTechnologies 5 What is Kylin? Extreme OLAP Engine for Big Data Kylin is an open source Distributed Analytics Engine from (originally from eBay) that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop for extremely large datasets kylin / ˈkiːˈlɪn / 麒麟 --n. (in Chinese art) a mythical animal of composite form • Open Sourced on Oct 1st, 2014 • Accepted into incubation November, 2014 • Preparing for first Apache release
  • 6.
    © 2014 MapRTechnologies 6 Big Data Obligatory Slide • More and more data becoming available on Hadoop • Limitations in existing Business Intelligence (BI) Tools – Limited support for Hadoop – Data size growing exponentially – High latency of interactive queries – Scale-Up architecture • Challenges to adopt Hadoop as interactive analysis system – Majority of analyst groups are SQL savvy – No mature SQL interface on Hadoop – OLAP capability on Hadoop ecosystem not ready yet
  • 7.
    © 2014 MapRTechnologies 7 Goals • Sub-second query latency on billions of rows • ANSI SQL for both analysts and engineers • Full OLAP capability to offer advanced functionality • Seamless Integration with BI Tools • Support for high cardinality and dimensionality • High concurrency – thousands of end users • Distributed and scale out architecture for large data volume
  • 8.
    © 2014 MapRTechnologies 8 Possible Strategies • Build from scratch – A grand tradition – Large-scale SQL support is much harder than it looks – Huge level of distraction • Patch Hive – Not feasible due to design assumptions in Hive – Weak optimizer – Hive isn’t standard SQL anyway – (but isn’t Hive moving to Calcite?)
  • 9.
    © 2014 MapRTechnologies 9 Kylin’s Strategy • Use Calcite as SQL core – Real SQL – Real cost-based optimizer – Already in Apache – Provides linkage to Apache Drill and future of Hive • Build cubes externally – Don’t care which tools, currently Hive, soon Spark • Use Calcite’s Rex interpreter – Assumes final aggregations fit on one machine • Possibly integrate with Drill at some point for parallel execution
  • 10.
    © 2014 MapRTechnologies 10 Transaction Operation Strategy Analytics Query Taxonomy High Level Aggregation • Very High Level, e.g GMV by site by vertical by weeks Analysis Query • Mid-level, e.g GMV by site by vertical, by category (level x) past 12 weeks Drill Down to Detail • Detail Level (Summary Table) Low Level Aggregation • First Level Aggregation Transaction Level • Transaction Data OLAP Kylin is designed to accelerate 80+% of analytics queries on Hadoop OLTP
  • 11.
    © 2014 MapRTechnologies 11 Technical Challenges • Huge volume data – Table scan • Big table joins – Data shuffling • Analysis on different granularity – Runtime aggregation expensive • Map Reduce job – Batch processing
  • 12.
    © 2014 MapRTechnologies 12 How Cubes Work • Start with a simple table – revenue,time,item,location,supplier • Build a table of aggregates for every combination of fields select sum(revenue), max(revenue), supplier from tbl group by time,item,location; select sum(revenue), max(revenue), location,supplier from tbl group by time,item; select sum(revenue), max(revenue), location from tbl group by time,item,supplier; … • Then transform queries using appropriate magic select sum(revenue), city from tbl join location_details where state = ‘MN’ group by city  select … from (select sum(),location from cube) join location_details where state = ‘MN’ group by city
  • 13.
    © 2014 MapRTechnologies 13 How Cubes Don’t Work • Total number of cubes is exponential in columns • High cardinality can result in large cubes • Skewed data can make cubes larger as original data • Magic may be insufficient to recognize cubable queries • Keeping cubes up to date can be hard • Forget OLTP thoughts like pervasive transactions
  • 14.
    © 2014 MapRTechnologies 15 OLAP Cube – Balance between Space and Time Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells 1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier> 2. (9/15, milk, Urbana, *) - <time, item, location> 3. (*, milk, Urbana, *) - <item, location> 4. (*, milk, Chicago, *) - <item, location> 5. (*, milk, *, *) - <item> Cuboid = one combination of dimensions Cube = all combinations of dimensions 1111 0111 1011 1101 1110 0011 0101 0110 1001 1010 1100 0001 0010 0100 1000 0000
  • 15.
    © 2014 MapRTechnologies 16 From Relational to Key-Value
  • 16.
    © 2014 MapRTechnologies 17 Kylin Architecture Overview 17 Cube Build Engine (MapReduce…) SQL Low Latency - SecondsMid Latency - Minutes Routing 3rd Party App (Web App, Mobile…) Metadata SQL Tools (BI Tools: Tableau…) Query Engine Hadoop Hive REST API JDBC/ODBC  Online Analysis Data Flow  Offline Data Flow  Clients/Users interactive with Kylin via SQL  OLAP Cube is transparent to users Star Schema Data Key Value Data Data CubeOLAP Cube (HBase) SQL REST Server
  • 17.
    © 2014 MapRTechnologies 18 Kylin Depends on Hadoop Eco-system • Hive – Input source, pre-join star schema during cube building • MapReduce – Aggregate metrics during cube building • HDFS – Store intermediate files during cube building • HBase – Store and query data cubes • Calcite – SQL parsing, code generation, optimization
  • 18.
    © 2014 MapRTechnologies 19 Agenda • What is Apache Kylin? • Features & Tech Highlights • Performance • Roadmap • Q & A
  • 19.
    © 2014 MapRTechnologies 20 Kylin Highlights • Extremely Fast OLAP Engine at Scale Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data to seconds • ANSI SQL Interface on Hadoop Kylin offers ANSI SQL on Hadoop and supports most ANSI SQL query functions • Seamless Integration with BI Tools Kylin currently offers integration capability with BI Tools like Tableau. • Interactive Query Capability Users can interact with Hadoop data via Kylin at sub-second latency • MOLAP Cube User can define a data model and pre-build in Kylin with more than 10+ billions of raw data records
  • 20.
    © 2014 MapRTechnologies 21 More Highlights • Compression and Encoding Support • Incremental Refresh of Cubes • Approximate Query Capability for distinct Count (HyperLogLog) • Leverage HBase Coprocessor for query latency • Job Management and Monitoring • Easy Web interface to manage, build, monitor and query cubes • Security capability to set ACL at Cube/Project Level • Support LDAP Integration
  • 21.
    © 2014 MapRTechnologies 22 Cube Designer
  • 22.
    © 2014 MapRTechnologies 23 Job Management
  • 23.
    © 2014 MapRTechnologies 24 Query and Visualization
  • 24.
    © 2014 MapRTechnologies 25 Tableau Integration
  • 25.
    © 2014 MapRTechnologies 26 Data Modeling Points of View Cube: … Fact Table: … Dimensions: … Measures: … Storage(HBase): … Fact Dim Dim Dim Source Star Schema row A row B row C Column Family Val 1 Val 2 Val 3 Row Key Column Target HBase Storage Mapping Cube Metadata End User Cube Modeler Admin
  • 26.
    © 2014 MapRTechnologies 27 Process Flow Source Joined tables Build dict Dimension dictionaries Hive
  • 27.
    © 2014 MapRTechnologies 28 Process Flow Joined n cuboid n-1 cuboids Apex cuboid MR MR Dimension dictionaries MR
  • 28.
    © 2014 MapRTechnologies 29 Process flow n cuboid n-1 cuboids Apex cuboid MR H-files HBase
  • 29.
    © 2014 MapRTechnologies 30 How To Store Cube? – HBase Schema
  • 30.
    © 2014 MapRTechnologies 31 SELECT test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name, test_kylin_fact.lstg_format_name, test_sites.site_name, SUM(test_kylin_fact.price) AS GMV, COUNT(*) AS TRANS_CNT FROM test_kylin_fact LEFT JOIN test_cal_dt ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt LEFT JOIN test_category ON test_kylin_fact.leaf_categ_id = test_category.leaf_categ_id AND test_kylin_fact.lstg_site_id = test_category.site_id LEFT JOIN test_sites ON test_kylin_fact.lstg_site_id = test_sites.site_id WHERE test_kylin_fact.seller_id = 123456OR test_kylin_fact.lstg_format_name = ’New' GROUP BY test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name, test_kylin_fact.lstg_format_name,test_sites.site_name OLAPToEnumerableConverter OLAPProjectRel(WEEK_BEG_DT=[$0], category_name=[$1], CATEG_LVL2_NAME=[$2], CATEG_LVL3_NAME=[$3], LSTG_FORMAT_NAME=[$4], SITE_NAME=[$5], GMV=[CASE(=($7, 0), null, $6)], TRANS_CNT=[$8]) OLAPAggregateRel(group=[{0, 1, 2, 3, 4, 5}], agg#0=[$SUM0($6)], agg#1=[COUNT($6)], TRANS_CNT=[COUNT()]) OLAPProjectRel(WEEK_BEG_DT=[$13], category_name=[$21], CATEG_LVL2_NAME=[$15], CATEG_LVL3_NAME=[$14], LSTG_FORMAT_NAME=[$5], SITE_NAME=[$23], PRICE=[$0]) OLAPFilterRel(condition=[OR(=($3, 123456), =($5, ’New'))]) OLAPJoinRel(condition=[=($2, $25)], joinType=[left]) OLAPJoinRel(condition=[AND(=($6, $22), =($2, $17))], joinType=[left]) OLAPJoinRel(condition=[=($4, $12)], joinType=[left]) OLAPTableScan(table=[[DEFAULT, TEST_KYLIN_FACT]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]]) OLAPTableScan(table=[[DEFAULT, TEST_CAL_DT]], fields=[[0, 1]]) OLAPTableScan(table=[[DEFAULT, test_category]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8]]) OLAPTableScan(table=[[DEFAULT, TEST_SITES]], fields=[[0, 1, 2]]) Query Engine – Kylin Explain Plan
  • 31.
    © 2014 MapRTechnologies 32 Now Let’s Make it Really Work • Full Cube – Pre-aggregate all dimension combinations – “Curse of dimensionality”: N dimension cube has 2N cuboid. • Partial Cube – To avoid dimension explosion, we divide the dimensions into different aggregation groups • 2N+M+L  2N + 2M + 2L – For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid count is reduced from 1 Billion to 3 thousand • 230  210 + 210 + 210 – Tradeoff between online aggregation and offline pre-aggregation
  • 32.
    © 2014 MapRTechnologies 34 Incremental Cube Building
  • 33.
    © 2014 MapRTechnologies 36 Agenda • What is Apache Kylin? • Features & Tech Highlights • Performance • Roadmap • Q & A
  • 34.
    © 2014 MapRTechnologies 37 # Query Type Return Dataset Query On Kylin (s) Query On Hive (s) Comments 1 High Level Aggregation 4 0.129 157.437 1,217 times 2 Analysis Query 22,669 1.615 109.206 68 times 3 Drill Down to Detail 325,029 12.058 113.123 9 times 4 Drill Down to Detail 524,780 22.42 6383.21 278 times 5 Data Dump 972,002 49.054 N/A 0 50 100 150 200 SQL #1 SQL #2 SQL #3 Hive Kylin High Level Aggregati on Analysis Query Drill Down to Detail Low Level Aggregati on Transactio n Level Based on 12+B records Kylin vs. Hive
  • 35.
    © 2014 MapRTechnologies 38 Performance Scaleout Linear scale out with more nodes
  • 36.
    © 2014 MapRTechnologies 39 Performance - Query Latency 99 %-ile 95 %-ile
  • 37.
    © 2014 MapRTechnologies 40 Agenda • What is Apache Kylin? • Features & Tech Highlights • Performance • Roadmap • Q & A
  • 38.
    © 2014 MapRTechnologies 41 201520142013 Initial Prototype for MOLAP • Basic end to end POC MOLAP • Incremental Refresh • ANSI SQL • ODBC Driver • Web GUI • ACL • Open Source HOLAP • Streaming OLAP • JDBC Driver • New UI • Excel Support • … more Next Gen • Automation • Capacity Management • In-Memory Analysis (TBD) • Spark (TBD) • … more TBD Future… Sep, 2013 Jan, 2014 Sep, 2014 Q1, 2015 Kylin History and Roadmap
  • 39.
    © 2014 MapRTechnologies 42 Kylin Ecosystem • Kylin Core – Fundamental framework of Kylin OLAP Engine • Extension – Plugins to support for additional functions and features • Integration – Lifecycle Management Support to integrate with other applications • Interface – Allows for third party users to build more features via user-interface atop Kylin core • Driver – ODBC and JDBC Drivers Kylin OLAP Core Extension  Security  Redis Storage  Spark Engine  Docker Interface  Web Console  Customized BI  Ambari/Hue Plugin Integration  ODBC Driver  ETL  Drill  SparkSQL
  • 40.
    © 2014 MapRTechnologies 44 If you want to go fast, go alone. If you want to go far, go together. --African Proverb
  • 41.
    © 2014 MapRTechnologies 45 Agenda • What is Apache Kylin? • Features & Tech Highlights • Performance • Roadmap • Q & A
  • 42.
    © 2014 MapRTechnologies 46 Q&A @mapr maprtech tdunning@mapr.com Engage with us! MapR maprtech mapr-technologies