SlideShare a Scribd company logo
1 of 38
Apache Kylin
Balance between Space and Time
2015中国 Hadoop 技术峰会
周千昊
2015-07-23
@ApacheKylin | http://kylin.io
http://kylin.io
Agenda
 What’s Apache Kylin?
 Performance
 Tech Highlights
 Roadmap
 Q & A
Extreme OLAP Engine for Big Data
Kylin is an open source Distributed Analytics Engine from eBay that
provides SQL interface and multi-dimensional analysis (OLAP) on
Hadoop supporting extremely large datasets
What’s Kylin
kylin / ˈkiːˈlɪn / 麒麟
--n. (in Chinese art) a mythical animal of composite form
• Open Sourced on Oct 1st, 2014
• Be accepted as Apache Incubator Project on Nov 25th, 2014
• First release on Jun 2015
Big Data Era
 More and more data becoming available on Hadoop
 Limitations in existing Business Intelligence (BI) Tools
 Limited support for Hadoop
 Data size growing exponentially
 High latency of interactive queries
 Scale-Up architecture
 Challenges to adopt Hadoop as interactive analysis system
 Majority of analyst groups are SQL savvy
 No mature SQL interface on Hadoop
 OLAP capability on Hadoop ecosystem not ready yet
5
Why not
Build an engine from scratch?
 Extreme Scale OLAP Engine
Kylin is designed to query 10+ billions of rows on Hadoop
 ANSI SQL Interface on Hadoop
Kylin offers ANSI SQL on Hadoop and supports most ANSI SQL query functions
 Seamless Integration with BI Tools
Kylin currently offers integration capability with BI Tools like Tableau.
 Interactive Query Capability
Users can interact with Hive tables at sub-second latency
 MOLAP Cube
Define a data model from Hive tables and pre-build in Kylin
 Scale Out Architecture
Query server cluster supports thousands concurrent users and provide high availability
 Support of Streaming data
Streaming cubing is introduce in v0.8 to support minute level data latency
Features Highlights
 Compression and Encoding Support
 Incremental Refresh of Cubes
 Approximate Query Capability for distinct count (HyperLogLog)
 Leverage HBase Coprocessor for query performance
 Job Management and Monitoring
 Easy Web interface to manage, build, monitor and query cubes
 Security capability to set ACL at Cube/Project Level
 Support LDAP Integration
More Features Highlights…
Cube Designer
Job Management
Query and
Visualization
Tableau Integration
 eBay
 95% query < 5 seconds
 Baidu
 Baidu Map internal analysis
 Many other Proof of Concepts
 Bloomberg Law, British GAS, JD, Microsoft, StubHub, Tableau …
Who are using Kylin
Case Cube
Size
Raw Records
Session Analysis 14 TB 75+ billion rows
Traffic Analysis 21 TB 20+ billion rows
Transaction Analysis 560 GB 1.2+ billion rows
http://kylin.io
Agenda
 What’s Apache Kylin?
 Performance
 Tech Highlights
 Roadmap
 Q & A
Kylin vs. Hive
# Query
Type
Return
Dataset
Query
On Kylin (s)
Query
On Hive (s)
Comments
1 High Level
Aggregation
4 0.129 157.437 1,217 times
2 Analysis
Query
22,669 1.615 109.206 68 times
3 Drill Down to
Detail
325,029 12.058 113.123 9 times
4 Drill Down to
Detail
524,780 22.42 6383.21 278 times
5 Data Dump 972,002 49.054 N/A
0
50
100
150
200
SQL #1 SQL #2 SQL #3
Hive
Kylin
High
Level
Aggregati
on
Analysis
Query
Drill Down
to Detail
Low Level
Aggregati
on
Transactio
n Level
Based on 12+B records case
Performance -- Concurrency
Linear scale out with more nodes
http://kylin.io
Agenda
 What’s Apache Kylin?
 Performance
 Tech Highlights
 Roadmap
 Q & A
Kylin Architecture Overview
17
Cube Build Engine
(MapReduce…)
SQL
Low Latency -
Seconds
Mid Latency -
Minutes Routing
3rd Party App
(Web App, Mobile…)
Metadata
SQL-Based
Tool
(BI Tools: Tableau…)
Query Engine
Hadoop
Hive
REST API JDBC/ODBC
 Online Analysis Data Flow
 Offline Data Flow
 Clients/Users interactive with
Kylin via SQL
 OLAP Cube is transparent to
users
Star Schema Data Key Value Data
Data
Cube
OLAP
Cube
(HBase)
SQL
REST Server
OLAP Cube – Balance between Space and Time
time, item
time, item, location
time, item, location, supplier
time item location supplier
time, location
Time, supplier
item, location
item, supplier
location, supplier
time, item, supplier
time, location, supplier
item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
• Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells
1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier>
2. (9/15, milk, Urbana, *) - <time, item, location>
3. (*, milk, Urbana, *) - <item, location>
4. (*, milk, Chicago, *) - <item, location>
5. (*, milk, *, *) - <item>
• Cuboid = one combination of dimensions
• Cube = all combination of dimensions (all cuboids)
Data Modeling
Cube: …
Fact Table: …
Dimensions: …
Measures: …
Storage(HBase):
…
Fact
Dim Dim
Dim
Source
Star Schema
row A
row B
row C
Column
Family
Val 1
Val 2
Val 3
Row Key Column
Target
HBase Storage
Mapping
Cube Metadata
End User Cube
Modeler
Admin
How To Store Cube? – HBase Schema
Cube Build Job Flow
• Metadata SPI
– Provide table schema from Kylin metadata
• Optimize Rule
– Translate the logic operator into Kylin operator
• Relational Operator
– Find right cube
– Translate SQL into storage engine API call
– Generate physical execute plan by linq4j java implementation
• Result Enumerator
– Translate storage engine result into java implementation result.
• SQL Function
– Add HyperLogLog for distinct count
– Implement date time related functions (i.e. Quarter)
How to Query Cube?
Kylin Extensions on Calcite
Query Engine – Kylin Explain Plan
SELECT test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name,
test_kylin_fact.lstg_format_name, test_sites.site_name, SUM(test_kylin_fact.price) AS GMV, COUNT(*) AS TRANS_CNT
FROM test_kylin_fact
LEFT JOIN test_cal_dt ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt
LEFT JOIN test_category ON test_kylin_fact.leaf_categ_id = test_category.leaf_categ_id AND test_kylin_fact.lstg_site_id =
test_category.site_id
LEFT JOIN test_sites ON test_kylin_fact.lstg_site_id = test_sites.site_id
WHERE test_kylin_fact.seller_id = 123456OR test_kylin_fact.lstg_format_name = ’New'
GROUP BY test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name,
test_kylin_fact.lstg_format_name,test_sites.site_name
OLAPToEnumerableConverter
OLAPProjectRel(WEEK_BEG_DT=[$0], category_name=[$1], CATEG_LVL2_NAME=[$2], CATEG_LVL3_NAME=[$3],
LSTG_FORMAT_NAME=[$4], SITE_NAME=[$5], GMV=[CASE(=($7, 0), null, $6)], TRANS_CNT=[$8])
OLAPAggregateRel(group=[{0, 1, 2, 3, 4, 5}], agg#0=[$SUM0($6)], agg#1=[COUNT($6)], TRANS_CNT=[COUNT()])
OLAPProjectRel(WEEK_BEG_DT=[$13], category_name=[$21], CATEG_LVL2_NAME=[$15],
CATEG_LVL3_NAME=[$14], LSTG_FORMAT_NAME=[$5], SITE_NAME=[$23], PRICE=[$0])
OLAPFilterRel(condition=[OR(=($3, 123456), =($5, ’New'))])
OLAPJoinRel(condition=[=($2, $25)], joinType=[left])
OLAPJoinRel(condition=[AND(=($6, $22), =($2, $17))], joinType=[left])
OLAPJoinRel(condition=[=($4, $12)], joinType=[left])
OLAPTableScan(table=[[DEFAULT, TEST_KYLIN_FACT]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]])
OLAPTableScan(table=[[DEFAULT, TEST_CAL_DT]], fields=[[0, 1]])
OLAPTableScan(table=[[DEFAULT, test_category]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8]])
OLAPTableScan(table=[[DEFAULT, TEST_SITES]], fields=[[0, 1, 2]])
 Plugin-able storage engine
 Common iterator interface for storage engine
 Isolate query engine from underline storage
 Translate cube query into HBase table scan
 Columns, Groups  Cuboid ID
 Filters -> Scan Range (Row Key)
 Aggregations -> Measure Columns (Row Values)
 Scan HBase table and translate HBase result into cube result
 HBase Result (key + value) -> Cube Result (dimensions +
measures)
How to Query Cube?
Extension of Storage Engine
 Data explosion
 Data duplication
 Data refreshment
 Data latency
How to Optimize Cube?
Optimization
 Full Cube
 Pre-aggregate all dimension combinations
 “Curse of dimensionality”: N dimension cube has 2N cuboid.
 Partial Cube
 To avoid dimension explosion, we divide the dimensions into
different aggregation groups
 2N+M+L  2N + 2M + 2L
 For cube with 30 dimensions, if we divide these dimensions
into 3 group, the cuboid number will reduce from 1 Billion to 3
Thousands
 230  210 + 210 + 210
 Tradeoff between online aggregation and offline pre-
aggregation
How to Optimize Cube?
Full Cube vs. Partial Cube
How to Optimize Cube?
Partial Cube
 Data cube has lots of duplicated dimension values
 Dictionary maps dimension values into IDs that will reduce the memory and
storage footprint.
 Dictionary is based on Trie
How to Optimize Cube?
Dictionary Encoding
How to Optimize Cube?
Incremental Build
 The current algorithm
 Many MRs, the number of
dimensions
 Huge shuffles, aggregation
at reduce side, 100x of total
cube size
Cube by Layer(Old)
Full Data
0-D Cuboid
1-D Cuboid
2-D Cuboid
3-D Cuboid
4-D Cuboid
MR
MR
MR
MR
MR
 The to-be algorithm,
30%-50% faster
 1 round MR
 Reduced shuffles, map side
aggregation, 20x total cube
size
 Hourly incremental build
done in tens of minutes
Cube by Segments(New)
Data Split
Cube
Segment
Data Split
Cube
Segment
Data Split
Cube
Segment
……
Final Cube
Merge Sort
(Shuffle)
mapper mapper mapper
 Cube is great w.r.t. query speed, but
 There is a large data latency because cube
building based on MR job is naturally slow
 Our customers needs real time analysis
badly
Kylin Streaming: ongoing effort
 Streaming data processing & failover
 In-memory cube building
 Cube Auto merge
New toolkits under the hood
Stream data
consuming
Cube Auto Merge
In-Memory
Cube
Building
Auto Cube
Merge with MR
http://kylin.io
Agenda
 What’s Apache Kylin?
 Performance
 Tech Highlights
 Roadmap
 Q & A
Kylin Evolution Roadmap
We are hiring
 Email:
 qianzhou@ebay.com
 Kylin Site:
 http://kylin.io
 Twitter/微博:
 @ApacheKylin
 微信公众号
 ApacheKylin

More Related Content

What's hot

Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseHBaseCon
 
Adding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark MeetupAdding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark MeetupLuke Han
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache KylinYang Li
 
Kylin olap part 1- getting started
Kylin olap   part 1- getting startedKylin olap   part 1- getting started
Kylin olap part 1- getting startedShubham Shirude
 
Apache Kylin Open Source Journey for QCon2015 Beijing
Apache Kylin Open Source Journey for QCon2015 BeijingApache Kylin Open Source Journey for QCon2015 Beijing
Apache Kylin Open Source Journey for QCon2015 BeijingLuke Han
 
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @ShanghaiLuke Han
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseYang Li
 
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...Luke Han
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopTed Dunning
 
The Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke HanThe Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke HanLuke Han
 
Apache Kylin on HBase: Extreme OLAP engine for big data
Apache Kylin on HBase: Extreme OLAP engine for big dataApache Kylin on HBase: Extreme OLAP engine for big data
Apache Kylin on HBase: Extreme OLAP engine for big dataShi Shao Feng
 
Datacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheConDatacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheConamarsri
 
Apache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 BeijingApache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 BeijingLuke Han
 
Big Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache KylinBig Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache Kylininovex GmbH
 
Kylin Engineering Principles
Kylin Engineering PrinciplesKylin Engineering Principles
Kylin Engineering PrinciplesXu Jiang
 
Apache Kylin Use Cases in China and Japan
Apache Kylin Use Cases in China and JapanApache Kylin Use Cases in China and Japan
Apache Kylin Use Cases in China and JapanLuke Han
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Seshu Adunuthula
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveXu Jiang
 
The Apache Way - Building Open Source Community in China - Luke Han
The Apache Way - Building Open Source Community in China - Luke HanThe Apache Way - Building Open Source Community in China - Luke Han
The Apache Way - Building Open Source Community in China - Luke HanLuke Han
 

What's hot (20)

Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBase
 
Adding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark MeetupAdding Spark support to Kylin at Bay Area Spark Meetup
Adding Spark support to Kylin at Bay Area Spark Meetup
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache Kylin
 
Kylin olap part 1- getting started
Kylin olap   part 1- getting startedKylin olap   part 1- getting started
Kylin olap part 1- getting started
 
Apache Kylin Open Source Journey for QCon2015 Beijing
Apache Kylin Open Source Journey for QCon2015 BeijingApache Kylin Open Source Journey for QCon2015 Beijing
Apache Kylin Open Source Journey for QCon2015 Beijing
 
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
6. Apache Kylin Roadmap and Community - Apache Kylin Meetup @Shanghai
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouse
 
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
1. Apache Kylin Deep Dive - Streaming and Plugin Architecture - Apache Kylin ...
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
The Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke HanThe Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke Han
 
Apache Kylin on HBase: Extreme OLAP engine for big data
Apache Kylin on HBase: Extreme OLAP engine for big dataApache Kylin on HBase: Extreme OLAP engine for big data
Apache Kylin on HBase: Extreme OLAP engine for big data
 
Datacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheConDatacubes in Apache Hive at ApacheCon
Datacubes in Apache Hive at ApacheCon
 
Apache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 BeijingApache kylin - Big Data Technology Conference 2014 Beijing
Apache kylin - Big Data Technology Conference 2014 Beijing
 
Big Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache KylinBig Data MDX with Mondrian and Apache Kylin
Big Data MDX with Mondrian and Apache Kylin
 
Kylin Engineering Principles
Kylin Engineering PrinciplesKylin Engineering Principles
Kylin Engineering Principles
 
Apache Kylin Use Cases in China and Japan
Apache Kylin Use Cases in China and JapanApache Kylin Use Cases in China and Japan
Apache Kylin Use Cases in China and Japan
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015Apache Kylin @ Big Data Europe 2015
Apache Kylin @ Big Data Europe 2015
 
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
The Apache Way - Building Open Source Community in China - Luke Han
The Apache Way - Building Open Source Community in China - Luke HanThe Apache Way - Building Open Source Community in China - Luke Han
The Apache Way - Building Open Source Community in China - Luke Han
 

Viewers also liked

Proactive Professionalism Networking Today Nj 1
Proactive Professionalism Networking Today Nj 1Proactive Professionalism Networking Today Nj 1
Proactive Professionalism Networking Today Nj 1Suzanne Carbonaro
 
Product Development Communications Executive Pat Scherschel 2011
Product Development Communications Executive Pat Scherschel 2011Product Development Communications Executive Pat Scherschel 2011
Product Development Communications Executive Pat Scherschel 2011patriciaindy2011
 
Technological applications and innovations
Technological applications and innovationsTechnological applications and innovations
Technological applications and innovationsrbulalakaw
 
Cultural diversity and unlocking the human potential
Cultural diversity and unlocking the human potentialCultural diversity and unlocking the human potential
Cultural diversity and unlocking the human potentialpitra.hutomo
 
Dispositivos de entrada
Dispositivos de entradaDispositivos de entrada
Dispositivos de entradapatiluki
 
In A Clean City, A Healthy Life Project
In A Clean City, A Healthy Life ProjectIn A Clean City, A Healthy Life Project
In A Clean City, A Healthy Life ProjectOlga Morozan
 
Jini new technology for a networked world
Jini new technology for a networked worldJini new technology for a networked world
Jini new technology for a networked worldSajan Sahu
 
Covestro y Ercros. tarragona
Covestro y Ercros. tarragonaCovestro y Ercros. tarragona
Covestro y Ercros. tarragonaoblanca
 
Iman bysajib hossain akash-01725-340978.
Iman bysajib hossain akash-01725-340978.Iman bysajib hossain akash-01725-340978.
Iman bysajib hossain akash-01725-340978.Sajib Hossain Akash
 
Compass Fi Treasury Pp July2008
Compass Fi Treasury Pp July2008Compass Fi Treasury Pp July2008
Compass Fi Treasury Pp July2008ntrung
 
Basic Object Oriented Concepts
Basic Object Oriented ConceptsBasic Object Oriented Concepts
Basic Object Oriented ConceptsScott Lee
 
Latihan korelasi ganda
Latihan korelasi gandaLatihan korelasi ganda
Latihan korelasi gandapindp
 

Viewers also liked (20)

Aquamacs Manual
Aquamacs ManualAquamacs Manual
Aquamacs Manual
 
Proactive Professionalism Networking Today Nj 1
Proactive Professionalism Networking Today Nj 1Proactive Professionalism Networking Today Nj 1
Proactive Professionalism Networking Today Nj 1
 
Product Development Communications Executive Pat Scherschel 2011
Product Development Communications Executive Pat Scherschel 2011Product Development Communications Executive Pat Scherschel 2011
Product Development Communications Executive Pat Scherschel 2011
 
Technological applications and innovations
Technological applications and innovationsTechnological applications and innovations
Technological applications and innovations
 
Cultural diversity and unlocking the human potential
Cultural diversity and unlocking the human potentialCultural diversity and unlocking the human potential
Cultural diversity and unlocking the human potential
 
Dispositivos de entrada
Dispositivos de entradaDispositivos de entrada
Dispositivos de entrada
 
In A Clean City, A Healthy Life Project
In A Clean City, A Healthy Life ProjectIn A Clean City, A Healthy Life Project
In A Clean City, A Healthy Life Project
 
Makalah dematitis
Makalah dematitisMakalah dematitis
Makalah dematitis
 
Managic presentation english
Managic presentation englishManagic presentation english
Managic presentation english
 
Jini new technology for a networked world
Jini new technology for a networked worldJini new technology for a networked world
Jini new technology for a networked world
 
Covestro y Ercros. tarragona
Covestro y Ercros. tarragonaCovestro y Ercros. tarragona
Covestro y Ercros. tarragona
 
Pemeriksaaan thoraks
Pemeriksaaan thoraksPemeriksaaan thoraks
Pemeriksaaan thoraks
 
Iman bysajib hossain akash-01725-340978.
Iman bysajib hossain akash-01725-340978.Iman bysajib hossain akash-01725-340978.
Iman bysajib hossain akash-01725-340978.
 
Geohash
GeohashGeohash
Geohash
 
Compass Fi Treasury Pp July2008
Compass Fi Treasury Pp July2008Compass Fi Treasury Pp July2008
Compass Fi Treasury Pp July2008
 
Modul 5 kb 1
Modul 5   kb 1Modul 5   kb 1
Modul 5 kb 1
 
Basic Object Oriented Concepts
Basic Object Oriented ConceptsBasic Object Oriented Concepts
Basic Object Oriented Concepts
 
The Anatomy Of The Idea
The Anatomy Of The IdeaThe Anatomy Of The Idea
The Anatomy Of The Idea
 
Rxpay-sepa-germany-01
Rxpay-sepa-germany-01Rxpay-sepa-germany-01
Rxpay-sepa-germany-01
 
Latihan korelasi ganda
Latihan korelasi gandaLatihan korelasi ganda
Latihan korelasi ganda
 

Similar to Apache kylin (china hadoop summit 2015 shanghai)

ApacheKylin_HBaseCon2015
ApacheKylin_HBaseCon2015ApacheKylin_HBaseCon2015
ApacheKylin_HBaseCon2015Luke Han
 
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for HadoopHBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for HadoopHBaseCon
 
Apache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeApache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeDataWorks Summit
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentationargonauts007
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeDatabricks
 
Accelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache KylinAccelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache KylinTyler Wishnoff
 
When OLAP Meets Real-Time, What Happens in eBay?
When OLAP Meets Real-Time, What Happens in eBay?When OLAP Meets Real-Time, What Happens in eBay?
When OLAP Meets Real-Time, What Happens in eBay?DataWorks Summit
 
Build 2017 - P4002 - Speedup Interactive Analytics on Petabytes of Data on Azure
Build 2017 - P4002 - Speedup Interactive Analytics on Petabytes of Data on AzureBuild 2017 - P4002 - Speedup Interactive Analytics on Petabytes of Data on Azure
Build 2017 - P4002 - Speedup Interactive Analytics on Petabytes of Data on AzureWindows Developer
 
2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQLYu Ishikawa
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
Grill at HadoopSummit
Grill at HadoopSummitGrill at HadoopSummit
Grill at HadoopSummitamarsri
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
HQL over Tiered Data Warehouse
HQL over Tiered Data WarehouseHQL over Tiered Data Warehouse
HQL over Tiered Data WarehouseDataWorks Summit
 
AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek SinhaAWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek SinhaAmazon Web Services
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with SupersetDataWorks Summit
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)Stratebi
 
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Edwin Poot
 

Similar to Apache kylin (china hadoop summit 2015 shanghai) (20)

ApacheKylin_HBaseCon2015
ApacheKylin_HBaseCon2015ApacheKylin_HBaseCon2015
ApacheKylin_HBaseCon2015
 
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for HadoopHBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
 
Apache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and TimeApache Kylin - Balance Between Space and Time
Apache Kylin - Balance Between Space and Time
 
Kylin and Druid Presentation
Kylin and Druid PresentationKylin and Druid Presentation
Kylin and Druid Presentation
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data Lake
 
Accelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache KylinAccelerating Big Data Analytics with Apache Kylin
Accelerating Big Data Analytics with Apache Kylin
 
When OLAP Meets Real-Time, What Happens in eBay?
When OLAP Meets Real-Time, What Happens in eBay?When OLAP Meets Real-Time, What Happens in eBay?
When OLAP Meets Real-Time, What Happens in eBay?
 
Build 2017 - P4002 - Speedup Interactive Analytics on Petabytes of Data on Azure
Build 2017 - P4002 - Speedup Interactive Analytics on Petabytes of Data on AzureBuild 2017 - P4002 - Speedup Interactive Analytics on Petabytes of Data on Azure
Build 2017 - P4002 - Speedup Interactive Analytics on Petabytes of Data on Azure
 
2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL2017 09-27 democratize data products with SQL
2017 09-27 democratize data products with SQL
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Grill at HadoopSummit
Grill at HadoopSummitGrill at HadoopSummit
Grill at HadoopSummit
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
HQL over Tiered Data Warehouse
HQL over Tiered Data WarehouseHQL over Tiered Data Warehouse
HQL over Tiered Data Warehouse
 
mod 2.pdf
mod 2.pdfmod 2.pdf
mod 2.pdf
 
AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek SinhaAWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
AWS Summit 2013 | India - Big Data Analytics, Abhishek Sinha
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with Superset
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
 
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceIES VE
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringWSO2
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....rightmanforbloodline
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaWSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityVictorSzoltysek
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...caitlingebhard1
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Apache kylin (china hadoop summit 2015 shanghai)

  • 1. Apache Kylin Balance between Space and Time 2015中国 Hadoop 技术峰会 周千昊 2015-07-23 @ApacheKylin | http://kylin.io
  • 2. http://kylin.io Agenda  What’s Apache Kylin?  Performance  Tech Highlights  Roadmap  Q & A
  • 3. Extreme OLAP Engine for Big Data Kylin is an open source Distributed Analytics Engine from eBay that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets What’s Kylin kylin / ˈkiːˈlɪn / 麒麟 --n. (in Chinese art) a mythical animal of composite form • Open Sourced on Oct 1st, 2014 • Be accepted as Apache Incubator Project on Nov 25th, 2014 • First release on Jun 2015
  • 4. Big Data Era  More and more data becoming available on Hadoop  Limitations in existing Business Intelligence (BI) Tools  Limited support for Hadoop  Data size growing exponentially  High latency of interactive queries  Scale-Up architecture  Challenges to adopt Hadoop as interactive analysis system  Majority of analyst groups are SQL savvy  No mature SQL interface on Hadoop  OLAP capability on Hadoop ecosystem not ready yet
  • 5. 5 Why not Build an engine from scratch?
  • 6.  Extreme Scale OLAP Engine Kylin is designed to query 10+ billions of rows on Hadoop  ANSI SQL Interface on Hadoop Kylin offers ANSI SQL on Hadoop and supports most ANSI SQL query functions  Seamless Integration with BI Tools Kylin currently offers integration capability with BI Tools like Tableau.  Interactive Query Capability Users can interact with Hive tables at sub-second latency  MOLAP Cube Define a data model from Hive tables and pre-build in Kylin  Scale Out Architecture Query server cluster supports thousands concurrent users and provide high availability  Support of Streaming data Streaming cubing is introduce in v0.8 to support minute level data latency Features Highlights
  • 7.  Compression and Encoding Support  Incremental Refresh of Cubes  Approximate Query Capability for distinct count (HyperLogLog)  Leverage HBase Coprocessor for query performance  Job Management and Monitoring  Easy Web interface to manage, build, monitor and query cubes  Security capability to set ACL at Cube/Project Level  Support LDAP Integration More Features Highlights…
  • 12.  eBay  95% query < 5 seconds  Baidu  Baidu Map internal analysis  Many other Proof of Concepts  Bloomberg Law, British GAS, JD, Microsoft, StubHub, Tableau … Who are using Kylin Case Cube Size Raw Records Session Analysis 14 TB 75+ billion rows Traffic Analysis 21 TB 20+ billion rows Transaction Analysis 560 GB 1.2+ billion rows
  • 13. http://kylin.io Agenda  What’s Apache Kylin?  Performance  Tech Highlights  Roadmap  Q & A
  • 14. Kylin vs. Hive # Query Type Return Dataset Query On Kylin (s) Query On Hive (s) Comments 1 High Level Aggregation 4 0.129 157.437 1,217 times 2 Analysis Query 22,669 1.615 109.206 68 times 3 Drill Down to Detail 325,029 12.058 113.123 9 times 4 Drill Down to Detail 524,780 22.42 6383.21 278 times 5 Data Dump 972,002 49.054 N/A 0 50 100 150 200 SQL #1 SQL #2 SQL #3 Hive Kylin High Level Aggregati on Analysis Query Drill Down to Detail Low Level Aggregati on Transactio n Level Based on 12+B records case
  • 15. Performance -- Concurrency Linear scale out with more nodes
  • 16. http://kylin.io Agenda  What’s Apache Kylin?  Performance  Tech Highlights  Roadmap  Q & A
  • 17. Kylin Architecture Overview 17 Cube Build Engine (MapReduce…) SQL Low Latency - Seconds Mid Latency - Minutes Routing 3rd Party App (Web App, Mobile…) Metadata SQL-Based Tool (BI Tools: Tableau…) Query Engine Hadoop Hive REST API JDBC/ODBC  Online Analysis Data Flow  Offline Data Flow  Clients/Users interactive with Kylin via SQL  OLAP Cube is transparent to users Star Schema Data Key Value Data Data Cube OLAP Cube (HBase) SQL REST Server
  • 18. OLAP Cube – Balance between Space and Time time, item time, item, location time, item, location, supplier time item location supplier time, location Time, supplier item, location item, supplier location, supplier time, item, supplier time, location, supplier item, location, supplier 0-D(apex) cuboid 1-D cuboids 2-D cuboids 3-D cuboids 4-D(base) cuboid • Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells 1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier> 2. (9/15, milk, Urbana, *) - <time, item, location> 3. (*, milk, Urbana, *) - <item, location> 4. (*, milk, Chicago, *) - <item, location> 5. (*, milk, *, *) - <item> • Cuboid = one combination of dimensions • Cube = all combination of dimensions (all cuboids)
  • 19. Data Modeling Cube: … Fact Table: … Dimensions: … Measures: … Storage(HBase): … Fact Dim Dim Dim Source Star Schema row A row B row C Column Family Val 1 Val 2 Val 3 Row Key Column Target HBase Storage Mapping Cube Metadata End User Cube Modeler Admin
  • 20. How To Store Cube? – HBase Schema
  • 22. • Metadata SPI – Provide table schema from Kylin metadata • Optimize Rule – Translate the logic operator into Kylin operator • Relational Operator – Find right cube – Translate SQL into storage engine API call – Generate physical execute plan by linq4j java implementation • Result Enumerator – Translate storage engine result into java implementation result. • SQL Function – Add HyperLogLog for distinct count – Implement date time related functions (i.e. Quarter) How to Query Cube? Kylin Extensions on Calcite
  • 23. Query Engine – Kylin Explain Plan SELECT test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name, test_kylin_fact.lstg_format_name, test_sites.site_name, SUM(test_kylin_fact.price) AS GMV, COUNT(*) AS TRANS_CNT FROM test_kylin_fact LEFT JOIN test_cal_dt ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt LEFT JOIN test_category ON test_kylin_fact.leaf_categ_id = test_category.leaf_categ_id AND test_kylin_fact.lstg_site_id = test_category.site_id LEFT JOIN test_sites ON test_kylin_fact.lstg_site_id = test_sites.site_id WHERE test_kylin_fact.seller_id = 123456OR test_kylin_fact.lstg_format_name = ’New' GROUP BY test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name, test_kylin_fact.lstg_format_name,test_sites.site_name OLAPToEnumerableConverter OLAPProjectRel(WEEK_BEG_DT=[$0], category_name=[$1], CATEG_LVL2_NAME=[$2], CATEG_LVL3_NAME=[$3], LSTG_FORMAT_NAME=[$4], SITE_NAME=[$5], GMV=[CASE(=($7, 0), null, $6)], TRANS_CNT=[$8]) OLAPAggregateRel(group=[{0, 1, 2, 3, 4, 5}], agg#0=[$SUM0($6)], agg#1=[COUNT($6)], TRANS_CNT=[COUNT()]) OLAPProjectRel(WEEK_BEG_DT=[$13], category_name=[$21], CATEG_LVL2_NAME=[$15], CATEG_LVL3_NAME=[$14], LSTG_FORMAT_NAME=[$5], SITE_NAME=[$23], PRICE=[$0]) OLAPFilterRel(condition=[OR(=($3, 123456), =($5, ’New'))]) OLAPJoinRel(condition=[=($2, $25)], joinType=[left]) OLAPJoinRel(condition=[AND(=($6, $22), =($2, $17))], joinType=[left]) OLAPJoinRel(condition=[=($4, $12)], joinType=[left]) OLAPTableScan(table=[[DEFAULT, TEST_KYLIN_FACT]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]]) OLAPTableScan(table=[[DEFAULT, TEST_CAL_DT]], fields=[[0, 1]]) OLAPTableScan(table=[[DEFAULT, test_category]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8]]) OLAPTableScan(table=[[DEFAULT, TEST_SITES]], fields=[[0, 1, 2]])
  • 24.  Plugin-able storage engine  Common iterator interface for storage engine  Isolate query engine from underline storage  Translate cube query into HBase table scan  Columns, Groups  Cuboid ID  Filters -> Scan Range (Row Key)  Aggregations -> Measure Columns (Row Values)  Scan HBase table and translate HBase result into cube result  HBase Result (key + value) -> Cube Result (dimensions + measures) How to Query Cube? Extension of Storage Engine
  • 25.  Data explosion  Data duplication  Data refreshment  Data latency How to Optimize Cube? Optimization
  • 26.  Full Cube  Pre-aggregate all dimension combinations  “Curse of dimensionality”: N dimension cube has 2N cuboid.  Partial Cube  To avoid dimension explosion, we divide the dimensions into different aggregation groups  2N+M+L  2N + 2M + 2L  For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid number will reduce from 1 Billion to 3 Thousands  230  210 + 210 + 210  Tradeoff between online aggregation and offline pre- aggregation How to Optimize Cube? Full Cube vs. Partial Cube
  • 27. How to Optimize Cube? Partial Cube
  • 28.  Data cube has lots of duplicated dimension values  Dictionary maps dimension values into IDs that will reduce the memory and storage footprint.  Dictionary is based on Trie How to Optimize Cube? Dictionary Encoding
  • 29. How to Optimize Cube? Incremental Build
  • 30.  The current algorithm  Many MRs, the number of dimensions  Huge shuffles, aggregation at reduce side, 100x of total cube size Cube by Layer(Old) Full Data 0-D Cuboid 1-D Cuboid 2-D Cuboid 3-D Cuboid 4-D Cuboid MR MR MR MR MR
  • 31.  The to-be algorithm, 30%-50% faster  1 round MR  Reduced shuffles, map side aggregation, 20x total cube size  Hourly incremental build done in tens of minutes Cube by Segments(New) Data Split Cube Segment Data Split Cube Segment Data Split Cube Segment …… Final Cube Merge Sort (Shuffle) mapper mapper mapper
  • 32.  Cube is great w.r.t. query speed, but  There is a large data latency because cube building based on MR job is naturally slow  Our customers needs real time analysis badly Kylin Streaming: ongoing effort
  • 33.  Streaming data processing & failover  In-memory cube building  Cube Auto merge New toolkits under the hood
  • 36. http://kylin.io Agenda  What’s Apache Kylin?  Performance  Tech Highlights  Roadmap  Q & A
  • 38. We are hiring  Email:  qianzhou@ebay.com  Kylin Site:  http://kylin.io  Twitter/微博:  @ApacheKylin  微信公众号  ApacheKylin

Editor's Notes

  1. eCS 管理eBay全球EDW/Hadoop/BI/ETL等分析平台,10000+节点Hadoop,几十PB数据
  2. Olap Big data Vs ubuntu kylin Ebay 第一个贡献到apache的开源项目,也是完整由中国团队贡献到Apache的第一个项目
  3. Evaluation on Impala,Stinger, Phoenix
  4. 介绍query 1台机器4个tomcat instanc可以达到300左右的QPS
  5. 这页可以去掉,不需要帮Calcite做广告
  6. Xin, Dong, et al. "Star-cubing: Computing iceberg cubes by top-down and bottom-up integration." Proceedings of the 29th international conference on Very large data bases-Volume 29. VLDB Endowment, 2003.