Apache Kylin: Balance Between Space and Time for Big Data Analytics

Apache Kylin
Balance between Space and Time
Debashis Saha | Luke Han
2015-06-09
http://kylin.io | @ApacheKylin

About us
§Debashis Saha (@debashis_saha )
- VP, eBay Cloud Services (Platform, Infrastructure, Data)
§Luke Han (@lukehq)
- Sr. Product Manager, Analytics Data Infrastructure
- Committer & PMC Member of Apache Kylin

Agenda
§About Apache Kylin
§Feature Highlights
§Tech Highlights
§Roadmap
§Q&A
http://kylin.io

What
kylin

/
ˈkiːˈlɪn
/
麒麟
-‐-‐n.
(in
Chinese
art)
a
mythical
animal
of
composite
form

Extreme OLAP Engine for Big Data
Kylin is an open source Distributed Analytics Engine, contributed
by eBay Inc., provides SQL interface and multi-dimensional
analysis (OLAP) on Hadoop supporting extremely large datasets
http://kylin.io
• Open
Sourced
on
Oct
1st,
2014

• Accepted
as
Apache
Incubator
Project
on
Nov
25th,
2014

• http://kylin.io
(http://kylin.incubator.apache.org)
@ApacheKylin

§ External
§ 25+ contributors in community
§ Adoption:
§ On Production: Baidu Map
§ Evaluation: Huawei, Bloomberg Law, British Gas, JD.com
Microsoft, Tableau…
§ eBay Internal Cases
- 90% ile query < 5 seconds
Case Cube
Size Raw
Records
User
Session
Analysis 26
TB 28+
billion
rows
Traffic
Analysis 21
TB 20+
billion
rows
Behavior
Analysis 560
GB 1.2+
billion
rows
—from mailing list
Who are using Kylin?
http://kylin.io

Why
http://kylin.io
Happiness
Latency
10s
size

Balance Between Space and Time
http://kylin.io
time, item
time, item, location
time, item, location, supplier
time item location supplier
time, location
Time, supplier
item, location
item, supplier
location, supplier
time, item, supplier
time, location, supplier
item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
• Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells
1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier>
2. (9/15, milk, Urbana, *) - <time, item, location>
3. (*, milk, Urbana, *) - <item, location>
4. (*, milk, Chicago, *) - <item, location>
5. (*, milk, *, *) - <item>
• Cuboid = one combination of dimensions
• Cube = all combination of dimensions
(all cuboids)
OLAP Cube

How
http://kylin.io
Map Reduce
Kylin
BI Tools,Web App…
ANSI SQL

Feature Highlights
• Extremely Fast OLAP Engine at scale
• ANSI SQL Interface on Hadoop
• Seamless Integration with BI Tools, like Tableau
• Interactive Query Capability
• MOLAP Cube
• Incremental Build of Cubes
• Approximate Query Capability for Distinct Count (HyperLogLog)
• Leverage HBase Coprocessor for query latency
• Job Management and Monitoring
• User friendly Web GUI for manage, build, monitor and query cubes
• Security capability to set ACL at Cube/Project Level
• Support LDAP Integration
http://kylin.io

Define Data Model
http://kylin.io

Explore the Data
http://kylin.io

Interactive with BI Tool - Tableau
http://kylin.io

http://kylin.io
Cube
Build
Engine

(MapReduce…)
SQL
Low

Latency
-‐
SecondsMid
Latency
-‐
Minutes
Routing
3rd
Party
App

(Web
App,
Mobile…)
Metadata
SQL-‐Based
Tool

(BI
Tools:
Tableau…)
Query
Engine
Hadoop
Hive
REST
API JDBC/ODBC
➢Online
Analysis
Data
Flow

➢Offline
Data
Flow

➢Clients/Users
interactive
with
Kylin

via
SQL

➢OLAP
Cube
is
transparent
to
users
Star
Schema
Data Key
Value
Data
Data

Cube
OLAP

Cube

(HBase)
SQL
REST
Server
Kylin Architecture Overview

http://kylin.io
Cube:
…

Fact
Table:
…

Dimensions:
…

Measures:
…

Storage(HBase):
…
Fact
Dim Dim
Dim
Source

Star
Schema
row
A
row
B
row
C
Column
Family
Val
1
Val
2
Val
3
Row
Key Column
Target

HBase
Storage
Mapping

Cube
Metadata
End
User Cube
Modeler Admin
Data Modeling

http://kylin.io
Cube Build Job Flow

http://kylin.io
How to Store Cube - HBase Schema

http://kylin.io
SELECT
test_cal_dt.week_beg_dt,

test_category.category_name,
test_category.lvl2_name,

test_kylin_fact.lstg_format_name,

test_sites.site_name,
SUM(test_kylin_fact.price)
AS
GMV,

COUNT(*)
AS
TRANS_CNT

FROM

test_kylin_fact

LEFT
JOIN
test_cal_dt
ON
test_kylin_fact.cal_dt
=

test_cal_dt.cal_dt

LEFT
JOIN
test_category
ON
test_kylin_fact.leaf_categ_id
=

test_category.leaf_categ_id

AND
test_kylin_fact.lstg_site_id

=
test_category.site_id

LEFT
JOIN
test_sites
ON
test_kylin_fact.lstg_site_id
=

test_sites.site_id

WHERE
test_kylin_fact.seller_id
=
123456OR

test_kylin_fact.lstg_format_name
=
’New'

GROUP
BY
test_cal_dt.week_beg_dt,

test_category.category_name,


test_kylin_fact.lstg_format_name,test_sites.site_name
OLAPToEnumerableConverter

OLAPProjectRel(WEEK_BEG_DT=[$0],
category_name=[$1],

CATEG_LVL2_NAME=[$2],
LSTG_FORMAT_NAME=[$4],

SITE_NAME=[$5],
GMV=[CASE(=($7,
0),
null,
$6)],
TRANS_CNT=[$8])

OLAPAggregateRel(group=[{0,
1,
2,
3,
4,
5}],
agg#0=[$SUM0($6)],

agg#1=[COUNT($6)],
TRANS_CNT=[COUNT()])

OLAPProjectRel(WEEK_BEG_DT=[$13],
category_name=[$21],


LSTG_FORMAT_NAME=[$5],
SITE_NAME=[$23],
PRICE=[$0])

OLAPFilterRel(condition=[OR(=($3,
123456),
=($5,
’New'))])

OLAPJoinRel(condition=[=($2,
$25)],
joinType=[left])

OLAPJoinRel(condition=[AND(=($6,
$22),
=($2,
$17))],
joinType=[left])

OLAPJoinRel(condition=[=($4,
$12)],
joinType=[left])

OLAPTableScan(table=[[DEFAULT,
TEST_KYLIN_FACT]],
fields=[[0,
1,
2,
3,

4,
5,
6,
7,
8,
9,
10,
11]])

TEST_CAL_DT]],
fields=[[0,
1]])

test_category]],
fields=[[0,
1,
2,
3,
4,
5,

6,
7,
8]])

TEST_SITES]],
fields=[[0,
1,
2]])
Kylin Query Engine - Explain Plan

Cube Optimization
§ Curse
of
Dimensionality

§ N
dimension
cube
has
2N
cuboid

§ Full
Cube
vs.
Parqal
Cube

§ Huge
Data
Volume

§ Dicqonary
Encoding

§ Incremental
Building
http://kylin.io

§ Full
Cube
- Pre-aggregate all dimension combinations
- “Curse of dimensionality”: N dimension cube has 2N cuboid.
§ ParOal
Cube
- To avoid dimension explosion, we divide the dimensions into different aggregation
groups
- 2N+M+L à 2N + 2M + 2L
- For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid
number will reduce from 1 Billion to 3 Thousands
- 230 à 210 + 210 + 210
- Tradeoff between online aggregation and offline pre-aggregation
http://kylin.io
Full Cube vs. Partical Cube

http://kylin.io
Incremental Build

What’s Next
§ Improve
cube
algorithm

§ Cube
by
segments,
30%-‐50%
faster

§ Build
delay
down
to
tens
of
minutes

§ Streaming
cubing

§ Analyze
real-‐qme
data

§ Build
delay
down
to
seconds

§ Spark
http://kylin.io

Cube by Layer
§ The current algorithm
- Many MRs, the number of
dimensions
- Huge shuffles, aggregation at
reduce side, 100x of total cube
size
http://kylin.io
Full
Data
0-‐D
Cuboid
1-‐D
Cuboid
2-‐D
Cuboid
3-‐D
Cuboid
4-‐D
Cuboid
MR
MR
MR
MR
MR

Cube by Segments
§ The to-be algorithm,
30%-50% faster
- 1 round MR
- Reduced shuffles, map side
aggregation, 20x total cube
size
- Hourly incremental build done
in tens of minutes
http://kylin.io
Data
Split
Cube
Segment
Data
Split
Cube
Segment
Data
Split
Cube
Segment
……
Final
Cube
Merge
Sort

(Shuffle)
mapper mapper mapper

Streaming Cubing
§ Cube is great but…
- Cube takes time to build, how about real-time analysis?
- Sometimes we want to drill down to row level information
§ Streaming cubing
- Build micro cube segments from streaming
- Use inverted index to capture last minute data
http://kylin.io

http://kylin.io
Streaming
seconds
delay
Last
Hour
Inverted

Index
Before
Last
Hour
Cube
Kylin Lambda Architecture
minutes
delay
Query
Engine
ANSI
SQL
Hybrid Storage
Interface

Adding Spark Support
§ Cubing Efficiency
§ MR is not optimal framework
§ Spark Cubing Engine
§ Source from SparkSQL
§ Read data from SparkSQL instead of Hive
§ Route to SparkSQL
§ Unsupported queries be coved by SparkSQL
http://kylin.io

Future2015 2016
Kylin Evolution Roadmap
http://kylin.io
20142013
Initial
Prototype
for

MOLAP

• Basic
end
to
end
POC

MOLAP

• Incremental
Refresh

• ANSI
SQL

• ODBC
Driver

• Web
GUI

• Tableau

• ACL

• Open
Source
StreamingOLAP

• Streaming
OLAP

• JDBC
Driver

• New
UI

• Excel

• SparkSQL

• …
more

TBD
Sep,
2013
Jan,
2014
Oct,
2014
H1,
2015
HybridOLAP

• Lambda
Arch

• Automation

• Capacity

Management

• Spark

• …
more
Next
Gen

• Adv
OLAP
Functions

• In-‐Memory
Analysis

(TBD)

• Mobile
(TBD)

• …
more

Kylin Ecosystem
■ Kylin Core
■ Fundamental framework of Kylin OLAP Engine
■ Extension
■ Plugins to support for additional functions and features
■ Integration
■ Lifecycle Management Support to integrate with other
applications
■ Interface
■ Allows for third party users to build more features via user-
interface atop Kylin core
http://kylin.io
Kylin OLAP
Core
Extension
à Security
à Redis Storage
à Spark Engine
à Docker
Interface
à Web Console
à Customized BI
à Ambari/Hue Plugin
Integration
à ODBC Driver
à ETL
à Drill
à SparkSQL

http://kylin.io
If
you
want
to
go
fast,
go
alone.

If
you
want
to
go
far,
go
together.
-‐-‐African
Proverb
dev@kylin.incubator.apache.org

Apache Kylin: Balance Between Space and Time for Big Data Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Apache Kylin: Balance Between Space and Time for Big Data Analytics

Similar to Apache Kylin: Balance Between Space and Time for Big Data Analytics (20)

Recently uploaded

Recently uploaded (20)

Apache Kylin: Balance Between Space and Time for Big Data Analytics