穆黎森：Interactive batch query at scale

Interactive Batch Query
At Scale
Adhoc query system for game analytics
based on Drill
immars@gmail.com

!1

Related Topics
•

Java Programming

•

Relational Algebra

•

Distributed Database

•

Hadoop Ecosystem

!2

About Us
•

Elex-tech

•

Game Development, Game Publishing

•

SNS Games, Web Games, Mobile Games, Apps

•

Global Market

!3

•

The Problem!

•

Brief on Drill

•

Design Considerations

•

Enhancement from Xingcloud

•

Now & Future

!4

The Problem
•

How many logins today?

•

How many individual users this week?

•

Total income today?

•

Paid user amount this month?

•

…
!6

The Problem: Facts
•

How many X during time period of Y

!

•

event

amount

login

-

1383729081

user_002

login

-

1383729082

user_001

!

user id
user_001

!

pay

4.99

1383729084

user_003

login

-

1383729090

Fact Table
!7

timestamp

The Problem: Facts
•


•


•

Total income today?

•


•

…
!8

The Problem: Facts
•


!
•

event

amount

login

-

1383729081

user_002

login

-

1383729082

user_001

!

user id
user_001

!

pay

4.99

1383729084

user_003

login

-

1383729090

timestamp

select count(*) from fact where event=‘login’ and
date(timestamp)=‘2013-12-06’;

!9

The Problem: Facts
•


!
•

event

amount

login

-

1383729081

user_002

login

-

1383729082

user_001

!

user id
user_001

!

timestamp

pay

4.99

1383729084

user_003

login

-

1383729090

select count(distinct uid) from fact where event=‘login’ and
timestamp>=‘?’ and timestamp<‘?’;

!10

The Problem: Facts
•

Total income today?

!
•

event

amount

login

-

1383729081

user_002

login

-

1383729082

user_001

!

user id
user_001

!

timestamp

pay

4.99

1383729084

user_003

login

-

1383729090

select sum(amount) from fact where event=‘pay’ and timestamp
>=‘?’ and timestamp<‘?’;

!11

The Problem: Facts
•


!
•

event

amount

login

-

1383729081

user_002

login

-

1383729082

user_001

!

user id
user_001

!

timestamp

pay

4.99

1383729084

user_003

login

-

1383729090

select count(distinct uid) from fact where event=‘pay’ and
timestamp >=‘?’ and timestamp<‘?’;

!12

The Problem: Dimensions
•

How many logins today from China?

•

How many individual users of each server this
week?

•

Total income today by new user?

•

Paid user amount this month from Adwords?

•

…
!13

The Problem: Dimensions
•

The user X’s property Y is of value Z

!

•

refer

en

adwords

user_002 20110927

cn

facebook

user_003 20121010

!

language

user_001 20100612

!

fr

admob

user_004 20130522

it

tapjoy

user id

reg_time

Dimension Table
!14

…

Fact & Dimension
•

Aggregation on Join
user id
user_001
user_002
user_001
user_003
user id
user_001
user_002
user_003
user_004

event
login
login
pay
login

amount
4.99
-

timestamp
1383729081
1383729082
1383729084
1383729090

reg_time language refer
20100612
en
adwords
20110927
cn
facebook
20121010
fr
admob
20130522
it
tapjoy
!15

…

Fact & Dimension
•

How many logins today from China?

•

How many individual users of each server this
week?

•

Total income today by new user?

•

Paid user amount this month from adwords?

•

…
!16

Fact & Dimension
SELECT COUNT DISTINCT (on uid)
JOIN (1 fact, n dimension, on uid)
WHERE (ﬁlter by value of dimensions/facts)
GROUP BY (value of dimension)

!17

Fact & Dimension
•

SQL
agg

•

-> Syntax tree
Join

•
•

-> Logical Plan
-> Physical Plan

Join
filter

filter

filter

scan:
Dimension

scan:
Dimension

scan:
Fact

Access Pattern
Facts

Write

Read by

Dimensions

Append

Insert,
update

date
event

user id
prop value
full table

!22

Volume

•

200GB new Facts

•

50GB Dimension updates

!23

Architecture
Query

Drill
MySQL
StorageEngine

HBase
StorageEngine

Storage
Data Loader

MySQL

!24

HBase

•

The Problem

•

Brief on Drill!

•


•

Our work

•

Now & Future

!25

http://www.slideshare.net/MapRTechnologies/technical-overview-of-apache-drill-by-jac
!26

http://www.slideshare.net/jasonfrantz/drill-architecture-20120913
!27

•

The Problem

•

Brief on Drill

•

Design Considerations!

•

Our work

•

Now & Future

!28

!29

Data Model
{
name: "icecream",

•

Various types

•

Nested values

price: {
basic: 4.99,

•

coupon: true
•

}
}
!30

price.basic

Schema-free

•

As Fast As possible
•

Space efﬁcient

•

Time efﬁcient

!31

about Space Efﬁciency
•

Compact data representation
•

•

Java object overhead: high

JVM friendly(GC)
•

Simpler object graph

•

Less tenured space, less full GC
!32

about Time Efﬁciency
•

Cache friendly
•

•

Superscalar: pipeline friendly
•

•

the inner loop problem

SIMD friendly
•

•

data access Locality

opportunity to operate on a vector of values

JVM friendly(JNI)
!33

ValueVector & RecordBatch

ValueVector
!34

•

ValueVector
•

small memory overhead

•

backed by DirectByteBuffer

•

further encoding

•

continuous access/random access
!35

{

name:VarChar

i
c
e
c
r
e
a
m
…

name: "icecream",
price: {
basic: 4.99,
coupon: true

price.coupon:boolean

price.basic:ﬂoat

4.99
…

}
}

RecordBatch
!36

T
…

scan:
Dimension

ﬁlter

Join

ﬁlter

•

Data passed in RecordBatch

•

Inner loop: next() vs for

!37

scan:
Fact

agg

Review the Considerations
•

name:VarCh

Cache friendly

•

Superscalar: pipeline friendly

•

SIMD friendly

•

Compact data representation

•

JVM friendly(GC)

•

JVM friendly(JNI)
!38

price.coupon:boole

i price.basic:ﬂo
c
4.99
e
…
c
r
e
a
m
…

T
…

•

The Problem

•

Brief on Drill

•


•

Our work!

•

Now & Future

!39

Our work, primarily

•

Adhoc batch query

!40

Reports: 2-dimensional tables generally

!41

Adhoc batch query
DailyActiveUser

2013-07-26

2013-07-27

en

576

491

cn

361

945

!42

Adhoc batch query
Fact
user id

event

time

user_13

login

2013-07-26

user_13

login

2013-07-26

user_76

pay

2013-07-27

Dimension
user id

nation

user_13

cn

user_76

en

DAU

2013-07-26 2013-07-27

en

576

491

cn

361

945

!43

Adhoc batch query
DAU

2013-07-26

2013-07-27

en

576

491

cn

361

945

!44

Adhoc batch query
scan:
Fact

scan:
Fact

filter

filter

date=‘2013-07-26’

DAU
scan:
Dimension

date=‘2013-07-27’

2013-07-26

filter

scan:
Dimension

Join

nation=‘en’

en

filter

Join

nation=‘en’

agg

scan:
Fact

2013-07-27

scan:
Fact 491

576

filter

filter

date=‘2013-07-26’

scan:
filter
Dimension
cn

scan:
Dimension
361

Join

nation=‘cn’

agg

date=‘2013-07-27’

filter

Join

nation=‘cn’

agg
!45

945
agg

scan:
Fact

scan:
Fact

filter

filter

date=‘2013-07-26’

scan:
Dimension

filter

scan:
Dimension

Join

nation=‘en’

date=‘2013-07-27’

filter

Join

nation=‘en’

agg

agg

scan:
Fact

scan:
Fact

filter

filter

date=‘2013-07-26’

scan:
Dimension

filter

scan:
Dimension

Join

nation=‘cn’

date=‘2013-07-27’

filter

Join

nation=‘cn’

agg
!46

agg

scan:
Fact
filter
date=‘2013-07-26’

filter
filter

Join

agg

date=‘2013-07-27’

nation=‘en’

filter

agg

Join

nation=‘en’

scan:
Dimension

filter
date=‘2013-07-26’

filter
filter

Join

agg

nation=‘cn’

date=‘2013-07-27’

filter

Join

nation=‘cn’
!47

agg

Adhoc batch query
•

Beneﬁts
•
•

•

Reduce the same Scans
Merge similar Scans

Possibility
•

SQL usually Parses into Tree, while

•

LogicalPlan in Drill is DAG
!48

More Beneﬁts:
Middle result reuse

!49

scan:
Fact

Adhoc batch query
filter
date=‘2013-07-26’

filter
filter

Join

agg

date=‘2013-07-27’

nation=‘en’

filter

agg

Join

nation=‘en’

scan:
Dimension

filter
date=‘2013-07-26’

filter
filter

Join

agg

nation=‘cn’

date=‘2013-07-27’

filter

Join

nation=‘cn’
!50

agg

scan:
Fact

Adhoc batch query
filter
date=‘2013-07-26’

filter
Join

agg

date=‘2013-07-27’

Filter
agg

Join
nation=‘en’

scan:
Dimension

filter
date=‘2013-07-26’

filter
Join

agg

date=‘2013-07-27’

Filter
Join
nation=‘cn’

!51

agg

scan:
Fact

Adhoc batch query
Filter
date=‘2013-07-26’

Filter
Join

agg

date=‘2013-07-27’

Filter

agg

Join
nation=‘en’

scan:
Dimension

Join

agg

Filter
Join
nation=‘cn’

!52

agg

More Beneﬁts:
More Batched,
More Ofﬂine

!53

Batched Query, from a report
!56

Batched Query, from tens of reports, with 1k+ operators
!57

Jobs vs Predictions
•

Ofﬂine job
•

becomes predictions of what data user may
be interested in

•

by merging more query together

•

daily predictions & hourly predictions

!58

More Beneﬁts:
Utilising multi-core

!59

Utilising Multi-core
•

Original:
agg

•

Pull data from root
Join

•

Downwards recursively

ﬁlter

nation=‘en’

scan:
Dimension

!60

ﬁlter
date=‘2013-07-26’

scan:
Fact

Utilising Multi-core
•

Now:
agg

•

Push data from Leaf
Join

•
•

Data driven upwards
Pooled execution

ﬁlter

nation=‘en’

scan:
Dimension

!61

ﬁlter
date=‘2013-07-26’

scan:
Fact

Adhoc batch query
•

Beneﬁts
•

Reduce the same Scans

•

Merge similar Scans

•

Merge intermediate operators

•

Uniﬁed process for adhoc & batch process

•

Multi-core process of single Plan
!62

•

The Problem

•

Brief on Drill

•


•

Our work

•

Now & Future

!63

About Xingcloud
•

Now
•
•

2 billion insert/update daily

•

200k+ aggregation data/day, 6k sec in total

•
•

http://a.xingcloud.com

query response time: <1sec - 100 sec, 10 sec on avg.

Future
•

Plan Merge

•

Uniﬁed process for batch, adhoc & stream process, SQL oriented

•

SQL(t): Plan with time window
!64

About Drill
•

Now
•
•

on Parquet/ORCFile on HDFS

•
•

Distributed Join

Write interface of storage engines

Future
•

1.0 M2: December 2013

•

1.0 GA: Early 2014

•

more detail on https://issues.apache.org/jira/browse/DRILL
!65

References
•

http://incubator.apache.org/drill/index.html#resources

•


•

http://prezi.com/j43vb1umlgqv/timothy-chen/

•

http://www.cs.virginia.edu/kim/publicity/pldi09tutorials/memoryefﬁcient-java-tutorial.pdf

•

http://www.cs.yale.edu/homes/dna/talks/
Column_Store_Tutorial_VLDB09.pdf

!66

穆黎森：Interactive batch query at scale

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (10)

Similar to 穆黎森：Interactive batch query at scale

Similar to 穆黎森：Interactive batch query at scale (20)

More from hdhappy001

More from hdhappy001 (20)

Recently uploaded

Recently uploaded (20)

穆黎森：Interactive batch query at scale