Your SlideShare is downloading. ×
0
Interactive Batch Query
At Scale
Adhoc query system for game analytics
based on Drill
immars@gmail.com

!1
Related Topics
•

Java Programming

•

Relational Algebra

•

Distributed Database

•

Hadoop Ecosystem

!2
About Us
•

Elex-tech

•

Game Development, Game Publishing

•

SNS Games, Web Games, Mobile Games, Apps

•

Global Market...
•

The Problem!

•

Brief on Drill

•

Design Considerations

•

Enhancement from Xingcloud

•

Now & Future

!4
The Problem

!5
The Problem
•

How many logins today?

•

How many individual users this week?

•

Total income today?

•

Paid user amoun...
The Problem: Facts
•

How many X during time period of Y

!

•

event

amount

login

-

1383729081

user_002

login

-

1...
The Problem: Facts
•

How many logins today?

•

How many individual users this week?

•

Total income today?

•

Paid use...
The Problem: Facts
•

How many logins today?

!
•

event

amount

login

-

1383729081

user_002

login

-

1383729082

us...
The Problem: Facts
•

How many individual users this week?

!
•

event

amount

login

-

1383729081

user_002

login

-

...
The Problem: Facts
•

Total income today?

!
•

event

amount

login

-

1383729081

user_002

login

-

1383729082

user_...
The Problem: Facts
•

Paid user amount this month?

!
•

event

amount

login

-

1383729081

user_002

login

-

13837290...
The Problem: Dimensions
•

How many logins today from China?

•

How many individual users of each server this
week?

•

T...
The Problem: Dimensions
•

The user X’s property Y is of value Z

!

•

refer

en

adwords

user_002 20110927

cn

faceboo...
Fact & Dimension
•

Aggregation on Join
user id
user_001
user_002
user_001
user_003
user id
user_001
user_002
user_003
use...
Fact & Dimension
•

How many logins today from China?

•

How many individual users of each server this
week?

•

Total in...
Fact & Dimension
SELECT COUNT DISTINCT (on uid)
JOIN (1 fact, n dimension, on uid)
WHERE (filter by value of dimensions/fac...
Fact & Dimension
•

SQL
agg

•

-> Syntax tree
Join

•
•

-> Logical Plan
-> Physical Plan

Join
filter

filter

filter

scan...
pre-aggregation?

!19
!20
Combinatorial Explosion!
!21
Access Pattern
Facts

Write

Read by

Dimensions

Append

Insert,
update

date
event

user id
prop value
full table

!22
Volume

•

200GB new Facts

•

50GB Dimension updates

!23
Architecture
Query

Drill
MySQL
StorageEngine

HBase
StorageEngine

Storage
Data Loader

MySQL

!24

HBase
•

The Problem

•

Brief on Drill!

•

Design Considerations

•

Our work

•

Now & Future

!25
http://www.slideshare.net/MapRTechnologies/technical-overview-of-apache-drill-by-jac
!26
http://www.slideshare.net/jasonfrantz/drill-architecture-20120913
!27
•

The Problem

•

Brief on Drill

•

Design Considerations!

•

Our work

•

Now & Future

!28
http://www.slideshare.net/jasonfrantz/drill-architecture-20120913
!29
Data Model
{
name: "icecream",

•

Various types

•

Nested values

price: {
basic: 4.99,

•

coupon: true
•

}
}
!30

pri...
Design Considerations
•

As Fast As possible
•

Space efficient

•

Time efficient

!31
about Space Efficiency
•

Compact data representation
•

•

Java object overhead: high

JVM friendly(GC)
•

Simpler object ...
about Time Efficiency
•

Cache friendly
•

•

Superscalar: pipeline friendly
•

•

the inner loop problem

SIMD friendly
•
...
ValueVector & RecordBatch

ValueVector
!34
ValueVector & RecordBatch
•

ValueVector
•

small memory overhead

•

backed by DirectByteBuffer

•

further encoding

•

...
ValueVector & RecordBatch
{

name:VarChar

i
c
e
c
r
e
a
m
…

name: "icecream",
price: {
basic: 4.99,
coupon: true

price....
ValueVector & RecordBatch
scan:
Dimension

filter

Join

filter

•

Data passed in RecordBatch

•

Inner loop: next() vs for...
Review the Considerations
•

name:VarCh

Cache friendly

•

Superscalar: pipeline friendly

•

SIMD friendly

•

Compact d...
•

The Problem

•

Brief on Drill

•

Design Considerations

•

Our work!

•

Now & Future

!39
Our work, primarily

•

Adhoc batch query

!40
Reports: 2-dimensional tables generally

!41
Adhoc batch query
DailyActiveUser

2013-07-26

2013-07-27

en

576

491

cn

361

945

!42
Adhoc batch query
Fact
user id

event

time

user_13

login

2013-07-26

user_13

login

2013-07-26

user_76

pay

2013-07...
Adhoc batch query
DAU

2013-07-26

2013-07-27

en

576

491

cn

361

945

!44
Adhoc batch query
scan:
Fact

scan:
Fact

filter

filter

date=‘2013-07-26’

DAU
scan:
Dimension

date=‘2013-07-27’

2013-07...
scan:
Fact

scan:
Fact

filter

filter

date=‘2013-07-26’

scan:
Dimension

filter

scan:
Dimension

Join

nation=‘en’

date=...
scan:
Fact
filter
date=‘2013-07-26’

filter
filter

Join

agg

date=‘2013-07-27’

nation=‘en’

filter

agg

Join

nation=‘en’
...
Adhoc batch query
•

Benefits
•
•

•

Reduce the same Scans
Merge similar Scans

Possibility
•

SQL usually Parses into Tre...
More Benefits:
Middle result reuse

!49
scan:
Fact

Adhoc batch query
filter
date=‘2013-07-26’

filter
filter

Join

agg

date=‘2013-07-27’

nation=‘en’

filter

agg
...
scan:
Fact

Adhoc batch query
filter
date=‘2013-07-26’

filter
Join

agg

date=‘2013-07-27’

Filter
agg

Join
nation=‘en’

s...
scan:
Fact

Adhoc batch query
Filter
date=‘2013-07-26’

Filter
Join

agg

date=‘2013-07-27’

Filter

agg

Join
nation=‘en’...
More Benefits:
More Batched,
More Offline

!53
Single Query
!54
Batched 3 Queries
!55
Batched Query, from a report
!56
Batched Query, from tens of reports, with 1k+ operators
!57
Jobs vs Predictions
•

Offline job
•

becomes predictions of what data user may
be interested in

•

by merging more query ...
More Benefits:
Utilising multi-core

!59
Utilising Multi-core
•

Original:
agg

•

Pull data from root
Join

•

Downwards recursively

filter

nation=‘en’

scan:
Di...
Utilising Multi-core
•

Now:
agg

•

Push data from Leaf
Join

•
•

Data driven upwards
Pooled execution

filter

nation=‘e...
Adhoc batch query
•

Benefits
•

Reduce the same Scans

•

Merge similar Scans

•

Merge intermediate operators

•

Unified ...
•

The Problem

•

Brief on Drill

•

Design Considerations

•

Our work

•

Now & Future

!63
About Xingcloud
•

Now
•
•

2 billion insert/update daily

•

200k+ aggregation data/day, 6k sec in total

•
•

http://a.x...
About Drill
•

Now
•
•

on Parquet/ORCFile on HDFS

•
•

Distributed Join

Write interface of storage engines

Future
•

1...
References
•

http://incubator.apache.org/drill/index.html#resources

•

http://www.slideshare.net/jasonfrantz/drill-archi...
Q&A

!67
Upcoming SlideShare
Loading in...5
×

穆黎森:Interactive batch query at scale

206

Published on

BDTC 2013 Beijing China

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
206
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "穆黎森:Interactive batch query at scale"

  1. 1. Interactive Batch Query At Scale Adhoc query system for game analytics based on Drill immars@gmail.com !1
  2. 2. Related Topics • Java Programming • Relational Algebra • Distributed Database • Hadoop Ecosystem !2
  3. 3. About Us • Elex-tech • Game Development, Game Publishing • SNS Games, Web Games, Mobile Games, Apps • Global Market !3
  4. 4. • The Problem! • Brief on Drill • Design Considerations • Enhancement from Xingcloud • Now & Future !4
  5. 5. The Problem !5
  6. 6. The Problem • How many logins today? • How many individual users this week? • Total income today? • Paid user amount this month? • … !6
  7. 7. The Problem: Facts • How many X during time period of Y ! • event amount login - 1383729081 user_002 login - 1383729082 user_001 ! user id user_001 ! pay 4.99 1383729084 user_003 login - 1383729090 Fact Table !7 timestamp
  8. 8. The Problem: Facts • How many logins today? • How many individual users this week? • Total income today? • Paid user amount this month? • … !8
  9. 9. The Problem: Facts • How many logins today? ! • event amount login - 1383729081 user_002 login - 1383729082 user_001 ! user id user_001 ! pay 4.99 1383729084 user_003 login - 1383729090 timestamp select count(*) from fact where event=‘login’ and date(timestamp)=‘2013-12-06’; !9
  10. 10. The Problem: Facts • How many individual users this week? ! • event amount login - 1383729081 user_002 login - 1383729082 user_001 ! user id user_001 ! timestamp pay 4.99 1383729084 user_003 login - 1383729090 select count(distinct uid) from fact where event=‘login’ and timestamp>=‘?’ and timestamp<‘?’; !10
  11. 11. The Problem: Facts • Total income today? ! • event amount login - 1383729081 user_002 login - 1383729082 user_001 ! user id user_001 ! timestamp pay 4.99 1383729084 user_003 login - 1383729090 select sum(amount) from fact where event=‘pay’ and timestamp >=‘?’ and timestamp<‘?’; !11
  12. 12. The Problem: Facts • Paid user amount this month? ! • event amount login - 1383729081 user_002 login - 1383729082 user_001 ! user id user_001 ! timestamp pay 4.99 1383729084 user_003 login - 1383729090 select count(distinct uid) from fact where event=‘pay’ and timestamp >=‘?’ and timestamp<‘?’; !12
  13. 13. The Problem: Dimensions • How many logins today from China? • How many individual users of each server this week? • Total income today by new user? • Paid user amount this month from Adwords? • … !13
  14. 14. The Problem: Dimensions • The user X’s property Y is of value Z ! • refer en adwords user_002 20110927 cn facebook user_003 20121010 ! language user_001 20100612 ! fr admob user_004 20130522 it tapjoy user id reg_time Dimension Table !14 …
  15. 15. Fact & Dimension • Aggregation on Join user id user_001 user_002 user_001 user_003 user id user_001 user_002 user_003 user_004 event login login pay login amount 4.99 - timestamp 1383729081 1383729082 1383729084 1383729090 reg_time language refer 20100612 en adwords 20110927 cn facebook 20121010 fr admob 20130522 it tapjoy !15 …
  16. 16. Fact & Dimension • How many logins today from China? • How many individual users of each server this week? • Total income today by new user? • Paid user amount this month from adwords? • … !16
  17. 17. Fact & Dimension SELECT COUNT DISTINCT (on uid) JOIN (1 fact, n dimension, on uid) WHERE (filter by value of dimensions/facts) GROUP BY (value of dimension) !17
  18. 18. Fact & Dimension • SQL agg • -> Syntax tree Join • • -> Logical Plan -> Physical Plan Join filter filter filter scan: Dimension scan: Dimension scan: Fact
  19. 19. pre-aggregation? !19
  20. 20. !20
  21. 21. Combinatorial Explosion! !21
  22. 22. Access Pattern Facts Write Read by Dimensions Append Insert, update date event user id prop value full table !22
  23. 23. Volume • 200GB new Facts • 50GB Dimension updates !23
  24. 24. Architecture Query Drill MySQL StorageEngine HBase StorageEngine Storage Data Loader MySQL !24 HBase
  25. 25. • The Problem • Brief on Drill! • Design Considerations • Our work • Now & Future !25
  26. 26. http://www.slideshare.net/MapRTechnologies/technical-overview-of-apache-drill-by-jac !26
  27. 27. http://www.slideshare.net/jasonfrantz/drill-architecture-20120913 !27
  28. 28. • The Problem • Brief on Drill • Design Considerations! • Our work • Now & Future !28
  29. 29. http://www.slideshare.net/jasonfrantz/drill-architecture-20120913 !29
  30. 30. Data Model { name: "icecream", • Various types • Nested values price: { basic: 4.99, • coupon: true • } } !30 price.basic Schema-free
  31. 31. Design Considerations • As Fast As possible • Space efficient • Time efficient !31
  32. 32. about Space Efficiency • Compact data representation • • Java object overhead: high JVM friendly(GC) • Simpler object graph • Less tenured space, less full GC !32
  33. 33. about Time Efficiency • Cache friendly • • Superscalar: pipeline friendly • • the inner loop problem SIMD friendly • • data access Locality opportunity to operate on a vector of values JVM friendly(JNI) !33
  34. 34. ValueVector & RecordBatch ValueVector !34
  35. 35. ValueVector & RecordBatch • ValueVector • small memory overhead • backed by DirectByteBuffer • further encoding • continuous access/random access !35
  36. 36. ValueVector & RecordBatch { name:VarChar i c e c r e a m … name: "icecream", price: { basic: 4.99, coupon: true price.coupon:boolean price.basic:float 4.99 … } } RecordBatch !36 T …
  37. 37. ValueVector & RecordBatch scan: Dimension filter Join filter • Data passed in RecordBatch • Inner loop: next() vs for !37 scan: Fact agg
  38. 38. Review the Considerations • name:VarCh Cache friendly • Superscalar: pipeline friendly • SIMD friendly • Compact data representation • JVM friendly(GC) • JVM friendly(JNI) !38 price.coupon:boole i price.basic:flo c 4.99 e … c r e a m … T …
  39. 39. • The Problem • Brief on Drill • Design Considerations • Our work! • Now & Future !39
  40. 40. Our work, primarily • Adhoc batch query !40
  41. 41. Reports: 2-dimensional tables generally !41
  42. 42. Adhoc batch query DailyActiveUser 2013-07-26 2013-07-27 en 576 491 cn 361 945 !42
  43. 43. Adhoc batch query Fact user id event time user_13 login 2013-07-26 user_13 login 2013-07-26 user_76 pay 2013-07-27 Dimension user id nation user_13 cn user_76 en DAU 2013-07-26 2013-07-27 en 576 491 cn 361 945 !43
  44. 44. Adhoc batch query DAU 2013-07-26 2013-07-27 en 576 491 cn 361 945 !44
  45. 45. Adhoc batch query scan: Fact scan: Fact filter filter date=‘2013-07-26’ DAU scan: Dimension date=‘2013-07-27’ 2013-07-26 filter scan: Dimension Join nation=‘en’ en filter Join nation=‘en’ agg scan: Fact 2013-07-27 scan: Fact 491 576 filter filter date=‘2013-07-26’ scan: filter Dimension cn scan: Dimension 361 Join nation=‘cn’ agg date=‘2013-07-27’ filter Join nation=‘cn’ agg !45 945 agg
  46. 46. scan: Fact scan: Fact filter filter date=‘2013-07-26’ scan: Dimension filter scan: Dimension Join nation=‘en’ date=‘2013-07-27’ filter Join nation=‘en’ agg agg scan: Fact scan: Fact filter filter date=‘2013-07-26’ scan: Dimension filter scan: Dimension Join nation=‘cn’ date=‘2013-07-27’ filter Join nation=‘cn’ agg !46 agg
  47. 47. scan: Fact filter date=‘2013-07-26’ filter filter Join agg date=‘2013-07-27’ nation=‘en’ filter agg Join nation=‘en’ scan: Dimension filter date=‘2013-07-26’ filter filter Join agg nation=‘cn’ date=‘2013-07-27’ filter Join nation=‘cn’ !47 agg
  48. 48. Adhoc batch query • Benefits • • • Reduce the same Scans Merge similar Scans Possibility • SQL usually Parses into Tree, while • LogicalPlan in Drill is DAG !48
  49. 49. More Benefits: Middle result reuse !49
  50. 50. scan: Fact Adhoc batch query filter date=‘2013-07-26’ filter filter Join agg date=‘2013-07-27’ nation=‘en’ filter agg Join nation=‘en’ scan: Dimension filter date=‘2013-07-26’ filter filter Join agg nation=‘cn’ date=‘2013-07-27’ filter Join nation=‘cn’ !50 agg
  51. 51. scan: Fact Adhoc batch query filter date=‘2013-07-26’ filter Join agg date=‘2013-07-27’ Filter agg Join nation=‘en’ scan: Dimension filter date=‘2013-07-26’ filter Join agg date=‘2013-07-27’ Filter Join nation=‘cn’ !51 agg
  52. 52. scan: Fact Adhoc batch query Filter date=‘2013-07-26’ Filter Join agg date=‘2013-07-27’ Filter agg Join nation=‘en’ scan: Dimension Join agg Filter Join nation=‘cn’ !52 agg
  53. 53. More Benefits: More Batched, More Offline !53
  54. 54. Single Query !54
  55. 55. Batched 3 Queries !55
  56. 56. Batched Query, from a report !56
  57. 57. Batched Query, from tens of reports, with 1k+ operators !57
  58. 58. Jobs vs Predictions • Offline job • becomes predictions of what data user may be interested in • by merging more query together • daily predictions & hourly predictions !58
  59. 59. More Benefits: Utilising multi-core !59
  60. 60. Utilising Multi-core • Original: agg • Pull data from root Join • Downwards recursively filter nation=‘en’ scan: Dimension !60 filter date=‘2013-07-26’ scan: Fact
  61. 61. Utilising Multi-core • Now: agg • Push data from Leaf Join • • Data driven upwards Pooled execution filter nation=‘en’ scan: Dimension !61 filter date=‘2013-07-26’ scan: Fact
  62. 62. Adhoc batch query • Benefits • Reduce the same Scans • Merge similar Scans • Merge intermediate operators • Unified process for adhoc & batch process • Multi-core process of single Plan !62
  63. 63. • The Problem • Brief on Drill • Design Considerations • Our work • Now & Future !63
  64. 64. About Xingcloud • Now • • 2 billion insert/update daily • 200k+ aggregation data/day, 6k sec in total • • http://a.xingcloud.com query response time: <1sec - 100 sec, 10 sec on avg. Future • Plan Merge • Unified process for batch, adhoc & stream process, SQL oriented • SQL(t): Plan with time window !64
  65. 65. About Drill • Now • • on Parquet/ORCFile on HDFS • • Distributed Join Write interface of storage engines Future • 1.0 M2: December 2013 • 1.0 GA: Early 2014 • more detail on https://issues.apache.org/jira/browse/DRILL !65
  66. 66. References • http://incubator.apache.org/drill/index.html#resources • http://www.slideshare.net/jasonfrantz/drill-architecture-20120913 • http://prezi.com/j43vb1umlgqv/timothy-chen/ • http://www.cs.virginia.edu/kim/publicity/pldi09tutorials/memoryefficient-java-tutorial.pdf • http://www.cs.yale.edu/homes/dna/talks/ Column_Store_Tutorial_VLDB09.pdf !66
  67. 67. Q&A !67
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×