• Like
  • Save
穆黎森:Interactive batch query at scale
Upcoming SlideShare
Loading in...5
×
 

穆黎森:Interactive batch query at scale

on

  • 257 views

BDTC 2013 Beijing China

BDTC 2013 Beijing China

Statistics

Views

Total Views
257
Views on SlideShare
257
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    穆黎森:Interactive batch query at scale 穆黎森:Interactive batch query at scale Presentation Transcript

    • Interactive Batch Query At Scale Adhoc query system for game analytics based on Drill immars@gmail.com !1
    • Related Topics • Java Programming • Relational Algebra • Distributed Database • Hadoop Ecosystem !2
    • About Us • Elex-tech • Game Development, Game Publishing • SNS Games, Web Games, Mobile Games, Apps • Global Market !3
    • • The Problem! • Brief on Drill • Design Considerations • Enhancement from Xingcloud • Now & Future !4
    • The Problem !5
    • The Problem • How many logins today? • How many individual users this week? • Total income today? • Paid user amount this month? • … !6
    • The Problem: Facts • How many X during time period of Y ! • event amount login - 1383729081 user_002 login - 1383729082 user_001 ! user id user_001 ! pay 4.99 1383729084 user_003 login - 1383729090 Fact Table !7 timestamp
    • The Problem: Facts • How many logins today? • How many individual users this week? • Total income today? • Paid user amount this month? • … !8
    • The Problem: Facts • How many logins today? ! • event amount login - 1383729081 user_002 login - 1383729082 user_001 ! user id user_001 ! pay 4.99 1383729084 user_003 login - 1383729090 timestamp select count(*) from fact where event=‘login’ and date(timestamp)=‘2013-12-06’; !9
    • The Problem: Facts • How many individual users this week? ! • event amount login - 1383729081 user_002 login - 1383729082 user_001 ! user id user_001 ! timestamp pay 4.99 1383729084 user_003 login - 1383729090 select count(distinct uid) from fact where event=‘login’ and timestamp>=‘?’ and timestamp<‘?’; !10
    • The Problem: Facts • Total income today? ! • event amount login - 1383729081 user_002 login - 1383729082 user_001 ! user id user_001 ! timestamp pay 4.99 1383729084 user_003 login - 1383729090 select sum(amount) from fact where event=‘pay’ and timestamp >=‘?’ and timestamp<‘?’; !11
    • The Problem: Facts • Paid user amount this month? ! • event amount login - 1383729081 user_002 login - 1383729082 user_001 ! user id user_001 ! timestamp pay 4.99 1383729084 user_003 login - 1383729090 select count(distinct uid) from fact where event=‘pay’ and timestamp >=‘?’ and timestamp<‘?’; !12
    • The Problem: Dimensions • How many logins today from China? • How many individual users of each server this week? • Total income today by new user? • Paid user amount this month from Adwords? • … !13
    • The Problem: Dimensions • The user X’s property Y is of value Z ! • refer en adwords user_002 20110927 cn facebook user_003 20121010 ! language user_001 20100612 ! fr admob user_004 20130522 it tapjoy user id reg_time Dimension Table !14 …
    • Fact & Dimension • Aggregation on Join user id user_001 user_002 user_001 user_003 user id user_001 user_002 user_003 user_004 event login login pay login amount 4.99 - timestamp 1383729081 1383729082 1383729084 1383729090 reg_time language refer 20100612 en adwords 20110927 cn facebook 20121010 fr admob 20130522 it tapjoy !15 …
    • Fact & Dimension • How many logins today from China? • How many individual users of each server this week? • Total income today by new user? • Paid user amount this month from adwords? • … !16
    • Fact & Dimension SELECT COUNT DISTINCT (on uid) JOIN (1 fact, n dimension, on uid) WHERE (filter by value of dimensions/facts) GROUP BY (value of dimension) !17
    • Fact & Dimension • SQL agg • -> Syntax tree Join • • -> Logical Plan -> Physical Plan Join filter filter filter scan: Dimension scan: Dimension scan: Fact
    • pre-aggregation? !19
    • !20
    • Combinatorial Explosion! !21
    • Access Pattern Facts Write Read by Dimensions Append Insert, update date event user id prop value full table !22
    • Volume • 200GB new Facts • 50GB Dimension updates !23
    • Architecture Query Drill MySQL StorageEngine HBase StorageEngine Storage Data Loader MySQL !24 HBase
    • • The Problem • Brief on Drill! • Design Considerations • Our work • Now & Future !25
    • http://www.slideshare.net/MapRTechnologies/technical-overview-of-apache-drill-by-jac !26
    • http://www.slideshare.net/jasonfrantz/drill-architecture-20120913 !27
    • • The Problem • Brief on Drill • Design Considerations! • Our work • Now & Future !28
    • http://www.slideshare.net/jasonfrantz/drill-architecture-20120913 !29
    • Data Model { name: "icecream", • Various types • Nested values price: { basic: 4.99, • coupon: true • } } !30 price.basic Schema-free
    • Design Considerations • As Fast As possible • Space efficient • Time efficient !31
    • about Space Efficiency • Compact data representation • • Java object overhead: high JVM friendly(GC) • Simpler object graph • Less tenured space, less full GC !32
    • about Time Efficiency • Cache friendly • • Superscalar: pipeline friendly • • the inner loop problem SIMD friendly • • data access Locality opportunity to operate on a vector of values JVM friendly(JNI) !33
    • ValueVector & RecordBatch ValueVector !34
    • ValueVector & RecordBatch • ValueVector • small memory overhead • backed by DirectByteBuffer • further encoding • continuous access/random access !35
    • ValueVector & RecordBatch { name:VarChar i c e c r e a m … name: "icecream", price: { basic: 4.99, coupon: true price.coupon:boolean price.basic:float 4.99 … } } RecordBatch !36 T …
    • ValueVector & RecordBatch scan: Dimension filter Join filter • Data passed in RecordBatch • Inner loop: next() vs for !37 scan: Fact agg
    • Review the Considerations • name:VarCh Cache friendly • Superscalar: pipeline friendly • SIMD friendly • Compact data representation • JVM friendly(GC) • JVM friendly(JNI) !38 price.coupon:boole i price.basic:flo c 4.99 e … c r e a m … T …
    • • The Problem • Brief on Drill • Design Considerations • Our work! • Now & Future !39
    • Our work, primarily • Adhoc batch query !40
    • Reports: 2-dimensional tables generally !41
    • Adhoc batch query DailyActiveUser 2013-07-26 2013-07-27 en 576 491 cn 361 945 !42
    • Adhoc batch query Fact user id event time user_13 login 2013-07-26 user_13 login 2013-07-26 user_76 pay 2013-07-27 Dimension user id nation user_13 cn user_76 en DAU 2013-07-26 2013-07-27 en 576 491 cn 361 945 !43
    • Adhoc batch query DAU 2013-07-26 2013-07-27 en 576 491 cn 361 945 !44
    • Adhoc batch query scan: Fact scan: Fact filter filter date=‘2013-07-26’ DAU scan: Dimension date=‘2013-07-27’ 2013-07-26 filter scan: Dimension Join nation=‘en’ en filter Join nation=‘en’ agg scan: Fact 2013-07-27 scan: Fact 491 576 filter filter date=‘2013-07-26’ scan: filter Dimension cn scan: Dimension 361 Join nation=‘cn’ agg date=‘2013-07-27’ filter Join nation=‘cn’ agg !45 945 agg
    • scan: Fact scan: Fact filter filter date=‘2013-07-26’ scan: Dimension filter scan: Dimension Join nation=‘en’ date=‘2013-07-27’ filter Join nation=‘en’ agg agg scan: Fact scan: Fact filter filter date=‘2013-07-26’ scan: Dimension filter scan: Dimension Join nation=‘cn’ date=‘2013-07-27’ filter Join nation=‘cn’ agg !46 agg
    • scan: Fact filter date=‘2013-07-26’ filter filter Join agg date=‘2013-07-27’ nation=‘en’ filter agg Join nation=‘en’ scan: Dimension filter date=‘2013-07-26’ filter filter Join agg nation=‘cn’ date=‘2013-07-27’ filter Join nation=‘cn’ !47 agg
    • Adhoc batch query • Benefits • • • Reduce the same Scans Merge similar Scans Possibility • SQL usually Parses into Tree, while • LogicalPlan in Drill is DAG !48
    • More Benefits: Middle result reuse !49
    • scan: Fact Adhoc batch query filter date=‘2013-07-26’ filter filter Join agg date=‘2013-07-27’ nation=‘en’ filter agg Join nation=‘en’ scan: Dimension filter date=‘2013-07-26’ filter filter Join agg nation=‘cn’ date=‘2013-07-27’ filter Join nation=‘cn’ !50 agg
    • scan: Fact Adhoc batch query filter date=‘2013-07-26’ filter Join agg date=‘2013-07-27’ Filter agg Join nation=‘en’ scan: Dimension filter date=‘2013-07-26’ filter Join agg date=‘2013-07-27’ Filter Join nation=‘cn’ !51 agg
    • scan: Fact Adhoc batch query Filter date=‘2013-07-26’ Filter Join agg date=‘2013-07-27’ Filter agg Join nation=‘en’ scan: Dimension Join agg Filter Join nation=‘cn’ !52 agg
    • More Benefits: More Batched, More Offline !53
    • Single Query !54
    • Batched 3 Queries !55
    • Batched Query, from a report !56
    • Batched Query, from tens of reports, with 1k+ operators !57
    • Jobs vs Predictions • Offline job • becomes predictions of what data user may be interested in • by merging more query together • daily predictions & hourly predictions !58
    • More Benefits: Utilising multi-core !59
    • Utilising Multi-core • Original: agg • Pull data from root Join • Downwards recursively filter nation=‘en’ scan: Dimension !60 filter date=‘2013-07-26’ scan: Fact
    • Utilising Multi-core • Now: agg • Push data from Leaf Join • • Data driven upwards Pooled execution filter nation=‘en’ scan: Dimension !61 filter date=‘2013-07-26’ scan: Fact
    • Adhoc batch query • Benefits • Reduce the same Scans • Merge similar Scans • Merge intermediate operators • Unified process for adhoc & batch process • Multi-core process of single Plan !62
    • • The Problem • Brief on Drill • Design Considerations • Our work • Now & Future !63
    • About Xingcloud • Now • • 2 billion insert/update daily • 200k+ aggregation data/day, 6k sec in total • • http://a.xingcloud.com query response time: <1sec - 100 sec, 10 sec on avg. Future • Plan Merge • Unified process for batch, adhoc & stream process, SQL oriented • SQL(t): Plan with time window !64
    • About Drill • Now • • on Parquet/ORCFile on HDFS • • Distributed Join Write interface of storage engines Future • 1.0 M2: December 2013 • 1.0 GA: Early 2014 • more detail on https://issues.apache.org/jira/browse/DRILL !65
    • References • http://incubator.apache.org/drill/index.html#resources • http://www.slideshare.net/jasonfrantz/drill-architecture-20120913 • http://prezi.com/j43vb1umlgqv/timothy-chen/ • http://www.cs.virginia.edu/kim/publicity/pldi09tutorials/memoryefficient-java-tutorial.pdf • http://www.cs.yale.edu/homes/dna/talks/ Column_Store_Tutorial_VLDB09.pdf !66
    • Q&A !67