穆黎森:Interactive batch query at scale

  • 145 views
Uploaded on

BDTC 2013 Beijing China

BDTC 2013 Beijing China

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
145
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Interactive Batch Query At Scale Adhoc query system for game analytics based on Drill immars@gmail.com !1
  • 2. Related Topics • Java Programming • Relational Algebra • Distributed Database • Hadoop Ecosystem !2
  • 3. About Us • Elex-tech • Game Development, Game Publishing • SNS Games, Web Games, Mobile Games, Apps • Global Market !3
  • 4. • The Problem! • Brief on Drill • Design Considerations • Enhancement from Xingcloud • Now & Future !4
  • 5. The Problem !5
  • 6. The Problem • How many logins today? • How many individual users this week? • Total income today? • Paid user amount this month? • … !6
  • 7. The Problem: Facts • How many X during time period of Y ! • event amount login - 1383729081 user_002 login - 1383729082 user_001 ! user id user_001 ! pay 4.99 1383729084 user_003 login - 1383729090 Fact Table !7 timestamp
  • 8. The Problem: Facts • How many logins today? • How many individual users this week? • Total income today? • Paid user amount this month? • … !8
  • 9. The Problem: Facts • How many logins today? ! • event amount login - 1383729081 user_002 login - 1383729082 user_001 ! user id user_001 ! pay 4.99 1383729084 user_003 login - 1383729090 timestamp select count(*) from fact where event=‘login’ and date(timestamp)=‘2013-12-06’; !9
  • 10. The Problem: Facts • How many individual users this week? ! • event amount login - 1383729081 user_002 login - 1383729082 user_001 ! user id user_001 ! timestamp pay 4.99 1383729084 user_003 login - 1383729090 select count(distinct uid) from fact where event=‘login’ and timestamp>=‘?’ and timestamp<‘?’; !10
  • 11. The Problem: Facts • Total income today? ! • event amount login - 1383729081 user_002 login - 1383729082 user_001 ! user id user_001 ! timestamp pay 4.99 1383729084 user_003 login - 1383729090 select sum(amount) from fact where event=‘pay’ and timestamp >=‘?’ and timestamp<‘?’; !11
  • 12. The Problem: Facts • Paid user amount this month? ! • event amount login - 1383729081 user_002 login - 1383729082 user_001 ! user id user_001 ! timestamp pay 4.99 1383729084 user_003 login - 1383729090 select count(distinct uid) from fact where event=‘pay’ and timestamp >=‘?’ and timestamp<‘?’; !12
  • 13. The Problem: Dimensions • How many logins today from China? • How many individual users of each server this week? • Total income today by new user? • Paid user amount this month from Adwords? • … !13
  • 14. The Problem: Dimensions • The user X’s property Y is of value Z ! • refer en adwords user_002 20110927 cn facebook user_003 20121010 ! language user_001 20100612 ! fr admob user_004 20130522 it tapjoy user id reg_time Dimension Table !14 …
  • 15. Fact & Dimension • Aggregation on Join user id user_001 user_002 user_001 user_003 user id user_001 user_002 user_003 user_004 event login login pay login amount 4.99 - timestamp 1383729081 1383729082 1383729084 1383729090 reg_time language refer 20100612 en adwords 20110927 cn facebook 20121010 fr admob 20130522 it tapjoy !15 …
  • 16. Fact & Dimension • How many logins today from China? • How many individual users of each server this week? • Total income today by new user? • Paid user amount this month from adwords? • … !16
  • 17. Fact & Dimension SELECT COUNT DISTINCT (on uid) JOIN (1 fact, n dimension, on uid) WHERE (filter by value of dimensions/facts) GROUP BY (value of dimension) !17
  • 18. Fact & Dimension • SQL agg • -> Syntax tree Join • • -> Logical Plan -> Physical Plan Join filter filter filter scan: Dimension scan: Dimension scan: Fact
  • 19. pre-aggregation? !19
  • 20. !20
  • 21. Combinatorial Explosion! !21
  • 22. Access Pattern Facts Write Read by Dimensions Append Insert, update date event user id prop value full table !22
  • 23. Volume • 200GB new Facts • 50GB Dimension updates !23
  • 24. Architecture Query Drill MySQL StorageEngine HBase StorageEngine Storage Data Loader MySQL !24 HBase
  • 25. • The Problem • Brief on Drill! • Design Considerations • Our work • Now & Future !25
  • 26. http://www.slideshare.net/MapRTechnologies/technical-overview-of-apache-drill-by-jac !26
  • 27. http://www.slideshare.net/jasonfrantz/drill-architecture-20120913 !27
  • 28. • The Problem • Brief on Drill • Design Considerations! • Our work • Now & Future !28
  • 29. http://www.slideshare.net/jasonfrantz/drill-architecture-20120913 !29
  • 30. Data Model { name: "icecream", • Various types • Nested values price: { basic: 4.99, • coupon: true • } } !30 price.basic Schema-free
  • 31. Design Considerations • As Fast As possible • Space efficient • Time efficient !31
  • 32. about Space Efficiency • Compact data representation • • Java object overhead: high JVM friendly(GC) • Simpler object graph • Less tenured space, less full GC !32
  • 33. about Time Efficiency • Cache friendly • • Superscalar: pipeline friendly • • the inner loop problem SIMD friendly • • data access Locality opportunity to operate on a vector of values JVM friendly(JNI) !33
  • 34. ValueVector & RecordBatch ValueVector !34
  • 35. ValueVector & RecordBatch • ValueVector • small memory overhead • backed by DirectByteBuffer • further encoding • continuous access/random access !35
  • 36. ValueVector & RecordBatch { name:VarChar i c e c r e a m … name: "icecream", price: { basic: 4.99, coupon: true price.coupon:boolean price.basic:float 4.99 … } } RecordBatch !36 T …
  • 37. ValueVector & RecordBatch scan: Dimension filter Join filter • Data passed in RecordBatch • Inner loop: next() vs for !37 scan: Fact agg
  • 38. Review the Considerations • name:VarCh Cache friendly • Superscalar: pipeline friendly • SIMD friendly • Compact data representation • JVM friendly(GC) • JVM friendly(JNI) !38 price.coupon:boole i price.basic:flo c 4.99 e … c r e a m … T …
  • 39. • The Problem • Brief on Drill • Design Considerations • Our work! • Now & Future !39
  • 40. Our work, primarily • Adhoc batch query !40
  • 41. Reports: 2-dimensional tables generally !41
  • 42. Adhoc batch query DailyActiveUser 2013-07-26 2013-07-27 en 576 491 cn 361 945 !42
  • 43. Adhoc batch query Fact user id event time user_13 login 2013-07-26 user_13 login 2013-07-26 user_76 pay 2013-07-27 Dimension user id nation user_13 cn user_76 en DAU 2013-07-26 2013-07-27 en 576 491 cn 361 945 !43
  • 44. Adhoc batch query DAU 2013-07-26 2013-07-27 en 576 491 cn 361 945 !44
  • 45. Adhoc batch query scan: Fact scan: Fact filter filter date=‘2013-07-26’ DAU scan: Dimension date=‘2013-07-27’ 2013-07-26 filter scan: Dimension Join nation=‘en’ en filter Join nation=‘en’ agg scan: Fact 2013-07-27 scan: Fact 491 576 filter filter date=‘2013-07-26’ scan: filter Dimension cn scan: Dimension 361 Join nation=‘cn’ agg date=‘2013-07-27’ filter Join nation=‘cn’ agg !45 945 agg
  • 46. scan: Fact scan: Fact filter filter date=‘2013-07-26’ scan: Dimension filter scan: Dimension Join nation=‘en’ date=‘2013-07-27’ filter Join nation=‘en’ agg agg scan: Fact scan: Fact filter filter date=‘2013-07-26’ scan: Dimension filter scan: Dimension Join nation=‘cn’ date=‘2013-07-27’ filter Join nation=‘cn’ agg !46 agg
  • 47. scan: Fact filter date=‘2013-07-26’ filter filter Join agg date=‘2013-07-27’ nation=‘en’ filter agg Join nation=‘en’ scan: Dimension filter date=‘2013-07-26’ filter filter Join agg nation=‘cn’ date=‘2013-07-27’ filter Join nation=‘cn’ !47 agg
  • 48. Adhoc batch query • Benefits • • • Reduce the same Scans Merge similar Scans Possibility • SQL usually Parses into Tree, while • LogicalPlan in Drill is DAG !48
  • 49. More Benefits: Middle result reuse !49
  • 50. scan: Fact Adhoc batch query filter date=‘2013-07-26’ filter filter Join agg date=‘2013-07-27’ nation=‘en’ filter agg Join nation=‘en’ scan: Dimension filter date=‘2013-07-26’ filter filter Join agg nation=‘cn’ date=‘2013-07-27’ filter Join nation=‘cn’ !50 agg
  • 51. scan: Fact Adhoc batch query filter date=‘2013-07-26’ filter Join agg date=‘2013-07-27’ Filter agg Join nation=‘en’ scan: Dimension filter date=‘2013-07-26’ filter Join agg date=‘2013-07-27’ Filter Join nation=‘cn’ !51 agg
  • 52. scan: Fact Adhoc batch query Filter date=‘2013-07-26’ Filter Join agg date=‘2013-07-27’ Filter agg Join nation=‘en’ scan: Dimension Join agg Filter Join nation=‘cn’ !52 agg
  • 53. More Benefits: More Batched, More Offline !53
  • 54. Single Query !54
  • 55. Batched 3 Queries !55
  • 56. Batched Query, from a report !56
  • 57. Batched Query, from tens of reports, with 1k+ operators !57
  • 58. Jobs vs Predictions • Offline job • becomes predictions of what data user may be interested in • by merging more query together • daily predictions & hourly predictions !58
  • 59. More Benefits: Utilising multi-core !59
  • 60. Utilising Multi-core • Original: agg • Pull data from root Join • Downwards recursively filter nation=‘en’ scan: Dimension !60 filter date=‘2013-07-26’ scan: Fact
  • 61. Utilising Multi-core • Now: agg • Push data from Leaf Join • • Data driven upwards Pooled execution filter nation=‘en’ scan: Dimension !61 filter date=‘2013-07-26’ scan: Fact
  • 62. Adhoc batch query • Benefits • Reduce the same Scans • Merge similar Scans • Merge intermediate operators • Unified process for adhoc & batch process • Multi-core process of single Plan !62
  • 63. • The Problem • Brief on Drill • Design Considerations • Our work • Now & Future !63
  • 64. About Xingcloud • Now • • 2 billion insert/update daily • 200k+ aggregation data/day, 6k sec in total • • http://a.xingcloud.com query response time: <1sec - 100 sec, 10 sec on avg. Future • Plan Merge • Unified process for batch, adhoc & stream process, SQL oriented • SQL(t): Plan with time window !64
  • 65. About Drill • Now • • on Parquet/ORCFile on HDFS • • Distributed Join Write interface of storage engines Future • 1.0 M2: December 2013 • 1.0 GA: Early 2014 • more detail on https://issues.apache.org/jira/browse/DRILL !65
  • 66. References • http://incubator.apache.org/drill/index.html#resources • http://www.slideshare.net/jasonfrantz/drill-architecture-20120913 • http://prezi.com/j43vb1umlgqv/timothy-chen/ • http://www.cs.virginia.edu/kim/publicity/pldi09tutorials/memoryefficient-java-tutorial.pdf • http://www.cs.yale.edu/homes/dna/talks/ Column_Store_Tutorial_VLDB09.pdf !66
  • 67. Q&A !67