穆黎森:Interactive batch query at scale
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

穆黎森:Interactive batch query at scale

on

  • 344 views

BDTC 2013 Beijing China

BDTC 2013 Beijing China

Statistics

Views

Total Views
344
Views on SlideShare
344
Embed Views
0

Actions

Likes
0
Downloads
2
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

穆黎森:Interactive batch query at scale Presentation Transcript

  • 1. Interactive Batch Query At Scale Adhoc query system for game analytics based on Drill immars@gmail.com !1
  • 2. Related Topics • Java Programming • Relational Algebra • Distributed Database • Hadoop Ecosystem !2
  • 3. About Us • Elex-tech • Game Development, Game Publishing • SNS Games, Web Games, Mobile Games, Apps • Global Market !3
  • 4. • The Problem! • Brief on Drill • Design Considerations • Enhancement from Xingcloud • Now & Future !4
  • 5. The Problem !5
  • 6. The Problem • How many logins today? • How many individual users this week? • Total income today? • Paid user amount this month? • … !6
  • 7. The Problem: Facts • How many X during time period of Y ! • event amount login - 1383729081 user_002 login - 1383729082 user_001 ! user id user_001 ! pay 4.99 1383729084 user_003 login - 1383729090 Fact Table !7 timestamp
  • 8. The Problem: Facts • How many logins today? • How many individual users this week? • Total income today? • Paid user amount this month? • … !8
  • 9. The Problem: Facts • How many logins today? ! • event amount login - 1383729081 user_002 login - 1383729082 user_001 ! user id user_001 ! pay 4.99 1383729084 user_003 login - 1383729090 timestamp select count(*) from fact where event=‘login’ and date(timestamp)=‘2013-12-06’; !9
  • 10. The Problem: Facts • How many individual users this week? ! • event amount login - 1383729081 user_002 login - 1383729082 user_001 ! user id user_001 ! timestamp pay 4.99 1383729084 user_003 login - 1383729090 select count(distinct uid) from fact where event=‘login’ and timestamp>=‘?’ and timestamp<‘?’; !10
  • 11. The Problem: Facts • Total income today? ! • event amount login - 1383729081 user_002 login - 1383729082 user_001 ! user id user_001 ! timestamp pay 4.99 1383729084 user_003 login - 1383729090 select sum(amount) from fact where event=‘pay’ and timestamp >=‘?’ and timestamp<‘?’; !11
  • 12. The Problem: Facts • Paid user amount this month? ! • event amount login - 1383729081 user_002 login - 1383729082 user_001 ! user id user_001 ! timestamp pay 4.99 1383729084 user_003 login - 1383729090 select count(distinct uid) from fact where event=‘pay’ and timestamp >=‘?’ and timestamp<‘?’; !12
  • 13. The Problem: Dimensions • How many logins today from China? • How many individual users of each server this week? • Total income today by new user? • Paid user amount this month from Adwords? • … !13
  • 14. The Problem: Dimensions • The user X’s property Y is of value Z ! • refer en adwords user_002 20110927 cn facebook user_003 20121010 ! language user_001 20100612 ! fr admob user_004 20130522 it tapjoy user id reg_time Dimension Table !14 …
  • 15. Fact & Dimension • Aggregation on Join user id user_001 user_002 user_001 user_003 user id user_001 user_002 user_003 user_004 event login login pay login amount 4.99 - timestamp 1383729081 1383729082 1383729084 1383729090 reg_time language refer 20100612 en adwords 20110927 cn facebook 20121010 fr admob 20130522 it tapjoy !15 …
  • 16. Fact & Dimension • How many logins today from China? • How many individual users of each server this week? • Total income today by new user? • Paid user amount this month from adwords? • … !16
  • 17. Fact & Dimension SELECT COUNT DISTINCT (on uid) JOIN (1 fact, n dimension, on uid) WHERE (filter by value of dimensions/facts) GROUP BY (value of dimension) !17
  • 18. Fact & Dimension • SQL agg • -> Syntax tree Join • • -> Logical Plan -> Physical Plan Join filter filter filter scan: Dimension scan: Dimension scan: Fact
  • 19. pre-aggregation? !19
  • 20. !20
  • 21. Combinatorial Explosion! !21
  • 22. Access Pattern Facts Write Read by Dimensions Append Insert, update date event user id prop value full table !22
  • 23. Volume • 200GB new Facts • 50GB Dimension updates !23
  • 24. Architecture Query Drill MySQL StorageEngine HBase StorageEngine Storage Data Loader MySQL !24 HBase
  • 25. • The Problem • Brief on Drill! • Design Considerations • Our work • Now & Future !25
  • 26. http://www.slideshare.net/MapRTechnologies/technical-overview-of-apache-drill-by-jac !26
  • 27. http://www.slideshare.net/jasonfrantz/drill-architecture-20120913 !27
  • 28. • The Problem • Brief on Drill • Design Considerations! • Our work • Now & Future !28
  • 29. http://www.slideshare.net/jasonfrantz/drill-architecture-20120913 !29
  • 30. Data Model { name: "icecream", • Various types • Nested values price: { basic: 4.99, • coupon: true • } } !30 price.basic Schema-free
  • 31. Design Considerations • As Fast As possible • Space efficient • Time efficient !31
  • 32. about Space Efficiency • Compact data representation • • Java object overhead: high JVM friendly(GC) • Simpler object graph • Less tenured space, less full GC !32
  • 33. about Time Efficiency • Cache friendly • • Superscalar: pipeline friendly • • the inner loop problem SIMD friendly • • data access Locality opportunity to operate on a vector of values JVM friendly(JNI) !33
  • 34. ValueVector & RecordBatch ValueVector !34
  • 35. ValueVector & RecordBatch • ValueVector • small memory overhead • backed by DirectByteBuffer • further encoding • continuous access/random access !35
  • 36. ValueVector & RecordBatch { name:VarChar i c e c r e a m … name: "icecream", price: { basic: 4.99, coupon: true price.coupon:boolean price.basic:float 4.99 … } } RecordBatch !36 T …
  • 37. ValueVector & RecordBatch scan: Dimension filter Join filter • Data passed in RecordBatch • Inner loop: next() vs for !37 scan: Fact agg
  • 38. Review the Considerations • name:VarCh Cache friendly • Superscalar: pipeline friendly • SIMD friendly • Compact data representation • JVM friendly(GC) • JVM friendly(JNI) !38 price.coupon:boole i price.basic:flo c 4.99 e … c r e a m … T …
  • 39. • The Problem • Brief on Drill • Design Considerations • Our work! • Now & Future !39
  • 40. Our work, primarily • Adhoc batch query !40
  • 41. Reports: 2-dimensional tables generally !41
  • 42. Adhoc batch query DailyActiveUser 2013-07-26 2013-07-27 en 576 491 cn 361 945 !42
  • 43. Adhoc batch query Fact user id event time user_13 login 2013-07-26 user_13 login 2013-07-26 user_76 pay 2013-07-27 Dimension user id nation user_13 cn user_76 en DAU 2013-07-26 2013-07-27 en 576 491 cn 361 945 !43
  • 44. Adhoc batch query DAU 2013-07-26 2013-07-27 en 576 491 cn 361 945 !44
  • 45. Adhoc batch query scan: Fact scan: Fact filter filter date=‘2013-07-26’ DAU scan: Dimension date=‘2013-07-27’ 2013-07-26 filter scan: Dimension Join nation=‘en’ en filter Join nation=‘en’ agg scan: Fact 2013-07-27 scan: Fact 491 576 filter filter date=‘2013-07-26’ scan: filter Dimension cn scan: Dimension 361 Join nation=‘cn’ agg date=‘2013-07-27’ filter Join nation=‘cn’ agg !45 945 agg
  • 46. scan: Fact scan: Fact filter filter date=‘2013-07-26’ scan: Dimension filter scan: Dimension Join nation=‘en’ date=‘2013-07-27’ filter Join nation=‘en’ agg agg scan: Fact scan: Fact filter filter date=‘2013-07-26’ scan: Dimension filter scan: Dimension Join nation=‘cn’ date=‘2013-07-27’ filter Join nation=‘cn’ agg !46 agg
  • 47. scan: Fact filter date=‘2013-07-26’ filter filter Join agg date=‘2013-07-27’ nation=‘en’ filter agg Join nation=‘en’ scan: Dimension filter date=‘2013-07-26’ filter filter Join agg nation=‘cn’ date=‘2013-07-27’ filter Join nation=‘cn’ !47 agg
  • 48. Adhoc batch query • Benefits • • • Reduce the same Scans Merge similar Scans Possibility • SQL usually Parses into Tree, while • LogicalPlan in Drill is DAG !48
  • 49. More Benefits: Middle result reuse !49
  • 50. scan: Fact Adhoc batch query filter date=‘2013-07-26’ filter filter Join agg date=‘2013-07-27’ nation=‘en’ filter agg Join nation=‘en’ scan: Dimension filter date=‘2013-07-26’ filter filter Join agg nation=‘cn’ date=‘2013-07-27’ filter Join nation=‘cn’ !50 agg
  • 51. scan: Fact Adhoc batch query filter date=‘2013-07-26’ filter Join agg date=‘2013-07-27’ Filter agg Join nation=‘en’ scan: Dimension filter date=‘2013-07-26’ filter Join agg date=‘2013-07-27’ Filter Join nation=‘cn’ !51 agg
  • 52. scan: Fact Adhoc batch query Filter date=‘2013-07-26’ Filter Join agg date=‘2013-07-27’ Filter agg Join nation=‘en’ scan: Dimension Join agg Filter Join nation=‘cn’ !52 agg
  • 53. More Benefits: More Batched, More Offline !53
  • 54. Single Query !54
  • 55. Batched 3 Queries !55
  • 56. Batched Query, from a report !56
  • 57. Batched Query, from tens of reports, with 1k+ operators !57
  • 58. Jobs vs Predictions • Offline job • becomes predictions of what data user may be interested in • by merging more query together • daily predictions & hourly predictions !58
  • 59. More Benefits: Utilising multi-core !59
  • 60. Utilising Multi-core • Original: agg • Pull data from root Join • Downwards recursively filter nation=‘en’ scan: Dimension !60 filter date=‘2013-07-26’ scan: Fact
  • 61. Utilising Multi-core • Now: agg • Push data from Leaf Join • • Data driven upwards Pooled execution filter nation=‘en’ scan: Dimension !61 filter date=‘2013-07-26’ scan: Fact
  • 62. Adhoc batch query • Benefits • Reduce the same Scans • Merge similar Scans • Merge intermediate operators • Unified process for adhoc & batch process • Multi-core process of single Plan !62
  • 63. • The Problem • Brief on Drill • Design Considerations • Our work • Now & Future !63
  • 64. About Xingcloud • Now • • 2 billion insert/update daily • 200k+ aggregation data/day, 6k sec in total • • http://a.xingcloud.com query response time: <1sec - 100 sec, 10 sec on avg. Future • Plan Merge • Unified process for batch, adhoc & stream process, SQL oriented • SQL(t): Plan with time window !64
  • 65. About Drill • Now • • on Parquet/ORCFile on HDFS • • Distributed Join Write interface of storage engines Future • 1.0 M2: December 2013 • 1.0 GA: Early 2014 • more detail on https://issues.apache.org/jira/browse/DRILL !65
  • 66. References • http://incubator.apache.org/drill/index.html#resources • http://www.slideshare.net/jasonfrantz/drill-architecture-20120913 • http://prezi.com/j43vb1umlgqv/timothy-chen/ • http://www.cs.virginia.edu/kim/publicity/pldi09tutorials/memoryefficient-java-tutorial.pdf • http://www.cs.yale.edu/homes/dna/talks/ Column_Store_Tutorial_VLDB09.pdf !66
  • 67. Q&A !67