Your SlideShare is downloading. ×
  • Like
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

  • 2,061 views
Published

Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets given by MapR Chief Data Engineer EMEA . Big Data User Group in Stuttgart 2013-05-16

Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets given by MapR Chief Data Engineer EMEA . Big Data User Group in Stuttgart 2013-05-16

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
2,061
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
11
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Apache  Drill   a  interac.ve,  ad-­‐hoc  query  system  for  large-­‐scale  datasets   Michael  Hausenblas,  Chief  Data  Engineer  EMEA,  MapR   Big  Data  User  Group  Stu>gart,  2013-­‐05-­‐16  
  • 2. Which   workloads  do   you   encounter  in   your   environment?   h>p://www.flickr.com/photos/kevinomara/2866648330/  licensed  under  CC  BY-­‐NC-­‐ND  2.0  
  • 3. Batch  processing   …  for  recurring  tasks  such  as  large-­‐scale  data  mining,  ETL   offloading/data-­‐warehousing  à  for  the  batch  layer  in  Lambda   architecture  
  • 4. OLTP   …  user-­‐facing  eCommerce  transac[ons,  real-­‐[me  messaging  at   scale  (FB),  [me-­‐series  processing,  etc.  à  for  the  serving  layer  in   Lambda  architecture  
  • 5. Stream  processing   …  in  order  to  handle  stream  sources  such  as  social  media  feeds   or  sensor  data  (mobile  phones,  RFID,  weather  sta[ons,  etc.)  à   for  the  speed  layer  in  Lambda  architecture    
  • 6. Search/Informa[on  Retrieval   …  retrieval  of  items  from  unstructured  documents  (plain   text,  etc.),  semi-­‐structured  data  formats  (JSON,  etc.),  as   well  as  data  stores  (MongoDB,  CouchDB,  etc.)  
  • 7. h>p://www.flickr.com/photos/9479603@N02/4144121838/    licensed  under  CC  BY-­‐NC-­‐ND  2.0   But  what  about   interac.ve   ad-­‐hoc  query     at  scale?  
  • 8. Impala Interac[ve  Query  (?)   low-­‐latency  
  • 9. Use  Case:  Marke[ng  Campaign   •  Jane,  a  marke[ng  analyst   •  Determine  target  segments   •  Data  from  different  sources    
  • 10. Use  Case:  Logis[cs   •  Supplier  tracking  and  performance   •  Queries   – Shipments  from  supplier  ‘ACM’  in  last  24h   – Shipments  in  region  ‘US’  not  from  ‘ACM’   SUPPLIER_ID   NAME   REGION   ACM   ACME  Corp   US   GAL   GotALot  Inc   US   BAP   Bits  and  Pieces  Ltd   Europe   ZUP   Zu  Pli   Asia   { "shipment": 100123, "supplier": "ACM", “timestamp": "2013-02-01", "description": ”first delivery today” }, { "shipment": 100124, "supplier": "BAP", "timestamp": "2013-02-02", "description": "hope you enjoy it” } …
  • 11. Use  Case:  Crime  Detec[on   •  Online  purchases   •  Fraud,  bilking,  etc.   •  Batch-­‐generated  overview   •  Modes   – Explora[ve   – Alerts  
  • 12. Requirements   •  Support  for  different  data  sources   •  Support  for  different  query  interfaces   •  Low-­‐latency/real-­‐[me   •  Ad-­‐hoc  queries   •  Scalable,  reliable  
  • 13. And now for something completely different …
  • 14. Google’s  Dremel   h>p://research.google.com/pubs/pub36632.html       Sergey  Melnik,  Andrey  Gubarev,  Jing  Jing  Long,  Geoffrey  Romer,  Shiva  Shivakumar,  Ma@  Tolton,   Theo  Vassilakis,  Proc.  of  the  36th  Int'l  Conf  on  Very  Large  Data  Bases  (2010),  pp.  330-­‐339   Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. … “ “ Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. …
  • 15. Google’s  Dremel   multi-level execution trees   columnar data layout  
  • 16. Google’s  Dremel   nested data + schema   column-striped representation   map nested data to tables  
  • 17. Google’s  Dremel   experiments: datasets & query performance  
  • 18. Back to Apache Drill …
  • 19. Apache  Drill–key  facts   •  Inspired  by  Google’s  Dremel   •  Standard  SQL  2003  support   •  Plug-­‐able  data  sources   •  Nested  data  is  a  first-­‐class  ci[zen   •  Schema  is  op.onal   •  Community  driven,  open,  100’s  involved  
  • 20. High-­‐level  Architecture  
  • 21. Principled  Query  Execu[on   Source   Query   Parser   Logical   Plan   Op[mizer   Physical   Plan   Execu[on   SQL  2003     DrQL   MongoQL   DSL   scanner  API  Topology   CF   etc.   query: [ { @id: "log", op: "sequence", do: [ { op: "scan", source: “logs” }, { op: "filter", condition: "x > 3” }, parser  API  
  • 22. Wire-­‐level  Architecture   •  Each  node:  Drillbit  -­‐  maximize  data  locality   •  Co-­‐ordina[on,  query  planning,  execu[on,  etc,  are  distributed   •  Any  node  can  act  as  endpoint  for  a  query—foreman   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node  
  • 23. Wire-­‐level  Architecture   •  Curator/Zookeeper  for  ephemeral  cluster  membership  info   •  Distributed  cache  (Hazelcast)  for  metadata,  locality   informa[on,  etc.   Curator/Zk   Distributed  Cache   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Distributed  Cache   Distributed  Cache   Distributed  Cache  
  • 24. Wire-­‐level  Architecture   •  Origina[ng  Drillbit  acts  as  foreman:  manages  query  execu[on,   scheduling,  locality  informa[on,  etc.   •  Streaming  data  communica.on  avoiding  SerDe   Curator/Zk   Distributed  Cache   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Distributed  Cache   Distributed  Cache   Distributed  Cache  
  • 25. Wire-­‐level  Architecture   Foreman  turns  into   root  of  the  mul[-­‐level   execu[on  tree,  leafs   ac[vate  their  storage   engine  interface.   node   node   node   Curator/Zk  
  • 26. Key  features   •  Full  SQL  –  ANSI  SQL  2003   •  Nested  Data  as  first  class  ci[zen   •  Op[onal  Schema   •  Extensibility  Points  …  
  • 27. Extensibility  Points   •  Source  query  à  parser  API   •  Custom  operators,  UDF  à  logical  plan   •  Serving  tree,  CF,  topology  à  physical  plan/op[mizer   •  Data  sources  &formats  à  scanner  API   Source   Query   Parser   Logical   Plan   Op[mizer   Physical   Plan   Execu[on  
  • 28. …  and  Hadoop?   •  HDFS  can  be  a  data  source   •  Complementary  use  cases*   •  …  use  Apache  Drill   –  Find  record  with  specified  condi[on   –  Aggrega[on  under  dynamic  condi[ons   •  …  use  MapReduce   –  Data  mining  with  mul[ple  itera[ons   –  ETL   *)  h>ps://cloud.google.com/files/BigQueryTechnicalWP.pdf    
  • 29. Basic  Demo   h>ps://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo     { "id": "0001", "type": "donut", ”ppu": 0.55, "batters": { "batter”: [ { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" }, … data  source:  donuts.json   query:[ { op:"sequence", do:[ { op: "scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "filter", expr: "donuts.ppu < 2.00" }, … logical  plan:  simple_plan.json   result:  out.json   { "sales" : 700.0, "typeCount" : 1, "quantity" : 700, "ppu" : 1.0 } { "sales" : 109.71, "typeCount" : 2, "quantity" : 159, "ppu" : 0.69 } { "sales" : 184.25, "typeCount" : 2, "quantity" : 335, "ppu" : 0.55 }
  • 30. BE  A  PART  OF  IT!  
  • 31. Status   •  Heavy  development  by  mul[ple  organiza[ons   •  Available   – Logical  plan  (ADSP)   – Reference  interpreter   – Basic  SQL  parser     – Basic  demo  
  • 32. Status   May  2013     •  Full  SQL  support  (+JDBC)   •  Physical  plan   •  In-­‐memory  compressed  data  interfaces   •  Distributed  execu[on  
  • 33. Status   May  2013     •  HBase  and  MySQL  storage  engine   •  WebUI  client  
  • 34. Contribu[ng   Contribu[ons  appreciated  (not  only  code  drops)  …     •  Test  data  &  test  queries   •  Use  case  scenarios  (textual/SQL  queries)   •  Documenta[on   •  Further  schedule   –  Alpha  Q2   –  Beta  Q3  
  • 35. Kudos  to  …   •  Julian  Hyde,  Pentaho     •  Lisen  Mu,  XingCloud   •  Tim  Chen,  Microsow   •  Chris  Merrick,  RJMetrics     •  David  Alves,  UT  Aus[n   •  Sree  Vaadi,  SSS/NGData   •  Jacques  Nadeau,  MapR   •  Ted  Dunning,  MapR  
  • 36. Engage!   •  Follow  @ApacheDrill  on  Twi>er   •  Sign  up  at  mailing  lists  (user  |  dev)     h>p://incubator.apache.org/drill/mailing-­‐lists.html       •  Standing  G+  hangouts  every  Tuesday  at  5pm  GMT   h>p://j.mp/apache-­‐drill-­‐hangouts     •  Keep  an  eye  on  h>p://drill-­‐user.org/