Your SlideShare is downloading. ×
0
Apache	
  Drill	
  
a	
  interac.ve,	
  ad-­‐hoc	
  query	
  system	
  for	
  large-­‐scale	
  datasets	
  
Michael	
  Hau...
Which	
  
workloads	
  do	
  
you	
  
encounter	
  in	
  
your	
  
environment?	
  
h>p://www.flickr.com/photos/kevinomara/...
Batch	
  processing	
  
…	
  for	
  recurring	
  tasks	
  such	
  as	
  large-­‐scale	
  data	
  mining,	
  ETL	
  
offloadi...
OLTP	
  
…	
  user-­‐facing	
  eCommerce	
  transac[ons,	
  real-­‐[me	
  messaging	
  at	
  
scale	
  (FB),	
  [me-­‐seri...
Stream	
  processing	
  
…	
  in	
  order	
  to	
  handle	
  stream	
  sources	
  such	
  as	
  social	
  media	
  feeds	
...
Search/Informa[on	
  Retrieval	
  
…	
  retrieval	
  of	
  items	
  from	
  unstructured	
  documents	
  (plain	
  
text,	...
h>p://www.flickr.com/photos/9479603@N02/4144121838/	
  	
  licensed	
  under	
  CC	
  BY-­‐NC-­‐ND	
  2.0	
  
But	
  what	
...
Impala
Interac[ve	
  Query	
  (?)	
  
low-­‐latency	
  
Use	
  Case:	
  Marke[ng	
  Campaign	
  
•  Jane,	
  a	
  marke[ng	
  analyst	
  
•  Determine	
  target	
  segments	
  
•...
Use	
  Case:	
  Logis[cs	
  
•  Supplier	
  tracking	
  and	
  performance	
  
•  Queries	
  
– Shipments	
  from	
  suppl...
Use	
  Case:	
  Crime	
  Detec[on	
  
•  Online	
  purchases	
  
•  Fraud,	
  bilking,	
  etc.	
  
•  Batch-­‐generated	
 ...
Requirements	
  
•  Support	
  for	
  different	
  data	
  sources	
  
•  Support	
  for	
  different	
  query	
  interfaces...
And now for something completely different …
Google’s	
  Dremel	
  
h>p://research.google.com/pubs/pub36632.html	
  	
  
	
  
Sergey	
  Melnik,	
  Andrey	
  Gubarev,	
...
Google’s	
  Dremel	
  
multi-level execution trees	
  
columnar data layout	
  
Google’s	
  Dremel	
  
nested data + schema	
   column-striped representation	
  
map nested data to tables	
  
Google’s	
  Dremel	
  
experiments:
datasets & query performance	
  
Back to Apache Drill …
Apache	
  Drill–key	
  facts	
  
•  Inspired	
  by	
  Google’s	
  Dremel	
  
•  Standard	
  SQL	
  2003	
  support	
  
•  ...
High-­‐level	
  Architecture	
  
Principled	
  Query	
  Execu[on	
  
Source	
  
Query	
   Parser	
  
Logical	
  
Plan	
   Op[mizer	
  
Physical	
  
Plan	
 ...
Wire-­‐level	
  Architecture	
  
•  Each	
  node:	
  Drillbit	
  -­‐	
  maximize	
  data	
  locality	
  
•  Co-­‐ordina[on...
Wire-­‐level	
  Architecture	
  
•  Curator/Zookeeper	
  for	
  ephemeral	
  cluster	
  membership	
  info	
  
•  Distribu...
Wire-­‐level	
  Architecture	
  
•  Origina[ng	
  Drillbit	
  acts	
  as	
  foreman:	
  manages	
  query	
  execu[on,	
  
...
Wire-­‐level	
  Architecture	
  
Foreman	
  turns	
  into	
  
root	
  of	
  the	
  mul[-­‐level	
  
execu[on	
  tree,	
  l...
Key	
  features	
  
•  Full	
  SQL	
  –	
  ANSI	
  SQL	
  2003	
  
•  Nested	
  Data	
  as	
  first	
  class	
  ci[zen	
  
...
Extensibility	
  Points	
  
•  Source	
  query	
  à	
  parser	
  API	
  
•  Custom	
  operators,	
  UDF	
  à	
  logical	...
…	
  and	
  Hadoop?	
  
•  HDFS	
  can	
  be	
  a	
  data	
  source	
  
•  Complementary	
  use	
  cases*	
  
•  …	
  use	...
Basic	
  Demo	
  
h>ps://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo	
  	
  
{
"id": "0001",
"type": "donut",
”ppu...
BE	
  A	
  PART	
  OF	
  IT!	
  
Status	
  
•  Heavy	
  development	
  by	
  mul[ple	
  organiza[ons	
  
•  Available	
  
– Logical	
  plan	
  (ADSP)	
  
–...
Status	
  
May	
  2013	
  
	
  
•  Full	
  SQL	
  support	
  (+JDBC)	
  
•  Physical	
  plan	
  
•  In-­‐memory	
  compres...
Status	
  
May	
  2013	
  
	
  
•  HBase	
  and	
  MySQL	
  storage	
  engine	
  
•  WebUI	
  client	
  
Contribu[ng	
  
Contribu[ons	
  appreciated	
  (not	
  only	
  code	
  drops)	
  …	
  
	
  
•  Test	
  data	
  &	
  test	
...
Kudos	
  to	
  …	
  
•  Julian	
  Hyde,	
  Pentaho	
  	
  
•  Lisen	
  Mu,	
  XingCloud	
  
•  Tim	
  Chen,	
  Microsow	
 ...
Engage!	
  
•  Follow	
  @ApacheDrill	
  on	
  Twi>er	
  
•  Sign	
  up	
  at	
  mailing	
  lists	
  (user	
  |	
  dev)	
 ...
Upcoming SlideShare
Loading in...5
×

Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

2,180

Published on

Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets given by MapR Chief Data Engineer EMEA . Big Data User Group in Stuttgart 2013-05-16

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,180
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets"

  1. 1. Apache  Drill   a  interac.ve,  ad-­‐hoc  query  system  for  large-­‐scale  datasets   Michael  Hausenblas,  Chief  Data  Engineer  EMEA,  MapR   Big  Data  User  Group  Stu>gart,  2013-­‐05-­‐16  
  2. 2. Which   workloads  do   you   encounter  in   your   environment?   h>p://www.flickr.com/photos/kevinomara/2866648330/  licensed  under  CC  BY-­‐NC-­‐ND  2.0  
  3. 3. Batch  processing   …  for  recurring  tasks  such  as  large-­‐scale  data  mining,  ETL   offloading/data-­‐warehousing  à  for  the  batch  layer  in  Lambda   architecture  
  4. 4. OLTP   …  user-­‐facing  eCommerce  transac[ons,  real-­‐[me  messaging  at   scale  (FB),  [me-­‐series  processing,  etc.  à  for  the  serving  layer  in   Lambda  architecture  
  5. 5. Stream  processing   …  in  order  to  handle  stream  sources  such  as  social  media  feeds   or  sensor  data  (mobile  phones,  RFID,  weather  sta[ons,  etc.)  à   for  the  speed  layer  in  Lambda  architecture    
  6. 6. Search/Informa[on  Retrieval   …  retrieval  of  items  from  unstructured  documents  (plain   text,  etc.),  semi-­‐structured  data  formats  (JSON,  etc.),  as   well  as  data  stores  (MongoDB,  CouchDB,  etc.)  
  7. 7. h>p://www.flickr.com/photos/9479603@N02/4144121838/    licensed  under  CC  BY-­‐NC-­‐ND  2.0   But  what  about   interac.ve   ad-­‐hoc  query     at  scale?  
  8. 8. Impala Interac[ve  Query  (?)   low-­‐latency  
  9. 9. Use  Case:  Marke[ng  Campaign   •  Jane,  a  marke[ng  analyst   •  Determine  target  segments   •  Data  from  different  sources    
  10. 10. Use  Case:  Logis[cs   •  Supplier  tracking  and  performance   •  Queries   – Shipments  from  supplier  ‘ACM’  in  last  24h   – Shipments  in  region  ‘US’  not  from  ‘ACM’   SUPPLIER_ID   NAME   REGION   ACM   ACME  Corp   US   GAL   GotALot  Inc   US   BAP   Bits  and  Pieces  Ltd   Europe   ZUP   Zu  Pli   Asia   { "shipment": 100123, "supplier": "ACM", “timestamp": "2013-02-01", "description": ”first delivery today” }, { "shipment": 100124, "supplier": "BAP", "timestamp": "2013-02-02", "description": "hope you enjoy it” } …
  11. 11. Use  Case:  Crime  Detec[on   •  Online  purchases   •  Fraud,  bilking,  etc.   •  Batch-­‐generated  overview   •  Modes   – Explora[ve   – Alerts  
  12. 12. Requirements   •  Support  for  different  data  sources   •  Support  for  different  query  interfaces   •  Low-­‐latency/real-­‐[me   •  Ad-­‐hoc  queries   •  Scalable,  reliable  
  13. 13. And now for something completely different …
  14. 14. Google’s  Dremel   h>p://research.google.com/pubs/pub36632.html       Sergey  Melnik,  Andrey  Gubarev,  Jing  Jing  Long,  Geoffrey  Romer,  Shiva  Shivakumar,  Ma@  Tolton,   Theo  Vassilakis,  Proc.  of  the  36th  Int'l  Conf  on  Very  Large  Data  Bases  (2010),  pp.  330-­‐339   Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. … “ “ Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. …
  15. 15. Google’s  Dremel   multi-level execution trees   columnar data layout  
  16. 16. Google’s  Dremel   nested data + schema   column-striped representation   map nested data to tables  
  17. 17. Google’s  Dremel   experiments: datasets & query performance  
  18. 18. Back to Apache Drill …
  19. 19. Apache  Drill–key  facts   •  Inspired  by  Google’s  Dremel   •  Standard  SQL  2003  support   •  Plug-­‐able  data  sources   •  Nested  data  is  a  first-­‐class  ci[zen   •  Schema  is  op.onal   •  Community  driven,  open,  100’s  involved  
  20. 20. High-­‐level  Architecture  
  21. 21. Principled  Query  Execu[on   Source   Query   Parser   Logical   Plan   Op[mizer   Physical   Plan   Execu[on   SQL  2003     DrQL   MongoQL   DSL   scanner  API  Topology   CF   etc.   query: [ { @id: "log", op: "sequence", do: [ { op: "scan", source: “logs” }, { op: "filter", condition: "x > 3” }, parser  API  
  22. 22. Wire-­‐level  Architecture   •  Each  node:  Drillbit  -­‐  maximize  data  locality   •  Co-­‐ordina[on,  query  planning,  execu[on,  etc,  are  distributed   •  Any  node  can  act  as  endpoint  for  a  query—foreman   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node  
  23. 23. Wire-­‐level  Architecture   •  Curator/Zookeeper  for  ephemeral  cluster  membership  info   •  Distributed  cache  (Hazelcast)  for  metadata,  locality   informa[on,  etc.   Curator/Zk   Distributed  Cache   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Distributed  Cache   Distributed  Cache   Distributed  Cache  
  24. 24. Wire-­‐level  Architecture   •  Origina[ng  Drillbit  acts  as  foreman:  manages  query  execu[on,   scheduling,  locality  informa[on,  etc.   •  Streaming  data  communica.on  avoiding  SerDe   Curator/Zk   Distributed  Cache   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Distributed  Cache   Distributed  Cache   Distributed  Cache  
  25. 25. Wire-­‐level  Architecture   Foreman  turns  into   root  of  the  mul[-­‐level   execu[on  tree,  leafs   ac[vate  their  storage   engine  interface.   node   node   node   Curator/Zk  
  26. 26. Key  features   •  Full  SQL  –  ANSI  SQL  2003   •  Nested  Data  as  first  class  ci[zen   •  Op[onal  Schema   •  Extensibility  Points  …  
  27. 27. Extensibility  Points   •  Source  query  à  parser  API   •  Custom  operators,  UDF  à  logical  plan   •  Serving  tree,  CF,  topology  à  physical  plan/op[mizer   •  Data  sources  &formats  à  scanner  API   Source   Query   Parser   Logical   Plan   Op[mizer   Physical   Plan   Execu[on  
  28. 28. …  and  Hadoop?   •  HDFS  can  be  a  data  source   •  Complementary  use  cases*   •  …  use  Apache  Drill   –  Find  record  with  specified  condi[on   –  Aggrega[on  under  dynamic  condi[ons   •  …  use  MapReduce   –  Data  mining  with  mul[ple  itera[ons   –  ETL   *)  h>ps://cloud.google.com/files/BigQueryTechnicalWP.pdf    
  29. 29. Basic  Demo   h>ps://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo     { "id": "0001", "type": "donut", ”ppu": 0.55, "batters": { "batter”: [ { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" }, … data  source:  donuts.json   query:[ { op:"sequence", do:[ { op: "scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "filter", expr: "donuts.ppu < 2.00" }, … logical  plan:  simple_plan.json   result:  out.json   { "sales" : 700.0, "typeCount" : 1, "quantity" : 700, "ppu" : 1.0 } { "sales" : 109.71, "typeCount" : 2, "quantity" : 159, "ppu" : 0.69 } { "sales" : 184.25, "typeCount" : 2, "quantity" : 335, "ppu" : 0.55 }
  30. 30. BE  A  PART  OF  IT!  
  31. 31. Status   •  Heavy  development  by  mul[ple  organiza[ons   •  Available   – Logical  plan  (ADSP)   – Reference  interpreter   – Basic  SQL  parser     – Basic  demo  
  32. 32. Status   May  2013     •  Full  SQL  support  (+JDBC)   •  Physical  plan   •  In-­‐memory  compressed  data  interfaces   •  Distributed  execu[on  
  33. 33. Status   May  2013     •  HBase  and  MySQL  storage  engine   •  WebUI  client  
  34. 34. Contribu[ng   Contribu[ons  appreciated  (not  only  code  drops)  …     •  Test  data  &  test  queries   •  Use  case  scenarios  (textual/SQL  queries)   •  Documenta[on   •  Further  schedule   –  Alpha  Q2   –  Beta  Q3  
  35. 35. Kudos  to  …   •  Julian  Hyde,  Pentaho     •  Lisen  Mu,  XingCloud   •  Tim  Chen,  Microsow   •  Chris  Merrick,  RJMetrics     •  David  Alves,  UT  Aus[n   •  Sree  Vaadi,  SSS/NGData   •  Jacques  Nadeau,  MapR   •  Ted  Dunning,  MapR  
  36. 36. Engage!   •  Follow  @ApacheDrill  on  Twi>er   •  Sign  up  at  mailing  lists  (user  |  dev)     h>p://incubator.apache.org/drill/mailing-­‐lists.html       •  Standing  G+  hangouts  every  Tuesday  at  5pm  GMT   h>p://j.mp/apache-­‐drill-­‐hangouts     •  Keep  an  eye  on  h>p://drill-­‐user.org/    
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×