Apache	
  Drill	
  status	
  
Michael	
  Hausenblas,	
  Chief	
  Data	
  Engineer	
  EMEA,	
  MapR	
  
HUG	
  Munich,	
  2...
Kudos	
  to	
  hEp://cmx.io/	
  	
  
Workloads	
  
•  Batch	
  processing	
  (MapReduce)	
  
•  Light-­‐weight	
  OLTP	
  (HBase,	
  Cassandra,	
  etc.)	
  
• ...
Impala
InteracVve	
  Query	
  at	
  Scale	
  
low-­‐latency	
  
Use	
  Case	
  I	
  
•  Jane,	
  a	
  markeVng	
  analyst	
  
•  Determine	
  target	
  segments	
  
•  Data	
  from	
  di...
Use	
  Case	
  II	
  
•  LogisVcs	
  –	
  supplier	
  status	
  
•  Queries	
  
– How	
  many	
  shipments	
  from	
  supp...
Today’s	
  SoluVons	
  
•  RDBMS-­‐focused	
  
–  ETL	
  data	
  from	
  MongoDB	
  and	
  Hadoop	
  
–  Query	
  data	
  ...
Requirements	
  
•  Support	
  for	
  different	
  data	
  sources	
  
•  Support	
  for	
  different	
  query	
  interfaces...
Google’s	
  Dremel*	
  
*)	
  hEp://research.google.com/pubs/pub36632.html	
  	
  
Apache	
  Drill	
  Overview	
  
•  Inspired	
  by	
  Google’s	
  Dremel	
  
•  Standard	
  	
  SQL	
  2003	
  support	
  
...
High-­‐level	
  Architecture	
  
High-­‐level	
  Architecture	
  
•  Each	
  node:	
  Drillbit	
  -­‐	
  maximize	
  data	
  locality	
  
•  Co-­‐ordinaVon...
High-­‐level	
  Architecture	
  
•  Zookeeper	
  for	
  ephemeral	
  cluster	
  membership	
  info	
  
•  Distributed	
  c...
High-­‐level	
  Architecture	
  
•  Origina1ng	
  Drillbit	
  acts	
  as	
  foreman,	
  manages	
  query	
  execuVon,	
  
...
Principled	
  Query	
  ExecuVon	
  
Source	
  
Query	
   Parser	
  
Logical	
  
Plan	
   OpVmizer	
  
Physical	
  
Plan	
 ...
Drillbit	
  Modules	
  
DFS	
  Engine	
  
HBase	
  Engine	
  
RPC	
  Endpoint	
  
SQL	
  
HiveQL	
  
Pig	
  
Parser	
  
Di...
Key	
  Features	
  
•  Full	
  SQL	
  2003	
  
•  Nested	
  data	
  
•  OpVonal	
  schema	
  
•  Extensibility	
  points	
...
Full	
  SQL	
  –	
  ANSI	
  SQL	
  2003	
  
•  SQL-­‐like	
  is	
  oien	
  not	
  enough	
  
•  IntegraVon	
  with	
  exis...
Nested	
  Data	
  
•  Nested	
  data	
  becoming	
  prevalent	
  
–  JSON/BSON,	
  XML,	
  ProtoBuf,	
  Avro	
  
–  Some	
...
OpVonal	
  Schema	
  
•  Many	
  data	
  sources	
  don’t	
  have	
  rigid	
  schemas	
  
–  Schema	
  changes	
  rapidly	...
Extensibility	
  Points	
  
•  Source	
  query	
  à	
  parser	
  API	
  
•  Custom	
  operators,	
  UDF	
  à	
  logical	...
…	
  and	
  Hadoop?	
  
•  HDFS	
  can	
  be	
  a	
  data	
  source	
  
•  Complementary	
  use	
  cases*	
  
•  …	
  use	...
Example	
  
hEps://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo	
  	
  
{
"id": "0001",
"type": "donut",
”ppu": 0.5...
Status	
  
•  Heavy	
  development	
  by	
  mulVple	
  organizaVons	
  
•  Available	
  
– Logical	
  plan	
  (ADSP)	
  
–...
Status	
  
April	
  2013	
  
	
  
•  Extend	
  SQL	
  syntax	
  
•  Physical	
  plan	
  
•  In-­‐memory	
  compressed	
  d...
ContribuVng	
  
•  Learn	
  where	
  and	
  how	
  to	
  contribute	
  
hEps://cwiki.apache.org/confluence/display/DRILL/
C...
ContribuVng	
  
General	
  contribuVons	
  appreciated:	
  
•  Supersonic	
  (?)	
  
•  Test	
  data	
  &	
  test	
  queri...
ContribuVng	
  
•  Dremel-­‐inspired	
  columnar	
  format	
  
–  TwiEer’s	
  Parquet	
  	
  
–  Hive’s	
  ORC	
  file	
  
...
ContribuVng	
  
•  DRILL-­‐48	
  RPC	
  interface	
  for	
  query	
  submission	
  and	
  physical	
  plan	
  
execuVon	
 ...
Kudos	
  to	
  …	
  
•  Julian	
  Hyde,	
  Pentaho	
  	
  
•  Lisen	
  Mu	
  
•  Tim	
  Chen,	
  Microsoi	
  
•  Chris	
  ...
Engage!	
  
•  Follow	
  @ApacheDrill	
  on	
  TwiEer	
  
•  Sign	
  up	
  at	
  mailing	
  lists	
  (user	
  |	
  dev)	
 ...
Upcoming SlideShare
Loading in...5
×

Hadoop User Group - Status Apache Drill

915

Published on

An Apache Drill status update given by Michael Hausenblas, MapR's Chief Data Engineer EMEA (2013-04-19)

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
915
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
14
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Hadoop User Group - Status Apache Drill"

  1. 1. Apache  Drill  status   Michael  Hausenblas,  Chief  Data  Engineer  EMEA,  MapR   HUG  Munich,  2013-­‐04-­‐19  
  2. 2. Kudos  to  hEp://cmx.io/    
  3. 3. Workloads   •  Batch  processing  (MapReduce)   •  Light-­‐weight  OLTP  (HBase,  Cassandra,  etc.)   •  Stream  processing  (Storm,  S4)   •  Search  (Solr,  ElasVcsearch)   •  Interac1ve,  ad-­‐hoc  query  and  analysis  (?)  
  4. 4. Impala InteracVve  Query  at  Scale   low-­‐latency  
  5. 5. Use  Case  I   •  Jane,  a  markeVng  analyst   •  Determine  target  segments   •  Data  from  different  sources    
  6. 6. Use  Case  II   •  LogisVcs  –  supplier  status   •  Queries   – How  many  shipments  from  supplier  X?   – How  many  shipments  in  region  Y?   SUPPLIER_ID   NAME   REGION   ACM   ACME  Corp   US   GAL   GotALot  Inc   US   BAP   Bits  and  Pieces  Ltd   Europe   ZUP   Zu  Pli   Asia   { "shipment": 100123, "supplier": "ACM", “timestamp": "2013-02-01", "description": ”first delivery today” }, { "shipment": 100124, "supplier": "BAP", "timestamp": "2013-02-02", "description": "hope you enjoy it” } …
  7. 7. Today’s  SoluVons   •  RDBMS-­‐focused   –  ETL  data  from  MongoDB  and  Hadoop   –  Query  data  using  SQL   •  MapReduce-­‐focused   –  ETL  from  RDBMS  and  MongoDB   –  Use  Hive,  etc.  
  8. 8. Requirements   •  Support  for  different  data  sources   •  Support  for  different  query  interfaces   •  Low-­‐latency/real-­‐Vme   •  Ad-­‐hoc  queries   •  Scalable,  reliable  
  9. 9. Google’s  Dremel*   *)  hEp://research.google.com/pubs/pub36632.html    
  10. 10. Apache  Drill  Overview   •  Inspired  by  Google’s  Dremel   •  Standard    SQL  2003  support   •  Other  QL  possible   •  Plug-­‐able  data  sources   •  Support  for  nested  data   •  Schema  is  opVonal   •  Community  driven,  open,  100’s  involved  
  11. 11. High-­‐level  Architecture  
  12. 12. High-­‐level  Architecture   •  Each  node:  Drillbit  -­‐  maximize  data  locality   •  Co-­‐ordinaVon,  query  planning,  execuVon,  etc,  are  distributed   •  By  default  Drillbits  hold  all  roles   •  Any  node  can  act  as  endpoint  for  a  query   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node  
  13. 13. High-­‐level  Architecture   •  Zookeeper  for  ephemeral  cluster  membership  info   •  Distributed  cache  (Hazelcast)  for  metadata,  locality   informaVon,  etc.   Curator/Zk   Distributed  Cache   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Distributed  Cache   Distributed  Cache   Distributed  Cache  
  14. 14. High-­‐level  Architecture   •  Origina1ng  Drillbit  acts  as  foreman,  manages  query  execuVon,   scheduling,  locality  informaVon,  etc.   •  Streaming  data  communica1on  avoiding  SerDe   Curator/Zk   Distributed  Cache   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Distributed  Cache   Distributed  Cache   Distributed  Cache  
  15. 15. Principled  Query  ExecuVon   Source   Query   Parser   Logical   Plan   OpVmizer   Physical   Plan   ExecuVon   SQL  2003     DrQL   MongoQL   DSL   scanner  API  topology  query: [ { @id: "log", op: "sequence", do: [ { op: "scan", source: “logs” }, { op: "filter", condition: "x > 3” }, parser  API  
  16. 16. Drillbit  Modules   DFS  Engine   HBase  Engine   RPC  Endpoint   SQL   HiveQL   Pig   Parser   Distributed  Cache   Logical  Plan   Physical  Plan   OpVmizer   Storage  Engine  Interface   Scheduler   Foreman   Operators   Mongo  
  17. 17. Key  Features   •  Full  SQL  2003   •  Nested  data   •  OpVonal  schema   •  Extensibility  points  
  18. 18. Full  SQL  –  ANSI  SQL  2003   •  SQL-­‐like  is  oien  not  enough   •  IntegraVon  with  exisVng  tools   –  Datameer,  Tableau,  Excel,  SAP  Crystal  Reports   –  Use  standard  ODBC/JDBC  driver  
  19. 19. Nested  Data   •  Nested  data  becoming  prevalent   –  JSON/BSON,  XML,  ProtoBuf,  Avro   –  Some  data  sources  support  it  naVvely   (MongoDB,  etc.)   •  FlaEening  nested  data  is  error-­‐prone   •  Extension  to  ANSI  SQL  2003  
  20. 20. OpVonal  Schema   •  Many  data  sources  don’t  have  rigid  schemas   –  Schema  changes  rapidly   –  Different  schema  per  record  (e.g.  HBase)   •  Supports  queries  against  unknown  schema   •  User  can  define  schema  or  via  discovery  
  21. 21. Extensibility  Points   •  Source  query  à  parser  API   •  Custom  operators,  UDF  à  logical  plan   •  Serving  tree,  CF,  topology  à  physical  plan/opVmizer   •  Data  sources  &formats  à  scanner  API   Source   Query   Parser   Logical   Plan   OpVmizer   Physical   Plan   ExecuVon  
  22. 22. …  and  Hadoop?   •  HDFS  can  be  a  data  source   •  Complementary  use  cases*   •  …  use  Apache  Drill   –  Find  record  with  specified  condiVon   –  AggregaVon  under  dynamic  condiVons   •  …  use  MapReduce   –  Data  mining  with  mulVple  iteraVons   –  ETL   22   *)  hEps://cloud.google.com/files/BigQueryTechnicalWP.pdf    
  23. 23. Example   hEps://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo     { "id": "0001", "type": "donut", ”ppu": 0.55, "batters": { "batter”: [ { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" }, … data  source:  donuts.json   query:[ { op:"sequence", do:[ { op: "scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "filter", expr: "donuts.ppu < 2.00" }, … logical  plan:  simple_plan.json   result:  out.json   { "sales" : 700.0, "typeCount" : 1, "quantity" : 700, "ppu" : 1.0 } { "sales" : 109.71, "typeCount" : 2, "quantity" : 159, "ppu" : 0.69 } { "sales" : 184.25, "typeCount" : 2, "quantity" : 335, "ppu" : 0.55 }
  24. 24. Status   •  Heavy  development  by  mulVple  organizaVons   •  Available   – Logical  plan  (ADSP)   – Reference  interpreter   – Basic  SQL  parser     – Basic  demo   – Basic  HBase  back-­‐end  
  25. 25. Status   April  2013     •  Extend  SQL  syntax   •  Physical  plan   •  In-­‐memory  compressed  data  interfaces   •  Distributed  execuVon  
  26. 26. ContribuVng   •  Learn  where  and  how  to  contribute   hEps://cwiki.apache.org/confluence/display/DRILL/ ContribuVng     •  Jira,  Git,  Apache  build  and  test  tools   •  Preparing  for  dependencies   –  Hazelcast   –  Neolix  Curator  
  27. 27. ContribuVng   General  contribuVons  appreciated:   •  Supersonic  (?)   •  Test  data  &  test  queries   •  Use  case  scenarios  (textual  desc./SQL  queries)   •  DocumentaVon  
  28. 28. ContribuVng   •  Dremel-­‐inspired  columnar  format   –  TwiEer’s  Parquet     –  Hive’s  ORC  file   •  IntegraVon  with  Hive  metastore  (?)   •  DRILL-­‐13  Storage  Engine:  Define  Java  Interface   •  DRILL-­‐15  Build  HBase  storage  engine  implementaVon  
  29. 29. ContribuVng   •  DRILL-­‐48  RPC  interface  for  query  submission  and  physical  plan   execuVon   •  DRILL-­‐53  Setup  cluster  configuraVon  and  membership  mgmt   system   •  Further  schedule   –  Alpha  Q2   –  Beta  Q3  
  30. 30. Kudos  to  …   •  Julian  Hyde,  Pentaho     •  Lisen  Mu   •  Tim  Chen,  Microsoi   •  Chris  Merrick,  RJMetrics     •  David  Alves,  UT  AusVn   •  Sree  Vaadi,  SSS/NGData   •  Jacques  Nadeau,  MapR   •  Ted  Dunning,  MapR  
  31. 31. Engage!   •  Follow  @ApacheDrill  on  TwiEer   •  Sign  up  at  mailing  lists  (user  |  dev)     hEp://incubator.apache.org/drill/mailing-­‐lists.html       •  Standing  G+  hangouts  every  Tuesday  at  18:00  CET   •  Keep  an  eye  on  hEp://drill-­‐user.org/    
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×