• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hadoop User Group - Status Apache Drill
 

Hadoop User Group - Status Apache Drill

on

  • 981 views

An Apache Drill status update given by Michael Hausenblas, MapR's Chief Data Engineer EMEA (2013-04-19)

An Apache Drill status update given by Michael Hausenblas, MapR's Chief Data Engineer EMEA (2013-04-19)

Statistics

Views

Total Views
981
Views on SlideShare
980
Embed Views
1

Actions

Likes
1
Downloads
11
Comments
0

1 Embed 1

http://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop User Group - Status Apache Drill Hadoop User Group - Status Apache Drill Presentation Transcript

    • Apache  Drill  status   Michael  Hausenblas,  Chief  Data  Engineer  EMEA,  MapR   HUG  Munich,  2013-­‐04-­‐19  
    • Kudos  to  hEp://cmx.io/    
    • Workloads   •  Batch  processing  (MapReduce)   •  Light-­‐weight  OLTP  (HBase,  Cassandra,  etc.)   •  Stream  processing  (Storm,  S4)   •  Search  (Solr,  ElasVcsearch)   •  Interac1ve,  ad-­‐hoc  query  and  analysis  (?)  
    • Impala InteracVve  Query  at  Scale   low-­‐latency  
    • Use  Case  I   •  Jane,  a  markeVng  analyst   •  Determine  target  segments   •  Data  from  different  sources    
    • Use  Case  II   •  LogisVcs  –  supplier  status   •  Queries   – How  many  shipments  from  supplier  X?   – How  many  shipments  in  region  Y?   SUPPLIER_ID   NAME   REGION   ACM   ACME  Corp   US   GAL   GotALot  Inc   US   BAP   Bits  and  Pieces  Ltd   Europe   ZUP   Zu  Pli   Asia   { "shipment": 100123, "supplier": "ACM", “timestamp": "2013-02-01", "description": ”first delivery today” }, { "shipment": 100124, "supplier": "BAP", "timestamp": "2013-02-02", "description": "hope you enjoy it” } …
    • Today’s  SoluVons   •  RDBMS-­‐focused   –  ETL  data  from  MongoDB  and  Hadoop   –  Query  data  using  SQL   •  MapReduce-­‐focused   –  ETL  from  RDBMS  and  MongoDB   –  Use  Hive,  etc.  
    • Requirements   •  Support  for  different  data  sources   •  Support  for  different  query  interfaces   •  Low-­‐latency/real-­‐Vme   •  Ad-­‐hoc  queries   •  Scalable,  reliable  
    • Google’s  Dremel*   *)  hEp://research.google.com/pubs/pub36632.html    
    • Apache  Drill  Overview   •  Inspired  by  Google’s  Dremel   •  Standard    SQL  2003  support   •  Other  QL  possible   •  Plug-­‐able  data  sources   •  Support  for  nested  data   •  Schema  is  opVonal   •  Community  driven,  open,  100’s  involved  
    • High-­‐level  Architecture  
    • High-­‐level  Architecture   •  Each  node:  Drillbit  -­‐  maximize  data  locality   •  Co-­‐ordinaVon,  query  planning,  execuVon,  etc,  are  distributed   •  By  default  Drillbits  hold  all  roles   •  Any  node  can  act  as  endpoint  for  a  query   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node  
    • High-­‐level  Architecture   •  Zookeeper  for  ephemeral  cluster  membership  info   •  Distributed  cache  (Hazelcast)  for  metadata,  locality   informaVon,  etc.   Curator/Zk   Distributed  Cache   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Distributed  Cache   Distributed  Cache   Distributed  Cache  
    • High-­‐level  Architecture   •  Origina1ng  Drillbit  acts  as  foreman,  manages  query  execuVon,   scheduling,  locality  informaVon,  etc.   •  Streaming  data  communica1on  avoiding  SerDe   Curator/Zk   Distributed  Cache   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Distributed  Cache   Distributed  Cache   Distributed  Cache  
    • Principled  Query  ExecuVon   Source   Query   Parser   Logical   Plan   OpVmizer   Physical   Plan   ExecuVon   SQL  2003     DrQL   MongoQL   DSL   scanner  API  topology  query: [ { @id: "log", op: "sequence", do: [ { op: "scan", source: “logs” }, { op: "filter", condition: "x > 3” }, parser  API  
    • Drillbit  Modules   DFS  Engine   HBase  Engine   RPC  Endpoint   SQL   HiveQL   Pig   Parser   Distributed  Cache   Logical  Plan   Physical  Plan   OpVmizer   Storage  Engine  Interface   Scheduler   Foreman   Operators   Mongo  
    • Key  Features   •  Full  SQL  2003   •  Nested  data   •  OpVonal  schema   •  Extensibility  points  
    • Full  SQL  –  ANSI  SQL  2003   •  SQL-­‐like  is  oien  not  enough   •  IntegraVon  with  exisVng  tools   –  Datameer,  Tableau,  Excel,  SAP  Crystal  Reports   –  Use  standard  ODBC/JDBC  driver  
    • Nested  Data   •  Nested  data  becoming  prevalent   –  JSON/BSON,  XML,  ProtoBuf,  Avro   –  Some  data  sources  support  it  naVvely   (MongoDB,  etc.)   •  FlaEening  nested  data  is  error-­‐prone   •  Extension  to  ANSI  SQL  2003  
    • OpVonal  Schema   •  Many  data  sources  don’t  have  rigid  schemas   –  Schema  changes  rapidly   –  Different  schema  per  record  (e.g.  HBase)   •  Supports  queries  against  unknown  schema   •  User  can  define  schema  or  via  discovery  
    • Extensibility  Points   •  Source  query  à  parser  API   •  Custom  operators,  UDF  à  logical  plan   •  Serving  tree,  CF,  topology  à  physical  plan/opVmizer   •  Data  sources  &formats  à  scanner  API   Source   Query   Parser   Logical   Plan   OpVmizer   Physical   Plan   ExecuVon  
    • …  and  Hadoop?   •  HDFS  can  be  a  data  source   •  Complementary  use  cases*   •  …  use  Apache  Drill   –  Find  record  with  specified  condiVon   –  AggregaVon  under  dynamic  condiVons   •  …  use  MapReduce   –  Data  mining  with  mulVple  iteraVons   –  ETL   22   *)  hEps://cloud.google.com/files/BigQueryTechnicalWP.pdf    
    • Example   hEps://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo     { "id": "0001", "type": "donut", ”ppu": 0.55, "batters": { "batter”: [ { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" }, … data  source:  donuts.json   query:[ { op:"sequence", do:[ { op: "scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "filter", expr: "donuts.ppu < 2.00" }, … logical  plan:  simple_plan.json   result:  out.json   { "sales" : 700.0, "typeCount" : 1, "quantity" : 700, "ppu" : 1.0 } { "sales" : 109.71, "typeCount" : 2, "quantity" : 159, "ppu" : 0.69 } { "sales" : 184.25, "typeCount" : 2, "quantity" : 335, "ppu" : 0.55 }
    • Status   •  Heavy  development  by  mulVple  organizaVons   •  Available   – Logical  plan  (ADSP)   – Reference  interpreter   – Basic  SQL  parser     – Basic  demo   – Basic  HBase  back-­‐end  
    • Status   April  2013     •  Extend  SQL  syntax   •  Physical  plan   •  In-­‐memory  compressed  data  interfaces   •  Distributed  execuVon  
    • ContribuVng   •  Learn  where  and  how  to  contribute   hEps://cwiki.apache.org/confluence/display/DRILL/ ContribuVng     •  Jira,  Git,  Apache  build  and  test  tools   •  Preparing  for  dependencies   –  Hazelcast   –  Neolix  Curator  
    • ContribuVng   General  contribuVons  appreciated:   •  Supersonic  (?)   •  Test  data  &  test  queries   •  Use  case  scenarios  (textual  desc./SQL  queries)   •  DocumentaVon  
    • ContribuVng   •  Dremel-­‐inspired  columnar  format   –  TwiEer’s  Parquet     –  Hive’s  ORC  file   •  IntegraVon  with  Hive  metastore  (?)   •  DRILL-­‐13  Storage  Engine:  Define  Java  Interface   •  DRILL-­‐15  Build  HBase  storage  engine  implementaVon  
    • ContribuVng   •  DRILL-­‐48  RPC  interface  for  query  submission  and  physical  plan   execuVon   •  DRILL-­‐53  Setup  cluster  configuraVon  and  membership  mgmt   system   •  Further  schedule   –  Alpha  Q2   –  Beta  Q3  
    • Kudos  to  …   •  Julian  Hyde,  Pentaho     •  Lisen  Mu   •  Tim  Chen,  Microsoi   •  Chris  Merrick,  RJMetrics     •  David  Alves,  UT  AusVn   •  Sree  Vaadi,  SSS/NGData   •  Jacques  Nadeau,  MapR   •  Ted  Dunning,  MapR  
    • Engage!   •  Follow  @ApacheDrill  on  TwiEer   •  Sign  up  at  mailing  lists  (user  |  dev)     hEp://incubator.apache.org/drill/mailing-­‐lists.html       •  Standing  G+  hangouts  every  Tuesday  at  18:00  CET   •  Keep  an  eye  on  hEp://drill-­‐user.org/