• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Swiss Big Data User Group - Introduction to Apache Drill
 

Swiss Big Data User Group - Introduction to Apache Drill

on

  • 377 views

An introduction to Apache Drill given by MapR Chief Data Engineer, Michael Hausenblas at the 6th Swiss Big Data User Group Meeting. Zurich, 2013-03-25

An introduction to Apache Drill given by MapR Chief Data Engineer, Michael Hausenblas at the 6th Swiss Big Data User Group Meeting. Zurich, 2013-03-25

Statistics

Views

Total Views
377
Views on SlideShare
377
Embed Views
0

Actions

Likes
1
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Swiss Big Data User Group - Introduction to Apache Drill Swiss Big Data User Group - Introduction to Apache Drill Presentation Transcript

    • 1   Introduc)on  to  Apache  Drill   Michael  Hausenblas,  Chief  Data  Engineer  EMEA,  MapR   6th  Swiss  Big  Data  User  Group  MeeAng,  Zurich,  2013-­‐03-­‐25  
    • 2   2   Kudos  to  hJp://cmx.io/    
    • 3   Workloads   •  Batch  processing  (MapReduce)   •  Light-­‐weight  OLTP  (HBase,  Cassandra,  etc.)   •  Stream  processing  (Storm,  S4)   •  Search  (Solr,  ElasAcsearch)   •  Interac)ve,  ad-­‐hoc  query  and  analysis  (?)  
    • 4   Impala InteracAve  Query  at  Scale   low-­‐latency  
    • 5   Use  Case  I   •  Jane,  a  markeAng  analyst   •  Determine  target  segments   •  Data  from  different  sources    
    • 6   Use  Case  II   •  LogisAcs  –  supplier  status   •  Queries   – How  many  shipments  from  supplier  X?   – How  many  shipments  in  region  Y?   SUPPLIER_ID   NAME   REGION   ACM   ACME  Corp   US   GAL   GotALot  Inc   US   BAP   Bits  and  Pieces  Ltd   Europe   ZUP   Zu  Pli   Asia   { "shipment": 100123, "supplier": "ACM", “timestamp": "2013-02-01", "description": ”first delivery today” }, { "shipment": 100124, "supplier": "BAP", "timestamp": "2013-02-02", "description": "hope you enjoy it” } …
    • 7   Today’s  SoluAons   •  RDBMS-­‐focused   –  ETL  data  from  MongoDB  and  Hadoop   –  Query  data  using  SQL   •  MapReduce-­‐focused   –  ETL  from  RDBMS  and  MongoDB   –  Use  Hive,  etc.  
    • 8   Requirements   •  Support  for  different  data  sources   •  Support  for  different  query  interfaces   •  Low-­‐latency/real-­‐Ame   •  Ad-­‐hoc  queries   •  Scalable,  reliable  
    • 9   Google’s  Dremel   hJp://research.google.com/pubs/pub36632.html    
    • 10   Apache  Drill  Overview   •  Inspired  by  Google’s  Dremel   •  Standard    SQL  2003  support   •  Other  QL  possible   •  Plug-­‐able  data  sources   •  Support  for  nested  data   •  Schema  is  opAonal   •  Community  driven,  open,  100’s  involved  
    • 11   Apache  Drill  Overview  
    • 12   High-­‐level  Architecture  
    • 13   High-­‐level  Architecture   •  Each  node:  Drillbit  -­‐  maximize  data  locality   •  Co-­‐ordinaAon,  query  planning,  execuAon,  etc,  are  distributed   •  By  default  Drillbits  hold  all  roles   •  Any  node  can  act  as  endpoint  for  a  query   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node  
    • 14   High-­‐level  Architecture   •  Zookeeper  for  ephemeral  cluster  membership  info   •  Distributed  cache  (Hazelcast)  for  metadata,  locality   informaAon,  etc.   Zookeeper   Distributed  Cache   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Distributed  Cache   Distributed  Cache   Distributed  Cache  
    • 15   High-­‐level  Architecture   •  Origina)ng  Drillbit  acts  as  foreman,  manages  query  execuAon,   scheduling,  locality  informaAon,  etc.   •  Streaming  data  communica)on  avoiding  SerDe   Zookeeper   Distributed  Cache   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Distributed  Cache   Distributed  Cache   Distributed  Cache  
    • 16   Principled  Query  ExecuAon   Source   Query   Parser   Logical   Plan   OpAmizer   Physical   Plan   ExecuAon   SQL  2003     DrQL   MongoQL   DSL   scanner  API  topology  query: [ { @id: "log", op: "sequence", do: [ { op: "scan", source: “logs” }, { op: "filter", condition: "x > 3” }, parser  API  
    • 17   Drillbit  Modules   DFS  Engine   HBase  Engine   RPC  Endpoint   SQL   HiveQL   Pig   Parser   Distributed  Cache   Logical  Plan   Physical  Plan   OpAmizer   Storage  Engine  Interface   Scheduler   Foreman   Operators   Mongo  
    • 18   Key  Features   •  Full  SQL  2003   •  Nested  data   •  OpAonal  schema   •  Extensibility  points  
    • 19   Full  SQL  –  ANSI  SQL  2003   •  SQL-­‐like  is  oken  not  enough   •  IntegraAon  with  exisAng  tools   –  Datameer,  Tableau,  Excel,  SAP  Crystal  Reports   –  Use  standard  ODBC/JDBC  driver  
    • 20   Nested  Data   •  Nested  data  becoming  prevalent   –  JSON/BSON,  XML,  ProtoBuf,  Avro   –  Some  data  sources  support  it  naAvely   (MongoDB,  etc.)   •  FlaJening  nested  data  is  error-­‐prone   •  Extension  to  ANSI  SQL  2003  
    • 21   OpAonal  Schema   •  Many  data  sources  don’t  have  rigid  schemas   –  Schema  changes  rapidly   –  Different  schema  per  record  (e.g.  HBase)   •  Supports  queries  against  unknown  schema   •  User  can  define  schema  or  via  discovery  
    • 22   Extensibility  Points   •  Source  query  –  parser  API   •  Custom  operators,  UDF  –  logical  plan   •  OpAmizer   •  Data  sources  and  formats  –  scanner  API   Source   Query   Parser   Logical   Plan   OpAmizer   Physical   Plan   ExecuAon  
    • 23   …  and  Hadoop?   •  HDFS  can  be  a  data  source   •  Complementary  use  cases  …   •  …  use  Apache  Drill   –  Find  record  with  specified  condiAon   –  AggregaAon  under  dynamic  condiAons   •  …  use  MapReduce   –  Data  mining  with  mulAple  iteraAons   –  ETL   23   hJps://cloud.google.com/files/BigQueryTechnicalWP.pdf    
    • 24   Example   hJps://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo     { "id": "0001", "type": "donut", ”ppu": 0.55, "batters": { "batter”: [ { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" }, … data  source:  donuts.json   query:[ { op:"sequence", do:[ { op: "scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "filter", expr: "donuts.ppu < 2.00" }, … logical  plan:  simple_plan.json   result:  out.json   { "sales" : 700.0, "typeCount" : 1, "quantity" : 700, "ppu" : 1.0 } { "sales" : 109.71, "typeCount" : 2, "quantity" : 159, "ppu" : 0.69 } { "sales" : 184.25, "typeCount" : 2, "quantity" : 335, "ppu" : 0.55 }
    • 25   Status   •  Heavy  development  by  mulAple  organizaAons   •  Available   – Logical  plan  (ADSP)   – Reference  interpreter   – Basic  SQL  parser     – Basic  demo   – Basic  HBase  back-­‐end  
    • 26   Status   March/April     •  Larger  SQL  syntax   •  Physical  plan   •  In-­‐memory  compressed  data  interfaces   •  Distributed  execuAon  focused  on  large  cluster   high  performance  sort,  aggregaAon  and  join  
    • 27   ContribuAng   •  Dremel-­‐inspired  columnar  format:  TwiJer’s  Parquet    and   Hive’s  ORC  file   •  IntegraAon  with  Hive  metastore  (?)   •  DRILL-­‐13  Storage  Engine:  Define  Java  Interface   •  DRILL-­‐15  Build  HBase  storage  engine  implementaAon  
    • 28   ContribuAng   •  DRILL-­‐48  RPC  interface  for  query  submission  and  physical  plan   execuAon   •  DRILL-­‐53  Setup  cluster  configuraAon  and  membership  mgmt   system   –  ZK  for  coordinaAon   –  Helix  for  parAAon  and  resource  assignment  (?)   •  Further  schedule   –  Alpha  Q2   –  Beta  Q3  
    • 29   Kudos  to  …   •  Julian  Hyde,  Pentaho     •  Timothy  Chen,  Microsok   •  Chris  Merrick,  RJMetrics     •  David  Alves,  UT  AusAn   •  Sree  Vaadi,  SSS/NGData   •  Jacques  Nadeau,  MapR   •  Ted  Dunning,  MapR  
    • 30   Engage!   •  Follow  @ApacheDrill  on  TwiJer   •  Sign  up  at  mailing  lists  (user  |  dev)     hJp://incubator.apache.org/drill/mailing-­‐lists.html     •  Learn  where  and  how  to  contribute   hJps://cwiki.apache.org/confluence/display/DRILL/ContribuAng     •  Keep  an  eye  on  hJp://drill-­‐user.org/