SlideShare a Scribd company logo
Introduc)on	
  to	
  Apache	
  Drill	
  

          February	
  27,	
  2013	
  
            Tomer	
  Shiran	
  
Who	
  Am	
  I?	
  
•      Tomer	
  Shiran	
  
•      tshiran@maprtech.com	
  
•      Founding	
  Member	
  and	
  CommiGer,	
  Apache	
  Drill	
  
•      Director	
  of	
  Product	
  Management,	
  MapR	
  
       Technologies	
  
	
  
Agenda	
  

•    Apache	
  Drill	
  overview	
  
•    Key	
  features	
  
•    Status	
  and	
  progress	
  
•    How	
  to	
  get	
  involved	
  
Big	
  Data	
  Workloads	
  
•    ETL	
  
•    Data	
  mining	
  
•    Index	
  and	
  model	
  genera)on	
  
•    Clustering,	
  anomaly	
  detec)on	
  and	
  classifica)on	
  
•    Blob	
  store	
  
•    Lightweight	
  OLTP	
  on	
  large	
  datasets	
  
•    Web	
  crawling	
  
•    Stream	
  processing	
  
•    Interac)ve	
  analysis	
  
Interac)ve	
  Queries	
  and	
  Hadoop	
  
                                                        Exis)ng	
  Solu)ons	
  


                                Compile	
  SQL	
  (HiveQL)	
  to	
                Export	
  MapReduce	
  results	
  to	
  
                                     MapReduce	
                                   RDBMS	
  and	
  query	
  RDBMS	
  


                                               Emerging	
  Technologies	
  
                                                                                                        Stinger/Tez                    Impala
     Real-­‐)me	
  	
         PostgreSQL-­‐based	
                    PostgreSQL-­‐based	
           Hive	
  performance	
             Real-­‐)me	
  	
  
interac)ve	
  analysis	
      Hadoop	
  analy)cs	
                    Hadoop	
  analy)cs	
            improvements	
              interac)ve	
  analysis	
  


                             HAWQ                             Phoenix	
                                                        Cascading Lingual
       PostgreSQL-­‐based	
                                                                                                     Compile	
  ANSI	
  SQL	
  to	
  
       Hadoop	
  analy)cs	
                   SQL	
  layer	
  for	
  HBase	
           SQL	
  layer	
  for	
  HBase	
              MapReduce	
  
Example	
  Problem	
  
       •  Jane	
  works	
  as	
  an	
          Transac.on	
  
          analyst	
  at	
  an	
  e-­‐          informa.on	
  
          commerce	
  company	
  
       •  How	
  does	
  she	
  figure	
  
                                                 User	
  	
  
          out	
  good	
  targe)ng	
             profiles	
  
          segments	
  for	
  the	
  next	
  
          marke)ng	
  campaign?	
  
       •  She	
  has	
  some	
  ideas	
           Access	
  
          and	
  lots	
  of	
  data	
              logs	
  
Solving	
  the	
  Problem	
  with	
  Tradi)onal	
  Systems	
  

•  Use	
  an	
  RDBMS	
  
     –  ETL	
  the	
  data	
  from	
  MongoDB	
  and	
  Hadoop	
  into	
  the	
  RDBMS	
  
            •  MongoDB	
  data	
  must	
  be	
  flaGened,	
  schema)zed,	
  filtered	
  and	
  aggregated	
  
            •  Hadoop	
  data	
  must	
  be	
  filtered	
  and	
  aggregated	
  
     –  Query	
  the	
  data	
  using	
  any	
  SQL-­‐based	
  tool	
  
•  Use	
  MapReduce	
  
     –  ETL	
  the	
  data	
  from	
  Oracle	
  and	
  MongoDB	
  into	
  Hadoop	
  
     –  Work	
  with	
  the	
  MapReduce	
  team	
  to	
  generate	
  the	
  desired	
  analyses	
  
•  Use	
  Hive	
  
     –  ETL	
  the	
  data	
  from	
  Oracle	
  and	
  MongoDB	
  into	
  Hadoop	
  
            •  MongoDB	
  data	
  must	
  be	
  flaGened	
  and	
  schema)zed	
  
     –  But	
  HiveQL	
  is	
  limited,	
  queries	
  take	
  too	
  long	
  and	
  BI	
  tool	
  support	
  is	
  
        limited	
  
WWGD	
  

                 Distributed	
                       Interac.ve	
       Batch	
  
                                       NoSQL	
  
                 File	
  System	
                      analysis	
     processing	
  


                      GFS	
           BigTable	
       Dremel	
       MapReduce	
  


                                                                       Hadoop	
  
                     HDFS	
            HBase	
           ???	
  
                                                                      MapReduce	
  




  Build	
  Apache	
  Drill	
  to	
  provide	
  a	
  true	
  open	
  source	
  
     solu)on	
  to	
  interac)ve	
  analysis	
  of	
  Big	
  Data	
  
Apache	
  Drill	
  Overview	
  
•  Interac)ve	
  analysis	
  of	
  Big	
  Data	
  using	
  standard	
  SQL	
  
•  Fast	
  
      –  Low	
  latency	
  
      –  Columnar	
  execu)on	
  
             •  Inspired	
  by	
  Google	
  Dremel/BigQuery	
                                      Interac)ve	
  queries	
  
      –  Complement	
  na)ve	
  interfaces	
  and	
  MapReduce/              Apache	
  Drill	
     Data	
  analyst	
  
         Hive/Pig	
                                                                                Repor)ng	
  
                                                                                                   100	
  ms-­‐20	
  min	
  
•  Open	
  
      –  Community	
  driven	
  open	
  source	
  project	
  
      –  Under	
  Apache	
  Socware	
  Founda)on	
  
•  Modern	
  
      –    Standard	
  ANSI	
  SQL:2003	
  (select/into)	
                                         Data	
  mining	
  
      –    Nested/hierarchical	
  data	
  support	
                          MapReduce	
           Modeling	
  
                                                                                  Hive	
  
      –    Schema	
  is	
  op)onal	
                                               Pig	
  
                                                                                                   Large	
  ETL	
  
                                                                                                   20	
  min-­‐20	
  hr	
  
      –    Supports	
  RDBMS,	
  Hadoop	
  and	
  NoSQL	
  
How	
  Does	
  It	
  Work?	
  
•  Drillbits	
  run	
  on	
  each	
  node,	
  designed	
  to	
  
   maximize	
  data	
  locality	
  
•  Processing	
  is	
  done	
  outside	
  MapReduce	
  
                                                                   SELECT	
  *	
  FROM	
  
   paradigm	
  (YARN	
  is	
  supported)	
                         oracle.transac)ons,	
  
•  Queries	
  can	
  be	
  fed	
  to	
  any	
  Drillbit	
          mongo.users,	
  
                                                                   hdfs.events	
  
•  Coordina)on,	
  query	
  planning,	
  op)miza)on,	
             LIMIT	
  1	
  
   scheduling,	
  and	
  execu)on	
  are	
  distributed	
  
Key	
  Features	
  

•    Full	
  SQL	
  (ANSI	
  SQL:2003)	
  
•    Nested	
  data	
  
•    Schema	
  is	
  op)onal	
  
•    Flexible	
  and	
  extensible	
  architecture	
  
Full	
  SQL	
  (ANSI	
  SQL:2003)	
  
•  Drill	
  supports	
  standard	
  ANSI	
  SQL:2003	
  
     –  Correlated	
  subqueries,	
  analy)c	
  func)ons,	
  …	
  
     –  SQL-­‐like	
  is	
  not	
  enough	
  
•  Use	
  any	
  SQL-­‐based	
  tool	
  with	
  Apache	
  Drill	
  
     –  Tableau,	
  Microstrategy,	
  Excel,	
  SAP	
  Crystal	
  Reports,	
  Toad,	
  SQuirreL,	
  …	
  
     –  Standard	
  ODBC	
  and	
  JDBC	
  drivers	
  
                               Client

                 Tableau

                                                                           Drillbit
               MicroStrategy
                               Drill%ODBC%                    SQL%Query%               Query%
                                             Driver                                              Drillbits
                                 Driver                         Parser                Planner   Drill%Worker
                                                                                                 Drill%Worker
                   Excel


                SAP%Crystal%
                 Reports
Nested	
  Data	
  
                                                                                                         JSON	
  
•  Nested	
  data	
  is	
  becoming	
  prevalent	
                                        {	
  
                                                                                          	
  	
  "name":	
  "Homer",	
  
     –  JSON,	
  BSON,	
  XML,	
  Protocol	
  Buffers,	
  Avro,	
  …	
                     	
  	
  "gender":	
  "Male",	
  
                                                                                          	
  	
  "followers":	
  100	
  
     –  The	
  data	
  source	
  may	
  or	
  may	
  not	
  be	
  aware	
                 	
  	
  children:	
  [	
  
           •  MongoDB	
  supports	
  nested	
  data	
  na)vely	
                          	
  	
  	
  	
  {name:	
  "Bart"},	
  
                                                                                          	
  	
  	
  	
  {name:	
  "Lisa”}	
  
           •  A	
  single	
  HBase	
  value	
  could	
  be	
  a	
  JSON	
  document	
     	
  	
  ]	
  
              (compound	
  nested	
  type)	
                                              }	
  
     –  Google	
  Dremel’s	
  innova)on	
  was	
  efficient	
  columnar	
  
        storage	
  and	
  querying	
  of	
  nested	
  data	
  
•  FlaGening	
  nested	
  data	
  is	
  error-­‐prone	
  and	
  ocen	
                                    Avro	
  
                                                                                          enum	
  Gender	
  {	
  
   impossible	
                                                                           	
  	
  MALE,	
  FEMALE	
  
     –  Think	
  about	
  repeated	
  and	
  op)onal	
  fields	
  at	
  every	
            }	
  
                                                                                          	
  
        level…	
                                                                          record	
  User	
  {	
  

•  Apache	
  Drill	
  supports	
  nested	
  data	
  
                                                                                          	
  	
  string	
  name;	
  
                                                                                          	
  	
  Gender	
  gender;	
  
     –  Extensions	
  to	
  ANSI	
  SQL:2003	
                                            	
  	
  long	
  followers;	
  
                                                                                          }	
  
Schema	
  is	
  Op)onal	
  
•  Many	
  data	
  sources	
  do	
  not	
  have	
  rigid	
  schemas	
  
         –  Schemas	
  change	
  rapidly	
  
         –  Each	
  record	
  may	
  have	
  a	
  different	
  schema	
  
                 •  Sparse	
  and	
  wide	
  rows	
  in	
  HBase	
  and	
  Cassandra,	
  MongoDB	
  
•  Apache	
  Drill	
  supports	
  querying	
  against	
  unknown	
  schemas	
  
         –  Query	
  any	
  HBase,	
  Cassandra	
  or	
  MongoDB	
  table	
  
•  User	
  can	
  define	
  the	
  schema	
  or	
  let	
  the	
  system	
  discover	
  it	
  automa)cally	
  
         –  System	
  of	
  record	
  may	
  already	
  have	
  schema	
  informa)on	
  
                 •  Why	
  manage	
  it	
  in	
  a	
  separate	
  system?	
  
         –  No	
  need	
  to	
  manage	
  schema	
  evolu)on	
  

Row	
  Key	
                       CF	
  contents	
                             CF	
  anchor	
  
"com.cnn.www"	
                    contents:html	
  =	
  "<html>…"	
            anchor:my.look.ca	
  =	
  "CNN.com"	
  
                                                                                anchor:cnnsi.com	
  =	
  "CNN"	
  
"com.foxnews.www"	
                contents:html	
  =	
  "<html>…"	
            anchor:en.wikipedia.org	
  =	
  "Fox	
  News"	
  
                                                                                	
  
…	
                                …	
                                          …	
  
Flexible	
  and	
  Extensible	
  Architecture	
  
•  Apache	
  Drill	
  is	
  designed	
  for	
  extensibility	
  
      –  Well-­‐documented	
  APIs	
  and	
  interfaces	
  
•  Data	
  sources	
  and	
  file	
  formats	
  
      –  Implement	
  a	
  custom	
  scanner	
  to	
  support	
  a	
  new	
  data	
  source	
  or	
  file	
  format	
  
•  Query	
  languages	
  
      –  SQL:2003	
  is	
  the	
  primary	
  language	
  
      –  Implement	
  a	
  custom	
  Parser	
  to	
  support	
  a	
  Domain	
  Specific	
  Language	
  
      –  UDFs	
  and	
  UDTFs	
  
•  Op)mizers	
  
      –  Drill	
  will	
  have	
  a	
  cost-­‐based	
  op)mizer	
  
      –  Clear	
  surrounding	
  APIs	
  support	
  easy	
  op)mizer	
  explora)on	
  
•  Operators	
  
      –  Custom	
  operators	
  can	
  be	
  implemented	
  
             •  Special	
  operators	
  for	
  Mahout	
  (k-­‐means)	
  being	
  designed	
  
      –  Operator	
  push-­‐down	
  to	
  data	
  source	
  (RDBMS)	
  
What	
  About	
  Other	
  SQL-­‐on-­‐Hadoop	
  
Systems?	
  
•  Strengths	
  
    –  Code	
  is	
  more	
  mature	
  than	
  Apache	
  Drill	
  
           •  Already	
  in	
  beta	
  (~2	
  quarters	
  ahead)	
  
    –  Faster	
  than	
  Hive	
  on	
  some	
  queries	
  

•  Weaknesses	
  
    –    Proprietary	
  or	
  semi-­‐open	
  source	
  
    –    Query	
  results	
  must	
  fit	
  in	
  memory	
  (no	
  spooling)	
  
    –    Early	
  row	
  materializa)on	
  (no	
  columnar	
  execu)on)	
  
    –    Some	
  are	
  SQL-­‐like	
  (not	
  SQL)	
  
    –    No	
  support	
  for	
  nested	
  data	
  
    –    Rigid	
  schema	
  is	
  required	
  
    –    Limited	
  flexibility	
  and	
  extensibility	
  
    –    Only	
  support	
  Hadoop	
  and	
  HBase	
  (and	
  no	
  other	
  NoSQL	
  or	
  RDBMS)	
  
Status:	
  In	
  Progress	
  
•    Heavy	
  ac)ve	
  development	
  
      –  6-­‐7	
  companies	
  are	
  contribu)ng	
  
•    Available	
  
      –  Logical	
  plan	
  syntax	
  and	
  interpreter	
  
      –  Reference	
  execu)on	
  engine	
  
•    In	
  progress	
  
      –  SQL	
  interpreter	
  
      –  Storage	
  engine	
  implementa)ons	
  for	
  Accumulo,	
  Cassandra,	
  HBase	
  and	
  various	
  file	
  formats	
  
•    Significant	
  community	
  momentum	
  
      –    Over	
  250	
  people	
  on	
  the	
  Drill	
  mailing	
  list	
  
      –    Over	
  250	
  members	
  of	
  the	
  Bay	
  Area	
  Drill	
  User	
  Group	
  
      –    Drill	
  meetups	
  across	
  the	
  US	
  and	
  Europe	
  
      –    OpenDremel	
  team	
  joined	
  Apache	
  Drill	
  
•    An)cipated	
  schedule:	
  
      –  Prototype:	
  Q1	
  
      –  Alpha:	
  Q2	
  
      –  Beta:	
  Q3	
  
Why	
  Apache	
  Drill	
  Will	
  Be	
  Successful	
  
Resources	
                             Community	
                               Architecture	
  
•  Contributors	
  have	
  strong	
     •  Development	
  done	
  in	
  the	
     •  Full	
  SQL	
  
   backgrounds	
  from	
                   open	
                                 •  New	
  data	
  support	
  
   companies	
  like	
  Oracle,	
       •  Ac)ve	
  contributors	
  from	
        •  Extensible	
  APIs	
  
   IBM	
  Netezza,	
  Informa)ca,	
        mul)ple	
  companies	
                 •  Full	
  Columnar	
  Execu)on	
  
   Clustrix	
  and	
  Pentaho	
         •  Rapidly	
  growing	
                   •  Beyond	
  Hadoop	
  
Interested	
  in	
  Apache	
  Drill?	
  

•  Many	
  op)ons	
  to	
  contribute	
  
   –  Become	
  a	
  full	
  )me	
  Drill	
  engineer	
  @	
  MapR	
  
        •  Email	
  tshiran@maprtech.com	
  	
  
   –  Join	
  the	
  Drill	
  mailing	
  list	
  and	
  start	
  contribu)ng	
  
        •  JIRAs,	
  code,	
  unit	
  tests,	
  documenta)on,	
  …	
  
   –  Shoot	
  me	
  an	
  email	
  and	
  we	
  can	
  discuss	
  
        •  Email	
  tshiran@maprtech.com	
  
QUESTIONS?	
  
Why	
  Not	
  Leverage	
  MapReduce?	
  
•  Scheduling	
  Model	
  
    –  Coarse	
  resource	
  model	
  reduces	
  hardware	
  u)liza)on	
  
    –  Acquisi)on	
  of	
  resources	
  typically	
  takes	
  100’s	
  of	
  millis	
  to	
  seconds	
  
•  Barriers	
  
    –  Map	
  comple)on	
  required	
  before	
  shuffle/reduce	
  
       commencement	
  
    –  All	
  maps	
  must	
  complete	
  before	
  reduce	
  can	
  start	
  
    –  In	
  chained	
  jobs,	
  one	
  job	
  must	
  finish	
  en)rely	
  before	
  the	
  next	
  one	
  
       can	
  start	
  
•  Persistence	
  and	
  Recoverability	
  
    –  Data	
  is	
  persisted	
  to	
  disk	
  between	
  each	
  barrier	
  
    –  Serializa)on	
  and	
  deserializa)on	
  are	
  required	
  between	
  execu)on	
  
       phase	
  

More Related Content

What's hot

Document Database
Document DatabaseDocument Database
Document Database
Heman Hosainpana
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
David Groozman
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Stanley Wang
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
Prashant Gupta
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
sunera pathan
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 
Data Vault Vs Data Lake
Data Vault Vs Data LakeData Vault Vs Data Lake
Data Vault Vs Data Lake
Calum Miller
 
Big Data Integration
Big Data IntegrationBig Data Integration
Big Data Integration
Hadi Fadlallah
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
ateeq ateeq
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
Dremio Corporation
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented Databases
Fabio Fumarola
 
Accelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache HiveAccelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache Hive
DataWorks Summit
 
Big data on aws
Big data on awsBig data on aws
Big data on aws
Serkan Özal
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
Dan Gunter
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
markgrover
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
Taro L. Saito
 

What's hot (20)

Document Database
Document DatabaseDocument Database
Document Database
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
 
Data Vault Vs Data Lake
Data Vault Vs Data LakeData Vault Vs Data Lake
Data Vault Vs Data Lake
 
Big Data Integration
Big Data IntegrationBig Data Integration
Big Data Integration
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
9. Document Oriented Databases
9. Document Oriented Databases9. Document Oriented Databases
9. Document Oriented Databases
 
Accelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache HiveAccelerating query processing with materialized views in Apache Hive
Accelerating query processing with materialized views in Apache Hive
 
Big data on aws
Big data on awsBig data on aws
Big data on aws
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
 

Viewers also liked

Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
MapR Technologies
 
JBoss Enterprise Data Services (Data Virtualization)
JBoss Enterprise Data Services (Data Virtualization)JBoss Enterprise Data Services (Data Virtualization)
JBoss Enterprise Data Services (Data Virtualization)
plarsen67
 
Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?
Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?
Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?
inovex GmbH
 
Apache Drill Workshop
Apache Drill WorkshopApache Drill Workshop
Apache Drill Workshop
Charles Givre
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
MapR Technologies
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
DataWorks Summit/Hadoop Summit
 
Data virtualization, Data Federation & IaaS with Jboss Teiid
Data virtualization, Data Federation & IaaS with Jboss TeiidData virtualization, Data Federation & IaaS with Jboss Teiid
Data virtualization, Data Federation & IaaS with Jboss Teiid
Anil Allewar
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Charles Givre
 
4×4: Big Data in der Cloud
4×4: Big Data in der Cloud4×4: Big Data in der Cloud
4×4: Big Data in der Cloud
Danny Linden
 
Red Hat JBOSS Data Virtualization
Red Hat JBOSS Data VirtualizationRed Hat JBOSS Data Virtualization
Red Hat JBOSS Data Virtualization
DLT Solutions
 
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuApache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Slim Baltagi
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Slim Baltagi
 
Why Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by DenodoWhy Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by Denodo
Justo Hidalgo
 

Viewers also liked (14)

Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
 
JBoss Enterprise Data Services (Data Virtualization)
JBoss Enterprise Data Services (Data Virtualization)JBoss Enterprise Data Services (Data Virtualization)
JBoss Enterprise Data Services (Data Virtualization)
 
Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?
Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?
Wer gewinnt das SQL-Rennen auf der Hadoop-Strecke?
 
Apache Drill Workshop
Apache Drill WorkshopApache Drill Workshop
Apache Drill Workshop
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 
Data virtualization, Data Federation & IaaS with Jboss Teiid
Data virtualization, Data Federation & IaaS with Jboss TeiidData virtualization, Data Federation & IaaS with Jboss Teiid
Data virtualization, Data Federation & IaaS with Jboss Teiid
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
4×4: Big Data in der Cloud
4×4: Big Data in der Cloud4×4: Big Data in der Cloud
4×4: Big Data in der Cloud
 
Red Hat JBOSS Data Virtualization
Red Hat JBOSS Data VirtualizationRed Hat JBOSS Data Virtualization
Red Hat JBOSS Data Virtualization
 
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuApache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
 
Why Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by DenodoWhy Data Virtualization? An Introduction by Denodo
Why Data Virtualization? An Introduction by Denodo
 

Similar to An introduction to apache drill presentation

Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
lucenerevolution
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
Great Wide Open
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
cwensel
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
Geoff Hendrey
 
Apache drill
Apache drillApache drill
Apache drill
MapR Technologies
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Krishnan Parasuraman
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
Mark Kromer
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoopDataWorks Summit
 
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLBig Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQL
Tugdual Grall
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Hadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchHadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchMapR Technologies
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summitOpen Analytics
 
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseOct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Yahoo Developer Network
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
Michael Rys
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DBHeriyadi Janwar
 

Similar to An introduction to apache drill presentation (20)

Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Apache drill
Apache drillApache drill
Apache drill
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deploy...
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
 
Big Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQLBig Data Paris : Hadoop and NoSQL
Big Data Paris : Hadoop and NoSQL
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Hadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 MarchHadoop Summit - Hausenblas 20 March
Hadoop Summit - Hausenblas 20 March
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBaseOct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
Oct 2012 HUG: Project Panthera: Better Analytics with SQL, MapReduce, and HBase
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DB
 

More from MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
MapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
MapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
MapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
MapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
MapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
MapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
MapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
MapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
MapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
MapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
MapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
MapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
MapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
MapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
MapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
MapR Technologies
 

More from MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Recently uploaded

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 

An introduction to apache drill presentation

  • 1. Introduc)on  to  Apache  Drill   February  27,  2013   Tomer  Shiran  
  • 2. Who  Am  I?   •  Tomer  Shiran   •  tshiran@maprtech.com   •  Founding  Member  and  CommiGer,  Apache  Drill   •  Director  of  Product  Management,  MapR   Technologies    
  • 3. Agenda   •  Apache  Drill  overview   •  Key  features   •  Status  and  progress   •  How  to  get  involved  
  • 4. Big  Data  Workloads   •  ETL   •  Data  mining   •  Index  and  model  genera)on   •  Clustering,  anomaly  detec)on  and  classifica)on   •  Blob  store   •  Lightweight  OLTP  on  large  datasets   •  Web  crawling   •  Stream  processing   •  Interac)ve  analysis  
  • 5. Interac)ve  Queries  and  Hadoop   Exis)ng  Solu)ons   Compile  SQL  (HiveQL)  to   Export  MapReduce  results  to   MapReduce   RDBMS  and  query  RDBMS   Emerging  Technologies   Stinger/Tez Impala Real-­‐)me     PostgreSQL-­‐based   PostgreSQL-­‐based   Hive  performance   Real-­‐)me     interac)ve  analysis   Hadoop  analy)cs   Hadoop  analy)cs   improvements   interac)ve  analysis   HAWQ Phoenix   Cascading Lingual PostgreSQL-­‐based   Compile  ANSI  SQL  to   Hadoop  analy)cs   SQL  layer  for  HBase   SQL  layer  for  HBase   MapReduce  
  • 6. Example  Problem   •  Jane  works  as  an   Transac.on   analyst  at  an  e-­‐ informa.on   commerce  company   •  How  does  she  figure   User     out  good  targe)ng   profiles   segments  for  the  next   marke)ng  campaign?   •  She  has  some  ideas   Access   and  lots  of  data   logs  
  • 7. Solving  the  Problem  with  Tradi)onal  Systems   •  Use  an  RDBMS   –  ETL  the  data  from  MongoDB  and  Hadoop  into  the  RDBMS   •  MongoDB  data  must  be  flaGened,  schema)zed,  filtered  and  aggregated   •  Hadoop  data  must  be  filtered  and  aggregated   –  Query  the  data  using  any  SQL-­‐based  tool   •  Use  MapReduce   –  ETL  the  data  from  Oracle  and  MongoDB  into  Hadoop   –  Work  with  the  MapReduce  team  to  generate  the  desired  analyses   •  Use  Hive   –  ETL  the  data  from  Oracle  and  MongoDB  into  Hadoop   •  MongoDB  data  must  be  flaGened  and  schema)zed   –  But  HiveQL  is  limited,  queries  take  too  long  and  BI  tool  support  is   limited  
  • 8. WWGD   Distributed   Interac.ve   Batch   NoSQL   File  System   analysis   processing   GFS   BigTable   Dremel   MapReduce   Hadoop   HDFS   HBase   ???   MapReduce   Build  Apache  Drill  to  provide  a  true  open  source   solu)on  to  interac)ve  analysis  of  Big  Data  
  • 9. Apache  Drill  Overview   •  Interac)ve  analysis  of  Big  Data  using  standard  SQL   •  Fast   –  Low  latency   –  Columnar  execu)on   •  Inspired  by  Google  Dremel/BigQuery   Interac)ve  queries   –  Complement  na)ve  interfaces  and  MapReduce/ Apache  Drill   Data  analyst   Hive/Pig   Repor)ng   100  ms-­‐20  min   •  Open   –  Community  driven  open  source  project   –  Under  Apache  Socware  Founda)on   •  Modern   –  Standard  ANSI  SQL:2003  (select/into)   Data  mining   –  Nested/hierarchical  data  support   MapReduce   Modeling   Hive   –  Schema  is  op)onal   Pig   Large  ETL   20  min-­‐20  hr   –  Supports  RDBMS,  Hadoop  and  NoSQL  
  • 10. How  Does  It  Work?   •  Drillbits  run  on  each  node,  designed  to   maximize  data  locality   •  Processing  is  done  outside  MapReduce   SELECT  *  FROM   paradigm  (YARN  is  supported)   oracle.transac)ons,   •  Queries  can  be  fed  to  any  Drillbit   mongo.users,   hdfs.events   •  Coordina)on,  query  planning,  op)miza)on,   LIMIT  1   scheduling,  and  execu)on  are  distributed  
  • 11. Key  Features   •  Full  SQL  (ANSI  SQL:2003)   •  Nested  data   •  Schema  is  op)onal   •  Flexible  and  extensible  architecture  
  • 12. Full  SQL  (ANSI  SQL:2003)   •  Drill  supports  standard  ANSI  SQL:2003   –  Correlated  subqueries,  analy)c  func)ons,  …   –  SQL-­‐like  is  not  enough   •  Use  any  SQL-­‐based  tool  with  Apache  Drill   –  Tableau,  Microstrategy,  Excel,  SAP  Crystal  Reports,  Toad,  SQuirreL,  …   –  Standard  ODBC  and  JDBC  drivers   Client Tableau Drillbit MicroStrategy Drill%ODBC% SQL%Query% Query% Driver Drillbits Driver Parser Planner Drill%Worker Drill%Worker Excel SAP%Crystal% Reports
  • 13. Nested  Data   JSON   •  Nested  data  is  becoming  prevalent   {      "name":  "Homer",   –  JSON,  BSON,  XML,  Protocol  Buffers,  Avro,  …      "gender":  "Male",      "followers":  100   –  The  data  source  may  or  may  not  be  aware      children:  [   •  MongoDB  supports  nested  data  na)vely          {name:  "Bart"},          {name:  "Lisa”}   •  A  single  HBase  value  could  be  a  JSON  document      ]   (compound  nested  type)   }   –  Google  Dremel’s  innova)on  was  efficient  columnar   storage  and  querying  of  nested  data   •  FlaGening  nested  data  is  error-­‐prone  and  ocen   Avro   enum  Gender  {   impossible      MALE,  FEMALE   –  Think  about  repeated  and  op)onal  fields  at  every   }     level…   record  User  {   •  Apache  Drill  supports  nested  data      string  name;      Gender  gender;   –  Extensions  to  ANSI  SQL:2003      long  followers;   }  
  • 14. Schema  is  Op)onal   •  Many  data  sources  do  not  have  rigid  schemas   –  Schemas  change  rapidly   –  Each  record  may  have  a  different  schema   •  Sparse  and  wide  rows  in  HBase  and  Cassandra,  MongoDB   •  Apache  Drill  supports  querying  against  unknown  schemas   –  Query  any  HBase,  Cassandra  or  MongoDB  table   •  User  can  define  the  schema  or  let  the  system  discover  it  automa)cally   –  System  of  record  may  already  have  schema  informa)on   •  Why  manage  it  in  a  separate  system?   –  No  need  to  manage  schema  evolu)on   Row  Key   CF  contents   CF  anchor   "com.cnn.www"   contents:html  =  "<html>…"   anchor:my.look.ca  =  "CNN.com"   anchor:cnnsi.com  =  "CNN"   "com.foxnews.www"   contents:html  =  "<html>…"   anchor:en.wikipedia.org  =  "Fox  News"     …   …   …  
  • 15. Flexible  and  Extensible  Architecture   •  Apache  Drill  is  designed  for  extensibility   –  Well-­‐documented  APIs  and  interfaces   •  Data  sources  and  file  formats   –  Implement  a  custom  scanner  to  support  a  new  data  source  or  file  format   •  Query  languages   –  SQL:2003  is  the  primary  language   –  Implement  a  custom  Parser  to  support  a  Domain  Specific  Language   –  UDFs  and  UDTFs   •  Op)mizers   –  Drill  will  have  a  cost-­‐based  op)mizer   –  Clear  surrounding  APIs  support  easy  op)mizer  explora)on   •  Operators   –  Custom  operators  can  be  implemented   •  Special  operators  for  Mahout  (k-­‐means)  being  designed   –  Operator  push-­‐down  to  data  source  (RDBMS)  
  • 16. What  About  Other  SQL-­‐on-­‐Hadoop   Systems?   •  Strengths   –  Code  is  more  mature  than  Apache  Drill   •  Already  in  beta  (~2  quarters  ahead)   –  Faster  than  Hive  on  some  queries   •  Weaknesses   –  Proprietary  or  semi-­‐open  source   –  Query  results  must  fit  in  memory  (no  spooling)   –  Early  row  materializa)on  (no  columnar  execu)on)   –  Some  are  SQL-­‐like  (not  SQL)   –  No  support  for  nested  data   –  Rigid  schema  is  required   –  Limited  flexibility  and  extensibility   –  Only  support  Hadoop  and  HBase  (and  no  other  NoSQL  or  RDBMS)  
  • 17. Status:  In  Progress   •  Heavy  ac)ve  development   –  6-­‐7  companies  are  contribu)ng   •  Available   –  Logical  plan  syntax  and  interpreter   –  Reference  execu)on  engine   •  In  progress   –  SQL  interpreter   –  Storage  engine  implementa)ons  for  Accumulo,  Cassandra,  HBase  and  various  file  formats   •  Significant  community  momentum   –  Over  250  people  on  the  Drill  mailing  list   –  Over  250  members  of  the  Bay  Area  Drill  User  Group   –  Drill  meetups  across  the  US  and  Europe   –  OpenDremel  team  joined  Apache  Drill   •  An)cipated  schedule:   –  Prototype:  Q1   –  Alpha:  Q2   –  Beta:  Q3  
  • 18. Why  Apache  Drill  Will  Be  Successful   Resources   Community   Architecture   •  Contributors  have  strong   •  Development  done  in  the   •  Full  SQL   backgrounds  from   open   •  New  data  support   companies  like  Oracle,   •  Ac)ve  contributors  from   •  Extensible  APIs   IBM  Netezza,  Informa)ca,   mul)ple  companies   •  Full  Columnar  Execu)on   Clustrix  and  Pentaho   •  Rapidly  growing   •  Beyond  Hadoop  
  • 19. Interested  in  Apache  Drill?   •  Many  op)ons  to  contribute   –  Become  a  full  )me  Drill  engineer  @  MapR   •  Email  tshiran@maprtech.com     –  Join  the  Drill  mailing  list  and  start  contribu)ng   •  JIRAs,  code,  unit  tests,  documenta)on,  …   –  Shoot  me  an  email  and  we  can  discuss   •  Email  tshiran@maprtech.com  
  • 21. Why  Not  Leverage  MapReduce?   •  Scheduling  Model   –  Coarse  resource  model  reduces  hardware  u)liza)on   –  Acquisi)on  of  resources  typically  takes  100’s  of  millis  to  seconds   •  Barriers   –  Map  comple)on  required  before  shuffle/reduce   commencement   –  All  maps  must  complete  before  reduce  can  start   –  In  chained  jobs,  one  job  must  finish  en)rely  before  the  next  one   can  start   •  Persistence  and  Recoverability   –  Data  is  persisted  to  disk  between  each  barrier   –  Serializa)on  and  deserializa)on  are  required  between  execu)on   phase