Cloudera Impala
Upcoming SlideShare
Loading in...5
×
 

Cloudera Impala

on

  • 131 views

Cloudera Impala provides a fast, ad hoc query capability to Apache Hadoop, complementing traditional MapReduce batch processing. Learn the design choices and architecture behind Impala, and how to use ...

Cloudera Impala provides a fast, ad hoc query capability to Apache Hadoop, complementing traditional MapReduce batch processing. Learn the design choices and architecture behind Impala, and how to use near-ubiquitous SQL to explore your own data at scale.

As presented to Portland Big Data User Group on July 23rd 2014.
http://www.meetup.com/Hadoop-Portland/events/194930422/

Statistics

Views

Total Views
131
Views on SlideShare
131
Embed Views
0

Actions

Likes
0
Downloads
4
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Cloudera Impala Cloudera Impala Presentation Transcript

  • 1 Cloudera  Impala   Portland  Big  Data  User  Group,  July  2014     Alex  Moundalexis   @technmsg  
  • Thirty  Seconds  About  Alex   •  SoluGons  Architect   •  aka  consultant   •  government   •  infrastructure   •  former  coder  of  Perl   •  former  administrator   •  fan  of  Portland     2  
  • What  Does  Cloudera  Do?   •  product   •  distribuGon  of  Hadoop  components,  Apache  licensed   •  enterprise  tooling   •  support   •  training   •  services  (aka  consulGng)   •  community   3
  • Disclaimer   •  Cloudera  builds  things  soPware   •  most  donated  to  Apache   •  some  closed-­‐source   •  Cloudera  “products”  I  reference  are  open  source   •  Apache  Licensed   •  source  code  is  on  GitHub   •  hVps://github.com/cloudera   4
  • What  This  Talk  Isn’t  About   •  deploying   •  Puppet,  Chef,  Ansible,  homegrown  scripts,  intern  labor   •  sizing  &  tuning   •  depends  heavily  on  data  and  workload   •  coding   •  unless  you  count  XML  or  CSV  or  SQL   •  algorithms   5
  • Public  Domain  IFCAR  
  • CC  BY-­‐SA  Lilian  De  Cassai  
  • cloud·∙e·∙ra  im·∙pal·∙a   8 /kloudˈi(ə)rə  imˈpalə/     noun     a  modern,  open  source,  MPP  SQL  query  engine   for  Apache  Hadoop.     “Cloudera  Impala  provides  fast,  ad  hoc  SQL  query   capability  for  Apache  Hadoop,  complemenGng   tradiGonal  MapReduce  batch  processing.”  
  • 9 Quick  and  dirty,  for  context.   The  Apache  Hadoop  Ecosystem  
  • Why  “Ecosystem?”   •  In  the  beginning,  just  Hadoop   •  HDFS   •  MapReduce   •  Today,  dozens  of  interrelated  components   •  I/O   •  Processing   •  Specialty  ApplicaGons   •  ConfiguraGon   •  Workflow   10
  • HDFS   •  Distributed,  highly  fault-­‐tolerant  filesystem   •  OpGmized  for  large  streaming  access  to  data   •  Based  on  Google  File  System   •  hVp://research.google.com/archive/gfs.html   11
  • Lots  of  Commodity  Machines   12 Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
  • MapReduce  (MR)   •  Programming  paradigm   •  Batch  oriented,  not  realGme   •  Works  well  with  distributed  compuGng   •  Lots  of  Java,  but  other  languages  supported   •  Based  on  Google’s  paper   •  hVp://research.google.com/archive/mapreduce.html   13
  • Under  the  Covers   14
  • You specify map() and reduce() functions.  The framework does the rest. 60
  • Apache  Hive   •  AbstracGon  of  Hadoop’s  Java  API   •  HiveQL  “compiles”  down  to  MR   •  a  “SQL-­‐like”  language   •  Eases  analysis  using  MapReduce   16
  • Apache  Hive  Metastore   •  Maps  HDFS  files  to  DB-­‐like  resources   •  Databases   •  Tables   •  Column/field  names,  data  types   •  Roles/users   •  InputFormat/OutputFormat   17
  • WHY  DO  WE  NEED  THIS?   But  wait…   18  
  • 19  
  • 20 I  am  not  a  SQL  wizard  by  any  means…   Super  Shady  SQL  Supplement  
  • A  Simple  RelaGonal  Database   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   21 >  
  • InteracGng  with  RelaGonal  Data   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   22 >  SELECT  *  FROM  people;  
  • InteracGng  with  RelaGonal  Data   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   23 >  SELECT  *  FROM  people;  
  • RequesGng  Specific  Fields   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   24 >  SELECT  name,  state  FROM  people;  
  • RequesGng  Specific  Fields   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   25 >  SELECT  name,  state  FROM  people;  
  • RequesGng  Specific  Rows   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   26 >  SELECT  name,  state  FROM  people  WHERE  year  <  2012;  
  • RequesGng  Specific  Rows   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   27 >  SELECT  name,  state  FROM  people  WHERE  year  <  2012;  
  • Two  Simple  Tables   owner   species   name   Alex   Cactus   Marvin   Joey   Cat   Brain   Sean   None   Paris   Unknown   28   >   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011  
  • Joining  Two  Tables   owner   species   name   Alex   Cactus   Marvin   Joey   Cat   Brain   Sean   None   Paris   Unknown   29   >  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner    name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011  
  • Joining  Two  Tables   owner   species   name   Alex   Cactus   Marvin   Joey   Cat   Brain   Sean   None   Paris   Unknown   30   >  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner    name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011  
  • Joining  Two  Tables   owner   species   name   Alex   Cactus   Marvin   Joey   Cat   Brain   Sean   None   Paris   Unknown   31   >  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011  
  • Joining  Two  Tables   32 >  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner   owner   state   pet   Alex   Maryland   Marvin   Joey   Maryland   Brain   Sean   Texas   Paris   Maryland  
  • Varying  ImplementaGon  of  JOIN   33 >  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner   owner   state   pet   Alex   Maryland   Marvin   Joey   Maryland   Brain   Sean   Texas   ?   Paris   Maryland   ?  
  • 34 Familiar  interface,  but  more  powerful.   Cloudera  Impala  
  • Cloudera  Impala   •  InteracGve  query  on  Hadoop   •  think  seconds,  not  minutes   •  Nearly  ANSI-­‐92  standard  SQL   •  compaGble  with  HiveQL   •  NaGve  MPP  query  engine   •  built  for  low-­‐latency  queries   35
  • Cloudera  Impala  –  Design  Choices   •  NaGve  daemons,  wriVen  in  C/C++   •  No  JVM,  no  MapReduce   •  Saturate  disks  on  reads   •  Uses  in-­‐memory  HDFS  caching   •  Re-­‐uses  Hive  metastore   •  Not  as  fault-­‐tolerant  as  MapReduce   36
  • Cloudera  Impala  –  Architecture   •  Impala  Daemon   •  runs  on  every  node   •  handles  client  requests   •  handles  query  planning  &  execuGon   •  State  Store  Daemon   •  provides  name  service   •  metadata  distribuGon   •  used  for  finding  data   37
  • Impala  Query  ExecuGon   38 Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   SQL  App   ODBC   Hive   Metastore   HDFS  NN   Statestore   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   SQL  request   1)  Request  arrives  via  ODBC/JDBC/HUE/Shell  
  • Impala  Query  ExecuGon   39 Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   SQL  App   ODBC   Hive   Metastore   HDFS  NN   Statestore   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   2)  Planner  turns  request  into  collecPons  of  plan  fragments   3)  Coordinator  iniPates  execuPon  on  impalad(s)  local  to  data  
  • Impala  Query  ExecuGon   40 Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   SQL  App   ODBC   Hive   Metastore   HDFS  NN   Statestore   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   4)  Intermediate  results  are  streamed  between  impalad(s)   5)  Query  results  are  streamed  back  to  client   Query  results  
  • Cloudera  Impala  –  Results   •  Allows  for  fast  iteraGon/discovery   •  How  much  faster?   •  3-­‐4x  faster  on  I/O  bound  workloads   •  up  to  45x  faster  on  mulG-­‐MR  queries   •  up  to  90x  faster  on  in-­‐memory  cache   41
  • 42 Hold  onto  something,  folks.   Demo  
  • What’s  Next?   •  Download  Hadoop!   •  CDH  available  at  www.cloudera.com   •  Already  done  that?  Contribute…   •  Cloudera  provides  pre-­‐loaded  VMs   •  hVp://Gny.cloudera.com/quickstartvm   •  Clone  our  repos!   •  hVps://github.com/cloudera   43
  • PORTLAND   Special  thanks:   44  
  • 45 Preferably  related  to  the  talk…  or  not.   QuesGons?  
  • 46 Thank  You!   Alex  Moundalexis   @technmsg     We’re  hiring,  kids!  Well,  not  kids.