Introduction to Cloudera Impala

  • 333 views
Uploaded on

Cloudera Impala provides a fast, ad hoc query capability to Apache Hadoop, complementing traditional MapReduce batch processing. Learn the design choices and architecture behind Impala, and how to use …

Cloudera Impala provides a fast, ad hoc query capability to Apache Hadoop, complementing traditional MapReduce batch processing. Learn the design choices and architecture behind Impala, and how to use near-ubiquitous SQL to explore your own data at scale.

As presented to Charm City Linux on March 25th 2014.
http://www.meetup.com/CharmCityLinux/events/168288632/

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
333
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
18
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. 1 Cloudera  Impala   Charm  City  Linux,  March  2014     Alex  Moundalexis   alexm+ccl@clouderagovt.com     @technmsg  
  • 2. Thirty  Seconds  About  Alex   •  SoluEons  Architect   •  aka  consultant   •  government   •  infrastructure   •  former  coder  of  Perl   •  former  administrator   •  likes  shiny  objects   2  
  • 3. What  Does  Cloudera  Do?   •  product   •  distribuEon  of  Hadoop  components,  Apache  licensed   •  enterprise  tooling   •  support   •  training   •  services  (aka  consulEng)   •  community   3
  • 4. Disclaimer   •  Cloudera  builds  things  soPware   •  most  donated  to  Apache   •  some  closed-­‐source   •  Cloudera  “products”  I  reference  are  open  source   •  Apache  Licensed   •  source  code  is  on  GitHub   •  hVps://github.com/cloudera   4
  • 5. What  This  Talk  Isn’t  About   •  deploying   •  Puppet,  Chef,  Ansible,  homegrown  scripts,  intern  labor   •  sizing  &  tuning   •  depends  heavily  on  data  and  workload   •  coding   •  unless  you  count  XML  or  CSV  or  SQL   •  algorithms   5
  • 6. 6 Quick  and  dirty,  for  context.   The  Apache  Hadoop  Ecosystem  
  • 7. Why  “Ecosystem?”   •  In  the  beginning,  just  Hadoop   •  HDFS   •  MapReduce   •  Today,  dozens  of  interrelated  components   •  I/O   •  Processing   •  Specialty  ApplicaEons   •  ConfiguraEon   •  Workflow   7
  • 8. HDFS   •  Distributed,  highly  fault-­‐tolerant  filesystem   •  OpEmized  for  large  streaming  access  to  data   •  Based  on  Google  File  System   •  hVp://research.google.com/archive/gfs.html   8
  • 9. Lots  of  Commodity  Machines   9 Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
  • 10. MapReduce  (MR)   •  Programming  paradigm   •  Batch  oriented,  not  realEme   •  Works  well  with  distributed  compuEng   •  Lots  of  Java,  but  other  languages  supported   •  Based  on  Google’s  paper   •  hVp://research.google.com/archive/mapreduce.html   10
  • 11. Under  the  Covers   11
  • 12. You specify map() and reduce() functions.  The framework does the rest. 60
  • 13. Apache  Hive   •  AbstracEon  of  Hadoop’s  Java  API   •  HiveQL  “compiles”  down  to  MR   •  a  “SQL-­‐like”  language   •  Eases  analysis  using  MapReduce   13
  • 14. Apache  Hive  Metastore   •  Maps  HDFS  files  to  DB-­‐like  resources   •  Databases   •  Tables   •  Column/field  names,  data  types   •  Roles/users   •  InputFormat/OutputFormat   14
  • 15. WHY  DO  WE  NEED  THIS?   But  wait…   15  
  • 16. 16  
  • 17. 17 I  am  not  a  SQL  wizard  by  any  means…   Super  Shady  SQL  Supplement  
  • 18. A  Simple  RelaEonal  Database   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   18 >  
  • 19. InteracEng  with  RelaEonal  Data   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   19 >  SELECT  *  FROM  people;  
  • 20. InteracEng  with  RelaEonal  Data   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   20 >  SELECT  *  FROM  people;  
  • 21. RequesEng  Specific  Fields   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   21 >  SELECT  name,  state  FROM  people;  
  • 22. RequesEng  Specific  Fields   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   22 >  SELECT  name,  state  FROM  people;  
  • 23. RequesEng  Specific  Rows   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   23 >  SELECT  name,  state  FROM  people  WHERE  year  <  2012;  
  • 24. RequesEng  Specific  Rows   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011   24 >  SELECT  name,  state  FROM  people  WHERE  year  <  2012;  
  • 25. Two  Simple  Tables   owner   species   name   Alex   Cactus   Marvin   Joey   Cat   Brain   Sean   None   Paris   Unknown   25   >   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011  
  • 26. Joining  Two  Tables   owner   species   name   Alex   Cactus   Marvin   Joey   Cat   Brain   Sean   None   Paris   Unknown   26   >  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner    name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011  
  • 27. Joining  Two  Tables   owner   species   name   Alex   Cactus   Marvin   Joey   Cat   Brain   Sean   None   Paris   Unknown   27   >  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner    name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011  
  • 28. Joining  Two  Tables   owner   species   name   Alex   Cactus   Marvin   Joey   Cat   Brain   Sean   None   Paris   Unknown   28   >  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner   name   state   employer   year   Alex   Maryland   Cloudera   2013   Joey   Maryland   Cloudera   2011   Sean   Texas   Cloudera   2013   Paris   Maryland   AOL   2011  
  • 29. Joining  Two  Tables   29 >  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner   owner   state   pet   Alex   Maryland   Marvin   Joey   Maryland   Brain   Sean   Texas   Paris   Maryland  
  • 30. Varying  ImplementaEon  of  JOIN   30 >  SELECT  people.name  AS  owner,  people.state  AS  state,  pets.name  AS  pet    FROM  people  LEFT  JOIN  pets  ON  people.name  =  pets.owner   owner   state   pet   Alex   Maryland   Marvin   Joey   Maryland   Brain   Sean   Texas   ?   Paris   Maryland   ?  
  • 31. 31 Familiar  interface,  but  more  powerful.   Cloudera  Impala  
  • 32. Cloudera  Impala   •  InteracEve  query  on  Hadoop   •  think  seconds,  not  minutes   •  Nearly  ANSI-­‐92  standard  SQL   •  compaEble  with  HiveQL   •  NaEve  MPP  query  engine   •  built  for  low-­‐latency  queries   32
  • 33. Cloudera  Impala  –  Design  Choices   •  NaEve  daemons,  wriVen  in  C/C++   •  No  JVM,  no  MapReduce   •  Saturate  disks  on  reads   •  Uses  in-­‐memory  HDFS  caching   •  Re-­‐uses  Hive  metastore   •  Not  as  fault-­‐tolerant  as  MapReduce   33
  • 34. Cloudera  Impala  –  Architecture   •  Impala  Daemon   •  runs  on  every  node   •  handles  client  requests   •  handles  query  planning  &  execuEon   •  State  Store  Daemon   •  provides  name  service   •  metadata  distribuEon   •  used  for  finding  data   34
  • 35. Impala  Query  ExecuEon   35 Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   SQL  App   ODBC   Hive   Metastore   HDFS  NN   Statestore   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   SQL  request   1)  Request  arrives  via  ODBC/JDBC/HUE/Shell  
  • 36. Impala  Query  ExecuEon   36 Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   SQL  App   ODBC   Hive   Metastore   HDFS  NN   Statestore   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   2)  Planner  turns  request  into  collecRons  of  plan  fragments   3)  Coordinator  iniRates  execuRon  on  impalad(s)  local  to  data  
  • 37. Impala  Query  ExecuEon   37 Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   SQL  App   ODBC   Hive   Metastore   HDFS  NN   Statestore   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   4)  Intermediate  results  are  streamed  between  impalad(s)   5)  Query  results  are  streamed  back  to  client   Query  results  
  • 38. Cloudera  Impala  –  Results   •  Allows  for  fast  iteraEon/discovery   •  How  much  faster?   •  3-­‐4x  faster  on  I/O  bound  workloads   •  up  to  45x  faster  on  mulE-­‐MR  queries   •  up  to  90x  faster  on  in-­‐memory  cache   38
  • 39. 39 Hold  onto  something,  folks.   Demo  
  • 40. What’s  Next?   •  Download  Hadoop!   •  CDH  available  at  www.cloudera.com   •  Already  done  that?  Contribute…   •  Cloudera  provides  pre-­‐loaded  VMs   •  hVp://Eny.cloudera.com/quickstartvm   •  Clone  our  repos!   •  hVps://github.com/cloudera   40
  • 41. PARIS   Special  thanks:   41  
  • 42. 42 Preferably  related  to  the  talk…  or  not.   QuesEons?  
  • 43. 43 Thank  You!   Alex  Moundalexis   alexm+ccl@clouderagovt.com   @technmsg     We’re  hiring,  kids!  Well,  not  kids.