Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cloudera impala


Published on

James Kinley from Cloudera:
An introduction to Cloudera Impala. Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.
The link to the video:

Published in: Technology
  • Be the first to comment

Cloudera impala

  1. 1. Cloudera  Impala  Real  Time  Query  for  HDFS  and  HBase  
  2. 2. Beyond  Batch   What  is  Impala   Capability   Architecture   Demo  2
  3. 3. Beyond  Batch   For  some  things  MapReduce  is  just  too  slow   Apache  Hive:   MapReduce  execuHon  engine   High-­‐latency,  low  throughput   High  runHme  overhead   Google  realized  this  early  on    Analysts  wanted  fast,  interacHve  results  3  
  4. 4. Dremel   Google  paper  (2010)   “scalable,  ad-­‐hoc  query  system  for   analysis  of  read-­‐only  nested  data”   Columnar  storage  format   Distributed  scalable  aggregaHon   “capable  of  running  aggrega.on  queries  over   trillion-­‐row  tables  in  seconds”   hUp://  4  
  5. 5. Impala:  Goals   General-­‐purpose  SQL  query  engine  for  Hadoop   For  analyHcal  and  transacHonal  workloads   Support  queries  that  take  μs  to  hours   Run  directly  with  Hadoop   Collocated  daemons   Same  file  formats   Same  storage  managers  (NN,  metastore)  5  
  6. 6. Impala:  Goals   High  performance   C++   runHme  code  generaHon  (LLVM)   direct  access  to  data  (no  MapReduce)   Retain  user  experience    easy  for  Hive  users  to  migrate   100%  open-­‐source  6  
  7. 7. Impala:  Capability   HiveQL  (subset  of  SQL92)   select,  project,  join,  union,  subqueries,   aggregaHon,  insert,  order  by  (with  limit)   DDL   Directly  queries  data  in  HDFS  &  HBase   Text  files  (compressed)   Sequence  files  (snappy/gzip)   Avro  &  Trevni   GA  features  7  
  8. 8. Impala:  Capability   Familiar  and  unified  plagorm   Uses  Hive’s  metastore   Submit  queries  via  ODBC  |  Beeswax  Thril  API   Query  is  distributed  to  nodes  with  relevant  data   Process-­‐to-­‐process  data  exchange   Kerberos  authenHcaHon   No  fault  tolerance  8  
  9. 9. Impala:  Performance   Greater  disk  throughput   ~100MB/sec/disk   I/O-­‐bound  workloads  faster  by  3-­‐4x   Queries  that  require  mulHple  map-­‐reduce   phases  in  Hive  are  significantly  faster  in  Impala   (up  to  45x)   Queries  that  run  against  in-­‐memory  cached  data   see  a  significant  speedup  (up  to  90x)  9  
  10. 10. Impala:  Architecture   impalad   runs  on  every  node   handles  client  requests  (ODBC,  thril)   handles  query  planning  &  execuHon   statestored   provides  name  service   metadata  distribuHon   used  for  finding  data  10  
  11. 11. Impala:  Architecture  11  
  12. 12. Impala:  Architecture  12  
  13. 13. Impala:  Architecture  13  
  14. 14. Impala:  Architecture  14  
  15. 15. Current  limitaHons   Public  Beta  (available  since  24  Oct  2012)   No  SerDes   No  User  Defined  FuncHons  (UDF’s)   Joins  are  done  in  memory  space  no  larger   than  that  of  smallest  node   impalad’s  only  read  statestored  metadata  at   startup    15  
  16. 16. Futures   GA  Q1  2013   DDL  support  (CREATE,  ALTER)   Rudimentary  cost-­‐based  opHmizer  (CBO)   Joins  done  in  aggregate  memory   metadata  distribuHon  through  statestored   Doug  Curng’s  Trevni   Columnar  storage  format  like  Dremel’s   Impala  +  Trevni  =  Dremel  superset  16  
  17. 17. Demo   impala-­‐   @jrkinley  17