Cloudera	  Impala	  Real	  Time	  Query	  for	  HDFS	  and	  HBase	  
Beyond	  Batch	      What	  is	  Impala	      Capability	      Architecture	      Demo	  2
Beyond	  Batch	          For	  some	  things	  MapReduce	  is	  just	  too	  slow	          Apache	  Hive:	              M...
Dremel	          Google	  paper	  (2010)	             “scalable,	  interac.ve	  ad-­‐hoc	  query	  system	  for	          ...
Impala:	  Goals	          General-­‐purpose	  SQL	  query	  engine	  for	  Hadoop	          For	  analyHcal	  and	  transa...
Impala:	  Goals	          High	  performance	              C++	              runHme	  code	  generaHon	  (LLVM)	          ...
Impala:	  Capability	          HiveQL	  (subset	  of	  SQL92)	             select,	  project,	  join,	  union,	  subquerie...
Impala:	  Capability	          Familiar	  and	  unified	  plagorm	            Uses	  Hive’s	  metastore	            Submit	...
Impala:	  Performance	          Greater	  disk	  throughput	             ~100MB/sec/disk	             I/O-­‐bound	  worklo...
Impala:	  Architecture	           impalad	              runs	  on	  every	  node	              handles	  client	  requests...
Impala:	  Architecture	  11	  
Impala:	  Architecture	  12	  
Impala:	  Architecture	  13	  
Impala:	  Architecture	  14	  
Current	  limitaHons	           Public	  Beta	  (available	  since	  24	  Oct	  2012)	             No	  SerDes	           ...
Futures	           GA	  Q1	  2013	             DDL	  support	  (CREATE,	  ALTER)	             Rudimentary	  cost-­‐based	 ...
Demo	                      impala-­‐user@cloudera.com	                           kinley@cloudera.com	                     ...
Cloudera impala
Upcoming SlideShare
Loading in...5
×

Cloudera impala

10,934

Published on

James Kinley from Cloudera:
An introduction to Cloudera Impala. Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.
The link to the video: http://zurichtechtalks.ch/post/37339409724/an-introduction-to-cloudera-impala-sql-on-top-of

Published in: Technology

Cloudera impala

  1. 1. Cloudera  Impala  Real  Time  Query  for  HDFS  and  HBase  
  2. 2. Beyond  Batch   What  is  Impala   Capability   Architecture   Demo  2
  3. 3. Beyond  Batch   For  some  things  MapReduce  is  just  too  slow   Apache  Hive:   MapReduce  execuHon  engine   High-­‐latency,  low  throughput   High  runHme  overhead   Google  realized  this  early  on    Analysts  wanted  fast,  interacHve  results  3  
  4. 4. Dremel   Google  paper  (2010)   “scalable,  interac.ve  ad-­‐hoc  query  system  for   analysis  of  read-­‐only  nested  data”   Columnar  storage  format   Distributed  scalable  aggregaHon   “capable  of  running  aggrega.on  queries  over   trillion-­‐row  tables  in  seconds”   hUp://research.google.com/pubs/pub36632.html  4  
  5. 5. Impala:  Goals   General-­‐purpose  SQL  query  engine  for  Hadoop   For  analyHcal  and  transacHonal  workloads   Support  queries  that  take  μs  to  hours   Run  directly  with  Hadoop   Collocated  daemons   Same  file  formats   Same  storage  managers  (NN,  metastore)  5  
  6. 6. Impala:  Goals   High  performance   C++   runHme  code  generaHon  (LLVM)   direct  access  to  data  (no  MapReduce)   Retain  user  experience    easy  for  Hive  users  to  migrate   100%  open-­‐source  6  
  7. 7. Impala:  Capability   HiveQL  (subset  of  SQL92)   select,  project,  join,  union,  subqueries,   aggregaHon,  insert,  order  by  (with  limit)   DDL   Directly  queries  data  in  HDFS  &  HBase   Text  files  (compressed)   Sequence  files  (snappy/gzip)   Avro  &  Trevni   GA  features  7  
  8. 8. Impala:  Capability   Familiar  and  unified  plagorm   Uses  Hive’s  metastore   Submit  queries  via  ODBC  |  Beeswax  Thril  API   Query  is  distributed  to  nodes  with  relevant  data   Process-­‐to-­‐process  data  exchange   Kerberos  authenHcaHon   No  fault  tolerance  8  
  9. 9. Impala:  Performance   Greater  disk  throughput   ~100MB/sec/disk   I/O-­‐bound  workloads  faster  by  3-­‐4x   Queries  that  require  mulHple  map-­‐reduce   phases  in  Hive  are  significantly  faster  in  Impala   (up  to  45x)   Queries  that  run  against  in-­‐memory  cached  data   see  a  significant  speedup  (up  to  90x)  9  
  10. 10. Impala:  Architecture   impalad   runs  on  every  node   handles  client  requests  (ODBC,  thril)   handles  query  planning  &  execuHon   statestored   provides  name  service   metadata  distribuHon   used  for  finding  data  10  
  11. 11. Impala:  Architecture  11  
  12. 12. Impala:  Architecture  12  
  13. 13. Impala:  Architecture  13  
  14. 14. Impala:  Architecture  14  
  15. 15. Current  limitaHons   Public  Beta  (available  since  24  Oct  2012)   No  SerDes   No  User  Defined  FuncHons  (UDF’s)   Joins  are  done  in  memory  space  no  larger   than  that  of  smallest  node   impalad’s  only  read  statestored  metadata  at   startup    15  
  16. 16. Futures   GA  Q1  2013   DDL  support  (CREATE,  ALTER)   Rudimentary  cost-­‐based  opHmizer  (CBO)   Joins  done  in  aggregate  memory   metadata  distribuHon  through  statestored   Doug  Curng’s  Trevni   Columnar  storage  format  like  Dremel’s   Impala  +  Trevni  =  Dremel  superset  16  
  17. 17. Demo   impala-­‐user@cloudera.com   kinley@cloudera.com   @jrkinley  17  

×