Your SlideShare is downloading. ×
0
Cloudera impala
Cloudera impala
Cloudera impala
Cloudera impala
Cloudera impala
Cloudera impala
Cloudera impala
Cloudera impala
Cloudera impala
Cloudera impala
Cloudera impala
Cloudera impala
Cloudera impala
Cloudera impala
Cloudera impala
Cloudera impala
Cloudera impala
Cloudera impala
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Cloudera impala

10,603

Published on

James Kinley from Cloudera: …

James Kinley from Cloudera:
An introduction to Cloudera Impala. Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.
The link to the video: http://zurichtechtalks.ch/post/37339409724/an-introduction-to-cloudera-impala-sql-on-top-of

Published in: Technology
0 Comments
27 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
10,603
On Slideshare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
0
Comments
0
Likes
27
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Cloudera  Impala  Real  Time  Query  for  HDFS  and  HBase  
  • 2. Beyond  Batch   What  is  Impala   Capability   Architecture   Demo  2
  • 3. Beyond  Batch   For  some  things  MapReduce  is  just  too  slow   Apache  Hive:   MapReduce  execuHon  engine   High-­‐latency,  low  throughput   High  runHme  overhead   Google  realized  this  early  on    Analysts  wanted  fast,  interacHve  results  3  
  • 4. Dremel   Google  paper  (2010)   “scalable,  interac.ve  ad-­‐hoc  query  system  for   analysis  of  read-­‐only  nested  data”   Columnar  storage  format   Distributed  scalable  aggregaHon   “capable  of  running  aggrega.on  queries  over   trillion-­‐row  tables  in  seconds”   hUp://research.google.com/pubs/pub36632.html  4  
  • 5. Impala:  Goals   General-­‐purpose  SQL  query  engine  for  Hadoop   For  analyHcal  and  transacHonal  workloads   Support  queries  that  take  μs  to  hours   Run  directly  with  Hadoop   Collocated  daemons   Same  file  formats   Same  storage  managers  (NN,  metastore)  5  
  • 6. Impala:  Goals   High  performance   C++   runHme  code  generaHon  (LLVM)   direct  access  to  data  (no  MapReduce)   Retain  user  experience    easy  for  Hive  users  to  migrate   100%  open-­‐source  6  
  • 7. Impala:  Capability   HiveQL  (subset  of  SQL92)   select,  project,  join,  union,  subqueries,   aggregaHon,  insert,  order  by  (with  limit)   DDL   Directly  queries  data  in  HDFS  &  HBase   Text  files  (compressed)   Sequence  files  (snappy/gzip)   Avro  &  Trevni   GA  features  7  
  • 8. Impala:  Capability   Familiar  and  unified  plagorm   Uses  Hive’s  metastore   Submit  queries  via  ODBC  |  Beeswax  Thril  API   Query  is  distributed  to  nodes  with  relevant  data   Process-­‐to-­‐process  data  exchange   Kerberos  authenHcaHon   No  fault  tolerance  8  
  • 9. Impala:  Performance   Greater  disk  throughput   ~100MB/sec/disk   I/O-­‐bound  workloads  faster  by  3-­‐4x   Queries  that  require  mulHple  map-­‐reduce   phases  in  Hive  are  significantly  faster  in  Impala   (up  to  45x)   Queries  that  run  against  in-­‐memory  cached  data   see  a  significant  speedup  (up  to  90x)  9  
  • 10. Impala:  Architecture   impalad   runs  on  every  node   handles  client  requests  (ODBC,  thril)   handles  query  planning  &  execuHon   statestored   provides  name  service   metadata  distribuHon   used  for  finding  data  10  
  • 11. Impala:  Architecture  11  
  • 12. Impala:  Architecture  12  
  • 13. Impala:  Architecture  13  
  • 14. Impala:  Architecture  14  
  • 15. Current  limitaHons   Public  Beta  (available  since  24  Oct  2012)   No  SerDes   No  User  Defined  FuncHons  (UDF’s)   Joins  are  done  in  memory  space  no  larger   than  that  of  smallest  node   impalad’s  only  read  statestored  metadata  at   startup    15  
  • 16. Futures   GA  Q1  2013   DDL  support  (CREATE,  ALTER)   Rudimentary  cost-­‐based  opHmizer  (CBO)   Joins  done  in  aggregate  memory   metadata  distribuHon  through  statestored   Doug  Curng’s  Trevni   Columnar  storage  format  like  Dremel’s   Impala  +  Trevni  =  Dremel  superset  16  
  • 17. Demo   impala-­‐user@cloudera.com   kinley@cloudera.com   @jrkinley  17  

×