Cloudera	  Impala	  Jus/n	  Erickson	  |	  Product	  Manager	  	  November	  2012	  
Why	  Data	  Scien/sts	  Love	  Hadoop	                                   	      •    Massive	  volumes	  of	  data	      ...
Hadoop	  Use	  Cases	  Moving	  to	  Real-­‐Time	           Already	  query	            Already	  load	  data	  into	     ...
But	  Hadoop	  Isn’t	  Fast	  Enough	           Need	  faster	           Move	  data	  from	  	                           ...
Beyond	  Batch	  –	  The	  Next	  Stage	  for	  Hadoop	                      HADOOP	  TODAY	  IS	  TOO	  SLOW	            ...
Cloudera	  Enterprise	  RTQ	  Real-­‐Time	  Query	  for	  Data	  Stored	  in	  Hadoop	  	  Powered	  by	  Cloudera	  Impal...
Cloudera	  Now	  Powered	  by	  Impala	                  BEFORE	  IMPALA	                                                 ...
Impala	  beta	  features	  Today	  (Cloudera	  Impala	  0.1):	  •   Nearly	  all	  of	  Hives	  SQL,	  including	  insert,...
Impala	  v0.1	  SQL	  (HiveQL)	  •    Select	        –    Boolean,	  /nyint,	  smallint,	  int,	  bigint,	  float,	  double...
Cloudera	  Impala	  Details	  Common	  Hive	  SQL	  and	  interface	                                                      ...
Cloudera	  Impala	  Details	  Common	  Hive	  SQL	  and	  interface	                     SQL	  App	                       ...
Cloudera	  Impala	  Details	                                                                                     Unified	  ...
Cloudera	  Impala	  Details	                   SQL	  App	                                     Hive	                       ...
Cloudera	  Impala	  Details	                   SQL	  App	                                     Hive	                       ...
Cloudera	  Impala	  Details	                   SQL	  App	                                                     Hive	       ...
Impala	  and	  Hive	  •     Shared	  with	  Hive:	          –    Metadata	  (table	  defini/ons)	          –    ODBC	  driv...
Advantages	  of	  Our	  Approach	  •    No	  high-­‐latency	  MapReduce	  batch	  processing	  •    Local	  processing	  a...
Google	  Dremel	  and	  Impala	  •  What	  is	  Dremel:	        –  Columnar	  storage	  for	  data	  with	  nested	  struc...
Benefits	  of	  Cloudera	  Impala	  Real-­‐Time	  Query	  for	  Data	  Stored	  in	  Hadoop	                               ...
Validated	  Beta	  Partners	                       ©2012	  Cloudera,	  Inc.	  All	  Rights	  Reserved.	  
Impala: Real-time Queries in Hadoop
Upcoming SlideShare
Loading in...5
×

Impala: Real-time Queries in Hadoop

9,732

Published on

Learn how Cloudera Impala empowers you to:

- Perform interactive, real-time analysis directly on source data stored in Hadoop
- Interact with data in HDFS and HBase at the “speed of thought”
- Reduce data movement between systems & eliminate double storage

Published in: Technology
1 Comment
52 Likes
Statistics
Notes
  • Hola amigos
    gracias por las informaciones y muchas bendiciones
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
9,732
On Slideshare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
1
Comments
1
Likes
52
Embeds 0
No embeds

No notes for slide

Impala: Real-time Queries in Hadoop

  1. 1. Cloudera  Impala  Jus/n  Erickson  |  Product  Manager    November  2012  
  2. 2. Why  Data  Scien/sts  Love  Hadoop     •  Massive  volumes  of  data     •  Data  prepara/on  &  analy/cs  in  1  environment   •  Highly  flexible  environment  for  crea/ng  &  tes/ng  machine  learning  models     •  10%  the  cost/TB  under  management   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  3. 3. Hadoop  Use  Cases  Moving  to  Real-­‐Time   Already  query   Already  load  data  into   Already  use  HBase  for   Hadoop  using  Hive   CDH  every  90  mins  or  less     real-­‐/me  data  access   Source:  Cloudera  customer  survey  August  2012   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  4. 4. But  Hadoop  Isn’t  Fast  Enough   Need  faster   Move  data  from     See  value  today  in   queries  on   Hadoop  to  RDBMS  for   consolida/ng  to  a   Hadoop  data   interac/ve  SQL   single  plaYorm   Source:  Cloudera  customer  survey  August  2012   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  5. 5. Beyond  Batch  –  The  Next  Stage  for  Hadoop   HADOOP  TODAY  IS  TOO  SLOW   MapReduce  is  batch   Simple  queries  can  take  minutes  /  tens  of  minutes       CURRENT  DATA  MANAGEMENT  IS  TOO  COMPLEX   Op/mized  for  rigid  schemas  &     special  purpose  applica/ons   Redundant  data  storage  &  processes   Very  expensive  systems:  $20K-­‐150K  /  TB     ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  6. 6. Cloudera  Enterprise  RTQ  Real-­‐Time  Query  for  Data  Stored  in  Hadoop    Powered  by  Cloudera  Impala.   Supports  Hive  SQL   4-­‐30X  faster  than  Hive  over  MapReduce   Supports  mul/ple  storage  engines  &     file  formats   Uses  exis/ng  drivers,  integrates  with  exis/ng   metastore,  works  with  leading  BI  tools   Flexible,  cost-­‐effec/ve,  no  lock-­‐in   Deploy  &  operate  with  Cloudera  Manager   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  7. 7. Cloudera  Now  Powered  by  Impala   BEFORE  IMPALA   WITH  IMPALA   USER  INTERFACE   BATCH  PROCESSING   REAL-­‐TIME  ACCESS   •  Unified  Storage:   •  With  Impala:     Supports  HDFS  and  HBase   Real-­‐/me  SQL  queries   Flexible  file  formats   Na/ve  distributed  query  engine   •  Unified  Metastore   Op/mized  for  low-­‐latency   •  Unified  Security   •  Provides:   •  Unified  Client  Interfaces:   Answers  as  fast  as  you  can  ask   ODBC,  SQL  syntax,  Hue  Beeswax   Everyone  to  ask  ques/ons  for  all  data   Big  data  storage  and  analy/cs  together   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  8. 8. Impala  beta  features  Today  (Cloudera  Impala  0.1):  •  Nearly  all  of  Hives  SQL,  including  insert,  join,  and  subqueries  •  Query  results  4-­‐30X  faster  than  Hive  •  Same  open  Hive  metadata  model  =>  easy  to  create  &  change  schema  •  Support  for  HDFS  and  HBase  storage  •  HDFS  file  formats:  TextFile,  SequenceFile  •  HDFS  compression:  Snappy,  GZIP,  BZIP  •  Common  ODBC  driver  and  Hue  Beeswax  with  Hive  •  Separate  CLI  than  Hive  Next  few  months:  •  Support  for  Avro,  RCFile  &  LZO  compressed  text  •  Addi/onal  OS  support  •  Trevni  columnar  format  •  JDBC  driver  •  DDL  •  Straggler  handling  •  Increased  join  perf   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  9. 9. Impala  v0.1  SQL  (HiveQL)  •  Select   –  Boolean,  /nyint,  smallint,  int,  bigint,  float,  double,  /mestamp,  string   –  All,  dis/nct   –  Subqueries  (in  from  clause)   –  Where,  group  by,  having   –  Order  by  (with  limit  ini/ally)   –  Joins  (ler,  right,  full,  outer),  mul/-­‐table,  subquery   –  Union  all   –  Limit   –  External  tables   –  Rela/onal,  arithme/c,  logical  operators   –  Math,  collec/on,  cast,  date,  condi/onal,  string,  /mestamp  built-­‐ins  (e.g.  count,  sum,  cast,  case,  like,   in,  between,  coalesce)  •  Insert  into   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  10. 10. Cloudera  Impala  Details  Common  Hive  SQL  and  interface   Unified  metadata  and  scheduler   SQL  App   Hive   State   Metastore   YARN   HDFS  NN   Store   ODBC   Query  Planner   Query  Planner   Fully  MPP   Query  Planner   Query  Coordinator   Query  Coordinator   Distributed   Query  Coordinator   Query  Exec  Engine   Query  Exec  Engine   Query  Exec  Engine   HDFS  DN   HBase   HDFS  DN   HBase   HDFS  DN   HBase   Local  Direct  Reads   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  11. 11. Cloudera  Impala  Details  Common  Hive  SQL  and  interface   SQL  App   Hive   State   Metastore   YARN   HDFS  NN   Store   ODBC   SQL  Request   Query  Planner   Query  Planner   Query  Planner   Query  Coordinator   Query  Coordinator   Query  Coordinator   Query  Exec  Engine   Query  Exec  Engine   Query  Exec  Engine   HDFS  DN   HBase   HDFS  DN   HBase   HDFS  DN   HBase   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  12. 12. Cloudera  Impala  Details   Unified  metadata  and  scheduler   SQL  App   Hive   State   Metastore   YARN   HDFS  NN   Store   ODBC   Query  Planner   Query  Planner   Query  Planner  Query  Coordinator   Query  Coordinator   Query  Coordinator  Query  Exec  Engine   Query  Exec  Engine   Query  Exec  Engine  HDFS  DN   HBase   HDFS  DN   HBase   HDFS  DN   HBase   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  13. 13. Cloudera  Impala  Details   SQL  App   Hive   State   Metastore   YARN   HDFS  NN   Store   ODBC   Query  Planner   Query  Planner   Fully  MPP   Query  Planner  Query  Coordinator   Query  Coordinator   Distributed   Query  Coordinator  Query  Exec  Engine   Query  Exec  Engine   Query  Exec  Engine  HDFS  DN   HBase   HDFS  DN   HBase   HDFS  DN   HBase   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  14. 14. Cloudera  Impala  Details   SQL  App   Hive   State   Metastore   YARN   HDFS  NN   Store   ODBC   Query  Planner   Query  Planner   Query  Planner  Query  Coordinator   Query  Coordinator   Query  Coordinator  Query  Exec  Engine   Query  Exec  Engine   Query  Exec  Engine  HDFS  DN   HBase   HDFS  DN   HBase   HDFS  DN   HBase   Local  Direct  Reads   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  15. 15. Cloudera  Impala  Details   SQL  App   Hive   State   Metastore   YARN   HDFS  NN   Store   ODBC   SQL  Results   Query  Planner   Query  Planner   In  Memory   Query  Planner  Query  Coordinator   Query  Coordinator   Transfers   Query  Coordinator  Query  Exec  Engine   Query  Exec  Engine   Query  Exec  Engine  HDFS  DN   HBase   HDFS  DN   HBase   HDFS  DN   HBase   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  16. 16. Impala  and  Hive  •  Shared  with  Hive:   –  Metadata  (table  defini/ons)   –  ODBC  driver   –  Hue  Beeswax   –  SQL  syntax  (HiveQL)   –  Flexible  file  formats   –  Machine  pool  •  Improvements:   –  Purpose-­‐built  query  engine  direct  on  HDFS  and  HBase   –  No  JVM  and  MapReduce   –  In-­‐memory  data  transfers   –  Low-­‐latency  scheduler   –  Na/ve  distributed  rela/onal  query  engine   –  Trevni  columnar  format  (arer  v0.1)   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  17. 17. Advantages  of  Our  Approach  •  No  high-­‐latency  MapReduce  batch  processing  •  Local  processing  avoids  network  botlenecks  •  No  costly  data  format  conversion  overhead  •  All  data  immediately  query-­‐able  •  Single  machine  pool  to  scale  •  All  machines  available  to  both  Impala  and  MapReduce  •  Single,  open,  and  unified  metadata  and  scheduler   MapReduce   Remote  Query   Side  Storage   Query   Query   Query   Query   Node   Node   Node   Node   Query   MR   Hive   Engine   MR   OR   MR   DN   NN   DN   HDFS   DN   DN   DN   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  18. 18. Google  Dremel  and  Impala  •  What  is  Dremel:   –  Columnar  storage  for  data  with  nested  structures   –  Distributed  scalable  aggrega/on  on  top  of  that  •  Columnar  storage  in  Hadoop:  Trevni   –  New  columnar  format  created  by  Doug  Cuung   –  Stores  data  in  appropriate  na/ve/binary  types   –  Will  also  store  nested  structures  similar  to  Dremels  ColumnIO  •  Distributed  aggrega/on:  Impala  •  Impala  plus  Trevni:  a  superset  of  the  published  version  of  Dremel  (which  didnt   support  joins)   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  19. 19. Benefits  of  Cloudera  Impala  Real-­‐Time  Query  for  Data  Stored  in  Hadoop   • Get  answers  as  fast  as  you  can  ask  ques/ons   • Interac/ve  analy/cs  directly  on  source  data   • No  jumping  between  data  silos   • Reduce  duplicate  storage  with  EDW   • Reduce  data  movement  for  interac/ve  analysis   • Leverage  exis/ng  tools  and  employee  skills   • Ask  ques/ons  of  all  your  data   • No  informa/on  loss  from  aggrega/on  or   conforming  to    rela/onal  schemas  for  analysis   • Single  metadata  store  from  origina/on    through  analysis   • No  need  to  hunt  through  mul/ple  data  silos   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  20. 20. Validated  Beta  Partners   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  

×