• Save
Impala: Real-time Queries in Hadoop
 

Impala: Real-time Queries in Hadoop

on

  • 9,464 views

Learn how Cloudera Impala empowers you to: ...

Learn how Cloudera Impala empowers you to:

- Perform interactive, real-time analysis directly on source data stored in Hadoop
- Interact with data in HDFS and HBase at the “speed of thought”
- Reduce data movement between systems & eliminate double storage

Statistics

Views

Total Views
9,464
Views on SlideShare
6,404
Embed Views
3,060

Actions

Likes
51
Downloads
1
Comments
1

16 Embeds 3,060

http://www.cloudera.com 2689
http://www.scoop.it 119
http://oraclechang.com 56
http://librosinteresantes8.blogspot.com.es 53
http://tedwon.com 49
http://inergy20.wordpress.com 36
http://author01.mtv.cloudera.com 10
https://twitter.com 8
http://cloudera.com 7
http://lnyce1yx.nyc-p02.chp.bankofamerica.com 7
http://www.linkedin.com 6
http://www.party09.com 6
http://twitter.com 6
http://librosinteresantes8.blogspot.com 5
http://author01.core.cloudera.com 2
http://10.21.1.24 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Hola amigos
    gracias por las informaciones y muchas bendiciones
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Impala: Real-time Queries in Hadoop Impala: Real-time Queries in Hadoop Presentation Transcript

  • Cloudera  Impala  Jus/n  Erickson  |  Product  Manager    November  2012  
  • Why  Data  Scien/sts  Love  Hadoop     •  Massive  volumes  of  data     •  Data  prepara/on  &  analy/cs  in  1  environment   •  Highly  flexible  environment  for  crea/ng  &  tes/ng  machine  learning  models     •  10%  the  cost/TB  under  management   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  • Hadoop  Use  Cases  Moving  to  Real-­‐Time   Already  query   Already  load  data  into   Already  use  HBase  for   Hadoop  using  Hive   CDH  every  90  mins  or  less     real-­‐/me  data  access   Source:  Cloudera  customer  survey  August  2012   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  • But  Hadoop  Isn’t  Fast  Enough   Need  faster   Move  data  from     See  value  today  in   queries  on   Hadoop  to  RDBMS  for   consolida/ng  to  a   Hadoop  data   interac/ve  SQL   single  plaYorm   Source:  Cloudera  customer  survey  August  2012   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  • Beyond  Batch  –  The  Next  Stage  for  Hadoop   HADOOP  TODAY  IS  TOO  SLOW   MapReduce  is  batch   Simple  queries  can  take  minutes  /  tens  of  minutes       CURRENT  DATA  MANAGEMENT  IS  TOO  COMPLEX   Op/mized  for  rigid  schemas  &     special  purpose  applica/ons   Redundant  data  storage  &  processes   Very  expensive  systems:  $20K-­‐150K  /  TB     ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  • Cloudera  Enterprise  RTQ  Real-­‐Time  Query  for  Data  Stored  in  Hadoop    Powered  by  Cloudera  Impala.   Supports  Hive  SQL   4-­‐30X  faster  than  Hive  over  MapReduce   Supports  mul/ple  storage  engines  &     file  formats   Uses  exis/ng  drivers,  integrates  with  exis/ng   metastore,  works  with  leading  BI  tools   Flexible,  cost-­‐effec/ve,  no  lock-­‐in   Deploy  &  operate  with  Cloudera  Manager   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  • Cloudera  Now  Powered  by  Impala   BEFORE  IMPALA   WITH  IMPALA   USER  INTERFACE   BATCH  PROCESSING   REAL-­‐TIME  ACCESS   •  Unified  Storage:   •  With  Impala:     Supports  HDFS  and  HBase   Real-­‐/me  SQL  queries   Flexible  file  formats   Na/ve  distributed  query  engine   •  Unified  Metastore   Op/mized  for  low-­‐latency   •  Unified  Security   •  Provides:   •  Unified  Client  Interfaces:   Answers  as  fast  as  you  can  ask   ODBC,  SQL  syntax,  Hue  Beeswax   Everyone  to  ask  ques/ons  for  all  data   Big  data  storage  and  analy/cs  together   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  • Impala  beta  features  Today  (Cloudera  Impala  0.1):  •  Nearly  all  of  Hives  SQL,  including  insert,  join,  and  subqueries  •  Query  results  4-­‐30X  faster  than  Hive  •  Same  open  Hive  metadata  model  =>  easy  to  create  &  change  schema  •  Support  for  HDFS  and  HBase  storage  •  HDFS  file  formats:  TextFile,  SequenceFile  •  HDFS  compression:  Snappy,  GZIP,  BZIP  •  Common  ODBC  driver  and  Hue  Beeswax  with  Hive  •  Separate  CLI  than  Hive  Next  few  months:  •  Support  for  Avro,  RCFile  &  LZO  compressed  text  •  Addi/onal  OS  support  •  Trevni  columnar  format  •  JDBC  driver  •  DDL  •  Straggler  handling  •  Increased  join  perf   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  • Impala  v0.1  SQL  (HiveQL)  •  Select   –  Boolean,  /nyint,  smallint,  int,  bigint,  float,  double,  /mestamp,  string   –  All,  dis/nct   –  Subqueries  (in  from  clause)   –  Where,  group  by,  having   –  Order  by  (with  limit  ini/ally)   –  Joins  (ler,  right,  full,  outer),  mul/-­‐table,  subquery   –  Union  all   –  Limit   –  External  tables   –  Rela/onal,  arithme/c,  logical  operators   –  Math,  collec/on,  cast,  date,  condi/onal,  string,  /mestamp  built-­‐ins  (e.g.  count,  sum,  cast,  case,  like,   in,  between,  coalesce)  •  Insert  into   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  • Cloudera  Impala  Details  Common  Hive  SQL  and  interface   Unified  metadata  and  scheduler   SQL  App   Hive   State   Metastore   YARN   HDFS  NN   Store   ODBC   Query  Planner   Query  Planner   Fully  MPP   Query  Planner   Query  Coordinator   Query  Coordinator   Distributed   Query  Coordinator   Query  Exec  Engine   Query  Exec  Engine   Query  Exec  Engine   HDFS  DN   HBase   HDFS  DN   HBase   HDFS  DN   HBase   Local  Direct  Reads   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  • Cloudera  Impala  Details  Common  Hive  SQL  and  interface   SQL  App   Hive   State   Metastore   YARN   HDFS  NN   Store   ODBC   SQL  Request   Query  Planner   Query  Planner   Query  Planner   Query  Coordinator   Query  Coordinator   Query  Coordinator   Query  Exec  Engine   Query  Exec  Engine   Query  Exec  Engine   HDFS  DN   HBase   HDFS  DN   HBase   HDFS  DN   HBase   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  • Cloudera  Impala  Details   Unified  metadata  and  scheduler   SQL  App   Hive   State   Metastore   YARN   HDFS  NN   Store   ODBC   Query  Planner   Query  Planner   Query  Planner  Query  Coordinator   Query  Coordinator   Query  Coordinator  Query  Exec  Engine   Query  Exec  Engine   Query  Exec  Engine  HDFS  DN   HBase   HDFS  DN   HBase   HDFS  DN   HBase   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  • Cloudera  Impala  Details   SQL  App   Hive   State   Metastore   YARN   HDFS  NN   Store   ODBC   Query  Planner   Query  Planner   Fully  MPP   Query  Planner  Query  Coordinator   Query  Coordinator   Distributed   Query  Coordinator  Query  Exec  Engine   Query  Exec  Engine   Query  Exec  Engine  HDFS  DN   HBase   HDFS  DN   HBase   HDFS  DN   HBase   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  • Cloudera  Impala  Details   SQL  App   Hive   State   Metastore   YARN   HDFS  NN   Store   ODBC   Query  Planner   Query  Planner   Query  Planner  Query  Coordinator   Query  Coordinator   Query  Coordinator  Query  Exec  Engine   Query  Exec  Engine   Query  Exec  Engine  HDFS  DN   HBase   HDFS  DN   HBase   HDFS  DN   HBase   Local  Direct  Reads   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  • Cloudera  Impala  Details   SQL  App   Hive   State   Metastore   YARN   HDFS  NN   Store   ODBC   SQL  Results   Query  Planner   Query  Planner   In  Memory   Query  Planner  Query  Coordinator   Query  Coordinator   Transfers   Query  Coordinator  Query  Exec  Engine   Query  Exec  Engine   Query  Exec  Engine  HDFS  DN   HBase   HDFS  DN   HBase   HDFS  DN   HBase   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  • Impala  and  Hive  •  Shared  with  Hive:   –  Metadata  (table  defini/ons)   –  ODBC  driver   –  Hue  Beeswax   –  SQL  syntax  (HiveQL)   –  Flexible  file  formats   –  Machine  pool  •  Improvements:   –  Purpose-­‐built  query  engine  direct  on  HDFS  and  HBase   –  No  JVM  and  MapReduce   –  In-­‐memory  data  transfers   –  Low-­‐latency  scheduler   –  Na/ve  distributed  rela/onal  query  engine   –  Trevni  columnar  format  (arer  v0.1)   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  • Advantages  of  Our  Approach  •  No  high-­‐latency  MapReduce  batch  processing  •  Local  processing  avoids  network  botlenecks  •  No  costly  data  format  conversion  overhead  •  All  data  immediately  query-­‐able  •  Single  machine  pool  to  scale  •  All  machines  available  to  both  Impala  and  MapReduce  •  Single,  open,  and  unified  metadata  and  scheduler   MapReduce   Remote  Query   Side  Storage   Query   Query   Query   Query   Node   Node   Node   Node   Query   MR   Hive   Engine   MR   OR   MR   DN   NN   DN   HDFS   DN   DN   DN   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  • Google  Dremel  and  Impala  •  What  is  Dremel:   –  Columnar  storage  for  data  with  nested  structures   –  Distributed  scalable  aggrega/on  on  top  of  that  •  Columnar  storage  in  Hadoop:  Trevni   –  New  columnar  format  created  by  Doug  Cuung   –  Stores  data  in  appropriate  na/ve/binary  types   –  Will  also  store  nested  structures  similar  to  Dremels  ColumnIO  •  Distributed  aggrega/on:  Impala  •  Impala  plus  Trevni:  a  superset  of  the  published  version  of  Dremel  (which  didnt   support  joins)   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  • Benefits  of  Cloudera  Impala  Real-­‐Time  Query  for  Data  Stored  in  Hadoop   • Get  answers  as  fast  as  you  can  ask  ques/ons   • Interac/ve  analy/cs  directly  on  source  data   • No  jumping  between  data  silos   • Reduce  duplicate  storage  with  EDW   • Reduce  data  movement  for  interac/ve  analysis   • Leverage  exis/ng  tools  and  employee  skills   • Ask  ques/ons  of  all  your  data   • No  informa/on  loss  from  aggrega/on  or   conforming  to    rela/onal  schemas  for  analysis   • Single  metadata  store  from  origina/on    through  analysis   • No  need  to  hunt  through  mul/ple  data  silos   ©2012  Cloudera,  Inc.  All  Rights  Reserved.  
  • Validated  Beta  Partners   ©2012  Cloudera,  Inc.  All  Rights  Reserved.