Cloudera Impala Technical Overview
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Cloudera Impala Technical Overview

  • 3,051 views
Uploaded on

In this slidecast, Justin Erickson from Cloudera presents a technical overview of Cloudera Impala, an SQL-on-Hadoop solution that enables users to do real-time queries of data stored in Hadoop......

In this slidecast, Justin Erickson from Cloudera presents a technical overview of Cloudera Impala, an SQL-on-Hadoop solution that enables users to do real-time queries of data stored in Hadoop clusters.

"To avoid latency, Impala circumvents MapReduce to directly access the data through a specialized distributed query engine that is very similar to those found in commercial parallel RDBMSs. The result is order-of-magnitude faster performance than Hive, depending on the type of query and configuration."

Watch the presentation video at:
http://inside-bigdata.com/?p=2977
or listen to the MP3:
http://archive.org/download/ClouderaImpalaPodcast/Cloudera%20Impala%20Podcast.mp3

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,051
On Slideshare
3,048
From Embeds
3
Number of Embeds
2

Actions

Shares
Downloads
172
Comments
0
Likes
13

Embeds 3

http://54.199.180.60 2
http://hubot-clb-2081983768.ap-northeast-1.elb.amazonaws.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Cloudera  Impala  Jus/n  Erickson  |    Senior  Product  Manager  May  2013  
  • 2. Agenda  •  Why  Impala?  •  Architectural  Overview  •  Real-­‐World  Use  Cases  •  Alterna/ve  Approaches  •  The  PlaKorm  for  Big  Data  ©2013 Cloudera, Inc. All RightsReserved.2
  • 3. Why  Hadoop?  •  Scalability  •  Simply  scales  just  by  adding  nodes  •  Local  processing  to  avoid  network  boSlenecks  •  Flexibility  •  All  kinds  of  data  (blobs,  documents,  records,  etc)  •  In  all  forms  (structured,  semi-­‐structured,  unstructured)  •  Store  anything  then  later  analyze  what  you  need  •  Efficiency  •  Cost  efficiency  (<$1k/TB)  on  commodity  hardware  •  Unified  storage,  metadata,  security  (no  duplica/on  or  synchroniza/on)  ©2013 Cloudera, Inc. All RightsReserved.3
  • 4. What’s  Impala?  •  Interac2ve  SQL  •  Typically  5-­‐65x  faster  than  Hive  (observed  up  to  100x  faster)  •  Responses  in  seconds  instead  of  minutes  (some/mes  sub-­‐second)  •  Nearly  ANSI-­‐92  standard  SQL  queries  with  Hive  SQL  •  Compa/ble  SQL  interface  for  exis/ng  Hadoop/CDH  applica/ons  •  Based  on  industry  standard  SQL  •  Na2vely  on  Hadoop/HBase  storage  and  metadata  •  Flexibility,  scale,  and  cost  advantages  of  Hadoop  •  No  duplica/on/synchroniza/on  of  data  and  metadata  •  Local  processing  to  avoid  network  boSlenecks  •  Separate  run2me  from  MapReduce  •  MapReduce  is  designed  and  great  for  batch  •  Impala  is  purpose-­‐built  for  low-­‐latency  SQL  queries  on  Hadoop  ©2013 Cloudera, Inc. All RightsReserved.4
  • 5. Benefits  of  Impala  5More  &  Faster  Value  from  “Big  Data”  §  BI  tools  imprac/cal  on  Hadoop  before  Impala  §  Move  from  10s  of  Hadoop  users  per  cluster  to  100s  of  SQL  users  §  No  delays  from  data  migra/on  Flexibility  §  Query  across  exis/ng  data  §  Select  best-­‐fit  file  formats  (Parquet,  Avro,  etc.)  §  Run  mul/ple  frameworks  on  the  same  data  at  the  same  /me    Cost  Efficiency  §  Reduce  movement,  duplicate  storage  &  compute  §  10%  to  1%  the  cost  of  analy/c  DBMS  Full  Fidelity  Analysis  §  No  loss  from  aggrega/ons  or  fixed  schemas  ©2013 Cloudera, Inc. All RightsReserved.
  • 6. Impala  Query  Execu/on  6Query  Planner  Query  Coordinator  Query  Executor  HDFS  DN   HBase  SQL  App  ODBC  Hive  Metastore  HDFS  NN   Statestore  Query  Planner  Query  Coordinator  Query  Executor  HDFS  DN   HBase  Query  Planner  Query  Coordinator  Query  Executor  HDFS  DN   HBase  SQL  request  1)  Request  arrives  via  ODBC/JDBC/Beeswax/Shell  ©2013 Cloudera, Inc. All RightsReserved.
  • 7. Impala  Query  Execu/on  7Query  Planner  Query  Coordinator  Query  Executor  HDFS  DN   HBase  SQL  App  ODBC  Hive  Metastore  HDFS  NN   Statestore  Query  Planner  Query  Coordinator  Query  Executor  HDFS  DN   HBase  Query  Planner  Query  Coordinator  Query  Executor  HDFS  DN   HBase  2)  Planner  turns  request  into  collec2ons  of  plan  fragments  3)  Coordinator  ini2ates  execu2on  on  impalad(s)  local  to  data  ©2013 Cloudera, Inc. All RightsReserved.
  • 8. Impala  Query  Execu/on  8Query  Planner  Query  Coordinator  Query  Executor  HDFS  DN   HBase  SQL  App  ODBC  Hive  Metastore  HDFS  NN   Statestore  Query  Planner  Query  Coordinator  Query  Executor  HDFS  DN   HBase  Query  Planner  Query  Coordinator  Query  Executor  HDFS  DN   HBase  4)  Intermediate  results  are  streamed  between  impalad(s)  5)  Query  results  are  streamed  back  to  client  Query  results  ©2013 Cloudera, Inc. All RightsReserved.
  • 9. Impala  and  Hive  9Shares  Everything  Client-­‐Facing  §  Metadata  (table  defini/ons)  §  ODBC/JDBC  drivers  §  SQL  syntax  (Hive  SQL)  §  Flexible  file  formats  §  Machine  pool  §  Hue  GUI  But  Built  for  Different  Purposes  §  Hive:  runs  on  MapReduce  and  ideal  for  batch  processing  §  Impala:  na/ve  MPP  query  engine  ideal  for  interac/ve  SQL  Storage  Integra2on  Resource  Management  Metadata  HDFS   HBase  TEXT,  RCFILE,  PARQUET,  AVRO,  ETC.   RECORDS  Hive  SQL  Syntax   Impala  SQL  Syntax  +  Compute  Framework  MapReduce  Compute  Framework  Batch  Processing  Interac/ve  SQL  ©2013 Cloudera, Inc. All RightsReserved.
  • 10. Impala  Use  Cases  10Interac/ve  BI/analy/cs  on  more  data  Asking  new  ques/ons  Query-­‐able  archive  w/  full  fidelity  Data  processing  with  /ght  SLAs  Cost-­‐effec2ve,  ad  hoc  query  environment  that  offloads  the  data  warehouse  for:  ©2013 Cloudera, Inc. All RightsReserved.
  • 11. Global  Financial  Services  Company  11Saving  90%  on  incremental  EDW  spend  &  improving  performance  by  5x  Offload  data  warehouse  for  query-­‐able  archive  Store  decades  of  data  cost-­‐effec/vely  Process  &  analyze  on  the  same  system  Improve  capabili/es  through  interac/ve  query  on  more  data  ©2013 Cloudera, Inc. All RightsReserved.
  • 12. Six3  Systems  12Boos2ng  performance  by  20X  for  mission-­‐cri2cal,  real-­‐2me  cyber  security  Analyze  unstructured  data  with  flexibility  &    real-­‐/me  response  Integrate  with  exis/ng  desktop  &  BI  tools  Deploy  in  minutes  with  Cloudera  Manager  ©2013 Cloudera, Inc. All RightsReserved.
  • 13. Expedia  13Implemen2ng  self-­‐service  BI  on  big  data,    reducing  data  latency  by  50%    Offload  data  warehouse  for  archiving,  ETL  &  analy/cs  Unify  IT  environment    Con/nuously  ingest  &  analyze  at  scale  Drive  greater  usability  &  adop/on  of  big  data  stack  ©2013 Cloudera, Inc. All RightsReserved.
  • 14. Our  Design  Strategy  14Storage  Integra2on  Resource  Management  Metadata  Batch  Processing  MAPREDUCE,  HIVE  &  PIG  …Interac/ve  SQL  IMPALA  Machine  Learning  MAHOUT,  DATAFU  HDFS   HBase  TEXT,  RCFILE,  PARQUET,  AVRO,  ETC.   RECORDS  Engines  One  pool  of  data  One  metadata  model  One  security  framework  One  set  of  system  resources  An  Integrated  Part  of  the  Hadoop  System  ©2013 Cloudera, Inc. All RightsReserved.
  • 15. Not  All  SQL  on  Hadoop  is  Created  Equal  15Batch  MapReduce  Make  MapReduce  faster  Slow,  s2ll  batch  Remote  Query  Pull  data  from  HDFS  over  the  network  to  the  DW  compute  layer  Slow,  expensive  Siloed  DBMS  Load  data  into  a  proprietary  database  file  Rigid,  siloed  data,  slow  ETL  Impala  Na/ve  MPP  query  engine  that’s  integrated  into  Hadoop  Fast,  flexible,    cost-­‐effec2ve  $©2013 Cloudera, Inc. All RightsReserved.
  • 16. The  Impala  Advantage  16BI  Partners:  Building  on  the  Enterprise  Standard  POWERED BYIMPALA©2013 Cloudera, Inc. All RightsReserved.
  • 17. It’s  Not  Just  About  SQL  on  Hadoop  17The  Plaeorm  for  Big  Data  Storage  Integra2on  Resource  Management  Metadata  Batch  Processing  MAPREDUCE,  HIVE  &  PIG  …Interac/ve  SQL  IMPALA  Machine  Learning  MAHOUT,  DATAFU  HDFS   HBase  TEXT,  RCFILE,  PARQUET,  AVRO…   RECORDS  Engines  Management    |    Support  Single  plaKorm  for  processing  &  analy/cs  Scales  to  ‘000s  of  servers  No  upfront  schema  10%  the  cost  per  TB  Open  source  plaKorm  ©2013 Cloudera, Inc. All RightsReserved.