• Save
Cloudera Impala: A modern SQL Query Engine for Hadoop
 

Cloudera Impala: A modern SQL Query Engine for Hadoop

on

  • 5,442 views

Justin Erickson explains how you can achieve real-time query results with Big Data by using Impala with Hadoop.

Justin Erickson explains how you can achieve real-time query results with Big Data by using Impala with Hadoop.

Statistics

Views

Total Views
5,442
Views on SlideShare
5,124
Embed Views
318

Actions

Likes
22
Downloads
0
Comments
0

4 Embeds 318

http://www.cloudera.com 301
http://www.bigdatacloud.com 6
http://cloudera.com 6
http://author01.mtv.cloudera.com 5

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Cloudera Impala: A modern SQL Query Engine for Hadoop Cloudera Impala: A modern SQL Query Engine for Hadoop Presentation Transcript

  • Cloudera  Impala:  A  Modern  SQL  USE  PUBLICLY   DO  NOT  Query  Engine  for  Hadoop   PRIOR  TO  10/23/12  Headline  Goes  Here  JusJn  Erickson  |    Product  Manager   Speaker  Name  or  Subhead  Goes  Here  January  2013  
  • Agenda   •  Intro  to  Impala   •  Impala’s  Architecture   •  Comparisons   Confidential. ©2013 Cloudera, Inc. All2 Rights Reserved.
  • Why  Hadoop?  •  Scalability   •  Simply  scales  just  by  adding  nodes   •  Local  processing  to  avoid  network  boTlenecks  •  Flexibility   •  All  kinds  of  data  (blobs,  documents,  records,  etc)   •  In  all  forms  (structured,  semi-­‐structured,  unstructured)   •  Store  anything  then  later  analyze  what  you  need  •  Efficiency   •  Cost  efficiency  (<$1k/TB)  on  commodity  hardware   •  Unified  storage,  metadata,  security  (no  duplicaJon  or  synchronizaJon)   Confidential. ©2013 Cloudera, Inc. All Rights Reserved.
  • What’s  Impala?   •  Interac<ve  SQL   •  Typically  4-­‐35x  faster  than  Hive  (observed  up  to  100x  faster)   •  Responses  in  seconds  instead  of  minutes  (someJmes  sub-­‐second)   •  Nearly  ANSI-­‐92  standard  SQL  queries  with  HiveQL   •  CompaJble  SQL  interface  for  exisJng  Hadoop/CDH  applicaJons   •  Based  on  industry  standard  SQL   •  Na<vely  on  Hadoop/HBase  storage  and  metadata   •  Flexibility,  scale,  and  cost  advantages  of  Hadoop   •  No  duplicaJon/synchronizaJon  of  data  and  metadata   •  Local  processing  to  avoid  network  boTlenecks   •  Separate  run<me  from  MapReduce   •  MapReduce  is  designed  and  great  for  batch   •  Impala  is  purpose-­‐built  for  low-­‐latency  SQL  queries  on  Hadoop   Confidential. ©2013 Cloudera, Inc. All4 Rights Reserved.
  • So  what?   •  Interac<ve  BI/analy<cs   •  BI  tools  impracJcal  on  Hadoop  before  Impala   •  Move  from  10s  of  Hadoop  users  per  cluster  to  100s  of  SQL  users   •  More  and  faster  value  from  “big  data”   •  ELT/data  processing  with  <ght  SLAs   •  Sub-­‐minute  SLAs  now  possible   •  Cost  efficiency   •  Fewer  nodes  to  meet  response  Jme  SLAs   Confidential. ©2013 Cloudera, Inc. All5 Rights Reserved.
  • Impala  Architecture  •  Two  binaries:  impalad  and  statestored  •  Impala  daemon  (impalad)   •  one  Impala  daemon  on  each  node  with  data   •  handles  external  client  requests  and  all  internal  requests   related  to  query  execuJon  •  State  store  daemon  (statestored)   •  provides  name  service  and  metadata  distribuJon   Confidential. ©2013 Cloudera, Inc. All Rights Reserved.
  • Impala  Architecture:  Query  ExecuJon  Phases  •  Request  arrives  via  ODBC/JDBC/Beeswax/Shell  •  Planner  turns  request  into  collecJons  of  plan  fragments  •  Coordinator  iniJates  execuJon  on  impalads  local  to  data  •  During  execuJon:   •  intermediate  results  are  streamed  between  executors   •  query  results  are  streamed  back  to  client   •  subject  to  limitaJons  imposed  to  blocking  operators  (top-­‐n,   aggregaJon)   Confidential. ©2013 Cloudera, Inc. All Rights Reserved.
  • Impala  Architecture:  Planner  •  Example:  query  with  join  and  aggregaJon   SELECT  state,  SUM(revenue)   FROM  HdfsTbl  h  JOIN  HbaseTbl  b  ON  (...)   GROUP  BY  1  ORDER  BY  2  desc  LIMIT  10   TopN   Agg   TopN   Agg   Hash   Hash   Agg   Join   Join   Hdfs   Hbase   Exch   Exch   Hdfs   Hbase   Scan   Scan   at  coordinator   at  DataNodes   at  region  servers   Scan   Scan   Confidential. ©2013 Cloudera, Inc. All Rights Reserved.
  • Impala  Architecture:  Query  ExecuJon  •  Request  arrives  via  ODBC/JDBC/Beeswax/Shell   SQL  App   Hive   Metastore   HDFS  NN   Statestore   ODBC   SQL   request   Query  Planner   Query  Planner   Query  Planner   Query  Coordinator   Query  Coordinator   Query  Coordinator   Query  Executor   Query  Executor   Query  Executor   HDFS  DN   HBase   HDFS  DN   HBase   HDFS  DN   HBase   Confidential. ©2013 Cloudera, Inc. All Rights Reserved.
  • Impala  Architecture:  Query  ExecuJon  •  Planner  turns  request  into  collecJons  of  plan  fragments  •  Coordinator  iniJates  execuJon  on  impalads  local  to  data   SQL  App   Hive   Metastore   HDFS  NN   Statestore   ODBC   Query  Planner   Query  Planner   Query  Planner   Query  Coordinator   Query  Coordinator   Query  Coordinator   Query  Executor   Query  Executor   Query  Executor   HDFS  DN   HBase   HDFS  DN   HBase   HDFS  DN   HBase   Confidential. ©2013 Cloudera, Inc. All Rights Reserved.
  • Impala  Architecture:  Query  ExecuJon  •  Intermediate  results  are  streamed  between  impalad’s  •  Query  results  are  streamed  back  to  client   SQL  App   Hive   Metastore   HDFS  NN   Statestore   ODBC   query   results   Query  Planner   Query  Planner   Query  Planner   Query  Coordinator   Query  Coordinator   Query  Coordinator   Query  Executor   Query  Executor   Query  Executor   HDFS  DN   HBase   HDFS  DN   HBase   HDFS  DN   HBase   Confidential. ©2013 Cloudera, Inc. All Rights Reserved.
  • Impala  and  Hive  •  Shared  with  Hive:   •  Metadata  (table  definiJons)   •  ODBC/JDBC  drivers   •  Hue  Beeswax   •  SQL  syntax  (HiveQL)   •  Flexible  file  formats   •  Machine  pool  •  Improvements:   •  Purpose-­‐built  query  engine  direct  on  HDFS  and  HBase   •  No  JVM  startup  and  no  MapReduce   •  In-­‐memory  data  transfers   •  NaJve  distributed  relaJonal  query  engine   Confidential. ©2012 Cloudera, Inc. All Rights Reserved.
  • What  about  an  EDW/RDBMS?   •  “Right  tool  for  the  right  job”   •  EDW/RDBMS  great  for:   •  OLTP’s  complex  transacJons   •  Highly  planned  and  opJmized  known  workloads   •  Opera4onal  reports  and  drill  into  repeated  known  queries   •  Impala’s  great  for:   •  Exploratory  analy4cs  with  new  previously-­‐unknown  queries   •  Queries  on  big  and  growing  data  sets   •  EDW/RDBMS  can’t:   •  Dump  in  raw  data  then  later  define  schema  and  query  what  you  want   •  Evolve  schemas  without  an  expensive  schema  upgrade  planning  process   •  Simply  scales  just  by  adding  nodes   •  Store  at  <  $1k/TB  instead  of  $10-­‐150k/TB   Confidential. ©2013 Cloudera, Inc. All13 Rights Reserved.
  • AlternaJve  Hadoop  Query  Approaches   MapReduce   Remote  Query   Side  Storage   Query   Query   Query   Query   Node   Node   Node   Node   Query   MR   Hive   Engine   MR   OR   MR   DN   NN   DN   HDFS   DN   DN   DN  High-­‐latency  MR   Network  boTleneck   Query  subset  of  data        Separate  nodes  for  SQL/MR   Separate  nodes  for  SQL/MR   RDBMS  rigid  schema        Duplicate  metadata,   Duplicate  metadata,   Duplicate  storage,  security,  SQL,  MR,  etc.   security,  SQL,  MR,  etc.   metadata,  security,  SQL,       etc.     Confidential. ©2013 Cloudera, Inc. All Rights Reserved.
  • Comparing  Impala  to  Dremel  •  What  is  Dremel:   •  columnar  storage  for  data  with  nested  structures   •  distributed  scalable  aggregaJon  on  top  of  that  •  Columnar  storage  in  Hadoop:  joint  project  between  Cloudera  and  TwiTer   •  new  columnar  format,  derived  from  Doug  Culngs  Trevni   •  stores  data  in  appropriate  naJve/binary  types   •  can  also  store  nested  structures  similar  to  Dremels  ColumnIO  •  Distributed  aggregaJon:  Impala  •  Impala  plus  columnar  format:  a  superset  of  the  published  version  of  Dremel   (which  didnt  support  joins  and  mulJple  file  formats)   Confidential. ©2013 Cloudera, Inc. All Rights Reserved.
  • Impala  Roadmap   •  GA  (target  April  2013)   •  All  CDH4  OSes:  RHEL/CentOS,  Ubuntu,  Debian,  SLES   •  JDBC  driver   •  More  formats:  Avro,  LZO-­‐compressed   •  Columnar  format   •  MR/Impala  resource  isolaJon   •  Perf  (joins,  aggregaJons,  SQL  features)   •  AutomaJc  metadata  distribuJon   •  Post-­‐GA  top  requests:   •  UDFs   •  Memory  caching   •  Nested  data   •  Window  funcJons   Confidential. ©2013 Cloudera, Inc. All16 Rights Reserved.
  • Validated  Beta  Partners   POWERED BY IMPALA Confidential. ©2013 Cloudera, Inc. All Rights Reserved.