Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Upcoming SlideShare
Loading in...5
×
 

Cloudera Impala - San Diego Big Data Meetup August 13th 2014

on

  • 136 views

Cloudera Impala presentation to San Diego Big Data Meetup (http://www.meetup.com/sdbigdata/events/189420582/)

Cloudera Impala presentation to San Diego Big Data Meetup (http://www.meetup.com/sdbigdata/events/189420582/)

Statistics

Views

Total Views
136
Views on SlideShare
135
Embed Views
1

Actions

Likes
2
Downloads
6
Comments
0

1 Embed 1

http://www.slideee.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Cloudera Impala - San Diego Big Data Meetup August 13th 2014 Cloudera Impala - San Diego Big Data Meetup August 13th 2014 Presentation Transcript

  • 1 Cloudera  Impala   SD  Big  Data  Monthly  Meetup  #2   August  13th  2014     Maxime  Dumas   Systems  Engineer  
  • Thirty  Seconds  About  Max   •  Systems  Engineer   •  aka  Sales  Engineer   •  SoCal,  AZ,  NV   •  former  coder  of  PHP   •  teaches  meditaLon  +  yoga   •  from  Montreal,  Canada   2  
  • What  Does  Cloudera  Do?   •  product   •  distribuLon  of  Hadoop  components,  Apache  licensed   •  enterprise  tooling   •  support   •  training   •  services  (aka  consulLng)   •  community   3
  • What  This  Talk  Isn’t  About   •  deploying   •  Puppet,  Chef,  Ansible,  homegrown  scripts,  intern  labor   •  sizing  &  tuning   •  depends  heavily  on  data  and  workload   •  coding   •  unless  you  count  XML  or  CSV  or  SQL   •  algorithms   4
  • Public  Domain  IFCAR  
  • What  is  Cloudera  Impala?   6
  • cloud·∙e·∙ra  im·∙pal·∙a   7 /kloudˈi(ə)rə  imˈpalə/     noun     a  modern,  open  source,  MPP  SQL  query   engine  for  Apache  Hadoop.     “Cloudera  Impala  provides  fast,  ad  hoc  SQL   query  capability  for  Apache  Hadoop,   complemenLng  tradiLonal  MapReduce  batch   processing.”  
  • 8 Quick  and  dirty,  for  context.   The  Apache  Hadoop  Ecosystem  
  • Why  “Ecosystem?”   •  In  the  beginning,  just  Hadoop   •  HDFS   •  MapReduce   •  Today,  dozens  of  interrelated  components   •  I/O   •  Processing   •  Specialty  ApplicaLons   •  ConfiguraLon   •  Workflow   9
  • HDFS   •  Distributed,  highly  fault-­‐tolerant  filesystem   •  OpLmized  for  large  streaming  access  to  data   •  Based  on  Google  File  System   •  hjp://research.google.com/archive/gfs.html   10
  • Lots  of  Commodity  Machines   11 Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
  • MapReduce  (MR)   •  Programming  paradigm   •  Batch  oriented,  not  realLme   •  Works  well  with  distributed  compuLng   •  Lots  of  Java,  but  other  languages  supported   •  Based  on  Google’s  paper   •  hjp://research.google.com/archive/mapreduce.html   12
  • Apache  Hive   •  AbstracLon  of  Hadoop’s  Java  API   •  HiveQL  “compiles”  down  to  MR   •  a  “SQL-­‐like”  language   •  Eases  analysis  using  MapReduce   13
  • Apache  Hive  Metastore   •  Maps  HDFS  files  to  DB-­‐like  resources   •  Databases   •  Tables   •  Column/field  names,  data  types   •  Roles/users   •  InputFormat/OutputFormat   14
  • Sqoop   ©2011 Cloudera, Inc. All Rights Reserved. 15 •  SQL  to  Hadoop   •  Tool  to  import/export  any  JDBC-­‐supported  database  into  Hadoop   •  Transfer  data  between  Hadoop  and  external  databases  or  EDW   •  High  performance  connectors  for  some  RDBMS   •  Oracle,  Teradata,  Netezza   •  Developed  at  Cloudera  
  • 16  
  • 17 Familiar  interface,  but  more  powerful.   Cloudera  Impala  
  • Cloudera  Impala   18 Interac(ve  SQL  for  Hadoop   § Responses  in  seconds   § Nearly  ANSI-­‐92  standard  SQL  with  Hive  SQL   Na(ve  MPP  Query  Engine   § Purpose-­‐built  for  low-­‐latency  queries   § Separate  runLme  from  MapReduce   § Designed  as  part  of  the  Hadoop  ecosystem   Open  Source   § Apache-­‐licensed  
  • Benefits  of  Impala   19 More  &  Faster  Value  from  “Big  Data”   §  InteracLve  BI/AnalyLcs  experience  via  SQL   §  No  delays  from  data  migraLon   Flexibility   §  Query  across  exisLng  data   §  Select  best-­‐fit  file  formats  (Parquet,  Avro,  etc.)   §  Run  mulLple  frameworks  on  the  same  data  at  the  same  Lme     Cost  Efficiency   §  Reduce  movement,  duplicate  storage  &  compute   §  10%  to  1%  the  cost  of  analyLc  DBMS   Full  Fidelity  Analysis   §  No  loss  from  aggregaLons  or  fixed  schemas  
  • Impala  Use  Cases   20 InteracLve  BI/analyLcs  on  more  data   Asking  new  quesLons  –  exploraLon,  ML   Data  processing  with  Lght  SLAs   Query-­‐able  archive  w/full  fidelity   Cost-­‐effec(ve,  ad  hoc  query  environment  that   offloads  the  data  warehouse  for:  
  • Our  Design  Strategy   21 One  pool  of  (open)  data   One  metadata  model   One  security  framework   One  set  of  system  resources   An  Integrated  Part  of   the  Hadoop  System   In-­‐Memory   Processing  &   Streaming   Spark   Storage   Integra(on   Resource  Management   Metadata   Batch   Processing   MAPREDUCE,   HIVE  &  PIG   … HDFS   HBase   TEXT,  RCFILE,  PARQUET,  AVRO,  ETC.   RECORDS   Engines   InteracLve   SQL   CLOUDERA   IMPALA   InteracLve   Search   CLOUDERA   SEARCH   Machine   Learning   MAHOUT,   ClouderaML,   Oryx   Math  &   Sta(s(cs   SAS,  R     Security  
  • Impala  Key  Features   22 Fast   Flexible   Secure   Easy  to  Implement   Easy  to  Use   Simple  to  Manage   §  In-­‐memory  data  transfers   §  ParLLoned  joins   §  Fully  distributed  aggregaLons   §  Query  data  in  HDFS  &  HBase   §  Supports  mul(ple  file  formats   &  compression  algorithms   §  Java  &  Na(ve  UDFs,  UDAFs   §  Integrated  with  Hadoop   security   §  Kerberos  authenLcaLon   §  Authoriza(on  (Sentry)   §  Leverages  Hive’s  ODBC/JDBC   connectors,  metastore  &  SQL   syntax     §  Open  source   §  Interact  with  data  via  SQL   §  CerLfied  with  leading  BI  tools   §  Deploy,  configure  &  monitor   with  Cloudera  Manager   §  Integrated  with  Hadoop   resource  management  
  • What’s  Coming?*   23 SQL  2003-­‐Compliant  AnalyLc  Window  FuncLons   AddiLonal  AuthenLcaLon  Mechanisms   User  Defined  Table  FuncLons   Intra-­‐node  Parallelized  AggregaLons  &  Joins   Nested  Data   Enhanced  YARN-­‐Integrated  Resource  Manager   Dynamic  ParLLon  Pruning   In  the  Near  Term:   *On  the  roadmap…   no  guarantees    
  • Impala  Plays  Well  with  Others   24 BI  Partners:   Building  on  the   Enterprise  Standard   POWERED BY IMPALA
  • Not  All  SQL  On  Hadoop  Is  Created  Equal   25 Batch  MapReduce   Make  MapReduce  faster   Slow,  s(ll  batch   Remote  Query   Pull  data  from  HDFS  over   the  network  to  the  DW   compute  layer   Slow,  expensive   Siloed  DBMS   Load  data  into  a   proprietary  database  file   Rigid,  siloed  data,   slow  ETL   Impala   Na(ve  MPP  query  engine   that’s  integrated  into   Hadoop   Fast,  flexible,     cost-­‐effec(ve   $
  • DMBS  Hadoop   More  Detail  On  AlternaLve  Approaches   26 Batch  MapReduce   §  Batch-­‐oriented   §  High  latency   Remote  Query   Siloed  DBMS   Hadoop   DMBS   HDFS   Storage   Compute   Compute   §  Network  bojleneck   §  2x  the  hardware   §  Duplicate  metadata,   security,  SQL,  etc.   Storage  (HDFS)   Integra(on   Resource  Management   Hadoop  Metadata   DBMS   Hadoop   Engines   MAPREDUCE,  HIVE,  PIG,  IMPALA,  ETC.   DBMS  Metadata   PROPRIETARY   STANDARD  &  SHARED   §  RDBMS  rigidity   §  Query  subset  of  data   §  Duplicate  storage,   metadata,  security,   SQL,  etc.   Storage   Integra(on   Resource  Management   Metadata   Batch   Processing   InteracLve   SQL   Machine   Learning   HDFS   HBase   Security   Security  
  • Other  Sexy  New  Big  Data  MPP  Tools   27 Presto   Purpose-­‐Built  MPP  Engine;  Similar  Architecture  to  Impala;  Few  Performance  Comparisons,   but  Impala  Anecdotally  5x-­‐10x  Faster     Shark   Hive-­‐CompaLble  Data  Warehouse  for  Spark;  Great  Performance  unLl  Required  to  go  to   Disk,  at  Which  Point  Impala  Bejer;  With  HDFS  Caching  Impala  will  Perform  on  Par  from  a   Memory  PerspecLve     Drill   Open  Source  version  of  Dremel;  Another  MPP  Engine;  MulLple  Data  Formats  and  Sources     Phoenix  –  Sort  Of   SQL  Skin  over  HBase  (and  Only  HBase);  Subset  of  SQL  Standard  
  • What  About  an  EDW/RDBMS?   “Right  Tool  for  the  Right  Job”     EDW/RDBMS  Great  For:   •  OLTP’s  complex  transacLons   •  Highly  planned  and  opLmized  known  workloads   •  Opera'onal  reports  and  repeated  known  queries     Impala  Great  For:   •  Exploratory  analy'cs  with  previously-­‐unknown  queries   •  Queries  on  big  and  growing  data  sets   EDW/RDBMS  Can’t:   •  Dump  in  raw  data  then  later  define  schema  and  query  what  you  want   •  Evolve  schemas  without  an  expensive  schema  upgrade  planning  process   •  Simply  scale  just  by  adding  industry-­‐standard  servers   •  Store  at  <  $1k/TB  instead  of  $10-­‐150k/TB   28
  • 29 Impala  Technical  Details  
  • The  Impala  Advantage   30 No  MapReduce;  No  JVM;  All  NaLve   In-­‐Memory  Data  Transfers   Saturate  Disks  on  Reads   OpLmized  File  Format  (ie  Parquet)   In-­‐Memory  HDFS  Caching     Cost-­‐Based  Join  Order  OpLmizaLon  –  Frees  User   from  Having  to  Guess  the  Correct  Join  Order   Where  does  the  Performance  Come  From?  
  • Impala  and  Hive   31 Shares  Everything  Client-­‐Facing   §  Metadata  (table  definiLons)   §  ODBC/JDBC  drivers   §  SQL  syntax  (Hive  SQL)   §  Flexible  file  formats   §  Machine  pool   §  Hue  GUI   But  Built  for  Different  Purposes   §  Hive:  runs  on  MapReduce  and   ideal  for  batch  processing   §  Impala:  naLve  MPP  query  engine   ideal  for  interacLve  SQL   Storage   Integra(on   Resource  Management   Metadata   HDFS   HBase   TEXT,  RCFILE,  PARQUET,  AVRO,  ETC.   RECORDS   Hive   SQL  Syntax   Impala   SQL  Syntax  +   Compute  Framework  MapReduce   Compute  Framework   Batch   Processing   InteracLve   SQL  
  • Impala  Query  ExecuLon   32 Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   SQL  App   ODBC   Hive   Metastore   HDFS  NN   Statestore   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   SQL  request   1)  Request  arrives  via  ODBC/JDBC/HUE/Shell  
  • Impala  Query  ExecuLon   33 Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   SQL  App   ODBC   Hive   Metastore   HDFS  NN   Statestore   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   2)  Planner  turns  request  into  collec(ons  of  plan  fragments   3)  Coordinator  ini(ates  execu(on  on  impalad(s)  local  to  data  
  • Impala  Query  ExecuLon   34 Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   SQL  App   ODBC   Hive   Metastore   HDFS  NN   Statestore   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   Query  Planner   Query  Coordinator   Query  Executor   HDFS  DN   HBase   4)  Intermediate  results  are  streamed  between  impalad(s)   5)  Query  results  are  streamed  back  to  client   Query  results  
  • Parquet  File  Format   35 Open  source,  columnar  Hadoop  file   format  developed  by  Cloudera  &  Twiler   Limits  the  IO  to  only  the  data  that  is  needed   Supports  storing  each  column  in  a  separate  file   Saves  space:  columnar  layout  compresses  bejer   Enables  bejer  scans:  load  only  the  columns  that  are  needed   Supports  index  pages  for  fast  lookup   Extensible  value  encodings  
  • 36 Impala  Performance  Results  
  • Impala  Performance  Results   •  Impala’s  Milestone  in  Jan  2014:   •  Comparable  commercial  MPP  DBMS  speed   •  NaLvely  on  Hadoop     •  Three  Result  Sets:   •  Impala  vs  Hive  0.12  (Impala  6-­‐70x  faster)   •  Impala  vs  “DBMS-­‐Y”  (Impala  average  of  2x  faster)   •  Impala  scalability  (Impala  achieves  linear  scale)     •  Background   •  20  pre-­‐selected,  diverse  TPC-­‐DS  queries  (modified  to  remove  unsupported   language)   •  Sufficient  data  scale  for  realisLc  comparison  (3  TB,  15  TB,  and  30  TB)   •  RealisLc  nodes  (e.g.  8-­‐core  CPU,  96GB  RAM,  12x2TB  disks)   •  Methodical  tesLng  (mulLple  runs,  reviewed  fairness  for  compeLLon,  etc)     •  Details:  hjp://blog.cloudera.com/blog/2014/01/impala-­‐performance-­‐dbms-­‐class-­‐speed/   37
  • Enough  slides…  DEMO  TIME!   38
  • So  What  is  Cloudera  Impala?   39
  • What’s  Next?   •  Download  Hadoop!   •  CDH  available  at  www.cloudera.com   •  Try  it  online:  Cloudera  Live   •  Cloudera  provides  pre-­‐loaded  VMs   •  hjp://Lny.cloudera.com/quickstartvm   •  Ride  Impala!   •  hjp://impala.io/     40
  • 41 SAN  DIEGO  BIG  DATA   Special  thanks:  
  • 42 Preferably  related  to  the  talk…  or  not.   QuesLons?  
  • 43 Thank  You!   Maxime  Dumas   mdumas@cloudera.com       We’re  hiring.  
  • 44