Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cloudera Impala


Published on

Some Observations from Cloudera Impala Usage.

Published in: Data & Analytics
  • Be the first to comment

Cloudera Impala

  1. 1. Cloudera  Impala   Some  Observa2ons  
  2. 2. Strengths   •  Excellent  for  analy2cal  queries   – Queries  that  scan  large  amounts  of  data   •  Parquet  with  Snappy  codec  provides  good   trade-­‐off   – Fast  data  load   – Good  query  performance  
  3. 3. Strengths  -­‐  con2nued   •  SQL  compliance   – Most  queries  from  other  systems  work  as-­‐in   •  Impala-­‐shell  –  Handy  Interface   – Easy  to  use  interface   •  Hadoop  Integra2on   – Uses  Hive  as  Metastore  and  HDFS  as  storage   – Uses  its  own  daemons  to  execute  query  
  4. 4. Weaknesses   •  Random  Access  is  Slow   •  No  Fault  Tolerance   – If  a  node  fails,  all  queries  running  on  that  node   will  fail.  Only  op2on  is  to  retry  the  query.   •   Upda2ng/Cleaning  Data  Tedious   – No  direct  updates  are  supported   – Mul2-­‐step  process   •  For  example,  to  remove  rows  from  exis2ng  table:   Create  a  temp  table  by  selec2ng  rows  from  source,   drop  source  table,  rename  temp  table  to  source.  
  5. 5. Weaknesses  -­‐  con2nued   •  Update  Stats  Manual  Process   – On  loading  significant  amount  of  data,  stats  must   be  updated  manually.  Some  queries  will  perform   poorly  or  fail  if  this  is  not  done.   •  Memory  Consump2on   – For  Impala  queries  to  perform  fast,  significant   amount  RAM  needed.   – It  is  possible  to  spill  to  disk  but  that  slows  down   performance.  
  6. 6. Conclusion   •  Impala  makes  SQL  first  class  ci2zens  in   Hadoop  ecosystem   •  Great  for  workloads  where  data  is  immutable   •  Excellent  query  performance  to  analy2cal   queries   •  Not  suitable  for  work  loads  that  involve   frequent  data  updates   •  Queries  are  not  fault-­‐tolerant