Evaluation of cloudera impala 1.1
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Evaluation of cloudera impala 1.1

on

  • 2,300 views

I evaluated impala 1.1 on our cluster environment.

I evaluated impala 1.1 on our cluster environment.

Statistics

Views

Total Views
2,300
Views on SlideShare
2,029
Embed Views
271

Actions

Likes
7
Downloads
29
Comments
2

4 Embeds 271

http://www.bigdatanosql.com 236
http://www.scoop.it 22
https://twitter.com 12
https://web.tweetdeck.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Hi Carter, thank you for your comment. At first, I used dataset of about 17GB and 800MB as described on slide. Of course, I understand that I should use dataset of larger size. Secondly, if I have a chance to evaluate next new version of Impala, I will try Hive 11 and ORCFile.
    Are you sure you want to
    Your message goes here
    Processing…
  • Hi Yukinori thanks for the nice writeup.

    From slide 12 I wasn't clear on the data set size. As written it looks like you used about 17 gigabytes of data in the test. Am I reading that right or was it actually 17 terabytes of data?

    I also noticed you are planning to do some TPC benchmarks, if so you should really try Hive 11 and ORCFile for that, a lot of work has been done in Hive that benefit TPC style queries.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Evaluation of cloudera impala 1.1 Presentation Transcript

  • 1. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 1 1 Evaluation  of  Cloudera  impala  1.1 Aug  7,  2013 CELLANT  Corp.  R&D  Strategy  Division Yukinori  SUDA @sudabon
  • 2. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / v  Sentry  support: l  Fine-‐‑‒grained  authorization l  Role-‐‑‒based  authorization v  Support  for  views v  Performance  improvements l  Parquet  columnar  performance l  More  efficient  metadata  refresh  for  larger  installations v  Additional  SQL l  SQL-‐‑‒89  joins  (in  addition  to  existing  SQL-‐‑‒92) l  LOAD  function l  REFRESH  command  for  JDBC/ODBC v  Improved  Hbase  support: l  Binary  types l  Caching  configuration v  Fixed  many  bugs Cloudera  Impala  1.1  was  released  !! 2
  • 3. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / v Hive  ⇒  Impala l On  Impala  shell,  can  read  data  in  “VIEW”  that  was   created  via  Hive  command  ? v Impala  ⇒  Hive l On  Hive  shell,  can  read  data  in  “VIEW”  that  was   created  via  Impala  command  ? v Result Two  “VIEW”s  have  compatibility Check  compatibility  of  “VIEW” 3
  • 4. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / Check  performance  (Hive  on  Cluster1) 4 0 50 100 150 200 250 No  Comp. Gzip Snappy Gzip Snappy TextFileSequenceFileRCFile 222.039 244.67 239.182 228.801 230.327 Avg.  Job  Latency  [sec] This result will be invalid as performance evaluation cause some data may be read remotely. See the slide of “Check performance (Hive on Cluster2)”.
  • 5. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / Check  performance  (Impala  on  Cluster1) 5 0 50 100 150 200 250 No  Comp. Gzip Snappy Gzip Snappy Snappy Text File Sequence FileRCFile Parquet File 23.518 32.155 28.617 20.774 12.654 13.146 Avg.  Job  Latency  [sec] This result will be invalid as performance evaluation cause some data may be read remotely. See the slide of “Check performance (Impala on Cluster2)”.
  • 6. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / Check  performance  (Hive  on  Cluster2) 6 0 50 100 150 200 250 300 No  Comp. Gzip Snappy Gzip Snappy TextFileSequenceFileRCFile 272.176 249.531 245.009 230.034 216.802 Avg.  Job  Latency  [sec]
  • 7. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / Check  performance  (Impala  on  Cluster2) 7 0 50 100 150 200 250 300 No  Comp. Gzip Snappy Gzip Snappy Snappy Text File Sequence FileRCFile Parquet File 32.528 28.73 21.173 24.794 14.308 19.814 Avg.  Job  Latency  [sec]
  • 8. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / v IMPALA-‐‑‒357 l Insert  into  Parquet  exceed  mem-‐‑‒limit v Problem l Even  if  set  mem_̲limit  setting,  when  create  ParquetFile   table  with  partitions,  consumed  memory  isnʼ’t  limited.   l At  last,  Impalad  crashes  due  to  memory  shortage v Result CREATE  command  failed  due  to  memory  limit Check  fixed  bug 8
  • 9. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / v Thanks  to  dev.  team,  Impala  is  also  going   from  “Good  to  Great” v Both  “VIEW”  and  “Parquet”  are  already  ready v Performance v RCFile+Snappy  is  the  fastest  on  both  Cluster1  and   Cluster2 v If  use  larger  size  table,  Parquet+Snappy  may  be  the   fastest v Hope  for  future  extension l Support  Structure  Types l Support  UDF/UDTF,  etc Summary 9
  • 10. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 10 Appendix.  Benchmark  Details
  • 11. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / Our  System  Environment(Cluster1) 11 v  Install  using  Cloudera  Manager  Free  Edition  4.6.0 Master Slave 14  Servers All  servers  are  connected  with  1Gbps  Ethernet  through  an  L2  switch Active NameNode DataNode TaskTracker Impalad Stand-‐‑‒by NameNode JobTracker statestored 3  Servers DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode DataNode DataNode DataNode
  • 12. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / Our  System  Environment(Cluster2) 12 v  Install  using  Cloudera  Manager  Free  Edition  4.6.0 Master Slave 10  Servers All  servers  are  connected  with  1Gbps  Ethernet  through  an  L2  switch Active NameNode DataNode TaskTracker Impalad Stand-‐‑‒by NameNode JobTracker statestored 3  Servers DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode TaskTracker Impalad DataNode DataNode DataNode DataNode Decommissioned
  • 13. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / v CPU l Intel  Core  2  Duo  2.13  GHz  with  Hyper  Threading v Memory l 8GB  :  Namenodes  only l 4GB  :  Others v Disk l 7,200  rpm  SATA  mechanical  Hard  Disk  Drive  *  1 v OS l Cent  OS  6.3 Our  Server  Specification 13
  • 14. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / v  Use  CDH4.3.0  +  Impala  1.1 v  Use  hivebench  in  open-‐‑‒sourced  benchmark  tool  “HiBench” l  https://github.com/hibench v  Modified  datasets  to  1/10  scale l  Default  configuration  generates  table  with  1  billion  rows v  Modified  query  sentence l  Deleted  “INSERT  INTO  TABLE  …”  to  evaluate  read-‐‑‒only  performance v  Combines  a  few  storage  format  with  a  few  compression  method l  TextFile,  SequenceFile,  RCFile,  ParquestFile l  No  compression,  Gzip,  Snappy v  Comparison  with  job  query  latency v  Average  job  latency  over  5  measurements v  Benchmark  on  both  Cluster1  and  Cluster2 Benchmark 14
  • 15. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / •  Uservisits  table –  100  million  rows –  16,895  MB  as  TextFile –  Table  Definitions •  sourceIP  string •  destURL  string •  visitDate  string •  adRevenue  double •  userAgent  string •  countryCode  string •  languageCode  string •  searchWord  string •  duration  int •  Rankings  table –  12  million  rows –  744  MB  as  TextFile –  Table  Definitions •  pageURL string •  pageRank int •  avgDuration int Modified  Datasets 15
  • 16. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / SELECT   sourceIP,   sum(adRevenue)  as  totalRevenue,   avg(pageRank)   FROM   rankings_̲t  R JOIN  [BROADCAST]  (   SELECT     sourceIP,     destURL,     adRevenue   FROM     uservisits_̲t  UV   WHERE     (datediff(UV.visitDate,  '1999-‐‑‒01-‐‑‒01')>=0     AND     datediff(UV.visitDate,  '2000-‐‑‒01-‐‑‒01')<=0)   )  NUV ON   (R.pageURL  =  NUV.destURL) group  by  sourceIP order  by  totalRevenue  DESC limit  1; Modified  Query 16
  • 17. Copyright © CELLANT Corp. All Rights Reserved. h t t p : / / w w w . c e l l a n t . j p / 17 Thanks! I  want  to  use  TPC  in  next  evaluation…