Cloudera  impala  Performance  
         Evaluation  
(with  Comparison  to  Hive)	
 
               Dec. 8, 2012
     CELLANT Corp. R&D Strategy Division
              Yukinori SUDA
                @sudabon
About  Cloudera  impala	
 
•  Latest version is 0.3 beta
•  Open-sourced implementation inspired by Google Dremel
   and F1
•  Developed by famous Hadoop distributor Cloudera
•  Bring real-time, ad-hoc query capability on Apache Hadoop
•  Query data stored in HDFS or Apache Hbase
•  Use the same metadata, SQL syntax (HiveQL) as Apache Hive
•  Support for TextFile and SequenceFile as Hive storage format
•  Also support SequenceFile compressed as Snappy, Gzip and
   Bzip
•  Directly access the data through a specialized distributed
   query engine
Architecture	
 
•  State Store works as an impala-state-store(statestored) daemon
•  Query Planner, Query Coordinator and Query Exec Engine work as an
   impalad daemon
System  Environment	
 
      •  Install via Cloudera Manager Free Edition
           Master                                          Slave



・HDFS	
   NameNode	
   SecondaryNameNode	
                                                     ・HDFS	
・MapReduceV1	
                                                                DataNode	
   JobTracker	
                                                            ・MapReduceV1	
・impala	
                                                                     TaskTracker	
   impalad	
                                                               ・impala	
   impala-­‐‑state-­‐‑store	
                                                 impalad	
   (statestored)
        1  Sever                                                               13  Servers

      All  servers  are  connected  with  1Gbps  Ethernet  through  an  L2  switch
Server  Specification	
 

•  CPU
   o  Intel Core 2 Duo 2.13 GHz with Hyper Threading

•  Memory
   o  4GB

•  Disk
   o  7,200 rpm SATA mechanical Hard Disk Drive

•  OS
   o  CentOS 6.2
Benchmark	
 
•  Use CDH4.1 + impala version 0.2 and 0.3
•  Use hivebench in open-sourced benchmark tool
   “HiBench”
   o  https://github.com/hibench
•  Modified datasets to 1/10 scale
   o  Default configuration generates table with 1 billion rows
•  Modified query sentence
   o  Deleted “INSERT INTO TABLE …” to evaluate read-only performance
   o  Deleted “datediff” function (I mistook not to be supported)
•  Combines a few Hive storage format with a few
   compression method
   o  TextFile, SequenceFile, RCFile
   o  No compression, Gzip, Snappy
•  Comparison with job query latency
   o  Average job latency over 5 measurements
Modified  Datasets	
 
•  Uservisits table              •  Rankings table
   o  100 million rows              o  12 million rows
   o  Schema                        o  Schema
        •  sourceIP     string           •  pageURL       string
        •  destURL      string           •  pageRank      int
        •  visitDate    string           •  avgDuration   int
        •  adRevenue    double
        •  userAgent    string
        •  countryCode string
        •  languageCode string
        •  searchWord   string
        •  duration     int
Modified  Query	
 
SELECT                                  ON
  sourceIP,                                (R.pageURL = NUV.destURL)
  sum(adRevenue) as totalRevenue,
  avg(pageRank)
                                        GROUP BY sourceIP
FROM                                    ORDER BY totalRevenue DESC
  rankings R                            LIMIT 1
JOIN (
  SELECT
     sourceIP,
     destURL,
     adRevenue
  FROM
     uservisits UV
  WHERE
     UV.visitData >= ‘1999-01-01’
     AND UV.visitData <= ‘2001-01-01’
  ) NUV
Benchmark  Result  
    (Hive)
Benchmark  Result  
 (impala  0.2)
Benchmark  Result  
 (impala  0.3)
Conclusion	
 
•  Impala is over 10 times faster than MR + Hive
   o  Impala 0.3
        •  SequenceFile compressed as Snappy: 14.337 seconds
   o  Impala 0.2
        •  SequenceFile compressed as Gzip: 19.733 seconds
   o  Hive
        •  RCFile compressed as Snappy: 164.161 seconds

•  Hope that impala version 1.0 included in CDH5
   makes faster
   o  Support RCFile and Trevni columner format
Thank  you

Performance evaluation of cloudera impala (with Comparison to Hive)

  • 1.
    Cloudera  impala  Performance  Evaluation   (with  Comparison  to  Hive) Dec. 8, 2012 CELLANT Corp. R&D Strategy Division Yukinori SUDA @sudabon
  • 2.
    About  Cloudera  impala •  Latest version is 0.3 beta •  Open-sourced implementation inspired by Google Dremel and F1 •  Developed by famous Hadoop distributor Cloudera •  Bring real-time, ad-hoc query capability on Apache Hadoop •  Query data stored in HDFS or Apache Hbase •  Use the same metadata, SQL syntax (HiveQL) as Apache Hive •  Support for TextFile and SequenceFile as Hive storage format •  Also support SequenceFile compressed as Snappy, Gzip and Bzip •  Directly access the data through a specialized distributed query engine
  • 3.
    Architecture •  StateStore works as an impala-state-store(statestored) daemon •  Query Planner, Query Coordinator and Query Exec Engine work as an impalad daemon
  • 4.
    System  Environment •  Install via Cloudera Manager Free Edition Master Slave ・HDFS NameNode SecondaryNameNode ・HDFS ・MapReduceV1 DataNode JobTracker ・MapReduceV1 ・impala TaskTracker impalad ・impala impala-­‐‑state-­‐‑store impalad (statestored) 1  Sever 13  Servers All  servers  are  connected  with  1Gbps  Ethernet  through  an  L2  switch
  • 5.
    Server  Specification • CPU o  Intel Core 2 Duo 2.13 GHz with Hyper Threading •  Memory o  4GB •  Disk o  7,200 rpm SATA mechanical Hard Disk Drive •  OS o  CentOS 6.2
  • 6.
    Benchmark •  UseCDH4.1 + impala version 0.2 and 0.3 •  Use hivebench in open-sourced benchmark tool “HiBench” o  https://github.com/hibench •  Modified datasets to 1/10 scale o  Default configuration generates table with 1 billion rows •  Modified query sentence o  Deleted “INSERT INTO TABLE …” to evaluate read-only performance o  Deleted “datediff” function (I mistook not to be supported) •  Combines a few Hive storage format with a few compression method o  TextFile, SequenceFile, RCFile o  No compression, Gzip, Snappy •  Comparison with job query latency o  Average job latency over 5 measurements
  • 7.
    Modified  Datasets • Uservisits table •  Rankings table o  100 million rows o  12 million rows o  Schema o  Schema •  sourceIP string •  pageURL string •  destURL string •  pageRank int •  visitDate string •  avgDuration int •  adRevenue double •  userAgent string •  countryCode string •  languageCode string •  searchWord string •  duration int
  • 8.
    Modified  Query SELECT ON sourceIP, (R.pageURL = NUV.destURL) sum(adRevenue) as totalRevenue, avg(pageRank) GROUP BY sourceIP FROM ORDER BY totalRevenue DESC rankings R LIMIT 1 JOIN ( SELECT sourceIP, destURL, adRevenue FROM uservisits UV WHERE UV.visitData >= ‘1999-01-01’ AND UV.visitData <= ‘2001-01-01’ ) NUV
  • 9.
  • 10.
    Benchmark  Result   (impala  0.2)
  • 11.
    Benchmark  Result   (impala  0.3)
  • 12.
    Conclusion •  Impalais over 10 times faster than MR + Hive o  Impala 0.3 •  SequenceFile compressed as Snappy: 14.337 seconds o  Impala 0.2 •  SequenceFile compressed as Gzip: 19.733 seconds o  Hive •  RCFile compressed as Snappy: 164.161 seconds •  Hope that impala version 1.0 included in CDH5 makes faster o  Support RCFile and Trevni columner format
  • 13.