SlideShare a Scribd company logo
1 of 12
Download to read offline
Cloudera  impala  0.6  beta  
                        Performance  Evaluation
                               (with  Comparison  to  Hive)

                                    Mar.  6,  2013
                         CELLANT  Corp.  R&D  Strategy  Division
                                   Yukinori  SUDA
                                     @sudabon

                                                                               1	
Copyright © CELLANT Corp. All Rights Reserved.            http://www.cellant.jp/
Cloudera  impala  0.6  beta

        v  ChangeLogs  from  0.5  beta
             v  Cloudera  Manager  4.5  and  CDH  4.2  support  Impala  0.6.
             v  Support  for  the  RCFile  file  format.
             v  Added  support  for  Impala  on  SUSE  and  Debian/Ubuntu.
                  v RHEL5.7/6.2  and  Centos5.7/6.2
                  v SUSE  11  with  Service  Pack  1  or  later
                  v Ubuntu  10.04/12.04  and  Debian  6.03




                                                                                         2
Copyright © CELLANT Corp. All Rights Reserved.                       http://www.cellant.jp/
System  Environment
            v  Install  via  Cloudera  Manager  Free  Edition  4.5.0

                 Master                                                      Slave


                                                  DataNode      DataNode              DataNode          DataNode
                   Active
                                                 TaskTracker   TaskTracker           TaskTracker       TaskTracker
                 NameNode
                                                   Impalad       Impalad               Impalad           Impalad




                                                  DataNode      DataNode              DataNode          DataNode
                  Stand-‐‑‒by
                                                 TaskTracker   TaskTracker           TaskTracker       TaskTracker
                 NameNode
                                                   Impalad       Impalad               Impalad           Impalad




                                                  DataNode
                 JobTracker                                     DataNode              DataNode
                                                 TaskTracker
                 statestored                                   TaskTracker           TaskTracker
                                                   Impalad
                                                                 Impalad               Impalad


             3  Servers                                                                            11  Servers

              All  servers  are  connected  with  1Gbps  Ethernet  through  an  L2  switch
                                                                                                                  3
Copyright © CELLANT Corp. All Rights Reserved.                                                http://www.cellant.jp/
Server  Specification

        v CPU
               l  Intel  Core  2  Duo  2.13  GHz  with  Hyper  Threading
        v Memory
               l  4GB
        v Disk
               l  7,200  rpm  SATA  mechanical  Hard  Disk  Drive
        v OS
               l  Cent  OS  6.2



                                                                                     4
Copyright © CELLANT Corp. All Rights Reserved.                   http://www.cellant.jp/
Benchmark

        v  Use  CDH4.2.0  +  impala  version  0.6  beta
        v  Use  hivebench  in  open-‐‑‒sourced  benchmark  tool  “HiBench”
               l  https://github.com/hibench
        v  Modified  datasets  to  1/10  scale
               l  Default  configuration  generates  table  with  1  billion  rows
        v  Modified  query  sentence
               l  Deleted  “INSERT  INTO  TABLE  …”  to  evaluate  read-‐‑‒only  performance
        v  Combines  a  few  Hive  storage  format  with  a  few  compression  
            method
               l  TextFile,  SequenceFile,  RCFile
               l  No  compression,  Gzip,  Snappy
        v  Comparison  with  job  query  latency
        v  Average  job  latency  over  5  measurements


                                                                                                    5
Copyright © CELLANT Corp. All Rights Reserved.                                  http://www.cellant.jp/
Modified  Datasets

        •  Uservisits  table                              •  Rankings  table
               –  100  million  rows                         –  12  million  rows
               –  Table  Definitions                          –  Table  Definitions
                      •    sourceIP              string          •  pageURL         string
                      •    destURL               string          •  pageRank        int
                      •    visitDate             string          •  avgDuration     int
                      •    adRevenue             double
                      •    userAgent             string
                      •    countryCode           string
                      •    languageCode          string
                      •    searchWord            string
                      •    duration              int




                                                                                                6
Copyright © CELLANT Corp. All Rights Reserved.                              http://www.cellant.jp/
Modified  Query
        SELECT                                               ON
          sourceIP,                                            (R.pageURL  =  NUV.destURL)
          sum(adRevenue)  as  totalRevenue,                  group  by  sourceIP
          avg(pageRank)                                      order  by  totalRevenue  DESC
        FROM                                                 limit  1;
          rankings_̲t  R
        JOIN  (
          SELECT
            sourceIP,
            destURL,
            adRevenue
          FROM
            uservisits_̲t  UV
          WHERE
            (datediff(UV.visitDate,  '1999-‐‑‒01-‐‑‒01')>=0
            AND
            datediff(UV.visitDate,  '2000-‐‑‒01-‐‑‒01')<=0)
          )  NUV



                                                                                                           7
Copyright © CELLANT Corp. All Rights Reserved.                                         http://www.cellant.jp/
Benchmark  Result  (Hive)


                                                                                        197.894
                                Snappy
               RCFile




                                                                                                   234.289
                                  Gzip
               SequenceFile




                                                                                           213.616
                                Snappy



                                                                                                  227.883
                                  Gzip
               TextFile




                                                                                                    235.843
                              No  Comp.


                                          0      50          100        150       200               250


                                                      Avg.  Job  Latency  [sec]


                                                                                                            8
Copyright © CELLANT Corp. All Rights Reserved.                                          http://www.cellant.jp/
Benchmark  Result  (impala)


                                              16.059
                                Snappy
               RCFile




                                              17.03
                                  Gzip
               SequenceFile




                                              17.725
                                Snappy



                                               21.25
                                  Gzip
               TextFile




                                                  32.776
                              No  Comp.


                                          0           50          100        150       200          250


                                                           Avg.  Job  Latency  [sec]


                                                                                                                 9
Copyright © CELLANT Corp. All Rights Reserved.                                               http://www.cellant.jp/
Block  Location  Cache  effect  ?


                                  TextFile	
        SequenceFile	
                 RCFile	
               job	
                                No Comp.	
        Gzip	
     Snappy	
     Gzip	
         Snappy	
               1st                 50.256        23.692	
     22.085	
   18.475	
         20.042	
               2nd	
               34.905	
      20.710	
     19.733	
   16.690	
         18.859	
               3rd	
               30.752	
      20.604	
     15.608	
   16.620	
         16.642	
               4th	
               26.848	
      20.625	
     15.602	
   16.617	
         12.148	
               5th	
               21.121	
      20.620	
     15.597	
   16.747	
         12.606	
           Average	
               32.776	
      21.250	
     17.725	
   17.030	
         16.059	



            v  1st  job  is  the  slowest,  and  the  fastest  job  is  one  of  the  others  
                due  to  Block  Location  Cache  effect?

                                                                                                 10
Copyright © CELLANT Corp. All Rights Reserved.                               http://www.cellant.jp/
Conclusion

        v Impala  is  over  10  times  faster  than  MRv1  +  
           Hive
        v Specifically,
               l  Impala  0.6  beta
                    •  RCFile  compressed  as  Snappy:  16.059  sec
               l  MRv1  +  Hive  0.10
                    •  RCFile  compressed  as  Snappy:  197.894  sec
        v Hope  that  impala  GA  included  in  CDH5  
           makes  faster
               l  Support  Trevni  columner  format
               l  Optimized  Query  Planner
                                                                                  11
Copyright © CELLANT Corp. All Rights Reserved.                http://www.cellant.jp/
Thanks.




                                                                               12
Copyright © CELLANT Corp. All Rights Reserved.             http://www.cellant.jp/

More Related Content

What's hot

Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationYahoo Developer Network
 
Configuration with Apache Tamaya
Configuration with Apache TamayaConfiguration with Apache Tamaya
Configuration with Apache TamayaAnatole Tresch
 
Data Base Upgrade
Data Base UpgradeData Base Upgrade
Data Base Upgradeguest362312
 
Rdf Processing For Java A Comparative Study
Rdf Processing For Java    A Comparative StudyRdf Processing For Java    A Comparative Study
Rdf Processing For Java A Comparative Studyioanid
 
Oozie Summit 2011
Oozie Summit 2011Oozie Summit 2011
Oozie Summit 2011mislam77
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariAlejandro Fernandez
 
Manage Add-on Services in Apache Ambari
Manage Add-on Services in Apache AmbariManage Add-on Services in Apache Ambari
Manage Add-on Services in Apache AmbariJayush Luniya
 
Tuning Apache Ambari Performance for Big Data at Scale with 3,000 Agents
Tuning Apache Ambari Performance for Big Data at Scale with 3,000 AgentsTuning Apache Ambari Performance for Big Data at Scale with 3,000 Agents
Tuning Apache Ambari Performance for Big Data at Scale with 3,000 AgentsAlejandro Fernandez
 
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - ZettasetBig Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - ZettasetBigDataCloud
 
Evaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesEvaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesMiguel Araújo
 
Squire: A polyglot application combining Neo4j, MongoDB, Ruby and Scala @ FOS...
Squire: A polyglot application combining Neo4j, MongoDB, Ruby and Scala @ FOS...Squire: A polyglot application combining Neo4j, MongoDB, Ruby and Scala @ FOS...
Squire: A polyglot application combining Neo4j, MongoDB, Ruby and Scala @ FOS...Alberto Perdomo
 
Oozie HUG May12
Oozie HUG May12Oozie HUG May12
Oozie HUG May12mislam77
 
July 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification ProcessJuly 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification ProcessYahoo Developer Network
 
linux.conf.au-HAminiconf-pgsql91-20120116
linux.conf.au-HAminiconf-pgsql91-20120116linux.conf.au-HAminiconf-pgsql91-20120116
linux.conf.au-HAminiconf-pgsql91-20120116ksk_ha
 
Extend starfish to Support the Growing Hadoop Ecosystem
Extend starfish to Support the Growing Hadoop EcosystemExtend starfish to Support the Growing Hadoop Ecosystem
Extend starfish to Support the Growing Hadoop EcosystemFei Dong
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Jeff Larkin
 

What's hot (20)

Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
 
Configuration with Apache Tamaya
Configuration with Apache TamayaConfiguration with Apache Tamaya
Configuration with Apache Tamaya
 
Data Base Upgrade
Data Base UpgradeData Base Upgrade
Data Base Upgrade
 
Rdf Processing For Java A Comparative Study
Rdf Processing For Java    A Comparative StudyRdf Processing For Java    A Comparative Study
Rdf Processing For Java A Comparative Study
 
HiveServer2
HiveServer2HiveServer2
HiveServer2
 
Oozie Summit 2011
Oozie Summit 2011Oozie Summit 2011
Oozie Summit 2011
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
 
Manage Add-on Services in Apache Ambari
Manage Add-on Services in Apache AmbariManage Add-on Services in Apache Ambari
Manage Add-on Services in Apache Ambari
 
Tuning Apache Ambari Performance for Big Data at Scale with 3,000 Agents
Tuning Apache Ambari Performance for Big Data at Scale with 3,000 AgentsTuning Apache Ambari Performance for Big Data at Scale with 3,000 Agents
Tuning Apache Ambari Performance for Big Data at Scale with 3,000 Agents
 
Ccna
CcnaCcna
Ccna
 
java8-features
java8-featuresjava8-features
java8-features
 
OSGi for mere mortals
OSGi for mere mortalsOSGi for mere mortals
OSGi for mere mortals
 
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - ZettasetBig Data Cloud Meetup - Jan 24 2013 - Zettaset
Big Data Cloud Meetup - Jan 24 2013 - Zettaset
 
Evaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated DatabasesEvaluating Data Freshness in Large Scale Replicated Databases
Evaluating Data Freshness in Large Scale Replicated Databases
 
Squire: A polyglot application combining Neo4j, MongoDB, Ruby and Scala @ FOS...
Squire: A polyglot application combining Neo4j, MongoDB, Ruby and Scala @ FOS...Squire: A polyglot application combining Neo4j, MongoDB, Ruby and Scala @ FOS...
Squire: A polyglot application combining Neo4j, MongoDB, Ruby and Scala @ FOS...
 
Oozie HUG May12
Oozie HUG May12Oozie HUG May12
Oozie HUG May12
 
July 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification ProcessJuly 2012 HUG: Overview of Oozie Qualification Process
July 2012 HUG: Overview of Oozie Qualification Process
 
linux.conf.au-HAminiconf-pgsql91-20120116
linux.conf.au-HAminiconf-pgsql91-20120116linux.conf.au-HAminiconf-pgsql91-20120116
linux.conf.au-HAminiconf-pgsql91-20120116
 
Extend starfish to Support the Growing Hadoop Ecosystem
Extend starfish to Support the Growing Hadoop EcosystemExtend starfish to Support the Growing Hadoop Ecosystem
Extend starfish to Support the Growing Hadoop Ecosystem
 
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
Maximizing Application Performance on Cray XT6 and XE6 Supercomputers DOD-MOD...
 

Similar to Performance evaluation of cloudera impala 0.6 beta with comparison to Hive

Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Yukinori Suda
 
Evaluation of cloudera impala 1.1
Evaluation of cloudera impala 1.1Evaluation of cloudera impala 1.1
Evaluation of cloudera impala 1.1Yukinori Suda
 
SQL-H a new way to enable SQL analytics
SQL-H a new way to enable SQL analyticsSQL-H a new way to enable SQL analytics
SQL-H a new way to enable SQL analyticsDataWorks Summit
 
20141011 my sql clusterv01pptx
20141011 my sql clusterv01pptx20141011 my sql clusterv01pptx
20141011 my sql clusterv01pptxIvan Ma
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera, Inc.
 
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at ClouderaDataconomy Media
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisFelicia Haggarty
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120Hyoungjun Kim
 
Scaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextScaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextRafał Kuć
 
HMS: Scalable Configuration Management System for Hadoop
HMS: Scalable Configuration Management System for HadoopHMS: Scalable Configuration Management System for Hadoop
HMS: Scalable Configuration Management System for HadoopDataWorks Summit
 
Performance Evaluation of Cloudera Impala GA
Performance Evaluation of Cloudera Impala GAPerformance Evaluation of Cloudera Impala GA
Performance Evaluation of Cloudera Impala GAYukinori Suda
 
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter
 
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016DataStax
 
A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)Flowdock
 
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...SCAPE Project
 
MySQL Cluster overview + development slides (2014)
MySQL Cluster overview + development slides (2014) MySQL Cluster overview + development slides (2014)
MySQL Cluster overview + development slides (2014) Frazer Clement
 

Similar to Performance evaluation of cloudera impala 0.6 beta with comparison to Hive (20)

Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)Performance evaluation of cloudera impala (with Comparison to Hive)
Performance evaluation of cloudera impala (with Comparison to Hive)
 
Evaluation of cloudera impala 1.1
Evaluation of cloudera impala 1.1Evaluation of cloudera impala 1.1
Evaluation of cloudera impala 1.1
 
SQL-H a new way to enable SQL analytics
SQL-H a new way to enable SQL analyticsSQL-H a new way to enable SQL analytics
SQL-H a new way to enable SQL analytics
 
20141011 my sql clusterv01pptx
20141011 my sql clusterv01pptx20141011 my sql clusterv01pptx
20141011 my sql clusterv01pptx
 
Cloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for HadoopCloudera Impala: A modern SQL Query Engine for Hadoop
Cloudera Impala: A modern SQL Query Engine for Hadoop
 
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120
 
Flexible Replication
Flexible ReplicationFlexible Replication
Flexible Replication
 
Scaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - SematextScaling massive elastic search clusters - Rafał Kuć - Sematext
Scaling massive elastic search clusters - Rafał Kuć - Sematext
 
HMS: Scalable Configuration Management System for Hadoop
HMS: Scalable Configuration Management System for HadoopHMS: Scalable Configuration Management System for Hadoop
HMS: Scalable Configuration Management System for Hadoop
 
Performance Evaluation of Cloudera Impala GA
Performance Evaluation of Cloudera Impala GAPerformance Evaluation of Cloudera Impala GA
Performance Evaluation of Cloudera Impala GA
 
Core os dna_automacon
Core os dna_automaconCore os dna_automacon
Core os dna_automacon
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
 
MySQL Replication
MySQL ReplicationMySQL Replication
MySQL Replication
 
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
 
A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)A Tale of a Server Architecture (Frozen Rails 2012)
A Tale of a Server Architecture (Frozen Rails 2012)
 
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
 
MySQL Cluster overview + development slides (2014)
MySQL Cluster overview + development slides (2014) MySQL Cluster overview + development slides (2014)
MySQL Cluster overview + development slides (2014)
 

More from Yukinori Suda

Hadoop operation chaper 4
Hadoop operation chaper 4Hadoop operation chaper 4
Hadoop operation chaper 4Yukinori Suda
 
Cloudera Impalaをサービスに組み込むときに苦労した話
Cloudera Impalaをサービスに組み込むときに苦労した話Cloudera Impalaをサービスに組み込むときに苦労した話
Cloudera Impalaをサービスに組み込むときに苦労した話Yukinori Suda
 
Hadoopエコシステムを駆使したこれからのWebアクセス解析サービス
Hadoopエコシステムを駆使したこれからのWebアクセス解析サービスHadoopエコシステムを駆使したこれからのWebアクセス解析サービス
Hadoopエコシステムを駆使したこれからのWebアクセス解析サービスYukinori Suda
 
自宅でHive愛を育む方法 〜Raspberry Pi編〜
自宅でHive愛を育む方法 〜Raspberry Pi編〜自宅でHive愛を育む方法 〜Raspberry Pi編〜
自宅でHive愛を育む方法 〜Raspberry Pi編〜Yukinori Suda
 
⾃宅で Hive 愛を育むための⼿順(Raspberry Pi 編)
⾃宅で Hive 愛を育むための⼿順(Raspberry Pi 編)⾃宅で Hive 愛を育むための⼿順(Raspberry Pi 編)
⾃宅で Hive 愛を育むための⼿順(Raspberry Pi 編)Yukinori Suda
 
HiveとImpalaのおいしいとこ取り
HiveとImpalaのおいしいとこ取りHiveとImpalaのおいしいとこ取り
HiveとImpalaのおいしいとこ取りYukinori Suda
 
Cloudera impalaの性能評価(Hiveとの比較)
Cloudera impalaの性能評価(Hiveとの比較)Cloudera impalaの性能評価(Hiveとの比較)
Cloudera impalaの性能評価(Hiveとの比較)Yukinori Suda
 

More from Yukinori Suda (7)

Hadoop operation chaper 4
Hadoop operation chaper 4Hadoop operation chaper 4
Hadoop operation chaper 4
 
Cloudera Impalaをサービスに組み込むときに苦労した話
Cloudera Impalaをサービスに組み込むときに苦労した話Cloudera Impalaをサービスに組み込むときに苦労した話
Cloudera Impalaをサービスに組み込むときに苦労した話
 
Hadoopエコシステムを駆使したこれからのWebアクセス解析サービス
Hadoopエコシステムを駆使したこれからのWebアクセス解析サービスHadoopエコシステムを駆使したこれからのWebアクセス解析サービス
Hadoopエコシステムを駆使したこれからのWebアクセス解析サービス
 
自宅でHive愛を育む方法 〜Raspberry Pi編〜
自宅でHive愛を育む方法 〜Raspberry Pi編〜自宅でHive愛を育む方法 〜Raspberry Pi編〜
自宅でHive愛を育む方法 〜Raspberry Pi編〜
 
⾃宅で Hive 愛を育むための⼿順(Raspberry Pi 編)
⾃宅で Hive 愛を育むための⼿順(Raspberry Pi 編)⾃宅で Hive 愛を育むための⼿順(Raspberry Pi 編)
⾃宅で Hive 愛を育むための⼿順(Raspberry Pi 編)
 
HiveとImpalaのおいしいとこ取り
HiveとImpalaのおいしいとこ取りHiveとImpalaのおいしいとこ取り
HiveとImpalaのおいしいとこ取り
 
Cloudera impalaの性能評価(Hiveとの比較)
Cloudera impalaの性能評価(Hiveとの比較)Cloudera impalaの性能評価(Hiveとの比較)
Cloudera impalaの性能評価(Hiveとの比較)
 

Performance evaluation of cloudera impala 0.6 beta with comparison to Hive

  • 1. Cloudera  impala  0.6  beta   Performance  Evaluation (with  Comparison  to  Hive) Mar.  6,  2013 CELLANT  Corp.  R&D  Strategy  Division Yukinori  SUDA @sudabon 1 Copyright © CELLANT Corp. All Rights Reserved. http://www.cellant.jp/
  • 2. Cloudera  impala  0.6  beta v  ChangeLogs  from  0.5  beta v  Cloudera  Manager  4.5  and  CDH  4.2  support  Impala  0.6. v  Support  for  the  RCFile  file  format. v  Added  support  for  Impala  on  SUSE  and  Debian/Ubuntu. v RHEL5.7/6.2  and  Centos5.7/6.2 v SUSE  11  with  Service  Pack  1  or  later v Ubuntu  10.04/12.04  and  Debian  6.03 2 Copyright © CELLANT Corp. All Rights Reserved. http://www.cellant.jp/
  • 3. System  Environment v  Install  via  Cloudera  Manager  Free  Edition  4.5.0 Master Slave DataNode DataNode DataNode DataNode Active TaskTracker TaskTracker TaskTracker TaskTracker NameNode Impalad Impalad Impalad Impalad DataNode DataNode DataNode DataNode Stand-‐‑‒by TaskTracker TaskTracker TaskTracker TaskTracker NameNode Impalad Impalad Impalad Impalad DataNode JobTracker DataNode DataNode TaskTracker statestored TaskTracker TaskTracker Impalad Impalad Impalad 3  Servers 11  Servers All  servers  are  connected  with  1Gbps  Ethernet  through  an  L2  switch 3 Copyright © CELLANT Corp. All Rights Reserved. http://www.cellant.jp/
  • 4. Server  Specification v CPU l  Intel  Core  2  Duo  2.13  GHz  with  Hyper  Threading v Memory l  4GB v Disk l  7,200  rpm  SATA  mechanical  Hard  Disk  Drive v OS l  Cent  OS  6.2 4 Copyright © CELLANT Corp. All Rights Reserved. http://www.cellant.jp/
  • 5. Benchmark v  Use  CDH4.2.0  +  impala  version  0.6  beta v  Use  hivebench  in  open-‐‑‒sourced  benchmark  tool  “HiBench” l  https://github.com/hibench v  Modified  datasets  to  1/10  scale l  Default  configuration  generates  table  with  1  billion  rows v  Modified  query  sentence l  Deleted  “INSERT  INTO  TABLE  …”  to  evaluate  read-‐‑‒only  performance v  Combines  a  few  Hive  storage  format  with  a  few  compression   method l  TextFile,  SequenceFile,  RCFile l  No  compression,  Gzip,  Snappy v  Comparison  with  job  query  latency v  Average  job  latency  over  5  measurements 5 Copyright © CELLANT Corp. All Rights Reserved. http://www.cellant.jp/
  • 6. Modified  Datasets •  Uservisits  table •  Rankings  table –  100  million  rows –  12  million  rows –  Table  Definitions –  Table  Definitions •  sourceIP string •  pageURL string •  destURL string •  pageRank int •  visitDate string •  avgDuration int •  adRevenue double •  userAgent string •  countryCode string •  languageCode string •  searchWord string •  duration int 6 Copyright © CELLANT Corp. All Rights Reserved. http://www.cellant.jp/
  • 7. Modified  Query SELECT ON   sourceIP,   (R.pageURL  =  NUV.destURL)   sum(adRevenue)  as  totalRevenue, group  by  sourceIP   avg(pageRank)   order  by  totalRevenue  DESC FROM limit  1;   rankings_̲t  R JOIN  (   SELECT     sourceIP,     destURL,     adRevenue   FROM     uservisits_̲t  UV   WHERE     (datediff(UV.visitDate,  '1999-‐‑‒01-‐‑‒01')>=0     AND     datediff(UV.visitDate,  '2000-‐‑‒01-‐‑‒01')<=0)   )  NUV 7 Copyright © CELLANT Corp. All Rights Reserved. http://www.cellant.jp/
  • 8. Benchmark  Result  (Hive) 197.894 Snappy RCFile 234.289 Gzip SequenceFile 213.616 Snappy 227.883 Gzip TextFile 235.843 No  Comp. 0 50 100 150 200 250 Avg.  Job  Latency  [sec] 8 Copyright © CELLANT Corp. All Rights Reserved. http://www.cellant.jp/
  • 9. Benchmark  Result  (impala) 16.059 Snappy RCFile 17.03 Gzip SequenceFile 17.725 Snappy 21.25 Gzip TextFile 32.776 No  Comp. 0 50 100 150 200 250 Avg.  Job  Latency  [sec] 9 Copyright © CELLANT Corp. All Rights Reserved. http://www.cellant.jp/
  • 10. Block  Location  Cache  effect  ? TextFile SequenceFile RCFile job No Comp. Gzip Snappy Gzip Snappy 1st 50.256 23.692 22.085 18.475 20.042 2nd 34.905 20.710 19.733 16.690 18.859 3rd 30.752 20.604 15.608 16.620 16.642 4th 26.848 20.625 15.602 16.617 12.148 5th 21.121 20.620 15.597 16.747 12.606 Average 32.776 21.250 17.725 17.030 16.059 v  1st  job  is  the  slowest,  and  the  fastest  job  is  one  of  the  others   due  to  Block  Location  Cache  effect? 10 Copyright © CELLANT Corp. All Rights Reserved. http://www.cellant.jp/
  • 11. Conclusion v Impala  is  over  10  times  faster  than  MRv1  +   Hive v Specifically, l  Impala  0.6  beta •  RCFile  compressed  as  Snappy:  16.059  sec l  MRv1  +  Hive  0.10 •  RCFile  compressed  as  Snappy:  197.894  sec v Hope  that  impala  GA  included  in  CDH5   makes  faster l  Support  Trevni  columner  format l  Optimized  Query  Planner 11 Copyright © CELLANT Corp. All Rights Reserved. http://www.cellant.jp/
  • 12. Thanks. 12 Copyright © CELLANT Corp. All Rights Reserved. http://www.cellant.jp/