Inside Hive (for beginners)1Takeshi NAKANO / Recruit Co. Ltd.
Why?Hive is good tool for non-specialist!The number of M/R controls the Hive processing time.↓How can we reduce the number?What can we do for this on writing HiveQL?↓How does Hive convert HiveQLto M/R jobs?On this, what optimizing processes are adopted?7/6/2011HIVE - A warehouse solution over Map Reduce Framework2
Don’t you have..This fb’s paper has a lot of information!But this is a little old..7/6/2011HIVE - A warehouse solution over Map Reduce Framework3
Component Level Analysis7/6/2011HIVE - A warehouse solution over Map Reduce Framework4
Hive Architecture / Exec Flow7/6/2011HIVE - A warehouse solution over Map Reduce Framework5ClientHadoopMetastoreDriverCompiler
ClientHadoopDriverCompilerHive WorkflowHive has the operators which are minimum processing units.The process of each operator is done with HDFS operation or M/R jobs.The compiler converts HiveQL to the sets of operators.7/6/2011HIVE - A warehouse solution over Map Reduce Framework6Metastore
Hive WorkflowOperators7/6/2011HIVE - A warehouse solution over Map Reduce Framework7
ClientHadoopMetastoreDriverCompilerHive WorkflowFor M/R processing, Hiveuses ExecMaper and ExecReducer.On processing, we have 2 modes.Local processing modeDistributed processing mode7/6/2011HIVE - A warehouse solution over Map Reduce Framework8
ClientHadoopMetastoreDriverCompilerHive WorkflowOn 1(Local mode)Hive fork the process with hadoop command.The plan.xml is made just on 1 and the single node processes this.On 2(Distributed mode).Hive send the process to exsistingJobTracker.The information is housed on DistributedCacheand processed on multi nodes.7/6/2011HIVE - A warehouse solution over Map Reduce Framework9
Compiler : How to Process HiveQL7/6/2011HIVE - A warehouse solution over Map Reduce Framework10ClientHadoopMetastoreDriverCompiler
“Plumbing” of HIVE compiler7/6/201111HIVE - A warehouse solution over Map Reduce Framework
“Plumbing” of HIVE compiler7/6/201112HIVE - A warehouse solution over Map Reduce Framework
Compiler Overview13ParserSemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizer
Compiler Overview14HiveQLParserASTSemanticAnalyzerQBLogicalPlan Gen.Operator TreeLogicalOptimizerOperator TreePhysicalPlan Gen.Task TreePhysicalOptimizerTask Tree
ParserHiveQLASTINSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);HiveQLTOK_QUERY  + TOK_FROM      + TOK_JOIN          + TOK_TABREF              + TOK_TABNAME                  + "access_log_hbase"              + a          + TOK_TABREF              + TOK_TABNAME                  + "product_hbase"              + "p"          + "="              + "."                  + TOK_TABLE_OR_COL                      + "a"                  + "access_log_hbase"              + "."                  + TOK_TABLE_OR_COL                      + "p"                  + "prono“AST  + TOK_INSERT      + TOK_DESTINATION          + TOK_TAB              + TOK_TABNAME                  + "access_log_temp2"      + TOK_SELECT          + TOK_SELEXPR              + "."                  + TOK_TABLE_OR_COL                      + "a"                  + "user"          + TOK_SELEXPR              + "."                  + TOK_TABLE_OR_COL                      + "a"                  + "prono"          + TOK_SELEXPR              + "."                  + TOK_TABLE_OR_COL                      + "p"                  + "maker"          + TOK_SELEXPR              + "."                  + TOK_TABLE_OR_COL                      + "p"                  + "price"SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser
ParserSQLASTINSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);SQLTOK_QUERY  + TOK_FROM      + TOK_JOIN          + TOK_TABREF              + TOK_TABNAME                  + "access_log_hbase"              + a          + TOK_TABREF              + TOK_TABNAME                  + "product_hbase"              + "p"          + "="              + "."                  + TOK_TABLE_OR_COL                      + "a"                  + "access_log_hbase"              + "."                  + TOK_TABLE_OR_COL                      + "p"                  + "prono“  + TOK_INSERT      + TOK_DESTINATION          + TOK_TAB              + TOK_TABNAME                  + "access_log_temp2"      + TOK_SELECT          + TOK_SELEXPR              + "."                  + TOK_TABLE_OR_COL                      + "a"                  + "user"          + TOK_SELEXPR              + "."                  + TOK_TABLE_OR_COL                      + "a"                  + "prono"          + TOK_SELEXPR              + "."                  + TOK_TABLE_OR_COL                      + "p"                  + "maker"          + TOK_SELEXPR              + "."                  + TOK_TABLE_OR_COL                      + "p"                  + "price"AST123SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser
17Semantic Analyzer (1/2)ASTQB+ TOK_FROM      + TOK_JOIN          + TOK_TABREF              + TOK_TABNAME                  + "access_log_hbase"              + a          + TOK_TABREF              + TOK_TABNAME                  + "product_hbase"              + "p"          + "="              + "."                  + TOK_TABLE_OR_COL                      + "a"                  + "access_log_hbase"              + "."                  + TOK_TABLE_OR_COL                      + "p"                  + "prono“AST1QBMetaDataAliasTo Table Info“a”=Table Info(“access_log_hbase”)“p”=Table Info(“product_hbase”)ParseInfoJoin Node+ TOK_JOIN    + TOK_TABREF        …    + TOK_TABREF        …    + “=”        …SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser17
18Semantic Analyzer (2/2)ASTQB      + TOK_DESTINATION          + TOK_TAB              + TOK_TABNAME                  + "access_log_temp2”AST2QBParseInfoNameTo Destination Node+ TOK_TAB    + TOK_TABNAME        +"access_log_temp2”SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser1818
19Semantic Analyzer (2/2)ASTQB      + TOK_SELECT          + TOK_SELEXPR              + "."                  + TOK_TABLE_OR_COL                      + "a"                  + "user"          + TOK_SELEXPR              + "."                  + TOK_TABLE_OR_COL                      + "a"                  + "prono"          + TOK_SELEXPR              + "."                  + TOK_TABLE_OR_COL                      + "p"                  + "maker"          + TOK_SELEXPR              + "."                  + TOK_TABLE_OR_COL                      + "p"                  + "price"ASTQBParseInfo3Name To Select Node+ TOK_SELECT    + TOK_SELEXPR        …     + TOK_SELEXPR        …    + TOK_SELEXPR        …    + TOK_SELEXPR        …SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser1919
20Logical Plan Generator (1/4)QBOPTreeQBMetaDataAliasTo Table Info“a”=Table Info(“access_log_hbase”)“p”=Table Info(“product_hbase”)OPTreeTableScanOperator(“access_log_hbase”)TableScanOperator(“product_hbase”)SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser2020
21Logical Plan Generator (2/4)QBOPTreeQBParseInfo + TOK_JOIN          + TOK_TABREF              + TOK_TABNAME                  + "access_log_hbase"              + a          + TOK_TABREF              + TOK_TABNAME                  + "product_hbase"              + "p"          + "="              + "."                  + TOK_TABLE_OR_COL                      + "a"                  + "access_log_hbase"              + "."                  + TOK_TABLE_OR_COL                      + "p"                  + "prono“ReduceSinkOperator(“access_log_hbase”)ReduceSinkOperator(“product_hbase”)OPTreeJoinOperatorSemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser
22Logical Plan Generator (3/4)QBOPTreeQBParseInfoName To Select Node+ TOK_SELECT    + TOK_SELEXPR        + "."             + TOK_TABLE_OR_COL                 + "a"             + "user"    + TOK_SELEXPR         + "."             + TOK_TABLE_OR_COL                 + "a"             + "prono"    + TOK_SELEXPR         + "."             + TOK_TABLE_OR_COL                 + "p"             + "maker"    + TOK_SELEXPR         + "."             + TOK_TABLE_OR_COL                 + "p"             + "price"OPTreeSelectOperatorSemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser
23Logical Plan Generator (4/4)QBOPTreeQBMetaDataName To Destination Table Info“insclause-0”=    Table Info(“access_log_temp2”)OPTreeFileSinkOperatorSemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser
Logical Plan Generator (result)24LCF OPTreeTableScanOperatorTS_1TableScanOperatorTS_0ReduceSinkOperatorRS_2ReduceSinkOperatorRS_3JoinOperatorJOIN_4SelectOperatorSEL_5FileSinkOperatorFS_6SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser
Logical OptimizerSemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser252525
Logical Optimizer (Predicate Push Down)INSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);INSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono) WHERE p.maker = 'honda';SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser2626
Logical Optimizer (Predicate Push Down)TableScanOperatorTS_1TableScanOperatorTS_0INSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);ReduceSinkOperatorRS_3ReduceSinkOperatorRS_2JoinOperatorJOIN_4INSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono) WHERE p.maker = 'honda';SelectOperatorSEL_6FileSinkOperatorFS_7SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser2727
INSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);INSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono) WHERE p.maker = 'honda';Logical Optimizer (Predicate Push Down)TableScanOperatorTS_1TableScanOperatorTS_0ReduceSinkOperatorRS_3ReduceSinkOperatorRS_2JoinOperatorJOIN_4FilterOperatorFIL_5(_col8 = 'honda')SelectOperatorSEL_6FileSinkOperatorFS_7SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser2828
Logical Optimizer (Predicate Push Down)TableScanOperatorTS_1TableScanOperatorTS_0INSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);FilterOperatorFIL_8(maker = 'honda')ReduceSinkOperatorRS_2ReduceSinkOperatorRS_3JoinOperatorJOIN_4INSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono) WHERE p.maker = 'honda';FilterOperatorFIL_5(_col8 = 'honda')SelectOperatorSEL_6FileSinkOperatorFS_7SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser2929
30Physical Plan GeneratorOPTreeTaskTreeMoveTask(Stage-0)OpeTreeLoadTableDescTableScanOperator(TS_0)TableScanOperator(TS_1)ReduceSinkOperator(RS_2)MapRedTask(Stage-1/root)ReduceSinkOperator(RS_3)JoinOperator(JOIN_4)SelectOperator(SEL_5)FileSinkOperator(FS_6) StatsTask(Stage-2)SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser3030
OPTreeTaskTreeMapRedTask (Stage-1/root)TableScanOperator(TS_0)Physical Plan Generator (result)31LCF MapperTableScanOperatorTS_1TableScanOperatorTS_0TableScanOperator(TS_1)ReduceSinkOperatorRS_2ReduceSinkOperatorRS_3ReduceSinkOperator(RS_2)MapRedTask(Stage-1/root)ReduceSinkOperator(RS_3)ReducerJoinOperatorJOIN_4JoinOperator(JOIN_4)SelectOperatorSEL_5SelectOperator(SEL_5)FileSinkOperatorFS_6SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser313131
32Physical OptimizerTaskTreeTaskTreejava/org/apache/hadoop/hive/ql/optimizer/physical/以下SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser
33Physical Optimizer (MapJoinResolver)TaskTreeTaskTreeMapRedTask (Stage-1)MapperTableScanOperatorTS_1TableScanOperatorTS_0MapJoinOperatorMAPJOIN_7SelectOperatorSEL_8SelectOperatorSEL_5FileSinkOperatorFS_6SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser33
34Physical Optimizer (MapJoinResolver)TaskTreeTaskTreeMapredLocalTask(Stage-7)MapRedTask (Stage-1)TableScanOperatorTS_0MapperTableScanOperatorTS_1TableScanOperatorTS_0HashTableSinkOperatorHASHTABLESINK_11MapJoinOperatorMAPJOIN_7MapRedTask (Stage-1)SelectOperatorSEL_8MapperTableScanOperatorTS_1SelectOperatorSEL_5MapJoinOperatorMAPJOIN_7FileSinkOperatorFS_6SelectOperatorSEL_8SelectOperatorSEL_5FileSinkOperatorFS_6SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser34
In the end7/6/2011HIVE - A warehouse solution over Map Reduce Framework35ClientHadoopMetastoreDriverCompiler
In the end36HiveQLParserASTSemanticAnalyzerQBLogicalPlan Gen.Operator TreeLogicalOptimizerOperator TreePhysicalPlan Gen.Task TreePhysicalOptimizerTask Tree
End7/6/201137
Appendix: What does Explain show?7/6/2011HIVE - A warehouse solution over Map Reduce Framework38
Appendix: What does Explain show?hive> explain INSERT OVERWRITE TABLE access_log_temp2    >  SELECT a.user, a.prono, p.maker, p.price    >  FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);OKABSTRACT SYNTAX TREE:  (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME access_log_hbase) a) (TOK_TABREF (TOK_TABNAME product_hbase) p) (= (. (TOK_TABLE_OR_COL a) prono) (. (TOK_TABLE_OR_COL p) prono)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB (TOK_TABNAME access_log_temp2))) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) user)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) prono)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL p) maker)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL p) price)))))STAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-0 depends on stages: Stage-1  Stage-2 depends on stages: Stage-0STAGE PLANS:  Stage: Stage-1    Map Reduce      Alias -> Map Operator Tree:        aTableScan            alias: a            Reduce Output Operator              key expressions:expr: prono                    type: int              sort order: +              Map-reduce partition columns:expr: prono                    type: int              tag: 0              value expressions:expr: user                    type: stringexpr: prono                    type: int        pTableScan            alias: p            Reduce Output Operator              key expressions:expr: prono                    type: int              sort order: +              Map-reduce partition columns:expr: prono                    type: int              tag: 1              value expressions:expr: maker                    type: stringexpr: price                    type: intReduce Operator Tree:        Join Operator          condition map:               Inner Join 0 to 1          condition expressions:            0 {VALUE._col0} {VALUE._col2}            1 {VALUE._col1} {VALUE._col2}handleSkewJoin: falseoutputColumnNames: _col0, _col2, _col6, _col7          Select Operator            expressions:expr: _col0                  type: stringexpr: _col2                  type: intexpr: _col6                  type: stringexpr: _col7                  type: intoutputColumnNames: _col0, _col1, _col2, _col3            File Output Operator              compressed: falseGlobalTableId: 1              table:                  input format: org.apache.hadoop.mapred.TextInputFormat                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatserde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                  name: default.access_log_temp2  Stage: Stage-0    Move Operator      tables:          replace: true          table:              input format: org.apache.hadoop.mapred.TextInputFormat              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatserde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe              name: default.access_log_temp2  Stage: Stage-2    Stats-Aggr OperatorTime taken: 0.1 secondshive>
Appendix: What does Explain show?hive> explain INSERT OVERWRITE TABLE access_log_temp2    >  SELECT a.user, a.prono, p.maker, p.price    >  FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);OKABSTRACT SYNTAX TREE:  (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME access_log_hbase) a) (TOK_TABREF (TOK_TABNAME product_hbase) p) (= (. (TOK_TABLE_OR_COL a) prono) (. (TOK_TABLE_OR_COL p) prono)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB (TOK_TABNAME access_log_temp2))) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) user)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) prono)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL p) maker)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL p) price)))))STAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-0 depends on stages: Stage-1  Stage-2 depends on stages: Stage-0STAGE PLANS:  Stage: Stage-1    Map Reduce      Alias -> Map Operator Tree:        aTableScan            alias: aReduce Output Operator              key expressions:expr: prono                    type: int              sort order: +              Map-reduce partition columns:expr: prono                    type: int              tag: 0              value expressions:expr: user                    type: stringexpr: prono                    type: int        pTableScan            alias: pReduce Output Operator              key expressions:expr: prono                    type: int              sort order: +              Map-reduce partition columns:expr: prono                    type: int              tag: 1              value expressions:expr: maker                    type: stringexpr: price                    type: intABSTRACT SYNTAX TREE:STAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-0 depends on stages: Stage-1  Stage-2 depends on stages: Stage-0STAGE PLANS:  Stage: Stage-1    Map Reduce      Map Operator Tree:TableScan            Reduce Output OperatorTableScan            Reduce Output Operator      Reduce Operator Tree:        Join Operator          Select Operator            File Output Operator  Stage: Stage-0    Move Operator  Stage: Stage-2    Stats-Aggr OperatorReduce Operator Tree:        Join Operator          condition map:               Inner Join 0 to 1          condition expressions:            0 {VALUE._col0} {VALUE._col2}            1 {VALUE._col1} {VALUE._col2}handleSkewJoin: falseoutputColumnNames: _col0, _col2, _col6, _col7          Select Operator            expressions:expr: _col0                  type: stringexpr: _col2                  type: intexpr: _col6                  type: stringexpr: _col7                  type: intoutputColumnNames: _col0, _col1, _col2, _col3File Output Operator              compressed: falseGlobalTableId: 1              table:                  input format: org.apache.hadoop.mapred.TextInputFormat                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatserde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                  name: default.access_log_temp2  Stage: Stage-0    Move Operator      tables:          replace: true          table:              input format: org.apache.hadoop.mapred.TextInputFormat              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatserde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe              name: default.access_log_temp2  Stage: Stage-2    Stats-Aggr OperatorTime taken: 0.1 secondshive>
Appendix: What does Explain show?ABSTRACT SYNTAX TREE:STAGE DEPENDENCIES:  Stage-1 is a root stage  Stage-0 depends on stages: Stage-1  Stage-2 depends on stages: Stage-0STAGE PLANS:  Stage: Stage-1    Map Reduce      Map Operator Tree:TableScan            Reduce Output OperatorTableScan            Reduce Output Operator      Reduce Operator Tree:        Join Operator          Select Operator            File Output Operator  Stage: Stage-0    Move Operator  Stage: Stage-2    Stats-Aggr OperatorMapRedTask (Stage-1/root)MapperTableScanOperatorTS_1TableScanOperatorTS_0ReduceSinkOperatorRS_2ReduceSinkOperatorRS_3ReducerJoinOperatorJOIN_4≒SelectOperatorSEL_5FileSinkOperatorFS_6MoveTask (Stage-0)Stats Task (Stage-2)

Internal Hive

  • 1.
    Inside Hive (forbeginners)1Takeshi NAKANO / Recruit Co. Ltd.
  • 2.
    Why?Hive is goodtool for non-specialist!The number of M/R controls the Hive processing time.↓How can we reduce the number?What can we do for this on writing HiveQL?↓How does Hive convert HiveQLto M/R jobs?On this, what optimizing processes are adopted?7/6/2011HIVE - A warehouse solution over Map Reduce Framework2
  • 3.
    Don’t you have..Thisfb’s paper has a lot of information!But this is a little old..7/6/2011HIVE - A warehouse solution over Map Reduce Framework3
  • 4.
    Component Level Analysis7/6/2011HIVE- A warehouse solution over Map Reduce Framework4
  • 5.
    Hive Architecture /Exec Flow7/6/2011HIVE - A warehouse solution over Map Reduce Framework5ClientHadoopMetastoreDriverCompiler
  • 6.
    ClientHadoopDriverCompilerHive WorkflowHive hasthe operators which are minimum processing units.The process of each operator is done with HDFS operation or M/R jobs.The compiler converts HiveQL to the sets of operators.7/6/2011HIVE - A warehouse solution over Map Reduce Framework6Metastore
  • 7.
    Hive WorkflowOperators7/6/2011HIVE -A warehouse solution over Map Reduce Framework7
  • 8.
    ClientHadoopMetastoreDriverCompilerHive WorkflowFor M/Rprocessing, Hiveuses ExecMaper and ExecReducer.On processing, we have 2 modes.Local processing modeDistributed processing mode7/6/2011HIVE - A warehouse solution over Map Reduce Framework8
  • 9.
    ClientHadoopMetastoreDriverCompilerHive WorkflowOn 1(Localmode)Hive fork the process with hadoop command.The plan.xml is made just on 1 and the single node processes this.On 2(Distributed mode).Hive send the process to exsistingJobTracker.The information is housed on DistributedCacheand processed on multi nodes.7/6/2011HIVE - A warehouse solution over Map Reduce Framework9
  • 10.
    Compiler : Howto Process HiveQL7/6/2011HIVE - A warehouse solution over Map Reduce Framework10ClientHadoopMetastoreDriverCompiler
  • 11.
    “Plumbing” of HIVEcompiler7/6/201111HIVE - A warehouse solution over Map Reduce Framework
  • 12.
    “Plumbing” of HIVEcompiler7/6/201112HIVE - A warehouse solution over Map Reduce Framework
  • 13.
  • 14.
    Compiler Overview14HiveQLParserASTSemanticAnalyzerQBLogicalPlan Gen.OperatorTreeLogicalOptimizerOperator TreePhysicalPlan Gen.Task TreePhysicalOptimizerTask Tree
  • 15.
    ParserHiveQLASTINSERT OVERWRITE TABLEaccess_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);HiveQLTOK_QUERY + TOK_FROM + TOK_JOIN + TOK_TABREF + TOK_TABNAME + "access_log_hbase" + a + TOK_TABREF + TOK_TABNAME + "product_hbase" + "p" + "=" + "." + TOK_TABLE_OR_COL + "a" + "access_log_hbase" + "." + TOK_TABLE_OR_COL + "p" + "prono“AST + TOK_INSERT + TOK_DESTINATION + TOK_TAB + TOK_TABNAME + "access_log_temp2" + TOK_SELECT + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "a" + "user" + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "a" + "prono" + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "p" + "maker" + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "p" + "price"SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser
  • 16.
    ParserSQLASTINSERT OVERWRITE TABLEaccess_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);SQLTOK_QUERY + TOK_FROM + TOK_JOIN + TOK_TABREF + TOK_TABNAME + "access_log_hbase" + a + TOK_TABREF + TOK_TABNAME + "product_hbase" + "p" + "=" + "." + TOK_TABLE_OR_COL + "a" + "access_log_hbase" + "." + TOK_TABLE_OR_COL + "p" + "prono“ + TOK_INSERT + TOK_DESTINATION + TOK_TAB + TOK_TABNAME + "access_log_temp2" + TOK_SELECT + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "a" + "user" + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "a" + "prono" + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "p" + "maker" + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "p" + "price"AST123SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser
  • 17.
    17Semantic Analyzer (1/2)ASTQB+TOK_FROM + TOK_JOIN + TOK_TABREF + TOK_TABNAME + "access_log_hbase" + a + TOK_TABREF + TOK_TABNAME + "product_hbase" + "p" + "=" + "." + TOK_TABLE_OR_COL + "a" + "access_log_hbase" + "." + TOK_TABLE_OR_COL + "p" + "prono“AST1QBMetaDataAliasTo Table Info“a”=Table Info(“access_log_hbase”)“p”=Table Info(“product_hbase”)ParseInfoJoin Node+ TOK_JOIN + TOK_TABREF … + TOK_TABREF … + “=” …SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser17
  • 18.
    18Semantic Analyzer (2/2)ASTQB + TOK_DESTINATION + TOK_TAB + TOK_TABNAME + "access_log_temp2”AST2QBParseInfoNameTo Destination Node+ TOK_TAB + TOK_TABNAME +"access_log_temp2”SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser1818
  • 19.
    19Semantic Analyzer (2/2)ASTQB + TOK_SELECT + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "a" + "user" + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "a" + "prono" + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "p" + "maker" + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "p" + "price"ASTQBParseInfo3Name To Select Node+ TOK_SELECT + TOK_SELEXPR … + TOK_SELEXPR … + TOK_SELEXPR … + TOK_SELEXPR …SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser1919
  • 20.
    20Logical Plan Generator(1/4)QBOPTreeQBMetaDataAliasTo Table Info“a”=Table Info(“access_log_hbase”)“p”=Table Info(“product_hbase”)OPTreeTableScanOperator(“access_log_hbase”)TableScanOperator(“product_hbase”)SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser2020
  • 21.
    21Logical Plan Generator(2/4)QBOPTreeQBParseInfo + TOK_JOIN + TOK_TABREF + TOK_TABNAME + "access_log_hbase" + a + TOK_TABREF + TOK_TABNAME + "product_hbase" + "p" + "=" + "." + TOK_TABLE_OR_COL + "a" + "access_log_hbase" + "." + TOK_TABLE_OR_COL + "p" + "prono“ReduceSinkOperator(“access_log_hbase”)ReduceSinkOperator(“product_hbase”)OPTreeJoinOperatorSemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser
  • 22.
    22Logical Plan Generator(3/4)QBOPTreeQBParseInfoName To Select Node+ TOK_SELECT + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "a" + "user" + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "a" + "prono" + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "p" + "maker" + TOK_SELEXPR + "." + TOK_TABLE_OR_COL + "p" + "price"OPTreeSelectOperatorSemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser
  • 23.
    23Logical Plan Generator(4/4)QBOPTreeQBMetaDataName To Destination Table Info“insclause-0”= Table Info(“access_log_temp2”)OPTreeFileSinkOperatorSemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser
  • 24.
    Logical Plan Generator(result)24LCF OPTreeTableScanOperatorTS_1TableScanOperatorTS_0ReduceSinkOperatorRS_2ReduceSinkOperatorRS_3JoinOperatorJOIN_4SelectOperatorSEL_5FileSinkOperatorFS_6SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser
  • 25.
  • 26.
    Logical Optimizer (PredicatePush Down)INSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);INSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono) WHERE p.maker = 'honda';SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser2626
  • 27.
    Logical Optimizer (PredicatePush Down)TableScanOperatorTS_1TableScanOperatorTS_0INSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);ReduceSinkOperatorRS_3ReduceSinkOperatorRS_2JoinOperatorJOIN_4INSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono) WHERE p.maker = 'honda';SelectOperatorSEL_6FileSinkOperatorFS_7SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser2727
  • 28.
    INSERT OVERWRITE TABLEaccess_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);INSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono) WHERE p.maker = 'honda';Logical Optimizer (Predicate Push Down)TableScanOperatorTS_1TableScanOperatorTS_0ReduceSinkOperatorRS_3ReduceSinkOperatorRS_2JoinOperatorJOIN_4FilterOperatorFIL_5(_col8 = 'honda')SelectOperatorSEL_6FileSinkOperatorFS_7SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser2828
  • 29.
    Logical Optimizer (PredicatePush Down)TableScanOperatorTS_1TableScanOperatorTS_0INSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);FilterOperatorFIL_8(maker = 'honda')ReduceSinkOperatorRS_2ReduceSinkOperatorRS_3JoinOperatorJOIN_4INSERT OVERWRITE TABLE access_log_temp2 SELECT a.user, a.prono, p.maker, p.price FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono) WHERE p.maker = 'honda';FilterOperatorFIL_5(_col8 = 'honda')SelectOperatorSEL_6FileSinkOperatorFS_7SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser2929
  • 30.
  • 31.
    OPTreeTaskTreeMapRedTask (Stage-1/root)TableScanOperator(TS_0)Physical PlanGenerator (result)31LCF MapperTableScanOperatorTS_1TableScanOperatorTS_0TableScanOperator(TS_1)ReduceSinkOperatorRS_2ReduceSinkOperatorRS_3ReduceSinkOperator(RS_2)MapRedTask(Stage-1/root)ReduceSinkOperator(RS_3)ReducerJoinOperatorJOIN_4JoinOperator(JOIN_4)SelectOperatorSEL_5SelectOperator(SEL_5)FileSinkOperatorFS_6SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser313131
  • 32.
  • 33.
    33Physical Optimizer (MapJoinResolver)TaskTreeTaskTreeMapRedTask(Stage-1)MapperTableScanOperatorTS_1TableScanOperatorTS_0MapJoinOperatorMAPJOIN_7SelectOperatorSEL_8SelectOperatorSEL_5FileSinkOperatorFS_6SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser33
  • 34.
    34Physical Optimizer (MapJoinResolver)TaskTreeTaskTreeMapredLocalTask(Stage-7)MapRedTask(Stage-1)TableScanOperatorTS_0MapperTableScanOperatorTS_1TableScanOperatorTS_0HashTableSinkOperatorHASHTABLESINK_11MapJoinOperatorMAPJOIN_7MapRedTask (Stage-1)SelectOperatorSEL_8MapperTableScanOperatorTS_1SelectOperatorSEL_5MapJoinOperatorMAPJOIN_7FileSinkOperatorFS_6SelectOperatorSEL_8SelectOperatorSEL_5FileSinkOperatorFS_6SemanticAnalyzerLogicalPlan Gen.LogicalOptimizerPhysicalPlan Gen.PhysicalOptimizerParser34
  • 35.
    In the end7/6/2011HIVE- A warehouse solution over Map Reduce Framework35ClientHadoopMetastoreDriverCompiler
  • 36.
    In the end36HiveQLParserASTSemanticAnalyzerQBLogicalPlanGen.Operator TreeLogicalOptimizerOperator TreePhysicalPlan Gen.Task TreePhysicalOptimizerTask Tree
  • 37.
  • 38.
    Appendix: What doesExplain show?7/6/2011HIVE - A warehouse solution over Map Reduce Framework38
  • 39.
    Appendix: What doesExplain show?hive> explain INSERT OVERWRITE TABLE access_log_temp2 > SELECT a.user, a.prono, p.maker, p.price > FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);OKABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME access_log_hbase) a) (TOK_TABREF (TOK_TABNAME product_hbase) p) (= (. (TOK_TABLE_OR_COL a) prono) (. (TOK_TABLE_OR_COL p) prono)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB (TOK_TABNAME access_log_temp2))) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) user)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) prono)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL p) maker)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL p) price)))))STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 Stage-2 depends on stages: Stage-0STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: aTableScan alias: a Reduce Output Operator key expressions:expr: prono type: int sort order: + Map-reduce partition columns:expr: prono type: int tag: 0 value expressions:expr: user type: stringexpr: prono type: int pTableScan alias: p Reduce Output Operator key expressions:expr: prono type: int sort order: + Map-reduce partition columns:expr: prono type: int tag: 1 value expressions:expr: maker type: stringexpr: price type: intReduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {VALUE._col0} {VALUE._col2} 1 {VALUE._col1} {VALUE._col2}handleSkewJoin: falseoutputColumnNames: _col0, _col2, _col6, _col7 Select Operator expressions:expr: _col0 type: stringexpr: _col2 type: intexpr: _col6 type: stringexpr: _col7 type: intoutputColumnNames: _col0, _col1, _col2, _col3 File Output Operator compressed: falseGlobalTableId: 1 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatserde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: default.access_log_temp2 Stage: Stage-0 Move Operator tables: replace: true table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatserde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: default.access_log_temp2 Stage: Stage-2 Stats-Aggr OperatorTime taken: 0.1 secondshive>
  • 40.
    Appendix: What doesExplain show?hive> explain INSERT OVERWRITE TABLE access_log_temp2 > SELECT a.user, a.prono, p.maker, p.price > FROM access_log_hbase a JOIN product_hbase p ON (a.prono = p.prono);OKABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME access_log_hbase) a) (TOK_TABREF (TOK_TABNAME product_hbase) p) (= (. (TOK_TABLE_OR_COL a) prono) (. (TOK_TABLE_OR_COL p) prono)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB (TOK_TABNAME access_log_temp2))) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) user)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) prono)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL p) maker)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL p) price)))))STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 Stage-2 depends on stages: Stage-0STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: aTableScan alias: aReduce Output Operator key expressions:expr: prono type: int sort order: + Map-reduce partition columns:expr: prono type: int tag: 0 value expressions:expr: user type: stringexpr: prono type: int pTableScan alias: pReduce Output Operator key expressions:expr: prono type: int sort order: + Map-reduce partition columns:expr: prono type: int tag: 1 value expressions:expr: maker type: stringexpr: price type: intABSTRACT SYNTAX TREE:STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 Stage-2 depends on stages: Stage-0STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree:TableScan Reduce Output OperatorTableScan Reduce Output Operator Reduce Operator Tree: Join Operator Select Operator File Output Operator Stage: Stage-0 Move Operator Stage: Stage-2 Stats-Aggr OperatorReduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {VALUE._col0} {VALUE._col2} 1 {VALUE._col1} {VALUE._col2}handleSkewJoin: falseoutputColumnNames: _col0, _col2, _col6, _col7 Select Operator expressions:expr: _col0 type: stringexpr: _col2 type: intexpr: _col6 type: stringexpr: _col7 type: intoutputColumnNames: _col0, _col1, _col2, _col3File Output Operator compressed: falseGlobalTableId: 1 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatserde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: default.access_log_temp2 Stage: Stage-0 Move Operator tables: replace: true table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatserde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: default.access_log_temp2 Stage: Stage-2 Stats-Aggr OperatorTime taken: 0.1 secondshive>
  • 41.
    Appendix: What doesExplain show?ABSTRACT SYNTAX TREE:STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 Stage-2 depends on stages: Stage-0STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree:TableScan Reduce Output OperatorTableScan Reduce Output Operator Reduce Operator Tree: Join Operator Select Operator File Output Operator Stage: Stage-0 Move Operator Stage: Stage-2 Stats-Aggr OperatorMapRedTask (Stage-1/root)MapperTableScanOperatorTS_1TableScanOperatorTS_0ReduceSinkOperatorRS_2ReduceSinkOperatorRS_3ReducerJoinOperatorJOIN_4≒SelectOperatorSEL_5FileSinkOperatorFS_6MoveTask (Stage-0)Stats Task (Stage-2)