Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hive sourcecodereading


Published on

Published in: Technology
  • Dating for everyone is here: ❤❤❤ ❤❤❤
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating direct: ❤❤❤ ❤❤❤
    Are you sure you want to  Yes  No
    Your message goes here
  • My cat couldn't make it through the Hive source code either!
    Are you sure you want to  Yes  No
    Your message goes here

Hive sourcecodereading

  1. 1. Hive Source Code Reading Hadoop Source Code Reading Vol.9(5/30) @wyukawa2012年5月31日木曜日
  2. 2. Overview Motivation What component did I read? Set up Development Environment Useful document for Code Reading Let’s try! My understanding Where does Hive tune performance? My impression2012年5月31日木曜日
  3. 3. Motivation Curiosity How does Hive convert HiveQL to MapReduce jobs? What optimizing processes are adopted? select item, sum(value) from item_tbl group by item; map reduce2012年5月31日木曜日
  4. 4. What component did I read?年5月31日木曜日
  5. 5. Set up Development Environment See questions/171/hive Setting up is troublesome...(-_-;) I read r1342841.2012年5月31日木曜日
  6. 6. Useful document Hive Anatomy anatomy Internal Hive internal-hive2012年5月31日木曜日
  7. 7. Let’s try!年5月31日木曜日
  8. 8. a few minutes later年5月31日木曜日
  9. 9. a few hours later年5月31日木曜日
  10. 10. It’s difficult... Complex Method length is long年5月31日木曜日
  11. 11. And there is this ticket...年5月31日木曜日
  12. 12. 2012年5月31日木曜日
  13. 13. Terrible...It’s not good...年5月31日木曜日
  14. 14. anyway...年5月31日木曜日
  15. 15. My understanding HiveQL Parser AST Semantic Analyzer Operator Logical Plan Generation QB Logical Optimizer Operator Physical Plan Generation Prepare Execution Task Physical Optimizer Task Start MapReduce job2012年5月31日木曜日
  16. 16. Operator Example FetchOperator SelectOperator FilterOperator GroupByOperator TableScanOperator ReduceSinkOperator JoinOperator2012年5月31日木曜日
  17. 17. Explain Example 1 hive> explain select * from aaa; ... STAGE DEPENDENCIES: Stage-0 is a root stage STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -12012年5月31日木曜日
  18. 18. Explain Example 22012年5月31日木曜日
  19. 19. Optimizer Rule-based Optimizer is a set of Transformation rules on the operator tree The transformation rules are specified by a regexp pattern on the tree. The rule with minimum cost are executed. The cost is regexp pattern match length.2012年5月31日木曜日
  20. 20. Cost Calculation Example HiveQL is “select key from src;” Optimizer is ColumnPruner. SelectOperator is expressed as “SEL%”. Regexp pattern of ColumnPrunerSelectProc rule is expressed as “SEL%”. The cost is 4. Pattern.compile(“SEL%”).matcher(“SEL%”).group ().length()2012年5月31日木曜日
  21. 21. Logical Optimizer Example ColumnPruner PredicatePushDown PartitionPruner GroupByOptimizer UnionProcessor JoinReorder2012年5月31日木曜日
  22. 22. Physical Optimizer Example SkewJoinResolver CommonJoinResolver If, resolver start. IndexWhereResolver MetadataOnlyOptimizer2012年5月31日木曜日
  23. 23. Break年5月31日木曜日
  24. 24. Where does Hive tune performance? Logical Plan Generation Physical Optimizer hive.exec.parallel Prepare Execution2012年5月31日木曜日
  25. 25. default true Logical Plan Generation map side aggregation(in-mapper combining)年5月31日木曜日
  26. 26. default false Physical Optimizer If, map join start. optimization-in-apache-hive/4706679289192012年5月31日木曜日
  27. 27. hive.exec.parallel default false Prepare Execution If hive.exec.parallel=true, parallel launch(like parallel query in Oracle) start. It is effective in multi table insert. 2012-04-03 19:34:09,346 Stage-4 map = 0%, reduce = 0% 2012-04-03 19:34:16,576 Stage-5 map = 0%, reduce = 0% 2012-04-03 19:34:47,630 Stage-4 map = 100%, reduce = 0% 2012-04-03 19:34:57,716 Stage-5 map = 100%, reduce = 0% 2012-04-03 19:35:40,147 Stage-4 map = 100%, reduce = 67% 2012-04-03 19:35:46,430 Stage-4 map = 100%, reduce = 100% 2012-04-03 19:35:46,443 Stage-5 map = 100%, reduce = 100%2012年5月31日木曜日
  28. 28. My impression There are a few choices of performance tuning. (, hive.exec.parallel) Data model, ETL flow is more important than performance tuning. Current optimization in Hive is rule-based. If index and statistics and cost-based optimization are more developed, there will be more choices of performance tuning.2012年5月31日木曜日