Programming Hive Reading #3

11,428 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
11,428
On SlideShare
0
From Embeds
0
Number of Embeds
8,154
Actions
Shares
0
Downloads
6
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Programming Hive Reading #3

  1. 1. Programming Hive Reading #3 @just_do_neet
  2. 2. Chapter 10. Tuning •Using Explain / Explain Extended •Optimized Join •Local Mode •Parallel Execution •Strict Mode •Tuning the Number of Map/Reduce •JVM etc...Programming Hive Reading #3 3
  3. 3. Using EXPLAIN •EXPLAIN使わなくて済むのは小学生ま(ry •出力内容 •Abstract Syntax Tree(AST) •Dependencies •Stage PlansProgramming Hive Reading #3 4
  4. 4. Using EXPLAIN •org.apache.hadoop.hive.ql.exec.ExplainTask http://grepcode.com/file/repository.cloudera.com$content$repositories$releases@org.apache.hadoop.hive$hive- exec@0.7.1-cdh3u1@org$apache$hadoop$hive$ql$exec$ExplainTask.javaProgramming Hive Reading #3 5
  5. 5. Using EXPLAIN •AST(抽象構文木) •TOK_FROM:入力元(TOK_TABREF=table) •TOK_INSERT:出力先 •TOK_SELECT:selectの条件Programming Hive Reading #3 6
  6. 6. Using EXPLAIN •Dependencies •MapReduce Job / Sampling Stage / Merge Stage / Limit Stage / etc..Programming Hive Reading #3 7
  7. 7. Using EXPLAIN •Stage PlansProgramming Hive Reading #3 8
  8. 8. Using EXPLAIN •Stage Plans:Operators http://hive.apache.org/docs/r0.7.1/api/org/apache/hadoop/hive/ql/exec/Operator.html •“EXPLAIN EXTENDED”にするとより詳細な情報が 出力される。(tmpファイルの出力先等)Programming Hive Reading #3 9
  9. 9. Optimized Join •tableのデータ件数によって式を調整。 ex. stocks > dividends の場合 •最右辺に出現するテーブル:streamed(at reduce) それ以外:bufferedProgramming Hive Reading #3 10
  10. 10. Optimized Join •stream tableはhint句 ”STREAMTABLE(tbl_name)”で 明示的に指定できる。Programming Hive Reading #3 11
  11. 11. Optimized Join •検証 a : 1,000,000,000 records b : 100,000,000 records $ SELECT a.hoge, b.fuga FROM a JOIN b on (a.id = b.id) 121.384 s $ SELECT a.hoge, b.fuga FROM b JOIN a on (b.id = a.id) 122.339 s $ SELECT /*+ streamtable(a) */ a.hoge, b.fuga FROM b JOIN a on (b.id = a.id) 120.298 sProgramming Hive Reading #3 12
  12. 12. Map Side Join •再掲Programming Hive Reading #3 13
  13. 13. Map Side Join •再掲Programming Hive Reading #3 14
  14. 14. Local Mode •データサイズが小さい場合はLocal Modeの方が overheadが減らせて速いケースがある。 $ set mapred.job.tracker = local; $ set mapred.tmp.dir =/tmp/masashi/sada; $ SELECT * FROM hoge FROM id = ‘fuga’ .......... Job running in-process (local Hadoop) ..........Programming Hive Reading #3 15
  15. 15. Local Mode •データサイズが小さい場合はLocal Modeの方が overheadが減らせて速いケースがある。 •ex. 約30,000レコードのtable normal mode : 27s local mode : 10s •ex. 約100,000,000レコードのtable normal mode : 40s local mode : 532sProgramming Hive Reading #3 16
  16. 16. Local Mode •自動的にLocal Mode処理をさせるには “hive.exec.mode.local.auto=true” •Local Mode動作する条件は以下 • The total input size of the job is lower than: “hive.exec.mode.local.auto.inputbytes.max” (128MB by default) • The total number of map-tasks is less than: “hive.exec.mode.local.auto.tasks.max” (4 by default) • The total number of reduce tasks required is 1 or 0.Programming Hive Reading #3 17
  17. 17. Strict Mode •Tuning? •有効にすると構文チェックが厳格になる。 ”hive.mapred.mode=strict”Programming Hive Reading #3 18
  18. 18. Tuning M/R Number •hive.exec.reducers.bytes.per.reducer = <number> •hive.exec.reducers.max = <number> •mapred.reduce.tasks = <number>Programming Hive Reading #3 19
  19. 19. JVM Reuse •1つのJVM上で動作するMap/Reduce Task数を設定 可能。(at “mapred-site.xml”) •-1の場合は無制限。Programming Hive Reading #3 20
  20. 20. Dynamic Partition Tuning •Dynamic Partitionの使用制約を設定可能。Programming Hive Reading #3 21
  21. 21. Single MR Multi Group By •参考:https://issues.apache.org/jira/browse/HIVE-2056 From table T insert overwrite table test1 select col1, count(distinct colx) group by col1 insert overwrite table test2 select col1, col2, count(distinct colx) group by col1, col2; •上記の場合”hive.multigroupby.singlemr=true”のほ うが速いらしい。Programming Hive Reading #3 22
  22. 22. Virtual Columns •Tuning? •以下の情報はHiveQLを用いて取得可能、ならびに 条件指定可能 •INPUT__FILE__NAME •BLOCK__OFFSET__INSIDE__FILE •ROW__OFFSET__INSIDE__BLOCK (“hive.exec.rowoffset=true”)Programming Hive Reading #3 23
  23. 23. Virtual Columns •Example https://cwiki.apache.org/Hive/languagemanual-virtualcolumns.htmlProgramming Hive Reading #3 24
  24. 24. Conclusion •実際のパフォーマンスチューニングには、上述の内 容よりもデータ構造の改善の方が効果が大きいと思 います。 •Chapter 11. ならびに Chapter 15. 担当の方に超期 待しています!!!Programming Hive Reading #3 25
  25. 25. ご清聴ありがとう ございました

×