Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Join optimization in hive

Join optimization in hive

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

Join optimization in hive

  1. 1. Join Optimization in Hive<br />Liyin Tang<br />
  2. 2. Outline<br />Map Join Optimization<br />Previous Common Join and Map Join<br />Optimized Map Join<br />JDBM<br />Performance Evaluation<br />Convert Join to Map Join Automatically<br />How it works<br />Performance Evaluation<br />
  3. 3. Common Join<br />Task A<br />Table X<br />Table Y<br />Common Join Task<br />Mapper<br />Mapper<br />Mapper<br />…<br />Mapper<br />…<br />Mapper<br />…<br />Mapper<br />Shuffle<br />Reducer<br />
  4. 4. Previous Map Join<br />Task A<br />Small Table Data<br />MapJoin Task<br />Mapper<br />…<br />…<br />Record<br />Mapper<br />…<br />…<br />…<br />Big Table <br />Data<br />Record<br />Mapper<br />…<br />…<br />…<br />Record<br />Record<br />Record<br />Task C<br />…<br />…<br />
  5. 5. Optimized Map Join<br />Small Table Data<br />Small Table Data<br />Small Table Data<br />Task A<br />Upload files to DC<br />HashTable Files<br />HashTable Files<br />HashTable Files<br />MapReduce Local Task<br />Distributed Cache<br />MapJoin Task<br />Mapper<br />…<br />Record<br />Mapper<br />…<br />Record<br />Mapper<br />…<br />Big Table <br />Data<br />Record<br />Record<br />…<br />…<br />Task C<br />
  6. 6. JDBM<br />JDBM is too heavy weight for Map Join<br />Take more than 70% CPU time<br />Generate very large file<br />No need to use persistent hashtable for map join<br />
  7. 7. Performance Evaluation I<br />
  8. 8. Converting Common Join into Map Join<br />Task A<br />Task A<br />Conditional Task<br />a<br />CommonJoinTask<br />MapJoinLocalTask<br />MapJoinLocalTask<br />MapJoinLocalTask<br />b<br />CommonJoinTask<br />. . . . . <br />Task C<br />c<br />MapJoinTask<br />MapJoinTask<br />MapJoinTask<br />Previous Execution Flow <br />Task C<br />Optimized Execution Flow<br />
  9. 9. Compile Time<br />SELECT * FROM <br />SRC1 x JOIN SRC2 y <br />ON x.key = y.key;<br />Task A<br />a<br />Conditional Task<br />Assume TABLE x is the big table<br />Assume TABLE y is the big table<br />MapJoinLocalTask<br />MapJoinLocalTask<br />CommonJoinTask<br />MapJoinTask<br />MapJoinTask<br />Task C<br />
  10. 10. Execution Time<br />SELECT * FROM <br />SRC1 x JOIN SRC2 y <br />ON x.key = y.key;<br />Task A<br />Both tables are too big for map join<br />Table X is the big table<br />a<br />Conditional Task<br />MapJoinLocalTask<br />CommonJoinTask<br />MapJoinTask<br />Task C<br />
  11. 11. Backup Task<br />Task A<br />Conditional Task<br />Memory Bound<br />MapJoinLocalTask<br />Run as a Backup Task<br />CommonJoinTask<br />MapJoinTask<br />Task C<br />
  12. 12. Performance Bottleneck<br />Distributed Cache is the potential performance bottleneck<br />Large hashtable file will slow down the propagation of Distributed Cache<br />Mappers are waiting for the hashtables file from Distributed Cache<br />Compress and archive all the hashtable file into a tar file.<br />
  13. 13. Compress and Archive <br />Small Table Data<br />Small Table Data<br />Small Table Data<br />Task A<br />Compressed & Archived<br />a<br />HashTable Files<br />HashTable Files<br />HashTable Files<br />MapReduce Local Task<br />Distributed Cache<br />Mapper<br />…<br />Record<br />Mapper<br />…<br />Record<br />Mapper<br />…<br />Big Table <br />Data<br />Record<br />b<br />Record<br />MapJoin Task<br />…<br />…<br />Task C<br />
  14. 14. Performance Evaluation II<br />
  15. 15. Performance Evaluation III<br />
  16. 16. Future Work<br />Audit how many join will be converted into map join in the cluster.<br />Set hashtable file replica number based on the number of Mappers<br />Tune the limit of small table data size by sampling<br />Increase the in-memory hashtable capacity.<br />
  17. 17. Thank you<br />Liyin Tang<br />

    Be the first to comment

    Login to see the comments

  • restartr

    Nov. 16, 2011
  • davidginzburg

    Jan. 9, 2012
  • YinHuai

    Oct. 19, 2012
  • chaoh

    Jan. 5, 2013
  • SteveHwang1

    Jan. 22, 2013
  • tasyblue

    Feb. 21, 2013
  • idealone

    Oct. 21, 2014
  • ahsandmkm

    Nov. 15, 2014
  • passfield2003

    Nov. 9, 2015
  • juanRojasTovar

    Apr. 20, 2016
  • pedrambashiri

    Aug. 9, 2017
  • ssusere598c2

    Nov. 13, 2017
  • uriyoung2

    Dec. 20, 2017

Join optimization in hive


Total views


On Slideshare


From embeds


Number of embeds