MapReduce Over Lustre

3,057 views

Published on

integrate Hadoop with Lustre

Published in: Technology, News & Politics
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,057
On SlideShare
0
From Embeds
0
Number of Embeds
27
Actions
Shares
0
Downloads
136
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

MapReduce Over Lustre

  1. 1. MapReduce over Lustre report David Luan, Simon Huang, GaoShengGong 2008.10~2009.6
  2. 2. Outline <ul><li>Early research, analysis </li></ul><ul><li>Platform design & improvement </li></ul><ul><li>Test cases, test process design </li></ul><ul><li>Result analysis </li></ul><ul><li>Related jobs (GFS-like redundancy) </li></ul><ul><li>White paper & conclusion </li></ul>
  3. 3. Early research, analysis <ul><li>HDFS, Lustre overall Benchmark tests </li></ul><ul><ul><li>IOZone </li></ul></ul><ul><ul><li>IOR ? </li></ul></ul><ul><ul><li>WebDAV (an indirect way to mount HDFS) ★ </li></ul></ul><ul><li>Hadoop platform overview </li></ul><ul><ul><li>MapReduce </li></ul></ul><ul><ul><li>Three kinds of Hadoop I/O </li></ul></ul><ul><ul><li>Shortcoming & bottlenecks </li></ul></ul><ul><li>Lustre platform </li></ul><ul><ul><li>Modules analysis </li></ul></ul><ul><ul><li>Shortcomings </li></ul></ul>
  4. 4. Early research, analysis Overall Benchmark tests
  5. 5. Early research, analysis Input Map Key, Value Key, Value … = Map Map Split Input into Key-Value pairs. For each K-V pair call Map. Each Map produces new set of K-V pairs. Reduce(K, V[…]) Sort Output Key, Value Key, Value … = For each distinct key, call reduce. Produces one K-V pair for each distinct key. Output as a set of Key Value Pairs. MapReduce Flow Key, Value Key, Value … Key, Value Key, Value … Key, Value Key, Value …
  6. 6. Early research, analysis Hadoop I/O phases Map Read Local Read Local Read HTTP Reduce write
  7. 7. Early research, analysis <ul><li>Hadoop + HDFS </li></ul><ul><ul><li>Job / Task level parallel </li></ul></ul><ul><ul><li>Compute/storage tightly coupled </li></ul></ul><ul><ul><li>HDFS prefer huge files </li></ul></ul><ul><ul><li>app limited ( job split difficult) </li></ul></ul><ul><ul><ul><li>Distribute grep </li></ul></ul></ul><ul><ul><ul><li>Distribute sort </li></ul></ul></ul><ul><ul><ul><li>Log Processing </li></ul></ul></ul><ul><ul><ul><li>Data Warehousing </li></ul></ul></ul><ul><li>Lustre </li></ul><ul><ul><li>I/O level parallel </li></ul></ul><ul><ul><li>Compute/storage loose coupled </li></ul></ul><ul><ul><li>POSIX compatible </li></ul></ul><ul><ul><li>Apps </li></ul></ul><ul><ul><ul><li>Super computer </li></ul></ul></ul>Platform comparison
  8. 8. Early research, analysis <ul><li>HDFS shortcom. </li></ul><ul><ul><li>Metadata design </li></ul></ul><ul><ul><li>No parallel I/O </li></ul></ul><ul><ul><li>No general use (design for MapReduce) </li></ul></ul><ul><li>Lustre shortcom. </li></ul><ul><ul><li>inadequate reliability </li></ul></ul><ul><ul><li>inadequate stability </li></ul></ul><ul><ul><li>No native redundancy </li></ul></ul>Shortcomings comparison
  9. 9. Outline <ul><li>Early research, analysis </li></ul><ul><li>Platform design & improvement </li></ul><ul><li>Test cases, test process design </li></ul><ul><li>Result analysis </li></ul><ul><li>Related jobs (GFS-like redundancy) </li></ul><ul><li>White paper & conclusion </li></ul>
  10. 10. Platform design & improvement <ul><li>Two ways: </li></ul><ul><li>Java wrapper for liblustre (without Lustre client ) </li></ul><ul><ul><li>Motivation : </li></ul></ul><ul><ul><li>Design a method to merge these two system. Implement Hadoop’s FileSystem interface with java wrapper, then MapReduce can work without Lustre Client. </li></ul></ul><ul><ul><li>Touch impasse </li></ul></ul><ul><li>Use Lustre Client </li></ul><ul><ul><li>Design </li></ul></ul><ul><ul><li>Improvement </li></ul></ul>
  11. 11. Platform design & improvement <ul><li>Java wrapper touch impasse -_- </li></ul><ul><li>JNI call liblustre.so error: </li></ul><ul><ul><li>Java’ JNI will mis-link the function whose name is the same as system call (such as: mount, read, write, etc.) </li></ul></ul><ul><ul><li>If we use C to call static-lib (liblutre.a), compile to a executable program, it works ok. </li></ul></ul><ul><li>liblustre’s other problems </li></ul><ul><ul><li>Liblustre is not recommended to use in wiki </li></ul></ul><ul><ul><li>When use it, use liblustre.a instead of liblustre.so </li></ul></ul><ul><ul><li>Liblustre depends on gcc version </li></ul></ul>
  12. 12. Platform design & improvement <ul><li>Advantages for each Task (with Lustre) </li></ul><ul><ul><li>Decentralized I/O </li></ul></ul><ul><ul><li>Lustre can write parallel </li></ul></ul><ul><ul><li>Lustre is common usage </li></ul></ul><ul><ul><li>Great for non-splitable jobs </li></ul></ul>Platform design (1) advantages:
  13. 13. Platform design & improvement <ul><li>Platform design (2) modules </li></ul>
  14. 14. Platform design & improvement Platform design (3) read/write
  15. 15. Platform design & improvement <ul><li>Use Hardlink in instead of HTTP shuffle before ReduceTask starts [1] </li></ul><ul><ul><li>decentralized network bandwidth usage </li></ul></ul><ul><ul><li>delay ReduceTask actual Read/Write </li></ul></ul><ul><li>Use Lustre block location info to distribute tasks[2] </li></ul><ul><ul><li>“ move the compute to its data” </li></ul></ul><ul><ul><li>Save network bandwidth </li></ul></ul><ul><ul><li>Use a java Child thread to run shell to fetch the location info (detail in White paper) </li></ul></ul>Platform improvement 1
  16. 16. Platform design & improvement <ul><li>Platform improvement 2 </li></ul>Add location info as a scheduling parameter Use hardlink to delay shuffle pahse
  17. 17. Outline <ul><li>Early research, analysis </li></ul><ul><li>Platform design & improvement </li></ul><ul><li>Test cases, test process design </li></ul><ul><li>Result analysis </li></ul><ul><li>Related jobs </li></ul><ul><li>White paper & conclusion </li></ul>
  18. 18. <ul><li>Test cases design (Two kinds apps) </li></ul><ul><li>Apps of statistics (search, log processing, etc.) </li></ul><ul><ul><li>Little grained tasks (job  tasks) </li></ul></ul><ul><ul><li>MapTask intermediate result is small </li></ul></ul><ul><li>Apps of no-good splitable & highly complex </li></ul><ul><ul><li>large grained tasks (job  tasks) </li></ul></ul><ul><ul><li>MapTask intermediate result is big </li></ul></ul><ul><ul><li>Each task is highly compute </li></ul></ul><ul><ul><li>Each task needs big I/O </li></ul></ul>Test cases, test process design
  19. 19. Platform design & improvement <ul><li>Apps of highly complex, no-good splitable </li></ul>intermediate result is big Each task is highly compute
  20. 20. Test cases, test process design <ul><li>Test cases : </li></ul><ul><li>Apps of statistics: WordCount </li></ul><ul><li>This test reads text files and count each words. The output contains a word and its count, separated by a tab. </li></ul><ul><li>Apps of no-good splitable : BigMapoutput : </li></ul><ul><li>It is a map/reduce program that works on a very big non-splittable file , for map or reduce tasks it just read the input file and the do nothing but output the same file. </li></ul>
  21. 21. Test cases, test process design <ul><li>Test results : </li></ul><ul><li>Overall execute time </li></ul><ul><li>Time of each phase </li></ul><ul><ul><li>Map Read phase (the most time-consuming for Lustre ) </li></ul></ul><ul><ul><li>Local read/write and HTTP phase </li></ul></ul><ul><ul><li>Reduce write phase </li></ul></ul>
  22. 22. Test cases, test process design <ul><li>Test scene : </li></ul><ul><li>No optimization </li></ul><ul><li>Use hardlink </li></ul><ul><li>Use hadlink and location info </li></ul><ul><li>Lustre tuning </li></ul><ul><ul><li>Stripe size=? </li></ul></ul><ul><ul><li>Stripe count=? </li></ul></ul>
  23. 23. Outline <ul><li>Early research, analysis </li></ul><ul><li>Platform design & improvement </li></ul><ul><li>Test cases, test process design </li></ul><ul><li>Result analysis </li></ul><ul><li>Related jobs (GFS-like redundancy) </li></ul><ul><li>White paper & conclusion </li></ul>
  24. 24. Result analysis <ul><li>Result analysis </li></ul><ul><li>Conclusion </li></ul>
  25. 25. Result analysis <ul><li>Test1: WordCount with a big file </li></ul><ul><ul><li>process one big textfile(6G) </li></ul></ul><ul><ul><li>blocksize=32m </li></ul></ul><ul><ul><li>Reduce Tasks=0.95((1.75))*2*7=13 </li></ul></ul>
  26. 26. Result analysis <ul><li>Test2:WordCount with many small files </li></ul><ul><ul><ul><li>process a large number small files(10000) </li></ul></ul></ul><ul><ul><ul><li>Reduce Tasks=0.95*2*7=13 </li></ul></ul></ul>
  27. 27. Result analysis <ul><li>Test3: BigMapOutput with one big file </li></ul><ul><li>Result1: </li></ul><ul><li>Result2 ( fresh memory ) </li></ul><ul><li>Result3 (set mapred.local.dir to default value) </li></ul>
  28. 28. Result analysis <ul><li>Test4: BigMapOutput with hardlink </li></ul><ul><li>Test5: BigMapOutput with hardlink & location information </li></ul>
  29. 29. Result analysis <ul><li>Test6: BigMapOutput Map Read phase </li></ul><ul><li>Conclusion </li></ul><ul><ul><li>Map Read is The most time-consulting part : ★ </li></ul></ul>
  30. 30. Result analysis Conclusion 1: Hadoop+ HDFS Map Read Local Read Local Read HTTP Reduce write
  31. 31. Result analysis Conclusion 2: Hadoop + Lustre HDFS block location is fitter for Hadoop task distribute algorithm than Lustre stripe info This makes Map Read be The most time-consulting part
  32. 32. Result analysis <ul><li>Dig the logs ( each task execution time ( map-read ) ) </li></ul>
  33. 33. Outline <ul><li>Early research, analysis </li></ul><ul><li>Platform design & improvement </li></ul><ul><li>Test cases, test process design </li></ul><ul><li>Result analysis </li></ul><ul><li>Related jobs (GFS-like redundancy) </li></ul><ul><li>White paper & conclusion </li></ul>
  34. 34. Related jobs <ul><li>GFS-liked redundant design </li></ul><ul><li>Motivation: </li></ul><ul><ul><li>Lustre is no native redundant </li></ul></ul><ul><ul><li>RAID is expensive </li></ul></ul><ul><ul><li>Lustre’s new feature </li></ul></ul><ul><li>Code analysis & HLD design </li></ul><ul><li>Challenges for design </li></ul>
  35. 35. Related jobs <ul><li>Lustre’s inode </li></ul><ul><li>Inode (*,*,*,…,{obj1,obj2,obj3,…}) </li></ul>
  36. 36. Related jobs <ul><li>Raw HLD thinking 1 </li></ul><ul><li>Modified inode structure </li></ul><ul><ul><li>Make inode contains 3 obj arrays </li></ul></ul><ul><ul><li>Inode (*,*,*,…,{obj11,obj12,obj13,…},{obj21,obj22,obj23,…}, {obj31,obj32,obj33,…}))) </li></ul></ul><ul><li>File read: </li></ul><ul><li>client read first group, if the first damaged, then read second,… </li></ul><ul><li>File write </li></ul><ul><li>client write three arrays objects one by one </li></ul><ul><li>File consistency (all done by client ) </li></ul><ul><li>Streaming replication(-_-) </li></ul><ul><ul><li>Client  {OST,…}  {OST,…}  {OST,…}, like a write chain </li></ul></ul><ul><ul><li>Some work shift to OST </li></ul></ul><ul><li>automatic recovery (-_-) </li></ul><ul><ul><li>If file’s one group of object damaged, the system automatically recover it by other backup. </li></ul></ul>
  37. 37. Related jobs <ul><li>Raw HLD thinking 2 : challenges </li></ul><ul><li>No rack information (place for redundancy?) </li></ul><ul><li>OST can not write to another OST (streaming redundant chain?) </li></ul><ul><li>File consistency </li></ul><ul><li>Lustre is changing fast (pool, etc.) </li></ul><ul><li>Internship is time-limited </li></ul>
  38. 38. <ul><li>Early research, analysis </li></ul><ul><li>Platform design & improvement </li></ul><ul><li>Test cases, test process design </li></ul><ul><li>Result analysis </li></ul><ul><li>Related jobs </li></ul><ul><li>White paper & conclusion </li></ul>Outline
  39. 39. White paper & conclusion <ul><li>White paper (hadoop_lustre_wp_v0.4.2.pdf) </li></ul><ul><li>Thanks to </li></ul>
  40. 40. White paper & conclusion <ul><li>Thanks a lot to our mentors and manager </li></ul><ul><li>Mentor: WangDi, HuangHua </li></ul><ul><li>Manager: Nirant.Puntambekar </li></ul>
  41. 41. White paper & conclusion <ul><li>Q&A ! </li></ul><ul><li>Email me: yjluan@gmail.com </li></ul>

×