Performance Issues on Hadoop Clusters


Published on

The MapReduce model has become an important parallel processing model for large- scale data-intensive applications like data mining and web indexing. Hadoop, an open-source implementation of MapReduce, is widely applied to support cluster computing jobs requiring low response time. The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into account for launching speculative map tasks, because it is assumed that most map tasks can quickly access their local data. Network delays due to data movement during running time have been ignored in the recent Hadoop research. Unfortunately, both the homogeneity and data locality assumptions in Hadoop are optimistic at best and unachievable at worst, potentially introducing performance problems in virtualized data centers. We show in this dissertation that ignoring the data-locality issue in heterogeneous cluster computing environments can noticeably reduce the performance of Hadoop. Without considering the network delays, the performance of Hadoop clusters would be significatly downgraded. In this dissertation, we address the problem of how to place data across nodes in a way that each node has a balanced data processing load. Apart from the data placement issue, we also design a prefetching and predictive scheduling mechanism to help Hadoop in loading data from local or remote disks into main memory. To avoid network congestions, we propose a preshuffling algorithm to preprocess intermediate data between the map and reduce stages, thereby increasing the throughput of Hadoop clusters. Given a data-intensive application running on a Hadoop cluster, our data placement, prefetching, and preshuffling schemes adaptively balance the tasks and amount of data to achieve improved data-processing performance. Experimental results on real data-intensive applications show that our design can noticeably improve the performance of Hadoop clusters. In summary, this dissertation describes three practical approaches to improving the performance of Hadoop clusters, and explores the idea of integrating prefetching and preshuffling in the native Hadoop system.

Published in: Technology
1 Comment
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Comment: three colors.
  • 1. 將要執行的 MapReduce 程式複製到 Master 與每一臺 Worker 機器中。 2.Master 決定 Map 程式與 Reduce 程式,分別由哪些 Worker 機器執行。 3. 將所有的資料區塊,分配到執行 Map 程式的 Worker 機器中進行 Map 。 4. 將 Map 後的結果存入 Worker 機器的本地磁碟。 5. 執行 Reduce 程式的 Worker 機器,遠端讀取每一份 Map 結果,進行彙整與排序,同時執行 Reduce 程式。 6. 將使用者需要的運算結果輸出。 Hadoop MapReduce Single master node, many worker nodes Client submits a job to master node Master splits each job into tasks (map/reduce), and assigns tasks to worker nodes
  • Hadoop MapReduce Single master node, many worker nodes Client submits a job to master node Master splits each job into tasks (map/reduce), and assigns tasks to worker nodes Hadoop Distributed File System (HDFS) Single name node, many data nodes Files stored as large, fixed-size (e.g. 64MB) blocks HDFS typically holds map input and reduce output
  • Comment: no “want to”
  • 90
  • 强调
  • Comments: differences between shuffling overhead and data transfer time.
  • Comment: what is computing ratio.
  • Comment 1: make step descriptions short. Comment 2: animations map to steps.
  • blocks
  • More io feature
  • Higher resolution
  • Comment: Titles of all the slides must use the same format Up to 33.1% abd 10.2% with average of 17.3% and 7.1%
  • Up to 33.1% abd 10.2% with average of 17.3% and 7.1%
  • Up to 33.1% abd 10.2% with average of 17.3% and 7.1%
  • 对于第二部分,最好有一个任务流和数据流的滑动窗口,而这个调度窗口大小是性能的关键,这样才能给你的预测调度提供理论基础。它的目标是减少每个任务数据同步或任务之间的同步间隙,实现流水线的作业处理。 你的第一部分比较清晰,第二部分比较乱, slide31 的目标和途径比较明确, slide41 的结论也清晰,但这两者之间的线索不突出你的贡献部分 第二部分的那个动画也不够典型。
  • which data block should be loaded and where is it? how to synchronize computation process with prefetching data. When the data is loaded to the cache ahead the requirement for a long time, it wastes the resource. On the contrary, if the data arrives later than the computation, it is useless. how to optimize the prefetching rate. What is the best percentage of data pre-loading of each machine? The MapReduce can maximal benefit from the best prefetching rate while minimizing the prefetching overhead. When to prefetch is to control how early to trigger the prefetching action of which node. In our previous research, we observed that one node process the same block size data in a fix time. Before the last block finishes, the next block starts loading data. In this design, we estimate the execution time of each block in each node. The execution time for the same application in various node may be different. we collect and adapt the average the running time for a block in each node. What to prefech is to determine which block load to the memory. Initially, the predictive scheduler assign two tasks to the each node. when the preloading work was trigger, it can access the data according to the required information from second task. How much to prefech is decide the amount of prepared data preload to the memory. According to the scheduler, when one task in running, one or more tasks are waiting in the node, the meta data is loaded in the memory of this node. When the prefetching action is triggered, the preloading work automatically load data form disk to memory. In view of large size of HDFS blocks, the number should be not so aggressive. Here we set the number of task is two, the prefetching block is one.
  • Comment: add animations.
  • Comment: add animations.
  • Figure 4.3: Comparing the excution time of Grep in native and prefetching Mapreduce Grep 9.5% 1G 8.5% 2G WordCount 8.9% 8.1% 2G
  • Above single large file 9% Small files 24% Increase the number of computing node we can get the same improvement through psp
  • 第三部分动画比较典型,但 slide44-slide53 的关系还需要提炼,最好能有所加强。
  • One map task for each block of the input file Applies user-defined map function to each record in the block Record = <key, value> User-defined number of reduce tasks Each reduce task is assigned a set of record groups Record group = all records with same key For each group, apply user-defined reduce function to the record values in that group Reduce tasks read from every map task Each read returns the record groups for that reduce task
  • To reduce the amount of transfer data, the map worker apply combiner functions to their local outputs before storing or transferring the intermediate data so as to reduce the traffic utilization. The combining strategy can minimize the amount of data that needs to be transferred to the reducers and speed up the execution time of the overall job. To reduce the communication between nodes, intermediate results are pipelined between fixed mappers and reducers. The reducer do not need to ask for each map node in the cluster. Moreover, the reducers begin processing data as soon as it is produced by mappers. 57 To use the overlapping between data movement and map computation. Our experiment shows that shuffle is always much longer than map computation time. Especially, when the network is saturated, overlapping time will disappear consequently. Secondly, reducing latency in the network also improve throughput but also increase the possibility of network conflict. Pipeline idea can be lend to hide the overlapping. To limit the communication between map and reduce. In MapReduce’s parallel execu- tion, a node’s execution includes map computation, waiting for shuffle, and reduce compu- tation, and the critical path is the slowest node. To improve performance, the critical path has to be shortened. However, the natural overlap of one node’s shuffle wait with another node’s execution does not shorten the critical path. Instead, the operations within a single node need to be overlapped. The reduce check all the avail data from all the map node in the cluster. If some reduce node and map node can group with particular key-value pairs. the network communicate costs can be shorten.
  • The execution time of native Hadoop for 1GB data WordCount Star. Figure 5.4.2 illustrates the average response time for 1GB data WordCount of native Hadoop. The Reduce does not launch the progress until the all the map tasks finish. Figure 5.1(b) illustrates the average response time for 1GB data WordCount of preshuffling Hadoop. The Reduce task receive map outputs a little earlier and can begin sorting earlier so as to reduce the time to required for the final merge can achieve a better cluster utilization by 14%. In the native Hadoop, Hadoop can finish all the map tasks earlier than preshuffling hadoop because preshuffling allow reduce tasks to begin executing more quickly; the reduce tasks prepare for longing required resources, causing the map phase take longer.
  • Figure 5.2 reports the improvement of Preshuffling compared with native Hadoop for WordCount. With the increase of block size, Preshuffling can achieve better improvement. when the block size is less than 16MB, the Preshuffling algorithm did not gain a significant performance improvement. The same results can be found in Figure 5.3. When the block size increases to 128MB or more, the improvement also arrive to a fix number. Preshuffling Hadoop can gain average 14% for WordCount and 12.5% for Sort. Figure 5.2: The improvement of Preshuffling with different block size for WordCount. Figure 5.2: The improvement of Preshuffling with different block size for Sort.
  • title
  • Average 14% improvement
  • Hadoop Archive Hadoop Archive 或者 HAR ,是一个高效地将小文件放入 HDFS 块中的文件存档工具,它能够将多个小文件打包成一个 HAR 文件,这样在减少 namenode 内存使用的同时,仍然允许对文件进行透明的访问。 对某个目录 /foo/bar 下的所有小文件存档成 /outputdir/ zoo.har : hadoop archive -archiveName zoo.har -p /foo/bar /outputdir 当然,也可以指定 HAR 的大小 ( 使用 -Dhar.block.size) 。 HAR 是在 Hadoop file system 之上的一个文件系统,因此所有 fs shell 命令对 HAR 文件均可用,只不过是文件路径格式不一样, HAR 的访问路径可以是以下两种格式: Sequence file sequence file 由一系列的二进制 key/value 组成,如果为 key 小文件名, value 为文件内容,则可以将大批小文件合并成一个大文件。 Hadoop-0.21.0 中提供了 SequenceFile ,包括 Writer , Reader 和 SequenceFileSorter 类进行写,读和排序操作。如果 hadoop 版本低于 0.21.0 的版本,实现方法可参见 [3] 。 CombineFileInputFormat CombineFileInputFormat 是一种新的 inputformat ,用于将多个文件合并成一个单独的 split ,另外,它会考虑数据的存储位置。
  • Performance Issues on Hadoop Clusters

    1. 1. Performance Issues on Hadoop Clusters Jiong Xie Advisor: Dr. Xiao Qin Committee Members: Dr. Cheryl Seals Dr. Dean Hendrix University Reader: Dr. Fa Foster Dai05/08/12 1
    2. 2. Overview of My Research Data locality Data movement Data shuffling Data Placement Prefetching Data Reduce networkon Heterogeneous from Disk congest Cluster to Memory [To Be Submitted] [HCW 10] [Submit to IPDPS]05/08/12 2
    3. 3. Data-Intensive Applications05/08/12 3
    4. 4. Data-Intensive Applications (cont.)05/08/12 4
    5. 5. Background• MapReduce programming model is growing in popularity• Hadoop is used by Yahoo, Facebook, Amazon.05/08/12 5
    6. 6. Hadoop Overview --Mapreduce Running System (J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI ’04, pages 137–150)05/08/12 6
    7. 7. Hadoop Distributed File System ( 7
    8. 8. Motivations• MapReduce provides – Automatic parallelization & distribution – Fault tolerance – I/O scheduling – Monitoring & status updates05/08/12 8
    9. 9. Existing Hadoop Clusters• Observation 1: Cluster nodes are dedicated – Data locality issues – Data transfer time• Observation 2: The number of nodes is increased d Scalability issues e Shuffling overhead goes up05/08/12 9
    10. 10. Proposed Solutions Input Map Reduce Map Output Map Reduce Map Reduce Map P1: Data placement P2: Prefetching P3: Preshuffling05/08/12 10
    11. 11. Solutions P1: Data placement Offline, distributed data, heterogeneous node P2: Prefetching P3: Preshuffling Online, data preloading Intermediate data movement, reducing traffic05/08/12 11
    12. 12. Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters05/08/12 12
    13. 13. Motivational ExampleNode A 1 task/min(fast)Node B 2x slower(slow)Node C 3x slower(slowest) Time (min) 05/08/12 13
    14. 14. The Native StrategyNode A 6 tasksNode B 3 tasksNode C 2 tasks Time (min) Loading Transferring Processing 05/08/12 14
    15. 15. Our Solution --Reducing data transfer timeNode A 6 tasksNode A’Node B’ 3 tasksNode C’ 2 tasks Time (min) Loading Transferring Processing 05/08/12 15
    16. 16. Challenges• Does distribution strategy depend on applications?• Initialization of data distribution• The data skew problems – New data arrival – Data deletion – Data updating – New joining nodes05/08/12 16
    17. 17. Measure Computing Ratios• Computing ratio• Fast machines process large data sets 1 task/min Node A Node B 2x slower Node C 3x slower Time05/08/12 17
    18. 18. Measuring Computing Ratios1. Run an application, collect response time2. Set ratio of a node offering the shortest response time as 13. Normalize ratios of other nodes4. Calculate the least common multiple of these ratios5. Determine the amount of data processed by each node Node Response Ratio # of File Speed time(s) Fragments Node A 10 1 6 Fastest Node B 20 2 3 Average Node C 30 3 2 Slowest05/08/12 18
    19. 19. Initialize Data Distribution Portions 3:2:1• Input files split into 64MB Namenode File1 blocks 1 2 4 5 3 6• Round-Robin data distribution algorithm A B C 7 a 8 c b 9 Datanodes05/08/12 19
    20. 20. Data Redistribution 4 3 2 11.Get network topology, Namenode ratio, and utilization L1 A C2.Build and sort two lists: under-utilized node list  L1 L2 B over-utilized node list  L23. Select the source and A B C Portion destination node from 3:2:1 the lists. 1 4 7 a 64.Transfer data 2 5 8 b 9 c5.Repeat step 3, 4 until the 3 list is empty.05/08/12 20
    21. 21. Experimental Environment Node CPU Model CPU(Hz) L1 Cache(KB) Node A Intel core 2 Duo 2*1G=2G 204 Node B Intel Celeron 2.8G 256 Node C Intel Pentium 3 1.2G 256 Node D Intel Pentium 3 1.2G 256 Node E Intel Pentium 3 1.2G 256 Five nodes in a Hadoop heterogeneous cluster05/08/12 21
    22. 22. Benckmarks• Grep: a tool searching for a regular expression in a text file• WordCount: a program used to count words in a text file• Sort: a program used to list the inputs in sorted order.05/08/12 22
    23. 23. Response Time of Grep and Wordcount in Each Node Application dependenceComputing ratio is Data size independence05/08/12 23
    24. 24. Computing Ratio for Two Applications Computing Node Ratios for Grep Ratios for Wordcount Node A 1 1 Node B 2 2 Node C 3.3 5 Node D 3.3 5 Node E 3.3 5 Computing ratio of the five nodes with respective of Grep and Wordcount applications05/08/12 24
    25. 25. Six Data Placement Decisions05/08/12 25
    26. 26. Impact of data placement on performance of Grep05/08/12 26
    27. 27. Impact of data placement on performance of WordCount05/08/12 27
    28. 28. Summary of Data PlacementP1: Data Placement Strategy• Motivation: Fast machines process large data sets• Problem: Data locality issue in heterogeneous clusters• Contributions: Distribute data according to computing capability – Measure computing ratio – Initialize data placement – Redistribution05/08/12 28
    29. 29. Predictive Scheduling and Prefetching for Hadoop clusters05/08/12 29
    30. 30. Prefetching• Goal: Improving performance• Approach – Best effort to guarantee data locality. – Keeping data close to computing nodes – Reducing the CPU stall time05/08/12 30
    31. 31. Challenges• What to prefetch?• How to prefetch?• What is the size of blocks to be prefetched?05/08/12 31
    32. 32. Dataflow in Hadoop 1.Submit job t ea k rtb tas 2.Schedule ea xt 5.h Ne 6. 3.Read Input map Local reduce FS Block 1 HDFS Block 2 Local map FS reduce 7.Read new file 4. Run map05/08/12 32
    33. 33. Dataflow in Hadoop 1.Submit job t 2.Schedule ea k rtb + more task tas ea + meta xt 5.h Ne data 6. 3.Read Input map Local reduce FS Block 1 HDFS Block 2 Local map FS reduce 5.1.Read new file 4. Run map05/08/12 33
    34. 34. Prefetching Processing 6 7 805/08/12 34
    35. 35. Software Architecture05/08/12 35
    36. 36. Grep Performance 9.5% 1G 8.5% 2G05/08/12 36
    37. 37. WordCount Performance 8.9% 1G 8.1% 2G05/08/12 37
    38. 38. Large/Small file in a node 9.1% Grep 18% Grep 8.3% WordCount 24% WordCount05/08/12 38
    39. 39. Experiment Setting05/08/12 39
    40. 40. Large/Small file in cluster05/08/12 40
    41. 41. SummaryP2: Predictive Scheduler and Prefetching• Goal: Moving data before task assigns• Problem: Synchronization task and data• Contributions: Preloading the required data early than the task assigned – Predictive scheduler – Prefetching mechanism – Worker thread05/08/12 41
    42. 42. Adaptive Preshuffling in Hadoop clusters05/08/12 42
    43. 43. Preshuffling• Observation 1: Too much data move from Map worker to Reduce worker – Solution1: Map nodes apply pre-shuffling functions to their local output• Observation 2: No reduce can start until a map is complete. – Solution2: Intermediate data is pipelined between mappers and reducers.05/08/12 43
    44. 44. Preshuffling• Goal : Minimize data shuffle during Reduce• Approach – Pipeline – Overlap between map and data movement – Group map and reduce• Challenges – Synchronize map and reduce – Data locality05/08/12 44
    45. 45. Dataflow in Hadoop 1.Submit job 2.Schedule 2. at Ne e rtb k w tas tas ea xt 5.h k Ne 6.3.Read Input 5.Write data map Local FS reduce Block 1 HTTP GET HDFS HDFS Block 2 Local map FS reduce 3. Request data 4. Run map 4. Send data 05/08/12 45
    46. 46. PreShuffle map reduce map reduce Data request05/08/12 46
    47. 47. In-memory buffer05/08/12 47
    48. 48. Pipelining – A new design map reduce Block 1HDFS HDFS Block 2 map reduce05/08/12 48
    49. 49. WordCount Performance 230 seconds vs 180 seconds05/08/12 49
    50. 50. WordCount Performance05/08/12 50
    51. 51. Sort Performace05/08/12 51
    52. 52. SummaryP3: Preshuffling• Goal: Minimize data shuffling during the Reduce• Problem: task distribution and synchronization• Contributions: preshuffling agorithm – Push data instead of tradition pull – In-memory buffer – Pipeline05/08/12 52
    53. 53. Conclusion Input P1: Data placement Offline, distributed data, heterogeneous node Map Map Map Map Map P2: Prefetching Online, data preloading, single node P3: Preshuffling Intermediate data movement, reducing Reduce Reduce Reduce traffic Output05/08/12 53
    54. 54. Future Work• Extend Pipelining – Implement the pipelining design• Small files issue – Har file – Sequence file – CombineFileInputFormat• Extend Data placement05/08/12 54
    55. 55. Thanks! And Questions?55
    56. 56. Run Time affected by Network Condition Experiment result conducted by Yixian Yang05/08/12 56
    57. 57. Traffic Volume affected by Network Condition Experiment result conducted by Yixian Yang05/08/12 57
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.