The MapReduce model has become an important parallel processing model for large- scale data-intensive applications like data mining and web indexing. Hadoop, an open-source implementation of MapReduce, is widely applied to support cluster computing jobs requiring low response time. The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into account for launching speculative map tasks, because it is assumed that most map tasks can quickly access their local data. Network delays due to data movement during running time have been ignored in the recent Hadoop research. Unfortunately, both the homogeneity and data locality assumptions in Hadoop are optimistic at best and unachievable at worst, potentially introducing performance problems in virtualized data centers. We show in this dissertation that ignoring the data-locality issue in heterogeneous cluster computing environments can noticeably reduce the performance of Hadoop. Without considering the network delays, the performance of Hadoop clusters would be significatly downgraded. In this dissertation, we address the problem of how to place data across nodes in a way that each node has a balanced data processing load. Apart from the data placement issue, we also design a prefetching and predictive scheduling mechanism to help Hadoop in loading data from local or remote disks into main memory. To avoid network congestions, we propose a preshuffling algorithm to preprocess intermediate data between the map and reduce stages, thereby increasing the throughput of Hadoop clusters. Given a data-intensive application running on a Hadoop cluster, our data placement, prefetching, and preshuffling schemes adaptively balance the tasks and amount of data to achieve improved data-processing performance. Experimental results on real data-intensive applications show that our design can noticeably improve the performance of Hadoop clusters. In summary, this dissertation describes three practical approaches to improving the performance of Hadoop clusters, and explores the idea of integrating prefetching and preshuffling in the native Hadoop system.
Clipping is a handy way to collect important slides you want to go back to later.