Characterization of hadoop jobs using unsupervised learning


Published on

J Gabriel Lima -

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Characterization of hadoop jobs using unsupervised learning

  1. 1. 2nd IEEE International Conference on Cloud Computing Technology and Science Characterization of Hadoop Jobs using Unsupervised Learning Sonali Aggarwal Shashank Phadke Milind Bhandarkar Stanford University Yahoo! Inc. Yahoo! Inc. Abstract of workloads running in MapReduce environments benefits both the cloud service providers and their users. Hadoop MapReduce programming paradigm [4] and its open clusters are used for a variety of research and development source implementation, Apache Hadoop [3],is increasingly projects, and for a growing number of production processes being used for data-intensive applications in cloud comput- at Yahoo!. Yahoo! has world’s largest Hadoop production ing environments. An understanding of the characteristics clusters, with the size of some clusters as large as 4000 ma- of workloads running in MapReduce environments benefits chines. With the increasing size of Hadoop clusters and both the cloud service providers and their users. This work the jobs being run on them, keeping track of the perfor- characterizes Hadoop jobs running on production clusters mance of Hadoop clusters is critical. In this paper, we pro- at Yahoo! using unsupervised learning [6]. Unsupervised pose a methodology to characterize jobs running on Hadoop clustering techniques have been applied to many impor- clusters, which can be used to measure performance of the tant problems - ranging from Social Network Analysis to Hadoop environment. Biomedical Research. We use these techniques to cluster The paper is organized as follows. We discuss the Hadoop MapReduce jobs that are similar in characteristics. background of Hadoop’s existing performance benchmark Hadoop framework generates metrics for every MapReduce - GridMix and its enhancements in section 2. In section 3, job, such as number of map and reduce tasks, number of we describe our data set and its features. In section 4, we bytes read/written to local file system and HDFS etc. We use present our clustering methodology and also introduce the these metrics and job configuration features such as format characteristic jobs running on the Hadoop cluster at Yahoo!. of the input/output files, type of compression used etc to find In section 5, we present a comparative analysis of a 3 hour similarity among Hadoop jobs. We study the centroids and trace of production jobs and benchmark jobs. In section 6, densities of these job clusters. We also perform compara- we conclude with a summary and some directions for future tive analysis of real production workload and workload em- research. ulated by our benchmark tool, GridMix, by comparing job clusters of both workloads. 2 Background and Objectives Hadoop clusters at Yahoo! run several thousand jobs ev- Keywords: performance benchmark, workload char- ery day, and each of these jobs execute hundreds of map and acterization reduce tasks and process several terabytes of data. With such a large scale usage, it is essential to track metrics on throughput, utilization, and most importantly, perceived 1 Introduction job latencies on these clusters. Performance evaluation and benchmarking of Hadoop software is a critical job, not only Apache Hadoop [3] is an open-source framework for re- to optimize the full range of job execution on these clusters, liable, scalable, distributed computing. It primarily con- but also to reproduce load-related bottlenecks. sists of HDFS - a distributed file system that provides The previous work on Hadoop performance benchmark- high throughput access to application data, and MapRe- ing is named GridMix, and has undergone several enhance- duce programming framework for processing large data ments over the years. GridMix aims to represent real appli- sets. MapReduce programming paradigm [4] and its open cation workloads on Hadoop clusters, and has been used to source implementation, Apache Hadoop [3],is increasingly verify and quantify optimizations across different Hadoop being used for data-intensive applications in cloud comput- releases. The first two versions of GridMix defined repre- ing environments. An understanding of the characteristics sentative Hadoop workloads through a fixed set of micro-978-0-7695-4302-4/10 $26.00 © 2010 IEEE 748DOI 10.1109/CloudCom.2010.20
  2. 2. benchmarks, and did not model the diverse mix of jobs run- Table 1. Task History Statisticsning on Yahoo!’s production cluster well. GridMix3 [5] is Metric Descriptiona latest enhancement over previous versions that accepts a HDFS Bytes Bytes read/written to HDFStimestamped stream (trace) of job descriptions. For each File Bytes Bytes read/written to local diskjob in the trace, the GridMix3 client submits a correspond- Combine Ratio of Combiner output/inputing, synthetic job to the target cluster at the same rate as in Records Ratio recordsthe original trace, and tries to model the diverse mix of jobs Shuffle Bytes The number of bytes shuffled afterrunning in Hadoop’s environment. the map phase. This is only for the We propose a new way to do performance benchmark reduce tasksof Hadoop cluster, by learning the characteristic jobs beingrun on it. We use unsupervised learning to cluster the realproduction workload, and determine centroids and densitiesof these job clusters. The centroid jobs reflect the repre- Table 2. Job History Statisticssentative jobs among the real workload. We use K-Means Metric Descriptionclustering algorithm for this purpose. Running these repre- Number of Number of tasks in the Map Phasesentative jobs, and computing a weighted sum of their per- Mapsformance, where the weights correspond with the size of Number of Number of tasks in the Reducethe job-cluster, gives us a measure of Hadoop cluster per- Reduces phaseformance, within a small margin of error to the measure Input format Specifies the format of the input filecomputed by GridMix3. which is parsed to generate the key- value pairs. This is a categorical fea-3. Hadoop Job Features ture Output format Specifies the format of the output file. This is a categorical feature Our input data set comprised of metrics generated bythe Hadoop MapReduce framework, collected by the job Type of output Specifies compression for output oftracker while the job is executing. After the end of job exe- compression the application.This is a categoricalcution, these metrics are stored on the Hadoop cluster in Job featurehistory files in the form of per-job and per-task counters. Map Phase The number of map slots occupiedJob counters keep track of application progress in both the slots by each map task in the Hadoopmap and reduce stages of processing. By default, Hadoop clusterMapreduce framework emits a number of standard counters Reduce Phase The number of reduce slots occu-such as Map input records, Map output records which we slots pied by each reduce task in theuse as features in our dataset. Please see Table 1 and Table Hadoop cluster2 for more information on the features. The various parameters used to measure the performanceof a job are divided into two levels - Job level and task level. 4. Clustering MethodologyA MapReduce job usually splits the input data-set into inde-pendent chunks which are processed by the map and reduce We used the statistical package R [7] for clustering. Rtasks in parallel. For the task level parameters, we use statis- is an open source language and environment for statisticaltical descriptors like mean, standard deviation, and range of computing and graphics.the counters for all tasks in map and reduce phase respec- We implemented traditional K-means algorithm for ourtively. We also include job-specific configuration features clustering purpose. We estimated the K in K-means us-such as type of data compression used, and formats of input ing within groups sum of squares. Also, to find the ini-and output files as job features from our input data-set. tial seeds we randomly picked sqrt(n) jobs from the entire We use non-correlated feature set from the counters, collection and ran Hierarchal Agglomerative Clustering onsince we did not want to give an increased weight to any them. Then we used these results as the initial seeds for theof the features. Also, we did not use features which are de- K-Means algorithm.pendent on the Hadoop cluster hardware configuration suchas the time taken to execute the job etc. Cluster-specific 4.1. Data Collectionfeatures would differ when the same MapReduce job is ex-ecuted on different Hadoop clusters. Thus, considering ab- The job metrics we collected spanned 24 hours fromsolute CPU of walltime as job feature would not allow us to one of Yahoo!’s production Hadoop clusters, comprising ofcorrelate jobs executed on different clusters. 749
  3. 3. 11686 jobs. We did not take into account the jobs whichfailed on the Hadoop cluster. By nature of productionHadoop jobs at Yahooo!, these jobs are executed repeatedlywith specific periodicity, on different data partitions, as theybecome available. We parsed the JobTracker logs to obtainthe feature vector set mention in Table 1 and 2, using a mod-ified version of Hadoop Vaidya [1]. Vaidya performs a postexecution analysis of MapReduce jobs by parsing and col-lecting execution statistics through job history and job con-figuration files. We generated our initial input using Vaidya,before we normalized it for clustering.4.2. Pre-processing Prior to clustering, we rescaled variables for comparabil-ity. We standardized the data to have mean of 0 and standarddeviation of 1. Since we use Euclidean distance to com-pute per-feature similarity between different jobs, the clus-ters will be influenced strongly by the magnitudes of thevariables, especially by outliers. Normalizing all featuresto have the same mean and standard deviation removes thisbias. Numeric variables were standardized and nominal at-tributes were converted into binary. We made scatter plotsand calculated co-variance to check dependencies betweenthe features, to get rid of heavily correlated variables thattend to artificially bias the clusters toward natural groupings Figure 1. Estimating the number of clusters.of those variables. For example, we observed that format of This graph is a plot of within-groups sum ofinput files of the jobs was strongly related to the format of squares with the number of clusters.output files.4.3. Estimating number of clusters 4.5. Results The heuristic we used to estimate the number of clustersin our dataset is to take the number of clusters where we We obtained 8 clusters from K-Means clustering algo-see the largest drop in the sum of the within-groups sum of rithm. Table 4 and Table 5 describe the centroids of thesesquares. We iterate through multiple clusters and observe clusters. The task-level features listed are obtained by tak-the sum of the within-groups sum of squares. A plot of ing the mean of each feature metric over all the Map or Re-the within-groups sum of squares by number of clusters ex- duce tasks of these jobs. These centroids are the character-tracted helped us determine the appropriate number of clus- istic jobs running on Hadoop cluster. Table 3 and Figure 3ters. The plot is shown in Figure 1. We looks for a bend show the densities of these clusters. Figure 2 shows thein the plot. There is very little variation in within sum of distance between the centers after they have been scaled tosquares after cluster 8, which suggests there are a maxi- two dimensions. We used Multidimensional scaling (MDS)mum of 8 clusters in the data. For two arbitrarily chosen to map our high-dimensional centers into two dimensionalcoordinates (i.e. features), the centroids of these clusters vectors, preserving all the relevant distances. These 8 clus-are shown in Figure 2. ters differ significantly in the number of map and reduce tasks and the bytes being read/written and processed on the4.4. K-Means Algorithm HDFS. Most of the jobs on the Hadoop cluster (90%) can be modeled to have close to 79 Map Tasks and 28 Reduce We then estimated the main seeds by Hierarchal Ag- Tasks. There are a few jobs (approx. 0.003%) which haveglomerative Clustering and performed the K-means algo- as large as 2487 Map Tasks. Most jobs tend to have signifi-rithm with the chosen seeds. We used Euclidean distance cantly lesser Reduce tasks than Map Tasks. These centroidas a distance metric. The total distance between two jobs is jobs, run the number of times as the size of their clusters,the square root of the squared sum of the individual feature represent the jobs being run on the Hadoop clusters withindistances. a small margin of error. 750
  4. 4. Table 4. Centroids of Job Clusters (Means of Features in Map Phase over all Map Tasks) Cluster Number Number of Map HDFS Map HDFS Map File Map File Map Input Map Output Maps Bytes Read Bytes Writ- Bytes Read Bytes Writ- Records Records ten ten 1 456 63 MB 0.22 MB 80.8 MB 166 MB 214,116 312,037 2 863 478.69 MB 84 B 721.5MB 1403 MB 387,661 387,661 3 572 100.5 MB 0.2 MB 71 MB 65.19 MB 936,015 1,600,945 4 191 90 MB 85 B 25.78 MB 44.26 MB 1,040,112 1,946,530 5 1080 86.6 MB 85 B 81.3 MB 81.24 MB 595,183 512,653 6 79 44.82 MB 42 MB 22 MB 39.414 MB 334,144 813,425 7 2487 122 MB 84 B 226 MB 319 MB 958,604 1,155,927 8 316 169.6 MB 86 B 210 MB 434.25 MB 513,999 513,913 Table 5. Centroid of Clusters (Means of Features in Reduce Phase over all Reduce Tasks) Cluster Number Number of Reduce HDFS Reduce Reduce File Reduce File Reduce In- Reduce Out- Reduces Byte Read HDFS Bytes Bytes Read Bytes Writ- put Records put Records Written ten 1 20 335 KB 9.64 GB 1.2886 GB 1.2GB 16,330,242 9,815,480 2 29 6B 2.09 GB 5.421 GB 5.42GB 2,395,949 2,395,955 3 73 419 B 650.2 MB 390 MB 384.5 MB 58,155,54 3,226,467 4 71 494 KB 65 MB 70.5 MB 70.5 MB 5,087,889 1,630,275 5 62 330 B 586.6 MB 759 MB 759 MB 10,919,233 5,754,906 6 28 57 KB 54.85 MB 17.7 MB 17.6 MB 505,995 235,897 7 67.5 31.6 MB 2.04 GB 3.2 GB 3.2 GB 154,633,619 19,670,261 8 14.5 6B 1.9 GB 5.57 GB 5.57 GB 14,113,424 14,113,430 sents the real workload it tries to emulate. We used a three Table 3. Size of Job Clusters hour trace from the daily run of production cluster for this Cluster Number Number of jobs in cluster purpose. The three hour trace consisted of 1203 jobs. We 1 205 processed the same three hour trace using Gridmix3, which 2 21 generated synthetic workload, and executed it in controlled 3 195 environment. We only used quantitative features in our 4 1143 analysis, since GridMix3 does not emulate categorical fea- 5 82 tures like compression type etc. We parsed the job counter 6 9991 logs using Rumen [2]. Rumen processes job history logs 7 35 to produce job and detailed task information in the form of 8 14 JSON file. We obtained 5 clusters in both the actual production jobs and the GridMix jobs with similar distribution and centers. This reflects that GridMix3 works effectively to model the5. Comparison of Production cluster jobs with actual production jobs and our clustering study has been ef- GridMix3 jobs fective in understanding the clustering of these jobs. GridMix3 attempts to emulate the workload of real jobs 6. Summary and Future Workin the Hadoop cluster by generating synthetic workload.Our objective in this study was to validate whether this syn- We obtained the characteristics of the jobs running onthetic workload generated by GridMix3 accurately repre- Yahoo’s Hadoop production clusters. We identified 8 751
  5. 5. Figure 2. Graph showing the centers of each clusters scales in two dimension. Co- Figure 3. Graph showing the logarithm of ordinate 1 and 2 depict the coordinates in number of jobs in each cluster. two dimension of the centroid clusters after MDS.The number above the point depicts the cluster number elling. This analysis would help us identify if there exists any underlying pattern of jobs being run on the experimen- tal clusters. We plan to expand the set of features being con-groups of jobs, and found their centroids to obtain our sidered, such as the language being used to develop actualcharacteristic jobs, and densities to determine the weight map and reduce tasks, use of metadata server(s), number ofthat should be given to each representative job. This way, input files etc in our future study.we present a new methodology to develop a performancebenchmark for Hadoop. Instead of emulating all the jobsin the workload of a real Hadoop cluster in benchmarking, 7. Acknowledgmentswe only emulate the representative jobs, denoted by the cen-troids of the job clusters. We also did a comparative analysis We would like the extend our thanks to Ryota Egashira,of actual production jobs and the equivalent synthetic Grid- Rajesh Balamohan, and Srigurunath Chakravarthi for theirMix jobs. We obtain a remarkable similarity in the clusters, help with data collection . We would also like to thank Li-with the centroids of both coinciding and a similar distribu- hong Li for his input on clustering methodology.tion of clusters. This suggests that GridMix is effective inemulating the mix of jobs being run on the Hadoop clusters. We see several other uses of the clustering methodology Referenceswe have used. We intend to extend the work to learn howthe jobs are changing over time, by studying the distribution [1] Apache Software Foundation. Hadoop Vaidya.of these job clusters obtained by analyzing workload across time periods. In addition, we would like to compare current/vaidya.html.the jobs being run on production cluster to the ad-hoc jobs [2] Apache Software Foundation. Rumen - A Tool to Ex-being run on the clusters used for data mining and mod- tract Job Characterization Data from Job Tracker Logs. MAPREDUCE-751. 752
  6. 6. [3] Apache Software Foundation. Welcome to Apache Hadoop![4] J. Dean and S. Ghemawat. Mapreduce: simplified data pro- cessing on large clusters. Commun. ACM, 51(1):107–113, 2008.[5] C. Douglas and H. Tang. Gridmix3 Emulating Production Workload for Apache Hadoop. http://developer. gridmix3_emulating_production/.[6] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classifica- tion. Wiley, New York, USA, 2001.[7] The R Foundation for Statistical Computing. The R Project for Statistical Computing. http://www.r-project. org. 753