View stunning SlideShares in full-screen with the new iOS app!Introducing SlideShare for AndroidExplore all your favorite topics in the SlideShare appGet the SlideShare app to Save for Later — even offline
View stunning SlideShares in full-screen with the new Android app!View stunning SlideShares in full-screen with the new iOS app!
benchmarks, and did not model the diverse mix of jobs run- Table 1. Task History Statisticsning on Yahoo!’s production cluster well. GridMix3  is Metric Descriptiona latest enhancement over previous versions that accepts a HDFS Bytes Bytes read/written to HDFStimestamped stream (trace) of job descriptions. For each File Bytes Bytes read/written to local diskjob in the trace, the GridMix3 client submits a correspond- Combine Ratio of Combiner output/inputing, synthetic job to the target cluster at the same rate as in Records Ratio recordsthe original trace, and tries to model the diverse mix of jobs Shufﬂe Bytes The number of bytes shufﬂed afterrunning in Hadoop’s environment. the map phase. This is only for the We propose a new way to do performance benchmark reduce tasksof Hadoop cluster, by learning the characteristic jobs beingrun on it. We use unsupervised learning to cluster the realproduction workload, and determine centroids and densitiesof these job clusters. The centroid jobs reﬂect the repre- Table 2. Job History Statisticssentative jobs among the real workload. We use K-Means Metric Descriptionclustering algorithm for this purpose. Running these repre- Number of Number of tasks in the Map Phasesentative jobs, and computing a weighted sum of their per- Mapsformance, where the weights correspond with the size of Number of Number of tasks in the Reducethe job-cluster, gives us a measure of Hadoop cluster per- Reduces phaseformance, within a small margin of error to the measure Input format Speciﬁes the format of the input ﬁlecomputed by GridMix3. which is parsed to generate the key- value pairs. This is a categorical fea-3. Hadoop Job Features ture Output format Speciﬁes the format of the output ﬁle. This is a categorical feature Our input data set comprised of metrics generated bythe Hadoop MapReduce framework, collected by the job Type of output Speciﬁes compression for output oftracker while the job is executing. After the end of job exe- compression the application.This is a categoricalcution, these metrics are stored on the Hadoop cluster in Job featurehistory ﬁles in the form of per-job and per-task counters. Map Phase The number of map slots occupiedJob counters keep track of application progress in both the slots by each map task in the Hadoopmap and reduce stages of processing. By default, Hadoop clusterMapreduce framework emits a number of standard counters Reduce Phase The number of reduce slots occu-such as Map input records, Map output records which we slots pied by each reduce task in theuse as features in our dataset. Please see Table 1 and Table Hadoop cluster2 for more information on the features. The various parameters used to measure the performanceof a job are divided into two levels - Job level and task level. 4. Clustering MethodologyA MapReduce job usually splits the input data-set into inde-pendent chunks which are processed by the map and reduce We used the statistical package R  for clustering. Rtasks in parallel. For the task level parameters, we use statis- is an open source language and environment for statisticaltical descriptors like mean, standard deviation, and range of computing and graphics.the counters for all tasks in map and reduce phase respec- We implemented traditional K-means algorithm for ourtively. We also include job-speciﬁc conﬁguration features clustering purpose. We estimated the K in K-means us-such as type of data compression used, and formats of input ing within groups sum of squares. Also, to ﬁnd the ini-and output ﬁles as job features from our input data-set. tial seeds we randomly picked sqrt(n) jobs from the entire We use non-correlated feature set from the counters, collection and ran Hierarchal Agglomerative Clustering onsince we did not want to give an increased weight to any them. Then we used these results as the initial seeds for theof the features. Also, we did not use features which are de- K-Means algorithm.pendent on the Hadoop cluster hardware conﬁguration suchas the time taken to execute the job etc. Cluster-speciﬁc 4.1. Data Collectionfeatures would differ when the same MapReduce job is ex-ecuted on different Hadoop clusters. Thus, considering ab- The job metrics we collected spanned 24 hours fromsolute CPU of walltime as job feature would not allow us to one of Yahoo!’s production Hadoop clusters, comprising ofcorrelate jobs executed on different clusters. 749
11686 jobs. We did not take into account the jobs whichfailed on the Hadoop cluster. By nature of productionHadoop jobs at Yahooo!, these jobs are executed repeatedlywith speciﬁc periodicity, on different data partitions, as theybecome available. We parsed the JobTracker logs to obtainthe feature vector set mention in Table 1 and 2, using a mod-iﬁed version of Hadoop Vaidya . Vaidya performs a postexecution analysis of MapReduce jobs by parsing and col-lecting execution statistics through job history and job con-ﬁguration ﬁles. We generated our initial input using Vaidya,before we normalized it for clustering.4.2. Pre-processing Prior to clustering, we rescaled variables for comparabil-ity. We standardized the data to have mean of 0 and standarddeviation of 1. Since we use Euclidean distance to com-pute per-feature similarity between different jobs, the clus-ters will be inﬂuenced strongly by the magnitudes of thevariables, especially by outliers. Normalizing all featuresto have the same mean and standard deviation removes thisbias. Numeric variables were standardized and nominal at-tributes were converted into binary. We made scatter plotsand calculated co-variance to check dependencies betweenthe features, to get rid of heavily correlated variables thattend to artiﬁcially bias the clusters toward natural groupings Figure 1. Estimating the number of clusters.of those variables. For example, we observed that format of This graph is a plot of within-groups sum ofinput ﬁles of the jobs was strongly related to the format of squares with the number of clusters.output ﬁles.4.3. Estimating number of clusters 4.5. Results The heuristic we used to estimate the number of clustersin our dataset is to take the number of clusters where we We obtained 8 clusters from K-Means clustering algo-see the largest drop in the sum of the within-groups sum of rithm. Table 4 and Table 5 describe the centroids of thesesquares. We iterate through multiple clusters and observe clusters. The task-level features listed are obtained by tak-the sum of the within-groups sum of squares. A plot of ing the mean of each feature metric over all the Map or Re-the within-groups sum of squares by number of clusters ex- duce tasks of these jobs. These centroids are the character-tracted helped us determine the appropriate number of clus- istic jobs running on Hadoop cluster. Table 3 and Figure 3ters. The plot is shown in Figure 1. We looks for a bend show the densities of these clusters. Figure 2 shows thein the plot. There is very little variation in within sum of distance between the centers after they have been scaled tosquares after cluster 8, which suggests there are a maxi- two dimensions. We used Multidimensional scaling (MDS)mum of 8 clusters in the data. For two arbitrarily chosen to map our high-dimensional centers into two dimensionalcoordinates (i.e. features), the centroids of these clusters vectors, preserving all the relevant distances. These 8 clus-are shown in Figure 2. ters differ signiﬁcantly in the number of map and reduce tasks and the bytes being read/written and processed on the4.4. K-Means Algorithm HDFS. Most of the jobs on the Hadoop cluster (90%) can be modeled to have close to 79 Map Tasks and 28 Reduce We then estimated the main seeds by Hierarchal Ag- Tasks. There are a few jobs (approx. 0.003%) which haveglomerative Clustering and performed the K-means algo- as large as 2487 Map Tasks. Most jobs tend to have signiﬁ-rithm with the chosen seeds. We used Euclidean distance cantly lesser Reduce tasks than Map Tasks. These centroidas a distance metric. The total distance between two jobs is jobs, run the number of times as the size of their clusters,the square root of the squared sum of the individual feature represent the jobs being run on the Hadoop clusters withindistances. a small margin of error. 750
Table 4. Centroids of Job Clusters (Means of Features in Map Phase over all Map Tasks) Cluster Number Number of Map HDFS Map HDFS Map File Map File Map Input Map Output Maps Bytes Read Bytes Writ- Bytes Read Bytes Writ- Records Records ten ten 1 456 63 MB 0.22 MB 80.8 MB 166 MB 214,116 312,037 2 863 478.69 MB 84 B 721.5MB 1403 MB 387,661 387,661 3 572 100.5 MB 0.2 MB 71 MB 65.19 MB 936,015 1,600,945 4 191 90 MB 85 B 25.78 MB 44.26 MB 1,040,112 1,946,530 5 1080 86.6 MB 85 B 81.3 MB 81.24 MB 595,183 512,653 6 79 44.82 MB 42 MB 22 MB 39.414 MB 334,144 813,425 7 2487 122 MB 84 B 226 MB 319 MB 958,604 1,155,927 8 316 169.6 MB 86 B 210 MB 434.25 MB 513,999 513,913 Table 5. Centroid of Clusters (Means of Features in Reduce Phase over all Reduce Tasks) Cluster Number Number of Reduce HDFS Reduce Reduce File Reduce File Reduce In- Reduce Out- Reduces Byte Read HDFS Bytes Bytes Read Bytes Writ- put Records put Records Written ten 1 20 335 KB 9.64 GB 1.2886 GB 1.2GB 16,330,242 9,815,480 2 29 6B 2.09 GB 5.421 GB 5.42GB 2,395,949 2,395,955 3 73 419 B 650.2 MB 390 MB 384.5 MB 58,155,54 3,226,467 4 71 494 KB 65 MB 70.5 MB 70.5 MB 5,087,889 1,630,275 5 62 330 B 586.6 MB 759 MB 759 MB 10,919,233 5,754,906 6 28 57 KB 54.85 MB 17.7 MB 17.6 MB 505,995 235,897 7 67.5 31.6 MB 2.04 GB 3.2 GB 3.2 GB 154,633,619 19,670,261 8 14.5 6B 1.9 GB 5.57 GB 5.57 GB 14,113,424 14,113,430 sents the real workload it tries to emulate. We used a three Table 3. Size of Job Clusters hour trace from the daily run of production cluster for this Cluster Number Number of jobs in cluster purpose. The three hour trace consisted of 1203 jobs. We 1 205 processed the same three hour trace using Gridmix3, which 2 21 generated synthetic workload, and executed it in controlled 3 195 environment. We only used quantitative features in our 4 1143 analysis, since GridMix3 does not emulate categorical fea- 5 82 tures like compression type etc. We parsed the job counter 6 9991 logs using Rumen . Rumen processes job history logs 7 35 to produce job and detailed task information in the form of 8 14 JSON ﬁle. We obtained 5 clusters in both the actual production jobs and the GridMix jobs with similar distribution and centers. This reﬂects that GridMix3 works effectively to model the5. Comparison of Production cluster jobs with actual production jobs and our clustering study has been ef- GridMix3 jobs fective in understanding the clustering of these jobs. GridMix3 attempts to emulate the workload of real jobs 6. Summary and Future Workin the Hadoop cluster by generating synthetic workload.Our objective in this study was to validate whether this syn- We obtained the characteristics of the jobs running onthetic workload generated by GridMix3 accurately repre- Yahoo’s Hadoop production clusters. We identiﬁed 8 751
Figure 2. Graph showing the centers of each clusters scales in two dimension. Co- Figure 3. Graph showing the logarithm of ordinate 1 and 2 depict the coordinates in number of jobs in each cluster. two dimension of the centroid clusters after MDS.The number above the point depicts the cluster number elling. This analysis would help us identify if there exists any underlying pattern of jobs being run on the experimen- tal clusters. We plan to expand the set of features being con-groups of jobs, and found their centroids to obtain our sidered, such as the language being used to develop actualcharacteristic jobs, and densities to determine the weight map and reduce tasks, use of metadata server(s), number ofthat should be given to each representative job. This way, input ﬁles etc in our future study.we present a new methodology to develop a performancebenchmark for Hadoop. Instead of emulating all the jobsin the workload of a real Hadoop cluster in benchmarking, 7. Acknowledgmentswe only emulate the representative jobs, denoted by the cen-troids of the job clusters. We also did a comparative analysis We would like the extend our thanks to Ryota Egashira,of actual production jobs and the equivalent synthetic Grid- Rajesh Balamohan, and Srigurunath Chakravarthi for theirMix jobs. We obtain a remarkable similarity in the clusters, help with data collection . We would also like to thank Li-with the centroids of both coinciding and a similar distribu- hong Li for his input on clustering methodology.tion of clusters. This suggests that GridMix is effective inemulating the mix of jobs being run on the Hadoop clusters. We see several other uses of the clustering methodology Referenceswe have used. We intend to extend the work to learn howthe jobs are changing over time, by studying the distribution  Apache Software Foundation. Hadoop Vaidya.of these job clusters obtained by analyzing workload across http://hadoop.apache.org/common/docs/various time periods. In addition, we would like to compare current/vaidya.html.the jobs being run on production cluster to the ad-hoc jobs  Apache Software Foundation. Rumen - A Tool to Ex-being run on the clusters used for data mining and mod- tract Job Characterization Data from Job Tracker Logs. http://issues.apache.org/jira/browse/ MAPREDUCE-751. 752
 Apache Software Foundation. Welcome to Apache Hadoop! http://hadoop.apache.org. J. Dean and S. Ghemawat. Mapreduce: simpliﬁed data pro- cessing on large clusters. Commun. ACM, 51(1):107–113, 2008. C. Douglas and H. Tang. Gridmix3 Emulating Production Workload for Apache Hadoop. http://developer. yahoo.com/blogs/hadoop/posts/2010/04/ gridmix3_emulating_production/. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classiﬁca- tion. Wiley, New York, USA, 2001. The R Foundation for Statistical Computing. The R Project for Statistical Computing. http://www.r-project. org. 753