vHadoop: A Scalable Hadoop Virtual Cluster   Platform for MapReduce-Based Parallel Mac   hine Learning with Performance Co...
Outline   Motivations   vHadoop Platform       System Architecture & Flow       Platform Design & Implementation   Pe...
Outline   Motivations   vHadoop Platform       System Architecture & Flow       Platform Design & Implementation   Pe...
Motivations Big data processing is currently becoming  increasingly important in modern era due t  o the continuous growt...
Motivations   As the cloud computing becomes more and more    mature, big data processing on virtual infrastru    cture w...
Contributions Propose a scalable hadoop virtual cluster pla  tform vHadoop for the large-scale MapRedu  ce-based parallel...
Outline   Motivations   vHadoop Platform       System Architecture & Flow       Platform Design & Implementation   Pe...
vHadoop Platform   System Architecture & Flow              Cluster 2012 Workshop: PQoSCom’12                    Sep. 28, ...
vHadoop Platform   Platform Design & Implementation     Virtualization Module     Hadoop Module     Machine Learning A...
Outline   Motivations   vHadoop Platform       System Architecture & Flow       Platform Design & Implementation   Pe...
Performance Analysis of vHadoop   Experimental Configuration       Hadoop Virtual Cluster Configuration            Dell...
Performance Analysis of vHadoop   Static Performance AnalysisWordcount                                                   ...
Performance Analysis of vHadoop   Dynamic Performance Analysis                                              Live migratio...
Outline   Motivations   vHadoop Platform       System Architecture & Flow       Platform Design & Implementation   Pe...
Parallel Machine Learning on vHadoop   MapReduce-based Clustering Algorithms       Canopy Clustering is a very simple, f...
Parallel Machine Learning on vHadoop   Clustering on “Synthetic Control Chart    Time Series” Data Set      1 namenode + ...
Parallel Machine Learning on vHadoop   Visualizing Sample Clustering        Canopy                   Dirichlet           ...
Parallel Machine Learning on vHadoop   Visualizing Sample ClusteringSample Data                    Canopy                ...
Outline   Motivations   vHadoop Platform       System Architecture & Flow       Platform Design & Implementation   Pe...
Related Work     Virtualization technology          Performance characterization of virtualization, inc           luding...
Outline   Motivations   vHadoop Platform       System Architecture & Flow       Platform Design & Implementation   Pe...
Conclusion We proposed a scalable hadoop virtual clu  ster platform vHadoop for the parallel mac  hine learning with perf...
Conclusion   Experimental results show that       The network I/O and NFS disk I/O are two main bottleneck        s of v...
Future Work   Integrate the vHadoop platform to open so    urce cloud computing system to provide s    calable on-demand ...
Q&AThank you!Cluster 2012 Workshop: PQoSCom’12      Sep. 28, 2012 Beijing, China
Upcoming SlideShare
Loading in …5
×

ClusterW 2012 vHadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

622 views

Published on

2012 IEEE International Conference on Cluster Computing Workshops

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
622
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

ClusterW 2012 vHadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Machine Learning with Performance Consideration

  1. 1. vHadoop: A Scalable Hadoop Virtual Cluster Platform for MapReduce-Based Parallel Mac hine Learning with Performance Considerati onKejiang Ye, Xiaohong Jiang, Yanzhang He, Xiang Li, Haiming Yan, Peng Huang CCNT Lab, College of Computer Science Zhejiang University, China Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  2. 2. Outline Motivations vHadoop Platform  System Architecture & Flow  Platform Design & Implementation Performance Analysis of vHadoop  Static Performance Analysis  Dynamic Performance Analysis Parallel Machine Learning on vHadoop  MapReduce-based Clustering Algorithms  Clustering on “Synthetic Control Chart Time Series” Data Set  Visualizing Sample Clustering Related Work Conclusion & Future Work PQoSCom’12 Cluster 2012 Workshop: Sep. 28, 2012 Beijing, China
  3. 3. Outline Motivations vHadoop Platform  System Architecture & Flow  Platform Design & Implementation Performance Analysis of vHadoop  Static Performance Analysis  Dynamic Performance Analysis Parallel Machine Learning on vHadoop  MapReduce-based Clustering Algorithms  Clustering on “Synthetic Control Chart Time Series” Data Set  Visualizing Sample Clustering Related Work Conclusion & Future Work PQoSCom’12 Cluster 2012 Workshop: Sep. 28, 2012 Beijing, China
  4. 4. Motivations Big data processing is currently becoming increasingly important in modern era due t o the continuous growth of the amount of data generated by various fields such as p article physics, human genomics, earth ob servation, etc. However, the efficiency of processing larg e-scale on modern virtual infrastructur e, especially on the virtualized cloud comp uting infrastructure, is not clear. Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  5. 5. Motivations As the cloud computing becomes more and more mature, big data processing on virtual infrastru cture will become more and more common:  Big data processing with high efficiency is a big challe nge which needs to be executed on distributed platfor ms in parallel.  In the cloud era, resource virtualization is a typical fea ture that most of tasks will be executed on the virtual i nfrastructure.  Virtualization holds many other benefits such as rapid startup, dynamic configuration, high scalability, etc.  Moving data to computing resources is more expensi ve than moving computing resources (such as VM) to data due to the high overheads of transferring large a mounts data.Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  6. 6. Contributions Propose a scalable hadoop virtual cluster pla tform vHadoop for the large-scale MapRedu ce-based parallel data processing with perfor mance consideration. Perform a series of experiments to investigat e the static and dynamic performance of vHa doop. Use the vHadoop platform to process several typical parallel clustering tasks, including Can opy, Dirichlet, Fuzzy k-Means, MeanShift, Mi nHash, on two datasets. Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  7. 7. Outline Motivations vHadoop Platform  System Architecture & Flow  Platform Design & Implementation Performance Analysis of vHadoop  Static Performance Analysis  Dynamic Performance Analysis Parallel Machine Learning on vHadoop  MapReduce-based Clustering Algorithms  Clustering on “Synthetic Control Chart Time Series” Data Set  Visualizing Sample Clustering Related Work Conclusion & Future Work PQoSCom’12 Cluster 2012 Workshop: Sep. 28, 2012 Beijing, China
  8. 8. vHadoop Platform System Architecture & Flow Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  9. 9. vHadoop Platform Platform Design & Implementation  Virtualization Module  Hadoop Module  Machine Learning Algorithm Library  Nmon Monitor  MapReduce Tunner Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  10. 10. Outline Motivations vHadoop Platform  System Architecture & Flow  Platform Design & Implementation Performance Analysis of vHadoop  Static Performance Analysis  Dynamic Performance Analysis Parallel Machine Learning on vHadoop  MapReduce-based Clustering Algorithms  Clustering on “Synthetic Control Chart Time Series” Data Set  Visualizing Sample Clustering Related Work Conclusion & Future Work PQoSCom’12 Cluster 2012 Workshop: Sep. 28, 2012 Beijing, China
  11. 11. Performance Analysis of vHadoop Experimental Configuration  Hadoop Virtual Cluster Configuration  Dell T710 Server, with 2 Quad-core 64bit Xeon processors and 32 GB DRAM.  CentOS 5.6 with kernel version 2.6.18-238.12.1.e15xen in Domain 0, and Xen 3.3.1 as the hypervisor.  VM (Guest OS) with Ubuntu 8.10, 1 VCPU & 1024 MB vMemory.  Hadoop version is 0.20.2  Mahout version is 0.6  All the VM images are stored on a separate NFS storage server  MapReduce-based Benchmarks  Wordcount  MRBench  TeraSort  TestDFSIO  Live Migration Benchmark  Virt-LM [Huang et al., ICPE’11] Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  12. 12. Performance Analysis of vHadoop Static Performance AnalysisWordcount MRBench Network TeraSort DFSIO communication overheads become the main bottleneck in the cross-domain Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  13. 13. Performance Analysis of vHadoop Dynamic Performance Analysis Live migration of hadoop virtual cluster incurs some overheads, especially the downtime. Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  14. 14. Outline Motivations vHadoop Platform  System Architecture & Flow  Platform Design & Implementation Performance Analysis of vHadoop  Static Performance Analysis  Dynamic Performance Analysis Parallel Machine Learning on vHadoop  MapReduce-based Clustering Algorithms  Clustering on “Synthetic Control Chart Time Series” Data Set  Visualizing Sample Clustering Related Work Conclusion & Future Work PQoSCom’12 Cluster 2012 Workshop: Sep. 28, 2012 Beijing, China
  15. 15. Parallel Machine Learning on vHadoop MapReduce-based Clustering Algorithms  Canopy Clustering is a very simple, fast and accurate method for group ing objects into clusters. All objects are represented as a point in a multid imensional feature space. Canopy Clustering is often used as an initial st ep in more rigorous clustering techniques, such as K-Means Clustering.  k-Means Clustering is a rather simple but well known algorithm for grou ping objects. All objects need to be represented as a set of numerical fea tures. In addition, the user has to specify the number of groups (referred to as k) he/she wishes to identify.  Fuzzy k-Means Clustering is an extension of K-Means, the popular sim ple clustering technique. While K-Means discovers hard clusters (a point belong to only one cluster),  Fuzzy K-Means is a more statistically formalized method and discovers soft clusters where a particular point can belong to more than one cluster with certain probability.  Mean Shift Clustering produces arbitrarily-shaped clusters depending u pon the topology of the data without a priori knowledge of the number of clusters (as required in KMeans). Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  16. 16. Parallel Machine Learning on vHadoop Clustering on “Synthetic Control Chart Time Series” Data Set 1 namenode + 1 datanode Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  17. 17. Parallel Machine Learning on vHadoop Visualizing Sample Clustering Canopy Dirichlet Fuzzy k-Means k-Means MeanShift MinHash Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  18. 18. Parallel Machine Learning on vHadoop Visualizing Sample ClusteringSample Data Canopy DirichletFuzzy k-Means Cluster 2012 Workshop: PQoSCom’12 MeanShift k-Means Sep. 28, 2012 Beijing, China
  19. 19. Outline Motivations vHadoop Platform  System Architecture & Flow  Platform Design & Implementation Performance Analysis of vHadoop  Static Performance Analysis  Dynamic Performance Analysis Parallel Machine Learning on vHadoop  MapReduce-based Clustering Algorithms  Clustering on “Synthetic Control Chart Time Series” Data Set  Visualizing Sample Clustering Related Work Conclusion & Future Work PQoSCom’12 Cluster 2012 Workshop: Sep. 28, 2012 Beijing, China
  20. 20. Related Work  Virtualization technology  Performance characterization of virtualization, inc luding performance evaluation [Cherkasova et al., USENIX’05; Ye et al., IJNAM’12], performance modeling [Tickoo et al., SIGMET RICS’10; Kundu et al., HPCA’10; Ye et al., HPCC’10], and performance optimization [Menon et al., USENIX’06; Ongaro et al., VEE’08].  Server consolidation [Apparao et al., VEE’08], Live Migratio n [Voorsluys et al., CloudCom’09]  MapReduce technology  Performance of Hadoop [Kambatla et al., HotCloud’09]  MapReduce in VM [Ibrahim et al., ICPP’11; Zaharia et al., USENIX’08]However, they didn’t refer to the dynamic performance, i.e. live migration of hadoopvirtual cluster. Further, they didn’t refer to the PQoSCom’12 parallel machine learning Cluster 2012 Workshop: problem ofon the hadoop virtual cluster which is becomingChina Sep. 28, 2012 Beijing, increasing important in the big data
  21. 21. Outline Motivations vHadoop Platform  System Architecture & Flow  Platform Design & Implementation Performance Analysis of vHadoop  Static Performance Analysis  Dynamic Performance Analysis Parallel Machine Learning on vHadoop  MapReduce-based Clustering Algorithms  Clustering on “Synthetic Control Chart Time Series” Data Set  Visualizing Sample Clustering Related Work Conclusion & Future Work PQoSCom’12 Cluster 2012 Workshop: Sep. 28, 2012 Beijing, China
  22. 22. Conclusion We proposed a scalable hadoop virtual clu ster platform vHadoop for the parallel mac hine learning with performance considerati on. And investigated both the static and dyna mic performance of vHadoop. Also verified the performance and efficien cy of running MapReduce-based parallel machine learning applications on vHadoop platform. Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  23. 23. Conclusion Experimental results show that  The network I/O and NFS disk I/O are two main bottleneck s of vHadoop platform due to the shared resource contenti on and interference. The poor I/O performance in virtualiza tion system and the heavy network communication operati ons in hadoop system make the network as the main perfo rmance bottleneck.  There is a performance degradation when the data size or cluster scale increases. The cross-domain distribution of h adoop virtual cluster will also affect the communication per formance of vHadoop.  The vHadoop can perform the live migration of hadoop virt ual cluster successfully. Although the service is unavailabl e in the period of downtime, the hadoop fault tolerance me chanism will re-run the job or restore from other available backup data.  The vHadoop Cluster 2012 Workshop: PQoSCom’12 to run the MapR platform is efficient enough educed-based parallel machineChina Sep. 28, 2012 Beijing, learning algorithms on rea
  24. 24. Future Work Integrate the vHadoop platform to open so urce cloud computing system to provide s calable on-demand computation service fo r processing data-intensive (or big data) a pplications with parallel machine learning algorithms. Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China
  25. 25. Q&AThank you!Cluster 2012 Workshop: PQoSCom’12 Sep. 28, 2012 Beijing, China

×