Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
MapReduce in the Clouds for Science<br />ThilinaGunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox<br />{tgunarat, taklwu, xqi...
Introduction<br />Cloud computing combined with cloud infrastructure services <br />A very viable alternative for scientis...
Introduction<br />Analyze the performance and viability of performing two types of bioinformatics computations using MapRe...
Platforms<br />Apache Hadoop<br />On BareMetal<br />On EC2<br />Amazon Web Services<br />Elastic MapReduce<br />Microsoft ...
Challenges for MapReduce in the clouds<br />Data storage<br />Reliability<br />Master node<br />Metadata storage<br />Perf...
AzureMapReduce<br />Built on using Azure cloud services<br />Distributed, highly scalable & highly available services<br /...
AzureMapReduce Features<br />Ability to dynamically scale up/down<br />Familiar programming model<br />Fault Tolerance<br ...
AzureMapReduce Architecture<br />
AzureMapReduce Architecture<br />
AzureMapReduce Architecture<br />
AzureMapReduce Architecture<br />
AzureMapReduce Architecture<br />
AzureMapReduce Architecture<br />
AzureMapReduce Architecture<br />
AzureMapReduce Architecture<br />Starting the Sort & Reduce phases, <br />When all the map tasks are finished &<br />When ...
Performance<br />Parallel efficiency<br />AzureMapReduce<br />Azure small instances – Single Core (1.7 GB memory)<br />Had...
Sequence Alignment<br />Smith-Waterman-GOTOH to calculate all-pairs dissimilarity<br />OutFile1<br />OutFile2<br />OutFile...
Sequence Alignment Performance<br />
Seqeunce Assembly<br />Assemble sequences using Cap3<br />Pleasingly parallel<br />Map Only<br />
Sequence Assembly Performance<br />
Sustained performance of clouds<br />
Conclusion<br />MapReduce in the cloud infrastructures provides an easy to use, economical option to perform loosely coupl...
Thanks<br />http://salsahpc.indiana.edu/azuremapreduce/<br />
Acknowledgements<br />All the SALSA group members for their support<br />Microsoft for their technical support on Azure. <...
Upcoming SlideShare
Loading in …5
×

Map Reduce in the Clouds (http://salsahpc.indiana.edu/mapreduceroles4azure/)

2,939 views

Published on

Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox
Pervasive Technology Institute, Indiana University.
http://salsahpc.indiana.edu/mapreduceroles4azure/

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Map Reduce in the Clouds (http://salsahpc.indiana.edu/mapreduceroles4azure/)

  1. 1. MapReduce in the Clouds for Science<br />ThilinaGunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox<br />{tgunarat, taklwu, xqiu,gcf}@indiana.edu<br />CloudCom 2010<br />Nov 30 – Dec 3, 2010<br />
  2. 2. Introduction<br />Cloud computing combined with cloud infrastructure services <br />A very viable alternative for scientists<br />MapReduce frameworks <br />Scalability<br />Excellent fault tolerance features<br />Ease of use. <br />Several options for using MapReduce in cloud environments<br />MapReduceas a service<br />Setting up MapReducecluster on cloud instances<br />Specialized cloud MapReduce runtimes <br />Take advantage of cloud infrastructure services.<br />
  3. 3. Introduction<br />Analyze the performance and viability of performing two types of bioinformatics computations using MapReduce in cloud environments<br />Sequence alignment<br />Sequence assembly<br />AzureMapReduce<br />Provide an decentralized, on demand MapReduce framework<br />Leverages the high latency, eventually consistent, yet highly scalable Azure infrastructure services <br />Sustained performance of clouds<br />
  4. 4. Platforms<br />Apache Hadoop<br />On BareMetal<br />On EC2<br />Amazon Web Services<br />Elastic MapReduce<br />Microsoft Azure<br />AzureMapReduce<br />
  5. 5. Challenges for MapReduce in the clouds<br />Data storage<br />Reliability<br />Master node<br />Metadata storage<br />Performance consistency<br />Communication consistency and scalability<br />CPU performance <br />Choosing suitable instance types<br />Logging<br />
  6. 6. AzureMapReduce<br />Built on using Azure cloud services<br />Distributed, highly scalable & highly available services<br />Minimal management / maintenance overhead<br />Reduced footprint<br />Co-exist with eventual consistency & high latency of cloud services<br />Decentralized control<br />
  7. 7. AzureMapReduce Features<br />Ability to dynamically scale up/down<br />Familiar programming model<br />Fault Tolerance<br />Easy testing and deployment <br />Combiner step<br />Web based monitoring console<br />
  8. 8. AzureMapReduce Architecture<br />
  9. 9. AzureMapReduce Architecture<br />
  10. 10. AzureMapReduce Architecture<br />
  11. 11. AzureMapReduce Architecture<br />
  12. 12. AzureMapReduce Architecture<br />
  13. 13. AzureMapReduce Architecture<br />
  14. 14. AzureMapReduce Architecture<br />
  15. 15. AzureMapReduce Architecture<br />Starting the Sort & Reduce phases, <br />When all the map tasks are finished &<br />When a reduce task is finished downloading all the intermediate data products<br />No guarantee when all the intermediate data will appear in Task tables<br />Map Tasks store the number of reduce data products it generated for each reduce task<br />
  16. 16. Performance<br />Parallel efficiency<br />AzureMapReduce<br />Azure small instances – Single Core (1.7 GB memory)<br />Hadoop Bare Metal -IBM iDataplex cluster<br />Two quad-core CPUs (Xeon 2.33GHz),16 GB memory, Gigabit Ethernet per node <br />EMR & Hadoop on EC2<br />Cap3 – HighCPU Extra Large instances (8 Cores, 20 CU, 7GB memory per instance)<br />SWG – Extra Large Instances (4 Cores, 8 CU, 15GB memory per instance)<br />
  17. 17. Sequence Alignment<br />Smith-Waterman-GOTOH to calculate all-pairs dissimilarity<br />OutFile1<br />OutFile2<br />OutFile3<br />OutFile4<br />
  18. 18. Sequence Alignment Performance<br />
  19. 19. Seqeunce Assembly<br />Assemble sequences using Cap3<br />Pleasingly parallel<br />Map Only<br />
  20. 20. Sequence Assembly Performance<br />
  21. 21. Sustained performance of clouds<br />
  22. 22. Conclusion<br />MapReduce in the cloud infrastructures provides an easy to use, economical option to perform loosely coupled scientific computations.<br />Cloud infrastructure services can successfully be leveraged built distributed parallel systems with acceptable performance and consistency.<br />For non-IO intensive workloads, cloud performance sustained well.<br />
  23. 23. Thanks<br />http://salsahpc.indiana.edu/azuremapreduce/<br />
  24. 24. Acknowledgements<br />All the SALSA group members for their support<br />Microsoft for their technical support on Azure. <br />This work was made possible using the compute use grant provided by Amazon Web Service which is titled "Proof of concepts linking FutureGrid users to AWS".<br />This work is partially funded by Microsoft "CRMC" grant and NIH Grant Number RC2HG005806-02.<br />

×