Architecture and Performance of Runtime Environments for Data Intensive Scalable Computing

Architecture and Performance of Runtime Environments for Data Intensive Scalable ComputingSC09 Doctoral Symposium, Portland, 11/18/2009Student: Jaliya EkanayakeAdvisor: Prof. Geoffrey FoxCommunity Grids Laboratory, Digital Science CenterPervasive Technology InstituteIndiana University

Cloud Runtimes for Data/Compute Intensive ApplicationsCloud RuntimesMapReduce Dryad/DryadLINQSector/Sphere Moving Computation to DataSimple communication topologiesMapReduceDirected Acyclic Graphs (DAG)sDistributed File SystemsFault ToleranceData/Compute intensive Applications

Represented as filter pipelines

Parallelizable filtersApplications using Hadoop and DryadLINQ (1)Input files (FASTA)CAP3 [1] - Expressed Sequence Tag assembly to re-construct full-length mRNACAP3CAP3CAP3DryadLINQOutput files“Map only” operation in HadoopSingle “Select” operation in DryadLINQ[1] X. Huang, A. Madan, “CAP3: A DNA Sequence Assembly Program,” Genome Research, vol. 9, no. 9, pp. 868-877, 1999.

Applications using Hadoop and DryadLINQ (2)PhyloD [1]project from Microsoft ResearchDerive associations between HLA alleles and HIV codons and between codons themselvesDryadLINQ implementation[1] Microsoft Computational Biology Web Tools, http://research.microsoft.com/en-us/um/redmond/projects/MSCompBio/

Applications using Hadoop and DryadLINQ (3)125 million distances4 hours & 46 minutesCalculate Pairwise Distances (Smith Waterman Gotoh)Calculate pairwise distances for a collection of genes (used for clustering, MDS)Fine grained tasks in MPICoarse grained tasks in DryadLINQPerformed on 768 cores (Tempest Cluster)

Applications using Hadoop and DryadLINQ (4)High Energy Physics (HEP)

Multi-Dimensional Scaling (MDS)MapReduce for Iterative ComputationsClassic MapReduce RuntimesGoogle, Apache Hadoop, Sector/Sphere, DryadLINQ (DAG based)Focus on Single Step MapReduce computations onlyIntermediate data is stored and accessed via file systemsBetter fault tolerance supportHigher latenciesIterative MapReduce computations uses new maps/reducesin each iterationFixed data is loaded again and againInefficient for many iterative computations to which the MapReduce technique could be appliedSolution: i-MapReduce

Applications & Different Interconnection PatternsInputmapiterationsInputInputmapmapOutputPijreducereduceMPIDomain of MapReduce and Iterative Extensions

i-MapReduceIn-memory MapReduce

Distinction on static data and variable data (data flow vs. δ flow)

Cacheable map/reduce tasks (long running tasks)

Support fast intermediate data transfersStaticdataConfigure()IterateUser Programδ flowMap(Key, Value) Reduce (Key, List<Value>) Close()Combine (Key, List<Value>)Different synchronization and intercommunication mechanisms used by the parallel runtimes

i-MapReduceProgramming ModelrunMapReduce() IterationsWorker NodesconfigureMaps()Local DiskconfigureReduce()Cacheable map/reduce taskswhile(condition){Can send <Key,Value> pairs directlyMap()Reduce()Combine() operationCommunications/data transfers via the pub-sub broker networkupdateCondition()Two configuration options :Using local disks (only for maps)Using pub-sub bus } //end whileclose()User program’s process space

i-MapReduceArchitecturePub/Sub Broker NetworkMap WorkerMWorker NodesReduce WorkerDMRDriverUserProgramDRMMMMMRDeamonDRRRRData Read/WriteFile SystemCommunicationData SplitStreaming based communication

Eliminates file based communication

Architecture and Performance of Runtime Environments for Data Intensive Scalable Computing

More Related Content

What's hot

Similar to Architecture and Performance of Runtime Environments for Data Intensive Scalable Computing

Recently uploaded

Architecture and Performance of Runtime Environments for Data Intensive Scalable Computing

Editor's Notes