Jaliya Ekanayake and Geoffrey Fox
           School of Informatics and Computing
               Indiana University Bloomington
Cloud Computing and Software Services: Theory and Techniques
                         July, 2010
                                                     Presented by:
                                                   Inderjeet Singh
   Introduction
   Problem
   Data Analysis Applications
   Evaluations and Analysis
   Performance of MPI on Clouds
   Benchmarks and Results
   Conclusions and Future Work
   Critique
   Apache Hadoop (OpenSource version of Google MapReduce)
   DryadLINQ (Microsoft API for Dryad)
   CGL-MapReduce (Iterative version of MapReduce)

Cloud technologies/Parallel Runtimes/Cloud Runtimes
   On demand provisioning of resources
   Customizable Virtual Machines (VM)
   Root privileges
   Provisioning is very fast (within minutes)
   You pay only for what you use
   Better resource utilization
Cloud Technologies
 Moving computation to data
 Better Quality of Service (QoS)
 Simple communication topologies
 Distributed file system (HDFS,GFS)


Most HPC applications are based upon MPI
 Many fine grained communication topologies
 Usage of fast network
Software framework to support distributed computing
    on large datasets on cluster of computers

   Map step - The master node takes the input, partitions it
    up into smaller sub-problems, and distributes them to
    worker nodes. A worker node may do this again in
    turn, leading to a multi-level tree structure. The worker
    node processes the smaller problem, and passes the
    answer back to its master node

   Reduce step - The master node collects the answers to all
    the sub-problems and combines them in some way to
    form the output or answer
Large data/compute intensive applications
Traditional approach
 Execution on Clusters/grid/supercomputers
 Moving     both application and data to     available
  computational power
 Efficiency decreases with large datasets


Better approach
 Execution with Cloud technologies
 Moving computations to data to perform processing
 More data centric approach
Comparisons of features supported by different
cloud technologies and MPI
   What applications are best handled by cloud
    technologies?
   What overheads do they introduce?
   Can traditional parallel runtimes such as MPI
    be used in cloud?
   If so, what overheads do they have?
Types of Applications (Based upon
    communication)

   Map only (Cap3)
   Map Reduce (HEP)
   Iterative/Complex style (Matrix Multiplication and
    K-Means Clustering)
   Cap3 - Sequence assembly program that operates
    on a collection of gene sequence files to produce
    several outputs

   HEP - High Energy Physics data analysis application

   K-Means clustering - Performs iteratively refining
    computation of clusters

   Matrix Multiplication – Cannon’s algorithm
   MapReduce does not support iterative/complex style
    applications so [Fox] build CGL- MapReduce
   CGL-Mapreduce – Supports long running tasks and retains
    static data in memory across invocations
   Performance (average running time)
   Overhead = [P * T(P) – T(1)]/T(1)
    P = No. of processes




                                         DryadLINQ


                                         Hadoop/
                                         CGL
                                         MapReduce/M
                                         PI
   CAP3 (map only) and HEP (mapreduce) perform well
    with cloud runtimes
   K-means clustering (iterative) and matrix
    multiplications (iterative) show high overheads with
    cloud runtimes compared to MPI runtime
   CGL-Mapreduce also gives less overhead for large
    datasets
Goals
   Overhead of Virtual Machines (VM) on parallel
    applications in MPI
   How applications with different
    communication/computation (c/c) ratio perform on
    cloud?
   Effect of different CPU core assignment strategies
    on VMs and running these MPI applications on
    these VMs
Three MPI applications with different c/c
    ratios requirements

   Matrix multiplication (Cannon’s algorithm)
   K-Means clustering
   Concurrent wave solver
Computation and Communication complexities of the
different MPI applications used
   Eucalyptus     and          Xen       based           cloud
    infrastructure
      16 nodes with 2 Quad Core Intel Xeon processors and
       32 GB of memory
      Nodes connected with 1 gigabit Ethernet connection
   Same s/w configuration for both bare-metal
    nodes and VMs
     • OS - Red Hat Enterprise Linux Server release 5.2
     • OpenMP version 1.3.2
Different CPU core/virtual machines assignment strategies


Invariant to select the number of MPI processes
 Number of MPI processes = Number of CPU cores used
Performance – 64 CPU Cores      Speedup – Fixed Matrix size
                                (5184*5184)

 ◦ Speedup decrease 34% between Bare metal and 8-VM/node
   at 81 processes
 ◦ Exchange of large messages and more communication
Performance – 128 CPU Cores          Total overhead (Number of MPI
                                     Processes =128)
    ◦ Communication is very less than computations
    ◦ Communication here depends upon number of clusters formed
    ◦ Overhead is large for small data sizes, so less speedup is
      observed
Total Overhead (Number of MPI
Performance – 128 CPU Cores         Processes = 128)

   ◦ Amount of communications is fixed, less data transfer rates
   ◦ Lower c/c ratio of O(1/n) leads to more latency and lower
     performance on VMs
   ◦ 8-VMs per node has 7% more overhead than bare metal node
Communication between dom0 and domUs when 1-VM per node is deployed
(top). Communication between dom0 and domUs when 8-VMs per node are
deployed (bottom)

◦ In multi VMs configuration scheduling of I/O
  operation of DomUs (user domains) happens via
  Dom0 (privileged OS)
Figure: LAM vs. OpenMPI in different VM configurations


   When using mutliple VMs on multi-core CPUs, it is good to
    use runtimes supporting in-node communications
    (OpenMP vs LAM-MPI)
   Cloud runtimes work well for pleasing parallel (map
    only and mapreduce) applications with large
    datasets
   Overheads of cloud runtimes are high with parallel
    applications    that    require    iterative/complex
    communication patterns (MPI based applications)
   Work needs to be done on finding algorithms for
    these applications that are cloud friendly
   CGL-MapReduce is efficient for iterative style
    mapreduce applications (k-means)
   Overheads for MPI applications increase as number
    of VMs/node increase (22-50% degradation)
   In-node communication in important
   MapReduce applications (not susceptible to
    latencies) may perform well on VMs deployed on
    clouds
   Integration of MapReduce and MPI (biological DNA
    sequencing application)
   No results of implementation of pleasing parallel
    applications (Cap3, HEP) with MPI, missing MPI and
    cloud runtimes time comparisons
   Missing     evaluations    of    HPC     applications
    implemented with cloud runtimes on private
    cloud, which is critical to show the effect of multi
    VMs/multi-core configurations on performances of
    these applications
   Difference in memory sizes (16/32 GB) for clusters
    of different OS. This could lead to biased results
   Ekanayake Jaliya and Fox Geoffrey, High Performance Parallel
    Computing with Clouds and Cloud Technologies, Lecture Notes of
    the Institute for Computer Sciences, Social Informatics and
    Telecommunications Engineering (2010), Pages 20, Volume 34

   High Performance Parallel Computing with Clouds and Cloud
    Technologies. http://www.slideshare.net/jaliyae/high-performance-
    parallel-computing-with-clouds-and-cloud-technologies

   Map Reduce, Wikipedia: http://en.wikipedia.org/wiki/MapReduce
HPC with Clouds and Cloud Technologies

HPC with Clouds and Cloud Technologies

  • 1.
    Jaliya Ekanayake andGeoffrey Fox School of Informatics and Computing Indiana University Bloomington Cloud Computing and Software Services: Theory and Techniques July, 2010 Presented by: Inderjeet Singh
  • 2.
    Introduction  Problem  Data Analysis Applications  Evaluations and Analysis  Performance of MPI on Clouds  Benchmarks and Results  Conclusions and Future Work  Critique
  • 4.
    Apache Hadoop (OpenSource version of Google MapReduce)  DryadLINQ (Microsoft API for Dryad)  CGL-MapReduce (Iterative version of MapReduce) Cloud technologies/Parallel Runtimes/Cloud Runtimes
  • 5.
    On demand provisioning of resources  Customizable Virtual Machines (VM)  Root privileges  Provisioning is very fast (within minutes)  You pay only for what you use  Better resource utilization
  • 6.
    Cloud Technologies  Movingcomputation to data  Better Quality of Service (QoS)  Simple communication topologies  Distributed file system (HDFS,GFS) Most HPC applications are based upon MPI  Many fine grained communication topologies  Usage of fast network
  • 7.
    Software framework tosupport distributed computing on large datasets on cluster of computers  Map step - The master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node  Reduce step - The master node collects the answers to all the sub-problems and combines them in some way to form the output or answer
  • 8.
    Large data/compute intensiveapplications Traditional approach  Execution on Clusters/grid/supercomputers  Moving both application and data to available computational power  Efficiency decreases with large datasets Better approach  Execution with Cloud technologies  Moving computations to data to perform processing  More data centric approach
  • 9.
    Comparisons of featuressupported by different cloud technologies and MPI
  • 10.
    What applications are best handled by cloud technologies?  What overheads do they introduce?  Can traditional parallel runtimes such as MPI be used in cloud?  If so, what overheads do they have?
  • 11.
    Types of Applications(Based upon communication)  Map only (Cap3)  Map Reduce (HEP)  Iterative/Complex style (Matrix Multiplication and K-Means Clustering)
  • 12.
    Cap3 - Sequence assembly program that operates on a collection of gene sequence files to produce several outputs  HEP - High Energy Physics data analysis application  K-Means clustering - Performs iteratively refining computation of clusters  Matrix Multiplication – Cannon’s algorithm
  • 14.
    MapReduce does not support iterative/complex style applications so [Fox] build CGL- MapReduce  CGL-Mapreduce – Supports long running tasks and retains static data in memory across invocations
  • 15.
    Performance (average running time)  Overhead = [P * T(P) – T(1)]/T(1) P = No. of processes DryadLINQ Hadoop/ CGL MapReduce/M PI
  • 18.
    CAP3 (map only) and HEP (mapreduce) perform well with cloud runtimes  K-means clustering (iterative) and matrix multiplications (iterative) show high overheads with cloud runtimes compared to MPI runtime  CGL-Mapreduce also gives less overhead for large datasets
  • 19.
    Goals  Overhead of Virtual Machines (VM) on parallel applications in MPI  How applications with different communication/computation (c/c) ratio perform on cloud?  Effect of different CPU core assignment strategies on VMs and running these MPI applications on these VMs
  • 20.
    Three MPI applicationswith different c/c ratios requirements  Matrix multiplication (Cannon’s algorithm)  K-Means clustering  Concurrent wave solver
  • 21.
    Computation and Communicationcomplexities of the different MPI applications used
  • 22.
    Eucalyptus and Xen based cloud infrastructure  16 nodes with 2 Quad Core Intel Xeon processors and 32 GB of memory  Nodes connected with 1 gigabit Ethernet connection  Same s/w configuration for both bare-metal nodes and VMs • OS - Red Hat Enterprise Linux Server release 5.2 • OpenMP version 1.3.2
  • 23.
    Different CPU core/virtualmachines assignment strategies Invariant to select the number of MPI processes Number of MPI processes = Number of CPU cores used
  • 24.
    Performance – 64CPU Cores Speedup – Fixed Matrix size (5184*5184) ◦ Speedup decrease 34% between Bare metal and 8-VM/node at 81 processes ◦ Exchange of large messages and more communication
  • 25.
    Performance – 128CPU Cores Total overhead (Number of MPI Processes =128) ◦ Communication is very less than computations ◦ Communication here depends upon number of clusters formed ◦ Overhead is large for small data sizes, so less speedup is observed
  • 26.
    Total Overhead (Numberof MPI Performance – 128 CPU Cores Processes = 128) ◦ Amount of communications is fixed, less data transfer rates ◦ Lower c/c ratio of O(1/n) leads to more latency and lower performance on VMs ◦ 8-VMs per node has 7% more overhead than bare metal node
  • 27.
    Communication between dom0and domUs when 1-VM per node is deployed (top). Communication between dom0 and domUs when 8-VMs per node are deployed (bottom) ◦ In multi VMs configuration scheduling of I/O operation of DomUs (user domains) happens via Dom0 (privileged OS)
  • 28.
    Figure: LAM vs.OpenMPI in different VM configurations  When using mutliple VMs on multi-core CPUs, it is good to use runtimes supporting in-node communications (OpenMP vs LAM-MPI)
  • 29.
    Cloud runtimes work well for pleasing parallel (map only and mapreduce) applications with large datasets  Overheads of cloud runtimes are high with parallel applications that require iterative/complex communication patterns (MPI based applications)  Work needs to be done on finding algorithms for these applications that are cloud friendly  CGL-MapReduce is efficient for iterative style mapreduce applications (k-means)
  • 30.
    Overheads for MPI applications increase as number of VMs/node increase (22-50% degradation)  In-node communication in important  MapReduce applications (not susceptible to latencies) may perform well on VMs deployed on clouds  Integration of MapReduce and MPI (biological DNA sequencing application)
  • 31.
    No results of implementation of pleasing parallel applications (Cap3, HEP) with MPI, missing MPI and cloud runtimes time comparisons  Missing evaluations of HPC applications implemented with cloud runtimes on private cloud, which is critical to show the effect of multi VMs/multi-core configurations on performances of these applications  Difference in memory sizes (16/32 GB) for clusters of different OS. This could lead to biased results
  • 32.
    Ekanayake Jaliya and Fox Geoffrey, High Performance Parallel Computing with Clouds and Cloud Technologies, Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering (2010), Pages 20, Volume 34  High Performance Parallel Computing with Clouds and Cloud Technologies. http://www.slideshare.net/jaliyae/high-performance- parallel-computing-with-clouds-and-cloud-technologies  Map Reduce, Wikipedia: http://en.wikipedia.org/wiki/MapReduce