High Performance Computing
          Cloud point of view


                     Alexey Ragozin
         alexey.ragozin@gmail.com
                          Apr 2012
Massive parallel computing

 I/O bound workload
  • Data mining / machine learning / indexing
  • Focus: Do not move data, process in place
 CPU bound
  • complex simulations / complex math models
  • Focus: Keep all cores busy
 Latency bound
  • Physical process simulations
    (e.g. weather forecast)
  • Focus: Minimize communication latencies
CPU bound task

 Stream like workload
  • Independent tasks
  • Random continuous stream of tasks
  • E.g. video conversion, crawling
 Structured batch jobs
  •   Single batch is split into subtasks for parallel execution
  •   Task may have data dependency on each other
  •   Task may be generated during batch execution
  •   E.g. portfolio risk calculation
Handling task stream in Cloud

               Worker pool
                                                                   incoming
                                   in g
                                           Task queue
                             po ll                                 tasks
                                             queue metrics


                                                Controler

                                             adjusts pool size
                                          based on queue metrics




  Simple pattern. Exploiting “elasticy” of cloud. Cost effective.
Structured batch jobs in cloud

Batches are usually more sporadic
 e.g. end of day risk calculations
Task may have cross dependencies
 scheduler should be “cloud-aware”
Supplying tasks with data
 data delivery delay is critical
 worker pool is generally very large
 data sets also could be very large
Data delivery strategy
Push approach
 scheduler controls data delivery
 worker expects data to be available locally
 more opportunities for optimization
 complex
Pull approach
 worker pulls required data from central service
 scheduler is unaware about data sets
 requires scalable data service
 much simpler
What kind of data do we have?

 Working set
 • working set is divided between jobs
 • each portion of working set processed by single job
 • often jobs are producing working set for next
   computation stage
 Reference data
 • exactly same data shared by multiple/all jobs
 • usually static data set
Data distribution problem

Working set
• Spiky work load – especially at the start
• Hard to predict there piece of data will be required
• Caching is ineffective
Reference data set
• Naïve approach will produce huge volume of
  redundant transfers – smart caching required
• Spiky work load
Private grid practice

     HPC Grid
                                    RDBMS
                                      or
                                Data Warehouse




                    Data grid
Data grid, what is it?
• Key/Value storage
• Data distributed across cluster of servers
• RAM is usually used as storage
• Redundant copies provide level of fault tolerant /
  durability
• No single point of failure
• Automatic rebalancing of data when servers
  added/removed from grid
• Capacity and throughput are scaling linearly
Data service for cloud HPC

• Block storage service
  Azure drive / Amazon EBS
  – Lack of shared access to data
• Key / Value storage
  Azure Tables / Amazon Simple DB
  – Pricing: volume + usage
• Blob store
  Azure Tables (blobs) / Amazon S3
  – Pricing: volume + transactions
  – Good read scalability
Use case for caching

 Avoid storage of data in cloud
  • Upload data once per batch and cache in cloud
 Reduce storage cost by reducing number of
  operations
 Save IO bandwidth for shared data
  • Edge caching
  • Routing overlays
Routing overlays
• Each node knows (communicates) only subset
  of network.
• A request to unknown node is routed via one
  of neighbors which a closest to destination.
Task stealing

Task steeling – alternative scheduling approach
Task steeling in widely used for in-process multi-core concurrency

Why use it for cluster task scheduling
• Stochastic and adaptive
• Can use cost models accounting internal cloud
  topology
• Decently solves problem of data delivery, without
  additional caching
• Unproven for cluster computation, so far
Task stealing
       Worker 1

                     Work backlog is organized as
                      stack
                     Tasks are generated recursively
                     Top of stack – fine grained tasks
fork                 Bottom of stack – coarse
                      grained tasks
fork                 Execution from top of stack
fork                 Stealing – bottom of stack
       processing
Task stealing

       Worker 1             Worker 2

                    steal



                     fork

                     fork
                            processing
fork




fork
       processing
fork
         done
IO bound workload in cloud

Dawn of Map/Reduce
- high bandwidth interconnects are expensive
- network storage is expensive
- cheap serves and local processing for keeping costs
  low
“Cloud” reality
- network bandwidth is cheap
- disks are already “networked”
- RAM is abundant
Hadoop is cloud unfriendly

Assume I have 50 nodes Hadoop cluster in cloud
What will I gain by adding another 50 nodes?
- Not much, until they are populated with data.
What if I will shut these 50 afterward?
- Effort to populate them with data will be wasted.

Hadoop is coupling execution and storage services
  together – you have pay for both even if you use one.
How cloud M/R should look?
• Use cloud storage service and persistent storage
• Streaming M/R processing
• Aggressive use of memory for intermediate data

Peregrine – storeless M/R framework
  http://peregrine_mapreduce.bitbucket.org/
Spark – in-memory M/R framework
  http://www.spark-project.org/
Looking into future

Highly anticipated features
 Scheduler as a Service
  Azure HPC
 Simple middleware for organizing caches and
  routing overlays
  Existing solutions are far from simple
 Cloud friendly map/reduce frameworks
Thank you
http://aragozin.blogspot.com
- my articles


                              Alexey Ragozin
                   alexey.ragozin@gmail.com

High Performance Computing - Cloud Point of View

  • 1.
    High Performance Computing Cloud point of view Alexey Ragozin alexey.ragozin@gmail.com Apr 2012
  • 2.
    Massive parallel computing I/O bound workload • Data mining / machine learning / indexing • Focus: Do not move data, process in place  CPU bound • complex simulations / complex math models • Focus: Keep all cores busy  Latency bound • Physical process simulations (e.g. weather forecast) • Focus: Minimize communication latencies
  • 3.
    CPU bound task Stream like workload • Independent tasks • Random continuous stream of tasks • E.g. video conversion, crawling  Structured batch jobs • Single batch is split into subtasks for parallel execution • Task may have data dependency on each other • Task may be generated during batch execution • E.g. portfolio risk calculation
  • 4.
    Handling task streamin Cloud Worker pool incoming in g Task queue po ll tasks queue metrics Controler adjusts pool size based on queue metrics Simple pattern. Exploiting “elasticy” of cloud. Cost effective.
  • 5.
    Structured batch jobsin cloud Batches are usually more sporadic  e.g. end of day risk calculations Task may have cross dependencies  scheduler should be “cloud-aware” Supplying tasks with data  data delivery delay is critical  worker pool is generally very large  data sets also could be very large
  • 6.
    Data delivery strategy Pushapproach  scheduler controls data delivery  worker expects data to be available locally  more opportunities for optimization  complex Pull approach  worker pulls required data from central service  scheduler is unaware about data sets  requires scalable data service  much simpler
  • 7.
    What kind ofdata do we have? Working set • working set is divided between jobs • each portion of working set processed by single job • often jobs are producing working set for next computation stage Reference data • exactly same data shared by multiple/all jobs • usually static data set
  • 8.
    Data distribution problem Workingset • Spiky work load – especially at the start • Hard to predict there piece of data will be required • Caching is ineffective Reference data set • Naïve approach will produce huge volume of redundant transfers – smart caching required • Spiky work load
  • 9.
    Private grid practice HPC Grid RDBMS or Data Warehouse Data grid
  • 10.
    Data grid, whatis it? • Key/Value storage • Data distributed across cluster of servers • RAM is usually used as storage • Redundant copies provide level of fault tolerant / durability • No single point of failure • Automatic rebalancing of data when servers added/removed from grid • Capacity and throughput are scaling linearly
  • 11.
    Data service forcloud HPC • Block storage service Azure drive / Amazon EBS – Lack of shared access to data • Key / Value storage Azure Tables / Amazon Simple DB – Pricing: volume + usage • Blob store Azure Tables (blobs) / Amazon S3 – Pricing: volume + transactions – Good read scalability
  • 12.
    Use case forcaching  Avoid storage of data in cloud • Upload data once per batch and cache in cloud  Reduce storage cost by reducing number of operations  Save IO bandwidth for shared data • Edge caching • Routing overlays
  • 13.
    Routing overlays • Eachnode knows (communicates) only subset of network. • A request to unknown node is routed via one of neighbors which a closest to destination.
  • 14.
    Task stealing Task steeling– alternative scheduling approach Task steeling in widely used for in-process multi-core concurrency Why use it for cluster task scheduling • Stochastic and adaptive • Can use cost models accounting internal cloud topology • Decently solves problem of data delivery, without additional caching • Unproven for cluster computation, so far
  • 15.
    Task stealing Worker 1  Work backlog is organized as stack  Tasks are generated recursively  Top of stack – fine grained tasks fork  Bottom of stack – coarse grained tasks fork  Execution from top of stack fork  Stealing – bottom of stack processing
  • 16.
    Task stealing Worker 1 Worker 2 steal fork fork processing fork fork processing fork done
  • 17.
    IO bound workloadin cloud Dawn of Map/Reduce - high bandwidth interconnects are expensive - network storage is expensive - cheap serves and local processing for keeping costs low “Cloud” reality - network bandwidth is cheap - disks are already “networked” - RAM is abundant
  • 18.
    Hadoop is cloudunfriendly Assume I have 50 nodes Hadoop cluster in cloud What will I gain by adding another 50 nodes? - Not much, until they are populated with data. What if I will shut these 50 afterward? - Effort to populate them with data will be wasted. Hadoop is coupling execution and storage services together – you have pay for both even if you use one.
  • 19.
    How cloud M/Rshould look? • Use cloud storage service and persistent storage • Streaming M/R processing • Aggressive use of memory for intermediate data Peregrine – storeless M/R framework http://peregrine_mapreduce.bitbucket.org/ Spark – in-memory M/R framework http://www.spark-project.org/
  • 20.
    Looking into future Highlyanticipated features  Scheduler as a Service Azure HPC  Simple middleware for organizing caches and routing overlays Existing solutions are far from simple  Cloud friendly map/reduce frameworks
  • 21.
    Thank you http://aragozin.blogspot.com - myarticles Alexey Ragozin alexey.ragozin@gmail.com