High Performance Computing
          Cloud point of view


                     Alexey Ragozin
         alexey.ragozin@gmail.com
                          Sep 2012
Massive parallel computing

 I/O bound workload
  • Data mining / machine learning / indexing
  • Focus: Do not move data, process in place
 CPU bound
  • complex simulations / complex math models
  • Focus: Keep all cores busy
 Latency bound
  • Physical process simulations
    (e.g. weather forecast)

  • Focus: Minimize communication latencies
CPU bound task

 Stream of independent tasks
  • Independent tasks
  • Random continuous stream of tasks
  • E.g. video conversion, crawling
 Structured batch jobs
  •   Single batch is split into subtasks for parallel execution
  •   Task may have data dependency on each other
  •   Task may be generated during batch execution
  •   E.g. portfolio risk calculation
Handling task stream in Cloud

               Worker pool
                                                                       incoming
                                     in   g    Task queue
                             p   oll                                   tasks
                                                 queue metrics


                                                    Controler

                                                 adjusts pool size
                                              based on queue metrics




  Simple pattern. Exploiting “elasticy” of cloud. Cost effective.
Structured batch jobs in cloud

Batches are usually more sporadic
 e.g. end of day risk calculations
Task may have cross dependencies
 scheduler should be “cloud-aware”
Supplying tasks with data
 data delivery delay is critical
 worker pool is generally very large
 data sets also could be very large
Data delivery strategy
Push approach
 scheduler controls data delivery
 worker expects data to be available locally
 more opportunities for optimization
 complex
Pull approach
 worker pulls required data from central service
 scheduler is unaware about data sets
 requires scalable data service
 much simpler
What kind of data do we have?

 Working set
 • working set is divided between jobs
 • each portion of working set processed by single job
 • often jobs are producing working set for next
   computation stage
 Reference data
 • exactly same data shared by multiple/all jobs
 • usually static data set
Data distribution problem

Working set
• Spiky work load – especially at the start
• Hard to predict there piece of data will be required
• Caching is ineffective
Reference data set
• Naïve approach will produce huge volume of
  redundant transfers – smart caching required
• Spiky work load
Private grid practice

     HPC Grid
                                    RDBMS
                                      or
                                Data Warehouse




                    Data grid
Data grid, what is it?

• Key/Value storage
• Data distributed across cluster of servers
• RAM is usually used as storage
• Redundant copies provide level of fault tolerant /
  durability
• No single point of failure
• Automatic rebalancing of data when servers
  added/removed from grid
• Capacity and throughput are scaling linearly
Data service for cloud HPC

• Block storage service
  Azure drive / Amazon EBS
  – Lack of shared access to data
• Key / Value storage
  Azure Tables / Amazon Simple DB
  – Pricing: volume + usage
• Blob store
  Azure Tables (blobs) / Amazon S3
  – Pricing: volume + transactions
  – Good read scalability
Use case for caching

 Avoid storage of data in cloud
  • Upload data once per batch and cache in cloud
 Reduce storage cost by reducing number of
  operations
 Save IO bandwidth for shared data
  • Edge caching
  • Routing overlays
Distribution tree /
  Routing overlays




                                Storage
                                Proxy
                                Clients
Switch     Switch      Switch
Task stealing

Task steeling – alternative scheduling approach
Task steeling in widely used for in-process multi-core concurrency

Why use it for cluster task scheduling?
• Stochastic and adaptive
• Can use cost models accounting internal cloud
  topology
• Decently solves problem of data delivery,
  without additional caching
• Unproven for cluster computation, so far
Task stealing

       Worker 1

                     Work backlog is organized in a
                      form of stack
                     Tasks are generated recursively
                     Top of stack – fine grained tasks
fork                 Bottom of stack – coarse
                      grained tasks
fork                 Execution from top of stack
fork
                     Stealing – bottom of stack
       processing
Task stealing

       Worker 1             Worker 2

                    steal



                    fork

                    fork
                            processing
fork




fork
       processing
fork
         done
IO bound workload in cloud

Dawn of Map/Reduce
- high bandwidth interconnects are expensive
- network storage is expensive (due to network cost)
- cheap serves and local processing for keeping costs low
- price – very complex computation model
“Cloud” reality
- network bandwidth is cheap
- disks are already “networked”
- RAM is abundant
Hadoop is cloud unfriendly

Assume I have 50 nodes Hadoop cluster in cloud
What will I gain by adding another 50 nodes?
- Not much, until they are populated with data.
What if I will shut these 50 afterward?
- Effort to populate them with data will be wasted.

Hadoop is coupling execution and storage services
together – you have pay for both even if you use one.
How cloud M/R should look?

• Use cloud storage service and persistent storage
• Streaming M/R processing
• Aggressive use of memory for intermediate data

Peregrine – storeless M/R framework
  http://peregrine_mapreduce.bitbucket.org/
Spark – in-memory M/R framework
  http://www.spark-project.org/
Looking into future

Highly anticipated features
 Scheduler as a Service
  Azure HPC / Amazon SWF
 Simple middleware for organizing caches and
  routing overlays
  Existing solutions are far from simple
 Cloud friendly map/reduce frameworks
  Could provider work hard to offer effective Hadoop
Thank you
http://blog.ragozin.info
- my articles


                                 Alexey Ragozin
                     alexey.ragozin@gmail.com

Взгляд на облака с точки зрения HPC

  • 1.
    High Performance Computing Cloud point of view Alexey Ragozin alexey.ragozin@gmail.com Sep 2012
  • 2.
    Massive parallel computing I/O bound workload • Data mining / machine learning / indexing • Focus: Do not move data, process in place  CPU bound • complex simulations / complex math models • Focus: Keep all cores busy  Latency bound • Physical process simulations (e.g. weather forecast) • Focus: Minimize communication latencies
  • 3.
    CPU bound task Stream of independent tasks • Independent tasks • Random continuous stream of tasks • E.g. video conversion, crawling  Structured batch jobs • Single batch is split into subtasks for parallel execution • Task may have data dependency on each other • Task may be generated during batch execution • E.g. portfolio risk calculation
  • 4.
    Handling task streamin Cloud Worker pool incoming in g Task queue p oll tasks queue metrics Controler adjusts pool size based on queue metrics Simple pattern. Exploiting “elasticy” of cloud. Cost effective.
  • 5.
    Structured batch jobsin cloud Batches are usually more sporadic  e.g. end of day risk calculations Task may have cross dependencies  scheduler should be “cloud-aware” Supplying tasks with data  data delivery delay is critical  worker pool is generally very large  data sets also could be very large
  • 6.
    Data delivery strategy Pushapproach  scheduler controls data delivery  worker expects data to be available locally  more opportunities for optimization  complex Pull approach  worker pulls required data from central service  scheduler is unaware about data sets  requires scalable data service  much simpler
  • 7.
    What kind ofdata do we have? Working set • working set is divided between jobs • each portion of working set processed by single job • often jobs are producing working set for next computation stage Reference data • exactly same data shared by multiple/all jobs • usually static data set
  • 8.
    Data distribution problem Workingset • Spiky work load – especially at the start • Hard to predict there piece of data will be required • Caching is ineffective Reference data set • Naïve approach will produce huge volume of redundant transfers – smart caching required • Spiky work load
  • 9.
    Private grid practice HPC Grid RDBMS or Data Warehouse Data grid
  • 10.
    Data grid, whatis it? • Key/Value storage • Data distributed across cluster of servers • RAM is usually used as storage • Redundant copies provide level of fault tolerant / durability • No single point of failure • Automatic rebalancing of data when servers added/removed from grid • Capacity and throughput are scaling linearly
  • 11.
    Data service forcloud HPC • Block storage service Azure drive / Amazon EBS – Lack of shared access to data • Key / Value storage Azure Tables / Amazon Simple DB – Pricing: volume + usage • Blob store Azure Tables (blobs) / Amazon S3 – Pricing: volume + transactions – Good read scalability
  • 12.
    Use case forcaching  Avoid storage of data in cloud • Upload data once per batch and cache in cloud  Reduce storage cost by reducing number of operations  Save IO bandwidth for shared data • Edge caching • Routing overlays
  • 13.
    Distribution tree / Routing overlays Storage Proxy Clients Switch Switch Switch
  • 14.
    Task stealing Task steeling– alternative scheduling approach Task steeling in widely used for in-process multi-core concurrency Why use it for cluster task scheduling? • Stochastic and adaptive • Can use cost models accounting internal cloud topology • Decently solves problem of data delivery, without additional caching • Unproven for cluster computation, so far
  • 15.
    Task stealing Worker 1  Work backlog is organized in a form of stack  Tasks are generated recursively  Top of stack – fine grained tasks fork  Bottom of stack – coarse grained tasks fork  Execution from top of stack fork  Stealing – bottom of stack processing
  • 16.
    Task stealing Worker 1 Worker 2 steal fork fork processing fork fork processing fork done
  • 17.
    IO bound workloadin cloud Dawn of Map/Reduce - high bandwidth interconnects are expensive - network storage is expensive (due to network cost) - cheap serves and local processing for keeping costs low - price – very complex computation model “Cloud” reality - network bandwidth is cheap - disks are already “networked” - RAM is abundant
  • 18.
    Hadoop is cloudunfriendly Assume I have 50 nodes Hadoop cluster in cloud What will I gain by adding another 50 nodes? - Not much, until they are populated with data. What if I will shut these 50 afterward? - Effort to populate them with data will be wasted. Hadoop is coupling execution and storage services together – you have pay for both even if you use one.
  • 19.
    How cloud M/Rshould look? • Use cloud storage service and persistent storage • Streaming M/R processing • Aggressive use of memory for intermediate data Peregrine – storeless M/R framework http://peregrine_mapreduce.bitbucket.org/ Spark – in-memory M/R framework http://www.spark-project.org/
  • 20.
    Looking into future Highlyanticipated features  Scheduler as a Service Azure HPC / Amazon SWF  Simple middleware for organizing caches and routing overlays Existing solutions are far from simple  Cloud friendly map/reduce frameworks Could provider work hard to offer effective Hadoop
  • 21.
    Thank you http://blog.ragozin.info - myarticles Alexey Ragozin alexey.ragozin@gmail.com