Data Aware Routing In The Cloud
Upcoming SlideShare
Loading in...5

Data Aware Routing In The Cloud



Large-scale HPC infrastructures such as Sun Grid Engine sooner or later face data access issues, usually caused by the centralized nature of data sources. This talk discusses an approach to mitigate ...

Large-scale HPC infrastructures such as Sun Grid Engine sooner or later face data access issues, usually caused by the centralized nature of data sources. This talk discusses an approach to mitigate those issues by employing in-memory data grids, such as Oracle Coherence. By collocating compute and data grid nodes on the same hosts, one can achieve not only better resource utilization but also enjoy performance gains through data aware routing. Grid Dynamics will present a reference architecture for building a data aware routing layer on top of SGE and Oracle Coherence along with a demo that highlights the performance benefits of data aware routing. Sun HPC ClusterTools and Open MPI Update



Total Views
Views on SlideShare
Embed Views



1 Embed 2 2



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Data Aware Routing In The Cloud Data Aware Routing In The Cloud Presentation Transcript

  • Data Aware Routing in the Cloud
    Eugene Steinberg
    CTO, Grid Dynamics
  • Traditional HPC approach:HPC Cluster with centralized data access
  • Data Driven Scalability Challenges in traditional HPC approach
    • Data is far away
    • Latency of remote connection
    • Latency of data travelling through pipes
    • Chatty random data access algorithms are expensive
    • Data is centralized
    • Centralized HW resources are always limited
    • Centralized RAM is limited: disk I/O is inevitable
    • Connections are limited
    • Highly concurrent data access doesn’t scale well
  • Usual solution: Data Grids
    • Classic Data Grid
    • Data is partitioned
    • Partitions are stored in memory
    • Data grid is deployed near to compute grid
    • Search is parallelized over partitions
    • Build-in replication, persistence, coherence, failover
    • What is Achieved?
    • Reduced latency and data moving cost
    • Improved connection scalability
    • Reduced data contention
    • No Disk I/O – 100% memory speed
    • Is this the best we can do?
  • Limitations of Compute Grid + Data Grid
    • Two separate grid environments
    • Hardware, footprint and management costs of dual infrastructure
    • Segregated infrastructures cannot share resources
    • Sub-optimal resource utilization
    • Compute grid is CPU-bound, not RAM-bound
    • Data grid is RAM-bound, not CPU-bound
    • Still sub-optimal performance
    • Still paying for remote network calls and data movement
  • Better answer: Compute-Data Grid
    • Shared hardware between compute & data grid
    • Data grid utilizes RAM of host machines
    • Compute grid runs HPC jobs on the same host machines
    • Opportunity for data aware routing
    • Many applications support compute-data affinity
    • Reduced overhead on remote calls and data movement
    • Opportunity to scale
    • As HPC application needs to scale in and out, data partitions are spread over larger or smaller pool of hosts
  • Data aware routing in Compute-Data grid
    • What is data aware routing?
    • Route HPC tasks to machine hosting relevant data grid partition
    • Relevant partition is identified by data affinity key
    • Why data aware routing?
    • Task access data via loopback – reduced latency
    • Data doesn’t travel through pipes and switches – bandwidth savings
    • Data aware routing architecture goals
    • Pluggable: lightweight core + adapters for variety of grid products
    • Non-intrusive: neither compute grid, nor data grid are aware of being coordinated
  • Data aware routing architecture
  • Components
    • Core Components
    • Data Grid Adapter: service responsible for knowing Data Grid’s topology and state
    • Data Aware Wrapper: client side library extending Compute Grid’s scheduling API to support data-aware job scheduling
    • Variation on Configuration
    • Data Grid Adapter can be deployed as a remote service or embedded as a library
  • Data aware routing workflow
    Client submits task + affinity key to Wrapper
    Wrapper requests host hint from Data Grid Adapter
    Data Grid Adapter gives host hint using affinity key
    Wrapper submits task with host hint using HPC API
  • Cloud Factor: opportunities and challenges
    Emerging Cloud Computing technologies
    • Offer benefits for HPC architectures
    • Automated cluster provisioning and deployment
    • Ability to timely set up large clusters for once-in-a-blue-moon jobs
    • Ability to scale cluster up and down on demand
    • Easy to spawn isolated development/testing environments
    • … as well as some new challenges
    • Networking limitations (e.g. absence of multicast)
    • No strong promises on CPU, I/O and bandwidth shares
    • No real perimeter firewall – security considerations
  • Demo: Simplistic Trading Analytics
    • Application characteristics
    • Data intensive application: N*10K trade objects (2Kb payload), for 4*N stock tickers
    • Processing algorithm
    • Job: to evaluate all tickers
    • Task: to process ticker trades with chatty random-access algorithm
    • Data source and task scheduling control
    • Data sources: RDBMS or IMDG
    • Scheduling mode: neutral and data-aware
    • Task and job latency measurement
  • SGE + Oracle Coherence on AWS
    • AWS provisioning considerations
    • Customization of popular images is faster then spinning up a custom AMI
    • make sure sshd started on freshly started machine
    • SGE considerations
    • Reverse DNS lookup: /etc/hosts to contain all hostnames
    • pw-less ssh access
    • Oracle Coherence considerations
    • No multicast: unicast + well-known addresses
    • Many short-living clients: do not join TCMP cluster, use Coherence*Extend
  • Data aware routing in the cloud
    Eugene Steinberg