Data Aware Routing in the Cloud<br />Eugene Steinberg<br />CTO, Grid Dynamics<br />
Traditional HPC approach:HPC Cluster with centralized data access<br />
Data Driven Scalability Challenges in traditional HPC approach<br /><ul><li>Data is far away
Latency of remote connection
Latency of data travelling through pipes
Chatty random data access algorithms are expensive
Data is centralized
Centralized HW resources are always limited
Centralized RAM is limited: disk I/O is inevitable
Connections are limited
Highly concurrent data access doesn’t scale well</li></li></ul><li>Usual solution: Data Grids<br /><ul><li>Classic Data Grid
Data is partitioned
Partitions are stored in memory
Data grid is deployed near to compute grid
Search is parallelized over partitions
Build-in replication, persistence, coherence, failover
Upcoming SlideShare
Loading in …5
×

Data Aware Routing In The Cloud

1,341 views

Published on

Large-scale HPC infrastructures such as Sun Grid Engine sooner or later face data access issues, usually caused by the centralized nature of data sources. This talk discusses an approach to mitigate those issues by employing in-memory data grids, such as Oracle Coherence. By collocating compute and data grid nodes on the same hosts, one can achieve not only better resource utilization but also enjoy performance gains through data aware routing. Grid Dynamics will present a reference architecture for building a data aware routing layer on top of SGE and Oracle Coherence along with a demo that highlights the performance benefits of data aware routing. Sun HPC ClusterTools and Open MPI Update

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,341
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
37
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Data Aware Routing In The Cloud

  1. 1. Data Aware Routing in the Cloud<br />Eugene Steinberg<br />CTO, Grid Dynamics<br />
  2. 2. Traditional HPC approach:HPC Cluster with centralized data access<br />
  3. 3. Data Driven Scalability Challenges in traditional HPC approach<br /><ul><li>Data is far away
  4. 4. Latency of remote connection
  5. 5. Latency of data travelling through pipes
  6. 6. Chatty random data access algorithms are expensive
  7. 7. Data is centralized
  8. 8. Centralized HW resources are always limited
  9. 9. Centralized RAM is limited: disk I/O is inevitable
  10. 10. Connections are limited
  11. 11. Highly concurrent data access doesn’t scale well</li></li></ul><li>Usual solution: Data Grids<br /><ul><li>Classic Data Grid
  12. 12. Data is partitioned
  13. 13. Partitions are stored in memory
  14. 14. Data grid is deployed near to compute grid
  15. 15. Search is parallelized over partitions
  16. 16. Build-in replication, persistence, coherence, failover
  17. 17. What is Achieved?
  18. 18. Reduced latency and data moving cost
  19. 19. Improved connection scalability
  20. 20. Reduced data contention
  21. 21. No Disk I/O – 100% memory speed
  22. 22. Is this the best we can do?</li></li></ul><li>Limitations of Compute Grid + Data Grid<br /><ul><li>Two separate grid environments
  23. 23. Hardware, footprint and management costs of dual infrastructure
  24. 24. Segregated infrastructures cannot share resources
  25. 25. Sub-optimal resource utilization
  26. 26. Compute grid is CPU-bound, not RAM-bound
  27. 27. Data grid is RAM-bound, not CPU-bound
  28. 28. Still sub-optimal performance
  29. 29. Still paying for remote network calls and data movement</li></li></ul><li>Better answer: Compute-Data Grid<br /><ul><li>Shared hardware between compute & data grid
  30. 30. Data grid utilizes RAM of host machines
  31. 31. Compute grid runs HPC jobs on the same host machines
  32. 32. Opportunity for data aware routing
  33. 33. Many applications support compute-data affinity
  34. 34. Reduced overhead on remote calls and data movement
  35. 35. Opportunity to scale
  36. 36. As HPC application needs to scale in and out, data partitions are spread over larger or smaller pool of hosts</li></li></ul><li>Data aware routing in Compute-Data grid<br /><ul><li>What is data aware routing?
  37. 37. Route HPC tasks to machine hosting relevant data grid partition
  38. 38. Relevant partition is identified by data affinity key
  39. 39. Why data aware routing?
  40. 40. Task access data via loopback – reduced latency
  41. 41. Data doesn’t travel through pipes and switches – bandwidth savings
  42. 42. Data aware routing architecture goals
  43. 43. Pluggable: lightweight core + adapters for variety of grid products
  44. 44. Non-intrusive: neither compute grid, nor data grid are aware of being coordinated</li></li></ul><li>Data aware routing architecture<br />
  45. 45. Components<br /><ul><li>Core Components
  46. 46. Data Grid Adapter: service responsible for knowing Data Grid’s topology and state
  47. 47. Data Aware Wrapper: client side library extending Compute Grid’s scheduling API to support data-aware job scheduling
  48. 48. Variation on Configuration
  49. 49. Data Grid Adapter can be deployed as a remote service or embedded as a library</li></li></ul><li>Data aware routing workflow<br />Client submits task + affinity key to Wrapper<br />Wrapper requests host hint from Data Grid Adapter<br />Data Grid Adapter gives host hint using affinity key<br />Wrapper submits task with host hint using HPC API<br />
  50. 50. Cloud Factor: opportunities and challenges<br />Emerging Cloud Computing technologies <br /><ul><li>Offer benefits for HPC architectures
  51. 51. Automated cluster provisioning and deployment
  52. 52. Ability to timely set up large clusters for once-in-a-blue-moon jobs
  53. 53. Ability to scale cluster up and down on demand
  54. 54. Easy to spawn isolated development/testing environments
  55. 55. … as well as some new challenges
  56. 56. Networking limitations (e.g. absence of multicast)
  57. 57. No strong promises on CPU, I/O and bandwidth shares
  58. 58. No real perimeter firewall – security considerations</li></li></ul><li>Demo: Simplistic Trading Analytics<br /><ul><li>Application characteristics
  59. 59. Data intensive application: N*10K trade objects (2Kb payload), for 4*N stock tickers
  60. 60. Processing algorithm
  61. 61. Job: to evaluate all tickers
  62. 62. Task: to process ticker trades with chatty random-access algorithm
  63. 63. Data source and task scheduling control
  64. 64. Data sources: RDBMS or IMDG
  65. 65. Scheduling mode: neutral and data-aware
  66. 66. Task and job latency measurement</li></li></ul><li>
  67. 67. SGE + Oracle Coherence on AWS<br /><ul><li>AWS provisioning considerations
  68. 68. Customization of popular images is faster then spinning up a custom AMI
  69. 69. make sure sshd started on freshly started machine
  70. 70. SGE considerations
  71. 71. Reverse DNS lookup: /etc/hosts to contain all hostnames
  72. 72. pw-less ssh access
  73. 73. Oracle Coherence considerations
  74. 74. No multicast: unicast + well-known addresses
  75. 75. Many short-living clients: do not join TCMP cluster, use Coherence*Extend</li></li></ul><li>
  76. 76. Data aware routing in the cloud<br />Eugene Steinberg<br />esteinberg@griddynamics.com<br />

×