Victoria Livschitz, CEO Grid Dynamics [email_address] September 17 th , 2008  Using Grid Technologies on the  Cloud for Hi...
A word about Grid Dynamics <ul><li>Who we are: global leader in scalability engineering </li></ul><ul><ul><li>Mission: ena...
Why I am speaking here tonight? <ul><li>We do scalability engineering for a living </li></ul><ul><li>Cloud computing is ne...
Exploring Scalability thru Benchmarking Benchmark Cloud Vendor Middleware Application 1. Test scalability of EC2 on the si...
Benchmark #1: Scalability of Simple Map/Reduce Application on EC2
Basic Scalability of Simple Map/Reduce <ul><li>Goal:  Establish upper limit on scalability of Monte-Carlo simulations perf...
Other Goals <ul><li>Understand “process bottlenecks” of EC2 platform </li></ul><ul><ul><li>Changes to the programming, dep...
Architecture Job Execution Spare EC2 Instances JMS Message Processing Manages worker nodes and tasks Discovery & Task Assi...
Performance Methodology & Results <ul><li>Same algorithm exercised on wide range of nodes </li></ul><ul><ul><li>2,4, 8, 16...
Simple scaling script <ul><li>var itersPerNode = 5000; </li></ul><ul><li>var cnode = [1, 2, 4, 8, 16, 32, 64, 128, 256, 51...
Observations <ul><li>Deployment considerations </li></ul><ul><ul><li>Start-up for whole grid in different configurations i...
Observations <ul><li>Monitoring considerations </li></ul><ul><ul><li>Connection to each node from outside is possible, but...
Observations <ul><li>Metering and payment </li></ul><ul><ul><li>Amazon sets a limit on concurrent VM </li></ul></ul><ul><u...
Achieving scalability <ul><li>Software breaks at scale. Including the glueware </li></ul><ul><ul><li>Barrier #1 was hit at...
What have we learned? <ul><li>EC2 is ready for production usage on large-scale stateless computations </li></ul><ul><ul><l...
Benchmark #2: Scalability of Data-Driven Risk Management Application on EC2
Data-driven Risk Management on EC2 <ul><li>Goal:  Investigate scalability of a prototypical Risk Management application th...
Architecture Workers take tasks, perform calculations, write results back User uses   ec2-gdc-tools  to manage grid Master...
Performance methodology & results <ul><li>Same algorithm exercised on wide range of nodes </li></ul><ul><ul><li>16,32, 128...
What have we learned? <ul><li>EC2 is ready for production usage for classes of large-scale data-driven HPC applications, c...
Benchmark #3: Performance implications of data “in the cloud” vs. “outside the cloud” for data-intensive analytics applica...
Data-intensive Analytics on MS cloud <ul><li>Goal:  Investigate performance improvements from data “in the cloud” vs. “out...
Architecture: CompFin
Architecture: Anticipated Bottlenecks
Architecture: CompFin + Velocity
Benchmarked configurations <ul><li>Same analytical model with complex queries </li></ul><ul><ul><li>Perfect linear scale c...
Test methodology <ul><li>3 ways of measuring scalability were used </li></ul><ul><ul><li>Fixed amount of computations, inc...
Performance results
Performance results
Conclusions <ul><li>Data “on the cloud” definitely matters! </li></ul><ul><ul><li>Performance improvements up to 31 times ...
Final Remarks <ul><li>Clouds are proving themselves out </li></ul><ul><ul><li>Early adaptors are there already </li></ul><...
Victoria Livschitz [email_address] Thank You!
Upcoming SlideShare
Loading in …5
×

Using Grid Technologies in the Cloud for High Scalability

1,297 views

Published on

An unstated assumption is that clouds are scalable. But are they? Stick thousands upon thousands of machines together and there are a lot of potential bottlenecks just waiting to choke off your scalability supply. And if the cloud is scalable what are the chances that your application is really linearly scalable? At 10 machines all may be well. Even at 50 machines the seas look calm. But at 100, 200, or 500 machines all hell might break loose. How do you know?

You know through real life testing. These kinds of tests are brutally hard and complicated. who wants to do all the incredibly precise and difficult work of producing cloud scalability tests? GridDynamics has stepped up to the challenge and has just released their Cloud Performance Reports.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,297
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
66
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Using Grid Technologies in the Cloud for High Scalability

  1. 1. Victoria Livschitz, CEO Grid Dynamics [email_address] September 17 th , 2008 Using Grid Technologies on the Cloud for High Scalability A Practitioner Report for Cloud User Group
  2. 2. A word about Grid Dynamics <ul><li>Who we are: global leader in scalability engineering </li></ul><ul><ul><li>Mission: enable adoption of scalable applications and networks though design patterns, best practices and engineering excellence </li></ul></ul><ul><ul><li>Value proposition: fusion of innovation with best practices </li></ul></ul><ul><ul><li>Focused on “physics”, “economics” and “engineering” of extreme scale </li></ul></ul><ul><ul><li>Founded in 2006, 30 people and growing, HQ in Silicon Valley </li></ul></ul><ul><li>Services </li></ul><ul><ul><li>Technology consulting </li></ul></ul><ul><ul><li>Application & systems architecture, design, development </li></ul></ul><ul><li>Customers </li></ul><ul><ul><li>Users of scalable applications: eBay, Bank of America, web start-ups </li></ul></ul><ul><ul><li>Makers of scalable middleware: GigaSpaces, Sun, Microsoft </li></ul></ul><ul><ul><li>Partners: GridGain, GigaSpaces, Terracotta, Data Synapse, Sun, MS </li></ul></ul>
  3. 3. Why I am speaking here tonight? <ul><li>We do scalability engineering for a living </li></ul><ul><li>Cloud computing is new, very exciting and terribly over-hyped </li></ul><ul><ul><li>Not a lot of solid data on performance, scalability, usability, stability… </li></ul></ul><ul><li>Many of our customers are early adopters or enablers </li></ul><ul><ul><li>Their pains, discoveries and lessons are worth sharing </li></ul></ul><ul><li>The practitioner prospective </li></ul><ul><ul><li>Recently completed 3 benchmark projects that we can make public </li></ul></ul><ul><ul><li>Results are presented here tonight </li></ul></ul>
  4. 4. Exploring Scalability thru Benchmarking Benchmark Cloud Vendor Middleware Application 1. Test scalability of EC2 on the simplest map-reduce problem Public commercial cloud, EC2 Amazon GridGain Monte-Carlo 2. Test scalability of data-driven HPC applications, similar to those used in practice Public commercial cloud, EC2 Amazon GigaSpaces Risk Management 3. Explore performance implications of data “in the cloud” vs. “outside the cloud” Incubator compute cloud for academic use, CompFin Microsoft Windows HPC Server, Velocity Data-intensive Analytics
  5. 5. Benchmark #1: Scalability of Simple Map/Reduce Application on EC2
  6. 6. Basic Scalability of Simple Map/Reduce <ul><li>Goal: Establish upper limit on scalability of Monte-Carlo simulations performed on EC2 using GridGain </li></ul><ul><li>Why Monte-Carlo: simple, widely-used, perfectly scalable problem </li></ul><ul><li>Why EC2: most popular public cloud </li></ul><ul><li>Why GridGain: simple, open-source map-reduce middleware </li></ul><ul><li>Intended Claims: </li></ul><ul><ul><li>EC2 scales linearly as grid execution platform </li></ul></ul><ul><ul><li>GridGain scales linearly as map-reduce middleware </li></ul></ul><ul><ul><li>Businesses can run their existing Monte-Carlo simulations on EC2 today using open-source technologies </li></ul></ul>
  7. 7. Other Goals <ul><li>Understand “process bottlenecks” of EC2 platform </li></ul><ul><ul><li>Changes to the programming, deployment, management model </li></ul></ul><ul><ul><li>Ease of use </li></ul></ul><ul><ul><li>Security </li></ul></ul><ul><ul><li>Metering and payment </li></ul></ul><ul><li>Identify scalability bottlenecks at any level in the stack </li></ul><ul><ul><li>EC2 </li></ul></ul><ul><ul><li>GridGain </li></ul></ul><ul><ul><li>Glueware </li></ul></ul><ul><li>Robustness </li></ul><ul><ul><li>Stability </li></ul></ul><ul><ul><li>Predictability </li></ul></ul>
  8. 8. Architecture Job Execution Spare EC2 Instances JMS Message Processing Manages worker nodes and tasks Discovery & Task Assignment Corporate Intranet Amazon EC2 Cloud Controls Grid Operation Configuration & Task Repository <ul><li>Technology Stack: </li></ul><ul><ul><li>EC2 </li></ul></ul><ul><ul><li>GridGain </li></ul></ul><ul><ul><li>Typica </li></ul></ul><ul><ul><li>OpenMQ </li></ul></ul>Spare Capacity Worker Nodes Head Node OpenMQ Server JMS Grid Console HTTP Server
  9. 9. Performance Methodology & Results <ul><li>Same algorithm exercised on wide range of nodes </li></ul><ul><ul><li>2,4, 8, 16, …, 256, 512. Limited by Amazon permission of 550 nodes </li></ul></ul><ul><ul><li>Simultaneously double the amount of computations and nodes </li></ul></ul><ul><ul><li>Measure completion time </li></ul></ul><ul><ul><li>Repeat several times to get statistical averages </li></ul></ul><ul><li>Conclusions </li></ul><ul><ul><li>Total degradation from 13 to 16 seconds, or 20% </li></ul></ul><ul><ul><li>Discarding first 8 nodes, near perfect scale up to 128 </li></ul></ul><ul><ul><li>Slight degradation from 128 to 256 (3%), from 256 to 512 (7%) </li></ul></ul>=> Prove point of near linear scalability end-to-end
  10. 10. Simple scaling script <ul><li>var itersPerNode = 5000; </li></ul><ul><li>var cnode = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]; </li></ul><ul><li>for (var i in cnode) { </li></ul><ul><li>var n = cnode[i]; </li></ul><ul><li>grid.growEC2Grid(n, true); </li></ul><ul><li>grid.waitForGridInstances(n); </li></ul><ul><li>runTask(itersPerNode * n, n, 3); </li></ul><ul><li>} </li></ul>
  11. 11. Observations <ul><li>Deployment considerations </li></ul><ul><ul><li>Start-up for whole grid in different configurations is 0.5 - 3 min </li></ul></ul><ul><ul><li>2-step deployment process </li></ul></ul><ul><ul><ul><li>First, bring up one EC2 node as controller </li></ul></ul></ul><ul><ul><ul><li>Next, use the controller on-the-inside to coordinate bootstrapping </li></ul></ul></ul><ul><ul><li>Some of EC2 nodes don’t finish bootstrapping successfully </li></ul></ul><ul><ul><ul><li>Average of 0.5% nodes come up in incomplete state </li></ul></ul></ul><ul><ul><ul><li>Not clear the nature of the problem </li></ul></ul></ul><ul><ul><ul><li>If the exact processing power is essential, start the nodes, then kill off the sick ones and bring up a few new ones before starting computation </li></ul></ul></ul><ul><ul><li>IP address deadlock issue </li></ul></ul><ul><ul><ul><li>IP addresses of the nodes are needed to start & configure the grid </li></ul></ul></ul><ul><ul><ul><li>IP addresses are not available until the grid is up & configures </li></ul></ul></ul><ul><ul><ul><li>Need carefully choreograph bootstrapping and pass IP’s as parameters into controlling scripts </li></ul></ul></ul>
  12. 12. Observations <ul><li>Monitoring considerations </li></ul><ul><ul><li>Connection to each node from outside is possible, but not efficient </li></ul></ul><ul><ul><li>Check heartbeat from the internal management nodes </li></ul></ul><ul><ul><li>Local scripts must be stored on S3 or passed back before termination </li></ul></ul><ul><li>Programming model considerations </li></ul><ul><ul><li>EC2 does not support IP multicast </li></ul></ul><ul><ul><ul><li>Switched to JMS instead </li></ul></ul></ul><ul><ul><ul><li>Luckily, GridGain supported multiple protocols </li></ul></ul></ul><ul><ul><li>Typica : 3 rd party connectivity library that use EC2 query interface </li></ul></ul><ul><ul><ul><li>Undocumented limit on URL length is hit with 100s of nodes </li></ul></ul></ul><ul><ul><ul><li>Amazon just disconnects with improper URLs without specifying the error, so debugging was hard </li></ul></ul></ul><ul><ul><ul><li>Workaround: rewrote some parts of our framework to enquire about individual running nodes. Works, but less efficient </li></ul></ul></ul>
  13. 13. Observations <ul><li>Metering and payment </li></ul><ul><ul><li>Amazon sets a limit on concurrent VM </li></ul></ul><ul><ul><ul><li>Eventually approval for 550 VMs after some due diligence from Amazon </li></ul></ul></ul><ul><ul><li>Amazon charges by full or partial VM/hours </li></ul></ul><ul><ul><li>Sometimes, short usage of VMs is not metered </li></ul></ul><ul><ul><ul><li>Not clear why </li></ul></ul></ul><ul><ul><ul><li>One hypotheses: metering “sweeps” happen every so often </li></ul></ul></ul><ul><ul><li>Be careful with usage bills for testing </li></ul></ul><ul><ul><ul><li>A test may need to be run multiple times </li></ul></ul></ul><ul><ul><ul><li>Beware of rouge scripts </li></ul></ul></ul><ul><ul><ul><li>Test everything on smaller configurations first </li></ul></ul></ul><ul><ul><ul><li>Scale gradually, or you will miss the bottlenecks </li></ul></ul></ul>
  14. 14. Achieving scalability <ul><li>Software breaks at scale. Including the glueware </li></ul><ul><ul><li>Barrier #1 was hit at 100 nodes because of ActiveMQ scalability </li></ul></ul><ul><ul><ul><li>Correction: Switched ActiveMQ for OpenMQ </li></ul></ul></ul><ul><ul><ul><li>Comment: some users report better ActiveMQ scalability with 5.x </li></ul></ul></ul><ul><ul><li>Barrier #2 was hit at 300 nodes because of Typica URL length limit </li></ul></ul><ul><ul><ul><li>Correction: Changed our use of the API </li></ul></ul></ul><ul><li>Security considerations </li></ul><ul><ul><li>EC2 credentials are passed to Head Node </li></ul></ul><ul><ul><li>3 rd party GridGain tasks can access them </li></ul></ul><ul><ul><li>Sounds like potential vulnerability </li></ul></ul>
  15. 15. What have we learned? <ul><li>EC2 is ready for production usage on large-scale stateless computations </li></ul><ul><ul><li>Price/performance </li></ul></ul><ul><ul><li>Strong linear scale curve </li></ul></ul><ul><li>GridGain showed itself very well </li></ul><ul><ul><li>Scale, stability, ease-of-use, pluggability </li></ul></ul><ul><ul><li>Solid open source choice of map-reduce middleware </li></ul></ul><ul><li>Some level of effort is required to “port” grid system to EC2 </li></ul><ul><ul><li>Deployment, monitoring, programming mode, metering, security </li></ul></ul><ul><li>What’s next? </li></ul><ul><ul><li>Can we go higher then 512? </li></ul></ul><ul><ul><li>What is the behavior of more complex applications? </li></ul></ul>
  16. 16. Benchmark #2: Scalability of Data-Driven Risk Management Application on EC2
  17. 17. Data-driven Risk Management on EC2 <ul><li>Goal: Investigate scalability of a prototypical Risk Management application that use significant amount of cached data to support large-scale Monte-Carlo simulations executed on EC2 using GigaSpaces </li></ul><ul><li>Why risk management: class of problems widely used in financial services </li></ul><ul><li>Why GigaSpaces: leading middleware platform for compute & data grids </li></ul><ul><li>Intended Claims: </li></ul><ul><ul><li>EC2 scales linearly for data-driven HPC applications </li></ul></ul><ul><ul><li>GigaSpaces scales well as both compute and data grid middleware </li></ul></ul><ul><ul><li>Businesses can run their existing risk management (and similar) applications on EC2 today using off-the-shelf technologies </li></ul></ul>
  18. 18. Architecture Workers take tasks, perform calculations, write results back User uses ec2-gdc-tools to manage grid Master writes tasks into data grid and waits for results… Amazon EC2 Grid Compute Grid Data Grid Grid Console Service Grid Manager Master
  19. 19. Performance methodology & results <ul><li>Same algorithm exercised on wide range of nodes </li></ul><ul><ul><li>16,32, 128, 256, 512. Still limited by Amazon permission of 550 </li></ul></ul><ul><ul><li>Constant size of data grid (4 large EC2 nodes) </li></ul></ul><ul><ul><li>Double the nodes with constant amount of work </li></ul></ul><ul><ul><li>Measure completion time (strive for linear time reduction) </li></ul></ul><ul><li>Conclusions </li></ul><ul><ul><li>Near perfect scale from 16 to 256 nodes </li></ul></ul><ul><ul><li>28% degradation from 256 to 512 since data cache becomes a bottleneck </li></ul></ul>
  20. 20. What have we learned? <ul><li>EC2 is ready for production usage for classes of large-scale data-driven HPC applications, common to Risk Management </li></ul><ul><li>GigaSpaces showed itself very well </li></ul><ul><ul><li>Compute - data grid scales well in master-worker pattern </li></ul></ul><ul><li>Some level of effort is required to “port” grid system to EC2 </li></ul><ul><ul><li>Deployment, monitoring, programming mode, metering, security </li></ul></ul><ul><ul><li>Bootstrapping this system is far more complex then GridGain’s. For more details, contact me offline </li></ul></ul><ul><li>What’s next? </li></ul><ul><ul><li>How does data grid scale? </li></ul></ul><ul><ul><li>What about more complex applications? </li></ul></ul><ul><ul><li>What’s the scalability of co-located compute-data grid configuration? </li></ul></ul>
  21. 21. Benchmark #3: Performance implications of data “in the cloud” vs. “outside the cloud” for data-intensive analytics applications
  22. 22. Data-intensive Analytics on MS cloud <ul><li>Goal: Investigate performance improvements from data “in the cloud” vs. “outside the cloud” for complex data-intensive Analytical applications in the context of HPC CompFin++ Labs environment using Velocity </li></ul><ul><li>What is CompFin++ Labs: MS-funded “incubator” compute cloud for exploration of modern compute & data challenges on massive scale </li></ul><ul><li>What is Velocity: MS new in-memory data grid middleware, still CTP1 </li></ul><ul><li>The Model : Computes correlation between stock prices over time. Algorithms use significant amount of data which could be cached. Maximum cache hit ratio for the model is around 90%. </li></ul><ul><li>Intended Claims: </li></ul><ul><ul><li>Measure impact of data “closeness” to the computation on the cloud </li></ul></ul>
  23. 23. Architecture: CompFin
  24. 24. Architecture: Anticipated Bottlenecks
  25. 25. Architecture: CompFin + Velocity
  26. 26. Benchmarked configurations <ul><li>Same analytical model with complex queries </li></ul><ul><ul><li>Perfect linear scale curve (baseline) </li></ul></ul><ul><ul><li>Original CompFin </li></ul></ul><ul><ul><li>Distributed cache (original CompFin + Velocity distributed cache for financial data) </li></ul></ul><ul><ul><li>Local cache (original CompFin + Velocity distributed cache for financial data + near cache with data-aware routing) </li></ul></ul>
  27. 27. Test methodology <ul><li>3 ways of measuring scalability were used </li></ul><ul><ul><li>Fixed amount of computations, increasing amount of data </li></ul></ul><ul><ul><li>Fixed amount of date, increasing amount of computations </li></ul></ul><ul><ul><li>Proportional Increase of computations and nodes </li></ul></ul><ul><ul><li>“ Node” = 1 core </li></ul></ul><ul><ul><li>“ Data unit” = 32 million records or 512 megabytes of tick data </li></ul></ul>Test 1 Test 2 Test 3 Test # 1 2 3 4 5 6 7 8 9 Nodes 8 32 32 32 32 32 64 128 200 Data Units 1 1 1 6 12 12 24 48 69
  28. 28. Performance results
  29. 29. Performance results
  30. 30. Conclusions <ul><li>Data “on the cloud” definitely matters! </li></ul><ul><ul><li>Performance improvements up to 31 times over “outside the cloud” </li></ul></ul><ul><li>Velocity distributed cache has some scalability challenges: </li></ul><ul><ul><li>Failure on 50 nodes cluster with 200 concurrent clients </li></ul></ul><ul><ul><li>Good news: it’s a very young product and MS is actively improving it </li></ul></ul><ul><li>Compute-data affinity matters too! </li></ul><ul><ul><li>Significant performance gain of local cache over distributed cache </li></ul></ul><ul><ul><li>Local cache resolved distributed cache scalability issue by reducing its load </li></ul></ul>
  31. 31. Final Remarks <ul><li>Clouds are proving themselves out </li></ul><ul><ul><li>Early adaptors are there already </li></ul></ul><ul><ul><li>The rest of the real world will join soon </li></ul></ul><ul><li>There are still significant adoption challenges </li></ul><ul><ul><li>Technology immaturity </li></ul></ul><ul><ul><li>Lack of real data, best practices, robust design patterns </li></ul></ul><ul><ul><li>“ Fitting” of application middleware to cloud platforms is just starting </li></ul></ul><ul><li>Amazon is the leading commercial cloud provider, but is not the only game in town </li></ul><ul><ul><li>Companies are building public, private, dedicated and special-purpose clouds </li></ul></ul>
  32. 32. Victoria Livschitz [email_address] Thank You!

×