Best Practices In Building Scalable Cloud Ready Service Based


Published on

Best Practices In Building Scalable Cloud Ready Service Based

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Best Practices In Building Scalable Cloud Ready Service Based

    1. 1. Best Practices in building scalable cloud-ready Service based systems Discussion Igor Moochnick IgorShare Consulting [email_address] Blog:
    2. 2. Is this system scalable?
    3. 3. What is scalability? <ul><li>Increase resources -> Increase performance proportionally to the amount of added resources </li></ul><ul><li>Increase performance -> more units of works </li></ul>
    4. 4. Scale-up And Scale-out $10,000 machine $1000 machine $500 machine # Machines Scale Up $500 machine $500 machine $500 machine Scale Out DNS WWW Volume Volume $500 machine
    5. 5. Is this system scalable?
    6. 6. Is this system scalable?
    7. 7. Is this system scalable?
    8. 8. Is this system a scalable MADNESS?
    9. 9. Here is the gun. Go kill yourself!
    10. 10. Some Useful Definitions Consistency Levels Message Assurances Strong Eventual Optimistic Missile Launch Address Change Stock Ticker Now In the Future Consistency Level Changes are Visible Example Maybe in the Future Exactly Once At Least Once At Most Once Bank Transfer Email Streaming Video No loss, no duplicates No loss, duplicates Assurance Message Delivery Example Loss, no duplicates Best Effort Stock Ticker Loss, duplicates
    11. 11. The Infrastructure Developer's Experience Where did you start? Where did you end up? Shared State ACID Transactions Partitioned, Replicated State Eventual Consistency Exactly Once Messaging Best Effort Messaging Machine Loss is a Catastrophe Keep Processes Running Machine Loss is Business As Usual Recovery-Oriented Computing
    12. 12. The law <ul><li>The least scalable component of your system becomes a bottleneck for the whole system </li></ul>
    13. 13. Recipe ingredients (Amazon guidelines) <ul><li>Autonomy </li></ul><ul><li>Asynchrony </li></ul><ul><li>Controlled concurrency </li></ul><ul><li>Controlled parallelism </li></ul><ul><li>Decentralize </li></ul><ul><li>Decompose into small well-understood building blocks </li></ul><ul><li>Failure tolerant </li></ul><ul><li>Local responsibility </li></ul><ul><li>Recovery Built-in </li></ul><ul><li>Simplicity </li></ul><ul><li>Symmetry </li></ul>
    14. 14. Key principals <ul><li>Things fail all the time! </li></ul><ul><li>Machines </li></ul><ul><ul><li>Disposable </li></ul></ul><ul><ul><li>Nameless </li></ul></ul><ul><ul><li>Self assembled </li></ul></ul><ul><li>State management </li></ul><ul><ul><li>Caching </li></ul></ul><ul><ul><li>Loose consistency </li></ul></ul><ul><ul><li>Relax isolation </li></ul></ul><ul><li>Redundancy </li></ul><ul><li>Partitioning </li></ul><ul><li>Loosely coupled messaging </li></ul><ul><li>Best effort </li></ul><ul><li>Message loss </li></ul><ul><li>Retries </li></ul><ul><li>Self monitors </li></ul><ul><li>Self heals </li></ul><ul><li>Designed to expect failures </li></ul><ul><li>Continue to work seamlessly during the failure </li></ul>
    15. 15. Application Development Patterns <ul><li>Architecture </li></ul><ul><ul><li>Choose a high-level framework </li></ul></ul><ul><ul><li>Keep service and hosting code separate </li></ul></ul><ul><ul><li>Partition </li></ul></ul><ul><li>Design </li></ul><ul><ul><li>Use loose coupling </li></ul></ul><ul><ul><li>Use caches and stale data </li></ul></ul><ul><ul><li>Have just a few simple recovery paths </li></ul></ul><ul><ul><li>Be topology-independent </li></ul></ul><ul><ul><li>Be hardware-indepedent </li></ul></ul>
    16. 16. Challenges Of Scalability <ul><li>How do I ensure incoming requests are processed at the right location? </li></ul><ul><ul><li>Partition on service-specific input </li></ul></ul><ul><ul><li>Dynamically route to correct node </li></ul></ul><ul><ul><li>Fail over seamlessly </li></ul></ul><ul><li>How do I manage state inside my service? </li></ul><ul><ul><li>Take a hard look at consistency requirements </li></ul></ul><ul><ul><li>Aggressively cache and use transient data </li></ul></ul><ul><ul><li>Partition the Storage Tier </li></ul></ul>
    17. 17. ACID vs. BASE <ul><li>ACID </li></ul><ul><ul><li>Atomic </li></ul></ul><ul><ul><li>Consistent </li></ul></ul><ul><ul><li>Isolated </li></ul></ul><ul><ul><li>Durable </li></ul></ul><ul><li>Modern BASE-based systems </li></ul><ul><ul><li>Basically Available </li></ul></ul><ul><ul><li>Soft-state (or scalable) </li></ul></ul><ul><ul><li>Eventually consistent </li></ul></ul>
    18. 18. What is the problem? <ul><li>Only two of three: </li></ul><ul><ul><li>Strong Consistency </li></ul></ul><ul><ul><ul><li>All clients see the same view during updates </li></ul></ul></ul><ul><ul><li>High Availability </li></ul></ul><ul><ul><ul><li>Some data replica is always available despite failures </li></ul></ul></ul><ul><ul><li>Partition tolerance </li></ul></ul><ul><ul><ul><li>All the properties hold even if partitioned </li></ul></ul></ul>
    19. 19. Techniques <ul><li>Expiration based caching: AP </li></ul><ul><li>Quorum / majority algorithms: PC </li></ul><ul><li>Two-phase commit: AC </li></ul>
    20. 20. Scaling data in 3 steps <ul><li>Partitioning </li></ul><ul><li>Routing </li></ul><ul><li>State management </li></ul>
    21. 21. Solving the data congestion <ul><li>Throttling (especially on startup after failure) </li></ul><ul><li>Denormalization </li></ul><ul><li>Scale vs. Performance </li></ul><ul><li>Fault Tolerance and recoverability </li></ul><ul><li>Geo-distribution </li></ul><ul><li>Content distribution providers (like Akamai) </li></ul>
    22. 22. Fault tolerance <ul><li>Throttling incoming traffic </li></ul><ul><li>Limit retries </li></ul><ul><li>Server failover </li></ul><ul><li>Data center failover </li></ul><ul><li>Consider using queues </li></ul>
    23. 23. Monitoring <ul><li>Monitor data about what the user sees – this is what is most important </li></ul><ul><li>Make sure not to overdo – kills the components you rely on </li></ul><ul><li>Be frugal </li></ul><ul><ul><li>Built in counters and monitor the trends - can help you to predict the spikes and allocate on demand extra resources </li></ul></ul>
    24. 24. Monitoring <ul><li>Availability </li></ul><ul><li>Performance </li></ul><ul><li>Alerts </li></ul><ul><li>Auto throttling </li></ul><ul><li>Capacity thresholds </li></ul><ul><li>Load </li></ul><ul><li>Transactions </li></ul><ul><li>Should measure realistic/relevant actions and behavior! </li></ul>Importance
    25. 25. Diagnosing & Logging <ul><li>Non-blocking </li></ul><ul><li>Asynchronously </li></ul><ul><li>Size – can be too big (there is “too much of a good thing”) </li></ul><ul><ul><li>Have control over “what” and “how much” </li></ul></ul><ul><li>Performance hit (“do no harm”) </li></ul><ul><li>Should not become a bottleneck </li></ul><ul><li>Be careful what you log </li></ul><ul><ul><li>Horizontally </li></ul></ul><ul><ul><li>Vertically </li></ul></ul><ul><li>Should be able to replay logs and correlate the requests </li></ul><ul><ul><li><time><correlate-id><node-id><action><data><result> </li></ul></ul>
    26. 26. Troubleshooting the distributed systems <ul><li>Decoupling </li></ul><ul><li>Role isolation </li></ul><ul><li>Single box </li></ul><ul><li>Allow to separate the functionality from the rest of the system </li></ul><ul><li>Allow to run all from a single box </li></ul><ul><li>Have stubs and simulators </li></ul><ul><li>Be able to “replay” the logs </li></ul>
    27. 27. Deployments <ul><li>Deployment packaging </li></ul><ul><li>Rolling out gradually or atomically </li></ul><ul><li>Automatic deployments </li></ul><ul><li>Staging environment </li></ul><ul><li>Building confidence with real customer data </li></ul><ul><li>Rolling back </li></ul><ul><li>Security trumps feature </li></ul><ul><li>Load balancing </li></ul><ul><li>Consider linear scale </li></ul><ul><li>Keep IT in mind </li></ul><ul><li>Upgradability </li></ul>
    28. 28. Deployment <ul><li>It’s hard </li></ul><ul><li>It’s hard to make it right </li></ul><ul><li>Automate everything – simplifies the repeatability </li></ul><ul><li>Version Forward/Backward compatible </li></ul><ul><li>Rolling upgrade and rollback </li></ul><ul><li>Be nice to your friends </li></ul><ul><li>Know and manage your environments </li></ul><ul><li>Compensate for gradual system recovery </li></ul><ul><li>Clean queues </li></ul>
    29. 29. Resources <ul><li>Availability & Consistency presentation of Amazon CTO Dr. Werner Vogel </li></ul><ul><li>Microsoft PDC’08 Presentations </li></ul>
    30. 30. Q&A
    31. 31. Thank you!