Your SlideShare is downloading. ×
ScalabilityAvailability
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

ScalabilityAvailability

686
views

Published on

Published in: Technology, Business

0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
686
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
42
Comments
0
Likes
4
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Scalability & Availability Paul Greenfield
  • 2. Building Real Systems
    • Scalable
      • Handle expected load with acceptable levels of performance
      • Grow easily when load grows
    • Available
      • Available enough of the time
    • Performance and availability cost
      • Aim for ‘enough’ of each but not more
      • Have to be ‘architected’ in… not added
  • 3. Scalable
    • Scale-up or…
      • Use bigger and faster systems
    • … Scale-out
      • Systems working together to handle load
        • Server farms
        • Clusters
    • Implications for application design
      • Especially state management
      • And availability as well
  • 4. Available
    • Goal is 100% availability
      • 24x7 operations
      • Including time for maintenance
    • Redundancy is the key to availability
      • No single points of failure
      • Spare everything
        • Disks, disk channels, processors, power supplies, fans, memory, ..
        • Applications, databases, …
          • Hot standby, quick changeover on failure
  • 5. Performance
    • How fast is this system?
      • Not the same as scalability but related
      • Measured by response time and throughput
    • How scalable is this system?
      • Scalability is concerned with the upper limits to performance
      • How big can it grow?
      • How does it grow? (evenly, lumpily?)
  • 6. Performance Measures
    • Response time
      • What delay does the user see?
      • Instantaneous is good
        • 95% under 2 seconds is acceptable?
        • Consistency is important psychologically
      • Response time varies with ‘heaviness’ of transactions
        • Fast read-only transactions
        • Slower update transactions
        • Effects of resource/database contention
  • 7. Response Time
    • Each transaction takes…
      • Processor time
        • Application, system services, database, …
        • Shared amongst competing processes
      • I/O time
        • Largely disk reads/writes
        • Large DB caches reduce # of I/Os
          • 2TB in IBM’s top TPCC entry
      • Wait time for shared resources
        • Locks, shared structures, …
  • 8. Response Times
  • 9. Response Times
  • 10. Response Times
  • 11. Throughput
    • How many transactions can be handled in some period of time
      • Transactions/second or tpm, tph or tpd
      • A measure of overall capacity
      • Inverse of response time
    • Transaction Processing Council
      • Standard benchmarks for TP systems
      • www.tpc.org
      • TPC-C models typical transaction system
        • Current record is 4,092,799 tpmc (HP)
      • TPC-E approved as TPC-C replacement (2/07)
  • 12. Throughput
    • Increases until resource saturation
      • Start waiting for resources
        • Processor, disk & network bandwidth
        • Increasing response time with load
      • Slowly decreases with contention
        • Overheads of sharing, interference
      • Some resources share/overload badly
        • Contention for shared locks
        • Ethernet network performance degrades
        • Disk degrades with sharing
  • 13. Throughput
  • 14. System Capacity?
    • How many clients can you support?
      • Name an acceptable response time
      • Average 95% under 2 secs is common
        • And what is ‘average’?
      • Plot response time vs # of clients
    • Great if you can run benchmarks
      • Reason for prototyping and proving proposed architectures before leaping into full-scale implementation
  • 15. System Capacity
  • 16. Scaling Out
    • More boxes at every level
      • Web servers (handling user interface)
      • App servers (running business logic)
      • Database servers (perhaps… a bit tricky?)
      • Just add more boxes to handle more load
    • Spread load out across boxes
      • Load balancing at every level
      • Partitioning or replication for database?
      • Impact on application design?
      • Impact on system management
      • All have impacts on architecture & operations
  • 17. Scaling Out
  • 18. ‘ Load Balancing’
    • A few different but related meanings
      • Distributing client bindings across servers or processes
        • Needed for stateful systems
        • Static allocation of client to server
      • Balancing requests across server systems or processes
        • Dynamically allocating requests to servers
        • Normally only done for stateless systems
  • 19. Static Load Balancing Client Client Client Name Server Server process Load balancing across application process instances within a server Server process Advertise service Request server reference Return server reference Call server object’s methods Get server object reference
  • 20. Load Balancing in CORBA
    • Client calls on name server to find the location of a suitable server
      • CORBA terminology for object directory
    • Name server can spread client objects across multiple servers
      • Often ‘round robin’
    • Client is bound to server and stays bound forever
      • Can lead to performance problems if server loads are unbalanced
  • 21. Name Servers
    • Server processes call name server as part of their initialisation
      • Advertising their services/objects
    • Clients call name server to find the location of a server process/object
      • Up to the name server to match clients to servers
    • Client then directly calls server process to create or link to objects
      • Client-object binding usually static
  • 22. Dynamic Stateful?
    • Dynamic load balancing with stateful servers/objects?
      • Clients can throw away server objects and get new ones every now and again
        • In application code or middleware
        • Have to save & restore state
      • Or object replication in middleware
        • Identical copies of objects on all servers
        • Replication of changes between servers
        • Clients have references to all copies
  • 23. BEA WLS Load Balancing Clients Clients DBMS MACHINE B MACHINE A EJB Cluster HeartBeat via Multicast backbone EJB Servers instances EJB Servers instances
  • 24. Threaded Servers
    • No need for load-balancing within a single system
      • Multithreaded server process
        • Thread pool servicing requests
      • All objects live in a single process space
      • Any request can be picked up by any thread
    • Used by modern app servers
  • 25. Threaded Servers Client Client Client App DLL COM+ COM+ process Thread pool Shared object space Application code COM+ using thread pools rather than load balancing within a single system
  • 26. Dynamic Load Balancing
    • Dynamically balance load across servers
      • Requests from a client can go to any server
    • Requests dynamically routed
      • Often used for Web Server farms
      • IP sprayer (Cisco etc)
      • Network Load Balancer etc
    • Routing decision has to be fast & reliable
      • Routing in main processing path
    • Applications normally stateless
  • 27. Web Server Farms
    • Web servers are highly scalable
      • Web applications are normally stateless
        • Next request can go to any Web server
        • State comes from client or database
      • Just need to spread incoming requests
        • IP sprayers (hardware, software)
        • Or >1 Web server looking at same IP address with some coordination
  • 28. Clusters
    • A group of independent computers acting like a single system
      • Shared disks
      • Single IP address
      • Single set of services
      • Fail-over to other members of cluster
      • Load sharing within the cluster
      • DEC, IBM, MS, …
  • 29. Clusters Client PCs Server A Server B Disk cabinet A Disk cabinet B Heartbeat Cluster management
  • 30. Clusters
    • Address scalability
      • Add more boxes to the cluster
      • Replication or shared storage
    • Address availability
      • Fail-over
      • Add & remove boxes from the cluster for upgrades and maintenance
    • Can be used as one element of a highly-available system
  • 31. Scaling State Stores?
    • Scaling stateless logic is easy
    • … but how are state stores scaled?
    • Bigger, faster box (if this helps at all)
      • Could hit lock contention or I/O limits
    • Replication
      • Multiple copies of shared data
      • Apps access their own state stores
      • Change anywhere & send to everyone
  • 32. Scaling State Stores
    • Partitioning
      • Multiple servers, each looking after a part of the state store
        • Separate customers A-M & N-Z
        • Split customers according to state
      • Preferably transparent to apps
        • e.g. SQL/Server partitioned views
    • Or combination of these approaches
  • 33. Scaling Out Summary Districts 11-20 Districts 1-10 Web server farm (Network Load Balancing) Application farm (Component Load Balancing) Database servers (Cluster Services and partitioning) UI tier Business tier Data tier
  • 34. Scale-up
    • No need for load-balancing
      • Just use a bigger box
      • Add processors, memory, ….
      • SMP (symmetric multiprocessing)
      • May not fix problem!
    • Runs into limits eventually
    • Could be less available
      • What happens on failures? Redundancy?
    • Could be easier to manage
  • 35. Scale-up
    • eBay example
      • Server farm of Windows boxes (scale-out)
      • Single database server (scale-up)
        • 64-processor SUN box (max at time)
      • More capacity needed?
        • Easily add more boxes to Web farm
        • Faster DB box? (not available)
          • More processors? (not possible)
          • Split DB load across multiple DB servers?
      • See eBay presentation…
  • 36. Available System Web Clients Web Server farm Load balanced using WLB App Servers farm using COM+ LB Database installed on cluster for high availability
  • 37. Availability
    • How much?
      • 99% 87.6 hours a year
      • 99.9% 8.76 hours a year
      • 99.99% 0.876 hours a year
    • Need to consider operations as well
      • Not just faults and recovery time
      • Maintenance, software upgrades, backups, application changes
  • 38. Availability
    • Often a question of application design
      • Stateful vs stateless
        • What happens if a server fails?
        • Can requests go to any server?
      • Synchronous method calls or asynchronous messaging?
        • Reduce dependency between components
        • Failure tolerant designs
      • And manageability decisions to consider
  • 39. Redundancy=Availability
    • Passive or active standby systems
      • Re-route requests on failure
      • Continuous service (almost)
        • Recover failed system while alternative handles workload
        • May be some hand-over time (db recovery?)
        • Active standby & log shipping reduce this
          • At the expense of 2x system cost…
    • What happens to in-flight work?
      • State recovers by aborting in-flight ops & doing db recovery but …
  • 40. Transaction Recovery
    • Could be handled by middleware
      • Persistent queues of accepted requests
      • Still a failure window though
    • Large role for client apps/users
      • Did the request get lost on failure?
      • Retry on error?
    • Large role for server apps
      • What to do with duplicate requests?
      • Try for idempotency (repeated txns OK)
      • Or track and reject duplicates
  • 41. Fragility
    • Large, distributed, synchronous systems are not robust
      • Many independent systems & links…
        • Everything always has to be working
      • Rationale for Asynchronous Messaging
        • Loosen ‘coupling’ between components
        • Rely on guaranteed delivery instead
        • May just defer error handling though
          • Could be much harder to handle later
        • To be discussed next time…
  • 42. Example

×