1. RESILIENT SYSTEM DESIGN
June 2013
Risk & Compliance Engineering, PayPal
Pradeep Ballal
Staale Nerboe
Greg Berry
This deck contains generic architecture information, and does not
reflect the exact details of current or planned systems.
2. PROBLEM DEFINITION AND SOLUTION
Problem
In a distributed, virtualized environment, system failures are inevitable.
Solution
Isolate functionality to enable independent implementation of appropriate availability
patterns and increase velocity/flexibility of fixes.
Use asynchronous reconciliation to resolve failures without affecting overall customer
experience.
2 Confidential and Proprietary
3. PPaaS
Circuit Breakers
Clients
Service Container
Circuit Breakers
3
HIGH LEVEL ARCHITECTURE
Confidential and Proprietary
Dependency Dependency
Dependency Dependency Dependency
Dependency
Orchestration/Response Consolidation
Request
Request
Request
Request
Component Container
Functional Component Functional Component
Dependency Dependency DependencyDependency
Functional Component (FC): Isolated set
of functionality that can be developed,
deployed and executed independently.
• Fits well into the Agile Development
methodology
• Fallback behavior defined
Service Container (SC): Contains
infrastructure to orchestrate FCs and
handle response consolidation and
initiate reconciliation during failure.
• Component based model (e.g. OSGi)
including support for hot deploy of FCs
without downtime for service
• Malfunctioning FCs will quickly show and
can be handled dynamically by properties
or real time deployments
• Provide meaningful response back to clients
4. 4
SERVICE CONTAINER
Confidential and Proprietary
Service Container (SC): Contains infrastructure to orchestrate FCs and handle response
consolidation and initiate reconciliation during failure.
• Build on top of PayPal Platform as a Services (PPaaS)
• Component based model (e.g. OSGi) including support for hot deploy of FCs without downtime
for service
• Enforces the concepts of coarse grained services
• Malfunctioning FCs will quickly show and can be handled dynamically by properties or real time
deployments
• Provide meaningful response back to clients
• Non-intrusive on the clients
Functional Component (FC): Isolated set of functionality that can be developed,
deployed and executed independently.
• Fits well into the Agile Development methodology
• FCs can fail independently
• Fallback behavior defined
5. Clients
FALLBACK
5 Confidential and Proprietary
To create a resilient system each Functional Component and Dependency SHOULD fail
gracefully and have Fallback Behavior. This can be achieved by utilizing a framework
that enforces normalized behavior across the platform.
PS: Fallback Behavior should not be an
afterthought but should be detailed
out in the design in conjunction with
your business partners.
FAILURE
Request
Functional Component / Dependency
Circuit Breakers (Local / Global)
Logging / Monitoring
Normal
Behavior
Fallback
Behavior
6. Clients
CIRCUIT BREAKERS*
6 Confidential and Proprietary
Circuit Breakers (CB)s serve these purposes:
• It protects the clients from slow or broken FCs
• It protects services from demand in excess
of capacity
• And most importantly it protects the
Business from malfunctioning code by
tracking negative actions (like decline
payment) and if abnormal behavior is
found, shuts down the FC
*Concept first discussed in the excellent book Release It! by Michael Nygard.
Example open source implementation by Netflix: https://github.com/Netflix/Hystrix/wiki
CBs are named after their counterparts
in the physical world.
Local CBs: Track the health of services
Global CBs: Tracks negative behavior
that impacts the Business or health of
overall system
Service ContainerService Container
Request
Request
Functional Component
Dependency
Orchestration/
Response Consolidation
Circuit Breakers (Global)
Circuit Breakers
Request
Request
Functional Component
Dependency
Circuit BreakersConfig
Orchestration/
Response Consolidation
7. DATA ACCESS – NEED MORE
7 Confidential and Proprietary
Globally Distributed
• You can’t have a single system of record that contains all data
• Latency matters (you can’t go faster than the speed of light)
• There must be a way to partition data and processing
Always Available
• Everything needs to be redundant (or dispensable)
• Can’t have a single point of failure
Shares Nothing
• Systems must be able to run completely
independently
Read
ReplicasRead
Replicas
SoR
Journal
Read Service Life Cycle (CRUD) Service
Latency Bridge
Replay
Clients
8. 8
EVENTUALLY CONSISTENT*
Confidential and Proprietary
CAP theorem: States that of three properties of distributed -data systems—data
consistency, system availability, and tolerance to network partition—only two can be
achieved at any given time.
To account for this fact a reconciliation system is required to identify issues and try to
correct them automatically. Only as a last resort should a Manual Review should be
conducted.
Design considerations:
• Limited DB table scanning: System should not rely on heavy DB table scanning and heavy
queries. If required this SHOULD be done in a DW or on a hadoop cluster and feed back into the
real time system.
• Non-intrusive: Listening only to events from other systems, SHOULD NOT touch code in other
parts of the system (and hence don’t need to get on their road map).
Types of reconciliation:
• Stateless: Only depend on the data in the request.
• Stateful: Depends on business processes and states when failure occurred. Hence when the
system failed may matter in the outcome of the reconciliation.
*See excellent paper “Eventually Consistent” by Werner Vogels, CTO Amazon
10. 10 Confidential and Proprietary
WE ARE HIRING
If you are interested in helping us solve
these problems, you can contact us at:
dwilfred@paypal.com
http://www.ebaycareers.com
Editor's Notes
Mr. Pradeep Ballal works as a Senior Architect in the Core Service Product Development with specific focus on Compliance and Risk products with PayPal Singapore. Mr. Ballal is a software generalist with 13 years of technology experience and has special interest in decision management, business rules, enterprise software and architectures. Mr. Staale Nerboe (snerboe@paypal.com) works as a Senior Architect in the Core Service Product Development organization withPayPal Singapore. Mr. Nerboe has 15+ years of Technology Consulting and Software Architecture experience for large global companies world-wide.Mr. Greg Berry (gberry@paypal.com) works as a Principal Architect at PayPal in the Core Services organization. Greg has been an architect in the payments industry for more than 15 years.
In a complex system you will see multiple levels of fallback behavior, like a onion. Also, a fallback behavior can also have fallback. E.g. as a last resort if only log and return an error message to the client.
CBs can be implemented in various ways including Complex Event Processing (CEP), Database, Global Cache, or any other fast storage media. It needs to support fast read/write, but also be able to handle rolling windows, like last 5 minutes, 1 hour, 24 hours. This gets complex in an environment where there volume of service invocations are high (e.g. with large number of invocations or