PayPal Resilient System Design

RESILIENT SYSTEM DESIGN
June 2013
Risk & Compliance Engineering, PayPal
Pradeep Ballal
Staale Nerboe
Greg Berry
This deck contains generic architecture information, and does not
reflect the exact details of current or planned systems.

PROBLEM DEFINITION AND SOLUTION
Problem
In a distributed, virtualized environment, system failures are inevitable.
Solution
Isolate functionality to enable independent implementation of appropriate availability
patterns and increase velocity/flexibility of fixes.
Use asynchronous reconciliation to resolve failures without affecting overall customer
experience.
2 Confidential and Proprietary

PPaaS
Circuit Breakers
Clients
Service Container
Circuit Breakers
3
HIGH LEVEL ARCHITECTURE
Confidential and Proprietary
Dependency Dependency
Dependency Dependency Dependency
Dependency
Orchestration/Response Consolidation
Request
Request
Request
Request
Component Container
Functional Component Functional Component
Dependency Dependency DependencyDependency
Functional Component (FC): Isolated set
of functionality that can be developed,
deployed and executed independently.
• Fits well into the Agile Development
methodology
• Fallback behavior defined
Service Container (SC): Contains
infrastructure to orchestrate FCs and
handle response consolidation and
initiate reconciliation during failure.
• Component based model (e.g. OSGi)
including support for hot deploy of FCs
without downtime for service
• Malfunctioning FCs will quickly show and
can be handled dynamically by properties
or real time deployments
• Provide meaningful response back to clients

4
SERVICE CONTAINER
Service Container (SC): Contains infrastructure to orchestrate FCs and handle response
consolidation and initiate reconciliation during failure.
• Build on top of PayPal Platform as a Services (PPaaS)
• Component based model (e.g. OSGi) including support for hot deploy of FCs without downtime
for service
• Enforces the concepts of coarse grained services
• Malfunctioning FCs will quickly show and can be handled dynamically by properties or real time
deployments
• Provide meaningful response back to clients
• Non-intrusive on the clients
Functional Component (FC): Isolated set of functionality that can be developed,
deployed and executed independently.
• Fits well into the Agile Development methodology
• FCs can fail independently
• Fallback behavior defined

Clients
FALLBACK
To create a resilient system each Functional Component and Dependency SHOULD fail
gracefully and have Fallback Behavior. This can be achieved by utilizing a framework
that enforces normalized behavior across the platform.
PS: Fallback Behavior should not be an
afterthought but should be detailed
out in the design in conjunction with
your business partners.
FAILURE
Request
Functional Component / Dependency
Circuit Breakers (Local / Global)
Logging / Monitoring
Normal
Behavior
Fallback
Behavior

Clients
CIRCUIT BREAKERS*
Circuit Breakers (CB)s serve these purposes:
• It protects the clients from slow or broken FCs
• It protects services from demand in excess
of capacity
• And most importantly it protects the
Business from malfunctioning code by
tracking negative actions (like decline
payment) and if abnormal behavior is
found, shuts down the FC
*Concept first discussed in the excellent book Release It! by Michael Nygard.
Example open source implementation by Netflix: https://github.com/Netflix/Hystrix/wiki
CBs are named after their counterparts
in the physical world.
Local CBs: Track the health of services
Global CBs: Tracks negative behavior
that impacts the Business or health of
overall system
Service ContainerService Container
Request
Request
Functional Component
Dependency
Orchestration/
Response Consolidation
Circuit Breakers (Global)
Circuit Breakers
Request
Request
Dependency
Circuit BreakersConfig
Orchestration/
Response Consolidation

DATA ACCESS – NEED MORE
Globally Distributed
• You can’t have a single system of record that contains all data
• Latency matters (you can’t go faster than the speed of light)
• There must be a way to partition data and processing
Always Available
• Everything needs to be redundant (or dispensable)
• Can’t have a single point of failure
Shares Nothing
• Systems must be able to run completely
independently
Read
ReplicasRead
Replicas
SoR
Journal
Read Service Life Cycle (CRUD) Service
Latency Bridge
Replay
Clients

8
EVENTUALLY CONSISTENT*
CAP theorem: States that of three properties of distributed -data systems—data
consistency, system availability, and tolerance to network partition—only two can be
achieved at any given time.
To account for this fact a reconciliation system is required to identify issues and try to
correct them automatically. Only as a last resort should a Manual Review should be
conducted.
Design considerations:
• Limited DB table scanning: System should not rely on heavy DB table scanning and heavy
queries. If required this SHOULD be done in a DW or on a hadoop cluster and feed back into the
real time system.
• Non-intrusive: Listening only to events from other systems, SHOULD NOT touch code in other
parts of the system (and hence don’t need to get on their road map).
Types of reconciliation:
• Stateless: Only depend on the data in the request.
• Stateful: Depends on business processes and states when failure occurred. Hence when the
system failed may matter in the outcome of the reconciliation.
*See excellent paper “Eventually Consistent” by Werner Vogels, CTO Amazon

Service Container
Clients
9
DETAILED DESIGN
Service Container
Request
Request
Dependency
Orchestration/Response
Consolidation
Circuit Breakers (Global)
Circuit Breakers
Request
Request
Dependency
Orchestration/Response
Consolidation
Circuit BreakersConfig
SoR
Reconciliation
&
Actions
Queue
Events
Reports (Manual)
Reconcile

WE ARE HIRING
If you are interested in helping us solve
these problems, you can contact us at:
dwilfred@paypal.com
http://www.ebaycareers.com

PayPal Resilient System Design

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to PayPal Resilient System Design

Similar to PayPal Resilient System Design (20)

Recently uploaded

Recently uploaded (20)

PayPal Resilient System Design

Editor's Notes