PayPal Risk Platform
High Performance Practice
Ling ZhiJun (Brian Ling)
2017 Software Architecture Summit
AGENDA
PayPal & PayPal Risk (Platform)
Risk DAL Service Challenge
Async Solution
Async Future Plan
2017 Software Architecture Summit
AGENDA
PayPal & PayPal Risk (Platform)
Risk DAL Service Challenge
Async Solution
Async Future Plan
2017 Software Architecture Summit
TPV/day
~1
BILLIONpayments/year
6.1
BILLIO
N
Computation/day
~20
Billion
Active Customer
Accounts
210M
petabytes of
data
105
Queries/ day
250
Billion
PayPal operates
one of the largest
Online
Payment
in the world
0.32%
Loss Rate
The power of
our platform
Our technology transformation enables us to:
• Process payments at tremendous scale (200+ countries & 25currencies
supported)
• Accelerate the innovation of new products
• Engage world-class developers & technologists
PayPal Overview
2017 Software Architecture Summit
TPV
+35
4
BILLION
payments/year
6.1
BILLIO
N
payments/
second at peak
1.8B
active customer
accounts
210M
petabytes of
data
73
database
calls/ quarter
4.5T
PayPal operates
one of the largest
Online
Payment
in the world
0.32%
Loss Rate
The power of
our platform
Our technology transformation enables us to:
• Process payments at tremendous scale (200+ countries & 25currencies
supported)
• Accelerate the innovation of new products
• Engage world-class developers & technologists
PayPal Risk KPI
Payments
transactions
Requirement for Risk Platform
Accuracy vs Latency Low Latency + Hardware Investment
Vs Large Throughput
2017 Software Architecture Summit
PayPal Risk Platform Architecture
Online
Offline
DAL
Service
Real-time
Compute Data
Offline
Generated Data
Model +
Variable
Computation
Service
Decision
Service
Variable Rollup
Service
Logging System/ ETL
Read
Path
Write
Path
Gateway
Service
Offline
Generated Data
Simulated
Real-time
Data
Offline Variable
Simulation
PlatformModel
Training
Platform Offline Variable
Aggregation
Service
2017 Software Architecture Summit
PayPal Risk Platform Architecture
Online
Offline
DAL
Service
Offline
Generated Data
Real-time
Compute Data
Model +
Variable
Computation
Service
Decision
Service
Variable
Aggregation
Service
Logging System/ ETL
Read
Path
Write
Path
Gateway
Service
Offline
Generated Data
Simulated
Real-time
Data
Offline Variable
Simulation
PlatformModel
Training
Platform Offline Variable
Aggregation
Service
2017 Software Architecture Summit
AGENDA
PayPal & PayPal Risk (Platform)
Risk DAL Service Challenge
Async Solution
Async Future Plan
DAL Service Ultimate Questions
JVM-Based High Performance & ATB DAL Service
<100ms P99.99 Latency ??
For single instance, 20k-30k Peak TPS ??
• 99.99% Availability-To-Business??
DAL Service Technical Challenges
Budget Cost
• Align with traffic, Hardware
investment Exponential Increase
Performance Issue
• P99 Latency Significantly
differentiate Avg latency
• Too Many Latency Spike under
Traffic
• Storage Cluster Unavailability Impact
Latency
Customer Requirement
• Adopt New Use Case
• Access behavior Differentiate per
Colo
• Flexibility & Fast-evolving Use Case
• Replication
• Traffic Strategy
Operational Cost
• Maintain too many Client with
multiple versions
• Too Frequent Release tie to Biz
Case
• Standby Storage Cluster switch-
over
Req
Tech
Value Cost
2017 Software Architecture Summit
AGENDA
PayPal & PayPal Risk (Platform)
Risk DAL Service Challenge
Async Solution
Async Future Plan
2017 Software Architecture Summit
Async Original Benefit
• More Efficient Thread Scheduling
• Non-blocking Call
• Event-Driven Callback
• Less Context Switch
• Fault Isolation
2017 Software Architecture Summit
Reactor Pattern Threading Model
2017 Software Architecture Summit
Async DAL Service KPI Comparison
• Low Latency
• ~10-35% Reduction (Average/P99)
0
20000
40000
60000
80000
100000
120000
200030004000500060007000800090001000011000120001300014000150001600017000
LATENCY(INMICROSECONDS)
THROUGHPUT (REQUESTS PER SEC)
E2E Client-Service-Aerospike
Benchmark: Read 50% Write 50%
Latency vs. Throughput (4-core VM)
99thPercentileLatency_update 99thPercentileLatency_read
AvgLatency_read AvgLatency_update
99.9thPercentileLatency_read 99.9thPercentileLatency_update
99.99thPercentileLatency_read 99.99thPercentileLatency_update
2017 Software Architecture Summit
Async DAL Service KPI Comparison – Cont.
• High Throughput
• 3-10X Increase (Single Instance Comparison)
2017 Software Architecture Summit
Async DAL Service KPI Comparison – Cont.
• Less CPU Usage
• 50% CPU Usage Reduction
• 66%+ Reduction for Context Switch & System Interrupts
2017 Software Architecture Summit
Async DAL Service KPI Comparison – Cont.
• Less Thread Pool
• 90% Reduction for Thread pool number
0
20
40
60
80
100
120
140
160
180
200
Server RPC Thread Operation Thread Replication Thread Management Thread
9
0 0 2
200
14
40
2
Thread Number Comparison
Async Sync
Async DAL Service KPI Comparison – Cont.
• Memory Friendly
• 20% Reduction for Memory Allocation
• 100+MB Young Generation after Young GC
• 130+MB Pooled Off-heap
0.00%
0.01%
0.02%
0.03%
0.04%
0.05%
0.06%
0.07%
Sync Async
GC Time / Total Time
GC Time / Total Time
0
50
100
150
200
250
300
350
Sync Async
GC Count
GC Count
We Have ONE Async Dream
• Reform Application Charter from CPU-bound Charter to IO-
bound
• Traffic Throughput (non-)linear growth with CPU Usage
• By guarantee Low Latency, Taking 20-30K TPS with 500MB
JVM Heap (After young GC)
• Cloud Friendly Application
• Less Hardware Investment
• Low Operational Cost
• Easy Capacity Estimation
High Performance Design
E2E Async • Non-blocking Pipeline: Async
RPC + Async DataAccess
Less is More • Shared ThreadPool OVER
Separate ThreadPool
• Inline Execution over
Execution cross Multiple
Thread Pool
Autonomous Memory
Management
• Use Off-Heap as much as
possible
(inbound/outbound &
[de]serialization)
• Release Inbound Memory At
earlier stage (submitRequest)
High Performance Good Practice
• Performance Test as Critical Path
for Each Commit
• [Mandatory] Continuous
Performance Test for Each
Commit
Inbound/Outbound
Management
• Batch Consolidation
• Order Management
• Timeout Management
• Retry Only Happen in Client Side
Programming Habit • Fast Fail over Exception Thrown
Cascading
• Logging & Monitoring Matters
• Thread-safe Write Operation In
Control Plan while Exception-safe
Read Operation In Data Plane
KPI Sign-Off
Async High Level Architecture
Real Time Data Service
Data Set Clients
Data Set 1
Client
Data Set N
Client
Data Set Schema
Data Access API Metadata API Generic Configuration API
KV Store APIClient
Server
Biz logic
HTTP(s) RPC Client
HTTP(s) RPC Server
KV Store API
Generic logic
Schema-less
Read
KV Store
Metadata namespace Data set namespace
Configuration
namespace
Direct access
Service access
Store/Cache
Async DAL Service Hierarchy
Async Data Access Maturity
• Client& Server RoR Identification
• biz-schema aware on Client Side
• Schema-less on Sever Side
• Traffic Sharding & Routing
• Active-Active/Active-Standby
• Auto-Failover
• Multi-Tenancy
• ACL
• Direct/Service-To-Service Replication
… ....
• Source-of-Truth for Online Guideline &
Offline Inventory
• Centralized Configuration
• Zero Restart/Auto-Fresh
DAL Service Feature
Metadata Driven
Data Access
Mapping
DataSet => KV Mapping
Logical => Physical DataSet Mapping
2017 Software Architecture Summit
Async RPC Control Plane Abstraction
2017 Software Architecture Summit
Async RPC Maturity
• Configurable Execution Chain per URL
• Customize protobuf / json encoder
• Inject Monitoring Module
• Execution Resource Configuration
• Threadpool size / netty option (tcp_nodelay)
• Sharable or not
• Service Listener Registry
• Server Container Life Cycle Management
• Graceful Shutdown
• Partial Shutdown Given Container
• Auto Rebuild RPC Client Channel
High Flexibility
Configuration
RPC Resource
Management
Async RPC Embrace Async DataAccess
Async Core Value
• Low Latency + High Throughput
• Low System Load
• SLA Isolation
• Understand Performance Contribution More
• Zero Code Change + Zero Release (new case
on-board)
• Minimize new DB Storage Integration Effort
• Lego-Style Customization
• Highly Reusable Functionality
High Performance
Easy Adoption
Cost Saving • Less Hardware Investment
• Loose Constraint for Hardware/VM SKU
High Flexibility
Configuration
• Execution Chain per URL (RPC)
• DataAccess Storage & Option [consistency &
ttl]
• Traffic Routing Strategy
• Replication Strategy
2017 Software Architecture Summit
Async Family
Async
Data
Access
RPC
(Server/
Client)
In-Memory
Aerospike
Workflow
Messaging
(pub-sub)
Kafka
ActiveMQ
Netty
HBase
2017 Software Architecture Summit
AGENDA
PayPal & PayPal Risk (Platform)
Risk DAL Service Challenge
Async Solution
Async Future Plan
Future Plan
• Shared Eventloop
• Netty Option (IO Ratio)
• NIO vs Epoll SocketChannel
• JDK SSL vs OpenSSL
• Protobuf vs Msgpack
• Sync Client vs Async Client
• W/- Monitoring/Replication features
Async DataAccess • Compute Operation Support
• DB Server-side UDF Adoption
• Smart Client for Direct & Service Access
• Async HBase Integration
Async RPC • Finer Granularity Monitoring & Throttling
• Error Handling Injection
• Client Side Multiplexing
• Server Push Partial Response + RPC Client
Consolidate Response
Async+Sync Hybrid Workflow Execution
Continuous Performance
Tuning Deep Dive
Open Source in Year 2019
2017 Software Architecture Summit

PayPal Risk Platform High Performance Practice

  • 1.
    PayPal Risk Platform HighPerformance Practice Ling ZhiJun (Brian Ling)
  • 2.
    2017 Software ArchitectureSummit AGENDA PayPal & PayPal Risk (Platform) Risk DAL Service Challenge Async Solution Async Future Plan
  • 3.
    2017 Software ArchitectureSummit AGENDA PayPal & PayPal Risk (Platform) Risk DAL Service Challenge Async Solution Async Future Plan
  • 4.
    2017 Software ArchitectureSummit TPV/day ~1 BILLIONpayments/year 6.1 BILLIO N Computation/day ~20 Billion Active Customer Accounts 210M petabytes of data 105 Queries/ day 250 Billion PayPal operates one of the largest Online Payment in the world 0.32% Loss Rate The power of our platform Our technology transformation enables us to: • Process payments at tremendous scale (200+ countries & 25currencies supported) • Accelerate the innovation of new products • Engage world-class developers & technologists PayPal Overview
  • 5.
    2017 Software ArchitectureSummit TPV +35 4 BILLION payments/year 6.1 BILLIO N payments/ second at peak 1.8B active customer accounts 210M petabytes of data 73 database calls/ quarter 4.5T PayPal operates one of the largest Online Payment in the world 0.32% Loss Rate The power of our platform Our technology transformation enables us to: • Process payments at tremendous scale (200+ countries & 25currencies supported) • Accelerate the innovation of new products • Engage world-class developers & technologists PayPal Risk KPI Payments transactions
  • 6.
    Requirement for RiskPlatform Accuracy vs Latency Low Latency + Hardware Investment Vs Large Throughput
  • 7.
    2017 Software ArchitectureSummit PayPal Risk Platform Architecture Online Offline DAL Service Real-time Compute Data Offline Generated Data Model + Variable Computation Service Decision Service Variable Rollup Service Logging System/ ETL Read Path Write Path Gateway Service Offline Generated Data Simulated Real-time Data Offline Variable Simulation PlatformModel Training Platform Offline Variable Aggregation Service
  • 8.
    2017 Software ArchitectureSummit PayPal Risk Platform Architecture Online Offline DAL Service Offline Generated Data Real-time Compute Data Model + Variable Computation Service Decision Service Variable Aggregation Service Logging System/ ETL Read Path Write Path Gateway Service Offline Generated Data Simulated Real-time Data Offline Variable Simulation PlatformModel Training Platform Offline Variable Aggregation Service
  • 9.
    2017 Software ArchitectureSummit AGENDA PayPal & PayPal Risk (Platform) Risk DAL Service Challenge Async Solution Async Future Plan
  • 10.
    DAL Service UltimateQuestions JVM-Based High Performance & ATB DAL Service <100ms P99.99 Latency ?? For single instance, 20k-30k Peak TPS ?? • 99.99% Availability-To-Business??
  • 11.
    DAL Service TechnicalChallenges Budget Cost • Align with traffic, Hardware investment Exponential Increase Performance Issue • P99 Latency Significantly differentiate Avg latency • Too Many Latency Spike under Traffic • Storage Cluster Unavailability Impact Latency Customer Requirement • Adopt New Use Case • Access behavior Differentiate per Colo • Flexibility & Fast-evolving Use Case • Replication • Traffic Strategy Operational Cost • Maintain too many Client with multiple versions • Too Frequent Release tie to Biz Case • Standby Storage Cluster switch- over Req Tech Value Cost
  • 12.
    2017 Software ArchitectureSummit AGENDA PayPal & PayPal Risk (Platform) Risk DAL Service Challenge Async Solution Async Future Plan
  • 13.
    2017 Software ArchitectureSummit Async Original Benefit • More Efficient Thread Scheduling • Non-blocking Call • Event-Driven Callback • Less Context Switch • Fault Isolation
  • 14.
    2017 Software ArchitectureSummit Reactor Pattern Threading Model
  • 15.
    2017 Software ArchitectureSummit Async DAL Service KPI Comparison • Low Latency • ~10-35% Reduction (Average/P99) 0 20000 40000 60000 80000 100000 120000 200030004000500060007000800090001000011000120001300014000150001600017000 LATENCY(INMICROSECONDS) THROUGHPUT (REQUESTS PER SEC) E2E Client-Service-Aerospike Benchmark: Read 50% Write 50% Latency vs. Throughput (4-core VM) 99thPercentileLatency_update 99thPercentileLatency_read AvgLatency_read AvgLatency_update 99.9thPercentileLatency_read 99.9thPercentileLatency_update 99.99thPercentileLatency_read 99.99thPercentileLatency_update
  • 16.
    2017 Software ArchitectureSummit Async DAL Service KPI Comparison – Cont. • High Throughput • 3-10X Increase (Single Instance Comparison)
  • 17.
    2017 Software ArchitectureSummit Async DAL Service KPI Comparison – Cont. • Less CPU Usage • 50% CPU Usage Reduction • 66%+ Reduction for Context Switch & System Interrupts
  • 18.
    2017 Software ArchitectureSummit Async DAL Service KPI Comparison – Cont. • Less Thread Pool • 90% Reduction for Thread pool number 0 20 40 60 80 100 120 140 160 180 200 Server RPC Thread Operation Thread Replication Thread Management Thread 9 0 0 2 200 14 40 2 Thread Number Comparison Async Sync
  • 19.
    Async DAL ServiceKPI Comparison – Cont. • Memory Friendly • 20% Reduction for Memory Allocation • 100+MB Young Generation after Young GC • 130+MB Pooled Off-heap 0.00% 0.01% 0.02% 0.03% 0.04% 0.05% 0.06% 0.07% Sync Async GC Time / Total Time GC Time / Total Time 0 50 100 150 200 250 300 350 Sync Async GC Count GC Count
  • 20.
    We Have ONEAsync Dream • Reform Application Charter from CPU-bound Charter to IO- bound • Traffic Throughput (non-)linear growth with CPU Usage • By guarantee Low Latency, Taking 20-30K TPS with 500MB JVM Heap (After young GC) • Cloud Friendly Application • Less Hardware Investment • Low Operational Cost • Easy Capacity Estimation
  • 21.
    High Performance Design E2EAsync • Non-blocking Pipeline: Async RPC + Async DataAccess Less is More • Shared ThreadPool OVER Separate ThreadPool • Inline Execution over Execution cross Multiple Thread Pool Autonomous Memory Management • Use Off-Heap as much as possible (inbound/outbound & [de]serialization) • Release Inbound Memory At earlier stage (submitRequest)
  • 22.
    High Performance GoodPractice • Performance Test as Critical Path for Each Commit • [Mandatory] Continuous Performance Test for Each Commit Inbound/Outbound Management • Batch Consolidation • Order Management • Timeout Management • Retry Only Happen in Client Side Programming Habit • Fast Fail over Exception Thrown Cascading • Logging & Monitoring Matters • Thread-safe Write Operation In Control Plan while Exception-safe Read Operation In Data Plane KPI Sign-Off
  • 23.
    Async High LevelArchitecture Real Time Data Service Data Set Clients Data Set 1 Client Data Set N Client Data Set Schema Data Access API Metadata API Generic Configuration API KV Store APIClient Server Biz logic HTTP(s) RPC Client HTTP(s) RPC Server KV Store API Generic logic Schema-less Read KV Store Metadata namespace Data set namespace Configuration namespace Direct access Service access Store/Cache
  • 24.
  • 25.
    Async Data AccessMaturity • Client& Server RoR Identification • biz-schema aware on Client Side • Schema-less on Sever Side • Traffic Sharding & Routing • Active-Active/Active-Standby • Auto-Failover • Multi-Tenancy • ACL • Direct/Service-To-Service Replication … .... • Source-of-Truth for Online Guideline & Offline Inventory • Centralized Configuration • Zero Restart/Auto-Fresh DAL Service Feature Metadata Driven Data Access Mapping DataSet => KV Mapping Logical => Physical DataSet Mapping
  • 26.
    2017 Software ArchitectureSummit Async RPC Control Plane Abstraction
  • 27.
    2017 Software ArchitectureSummit Async RPC Maturity • Configurable Execution Chain per URL • Customize protobuf / json encoder • Inject Monitoring Module • Execution Resource Configuration • Threadpool size / netty option (tcp_nodelay) • Sharable or not • Service Listener Registry • Server Container Life Cycle Management • Graceful Shutdown • Partial Shutdown Given Container • Auto Rebuild RPC Client Channel High Flexibility Configuration RPC Resource Management
  • 28.
    Async RPC EmbraceAsync DataAccess
  • 29.
    Async Core Value •Low Latency + High Throughput • Low System Load • SLA Isolation • Understand Performance Contribution More • Zero Code Change + Zero Release (new case on-board) • Minimize new DB Storage Integration Effort • Lego-Style Customization • Highly Reusable Functionality High Performance Easy Adoption Cost Saving • Less Hardware Investment • Loose Constraint for Hardware/VM SKU High Flexibility Configuration • Execution Chain per URL (RPC) • DataAccess Storage & Option [consistency & ttl] • Traffic Routing Strategy • Replication Strategy
  • 30.
    2017 Software ArchitectureSummit Async Family Async Data Access RPC (Server/ Client) In-Memory Aerospike Workflow Messaging (pub-sub) Kafka ActiveMQ Netty HBase
  • 31.
    2017 Software ArchitectureSummit AGENDA PayPal & PayPal Risk (Platform) Risk DAL Service Challenge Async Solution Async Future Plan
  • 32.
    Future Plan • SharedEventloop • Netty Option (IO Ratio) • NIO vs Epoll SocketChannel • JDK SSL vs OpenSSL • Protobuf vs Msgpack • Sync Client vs Async Client • W/- Monitoring/Replication features Async DataAccess • Compute Operation Support • DB Server-side UDF Adoption • Smart Client for Direct & Service Access • Async HBase Integration Async RPC • Finer Granularity Monitoring & Throttling • Error Handling Injection • Client Side Multiplexing • Server Push Partial Response + RPC Client Consolidate Response Async+Sync Hybrid Workflow Execution Continuous Performance Tuning Deep Dive Open Source in Year 2019
  • 33.

Editor's Notes

  • #8 DAL Service: Control Connection Pool Centralized Control & Highly Reusability (easily storage migration/non-backward compatible migration & throttling & ACL control) => Minimize Client Upgrade & Integration Effort Seamless storage switch & upgrade
  • #9 Control Connection Pool Centralized Control & Highly Reusability (easily storage migration/non-backward compatible migration & throttling & ACL control) Minimize Client Upgrade & Integration Effort
  • #11 GC issue Lock Contention (non-blocking) Threading switch & context switch IO Blocking cache line refresh/cache miss IPC => instruction per cycle
  • #12 Use case: TTL/timeout ACL Replication Traffic strategy
  • #14 Leverage OS support event-driven notification: windows IOCP & Linux Epoll & osx kqueue Fully leverage CPU Cycle only for Inbound & outbound Handle Short-lived Thread Task for better Thread Usage Not-involve Client Thread for blocking waiting for downstream storage response & less impact for Client System Resource Usage 我们可以知道Epoll不负责IO操作,所以它只告诉你当前可读可写了,并且将协议读写缓冲填充,由用户去读写控制,此时我们可以做出额外的许多操作。IOCP则直接将IO通道里的读写操作都做完了才通知用户,当IO通道里发生了堵塞等状况我们是无法控制的。
  • #15 反应器模式:Boss Thread同步的将输入的请求事件 利用多路复用分配策略快速分发给相应的Worker Thread Handler 通过底层数据存储回调事件 通知事后的Response 处理 ** 异步操作:有通知无需轮询检查 非堵塞:操作结果是否等待(是否马上有返回值)由回调的事件触发后续RPC Channel flush 返回结果给客户端
  • #18 Under same throughput situation
  • #22 Async for platform-wise & framework level, for business logic, not easy to adopt async pattern Use off-heap: Schema-less for inbound & outbound Release request memory: Retry won’t happen in DAL service
  • #24 Aerospike: High write performance & specific optimization for SSD => 1M TPS with P99 <1ms DRAM/SSD Hybrid Solution High ATB & Scalability | Local Replication & XDR Aerospike VLDB 2016 Paper
  • #25 Batch & Retry Traffic Routing & HA ACL & Multi-Tenancy
  • #30 以性能为导向的 可靠的 全链路异步服务访问框架 灵活支持企业级需求 数据访问 可配置 高性能 异步RPC访问