Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PayPal Risk Platform High Performance Practice

118 views

Published on

PayPal Risk Async Framework Targeting for High Performance & Low Latency

  • Be the first to comment

  • Be the first to like this

PayPal Risk Platform High Performance Practice

  1. 1. PayPal Risk Platform High Performance Practice Ling ZhiJun (Brian Ling)
  2. 2. 2017 Software Architecture Summit AGENDA PayPal & PayPal Risk (Platform) Risk DAL Service Challenge Async Solution Async Future Plan
  3. 3. 2017 Software Architecture Summit AGENDA PayPal & PayPal Risk (Platform) Risk DAL Service Challenge Async Solution Async Future Plan
  4. 4. 2017 Software Architecture Summit TPV/day ~1 BILLIONpayments/year 6.1 BILLIO N Computation/day ~20 Billion Active Customer Accounts 210M petabytes of data 105 Queries/ day 250 Billion PayPal operates one of the largest Online Payment in the world 0.32% Loss Rate The power of our platform Our technology transformation enables us to: • Process payments at tremendous scale (200+ countries & 25currencies supported) • Accelerate the innovation of new products • Engage world-class developers & technologists PayPal Overview
  5. 5. 2017 Software Architecture Summit TPV +35 4 BILLION payments/year 6.1 BILLIO N payments/ second at peak 1.8B active customer accounts 210M petabytes of data 73 database calls/ quarter 4.5T PayPal operates one of the largest Online Payment in the world 0.32% Loss Rate The power of our platform Our technology transformation enables us to: • Process payments at tremendous scale (200+ countries & 25currencies supported) • Accelerate the innovation of new products • Engage world-class developers & technologists PayPal Risk KPI Payments transactions
  6. 6. Requirement for Risk Platform Accuracy vs Latency Low Latency + Hardware Investment Vs Large Throughput
  7. 7. 2017 Software Architecture Summit PayPal Risk Platform Architecture Online Offline DAL Service Real-time Compute Data Offline Generated Data Model + Variable Computation Service Decision Service Variable Rollup Service Logging System/ ETL Read Path Write Path Gateway Service Offline Generated Data Simulated Real-time Data Offline Variable Simulation PlatformModel Training Platform Offline Variable Aggregation Service
  8. 8. 2017 Software Architecture Summit PayPal Risk Platform Architecture Online Offline DAL Service Offline Generated Data Real-time Compute Data Model + Variable Computation Service Decision Service Variable Aggregation Service Logging System/ ETL Read Path Write Path Gateway Service Offline Generated Data Simulated Real-time Data Offline Variable Simulation PlatformModel Training Platform Offline Variable Aggregation Service
  9. 9. 2017 Software Architecture Summit AGENDA PayPal & PayPal Risk (Platform) Risk DAL Service Challenge Async Solution Async Future Plan
  10. 10. DAL Service Ultimate Questions JVM-Based High Performance & ATB DAL Service <100ms P99.99 Latency ?? For single instance, 20k-30k Peak TPS ?? • 99.99% Availability-To-Business??
  11. 11. DAL Service Technical Challenges Budget Cost • Align with traffic, Hardware investment Exponential Increase Performance Issue • P99 Latency Significantly differentiate Avg latency • Too Many Latency Spike under Traffic • Storage Cluster Unavailability Impact Latency Customer Requirement • Adopt New Use Case • Access behavior Differentiate per Colo • Flexibility & Fast-evolving Use Case • Replication • Traffic Strategy Operational Cost • Maintain too many Client with multiple versions • Too Frequent Release tie to Biz Case • Standby Storage Cluster switch- over Req Tech Value Cost
  12. 12. 2017 Software Architecture Summit AGENDA PayPal & PayPal Risk (Platform) Risk DAL Service Challenge Async Solution Async Future Plan
  13. 13. 2017 Software Architecture Summit Async Original Benefit • More Efficient Thread Scheduling • Non-blocking Call • Event-Driven Callback • Less Context Switch • Fault Isolation
  14. 14. 2017 Software Architecture Summit Reactor Pattern Threading Model
  15. 15. 2017 Software Architecture Summit Async DAL Service KPI Comparison • Low Latency • ~10-35% Reduction (Average/P99) 0 20000 40000 60000 80000 100000 120000 200030004000500060007000800090001000011000120001300014000150001600017000 LATENCY(INMICROSECONDS) THROUGHPUT (REQUESTS PER SEC) E2E Client-Service-Aerospike Benchmark: Read 50% Write 50% Latency vs. Throughput (4-core VM) 99thPercentileLatency_update 99thPercentileLatency_read AvgLatency_read AvgLatency_update 99.9thPercentileLatency_read 99.9thPercentileLatency_update 99.99thPercentileLatency_read 99.99thPercentileLatency_update
  16. 16. 2017 Software Architecture Summit Async DAL Service KPI Comparison – Cont. • High Throughput • 3-10X Increase (Single Instance Comparison)
  17. 17. 2017 Software Architecture Summit Async DAL Service KPI Comparison – Cont. • Less CPU Usage • 50% CPU Usage Reduction • 66%+ Reduction for Context Switch & System Interrupts
  18. 18. 2017 Software Architecture Summit Async DAL Service KPI Comparison – Cont. • Less Thread Pool • 90% Reduction for Thread pool number 0 20 40 60 80 100 120 140 160 180 200 Server RPC Thread Operation Thread Replication Thread Management Thread 9 0 0 2 200 14 40 2 Thread Number Comparison Async Sync
  19. 19. Async DAL Service KPI Comparison – Cont. • Memory Friendly • 20% Reduction for Memory Allocation • 100+MB Young Generation after Young GC • 130+MB Pooled Off-heap 0.00% 0.01% 0.02% 0.03% 0.04% 0.05% 0.06% 0.07% Sync Async GC Time / Total Time GC Time / Total Time 0 50 100 150 200 250 300 350 Sync Async GC Count GC Count
  20. 20. We Have ONE Async Dream • Reform Application Charter from CPU-bound Charter to IO- bound • Traffic Throughput (non-)linear growth with CPU Usage • By guarantee Low Latency, Taking 20-30K TPS with 500MB JVM Heap (After young GC) • Cloud Friendly Application • Less Hardware Investment • Low Operational Cost • Easy Capacity Estimation
  21. 21. High Performance Design E2E Async • Non-blocking Pipeline: Async RPC + Async DataAccess Less is More • Shared ThreadPool OVER Separate ThreadPool • Inline Execution over Execution cross Multiple Thread Pool Autonomous Memory Management • Use Off-Heap as much as possible (inbound/outbound & [de]serialization) • Release Inbound Memory At earlier stage (submitRequest)
  22. 22. High Performance Good Practice • Performance Test as Critical Path for Each Commit • [Mandatory] Continuous Performance Test for Each Commit Inbound/Outbound Management • Batch Consolidation • Order Management • Timeout Management • Retry Only Happen in Client Side Programming Habit • Fast Fail over Exception Thrown Cascading • Logging & Monitoring Matters • Thread-safe Write Operation In Control Plan while Exception-safe Read Operation In Data Plane KPI Sign-Off
  23. 23. Async High Level Architecture Real Time Data Service Data Set Clients Data Set 1 Client Data Set N Client Data Set Schema Data Access API Metadata API Generic Configuration API KV Store APIClient Server Biz logic HTTP(s) RPC Client HTTP(s) RPC Server KV Store API Generic logic Schema-less Read KV Store Metadata namespace Data set namespace Configuration namespace Direct access Service access Store/Cache
  24. 24. Async DAL Service Hierarchy
  25. 25. Async Data Access Maturity • Client& Server RoR Identification • biz-schema aware on Client Side • Schema-less on Sever Side • Traffic Sharding & Routing • Active-Active/Active-Standby • Auto-Failover • Multi-Tenancy • ACL • Direct/Service-To-Service Replication … .... • Source-of-Truth for Online Guideline & Offline Inventory • Centralized Configuration • Zero Restart/Auto-Fresh DAL Service Feature Metadata Driven Data Access Mapping DataSet => KV Mapping Logical => Physical DataSet Mapping
  26. 26. 2017 Software Architecture Summit Async RPC Control Plane Abstraction
  27. 27. 2017 Software Architecture Summit Async RPC Maturity • Configurable Execution Chain per URL • Customize protobuf / json encoder • Inject Monitoring Module • Execution Resource Configuration • Threadpool size / netty option (tcp_nodelay) • Sharable or not • Service Listener Registry • Server Container Life Cycle Management • Graceful Shutdown • Partial Shutdown Given Container • Auto Rebuild RPC Client Channel High Flexibility Configuration RPC Resource Management
  28. 28. Async RPC Embrace Async DataAccess
  29. 29. Async Core Value • Low Latency + High Throughput • Low System Load • SLA Isolation • Understand Performance Contribution More • Zero Code Change + Zero Release (new case on-board) • Minimize new DB Storage Integration Effort • Lego-Style Customization • Highly Reusable Functionality High Performance Easy Adoption Cost Saving • Less Hardware Investment • Loose Constraint for Hardware/VM SKU High Flexibility Configuration • Execution Chain per URL (RPC) • DataAccess Storage & Option [consistency & ttl] • Traffic Routing Strategy • Replication Strategy
  30. 30. 2017 Software Architecture Summit Async Family Async Data Access RPC (Server/ Client) In-Memory Aerospike Workflow Messaging (pub-sub) Kafka ActiveMQ Netty HBase
  31. 31. 2017 Software Architecture Summit AGENDA PayPal & PayPal Risk (Platform) Risk DAL Service Challenge Async Solution Async Future Plan
  32. 32. Future Plan • Shared Eventloop • Netty Option (IO Ratio) • NIO vs Epoll SocketChannel • JDK SSL vs OpenSSL • Protobuf vs Msgpack • Sync Client vs Async Client • W/- Monitoring/Replication features Async DataAccess • Compute Operation Support • DB Server-side UDF Adoption • Smart Client for Direct & Service Access • Async HBase Integration Async RPC • Finer Granularity Monitoring & Throttling • Error Handling Injection • Client Side Multiplexing • Server Push Partial Response + RPC Client Consolidate Response Async+Sync Hybrid Workflow Execution Continuous Performance Tuning Deep Dive Open Source in Year 2019
  33. 33. 2017 Software Architecture Summit

×