Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How Orange Financial combat financial frauds over 50M transactions a day using Apache Pulsar

352 views

Published on

You will learn how Orange Financial combat financial frauds over 50M transactions a day using Apache Pulsar. The presentation is shared at Strata Data Conference at New York, US, 2019/09.

Published in: Software
  • Be the first to comment

How Orange Financial combat financial frauds over 50M transactions a day using Apache Pulsar

  1. 1. How Orange Financial combat financial frauds over 50M transactions a day using Apache Pulsar Vincent Xie (Bestpay), Jia Zhai (StreamNative)
  2. 2. About us Vincent (Weisheng) Xie ❏ Current Director @ Orange Financial ❏ Previous Tech lead of ML engineering team @ Intel Jia Zhai ❏ Co-Founder of StreamNative ❏ Apache Pulsar PMC Member ❏ Apache BookKeeper PMC Member
  3. 3. Agenda ❏ Background ❏ Apache Pulsar ❏ Unified Data Processing ❏ Our Practices ❏ Q & A
  4. 4. Background Intro
  5. 5. Orange Financial Orange Financial Services Group (Chinese: 甜橙金融), formerly known as Bestpay, is an affiliate company of China Telecom. It reached 1.13 trillion CNY ($18.37 Billion) transaction volume in 2018, with 500 million registered users and 41.9 million active users. Subsidiaries: Bestpay - a mobile wallet and payment app Jieqian - a consumer loan service Orange Wealth Orange Insurance Orange Credit Orange Financial Cloud
  6. 6. Source: iiMedia Research Inc.
  7. 7. High Industry Penetration Rate Source: China Unionpay
  8. 8. Source: RSA
  9. 9. Challenges ❏ High concurrency ❏ > 50M transactions, 1 billion events a day (peek: 35K/s) ❏ Low latency demand ❏ response < 200ms ❏ Large number of batch jobs and streaming jobs
  10. 10. “A merchant’s total transaction volume ($) within the past month (30days) (current transaction included)” = sum($past_29days) + sum($today_upto_current) batch streaming
  11. 11. Architecture V1 API Gateway
  12. 12. Batch Layer Speed/Streaming Layer Architecture V1 - Lambda API Gateway Serving Layer
  13. 13. Drawbacks ❏ S/W stacks complexity ❏ Realtime / Offline / Serving stacks ❏ Multiple clusters to maintain (Kafka / Hive / Spark / Flink) ❏ Different skill sets to manipulate (Scala / Java / SQL) ❏ Segmented Logics ❏ Historical/Current ❏ Data redundancy ❏ Multiple duplications to move over
  14. 14. Introduce Apache Pulsar
  15. 15. What is Apache Pulsar?
  16. 16. “Flexible Pub/Sub Messaging Backed by durable log storage”
  17. 17. Pulsar - A cloud-native architecture Stateless Serving Durable Storage
  18. 18. Pulsar - Segment Centric Storage ❏ Topic Partition (Managed Ledger) ❏ The storage layer for a single topic partition ❏ Segment (Ledger) ❏ Single writer, append-only ❏ Replicated to multiple bookies
  19. 19. Pulsar - Pub/Sub
  20. 20. Pulsar - Topic Partitions
  21. 21. Pulsar - Segments
  22. 22. Pulsar - Stream
  23. 23. Pulsar - Stream as a unified view on data
  24. 24. Pulsar - Two levels of reading API ❏ Pub/Sub (Streaming) ❏ Read data from brokers ❏ Consume / Seek / Receive ❏ Subscription Mode - Failover, Shared, Key_Shared ❏ Reprocessing data by rewinding (seeking) the cursors ❏ Segment (Batch) ❏ Read data from storage (bookkeeper or tiered storage) ❏ Fine-grained Parallelism ❏ Predicate pushdown (publish timestamp)
  25. 25. Unified Data Processing on Pulsar
  26. 26. Architecture V2 API Gateway Spark Structured Streaming Spark SQL
  27. 27. Architecture V2 API Gateway Spark Structured Streaming Spark SQL ❏ Single Data Store (Pulsar) ❏ Single Computing Engine (Spark) ❏ Unified API
  28. 28. Pulsar-Spark ❏ Deeply integrated with Pulsar schema ❏ Pulsar topics as Structured Streams ❏ Pulsar Connectors for Spark Structured Streaming ❏ Pulsar Connectors for Spark SQL https://github.com/streamnative/pulsar-spark
  29. 29. Pulsar-Spark / Streaming Queries https://github.com/streamnative/pulsar-spark
  30. 30. Pulsar-Spark / Batch Queries https://github.com/streamnative/pulsar-spark
  31. 31. Pulsar-Spark / Write Results to Pulsar https://github.com/streamnative/pulsar-spark
  32. 32. PoC at Bestpay ❏ Ingest data to Pulsar ❏ Realtime Data ❏ pulsar-io-kafka: connect kafka messages (JSON) to Pulsar and store them in AVRO format with schema information ❏ Historic Data ❏ pulsar-spark: query the Hive table and insert Hive rows as Pulsar messages (AVRO) to Pulsar ❏ Data Processing ❏ Spark Structured Streaming: for stream processing ❏ Spark SQL: for batch processing and interactive queries
  33. 33. Benefits ❏ Complexity drop 33% (Number of clusters from 6 down to 4) ❏ Storage saving 8.7% (expect to be 28%) ❏ Time to production boosts 11x (backed with streaming SQL) ❏ Higher stability (expected)
  34. 34. Summary ❏ Apache Pulsar is a cloud-native messaging streaming system ❏ Multi layered architecture ❏ Segment centric storage ❏ Two levels of reading API: Pub/Sub + Segment ❏ Apache Pulsar provides a unified view of data ❏ Pulsar + Spark for a simple unified data processing
  35. 35. References ❏ pulsar-io-kafka: https://github.com/streamnative/pulsar-io-kafka ❏ pulsar-spark: https://github.com/streamnative/pulsar-spark ❏ Apache Pulsar as One Storage System for Both Real-time and Historical Data Analysis: https://medium.com/streamnative/apache-pulsar-as-one-storage-455222c590 17
  36. 36. Community ❏ Pulsar Website: https://pulsar.apache.org ❏ Twitter: @apache_pulsar / @streamnativeio ❏ Slack: https://apache-pulsar.herokuapp.com ❏ Mailing Lists dev@pulsar.apache.org, users@pulsar.apache.org ❏ Github https://github.com/apache/pulsar ❏ Medium https://medium.com/streamnative
  37. 37. Thanks!

×