Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans

4,031 views

Published on

Spark Summit East Talk

Published in: Data & Analytics
  • Can you please share the configuration how many Kafka Brokers, and partitions achieved this 14k events/sec.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Realtime Risk Management Using Kafka, Python, and Spark Streaming by Nick Evans

  1. 1. Realtime Risk ManagementUSING KAFKA, PYTHON, AND SPARK STREAMING
  2. 2. 2 UNDERWRITING CREDIT CARD TRANSACTIONS IS RISKY
  3. 3. 3 WE NEED TO BE QUICK AT UNDERWRITING
  4. 4. 4 WE ALSO NEED TO AVOID LOSING MONEY
  5. 5. 5 Some Numbers $12BnC U M U L AT I V E P R O C E S S E D 200k+M E R C H A N T S 14kE V E N T S / S E C O N D 7R I S K A N A LYS T S
  6. 6. 6 Risk Analysts
  7. 7. 7 WE NEEDED TO BUILD SOMETHING THAT STOPS THE HOPE FACTOR
  8. 8. 8
  9. 9. 9 SPARK STREAMING ALLOWS US TO DO REAL-TIME DATA PROCESSING. WE CAN DECIDE WHICH EVENTS NEED A CLOSER LOOK
  10. 10. 10 Intro to Kafka, Zookeeper, and Spark Streaming
  11. 11. 11 Apache Kafka Image taken from Kafka docs
  12. 12. 12 Apache Zookeeper Image taken from Zookeeper docs
  13. 13. 13 Apache Spark Streaming Image taken from mapr.com
  14. 14. 14 OLD WAY: RECEIVERS
  15. 15. 15 Kafka Receiver Spark
 Engine t0 t1 t2 t3 Event Event Event Event Event Event Build RDD with Events from t0 to t1 Build RDD with Events from t1 to t2 Process RDD Process RDD
  16. 16. 16 Problems w/ Receivers • The only way to get at-least once delivery makes it hard to deploy new code • Zookeeper is updated with which offsets to start from when data is received, not when it is processed • We’re actually duplicating Kafka
  17. 17. 17 NEW WAY: RECEIVERLESS
  18. 18. 18 Kafka Spark
 Engine t0 t1 t2 t3 Event Event Event Event Event Event Process RDD with Events from t0 to t1 Process RDD with Events from t1 to t2
  19. 19. 19 General Structure • Load Kafka offsets from Zookeeper • Tell Spark Streaming to create a DStream that consumes from Kafka, starting at the specified offsets • Define your processing step (ie. filter out non-risky events) • Define your output step (ie. POST the data to the case management software) • Save Kafka offset of most recently processed event to Zookeeper • Start your streaming application, and grab some popcorn!
  20. 20. 20 Example Filtering: Risky Products hair extensions pharmacy vaporizer gateway card wifi pineapple iPhone gucci cannabis travel package
  21. 21. 21 Risky Products hair extensions gucci cannabis sweet bag nice shoes taylor swift t-shirt RDD for Time 0 Filter hair extensions gucci cannabis Map {“title”: “hair exten…} {“title”: “gucci”…} {“title”: “cannab…} HTTPS Post Case
 Management
 Software
  22. 22. 22
  23. 23. 23 TIME-WINDOWED AGGREGATIONS
  24. 24. 24 The Future • Time-Windowed Functions – A necessity for most of the non-trivial jobs • Performance Tweaks – We haven’t spent any time on this, so lots of potential for gains • Machine Learning – We could use the Risk Analyst decisions to build a ML model • Improved Monitoring – We are only monitoring the basics right now • Apache Cassandra – Others use it as a fast key/value store for their jobs • Improved Receiverless API – An API to access Kafka / Zookeeper without hard work
  25. 25. 25 Icon Credits Credit Card by Rediffusion from the Noun Project Money Bag by icon 54 from the Noun Project
  26. 26. 26 WE’RE HIRING SHOPIFY.COM/CAREERS
  27. 27. 27 THANK YOU! @n_e_evans

×