Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Software: Spark Summit East talk by Kees Jan de Vries

1,809 views

Published on

Fraudsters attempt to pay for goods, flights, hotels – you name it – using stolen credit cards. This hurts both the trust of card holders and the business of vendors around the world. We built a Real-Time Fraud Prevention Engine using Open Source (Big Data) Software: Spark, Spark ML, H2O, Hive, Esper. In my talk I will highlight both the business and the technical challenges that we’ve faced and dealt with.

Published in: Data & Analytics
  • Be the first to comment

Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Software: Spark Summit East talk by Kees Jan de Vries

  1. 1. Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Software Kees Jan de Vries Booking.com
  2. 2. Who am I? About me • Physics PhD Imperial College • Data Scientist at Booking.com for 1.5 year ▸ Security Department • linkedin.com/in/kees-jan-de-vries-93767240 About Booking.com • World leader in connecting travellers with the widest variety of great places to stay • Part of The Priceline Group, the world’s 3rd largest e-commerce company (by market capitalisation) • Employing 14,000 people in 180 countries • Each day, over 1,200,000 room nights are reserved on Booking.com
  3. 3. Contents. • Motivation • Running Example ▸ Probability to Book • Real Time Prediction Engine: Lessons Learnt ▸ Aggregate Features ▸ Models Training and Deployment ▸ Interpretation of Individual Scores
  4. 4. Motivation
  5. 5. Motivation. booking.com
  6. 6. Motivation. booking.com Booking.com • Empower people to experience the world Security • Provide secure reservation system (facilitating hundreds of thousands of reservations on a daily basis) Fraud Prevention • Protect customers and property partners, ensuring the best possible experience
  7. 7. Motivation for this Talk. Train awesome Model Serve in Real Time at Low Latency using aggregate features
  8. 8. Running Example Probability to Book
  9. 9. Disclaimer: although the running example presented in these slides may seem realistic, it is only intended to highlight some lessons learnt about building a real time prediction engine.
  10. 10. Probability to Book. Let’s book a hotel for the Spark Summit East 2017 in Boston
  11. 11. Probability to Book. OK. Let’s click on the first hit.
  12. 12. Probability to Book. Nice. Here’s all the information I need. But maybe I’ll browse a few more, to make the best choice.
  13. 13. Probability to Book. Nice. Here’s all the information I need. But maybe I’ll browse a few more, to make the best choice. Let’s help the customer make the best choice for them!
  14. 14. Probability to Book. We’ll calculate the probability of booking when the Customer clicks on an accommodation Nice. Here’s all the information I need. But maybe I’ll browse a few more, to make the best choice.
  15. 15. Behind the Scenes. User id Page id ... Action PredictionEngine Click!
  16. 16. Behind the Scenes. Labels • Booked? Yes/No Features • Simple ▸ Time of day ▸ ... • Profile ▸ User ▸ Circumstantial Click!
  17. 17. Behind the Scenes. Labels • Booked? Yes/No Features • Simple ▸ Time of day ▸ ... • Profile ▸ User ▸ Circumstantial Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Click!
  18. 18. Behind the Scenes. Features Model Action User id ... Click! Action
  19. 19. Behind the Scenes. Features Model Action User id ... Click! Aggregate features “ ” Action
  20. 20. Aggregate Features
  21. 21. Aggregate Features. Time Requests Data Streams Click Booking
  22. 22. Aggregate Features. Time Requests Data Streams Click Booking Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  23. 23. Aggregate Features. Users Time ... Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Requests Clicks (Requests)
  24. 24. Aggregate Features. Users Time ... Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Requests Clicks (Requests) Clicks coincide with requests, so aggregation is preferably instant
  25. 25. Aggregate Features. Users Time ... Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Requests Clicks (Requests) Clicks coincide with requests, so aggregation is preferably instant Small time window, so limited amount of data
  26. 26. Aggregate Features. SELECT COUNT(DISTINCT hotel_id) FROM cliks.win:time(30 min) GROUP BY user_id Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... In memory Complex Event Processing • No lag: instant aggregation • Scalability: see Esper website • Not persistent
  27. 27. Aggregate Features. SELECT COUNT(DISTINCT hotel_id) FROM cliks.win:time(30 min) GROUP BY user_id Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...From http://espertech.com/esper/faq_esper.php#streamprocessing “Complex Event Processing and Esper are standing queries and latency to the answer is usually below 10us with more than 99% predictability.” “The first component of scaling is the throughput that can be achieved running single-threaded. For Esper we think this number is very high and likely between 10k to 200k events per second.” In memory Complex Event Processing
  28. 28. Aggregate Features. Users Time ... Requests Bookings Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  29. 29. Aggregate Features. Users Time ... Requests Bookings Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Very High Cardinality: features for every user who made at least one booking
  30. 30. Aggregate Features High Cardinality Features with Cassandra • Very scalable: reads & writes • None-instant aggregations ▸ Consistency fundamentally bound by gossip protocol • Persistent ( ) Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  31. 31. Aggregate Features High Cardinality Features with Cassandra ( ) Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... From http://cassandra.apache.org/ ● Proven ● Fault Tolerant ● Performant ● Scalable ● Elastic ● ... “Some of the largest production deployments include Apple's, with over 75,000 nodes storing over 10 PB of data, Netflix (2,500 nodes, 420 TB, over 1 trillion requests per day), …”
  32. 32. Aggregate Features. HotelPages Time ... Requests Bookings/Clicks Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  33. 33. Aggregate Features. HotelPages Time ... Requests Bookings/Clicks Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Medium Cardinality: limited by number of hotels
  34. 34. Aggregate Features. HotelPages Time ... Requests Bookings/Clicks Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Medium Cardinality: limited by number of hotels Larger lag is ok with time window of 3 months and large number of events per page.
  35. 35. Aggregate Features SELECT page_id , COUNT(*) AS res_count FROM reservations GROUP BY page_id Low to Medium Cardinality • No need to go “fancy” • Batch processing Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  36. 36. Models Training and deployment
  37. 37. Aggregate Features for Training. Time Requests Data Streams Click Reservation Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  38. 38. Aggregate Features for Training. Time Requests Data Streams Click Reservation Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  39. 39. Aggregate Features for Training. Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... SELECT request.request_id , COUNT DISTINCT event.page_id FROM request JOIN event ON request.user_id = event.user_id WHERE event.epoch BETWEEN request.epoch AND request.epoch + 30*60 GROUP BY request.request_id
  40. 40. Aggregate Features for Training. Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... SELECT request.request_id , COUNT DISTINCT event.page_id FROM request JOIN event ON request.user_id = event.user_id WHERE event.epoch BETWEEN request.epoch AND request.epoch + 30*60 GROUP BY request.request_id Warning: stragglers ahead! Distribute wisely!
  41. 41. Aggregate Features for Training. SELECT request.request_id , COUNT DISTINCT event.page_id FROM request JOIN event ON request.user_id = event.user_id WHERE event.epoch BETWEEN request.epoch AND request.epoch + 30*60 GROUP BY request.request_id Scalable Technologies • Simple SQL ▸ Hive • Facing stragglers ▸ Spark
  42. 42. Model Training & Iteration y X Train Model Predict YES Predict NO Actual YES TP FN Actual NO FP TN Investigate Evaluate • Features ▸ Engineering ▸ Selection • Hyper parameters Improve Model
  43. 43. Sorry: H2O? The benchmark shown on this slide can be found in: http://h2o-release.s3.amazonaws.com/h2o/rel-lambert/5/docs-website/resources/h2odatasheet.html More benchmarks: https://github.com/szilard/benchm-ml
  44. 44. Time to Deploy model = H2ORandomForestEstimator(ntrees=50, max_depth=10) model.train(x=inputCols, y="label", training_frame=trainingDf, validation_frame=validationDf) model.download_pojo("/my/model/path/")
  45. 45. Time to Deploy. model = H2ORandomForestEstimator(ntrees=50, max_depth=10) model.train(x=inputCols, y="label", training_frame=trainingDf, validation_frame=validationDf) model.download_pojo("/my/model/path/") AWESOME!!! POJO: Plain Old Java Object. Model export in plain Java code for real- time scoring on the JVM. Very fast!
  46. 46. Model Interpretation
  47. 47. Score Interpretation. feature 15 > 9 feature 2 > 0.56 feature 4 > 200 yes no yes no yes no feature 11 > 6 yes no Logistic Regression Decision Trees feature 1 feature 2+ 13.9 x + …x = -5.12 x Decision tree interpretation based on: http://blog.datadive.net/interpreting-random-forests Contribution Feature i: i xi p0 Contribution Feature i: pnode - pparent node p2 p1 p3 p4 p6 p7 p5 p8 Example: p3 = p0 + (p1 -p0 ) + (p3 -p1 ) p0 : bias (p1 -p0 ): feature 15 contr. (p3 -p1 ): feature 2 contr.
  48. 48. Score Interpretation. feature 15 > 9 feature 2 > 0.56 feature 4 > 200 yes no yes no yes no feature 11 > 6 yes no Logistic Regression Decision Trees feature 1 feature 2+ 13.9 x + …x = -5.12 x Decision tree interpretation based on: http://blog.datadive.net/interpreting-random-forests Contribution Feature i: i xi p0 Contribution Feature i: pnode - pparent node p2 p1 p3 p4 p6 p7 p5 p8 We hacked this interpretation into Spark ML during a Hackathon at Booking:) H2O does not offer it (yet?).
  49. 49. Probability to Book. The probability of booking is high; largest contribution from #distinct property pages. Let’s show a summary!
  50. 50. Probability to Book. Awesome! This app really helps me book! The probability of booking is high; largest contribution from #distinct property pages. Let’s show a summary!
  51. 51. Summary
  52. 52. Summary. • How to make Aggregate Features available? ▸ Long/short time windows ▸ High and Low cardinality dimensions • Model Training and Deployment ▸ H2O POJO for fast Real-Time scoring • Interpretation ▸ Contributions per Feature per Score
  53. 53. Thank You. Questions? We’re hiring! Kees Jan de Vries from Booking.com linkedin.com/in/kees-jan-de-vries-93767240

×