Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Active Learning for Fraud Prevention

1,163 views

Published on

Active Learning for Fraud Prevention

Published in: Technology

Active Learning for Fraud Prevention

  1. 1. Active Learning for Fraud Prevention
  2. 2. Agenda Introduction Fraud Prevention Algorithm Experiments Conclusion ©2016 PayPal Inc. Confidential and proprietary.
  3. 3. INTRODUCTION
  4. 4. © 2016 PayPal Inc. Confidential and proprietary. About Me • Software Engineer/Data Scientist/ML Researcher • Ph. D Computer Science • Research in Face Recognition, Phishing/Spam, Fraud Prevention 4
  5. 5. developers +2.5 MILLIONpayments/year 4.9 BILLION payments/ second at peak ~30 0 active customer accounts 184 M petabytes of data 42 database calls/ quarter 4.5 T PayPal operates one of the largest PRIVATE CLOUDS in the world We have transformed core business processes into robust SERVICE-BASED PLATFORMS The power of our platform Our technology transformation enables us to: • Process payments at tremendous scale • Accelerate the innovation of new products • Engage world-class developers & technologists About PayPal
  6. 6. FRAUD PREVENTION
  7. 7. Fraud Prevention @ PayPal Robust feature engineering, machine learning and statistical models Highly scalable and multi-layered infrastructure software Superior team of data scientists, researchers, financial and intelligence analysts Images source:
  8. 8. Fraud Prevention @ PayPal • Employs advanced machine learning and statistical models to flag fraudulent behavior up-front • More sophisticated algorithms after transaction is complete Transaction Level • Monitor account level activity to identify abusive behavior • Abusive pattern include frequent payments, suspicious profile changes Account Level • Monitor account-to-account interaction • Frequent transfer of money from several accounts to one central account Network Level
  9. 9. Fraud Prevention – What are we up against? Fraudsters are becoming increasingly smarter and adaptive Need cost-effective solutions that can model complex attack patterns not previously observed Need scalable and computationally efficient prediction models
  10. 10. © 2016 PayPal Inc. Confidential and proprietary. Fraud Prevention – What are we up against? • Much harder to get performance lift on our flagship models • Need to re-look at all aspects of traditional model building • Need out-of-the-box thinking 10 Area we are missing (AUC 0.96)
  11. 11. © 2016 PayPal Inc. Confidential and proprietary. Fraud Prevention – What can we do to build better models? 11 feature1 …. featureN ……… Target (Label) d1 d2 … dM ….. Better feature Better labeling Advanced ML Algorithms Bigger better data
  12. 12. ALGORITHM – ACTIVE LEARNING
  13. 13. © 2016 PayPal Inc. Confidential and proprietary. Active Learning – What is it? • Supervised learning algorithms require data to be labeled • Labelling is difficult, time-consuming and expensive : Active Learning to the rescue • Idea – ML Algorithm can achieve better accuracy if it is allowed to “choose the data” from which it learns* • Overcome labelling bottleneck by asking queries (unlabeled data) to be labeled by human 13 Unlabeled Data Labeled Data Human Annotator Machine Learning Model (Re)Build Model Select Queries Source*: Burr Settles
  14. 14. © 2016 PayPal Inc. Confidential and proprietary. Active Learning – What is it? • Scenarios • Membership Query Synthesis – request labels for ‘any’ unlabeled instance in input space • Stream-based Selective Sampling – unlabeled instance is drawn one at a time & learner decides whether to discard or query • Pool-based Sampling – instances are queried from a pool according to informative-ness measure 14
  15. 15. © 2016 PayPal Inc. Confidential and proprietary. Active Learning – What is it? • Query Strategy Frameworks • Uncertainty Sampling • Query-By-Committee • Expected Model Change • Expected Error Reduction • Variance Reduction • Density Weighted Methods 15
  16. 16. © 2016 PayPal Inc. Confidential and proprietary. Active Learning –Toy Example 16 Toy data – 400 instances Model using random sampling 70% accuracy Model using active learning Uncertainty sampling – 90% accuracy
  17. 17. © 2016 PayPal Inc. Confidential and proprietary. Active Learning For Fraud Prevention – Why is it unique? 17 • Data is unbalanced • Fraud labelling require trained experts. Can’t be outsourced • Fraud labelling is time consuming • Fraud labelling require more than just individual instances. Require before & after transactions • Fraud labelling require data from other entities (ex: IP address) • Fraud labelling require aggregate data • Fraud tag mature at different times (ex: chargeback) & not instantaneous
  18. 18. © 2016 PayPal Inc. Confidential and proprietary. Active Learning For Fraud Prevention – High Level Framework 18 Labeled Data Create Bags Deep Learning Model GBT Model (Re)Build Models Unlabeled Data Predict Query By Committee Human Expert Create Statistics Active Feature Engineering Simulate Features
  19. 19. © 2016 PayPal Inc. Confidential and proprietary. Modeling Algorithm – Deep Learning 19 Input Layer Hidden Layers Output Layer • If a network has many layers of non-linearity, it is “deep” • Need scalable platform • Need lots of training data
  20. 20. © 2016 PayPal Inc. Confidential and proprietary. Modeling Algorithm – Deep Learning 20 •NetworkTopology – Feed forward •Key Parameters • # of hidden layers • # of neurons @ each hidden layer • Regularization • Activation function
  21. 21. © 2016 PayPal Inc. Confidential and proprietary. Modeling Algorithm – Gradient BoostingTrees 21 • GBT = Gradient Descent + Boosting • Fit an additive (ensemble) model in forward stage wise manner • In each stage introduce a new model to compensate the shortcomings of existing models
  22. 22. © 2016 PayPal Inc. Confidential and proprietary. Modeling Algorithm – Gradient BoostingTrees 22 • Strengths • No pre-processing required • Robust • Scalable • Weaknesses • Overfits (Need to find proper stopping point) • Sensitive to noise • Key Parameters • # of trees • Max depth • Max observations • Learning rate
  23. 23. EXPERIMENTS
  24. 24. © 2016 PayPal Inc. Confidential and proprietary. Datasets 24 • Training Data • 1 year • 11 million transactions (1 million for active labelling) • Test Data • 4 months • 4 million transactions • # of features • 500 - 600
  25. 25. © 2016 PayPal Inc. Confidential and proprietary. Tools 25 • H2O • Open source • Scalable • Robust • Deep Learning & GBM implementations • R • Open source • Active learning package
  26. 26. © 2016 PayPal Inc. Confidential and proprietary. 26 # of instances queried AUC (*weighted) 0 0.960 1000 0.961 10000 0.963 50000 0.971 100000 0.975 500000 0.977 1000000 0.979 Early Results – Active Learning Shows Promise…
  27. 27. CONCLUSIONS
  28. 28. © 2016 PayPal Inc. Confidential and proprietary. Conclusions 28 • Deep learning & GBT has shown tremendous performance for fraud detection. • Active learning shows promise in improving performance of these champion models • Active learning also significantly reduce our labelling cost

×