Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Semi-Supervised Learning In An Adversarial Environment


Published on

Unlike a typical e-commerce marketplace, Uber’s marketplace is real-time, therefore stopping fraud has to happen in real-time too. In this talk, we will dive into Account Takeover (ATO) attacks and how we built a near real-time semi-supervised learning system to keep your Uber accounts safe. ATO attacks evolve very fast and may last only for a short time period. Thus traditional way of training a model once a week/day doesn’t really help us to defend against new attack vectors. This lead us to develop a semi-supervised learning system that is built on top of Apache Spark and uses clustering techniques and feedback signals from our ATO challenges to detect and stop new attack vectors. We will discuss our results from online clustering using streaming k-means and also from batch clustering using more complex clustering algorithm, DBSCAN. We will also share some lessons on feature selection and hyperparameter tuning for clustering algorithms which plays a crucial part in performance.

Published in: Technology

Semi-Supervised Learning In An Adversarial Environment

  1. 1. Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent. Uber Risk Team Semi-supervised Learning In An Adversarial Environment Karthik Ramasamy Gaurav Agarwal
  2. 2. About us • Data Science Manager and founding member of - Account Security and Privacy • Previously: ‐ 2 years at my own startup ‐ 4.5 years at LinkedIn (founding member of the security team) • Interests: ‐ Applying ML, recently deep learning, in security and fraud problems ‐ Building scalable infra for ML systems Karthik Ramasamy
  3. 3. About us • Senior engineer and founding member of Account security team • 2 years at Uber working on fraud problems • Previously: ‐ 4 years at Microsoft working on NLP Question answering systems • Interests: ‐ Intersection of distributed systems and ML ‐ Natural language processing Gaurav Agarwal
  4. 4. Focus of the talk • DS algorithms and process • Features ‐ Not covered in this talk ‐ Arbitrary names will be used to describe features • Engineering architecture • Engineering challenges
  5. 5. Semi-Supervised Learning Classification Clustering
  6. 6. Adversarial Problem Not-Hotdog Classifier Fake Account Classifier
  7. 7. Adversarial Problem Recall for a specific model % of null features
  8. 8. Account Takeovers (ATO) Single IP IPs in same class c/b Targeted Attacks on specific accounts Phishing and Malware Attacks Massive Botnet with >100K IPs Proxy IPs across world Easy to detect Hard to detect
  9. 9. Clustering: K-Means Credits:
  10. 10. Clustering: DBSCAN Credits:
  11. 11. Clustering Algorithm: DBSCAN 1. Method 1 - use labels to tune hyperparameter a. Solves the hyperparameter tuning issue 2. Method 2 - use labels as constraints in the clustering algorithm a. Challenges i. Working with week labels ii. Scalability Semi-Supervised Clustering Approach
  12. 12. Login Clusters 2016 *PCA used for visualization
  13. 13. Login Clusters 2017 *PCA used for visualization
  14. 14. ML Challenges • Feature Selection ‐ Manual feature selection ‐ Aggregate features are better ‐ Feature normalization is very important ▪ Features like trip fare and #trips are bad features ▪ % of UberX trips for a user is a good feature ▪ All features having same scale like % of X • Feature evolution in adversarial environment • Scalability ‐ DBSCAN for large dataset (10’s of millions) takes long time to fit • No online DBSCAN
  15. 15. Edit or delete footer text in Master ipsandella doloreium dem isciame ndaestia nessed quibus aut hiligenet ut ea debisci eturiate poresti vid min core, vercidigent. Engineering Architecture
  16. 16. Goals ● Minimum human input for unknown anomalies ● Automatic action support ● Low latency for sync scenarios eg: Login ● Easily reusable for other use cases eg: Bot signups ● Good ML library support
  17. 17. Query Flow Login Client Feature Gathering ML Models Actions Risk Gateway Rule Engine Other Stores Error code Challenge thrown Async call Parallel calls Sync callChallenges: ● 2FA ● Captcha ● ... Streaming Counters Clustering Features
  18. 18. Feature Computation Feature Normalization Categorical Feature Transformation Spark Mllib (k-means) Spark Clustering Features (Cassandra) Offline Features Attempt Thresholding DBSCAN (sklearn) Parameter Tuning Login Attempts (Kafka) Challenge Feedback Signals (Kafka) Streaming pipeline Hourly job pipeline
  19. 19. Engineering Challenges ● Python vs Scala perf for streaming case ● DBScan limitations in Spark ● Window aggregations limited to 30 minutes ● JOINS with feedback signals in realtime
  20. 20. Production Setup ● Batch: 7 days worth of data, run DBSCAN hourly ● Streaming: 60 minutes moving window, run streaming k-means ● Used feedback signal success ratios to mark clusters as good, bad or unknown ● Bad clusters: Always throw ● Good clusters: Small % of attempts ● Unknown clusters: X% of attempts
  21. 21. Results Good Clusters Bad Clusters Unknown Clusters
  22. 22. GPU DBSCAN using Faiss from FB
  23. 23. Thank you!