Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HBaseCon 2015: Running ML Infrastructure on HBase

2,957 views

Published on

Sift Science uses online, large-scale machine learning to detect fraud for thousands of sites and hundreds of millions of users in real-time. This talk describes how we leverage HBase to power an ML infrastructure including how we train and build models, store and update model parameters online, and provide real-time predictions. The central pieces of the machine learning infrastructure and the tradeoffs we made to maximize performance will also be covered.

Published in: Software

HBaseCon 2015: Running ML Infrastructure on HBase

  1. 1. Running ML Infrastructure on andrey@siftscience.com
  2. 2. About Sift Science •Fraud Detection using Machine Learning •Realtime •Billions of purchases scored •Hundreds of millions of users
  3. 3. Outline I. Data at Sift Science II. Quick ML Overview •Inputs - Customer data •Outputs - Probability of fraud III. Batch training the model IV. Online learning and scoring
  4. 4. Customers stream events to us Page Views (javascript) Purchases (api) Labels (api or console) Time series view of the user Scans Data at Sift
  5. 5. Supervised ML
  6. 6. You have examples of GOOD and BAD users. You have a set of signals that you think are predictive of fraud. Start with your data…
  7. 7. Train: Build a model from existing data Train a statistical model with examples of GOOD and BAD users. Model will learn signal values common to each user type.
  8. 8. Predict: Find patterns in new data Apply the model to current active customers. Predict which are fraud, and which aren’t.
  9. 9. Act: Turn insights into action Intelligently segment your customers with a probability of risk
  10. 10. Batch Training
  11. 11. Production Snapshots Move the data from online cluster to batch cluster Batch/ Experiment Production
  12. 12. Select Training Events - Read from HFiles Feature Extraction - Time Series via Scans Feature Transformations - Set Cardinalities Model Training - Write to HFiles (driven via MapReduce) Batch Training Pipeline
  13. 13. Signup Add CC Add item(s) to Cart Purchase 1 Change CC Change Billing Purchase 2 Time Series of Events Time Series Add item(s) to Cart Scan
  14. 14. { Device ID features } { Number of emails } { NLP features } { Address features } { Custom fields } … … Time Series of Events Data Transformation … > 1K features
  15. 15. Data Transformations
  16. 16. Val a@ (num_fraud=1) … Sparse Feature: Email Val b@ (num_fraud=3) Val c@ (num_fraud=3) Val d@ (num_fraud=1) Val 1 Val 3 … … Dense Feature: Email … … …
  17. 17. Sparse fields - device ids, cookies, custom fields, etc. Mapping to dense space based on set cardinality Dual table implementation Slower set table (up to 8K items per set; > 100M sets) Faster counts table (batching, coalescing) Global and customer states Sparse Feature Densification
  18. 18. Most talked to table-pair (counts, sets) Memcache + HBase “Approximately consistent” Throughput/latency vs consistency tradeoff Higher noise tolerance in ML feature space Sparse Feature Densification
  19. 19. 95% cache hit rate 50-100 batches/sec 75th: 5ms 99th:100ms 50-200 rows/batch Densification
  20. 20. Batch cluster Events are moved via snapshots User time series Transformations on feature vectors Model Parameters (global, customer specific) Models are shipped back to production (snapshot again) Every 2-3 weeks Batch Training
  21. 21. Online Learning Time Series Features Score (Update) Updates to sparse feature state Update model parameters
  22. 22. Scan for Batch operations, row operations online Higher level atomic operations and batching Block caching (and other forms of caching) Snapshots Driving console and front end Why HBase?
  23. 23. Coalescing + Batching Fast-table slow-table Append operations for batch learning, row operations for online learning Hashing on (customer,user) to avoid hot regions Pre-splitting Lessons Learned
  24. 24. Questions? andrey@siftscience.com

×