Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)


Published on

Right now in institutions around the world, some of the greatest minds in computer science and statistics are coming up with amazing new algorithms and mathematically beautiful solutions. However it's entirely possible that the solutions they conceive will be impracticable in industry. The reason is simple; "the best answer is useless if it arrives too late to do anything with it". The key principle here is the compromise between 'accuracy' and 'latency'. In this talk I will describe examples where this holds true, and how I am using real-time machine learning models to solve challenges in eCommerce, Financial Services and Media companies.

Published in: Technology

Real-Time Machine Learning at Industrial scale (University of Oxford, 9th Oct 2012)

  1. 1. Real-Time Machine Learning at Industrial scale ... the battle of accuracy vs latency @tumra 9th October 2012TUMRA LTD, Building 3, Chiswick Park,566 Chiswick High Road, W4 5YA Michael Cutler @cotdp
  2. 2. $ whoamiMichael Cutler (@cotdp)● Previously at British Sky Broadcasting ○ Last 7 years in R&D ○ Created several patented systems & algorithms ○ Kicked off ‘Big Data’ initiative at Sky in 2008● Co-founder CTO @ TUMRA in March 12 ○ Real-time big data science platform ○ Alpha-testing with selected clients
  3. 3. Agenda● Background● Real-Time vs Batch processing● Accuracy vs Latency● Use Cases ○ eCommerce ○ Financial Services ○ Media● Questions
  4. 4. BackgroundBig Data is "in vogue", but what does it mean: ● Distributed processing ● Massively scalable ● CommodityApache Hadoop is "Kernel" of Big Data OS: ● Distributed Filesystem (HDFS) ● Parallel Processing (Map/Reduce, YARN)
  5. 5. Background (contd)Solving problems with Big Data is hard: ● Tools are all low-level (Pig, Hive etc.) ● Skills are hard to findWhat is "Data Science":● Understanding data & solving problems● Applies the following skills: ○ Statistical Analysis ○ Machine Learning ○ Communicating Results
  6. 6. Real-Time vsBatch processing
  7. 7. Batch - Hoppers, Bins, Buckets Credit:
  8. 8. Real-Time - Flows & Streams Credit:
  9. 9. Real-Time vs Batch processingSimilarities to the Industrial Revolution: ● From handicraft to Batch & Real-Time ● Complexity increasesNeed for "Real-Time":● Wherever the variation can change faster than you can retrain models● When you cant pre-compute everything ahead of time
  10. 10. Accuracy vs Latency
  11. 11. Accuracy vs LatencyNetflix Prize winning entry :-● Ensemble of 100s of models● Massively compute intensive solution● Marginally better than much simpler modelsIBM won the KDD Cup 2009 (Orange) :- ● IBM Watson team won by sheer brute force ● Used a "one of everything" approach generating hundreds of models
  12. 12. Accuracy vs Latency (contd)Mathematical navel-gazing:● Often the factor were optimising for, isnt the thing we measure improvement in: ○ User ratings vs. customer longevity/value ○ Overfitting outliers vs. missing clear FraudGiven the choice between a "best guess" now,and a "marginally better" answer later, Id takethe "best guess" every time.
  13. 13. However, that doesnt mean...
  14. 14. Accuracy vs Latency (contd)Its a trade-off: ● Sometimes "best guess" is good enough, ● Other times we can wait for the accuracy, ● And of course, occasionally we want both!Key objective: ● Most appropriate solution for the use-case ● Hybrid solutions part batch, part real-time
  15. 15. Use CaseeCommerce
  16. 16. Use Case - eCommerceObjective - Increase profitsHow:● Match potential customers to the right products● Personalise user experience on web & email● Customer lifecycle managementMethod:● Ensemble of real-time models● Collect lots of implicit feedback data
  17. 17. Use Case - eCommerce (contd)Detail:● Clustering - behavior, demogs● Simple predictors - keywords to products● Bayesian Bandit - blend the outputRequirements:● Predictions in < 50 ms● Online learning models● Occasional batch updates are OK
  18. 18. When eCommerce #FAILs
  19. 19. Ive only ever bought Cat food...
  20. 20. ... wait theres more, no Cat food
  21. 21. Even Amazon can #FAIL
  22. 22. Use CaseFinancial Services
  23. 23. Use Case - Financial ServicesObjective - Reduce FraudHow:● Compute patterns/predictors for individuals● Cluster individuals and recompute for clusters● Compute baselines across all dataMethod:● Hybrid and Hierarchical Clustering models● Simple predictors for individuals, clusters & baseline
  24. 24. Use Case - Financial ServicesDetail:● CHEAT!!! ... Cluster to nearest centroid ○ will degrade over time (Hunchback Clusters)● Use simple metrics to alert (stddev)Requirements:● Ability to alert/intervene near real-time < 1 second● Adapt to rapid changes (within baseline & clusters)● Periodic batch processing to recompute clusters
  25. 25. Use Case - Financial Services
  26. 26. Use Case Media
  27. 27. Use Case - MediaObjective - Generating MetadataWhy:● Drive second screen applications● Create new streams of information for resaleHow:● Video / Audio analysis● Closed Caption or, Subtitle text processing● Knowledgebase :- People, Places, Products & Things
  28. 28. Use Case - Media (contd)Method:● Natural Language Processing ○ Named Entity Recognition ○ Topic Extraction & Disambiguation● Graph databases & algorithmsRequirements:● Responses in < 1 second● Ability to learn new ThingsExample of 12,000 entities from our Knowledgebase...
  29. 29. Summary
  30. 30. SummaryKey points:● Clear move towards distributed algorithms● Latency is often more favorable than accuracy● Trade-offs are dependant on the use-casesFurther reading:● Apache Mahout -● Storm Project -● Data Science London -● Machine Learning Meetup -
  31. 31. Almost finished!
  32. 32. Introducing TUMRA LabsAPI access to some of our real-time models:● Probabilistic DemographicsComing Soon:● Language detection● Sentiment analysis● Metadata Generation Free to signup and easy to get started!
  33. 33. Questions? Work @tumra @cotdp