Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

...Moving from batch to real-time machine learning


Published on

Michael Cutler CTO @Tumra talk at Data Science London 6/09/12

Published in: Technology, Education
  • Be the first to comment

...Moving from batch to real-time machine learning

  1. 1. From square to round wheels... ...moving from batch to real-time machine learning @tumraTUMRA LTD, Building 3, Chiswick Park,566 Chiswick High Road, W4 5YA Michael Cutler - 6th Sept 2012
  2. 2. BatchProcessing
  3. 3. Credit:
  4. 4. In Manufacturing...Batch processing brought advantages :- ● Increased scale of production ● Reduced manufacturing cost ● Economies of scale (reusable parts)However :-● Machinery is complex & expensive● Each product requires some bespoke parts
  5. 5. In Technology...Been around since the 50s in MainframesHadoop (Map/Reduce) advantages :-● Increased scale of processing● Reduced processing cost **● Economies of scale (reusable code)However :-● Complex & expensive **● Most jobs requires some bespoke code
  6. 6. Map/Reduce != FUNSure its "just Java" but... ● Requires certain mindset ● Multi-stage algorithm complexity ● If you get stuck, R.T.F.S.Alleviated to an extent by tools like :- ● Pig, Hive, Cascading, CrunchTypically requires bespoke code / algorithms
  7. 7. ContinuousProcessing
  8. 8. Credit:
  9. 9. In manufacturing...Described as: "a method used to manufacture, produce, or process materials without interruption"Key features :- ● Materials are processed in flows & streams ● Can run continuously (exc. maintenance) ● Latency e2e can be from seconds to hours Credit: Wikipedia
  10. 10. In Technology...We have a problem... most Hadoop relatedtechnologies are inherently batch!!The trend towards real-time continuouscomputation requires :- ● New tools (Storm?) ● Better algorithmsSo whats the solution?
  11. 11. Credit: Scott Simmerman
  12. 12. Its a hybrid of both!
  13. 13. Batch does have its place...Map/Reduce is great for boil the ocean jobs;● tasks that take hours or days● typically non-interactive with users● works well for pattern mining, clustering etc.However, the perfect answer is useless if itarrives so late its irrelevant...
  14. 14. Real-time machine learningQuite simply "data is never at rest"...● processed in streams not batches● best for supervised learning models● end-to-end latency can be in secondsKey criteria :- ● model always has a best answer available ● feedback used to train the model
  15. 15. So what works well in real-time?Classification :- ● Easiest to implementClustering :- ● Periodically batch recompute clusters ● Add new data points to the nearest centroid ● Rinse, repeatCollaborative filtering :-
  16. 16. The machine learning gap...Academic Practical
  17. 17. Machine learning gap...Academia are way out there with newapproaches and algorithms almost every day :- ● Many hard to implement in a parallel wayWe need more focus on :-● Inherently distributed algorithms● Practical implementations● Speed over marginal accuracy improvements
  18. 18. Mathematical navel gazingWe need practical solutions to real-worldproblems...Recommendations Rant!?!?!?!?! ● Most recommenders are 2D matrices ● Humans are not very 2D ● Is there an N-dimensional solution?
  19. 19. Hybrid approach
  20. 20. Hybrid approach
  21. 21. Example Use-casesExamples; ● eCommerce optimisation ● Targeted advertising ● Financial services (risk modeling) ● Detecting anomalies in M2M data ● Automated metadata generation... many more!
  22. 22. Almost finished!
  23. 23. Introducing TUMRA LabsAPI access to some of our real-time models :- ● Probabilistic Demographics ● Language detection ** ● Sentiment analysis ** ● Metadata Generation (entity extraction and disambiguation) ** Free to signup and easy to get started!
  24. 24. Questions? @tumra