Your SlideShare is downloading. ×
...Moving from batch to real-time machine learning
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

...Moving from batch to real-time machine learning


Published on

Michael Cutler CTO @Tumra talk at Data Science London 6/09/12

Michael Cutler CTO @Tumra talk at Data Science London 6/09/12

Published in: Technology, Education

1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. From square to round wheels... ...moving from batch to real-time machine learning @tumraTUMRA LTD, Building 3, Chiswick Park,566 Chiswick High Road, W4 5YA Michael Cutler - 6th Sept 2012
  • 2. BatchProcessing
  • 3. Credit:
  • 4. In Manufacturing...Batch processing brought advantages :- ● Increased scale of production ● Reduced manufacturing cost ● Economies of scale (reusable parts)However :-● Machinery is complex & expensive● Each product requires some bespoke parts
  • 5. In Technology...Been around since the 50s in MainframesHadoop (Map/Reduce) advantages :-● Increased scale of processing● Reduced processing cost **● Economies of scale (reusable code)However :-● Complex & expensive **● Most jobs requires some bespoke code
  • 6. Map/Reduce != FUNSure its "just Java" but... ● Requires certain mindset ● Multi-stage algorithm complexity ● If you get stuck, R.T.F.S.Alleviated to an extent by tools like :- ● Pig, Hive, Cascading, CrunchTypically requires bespoke code / algorithms
  • 7. ContinuousProcessing
  • 8. Credit:
  • 9. In manufacturing...Described as: "a method used to manufacture, produce, or process materials without interruption"Key features :- ● Materials are processed in flows & streams ● Can run continuously (exc. maintenance) ● Latency e2e can be from seconds to hours Credit: Wikipedia
  • 10. In Technology...We have a problem... most Hadoop relatedtechnologies are inherently batch!!The trend towards real-time continuouscomputation requires :- ● New tools (Storm?) ● Better algorithmsSo whats the solution?
  • 11. Credit: Scott Simmerman
  • 12. Its a hybrid of both!
  • 13. Batch does have its place...Map/Reduce is great for boil the ocean jobs;● tasks that take hours or days● typically non-interactive with users● works well for pattern mining, clustering etc.However, the perfect answer is useless if itarrives so late its irrelevant...
  • 14. Real-time machine learningQuite simply "data is never at rest"...● processed in streams not batches● best for supervised learning models● end-to-end latency can be in secondsKey criteria :- ● model always has a best answer available ● feedback used to train the model
  • 15. So what works well in real-time?Classification :- ● Easiest to implementClustering :- ● Periodically batch recompute clusters ● Add new data points to the nearest centroid ● Rinse, repeatCollaborative filtering :-
  • 16. The machine learning gap...Academic Practical
  • 17. Machine learning gap...Academia are way out there with newapproaches and algorithms almost every day :- ● Many hard to implement in a parallel wayWe need more focus on :-● Inherently distributed algorithms● Practical implementations● Speed over marginal accuracy improvements
  • 18. Mathematical navel gazingWe need practical solutions to real-worldproblems...Recommendations Rant!?!?!?!?! ● Most recommenders are 2D matrices ● Humans are not very 2D ● Is there an N-dimensional solution?
  • 19. Hybrid approach
  • 20. Hybrid approach
  • 21. Example Use-casesExamples; ● eCommerce optimisation ● Targeted advertising ● Financial services (risk modeling) ● Detecting anomalies in M2M data ● Automated metadata generation... many more!
  • 22. Almost finished!
  • 23. Introducing TUMRA LabsAPI access to some of our real-time models :- ● Probabilistic Demographics ● Language detection ** ● Sentiment analysis ** ● Metadata Generation (entity extraction and disambiguation) ** Free to signup and easy to get started!
  • 24. Questions? @tumra