In Manufacturing...Batch processing brought advantages :- ● Increased scale of production ● Reduced manufacturing cost ● Economies of scale (reusable parts)However :-● Machinery is complex & expensive● Each product requires some bespoke parts
In Technology...Been around since the 50s in MainframesHadoop (Map/Reduce) advantages :-● Increased scale of processing● Reduced processing cost **● Economies of scale (reusable code)However :-● Complex & expensive **● Most jobs requires some bespoke code
Map/Reduce != FUNSure its "just Java" but... ● Requires certain mindset ● Multi-stage algorithm complexity ● If you get stuck, R.T.F.S.Alleviated to an extent by tools like :- ● Pig, Hive, Cascading, CrunchTypically requires bespoke code / algorithms
In manufacturing...Described as: "a method used to manufacture, produce, or process materials without interruption"Key features :- ● Materials are processed in flows & streams ● Can run continuously (exc. maintenance) ● Latency e2e can be from seconds to hours Credit: Wikipedia
In Technology...We have a problem... most Hadoop relatedtechnologies are inherently batch!!The trend towards real-time continuouscomputation requires :- ● New tools (Storm?) ● Better algorithmsSo whats the solution?
Batch does have its place...Map/Reduce is great for boil the ocean jobs;● tasks that take hours or days● typically non-interactive with users● works well for pattern mining, clustering etc.However, the perfect answer is useless if itarrives so late its irrelevant...
Real-time machine learningQuite simply "data is never at rest"...● processed in streams not batches● best for supervised learning models● end-to-end latency can be in secondsKey criteria :- ● model always has a best answer available ● feedback used to train the model
So what works well in real-time?Classification :- ● Easiest to implementClustering :- ● Periodically batch recompute clusters ● Add new data points to the nearest centroid ● Rinse, repeatCollaborative filtering :-
Machine learning gap...Academia are way out there with newapproaches and algorithms almost every day :- ● Many hard to implement in a parallel wayWe need more focus on :-● Inherently distributed algorithms● Practical implementations● Speed over marginal accuracy improvements
Mathematical navel gazingWe need practical solutions to real-worldproblems...Recommendations Rant!?!?!?!?! ● Most recommenders are 2D matrices ● Humans are not very 2D ● Is there an N-dimensional solution?
Introducing TUMRA LabsAPI access to some of our real-time models :- ● Probabilistic Demographics ● Language detection ** ● Sentiment analysis ** ● Metadata Generation (entity extraction and disambiguation) ** Free to signup and easy to get started! http://labs.tumra.com/