2. 2
Agenda
What is Machine Learning ?
Need for Real Time Machine Learning
What is Lambda architecture ?
What is Mahout ?
How does a basic recommendor engine works ?
Some Use Cases
4. 4
Introduction
Machine Learning from Streaming Data
Model that
considers recent
history
Model that is updatable
Machine Learning
It has been sunny and 30
degrees in the last two days, it is
unlikely that it will be -10 degrees
and snowing the next day
A retail sales model that
remains accurate as the
business gets larger
Dont they both mean the same ??
5. 5
Introduction
Machine Learning from Streaming Data
Time-series prediction non-stationary data
distributions
weather Retail sales
Model that
considers recent
history
Model that is updatable
6. 6
Introduction
Machine Learning from non-stationary data distributions
Incremental Algorithms
non-stationary data
distributions
Batch algorithm
These are machine
learning algorithms that
learn incrementally over
the data.
These are machine
learning algorithms that
re-trains periodically
with a batch algorithm.
7. 7
Introduction
The Challenge for the Best Big Data Technology
Hadoop
Batch processing
System that can churn
huge volume of data
Storm
Real time complex event
processing System that
can process data stream
11. Speed Layer
• Only new data
• Compensates for high latency
Serving layer updates
• Batch layer overrides speed
layer
Serving layer
• Loads and expose the batch
views for querying
• Random access to batch views
Batch layer
• Immutable, constantly growing
datasets
• Batch views are computed from
this raw dataset
Lambda Architecture
Overview with description
12. Basic Idea behind Lambda architecture
12
query = function(all data)
- Nathan Marz
Big Data - Principles and best practices of scalable realtime data systems
13. Basic Idea behind Lambda
13
Perform some function from real-time data “0“ to the history data “n“
Real Time Big Data
Lambda Architecture
Hadoop ProcessStorm ProcessReal Time Big Data
}
}
}Letting the History data processed by Hadoop makes process faster
14. The Problem
14
Batch ProcessReal-timeReal Time Big Data
}
}
}
• How to define the boundery between Real-time and Batch
Process ?
• How to synchronize the computation between the two
system ?
• How to avoid gaps and overlaps ?
• What algorithm to use?
• How to avoid failure and have fault tolerance mechanism ?
Questions to be answered
Unanswered questions of Lambda architecture
15. Modified Lamda Architecture
Presentation Layer
• Presentation layer must aggregate the output of
Storm and Hadoop outputs
• User will see the result of his events in less than 2
seconds
• Seamless merge between short and long term data
17. 17
What is Mahout ?
Introduction
• Apache Software Foundation Java library
• Scalable “machine learning“ library that runs on Hadoop mostly
• Currently Mahout supports mainly four use cases
Recommendation Clustering
Classification
Frequent Itemset
mining
• Core algorithms for clustering, classfication and batch based
collaborative filtering are implemented on top of Apache Hadoop using
the map/reduce paradigm
19. 19
Basic Recommendor algorithm
Defining recommendation
Two broad categories of recommender engine algorithms
Mahout implements a collabrative filtering framework
User-based
Recommends items by
finding similar users.
Harder to scale because of
dynamic nature of users
Item-based
Calculate similiarty between
items and make
recommendations.
Items usually dont change
much and hence could be
calculated offline
20. 20
Basic Recommendor algorithm
Defining recommendation
User Preference to an Item
• Like Something
• Dont Like something
• Dont Care
1 Click = 1 Like = Uniform Preference
Safe to assume
22. Use Cases
Real Time Machine Learning
eCommerce
Objective : Increase sales revenue
Match potential customer to the right product
Personalise user experience on web and email
Customer lifecycle management
23. Use Cases
Real Time Machine Learning
Financial Services
Objective : Real Time Fraud Detection
Compute patterns/ predictors for individual
customers
Classify and Cluster custumers and recalculate
patterns and predictors
Set threshold across all data
24. Use Cases
Real Time Machine Learning
Media
Objective : Generating Meta Data
Video/ Audio/Text analysis
Find patterns/cluster for
people, places, products, things
25. Use Cases
Real Time Machine Learning
Carbookplus
Objective : Generating Meta Data
Match potential trips to right destination
Recommend best gas station
Recommend contacts whom user might know
Match right advertisers to customer based on
vehcile needs
26. Summary
Ability to create real time systems based on lambda
architecture
Usefulness of predictive algorithms
Reason to concentrate on real time predicitions
More Read
http://storm-project.net/
http://mahout.apache.org/
http://hadoop.apache.org/
26