1
Real-time Machine Learning
Vinoth Kannan
Intelligent software architecture using Modified Lambda architecture
& Apache Mahout
SkillFactory 71
Vinoth.kannan@widas.de
2
Agenda
What is Machine Learning ?
Need for Real Time Machine Learning
What is Lambda architecture ?
What is Mahout ?
How does a basic recommendor engine works ?
Some Use Cases
3
What is machine learning?
4
Introduction
Machine Learning from Streaming Data
Model that
considers recent
history
Model that is updatable
Machine Learning
It has been sunny and 30
degrees in the last two days, it is
unlikely that it will be -10 degrees
and snowing the next day
A retail sales model that
remains accurate as the
business gets larger
Dont they both mean the same ??
5
Introduction
Machine Learning from Streaming Data
Time-series prediction non-stationary data
distributions
weather Retail sales
Model that
considers recent
history
Model that is updatable
6
Introduction
Machine Learning from non-stationary data distributions
Incremental Algorithms
non-stationary data
distributions
Batch algorithm
These are machine
learning algorithms that
learn incrementally over
the data.
These are machine
learning algorithms that
re-trains periodically
with a batch algorithm.
7
Introduction
The Challenge for the Best Big Data Technology
Hadoop
Batch processing
System that can churn
huge volume of data
Storm
Real time complex event
processing System that
can process data stream
Wrong Fight !!!
9
+ =
Real-time
Big Data
Its a Chance not a Challenge
Lambda Architecture!!!
10
Lambda Architecture
Overview
Speed Layer
Serving layer
Batch layer
Speed Layer
• Only new data
• Compensates for high latency
Serving layer updates
• Batch layer overrides speed
layer
Serving layer
• Loads and expose the batch
views for querying
• Random access to batch views
Batch layer
• Immutable, constantly growing
datasets
• Batch views are computed from
this raw dataset
Lambda Architecture
Overview with description
Basic Idea behind Lambda architecture
12
query = function(all data)
- Nathan Marz
Big Data - Principles and best practices of scalable realtime data systems
Basic Idea behind Lambda
13
Perform some function from real-time data “0“ to the history data “n“
Real Time Big Data
Lambda Architecture
Hadoop ProcessStorm ProcessReal Time Big Data
}
}
}Letting the History data processed by Hadoop makes process faster
The Problem
14
Batch ProcessReal-timeReal Time Big Data
}
}
}
• How to define the boundery between Real-time and Batch
Process ?
• How to synchronize the computation between the two
system ?
• How to avoid gaps and overlaps ?
• What algorithm to use?
• How to avoid failure and have fault tolerance mechanism ?
Questions to be answered
Unanswered questions of Lambda architecture
Modified Lamda Architecture
Presentation Layer
• Presentation layer must aggregate the output of
Storm and Hadoop outputs
• User will see the result of his events in less than 2
seconds
• Seamless merge between short and long term data
Machine Learning with Mahout
16
17
What is Mahout ?
Introduction
• Apache Software Foundation Java library
• Scalable “machine learning“ library that runs on Hadoop mostly
• Currently Mahout supports mainly four use cases
Recommendation Clustering
Classification
Frequent Itemset
mining
• Core algorithms for clustering, classfication and batch based
collaborative filtering are implemented on top of Apache Hadoop using
the map/reduce paradigm
18
Basic Recommendor algorithm
How it works
Today‘s FOCUS : Suggesting item to user based on current search
19
Basic Recommendor algorithm
Defining recommendation
Two broad categories of recommender engine algorithms
Mahout implements a collabrative filtering framework
User-based
Recommends items by
finding similar users.
Harder to scale because of
dynamic nature of users
Item-based
Calculate similiarty between
items and make
recommendations.
Items usually dont change
much and hence could be
calculated offline
20
Basic Recommendor algorithm
Defining recommendation
User Preference to an Item
• Like Something
• Dont Like something
• Dont Care
1 Click = 1 Like = Uniform Preference
Safe to assume
Mahout Library of Algorithms
Lots of algorithms to Choose From
Use Cases
Real Time Machine Learning
eCommerce
Objective : Increase sales revenue
Match potential customer to the right product
Personalise user experience on web and email
Customer lifecycle management
Use Cases
Real Time Machine Learning
Financial Services
Objective : Real Time Fraud Detection
Compute patterns/ predictors for individual
customers
Classify and Cluster custumers and recalculate
patterns and predictors
Set threshold across all data
Use Cases
Real Time Machine Learning
Media
Objective : Generating Meta Data
Video/ Audio/Text analysis
Find patterns/cluster for
people, places, products, things
Use Cases
Real Time Machine Learning
Carbookplus
Objective : Generating Meta Data
Match potential trips to right destination
Recommend best gas station
Recommend contacts whom user might know
Match right advertisers to customer based on
vehcile needs
Summary
Ability to create real time systems based on lambda
architecture
Usefulness of predictive algorithms
Reason to concentrate on real time predicitions
More Read
http://storm-project.net/
http://mahout.apache.org/
http://hadoop.apache.org/
26
27
Thank You

Real time machine learning

  • 1.
    1 Real-time Machine Learning VinothKannan Intelligent software architecture using Modified Lambda architecture & Apache Mahout SkillFactory 71 Vinoth.kannan@widas.de
  • 2.
    2 Agenda What is MachineLearning ? Need for Real Time Machine Learning What is Lambda architecture ? What is Mahout ? How does a basic recommendor engine works ? Some Use Cases
  • 3.
  • 4.
    4 Introduction Machine Learning fromStreaming Data Model that considers recent history Model that is updatable Machine Learning It has been sunny and 30 degrees in the last two days, it is unlikely that it will be -10 degrees and snowing the next day A retail sales model that remains accurate as the business gets larger Dont they both mean the same ??
  • 5.
    5 Introduction Machine Learning fromStreaming Data Time-series prediction non-stationary data distributions weather Retail sales Model that considers recent history Model that is updatable
  • 6.
    6 Introduction Machine Learning fromnon-stationary data distributions Incremental Algorithms non-stationary data distributions Batch algorithm These are machine learning algorithms that learn incrementally over the data. These are machine learning algorithms that re-trains periodically with a batch algorithm.
  • 7.
    7 Introduction The Challenge forthe Best Big Data Technology Hadoop Batch processing System that can churn huge volume of data Storm Real time complex event processing System that can process data stream
  • 8.
  • 9.
    9 + = Real-time Big Data Itsa Chance not a Challenge Lambda Architecture!!!
  • 10.
  • 11.
    Speed Layer • Onlynew data • Compensates for high latency Serving layer updates • Batch layer overrides speed layer Serving layer • Loads and expose the batch views for querying • Random access to batch views Batch layer • Immutable, constantly growing datasets • Batch views are computed from this raw dataset Lambda Architecture Overview with description
  • 12.
    Basic Idea behindLambda architecture 12 query = function(all data) - Nathan Marz Big Data - Principles and best practices of scalable realtime data systems
  • 13.
    Basic Idea behindLambda 13 Perform some function from real-time data “0“ to the history data “n“ Real Time Big Data Lambda Architecture Hadoop ProcessStorm ProcessReal Time Big Data } } }Letting the History data processed by Hadoop makes process faster
  • 14.
    The Problem 14 Batch ProcessReal-timeRealTime Big Data } } } • How to define the boundery between Real-time and Batch Process ? • How to synchronize the computation between the two system ? • How to avoid gaps and overlaps ? • What algorithm to use? • How to avoid failure and have fault tolerance mechanism ? Questions to be answered Unanswered questions of Lambda architecture
  • 15.
    Modified Lamda Architecture PresentationLayer • Presentation layer must aggregate the output of Storm and Hadoop outputs • User will see the result of his events in less than 2 seconds • Seamless merge between short and long term data
  • 16.
  • 17.
    17 What is Mahout? Introduction • Apache Software Foundation Java library • Scalable “machine learning“ library that runs on Hadoop mostly • Currently Mahout supports mainly four use cases Recommendation Clustering Classification Frequent Itemset mining • Core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm
  • 18.
    18 Basic Recommendor algorithm Howit works Today‘s FOCUS : Suggesting item to user based on current search
  • 19.
    19 Basic Recommendor algorithm Definingrecommendation Two broad categories of recommender engine algorithms Mahout implements a collabrative filtering framework User-based Recommends items by finding similar users. Harder to scale because of dynamic nature of users Item-based Calculate similiarty between items and make recommendations. Items usually dont change much and hence could be calculated offline
  • 20.
    20 Basic Recommendor algorithm Definingrecommendation User Preference to an Item • Like Something • Dont Like something • Dont Care 1 Click = 1 Like = Uniform Preference Safe to assume
  • 21.
    Mahout Library ofAlgorithms Lots of algorithms to Choose From
  • 22.
    Use Cases Real TimeMachine Learning eCommerce Objective : Increase sales revenue Match potential customer to the right product Personalise user experience on web and email Customer lifecycle management
  • 23.
    Use Cases Real TimeMachine Learning Financial Services Objective : Real Time Fraud Detection Compute patterns/ predictors for individual customers Classify and Cluster custumers and recalculate patterns and predictors Set threshold across all data
  • 24.
    Use Cases Real TimeMachine Learning Media Objective : Generating Meta Data Video/ Audio/Text analysis Find patterns/cluster for people, places, products, things
  • 25.
    Use Cases Real TimeMachine Learning Carbookplus Objective : Generating Meta Data Match potential trips to right destination Recommend best gas station Recommend contacts whom user might know Match right advertisers to customer based on vehcile needs
  • 26.
    Summary Ability to createreal time systems based on lambda architecture Usefulness of predictive algorithms Reason to concentrate on real time predicitions More Read http://storm-project.net/ http://mahout.apache.org/ http://hadoop.apache.org/ 26
  • 27.