SlideShare a Scribd company logo
1 of 38
Download to read offline
Mahout
1
Wednesday, March 16, 2011
Mahout
Scalable Data Mining for Everybody
1
Wednesday, March 16, 2011
What is Mahout
• Recommendations (people who x this also
x that)
• Clustering (segment data into groups of)
• Classification (learn decision making from
examples)
• Stuff (LDA, SVD, frequent item-set, math)
2
Wednesday, March 16, 2011
What is Mahout?
• Recommendations (people who x this also
x that)
• Clustering (segment data into groups of)
•Classification (learn decision
making from examples)
• Stuff (LDA, SVM, frequent item-set, math)
3
Wednesday, March 16, 2011
Classification in Detail
• Naive Bayes Family
• Hadoop based training
• Decision Forests
• Hadoop based training
• Logistic Regression (aka SGD)
• fast on-line (sequential) training
4
Wednesday, March 16, 2011
Classification in Detail
• Naive Bayes Family
• Hadoop based training
• Decision Forests
• Hadoop based training
•Logistic Regression (aka SGD)
•fast on-line (sequential) training
5
Wednesday, March 16, 2011
So What?
Online training
has low
overhead for
small and
moderate size
data-sets
6
Wednesday, March 16, 2011
So What?
Online training
has low
overhead for
small and
moderate size
data-sets
6
Wednesday, March 16, 2011
So What?
Online training
has low
overhead for
small and
moderate size
data-sets
6
Wednesday, March 16, 2011
So What?
Online training
has low
overhead for
small and
moderate size
data-sets
6
Wednesday, March 16, 2011
So What?
Online training
has low
overhead for
small and
moderate size
data-sets
big starts here
6
Wednesday, March 16, 2011
An Example
7
Wednesday, March 16, 2011
An Example
7
Wednesday, March 16, 2011
An Example
7
Wednesday, March 16, 2011
An Example
7
Wednesday, March 16, 2011
An Example
7
Wednesday, March 16, 2011
An Example
7
Wednesday, March 16, 2011
An Example
7
Wednesday, March 16, 2011
And Another
From: Dr. Paul Acquah
Dear Sir,
Re: Proposal for over-invoice Contract Benevolence
Based on information gathered from the India
hospital directory, I am pleased to propose a
confidential business deal for our mutual benefit.
I have in my possession, instruments
(documentation) to transfer the sum of
33,100,000.00 eur thirty-three million one hundred
thousand euros, only) into a foreign company's bank
account for our favor.
...
8
Wednesday, March 16, 2011
And Another
Date: Thu, May 20, 2010 at 10:51 AM
From: George <george@fumble-tech.com>
Hi Ted, was a pleasure talking to you last night
at the Hadoop User Group. I liked the idea of
going for lunch together. Are you available
tomorrow (Friday) at noon?
8
Wednesday, March 16, 2011
And Another
Date: Thu, May 20, 2010 at 10:51 AM
From: George <george@fumble-tech.com>
Hi Ted, was a pleasure talking to you last night
at the Hadoop User Group. I liked the idea of
going for lunch together. Are you available
tomorrow (Friday) at noon?
8
Wednesday, March 16, 2011
And Another
Date: Thu, May 20, 2010 at 10:51 AM
From: George <george@fumble-tech.com>
Hi Ted, was a pleasure talking to you last night
at the Hadoop User Group. I liked the idea of
going for lunch together. Are you available
tomorrow (Friday) at noon?
8
Wednesday, March 16, 2011
Mahout’s SGD
• Learns on-line per example
• O(1) memory
• O(1) time per training example
• Sequential implementation
• fast, but not parallel
9
Wednesday, March 16, 2011
Special Features
• Hashed feature encoding
• Per-term annealing
• learn the boring stuff once
• Auto-magical learning knob turning
• learns correct learning rate, learns
correct learning rate for learning learning
rate, ...
10
Wednesday, March 16, 2011
Feature Encoding
11
Wednesday, March 16, 2011
Feature Encoding
11
Wednesday, March 16, 2011
Hashed Encoding
12
Wednesday, March 16, 2011
Feature Collisions
13
Wednesday, March 16, 2011
Learning Rate Annealing
Learning
Rate
# training examples seen
14
Wednesday, March 16, 2011
Per-term Annealing
Learning
Rate
# training examples seen
15
Wednesday, March 16, 2011
Per-term Annealing
Learning
Rate
# training examples seen
Common
Feature
15
Wednesday, March 16, 2011
Per-term Annealing
Learning
Rate
# training examples seen
Rare
Feature
15
Wednesday, March 16, 2011
General Structure
• OnlineLogisticRegression
• Traditional logistic regression
• Stochastic Gradient Descent
• Per term annealing
• Too fast (for the disk + encoder)
16
Wednesday, March 16, 2011
Next Level
• CrossFoldLearner
• contains multiple primitive learners
• online cross validation
• 5x more work
17
Wednesday, March 16, 2011
And again
• AdaptiveLogisticRegression
• 20 x CrossFoldLearner
• evolves good learning and regularization
rates
• 100 x more work than basic learner
• still faster than disk + encoding
18
Wednesday, March 16, 2011
A comparison
• Traditional view
• 400 x (read + OLR)
• Revised Mahout view
• 1 x (read + mu x 100 x OLR) x eta
• mu = efficiency from killing losers early
• eta = efficiency from stopping early
19
Wednesday, March 16, 2011
Deployment
• Training
• ModelSerializer.writeBinary(..., model)
• Deployment
• m = ModelSerializer.readBinary(...)
• r = m.classifyScalar(featureVector)
20
Wednesday, March 16, 2011
The Upshot
• One machine can go fast
• SITM trains in 2 billion examples in 3
hours
• Deployability pays off big
• simple sample server farm
21
Wednesday, March 16, 2011

More Related Content

Similar to Mahout classifier tour

Intro to Linked Data: Context
Intro to Linked Data: ContextIntro to Linked Data: Context
Intro to Linked Data: Context
David Wood
 
BEA 2015 Generating Metadata by Machine Final
BEA 2015 Generating Metadata by Machine FinalBEA 2015 Generating Metadata by Machine Final
BEA 2015 Generating Metadata by Machine Final
S. M. Hassan Zaidi
 
Public PhD Defense - Ben De Meester
Public PhD Defense - Ben De MeesterPublic PhD Defense - Ben De Meester
Public PhD Defense - Ben De Meester
Ben De Meester
 
Devopsdays Goteborg 2011 - State of the Union
Devopsdays Goteborg 2011 - State of the UnionDevopsdays Goteborg 2011 - State of the Union
Devopsdays Goteborg 2011 - State of the Union
John Willis
 
Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentation
Theo Schlossnagle
 
ECM Meets the Semantic Web - Nuxeo World 2011
ECM Meets the Semantic Web - Nuxeo World 2011ECM Meets the Semantic Web - Nuxeo World 2011
ECM Meets the Semantic Web - Nuxeo World 2011
Stefane Fermigier
 
Sdforum 11-04-2010
Sdforum 11-04-2010Sdforum 11-04-2010
Sdforum 11-04-2010
Ted Dunning
 

Similar to Mahout classifier tour (20)

MAHOUT classifier tour
MAHOUT classifier tourMAHOUT classifier tour
MAHOUT classifier tour
 
Node js techtalksto
Node js techtalkstoNode js techtalksto
Node js techtalksto
 
Intro to Linked Data: Context
Intro to Linked Data: ContextIntro to Linked Data: Context
Intro to Linked Data: Context
 
BEA 2015 Generating Metadata by Machine Final
BEA 2015 Generating Metadata by Machine FinalBEA 2015 Generating Metadata by Machine Final
BEA 2015 Generating Metadata by Machine Final
 
Week1
Week1Week1
Week1
 
Public PhD Defense - Ben De Meester
Public PhD Defense - Ben De MeesterPublic PhD Defense - Ben De Meester
Public PhD Defense - Ben De Meester
 
Devopsdays Goteborg 2011 - State of the Union
Devopsdays Goteborg 2011 - State of the UnionDevopsdays Goteborg 2011 - State of the Union
Devopsdays Goteborg 2011 - State of the Union
 
Data Journalism 2: Interrogating, Visualising and Mashing
Data Journalism 2: Interrogating, Visualising and MashingData Journalism 2: Interrogating, Visualising and Mashing
Data Journalism 2: Interrogating, Visualising and Mashing
 
LIBER Webinar: Research Data Services Survey
LIBER Webinar: Research Data Services Survey LIBER Webinar: Research Data Services Survey
LIBER Webinar: Research Data Services Survey
 
Core Methods In Educational Data Mining
Core Methods In Educational Data MiningCore Methods In Educational Data Mining
Core Methods In Educational Data Mining
 
Introduction to NLP.pptx
Introduction to NLP.pptxIntroduction to NLP.pptx
Introduction to NLP.pptx
 
The Handover Project - Improving the Continuity of patient care Through Ident...
The Handover Project - Improving the Continuity of patient care Through Ident...The Handover Project - Improving the Continuity of patient care Through Ident...
The Handover Project - Improving the Continuity of patient care Through Ident...
 
Webops dashboards
Webops dashboardsWebops dashboards
Webops dashboards
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Monitoring is easy, why are we so bad at it presentation
Monitoring is easy, why are we so bad at it  presentationMonitoring is easy, why are we so bad at it  presentation
Monitoring is easy, why are we so bad at it presentation
 
ECM Meets the Semantic Web - Nuxeo World 2011
ECM Meets the Semantic Web - Nuxeo World 2011ECM Meets the Semantic Web - Nuxeo World 2011
ECM Meets the Semantic Web - Nuxeo World 2011
 
Getting Started with Unstructured Data
Getting Started with Unstructured DataGetting Started with Unstructured Data
Getting Started with Unstructured Data
 
Time Series Data Storage in MongoDB
Time Series Data Storage in MongoDBTime Series Data Storage in MongoDB
Time Series Data Storage in MongoDB
 
CSE509 Lecture 2
CSE509 Lecture 2CSE509 Lecture 2
CSE509 Lecture 2
 
Sdforum 11-04-2010
Sdforum 11-04-2010Sdforum 11-04-2010
Sdforum 11-04-2010
 

More from MapR Technologies

More from MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Mahout classifier tour