SlideShare a Scribd company logo
Apache Mahout
Thursday, November 4, 2010
Apache Mahout
Now with extra whitening and classification powers!
Thursday, November 4, 2010
• Mahout intro
• Scalability in general
• Supervised learning recap
• The new SGD classifiers
Thursday, November 4, 2010
Mahout?
• Hebrew for “essence”
• Hindi for a guy who drives an elephant
Thursday, November 4, 2010
Mahout?
• Hebrew for “essence”
• Hindi for a guy who drives an elephant
Thursday, November 4, 2010
Mahout?
• Hebrew for “essence”
• Hindi for a guy who drives an elephant
Thursday, November 4, 2010
Mahout!
• Scalable data-mining and recommendations
• Not all data-mining
• Not the fanciest data-mining
• Just some of the scalable stuff
• Not a competitor for R or Weka
Thursday, November 4, 2010
General Areas
• Recommendations
• lots of support, lots of flexibility,
production ready
• Unsupervised learning (clustering)
• lots of options, lots of flexibility,
production ready (ish)
Thursday, November 4, 2010
General Areas
• Supervised learning (classification)
• multiple architectures, fair number of
options, somewhat inter-operable
• production ready (for the right definition
of production and ready)
• Large scale SVD
• larger scale coming, beware sharp edges
Thursday, November 4, 2010
Scalable?
• Scalable means
• Time is proportional to problem size by
resource size
• Does not imply Hadoop or parallel
THE AUTHOR
t ∝
|P|
|R|
Thursday, November 4, 2010
Wall
Clock
Time
# of Training Examples
Scalable Algorithm
(Mahout wins!)
Traditional
Datamining
Works here
Scalable Solutions Required
Non-scalable Algorithm
Thursday, November 4, 2010
Scalable means ...
• One unit of work requires about a unit of
time
• Not like the company store (bit.ly/22XVa4)
t ∝
|P|
|R|
|P| = O(1) =⇒ t = O(1)
Thursday, November 4, 2010
Wall
Clock
Time
# of Training Examples
Parallel Algorithm
Sequential
Algorithm
Preferred
Parallel Algorithm Preferred
Sequential Algorithm
Thursday, November 4, 2010
Toy Example
Thursday, November 4, 2010
Training Data Sample
yes
no 0.92 0.01 circle
0.30 0.41 square
Filled?
x coordinate y coordinate
shape
predictor
variables
target
variable
Thursday, November 4, 2010
What matters most?
!
!
!
!
!
!
!
!
!
!
Thursday, November 4, 2010
SGD Classification
• Supervised learning of logistic regression
• Sequential gradient descent, not parallel
• Highly optimized for high dimensional
sparse data, possibly with interactions
• Scalable, real dang fast to train
Thursday, November 4, 2010
Supervised Learning
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
Model
Model
T
T
T
T
T
Learning
Algorithm
? x1 ... xn
? x1 ... xn
? x1 ... xn
? x1 ... xn
? x1 ... xn
Thursday, November 4, 2010
Supervised Learning
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
Model
Model
T
T
T
T
T
Learning
Algorithm
? x1 ... xn
? x1 ... xn
? x1 ... xn
? x1 ... xn
? x1 ... xn
Sequential
but fast
Thursday, November 4, 2010
Supervised Learning
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
Model
Model
T
T
T
T
T
Learning
Algorithm
? x1 ... xn
? x1 ... xn
? x1 ... xn
? x1 ... xn
? x1 ... xn
Sequential
but fast
Stateless,
parallel
Thursday, November 4, 2010
Small example
• On 20 newsgroups
• converges in < 10,000 training examples
(less than one pass through the data)
• accuracy comparable to SVM, Naive
Bayes, Complementary Naive Bayes
• learning rate, regularization set
automagically on held-out data
Thursday, November 4, 2010
System Structure
EvolutionaryProcess ep
void train(target, features)
AdaptiveLogisticRegression
20
1
OnlineLogisticRegression folds
void train(target, tracking, features)
double auc()
CrossFoldLearner
5
1
Matrix beta
void train(target, features)
double classifyScalar(features)
OnlineLogisticRegression
Thursday, November 4, 2010
Training API
public interface OnlineLearner {
void train(int actual, Vector instance);
void train(long trackingKey, int actual, Vector instance);
void train(long trackingKey, String groupKey, int actual, Vector instance);
void close();
}
Thursday, November 4, 2010
Classification API
public class AdaptiveLogisticRegression implements OnlineLearner {
public AdaptiveLogisticRegression(int numCategories, int numFeatures,
PriorFunction prior);
public void train(int actual, Vector instance);
public void train(long trackingKey, int actual, Vector instance);
public void train(long trackingKey, String groupKey, int actual,
Vector instance);
public void close();
public double auc();
public State<Wrapper> getBest();
}
CrossFoldLearner model = learningAlgorithm.getBest().getPayload().getLearner();
double averageCorrect = model.percentCorrect();
double averageLL = model.logLikelihood();
double p = model.classifyScalar(features);
Thursday, November 4, 2010
Speed?
• Encoding API for hashed feature vectors
• String, byte[] or double interfaces
• String allows simple parsing
• byte[] and double allows speed
• Abstract interactions supported
Thursday, November 4, 2010
Speed!
• Parsing and encoding dominate single
learner
• Moderate optimization allows 1 million
training examples with 200 features to be
encoded in 14 seconds in a single core
• 20 million mixed text, categorical features
with many interactions learned in ~ 1 hour
Thursday, November 4, 2010
More Speed!
• Evolutionary optimization of learning
parameters allows simple operation
• 20x threading allows high machine use
• 20 newsgroup test completes in less time
on single node with SGD than on Hadoop
with Complementary Naive Bayes
Thursday, November 4, 2010
Summary
• Mahout provides early production quality
scalable data-mining
• New classification systems allow industrial
scale classification
Thursday, November 4, 2010
Contact Info
Ted Dunning
tdunning@maprtech.com
Thursday, November 4, 2010
Contact Info
Ted Dunning
tdunning@maprtech.com
or tdunning@apache.com
Thursday, November 4, 2010

More Related Content

Similar to Sdforum 11-04-2010

2010.10.30 steven sustaining tdd agile tour shenzhen
2010.10.30 steven sustaining tdd   agile tour shenzhen2010.10.30 steven sustaining tdd   agile tour shenzhen
2010.10.30 steven sustaining tdd agile tour shenzhen
Odd-e
 
Building Brilliant APIs
Building Brilliant APIsBuilding Brilliant APIs
Building Brilliant APIs
bencollier
 
Node js techtalksto
Node js techtalkstoNode js techtalksto
Node js techtalksto
Jason Diller
 
Crowd-sourced Automated Firefox UI Testing
Crowd-sourced Automated Firefox UI TestingCrowd-sourced Automated Firefox UI Testing
Crowd-sourced Automated Firefox UI Testing
Henrik Skupin
 
Using+javascript+to+build+native+i os+applications
Using+javascript+to+build+native+i os+applicationsUsing+javascript+to+build+native+i os+applications
Using+javascript+to+build+native+i os+applications
Muhammad Ikram Ul Haq
 
Sustainable TDD
Sustainable TDDSustainable TDD
Sustainable TDD
Steven Mak
 
Apache Solr, il motore di ricerca enterprise open source
Apache Solr, il motore di ricerca enterprise open sourceApache Solr, il motore di ricerca enterprise open source
Apache Solr, il motore di ricerca enterprise open source
Luca Bonesini
 
BRAINREPUBLIC - Powered by no-SQL
BRAINREPUBLIC - Powered by no-SQLBRAINREPUBLIC - Powered by no-SQL
BRAINREPUBLIC - Powered by no-SQL
Andreas Jung
 
Future-proofing Your JavaScript Apps (Compact edition)
Future-proofing Your JavaScript Apps (Compact edition)Future-proofing Your JavaScript Apps (Compact edition)
Future-proofing Your JavaScript Apps (Compact edition)
Addy Osmani
 
ExpressionEngine FUGN presentation
ExpressionEngine FUGN presentationExpressionEngine FUGN presentation
ExpressionEngine FUGN presentation
Jens Brynildsen
 
Scala Introduction
Scala IntroductionScala Introduction
Scala Introduction
Adrian Spender
 
#3 Information extraction from news to conversations
#3 Information extraction from news to conversations#3 Information extraction from news to conversations
#3 Information extraction from news to conversations
Berlin Language Technology
 
Scalable Plone hosting with Amazon EC2 for Rice University's Rhaptos open lea...
Scalable Plone hosting with Amazon EC2 for Rice University's Rhaptos open lea...Scalable Plone hosting with Amazon EC2 for Rice University's Rhaptos open lea...
Scalable Plone hosting with Amazon EC2 for Rice University's Rhaptos open lea...
Jazkarta, Inc.
 
Best Practices - Mobile Developer Summit
Best Practices - Mobile Developer SummitBest Practices - Mobile Developer Summit
Best Practices - Mobile Developer Summit
wolframkriesing
 
2011 july-nyc-gtug-go
2011 july-nyc-gtug-go2011 july-nyc-gtug-go
2011 july-nyc-gtug-go
ikailan
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)
Erik Hatcher
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
Tommaso Teofili
 
Suneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JSuneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4J
Flink Forward
 
PyTest - The Awesome Parts by Josh Grant
PyTest - The Awesome Parts by Josh GrantPyTest - The Awesome Parts by Josh Grant
PyTest - The Awesome Parts by Josh Grant
QA or the Highway
 
Building the NGDLE with Tsugi (次) and Koseu(코스)
Building the NGDLE with Tsugi (次) and Koseu(코스)Building the NGDLE with Tsugi (次) and Koseu(코스)
Building the NGDLE with Tsugi (次) and Koseu(코스)
Charles Severance
 

Similar to Sdforum 11-04-2010 (20)

2010.10.30 steven sustaining tdd agile tour shenzhen
2010.10.30 steven sustaining tdd   agile tour shenzhen2010.10.30 steven sustaining tdd   agile tour shenzhen
2010.10.30 steven sustaining tdd agile tour shenzhen
 
Building Brilliant APIs
Building Brilliant APIsBuilding Brilliant APIs
Building Brilliant APIs
 
Node js techtalksto
Node js techtalkstoNode js techtalksto
Node js techtalksto
 
Crowd-sourced Automated Firefox UI Testing
Crowd-sourced Automated Firefox UI TestingCrowd-sourced Automated Firefox UI Testing
Crowd-sourced Automated Firefox UI Testing
 
Using+javascript+to+build+native+i os+applications
Using+javascript+to+build+native+i os+applicationsUsing+javascript+to+build+native+i os+applications
Using+javascript+to+build+native+i os+applications
 
Sustainable TDD
Sustainable TDDSustainable TDD
Sustainable TDD
 
Apache Solr, il motore di ricerca enterprise open source
Apache Solr, il motore di ricerca enterprise open sourceApache Solr, il motore di ricerca enterprise open source
Apache Solr, il motore di ricerca enterprise open source
 
BRAINREPUBLIC - Powered by no-SQL
BRAINREPUBLIC - Powered by no-SQLBRAINREPUBLIC - Powered by no-SQL
BRAINREPUBLIC - Powered by no-SQL
 
Future-proofing Your JavaScript Apps (Compact edition)
Future-proofing Your JavaScript Apps (Compact edition)Future-proofing Your JavaScript Apps (Compact edition)
Future-proofing Your JavaScript Apps (Compact edition)
 
ExpressionEngine FUGN presentation
ExpressionEngine FUGN presentationExpressionEngine FUGN presentation
ExpressionEngine FUGN presentation
 
Scala Introduction
Scala IntroductionScala Introduction
Scala Introduction
 
#3 Information extraction from news to conversations
#3 Information extraction from news to conversations#3 Information extraction from news to conversations
#3 Information extraction from news to conversations
 
Scalable Plone hosting with Amazon EC2 for Rice University's Rhaptos open lea...
Scalable Plone hosting with Amazon EC2 for Rice University's Rhaptos open lea...Scalable Plone hosting with Amazon EC2 for Rice University's Rhaptos open lea...
Scalable Plone hosting with Amazon EC2 for Rice University's Rhaptos open lea...
 
Best Practices - Mobile Developer Summit
Best Practices - Mobile Developer SummitBest Practices - Mobile Developer Summit
Best Practices - Mobile Developer Summit
 
2011 july-nyc-gtug-go
2011 july-nyc-gtug-go2011 july-nyc-gtug-go
2011 july-nyc-gtug-go
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Suneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JSuneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4J
 
PyTest - The Awesome Parts by Josh Grant
PyTest - The Awesome Parts by Josh GrantPyTest - The Awesome Parts by Josh Grant
PyTest - The Awesome Parts by Josh Grant
 
Building the NGDLE with Tsugi (次) and Koseu(코스)
Building the NGDLE with Tsugi (次) and Koseu(코스)Building the NGDLE with Tsugi (次) and Koseu(코스)
Building the NGDLE with Tsugi (次) and Koseu(코스)
 

More from Ted Dunning

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
Ted Dunning
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
Ted Dunning
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
Ted Dunning
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
Ted Dunning
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
Ted Dunning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
Ted Dunning
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
Ted Dunning
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
Ted Dunning
 
T digest-update
T digest-updateT digest-update
T digest-update
Ted Dunning
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
Ted Dunning
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
Ted Dunning
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
Ted Dunning
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
Ted Dunning
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
Ted Dunning
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Ted Dunning
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
Ted Dunning
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
Ted Dunning
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
Ted Dunning
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
Ted Dunning
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
Ted Dunning
 

More from Ted Dunning (20)

Dunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptxDunning - SIGMOD - Data Economy.pptx
Dunning - SIGMOD - Data Economy.pptx
 
How to Get Going with Kubernetes
How to Get Going with KubernetesHow to Get Going with Kubernetes
How to Get Going with Kubernetes
 
Progress for big data in Kubernetes
Progress for big data in KubernetesProgress for big data in Kubernetes
Progress for big data in Kubernetes
 
Anomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look forAnomaly Detection: How to find what you didn’t know to look for
Anomaly Detection: How to find what you didn’t know to look for
 
Streaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine LearningStreaming Architecture including Rendezvous for Machine Learning
Streaming Architecture including Rendezvous for Machine Learning
 
Machine Learning Logistics
Machine Learning LogisticsMachine Learning Logistics
Machine Learning Logistics
 
Tensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworksTensor Abuse - how to reuse machine learning frameworks
Tensor Abuse - how to reuse machine learning frameworks
 
Machine Learning logistics
Machine Learning logisticsMachine Learning logistics
Machine Learning logistics
 
T digest-update
T digest-updateT digest-update
T digest-update
 
Finding Changes in Real Data
Finding Changes in Real DataFinding Changes in Real Data
Finding Changes in Real Data
 
Where is Data Going? - RMDC Keynote
Where is Data Going? - RMDC KeynoteWhere is Data Going? - RMDC Keynote
Where is Data Going? - RMDC Keynote
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015Cheap learning-dunning-9-18-2015
Cheap learning-dunning-9-18-2015
 
Sharing Sensitive Data Securely
Sharing Sensitive Data SecurelySharing Sensitive Data Securely
Sharing Sensitive Data Securely
 
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeReal-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
 
How the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside DownHow the Internet of Things is Turning the Internet Upside Down
How the Internet of Things is Turning the Internet Upside Down
 
Apache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on HadoopApache Kylin - OLAP Cubes for SQL on Hadoop
Apache Kylin - OLAP Cubes for SQL on Hadoop
 
Dunning time-series-2015
Dunning time-series-2015Dunning time-series-2015
Dunning time-series-2015
 
Doing-the-impossible
Doing-the-impossibleDoing-the-impossible
Doing-the-impossible
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 

Recently uploaded

“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 

Recently uploaded (20)

“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 

Sdforum 11-04-2010

  • 2. Apache Mahout Now with extra whitening and classification powers! Thursday, November 4, 2010
  • 3. • Mahout intro • Scalability in general • Supervised learning recap • The new SGD classifiers Thursday, November 4, 2010
  • 4. Mahout? • Hebrew for “essence” • Hindi for a guy who drives an elephant Thursday, November 4, 2010
  • 5. Mahout? • Hebrew for “essence” • Hindi for a guy who drives an elephant Thursday, November 4, 2010
  • 6. Mahout? • Hebrew for “essence” • Hindi for a guy who drives an elephant Thursday, November 4, 2010
  • 7. Mahout! • Scalable data-mining and recommendations • Not all data-mining • Not the fanciest data-mining • Just some of the scalable stuff • Not a competitor for R or Weka Thursday, November 4, 2010
  • 8. General Areas • Recommendations • lots of support, lots of flexibility, production ready • Unsupervised learning (clustering) • lots of options, lots of flexibility, production ready (ish) Thursday, November 4, 2010
  • 9. General Areas • Supervised learning (classification) • multiple architectures, fair number of options, somewhat inter-operable • production ready (for the right definition of production and ready) • Large scale SVD • larger scale coming, beware sharp edges Thursday, November 4, 2010
  • 10. Scalable? • Scalable means • Time is proportional to problem size by resource size • Does not imply Hadoop or parallel THE AUTHOR t ∝ |P| |R| Thursday, November 4, 2010
  • 11. Wall Clock Time # of Training Examples Scalable Algorithm (Mahout wins!) Traditional Datamining Works here Scalable Solutions Required Non-scalable Algorithm Thursday, November 4, 2010
  • 12. Scalable means ... • One unit of work requires about a unit of time • Not like the company store (bit.ly/22XVa4) t ∝ |P| |R| |P| = O(1) =⇒ t = O(1) Thursday, November 4, 2010
  • 13. Wall Clock Time # of Training Examples Parallel Algorithm Sequential Algorithm Preferred Parallel Algorithm Preferred Sequential Algorithm Thursday, November 4, 2010
  • 15. Training Data Sample yes no 0.92 0.01 circle 0.30 0.41 square Filled? x coordinate y coordinate shape predictor variables target variable Thursday, November 4, 2010
  • 17. SGD Classification • Supervised learning of logistic regression • Sequential gradient descent, not parallel • Highly optimized for high dimensional sparse data, possibly with interactions • Scalable, real dang fast to train Thursday, November 4, 2010
  • 18. Supervised Learning T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn Model Model T T T T T Learning Algorithm ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn Thursday, November 4, 2010
  • 19. Supervised Learning T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn Model Model T T T T T Learning Algorithm ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn Sequential but fast Thursday, November 4, 2010
  • 20. Supervised Learning T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn T x1 ... xn Model Model T T T T T Learning Algorithm ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn ? x1 ... xn Sequential but fast Stateless, parallel Thursday, November 4, 2010
  • 21. Small example • On 20 newsgroups • converges in < 10,000 training examples (less than one pass through the data) • accuracy comparable to SVM, Naive Bayes, Complementary Naive Bayes • learning rate, regularization set automagically on held-out data Thursday, November 4, 2010
  • 22. System Structure EvolutionaryProcess ep void train(target, features) AdaptiveLogisticRegression 20 1 OnlineLogisticRegression folds void train(target, tracking, features) double auc() CrossFoldLearner 5 1 Matrix beta void train(target, features) double classifyScalar(features) OnlineLogisticRegression Thursday, November 4, 2010
  • 23. Training API public interface OnlineLearner { void train(int actual, Vector instance); void train(long trackingKey, int actual, Vector instance); void train(long trackingKey, String groupKey, int actual, Vector instance); void close(); } Thursday, November 4, 2010
  • 24. Classification API public class AdaptiveLogisticRegression implements OnlineLearner { public AdaptiveLogisticRegression(int numCategories, int numFeatures, PriorFunction prior); public void train(int actual, Vector instance); public void train(long trackingKey, int actual, Vector instance); public void train(long trackingKey, String groupKey, int actual, Vector instance); public void close(); public double auc(); public State<Wrapper> getBest(); } CrossFoldLearner model = learningAlgorithm.getBest().getPayload().getLearner(); double averageCorrect = model.percentCorrect(); double averageLL = model.logLikelihood(); double p = model.classifyScalar(features); Thursday, November 4, 2010
  • 25. Speed? • Encoding API for hashed feature vectors • String, byte[] or double interfaces • String allows simple parsing • byte[] and double allows speed • Abstract interactions supported Thursday, November 4, 2010
  • 26. Speed! • Parsing and encoding dominate single learner • Moderate optimization allows 1 million training examples with 200 features to be encoded in 14 seconds in a single core • 20 million mixed text, categorical features with many interactions learned in ~ 1 hour Thursday, November 4, 2010
  • 27. More Speed! • Evolutionary optimization of learning parameters allows simple operation • 20x threading allows high machine use • 20 newsgroup test completes in less time on single node with SGD than on Hadoop with Complementary Naive Bayes Thursday, November 4, 2010
  • 28. Summary • Mahout provides early production quality scalable data-mining • New classification systems allow industrial scale classification Thursday, November 4, 2010
  • 30. Contact Info Ted Dunning tdunning@maprtech.com or tdunning@apache.com Thursday, November 4, 2010