SlideShare a Scribd company logo
1 of 48
Download to read offline
{ “Mahout” : “Scalable Machine Learning Library” }
{ “Presented By” : “Varad Meru”,
“Company” : “Orzota, Inc”,
“Twitter” : “@vrdmr” }
1
{ “Mahout” : “Introduction” }
2
{ “Introduction” : “History and Etymology” }
• A Scalable Machine Learning Library built on Hadoop, written in Java.
• Driven by Ng et al.’s paper “MapReduce for Machine Learning
on Multicore”
• Started as a Lucene sub-project. Became Apache TLP in April 2010.
• Latest version out – 0.6 (released on 6th Feb 2012).
• Mahout – Keeper/Driver of Elephants. Since many of the algorithms are implemented in MapReduce on Hadoop.
• Mahout was started by Isabel Drost, Grant Ingersoll, Karl Witten.
• Taste Recommendation Framework was added
later by Sean Owen.
3
Figure 1.1 Apache Mahout and its related projects within the Apache Foundation.
Much of Mahout’s work has been to not only implement these algorithms conventionally,
and scalable way, but also to convert some of these algorithms to work at scale on to
Hadoop’s mascot is an elephant, which at last explains the project name!
Mahout incubates a number of techniques and algorithms, many still in developm
experimental phase. At this early stage in the project's life, three core themes are evident
filtering / recommender engines, clustering, and classification. This is by no means all tha
Mahout, but are the most prominent and mature themes at the time of writing. These the
scope of this book.
Chances are that if you are reading this, you are already aware of the interesting pot
three families of techniques. But just in case, read on.
2
{ “Mahout” : “Machine Learning” }
4
{ “Machine Learning” : “Introduction” }
“Machine Learning is Programming Computers to optimize a
Performance Criterion using Example Data or Past Experience”
• Branch of Artificial Intelligence
• Design and Development of Algorithms
• Computers Evolve Behavior based on Empirical Data .
• Supervised Learning
• Using Labeled training data, to create a Classifier that can predict output for unseen inputs.
• Unsupervised Learning
• Using Unlabeled training data to create a function that can predict output.
• Semi-Supervised Learning
5
{ “Machine Learning” : “Applications” }
• Recommend Friends, Dates, Products to end-user.
• Classify content into pre-defined groups.
• Find Similar content based on Object Properties.
• Identify key topics in large Collections of Text.
• Detect Anomalies within given data.
• Ranking Search Results with User Feedback Learning.
• Classifying DNA sequences.
• Sentiment Analysis/ Opinion Mining
• Computer Vision.
• Natural Language Processing,
• BioInformatics.
• Speech and HandWriting Recognition.
• Others ...
6
{“Machine Learning”: “Challenges”}
• BigData
• Yesterdays Processing on next
generation Data.
• Time for Processing
• Large and Cheap Storage
7
Size Classification Tools
Lines
Sample Data
Analysis and
Visualization
Whiteboard,
bash,...
KBs - low MBs
Prototype Data
Analysis and
Visualization
Matlab, Octave, R,
Processing,
bash,...
MBs - low GBs
Online Data
Storage MySQL (DBs),...
MBs - low GBs
Online Data
Analysis
NumPy, SciPy,
Weka, BLAS/
LAPACK,...
MBs - low GBs
Online Data
Visualization
Flare, AmCharts,
Raphael, Protovis,...
GBs - TBs - PBs
Big Data
Storage
HDFS, HBase,
Cassandra,...
GBs - TBs - PBs
Big Data
Analysis
Hive, Mahout,
Hama, Giraph,...
{ “Machine Learning” : “Mahout for Big Data”}
• Goal: “Be as Fast and Efficient as possible given the intrinsic design of the Algorithm”.
• Some Algorithms won’t scale to massive machine clusters
• Others fit logically on MapReduce framework like Apache Hadoop
• Most Mahout implementations are MapReduce enabled
• Focus: “Scalability with Hadoop’s MapReduce Processing Framework on BigData on Hadoop’s HDFS Storage”.
• The only Machine Learning Library build on a MapReduce framework. Other MapReduce framework such as
Disco, Skynet, FileMap, Phoenix, AEMR either don’t scale or don’t have any ML library.
• The only Scalable Machine Learning Framework with MapReduce and Hadoop Support. (www.mloss.org: Machine
Learning Open-Source Softwares)
8
{ “Mahout” : “Internals” }
9
10
{ “Internals” : “Architecture” }
Math%
Vectors/Matrices/SVD%
Recommenders%Clustering%Classifica9on%
Freq.%
Pa>ern%
Mining%
Evolu9onary%
Algorithms%
U9li9es%
Lucene/Vectorizer%
Collec9ons%
(primi9ves)%
Apache%
Hadoop%
Applica9ons%
Examples%
Regression%
Dimension%
Reduc9on%
• Scalable
• Dual-Mode (Sequential and MapReduce Enabled)
• Support for easy Extension.
• Large Number of Data Source Enabled including the newer NoSQL variants.
• It is a Java library. It is a framework of tools intended to be used and adapted by developers.
• Advanced Implementations of Java’s Collections Framework for better Performance.
11
{ “Internals” : “Features” }
{ “Mahout” : “Algorithms” }
12
• Help Users find items they might like based on historical behavior and preferences
• Top-level packages define the Mahout interfaces to these key abstractions:
• DataModel – FileDataModel, MySQLJDBCDataModel, PostgreSQLJDBCDataModel,
MongoDBDataModel, CassandraDataModel
• UserSimilarity – Pearson-Correlation, Tanimoto, Log-Likelihood, Uncentered Cosine Similarity,
Euclidean Distance Similarity
• ItemSimilarity – Pearson-Correlation, Tanimoto, Log-Likelihood, Uncentered Cosine Similarity,
Euclidean Distance Similarity
• UserNeighborhood – Nearest N-User Neighborhood, Threshold User Neighborhood.
• Recommender – KNN Item-Based Recommender, Slope One Recommender, Tree Clustering
Recommender.
13
{ “Algorithms” : “Recommender Systems”, “id” : “Introduction”}
14
{ “Algorithms” : “Recommender Systems”, “id” : “Example”}
0
 1
 1
 1
1
 0
 1
 1
0
 1
 0
 0
1
 0
 1
 1
1
 1
 1
 1
1
 0
 1
 1
1
 0
 0
 0
1
 1
 1
 0
1
 1
 0
 1
Binary Values
Recommendation
Alice
Bob
John
Jane
Bill
Steve
Larry
Don
Jack
15
{ “Algorithms” : “Recommender Systems” , “Similarity” : “Tanimoto”}
1
1/3 –
0.33
5/8 –
0.625
5/8 –
0.625
1/3 –
0.33
1
3/8 –
0.375
3/8 –
0.375
5/8 –
0.625
3/8 –
0.375
1
5/7 –
0.714
5/8 –
0.625
3/8 –
0.375
5/7 –
0.714
1
Tanimoto Coefficient
NA – Number of Customers
who bought Product A
NB – Number of Customer who
bought Product B
Nc – Number of Customer who
bought both Product A and
Product B
16
{ “Algorithms” : “Recommender Systems” , “Similarity” : “Cosine”}
1
 0.507
 0.772
 0.772
0.507
 1
 0.707
 0.707
0.772
 0.707
 1
 0.833
0.772
 0.707
 0.833
 1
Cosine Coefficient
NA – Number of Customers
who bought Product A
NB – Number of Customer who
bought Product B
Nc – Number of Customer who
bought both Product A and
Product B
• Assigning Data to discreet Categories.
• Train a model on Labeled Data
• Run the Model on new, Unlabeled Data
• Classifier: An algorithm that implements classification, especially in a concrete implementation.
• Classification Algorithms
• Maximum entropy classifier
• Naïve Bayes classifier
• Decision trees, decision lists
• Support vector machines
• Kernel estimation and K-nearest-neighbor algorithms
• Perceptrons
• Neural networks (multi-level perceptrons)
17
{ “Algorithms” : “Classification” , “id” : “Introduction”}
Spam
 Not spam
?
18
{ “Algorithms” : “Classification” , “id” : “Naïve Bayes Example”}
Train: Not Spam
President Obama’s Nobel Prize Speech
19
{ “Algorithms” : “Classification” , “id” : “Naïve Bayes Example”}
Train: Spam
Spam Email Content
20
{ “Algorithms” : “Classification” , “id” : “Naïve Bayes Example”}
Run
“Order a trial Adobe chicken daily
EAB-List new summer savings, welcome!”
21
{ “Algorithms” : “Classification” , “id” : “Naïve Bayes in Mahout”}
• Naïve Bayes is a pretty complex process in Mahout: training the classifier requires four separate Hadoop jobs.
• Training:
• Read the Features
• Calculate per-Document Statistics
• Normalize across Categories
• Calculate normalizing factor of each label
• Testing
• Classification (fifth job, explicitly invoked)
algorithm through which the system will learn, and the variables used as input are key steps in the
phase of building the classification system.
The basic steps in building a classification system are illustrated in figure 13.2.
Figure 13.2. How a classification system works. Inside the dotted lasso is the heart of the classification system, a train
algorithm that learns a model to emulate human decisions. A copy of the model is then used in evaluation or in produc
with new input examples to estimate the target variable.
The figure shows two phases of the classification process, with the upper path representing training
classification model and the lower path providing new examples for which the model will assign catego
(the target variables) as a way to emulate decisions. For the training phase, input for the train
• Grouping unstructured data without any training data.
• Self learning from experience.
• Small intra-cluster distance - Trying for local and global Minima
• Large inter-cluster distance
• Mahout’s Canopy Clustering
map reduce algorithm is often
used to compute initial cluster
centroids.
22
{ “Algorithms” : “Clustering” , “id” : “Introduction”}
23
{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
24
{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
25
{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
26
{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
27
{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
28
{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
29
{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
30
{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
31
{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
32
{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
Cats
Dogs
33
{ “Algorithms” : “Clustering” , “id” : “K-Means in Mahout”}
+
C0 C1 C2 C3
M0 M1 M2 M3
IO0 IO1 IO2 IO3
R0 R1
FO0 FO1
chunks
mappers
Reducers
MapPhaseReducePhase
Shuffling Data
• Assume: Number of Cluster is far lesser than Number of Points.
• Therefore, |Clusters| << |Points|
• Hadoop’s DistributedCache is used in order to give each Mapper access to all the current cluster centroids.
34
{ “Algorithms” : “Clustering” , “id” : “K-Means in Mahout”}
M0 M1 M2 M3
<clusterID, observation>
R0 R1
Important arguments
--maxIter
--convergenceDelta
--method
35
{ “Algorithms” : “Clustering” , “id” : “MapReduce KMeans Clustering”}
Map phase: assign cluster IDs
Reduce phase: reset centroids
36
{ “Algorithms” : “Other Algorithms” }
• Classification
‣ Stochastic Gradient Descent
‣ Support Vector Machines
‣ Random Forests
• Clustering
‣ Latent Dirichlet Allocation
- Topic models
‣ Fuzzy K-Means
- Points are assigned multiple clusters
‣ Canopy clustering
- Fast approximations of clusters
‣ Spectral clustering
- Treat points as a graph
• Evolutionary Algorithms - Integration with Watchmaker for Genetic Programming Fitness Functions
• Dimensionality Reduction
• Regression
37
{ “Algorithms” : “Future” }
• Classification
‣ Decision Trees such as J48 and ID3
• Clustering
‣ DBScan and CoWeb Clustering techniques
• Evolutionary Algorithms
‣ Classical Genetic Algorithms
• Association Rules
‣ Apriori. (It has an alternative frequent itemset algorithm implementation).
{ “Mahout” : “Summary” }
38
{ “Summary”: “Apache Mahout” }
39
• Scalable Library
40
• Scalable Library
• Three Primary Areas of
Focus
{ “Summary”: “Apache Mahout” }
41
• Scalable Library
• Three Primary Areas of
Focus
• Other Algorithms
{ “Summary”: “Apache Mahout” }
42
• Scalable Library
• Three Primary Areas of
Focus
• Other Algorithms
• All in your friendly
neighborhood MapReduce
{ “Summary”: “Apache Mahout” }
{ “Mahout” : “Demo” }
43
{ “Mahout” : “Questions” }
44
{ “Mahout” : “References” }
45
• Books
• “Mahout in Action”, Owen et. al., Manning Pub.
• “Pattern Recognition and Machine Learning”, Christopher Bishop, Springer Pub.
• “Elements of Statistical Learning: Data Mining, Inference, and Prediction”, Hastie et. al., Springer
Pub.
• Videos
• CS-229, Machine Learning at Stanford University - Prof. Andrew Ng.
• Collaborative filtering at scale - Sean Owen
• Distributed Item-based Collaborative Filtering - Sebastian Schelter
• EMail Classification with Mahout - Grant Ingersoll @ Lucid Imagination
46
{ “References” : “Mahout Books, Tutorials, Links”, “id” : 1}
• WWW
• http://mahout.apache.org - Mahout@Apache
• http://hadoop.apache.org - Hadoop@Apache
• dev@mahout.apache.org - Developer mailing list
• user@mahout.apache.org - User mailing list
• http://www.ibm.com/developerworks/java/library/j-mahout/ - Introducing Apache Mahout
47
{ “References” : “Mahout Books, Tutorials, Links”, “id” : 2}
{ “Mahout” : “The End” }
48
{“Thank You” : “Have a Nice and Green Day” }

More Related Content

What's hot

What's hot (20)

Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Mauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshopMauritius Big Data and Machine Learning JEDI workshop
Mauritius Big Data and Machine Learning JEDI workshop
 
Intro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big DataIntro to Data Science for Enterprise Big Data
Intro to Data Science for Enterprise Big Data
 
Machine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache SparkMachine Learning with Big Data using Apache Spark
Machine Learning with Big Data using Apache Spark
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine Learning
 
Machine Learning using Big data
Machine Learning using Big data Machine Learning using Big data
Machine Learning using Big data
 
Analytics and Data Mining Industry Overview
Analytics and Data Mining Industry OverviewAnalytics and Data Mining Industry Overview
Analytics and Data Mining Industry Overview
 
Crowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic PerspectivesCrowdsourced Data Processing: Industry and Academic Perspectives
Crowdsourced Data Processing: Industry and Academic Perspectives
 
A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data Science
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
 
Machine Learning Classifiers
Machine Learning ClassifiersMachine Learning Classifiers
Machine Learning Classifiers
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
 
Intro to Machine Learning
Intro to Machine LearningIntro to Machine Learning
Intro to Machine Learning
 
Are you ready for BIG DATA?
Are you ready for BIG DATA?Are you ready for BIG DATA?
Are you ready for BIG DATA?
 
Machine learning in action at Pipedrive
Machine learning in action at PipedriveMachine learning in action at Pipedrive
Machine learning in action at Pipedrive
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
 
Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science
 

Viewers also liked

Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The People
Daniel Tunkelang
 
A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013
Philip Zheng
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
DEEPASHRI HK
 
Artificial Intelligence Presentation
Artificial Intelligence PresentationArtificial Intelligence Presentation
Artificial Intelligence Presentation
lpaviglianiti
 

Viewers also liked (20)

10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
 
Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in Python
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The People
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
 
10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
 
A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification Rules
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
 
Tutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsTutorial on Deep learning and Applications
Tutorial on Deep learning and Applications
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
Artificial Intelligence Presentation
Artificial Intelligence PresentationArtificial Intelligence Presentation
Artificial Intelligence Presentation
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache Mahout
 

Similar to Introduction to Mahout and Machine Learning

Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
Sri Ambati
 

Similar to Introduction to Mahout and Machine Learning (20)

Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
 
Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
 
Machine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossMachine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy Cross
 
New Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 EditionNew Developments in H2O: April 2017 Edition
New Developments in H2O: April 2017 Edition
 
Strata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2OStrata San Jose 2016: Scalable Ensemble Learning with H2O
Strata San Jose 2016: Scalable Ensemble Learning with H2O
 
Machine Learning & Apache Mahout
Machine Learning & Apache MahoutMachine Learning & Apache Mahout
Machine Learning & Apache Mahout
 
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
 
Machine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMachine Learning for (JVM) Developers
Machine Learning for (JVM) Developers
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Buidling large scale recommendation engine
Buidling large scale recommendation engineBuidling large scale recommendation engine
Buidling large scale recommendation engine
 
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
 
Studies of HPCC Systems from Machine Learning Perspectives
Studies of HPCC Systems from Machine Learning PerspectivesStudies of HPCC Systems from Machine Learning Perspectives
Studies of HPCC Systems from Machine Learning Perspectives
 
The Art of Intelligence – Introduction Machine Learning for Java professional...
The Art of Intelligence – Introduction Machine Learning for Java professional...The Art of Intelligence – Introduction Machine Learning for Java professional...
The Art of Intelligence – Introduction Machine Learning for Java professional...
 
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
DeepLearning001&ApacheMXNetWithSparkForInference-ACNA2018
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User Group
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...
 
IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.
 

More from Varad Meru

More from Varad Meru (16)

Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesPredicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensembles
 
Generating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep LearningGenerating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep Learning
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
 
Kakuro: Solving the Constraint Satisfaction Problem
Kakuro: Solving the Constraint Satisfaction ProblemKakuro: Solving the Constraint Satisfaction Problem
Kakuro: Solving the Constraint Satisfaction Problem
 
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...
CS295 Week5: Megastore - Providing Scalable, Highly Available Storage for Int...
 
Cassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage SystemCassandra - A Decentralized Structured Storage System
Cassandra - A Decentralized Structured Storage System
 
Cloud Computing: An Overview
Cloud Computing: An OverviewCloud Computing: An Overview
Cloud Computing: An Overview
 
Live Wide-Area Migration of Virtual Machines including Local Persistent State.
Live Wide-Area Migration of Virtual Machines including Local Persistent State.Live Wide-Area Migration of Virtual Machines including Local Persistent State.
Live Wide-Area Migration of Virtual Machines including Local Persistent State.
 
Machine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An IntroductionMachine Learning and Apache Mahout : An Introduction
Machine Learning and Apache Mahout : An Introduction
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
Large-scale Parallel Collaborative Filtering and Clustering using MapReduce f...
 
Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...Big Data, Hadoop, NoSQL and more ...
Big Data, Hadoop, NoSQL and more ...
 
Final Year Project Guidance
Final Year Project GuidanceFinal Year Project Guidance
Final Year Project Guidance
 
OpenSourceEducation
OpenSourceEducationOpenSourceEducation
OpenSourceEducation
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 

Recently uploaded (20)

WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Your enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jYour enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4j
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 

Introduction to Mahout and Machine Learning

  • 1. { “Mahout” : “Scalable Machine Learning Library” } { “Presented By” : “Varad Meru”, “Company” : “Orzota, Inc”, “Twitter” : “@vrdmr” } 1
  • 2. { “Mahout” : “Introduction” } 2
  • 3. { “Introduction” : “History and Etymology” } • A Scalable Machine Learning Library built on Hadoop, written in Java. • Driven by Ng et al.’s paper “MapReduce for Machine Learning on Multicore” • Started as a Lucene sub-project. Became Apache TLP in April 2010. • Latest version out – 0.6 (released on 6th Feb 2012). • Mahout – Keeper/Driver of Elephants. Since many of the algorithms are implemented in MapReduce on Hadoop. • Mahout was started by Isabel Drost, Grant Ingersoll, Karl Witten. • Taste Recommendation Framework was added later by Sean Owen. 3 Figure 1.1 Apache Mahout and its related projects within the Apache Foundation. Much of Mahout’s work has been to not only implement these algorithms conventionally, and scalable way, but also to convert some of these algorithms to work at scale on to Hadoop’s mascot is an elephant, which at last explains the project name! Mahout incubates a number of techniques and algorithms, many still in developm experimental phase. At this early stage in the project's life, three core themes are evident filtering / recommender engines, clustering, and classification. This is by no means all tha Mahout, but are the most prominent and mature themes at the time of writing. These the scope of this book. Chances are that if you are reading this, you are already aware of the interesting pot three families of techniques. But just in case, read on. 2
  • 4. { “Mahout” : “Machine Learning” } 4
  • 5. { “Machine Learning” : “Introduction” } “Machine Learning is Programming Computers to optimize a Performance Criterion using Example Data or Past Experience” • Branch of Artificial Intelligence • Design and Development of Algorithms • Computers Evolve Behavior based on Empirical Data . • Supervised Learning • Using Labeled training data, to create a Classifier that can predict output for unseen inputs. • Unsupervised Learning • Using Unlabeled training data to create a function that can predict output. • Semi-Supervised Learning 5
  • 6. { “Machine Learning” : “Applications” } • Recommend Friends, Dates, Products to end-user. • Classify content into pre-defined groups. • Find Similar content based on Object Properties. • Identify key topics in large Collections of Text. • Detect Anomalies within given data. • Ranking Search Results with User Feedback Learning. • Classifying DNA sequences. • Sentiment Analysis/ Opinion Mining • Computer Vision. • Natural Language Processing, • BioInformatics. • Speech and HandWriting Recognition. • Others ... 6
  • 7. {“Machine Learning”: “Challenges”} • BigData • Yesterdays Processing on next generation Data. • Time for Processing • Large and Cheap Storage 7 Size Classification Tools Lines Sample Data Analysis and Visualization Whiteboard, bash,... KBs - low MBs Prototype Data Analysis and Visualization Matlab, Octave, R, Processing, bash,... MBs - low GBs Online Data Storage MySQL (DBs),... MBs - low GBs Online Data Analysis NumPy, SciPy, Weka, BLAS/ LAPACK,... MBs - low GBs Online Data Visualization Flare, AmCharts, Raphael, Protovis,... GBs - TBs - PBs Big Data Storage HDFS, HBase, Cassandra,... GBs - TBs - PBs Big Data Analysis Hive, Mahout, Hama, Giraph,...
  • 8. { “Machine Learning” : “Mahout for Big Data”} • Goal: “Be as Fast and Efficient as possible given the intrinsic design of the Algorithm”. • Some Algorithms won’t scale to massive machine clusters • Others fit logically on MapReduce framework like Apache Hadoop • Most Mahout implementations are MapReduce enabled • Focus: “Scalability with Hadoop’s MapReduce Processing Framework on BigData on Hadoop’s HDFS Storage”. • The only Machine Learning Library build on a MapReduce framework. Other MapReduce framework such as Disco, Skynet, FileMap, Phoenix, AEMR either don’t scale or don’t have any ML library. • The only Scalable Machine Learning Framework with MapReduce and Hadoop Support. (www.mloss.org: Machine Learning Open-Source Softwares) 8
  • 9. { “Mahout” : “Internals” } 9
  • 10. 10 { “Internals” : “Architecture” } Math% Vectors/Matrices/SVD% Recommenders%Clustering%Classifica9on% Freq.% Pa>ern% Mining% Evolu9onary% Algorithms% U9li9es% Lucene/Vectorizer% Collec9ons% (primi9ves)% Apache% Hadoop% Applica9ons% Examples% Regression% Dimension% Reduc9on%
  • 11. • Scalable • Dual-Mode (Sequential and MapReduce Enabled) • Support for easy Extension. • Large Number of Data Source Enabled including the newer NoSQL variants. • It is a Java library. It is a framework of tools intended to be used and adapted by developers. • Advanced Implementations of Java’s Collections Framework for better Performance. 11 { “Internals” : “Features” }
  • 12. { “Mahout” : “Algorithms” } 12
  • 13. • Help Users find items they might like based on historical behavior and preferences • Top-level packages define the Mahout interfaces to these key abstractions: • DataModel – FileDataModel, MySQLJDBCDataModel, PostgreSQLJDBCDataModel, MongoDBDataModel, CassandraDataModel • UserSimilarity – Pearson-Correlation, Tanimoto, Log-Likelihood, Uncentered Cosine Similarity, Euclidean Distance Similarity • ItemSimilarity – Pearson-Correlation, Tanimoto, Log-Likelihood, Uncentered Cosine Similarity, Euclidean Distance Similarity • UserNeighborhood – Nearest N-User Neighborhood, Threshold User Neighborhood. • Recommender – KNN Item-Based Recommender, Slope One Recommender, Tree Clustering Recommender. 13 { “Algorithms” : “Recommender Systems”, “id” : “Introduction”}
  • 14. 14 { “Algorithms” : “Recommender Systems”, “id” : “Example”} 0 1 1 1 1 0 1 1 0 1 0 0 1 0 1 1 1 1 1 1 1 0 1 1 1 0 0 0 1 1 1 0 1 1 0 1 Binary Values Recommendation Alice Bob John Jane Bill Steve Larry Don Jack
  • 15. 15 { “Algorithms” : “Recommender Systems” , “Similarity” : “Tanimoto”} 1 1/3 – 0.33 5/8 – 0.625 5/8 – 0.625 1/3 – 0.33 1 3/8 – 0.375 3/8 – 0.375 5/8 – 0.625 3/8 – 0.375 1 5/7 – 0.714 5/8 – 0.625 3/8 – 0.375 5/7 – 0.714 1 Tanimoto Coefficient NA – Number of Customers who bought Product A NB – Number of Customer who bought Product B Nc – Number of Customer who bought both Product A and Product B
  • 16. 16 { “Algorithms” : “Recommender Systems” , “Similarity” : “Cosine”} 1 0.507 0.772 0.772 0.507 1 0.707 0.707 0.772 0.707 1 0.833 0.772 0.707 0.833 1 Cosine Coefficient NA – Number of Customers who bought Product A NB – Number of Customer who bought Product B Nc – Number of Customer who bought both Product A and Product B
  • 17. • Assigning Data to discreet Categories. • Train a model on Labeled Data • Run the Model on new, Unlabeled Data • Classifier: An algorithm that implements classification, especially in a concrete implementation. • Classification Algorithms • Maximum entropy classifier • Naïve Bayes classifier • Decision trees, decision lists • Support vector machines • Kernel estimation and K-nearest-neighbor algorithms • Perceptrons • Neural networks (multi-level perceptrons) 17 { “Algorithms” : “Classification” , “id” : “Introduction”} Spam Not spam ?
  • 18. 18 { “Algorithms” : “Classification” , “id” : “Naïve Bayes Example”} Train: Not Spam President Obama’s Nobel Prize Speech
  • 19. 19 { “Algorithms” : “Classification” , “id” : “Naïve Bayes Example”} Train: Spam Spam Email Content
  • 20. 20 { “Algorithms” : “Classification” , “id” : “Naïve Bayes Example”} Run “Order a trial Adobe chicken daily EAB-List new summer savings, welcome!”
  • 21. 21 { “Algorithms” : “Classification” , “id” : “Naïve Bayes in Mahout”} • Naïve Bayes is a pretty complex process in Mahout: training the classifier requires four separate Hadoop jobs. • Training: • Read the Features • Calculate per-Document Statistics • Normalize across Categories • Calculate normalizing factor of each label • Testing • Classification (fifth job, explicitly invoked) algorithm through which the system will learn, and the variables used as input are key steps in the phase of building the classification system. The basic steps in building a classification system are illustrated in figure 13.2. Figure 13.2. How a classification system works. Inside the dotted lasso is the heart of the classification system, a train algorithm that learns a model to emulate human decisions. A copy of the model is then used in evaluation or in produc with new input examples to estimate the target variable. The figure shows two phases of the classification process, with the upper path representing training classification model and the lower path providing new examples for which the model will assign catego (the target variables) as a way to emulate decisions. For the training phase, input for the train
  • 22. • Grouping unstructured data without any training data. • Self learning from experience. • Small intra-cluster distance - Trying for local and global Minima • Large inter-cluster distance • Mahout’s Canopy Clustering map reduce algorithm is often used to compute initial cluster centroids. 22 { “Algorithms” : “Clustering” , “id” : “Introduction”}
  • 23. 23 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  • 24. 24 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  • 25. 25 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  • 26. 26 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  • 27. 27 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  • 28. 28 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  • 29. 29 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  • 30. 30 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  • 31. 31 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}
  • 32. 32 { “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”} Cats Dogs
  • 33. 33 { “Algorithms” : “Clustering” , “id” : “K-Means in Mahout”} + C0 C1 C2 C3 M0 M1 M2 M3 IO0 IO1 IO2 IO3 R0 R1 FO0 FO1 chunks mappers Reducers MapPhaseReducePhase Shuffling Data
  • 34. • Assume: Number of Cluster is far lesser than Number of Points. • Therefore, |Clusters| << |Points| • Hadoop’s DistributedCache is used in order to give each Mapper access to all the current cluster centroids. 34 { “Algorithms” : “Clustering” , “id” : “K-Means in Mahout”} M0 M1 M2 M3 <clusterID, observation> R0 R1 Important arguments --maxIter --convergenceDelta --method
  • 35. 35 { “Algorithms” : “Clustering” , “id” : “MapReduce KMeans Clustering”} Map phase: assign cluster IDs Reduce phase: reset centroids
  • 36. 36 { “Algorithms” : “Other Algorithms” } • Classification ‣ Stochastic Gradient Descent ‣ Support Vector Machines ‣ Random Forests • Clustering ‣ Latent Dirichlet Allocation - Topic models ‣ Fuzzy K-Means - Points are assigned multiple clusters ‣ Canopy clustering - Fast approximations of clusters ‣ Spectral clustering - Treat points as a graph • Evolutionary Algorithms - Integration with Watchmaker for Genetic Programming Fitness Functions • Dimensionality Reduction • Regression
  • 37. 37 { “Algorithms” : “Future” } • Classification ‣ Decision Trees such as J48 and ID3 • Clustering ‣ DBScan and CoWeb Clustering techniques • Evolutionary Algorithms ‣ Classical Genetic Algorithms • Association Rules ‣ Apriori. (It has an alternative frequent itemset algorithm implementation).
  • 38. { “Mahout” : “Summary” } 38
  • 39. { “Summary”: “Apache Mahout” } 39 • Scalable Library
  • 40. 40 • Scalable Library • Three Primary Areas of Focus { “Summary”: “Apache Mahout” }
  • 41. 41 • Scalable Library • Three Primary Areas of Focus • Other Algorithms { “Summary”: “Apache Mahout” }
  • 42. 42 • Scalable Library • Three Primary Areas of Focus • Other Algorithms • All in your friendly neighborhood MapReduce { “Summary”: “Apache Mahout” }
  • 43. { “Mahout” : “Demo” } 43
  • 44. { “Mahout” : “Questions” } 44
  • 45. { “Mahout” : “References” } 45
  • 46. • Books • “Mahout in Action”, Owen et. al., Manning Pub. • “Pattern Recognition and Machine Learning”, Christopher Bishop, Springer Pub. • “Elements of Statistical Learning: Data Mining, Inference, and Prediction”, Hastie et. al., Springer Pub. • Videos • CS-229, Machine Learning at Stanford University - Prof. Andrew Ng. • Collaborative filtering at scale - Sean Owen • Distributed Item-based Collaborative Filtering - Sebastian Schelter • EMail Classification with Mahout - Grant Ingersoll @ Lucid Imagination 46 { “References” : “Mahout Books, Tutorials, Links”, “id” : 1}
  • 47. • WWW • http://mahout.apache.org - Mahout@Apache • http://hadoop.apache.org - Hadoop@Apache • dev@mahout.apache.org - Developer mailing list • user@mahout.apache.org - User mailing list • http://www.ibm.com/developerworks/java/library/j-mahout/ - Introducing Apache Mahout 47 { “References” : “Mahout Books, Tutorials, Links”, “id” : 2}
  • 48. { “Mahout” : “The End” } 48 {“Thank You” : “Have a Nice and Green Day” }