SlideShare a Scribd company logo
Intro to Apache Mahout Grant Ingersoll Lucid Imagination http://www.lucidimagination.com
Anyone Here Use Machine Learning? Any users of: Google? Search? Priority Inbox? Facebook? Twitter? LinkedIn?
Topics Background and Use Cases What can you do in Mahout? Where’s the community at? Resources K-Means in Hadoop (time permitting)
Definition “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” Intro. To Machine Learning by E. Alpaydin Subset of Artificial Intelligence Lots of related fields: Information Retrieval Stats Biology Linear algebra Many more
Common Use Cases Recommend friends/dates/products Classify content into predefined groups Find similar content Find associations/patterns in actions/behaviors Identify key topics/summarize text Documents and Corpora Detect anomalies/fraud Ranking search results Others?
Apache Mahout An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License http://mahout.apache.org Why Mahout? Many Open Source ML libraries either: Lack Community Lack Documentation and Examples Lack Scalability Lack the Apache License Or are research-oriented Definition:http://dictionary.reference.com/browse/mahout
What does scalable mean to us? Goal: Be as fast and efficient as possible given the intrinsic design of the algorithm Some algorithms won’t scale to massive machine clusters Others fit logically on a Map Reduce framework like Apache Hadoop Still others will need different distributed programming models Others are already fast (SGD) Be pragmatic
Sampling of Who uses Mahout? https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
What Can I do with Mahout Right Now?3C + FPM + O = Mahout
Collaborative Filtering Extensive framework for collaborative filtering (recommenders) Recommenders User based Item based Online and Offline support Offline can utilize Hadoop Many different Similarity measures Cosine, LLR, Tanimoto, Pearson, others
Clustering Document level Group documents based on a notion of similarity K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift, EigenCuts (Spectral) All Map/Reduce Distance Measures Manhattan, Euclidean, other Topic Modeling  Cluster words across documents to identify topics Latent Dirichlet Allocation (M/R)
Categorization Place new items into predefined categories: Sports, politics, entertainment Recommenders Implementations Naïve Bayes (M/R) Compl. Naïve Bayes (M/R) Decision Forests (M/R) Linear Regression (Seq. but Fast!) ,[object Object]
http://awe.sm/5FyNe,[object Object]
Other Primitive Collections! Collocations (M/R) Math library Vectors, Matrices, etc. Noise Reduction via Singular Value Decomp (M/R)
Prepare Data from Raw content Data Sources: Lucene integration bin/mahout lucene.vector… Document Vectorizer bin/mahout seqdirectory … bin/mahout seq2sparse … Programmatically See the Utils module in Mahout and the Iterator<Vector> classes Database File system
How to: Command Line Most algorithms have a Driver program $MAHOUT_HOME/bin/mahout.shhelps with most tasks Prepare the Data Different algorithms require different setup Run the algorithm Single Node Hadoop Print out the results or incorporate into application Several helper classes:  LDAPrintTopics, ClusterDumper, etc.
What’s Happening Now? Unified Framework for Clustering and Classification 0.5 release on the horizon (May?) Working towards 1.0 release by focusing on: Tests, examples, documentation API cleanup and consistency Gearing up for Google Summer of Code New M/R work for Hidden Markov Models
Summary Machine learning is all over the web today Mahout is about scalable machine learning Mahout has functionality for many of today’s common machine learning tasks Many Mahout implementations use Hadoop
Resources http://mahout.apache.org http://cwiki.apache.org/MAHOUT {user|dev}@mahout.apache.org http://svn.apache.org/repos/asf/mahout/trunk http://hadoop.apache.org
Resources “Mahout in Action”  Owen, Anil, Dunning and Friedman http://awe.sm/5FyNe “Introducing Apache Mahout”  http://www.ibm.com/developerworks/java/library/j-mahout/ “Taming Text” by Ingersoll, Morton, Farris “Programming Collective Intelligence” by Toby Segaran “Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank “Data-Intensive Text Processing with MapReduce” by Jimmy Lin and  Chris Dyer
K-Means Clustering Algorithm Nicely parallelizable! http://en.wikipedia.org/wiki/K-means_clustering
K-Means in Map-Reduce Input: Mahout Vectors representing the original content Either: A predefined set of initial centroids (Can be from Canopy) --k – The number of clusters to produce Iterate Do the centroid calculation (more in a moment) Clustering Step (optional) Output Centroids (as Mahout Vectors) Points for each Centroid (if Clustering Step was taken)
Map-Reduce Iteration Each Iteration calculates the Centroids using: KMeansMapper KMeansCombiner KMeansReducer Clustering Step Calculate the points for each Centroid using: KMeansClusterMapper
KMeansMapper During Setup: Load the initial Centroids (or the Centroids from the last iteration) Map Phase For each input Calculate it’s distance from each Centroid and output the closest one Distance Measures are pluggable Manhattan, Euclidean, Squared Euclidean, Cosine, others
KMeansReducer Setup: Load up clusters Convergence information Partial sums from KMeansCombiner (more in a moment) Reduce Phase Sum all the vectors in the cluster to produce a new Centroid Check for Convergence Output cluster
KMeansCombiner Just like KMeansReducer, but only produces partial sum of the cluster based on the data local to the Mapper
KMeansClusterMapper Some applications only care about what the Centroids are, so this step is optional Setup: Load up the clusters and the DistanceMeasure used Map Phase Calculate which Cluster the point belongs to Output <ClusterId, Vector>

More Related Content

What's hot

Apache mahout
Apache mahoutApache mahout
Apache mahout
Puneet Gupta
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用
James Chen
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
Cataldo Musto
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
Ajit Koti
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantGrant Ingersoll
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache Mahout
Aman Adhikari
 
Mahout part2
Mahout part2Mahout part2
Mahout part2
Yasmine Gaber
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
sscdotopen
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
OSCON Byrum
 
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkScalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Evan Casey
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
aneeshabakharia
 
Intro to Mahout
Intro to MahoutIntro to Mahout
Intro to Mahout
Uri Lavi
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy
 
Mahout classification presentation
Mahout classification presentationMahout classification presentation
Mahout classification presentation
Naoki Nakatani
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDC
Drew Farris
 
Apache Mahout Architecture Overview
Apache Mahout Architecture OverviewApache Mahout Architecture Overview
Apache Mahout Architecture Overview
Stefano Dalla Palma
 
Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro Analytics
Navisro Analytics
 

What's hot (20)

Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Apache mahout
Apache mahoutApache mahout
Apache mahout
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用
 
Mahout
MahoutMahout
Mahout
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache Mahout
 
Mahout part2
Mahout part2Mahout part2
Mahout part2
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
 
mahout introduction
mahout  introductionmahout  introduction
mahout introduction
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkScalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
Intro to Mahout
Intro to MahoutIntro to Mahout
Intro to Mahout
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Mahout classification presentation
Mahout classification presentationMahout classification presentation
Mahout classification presentation
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDC
 
Apache Mahout Architecture Overview
Apache Mahout Architecture OverviewApache Mahout Architecture Overview
Apache Mahout Architecture Overview
 
Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro Analytics
 

Viewers also liked

Understanding Mahout classification documentation
Understanding Mahout  classification documentationUnderstanding Mahout  classification documentation
Understanding Mahout classification documentation
Naveen Kumar
 
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
Jee Vang, Ph.D.
 
Kmeans in-hadoop
Kmeans in-hadoopKmeans in-hadoop
Kmeans in-hadoopTianwei Liu
 
K fold validation
K fold validationK fold validation
K fold validation
Masrur Ahmed
 
Recomendación con Mahout sobre Cassandra
Recomendación con Mahout sobre CassandraRecomendación con Mahout sobre Cassandra
Recomendación con Mahout sobre Cassandra
Jose Felix Hernandez Barrio
 
Filtros Colaborativos y Sistemas de Recomendación
Filtros Colaborativos y Sistemas de RecomendaciónFiltros Colaborativos y Sistemas de Recomendación
Filtros Colaborativos y Sistemas de Recomendación
Gabriel Huecas
 
Java WebServices JaxWS - JaxRs
Java WebServices JaxWS - JaxRsJava WebServices JaxWS - JaxRs
Java WebServices JaxWS - JaxRsHernan Rengifo
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringGeorge Ang
 
Apache Mahout Algorithms
Apache Mahout AlgorithmsApache Mahout Algorithms
Apache Mahout Algorithmsmozgkarakaya
 
Final Presentation for Pattern Recognition
Final Presentation for Pattern RecognitionFinal Presentation for Pattern Recognition
Final Presentation for Pattern Recognition
davidglenEE
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
Varad Meru
 
Modelo del dominio
Modelo del dominioModelo del dominio
Modelo del dominio
SCMU AQP
 
Parallel-kmeans
Parallel-kmeansParallel-kmeans
Parallel-kmeans
Tien-Yang (Aiden) Wu
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
Farzad Nozarian
 
Unidad 10 Mad Diagrama De Clases
Unidad 10 Mad Diagrama De ClasesUnidad 10 Mad Diagrama De Clases
Unidad 10 Mad Diagrama De ClasesSergio Sanchez
 
Neural Networks with Google TensorFlow
Neural Networks with Google TensorFlowNeural Networks with Google TensorFlow
Neural Networks with Google TensorFlow
Darshan Patel
 
Modelos de Base de Datos
Modelos de Base de DatosModelos de Base de Datos
Modelos de Base de DatosAxel Mérida
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupHadoop User Group
 

Viewers also liked (20)

Understanding Mahout classification documentation
Understanding Mahout  classification documentationUnderstanding Mahout  classification documentation
Understanding Mahout classification documentation
 
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
 
Kmeans in-hadoop
Kmeans in-hadoopKmeans in-hadoop
Kmeans in-hadoop
 
K fold validation
K fold validationK fold validation
K fold validation
 
Recomendación con Mahout sobre Cassandra
Recomendación con Mahout sobre CassandraRecomendación con Mahout sobre Cassandra
Recomendación con Mahout sobre Cassandra
 
Filtros Colaborativos y Sistemas de Recomendación
Filtros Colaborativos y Sistemas de RecomendaciónFiltros Colaborativos y Sistemas de Recomendación
Filtros Colaborativos y Sistemas de Recomendación
 
Java WebServices JaxWS - JaxRs
Java WebServices JaxWS - JaxRsJava WebServices JaxWS - JaxRs
Java WebServices JaxWS - JaxRs
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means Clustering
 
Apache Mahout Algorithms
Apache Mahout AlgorithmsApache Mahout Algorithms
Apache Mahout Algorithms
 
Final Presentation for Pattern Recognition
Final Presentation for Pattern RecognitionFinal Presentation for Pattern Recognition
Final Presentation for Pattern Recognition
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Modelo del dominio
Modelo del dominioModelo del dominio
Modelo del dominio
 
Parallel-kmeans
Parallel-kmeansParallel-kmeans
Parallel-kmeans
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
 
Unidad 10 Mad Diagrama De Clases
Unidad 10 Mad Diagrama De ClasesUnidad 10 Mad Diagrama De Clases
Unidad 10 Mad Diagrama De Clases
 
IR
IRIR
IR
 
Neural Networks with Google TensorFlow
Neural Networks with Google TensorFlowNeural Networks with Google TensorFlow
Neural Networks with Google TensorFlow
 
Modelos de Base de Datos
Modelos de Base de DatosModelos de Base de Datos
Modelos de Base de Datos
 
Modelo relacional
Modelo relacionalModelo relacional
Modelo relacional
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user group
 

Similar to Intro to Mahout -- DC Hadoop

Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101
John Ternent
 
OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningRobin Anil
 
Seattle Scalability Mahout
Seattle Scalability MahoutSeattle Scalability Mahout
Seattle Scalability Mahout
Jake Mannix
 
Data Science.pptx
Data Science.pptxData Science.pptx
Data Science.pptx
TrainerAnalogicx
 
BigData
BigDataBigData
BigData
Shankar R
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
Duncan Hull
 
DB-IR-ranking
DB-IR-rankingDB-IR-ranking
DB-IR-rankingFELIX75
 
DB and IR Integration
DB and IR IntegrationDB and IR Integration
DB and IR Integration
Marco A Torres
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
Vijay Srinivas Agneeswaran, Ph.D
 
Bigdata
BigdataBigdata
Bigdata
Shankar R
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data Modeling
Vital.AI
 
Machine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMachine Learning for (JVM) Developers
Machine Learning for (JVM) Developers
Mateusz Dymczyk
 
AI Presentation 1
AI Presentation 1AI Presentation 1
AI Presentation 1
Mustafa Kuğu
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
Josh Patterson
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
Ivo Vachkov
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
gauravsc36
 
Expressiveness, Simplicity and Users
Expressiveness, Simplicity and UsersExpressiveness, Simplicity and Users
Expressiveness, Simplicity and Users
greenwop
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User Group
Sri Ambati
 
Natural Language Processing & Semantic Models in an Imperfect World
Natural Language Processing & Semantic Modelsin an Imperfect WorldNatural Language Processing & Semantic Modelsin an Imperfect World
Natural Language Processing & Semantic Models in an Imperfect WorldVital.AI
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
Rohit
 

Similar to Intro to Mahout -- DC Hadoop (20)

Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101Mahout and Distributed Machine Learning 101
Mahout and Distributed Machine Learning 101
 
OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine Learning
 
Seattle Scalability Mahout
Seattle Scalability MahoutSeattle Scalability Mahout
Seattle Scalability Mahout
 
Data Science.pptx
Data Science.pptxData Science.pptx
Data Science.pptx
 
BigData
BigDataBigData
BigData
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
DB-IR-ranking
DB-IR-rankingDB-IR-ranking
DB-IR-ranking
 
DB and IR Integration
DB and IR IntegrationDB and IR Integration
DB and IR Integration
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Bigdata
BigdataBigdata
Bigdata
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data Modeling
 
Machine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMachine Learning for (JVM) Developers
Machine Learning for (JVM) Developers
 
AI Presentation 1
AI Presentation 1AI Presentation 1
AI Presentation 1
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
 
Expressiveness, Simplicity and Users
Expressiveness, Simplicity and UsersExpressiveness, Simplicity and Users
Expressiveness, Simplicity and Users
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User Group
 
Natural Language Processing & Semantic Models in an Imperfect World
Natural Language Processing & Semantic Modelsin an Imperfect WorldNatural Language Processing & Semantic Modelsin an Imperfect World
Natural Language Processing & Semantic Models in an Imperfect World
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 

More from Grant Ingersoll

Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
Grant Ingersoll
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
Grant Ingersoll
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
Grant Ingersoll
 
Intro to Search
Intro to SearchIntro to Search
Intro to Search
Grant Ingersoll
 
Open Source Search FTW
Open Source Search FTWOpen Source Search FTW
Open Source Search FTW
Grant Ingersoll
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Grant Ingersoll
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
Grant Ingersoll
 
Taming Text
Taming TextTaming Text
Taming Text
Grant Ingersoll
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
Grant Ingersoll
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
Grant Ingersoll
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionGrant Ingersoll
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
Grant Ingersoll
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
Grant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Grant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Grant Ingersoll
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
Grant Ingersoll
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
Grant Ingersoll
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
Grant Ingersoll
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
Grant Ingersoll
 

More from Grant Ingersoll (20)

Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
 
Intro to Search
Intro to SearchIntro to Search
Intro to Search
 
Open Source Search FTW
Open Source Search FTWOpen Source Search FTW
Open Source Search FTW
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
 
Taming Text
Taming TextTaming Text
Taming Text
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
 
Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
 
Apache Lucene 4
Apache Lucene 4Apache Lucene 4
Apache Lucene 4
 
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene EcosystemOpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
 

Recently uploaded

GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 

Recently uploaded (20)

GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 

Intro to Mahout -- DC Hadoop

  • 1. Intro to Apache Mahout Grant Ingersoll Lucid Imagination http://www.lucidimagination.com
  • 2. Anyone Here Use Machine Learning? Any users of: Google? Search? Priority Inbox? Facebook? Twitter? LinkedIn?
  • 3. Topics Background and Use Cases What can you do in Mahout? Where’s the community at? Resources K-Means in Hadoop (time permitting)
  • 4. Definition “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” Intro. To Machine Learning by E. Alpaydin Subset of Artificial Intelligence Lots of related fields: Information Retrieval Stats Biology Linear algebra Many more
  • 5. Common Use Cases Recommend friends/dates/products Classify content into predefined groups Find similar content Find associations/patterns in actions/behaviors Identify key topics/summarize text Documents and Corpora Detect anomalies/fraud Ranking search results Others?
  • 6. Apache Mahout An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License http://mahout.apache.org Why Mahout? Many Open Source ML libraries either: Lack Community Lack Documentation and Examples Lack Scalability Lack the Apache License Or are research-oriented Definition:http://dictionary.reference.com/browse/mahout
  • 7. What does scalable mean to us? Goal: Be as fast and efficient as possible given the intrinsic design of the algorithm Some algorithms won’t scale to massive machine clusters Others fit logically on a Map Reduce framework like Apache Hadoop Still others will need different distributed programming models Others are already fast (SGD) Be pragmatic
  • 8. Sampling of Who uses Mahout? https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
  • 9. What Can I do with Mahout Right Now?3C + FPM + O = Mahout
  • 10. Collaborative Filtering Extensive framework for collaborative filtering (recommenders) Recommenders User based Item based Online and Offline support Offline can utilize Hadoop Many different Similarity measures Cosine, LLR, Tanimoto, Pearson, others
  • 11. Clustering Document level Group documents based on a notion of similarity K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift, EigenCuts (Spectral) All Map/Reduce Distance Measures Manhattan, Euclidean, other Topic Modeling Cluster words across documents to identify topics Latent Dirichlet Allocation (M/R)
  • 12.
  • 13.
  • 14. Other Primitive Collections! Collocations (M/R) Math library Vectors, Matrices, etc. Noise Reduction via Singular Value Decomp (M/R)
  • 15. Prepare Data from Raw content Data Sources: Lucene integration bin/mahout lucene.vector… Document Vectorizer bin/mahout seqdirectory … bin/mahout seq2sparse … Programmatically See the Utils module in Mahout and the Iterator<Vector> classes Database File system
  • 16. How to: Command Line Most algorithms have a Driver program $MAHOUT_HOME/bin/mahout.shhelps with most tasks Prepare the Data Different algorithms require different setup Run the algorithm Single Node Hadoop Print out the results or incorporate into application Several helper classes: LDAPrintTopics, ClusterDumper, etc.
  • 17. What’s Happening Now? Unified Framework for Clustering and Classification 0.5 release on the horizon (May?) Working towards 1.0 release by focusing on: Tests, examples, documentation API cleanup and consistency Gearing up for Google Summer of Code New M/R work for Hidden Markov Models
  • 18. Summary Machine learning is all over the web today Mahout is about scalable machine learning Mahout has functionality for many of today’s common machine learning tasks Many Mahout implementations use Hadoop
  • 19. Resources http://mahout.apache.org http://cwiki.apache.org/MAHOUT {user|dev}@mahout.apache.org http://svn.apache.org/repos/asf/mahout/trunk http://hadoop.apache.org
  • 20. Resources “Mahout in Action” Owen, Anil, Dunning and Friedman http://awe.sm/5FyNe “Introducing Apache Mahout” http://www.ibm.com/developerworks/java/library/j-mahout/ “Taming Text” by Ingersoll, Morton, Farris “Programming Collective Intelligence” by Toby Segaran “Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank “Data-Intensive Text Processing with MapReduce” by Jimmy Lin and Chris Dyer
  • 21. K-Means Clustering Algorithm Nicely parallelizable! http://en.wikipedia.org/wiki/K-means_clustering
  • 22. K-Means in Map-Reduce Input: Mahout Vectors representing the original content Either: A predefined set of initial centroids (Can be from Canopy) --k – The number of clusters to produce Iterate Do the centroid calculation (more in a moment) Clustering Step (optional) Output Centroids (as Mahout Vectors) Points for each Centroid (if Clustering Step was taken)
  • 23. Map-Reduce Iteration Each Iteration calculates the Centroids using: KMeansMapper KMeansCombiner KMeansReducer Clustering Step Calculate the points for each Centroid using: KMeansClusterMapper
  • 24. KMeansMapper During Setup: Load the initial Centroids (or the Centroids from the last iteration) Map Phase For each input Calculate it’s distance from each Centroid and output the closest one Distance Measures are pluggable Manhattan, Euclidean, Squared Euclidean, Cosine, others
  • 25. KMeansReducer Setup: Load up clusters Convergence information Partial sums from KMeansCombiner (more in a moment) Reduce Phase Sum all the vectors in the cluster to produce a new Centroid Check for Convergence Output cluster
  • 26. KMeansCombiner Just like KMeansReducer, but only produces partial sum of the cluster based on the data local to the Mapper
  • 27. KMeansClusterMapper Some applications only care about what the Centroids are, so this step is optional Setup: Load up the clusters and the DistanceMeasure used Map Phase Calculate which Cluster the point belongs to Output <ClusterId, Vector>

Editor's Notes

  1. 3C: The three C’s: clustering, classification and collaborative filteringFPM: Frequent patternset miningO: Other (math, collections, etc.)
  2. Convergence just checks to see how far the centroid has moved from the previous centroid