Mahout is an open source machine learning java library from Apache Software Foundation, and therefore platform independent, that provides a fertile framework and collection of patterns and ready-made component for testing and deploying new large-scale algorithms.
With these slides we aims at providing a deeper understanding of its architecture.
A talk I gave on what Hadoop does for the data scientist. I talk about data exploration, NLP, Classifiers, and recommendation systems, plus some other things. I tried to depict a realistic view of Hadoop here.
Mahout is an open source machine learning java library from Apache Software Foundation, and therefore platform independent, that provides a fertile framework and collection of patterns and ready-made component for testing and deploying new large-scale algorithms.
With these slides we aims at providing a deeper understanding of its architecture.
A talk I gave on what Hadoop does for the data scientist. I talk about data exploration, NLP, Classifiers, and recommendation systems, plus some other things. I tried to depict a realistic view of Hadoop here.
This presentation lets you know about Apache Mahout.
The Apache Mahout is a machine learning library and the main goal is to build scalable machine learning libraries.
This presentation lets you know about Apache Mahout.
The Apache Mahout is a machine learning library and the main goal is to build scalable machine learning libraries.
Introduction to Mahout and Machine LearningVarad Meru
This presentation gives an introduction to Apache Mahout and Machine Learning. It presents some of the important Machine Learning algorithms implemented in Mahout. Machine Learning is a vast subject; this presentation is only a introductory guide to Mahout and does not go into lower-level implementation details.
Sangchul Song and Thu Kyaw discuss machine learning at AOL, and the challenges and solutions they encountered when trying to train a large number of machine learning models using Hadoop. Algorithms including SVM and packages like Mahout are discussed. Finally, they discuss their analytics pipeline, which includes some custom components used to interoperate with a range of machine learning libraries, as well as integration with the query language Pig.
Discovering User's Topics of Interest in Recommender SystemsGabriel Moreira
This talk introduces the main techniques of Recommender Systems and Topic Modeling.
Then, we present a case of how we've combined those techniques to build Smart Canvas (www.smartcanvas.com), a service that allows people to bring, create and curate content relevant to their organization, and also helps to tear down knowledge silos.
We present some of Smart Canvas features powered by its recommender system, such as:
- Highlight relevant content, explaining to the users which of his topics of interest have generated each recommendation.
- Associate tags to users’ profiles based on topics discovered from content they have contributed. These tags become searchable, allowing users to find experts or people with specific interests.
- Recommends people with similar interests, explaining which topics brings them together.
We give a deep dive into the design of our large-scale recommendation algorithms, giving special attention to our content-based approach that uses topic modeling techniques (like LDA and NMF) to discover people’s topics of interest from unstructured text, and social-based algorithms using a graph database connecting content, people and teams around topics.
Our typical data pipeline that includes the ingestion millions of user events (using Google PubSub and BigQuery), the batch processing of the models (with PySpark, MLib, and Scikit-learn), the online recommendations (with Google App Engine, Titan Graph Database and Elasticsearch), and the data-driven evaluation of UX and algorithms through A/B testing experimentation. We also touch topics about non-functional requirements of a software-as-a-service like scalability, performance, availability, reliability and multi-tenancy and how we addressed it in a robust architecture deployed on Google Cloud Platform.
A Deep Dive into Classification with Naive Bayes. Along the way we take a look at some basics from Ian Witten's Data Mining book and dig into the algorithm.
Presented on Wed Apr 27 2011 at SeaHUG in Seattle, WA.
An introduction to Apache Mahout presented at Apache BarCamp DC, May 19, 2012
A brief introduction to the examples and links to more resources for further exploration.
Big data serving: Processing and inference at scale in real timeItai Yaffe
Jon Bratseth (VP Architect) @ Verizon Media:
The big data world has mature technologies for offline analysis and learning from data, but have lacked options for making data-driven decisions in real time.
When it is sufficient to consider a single data point model servers such as TensorFlow serving can be used but in many cases you want to consider many data points to make decisions.
This is a difficult engineering problem combining state, distributed algorithms and low latency, but solving it often makes it possible to create far superior solutions when applying machine learning.
This talk will explain why this is a hard problem, show the advantages of solving it, and introduce the open source Vespa.ai platform which is used to implement such solutions in some of the largest scale problems in the world including the world's third largest ad serving system.
Configuring Mahout Clustering Jobs - Frank Scholtenlucenerevolution
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
For more than a decade internet search engines have helped users find documents they are looking for. However, what if users aren't looking for anything specific but want a summary of a large document collection and want to be surprised? One solution to this problem is document clustering. Clustering algorithms group documents that have similar content. Real-life examples of clustering are clustered search results of Google news, or tag clouds which group documents under a shared label. Apache Mahout is a framework for scalable machine learning on top of Apache Hadoop and can be used for large scale document clustering. This talk introduces clustering in general and shows you step-by-step how to configure Mahout clustering jobs to create a tag cloud from a document collection. This talk is suitable for people who have some experience with Hadoop and perhaps Mahout. Knowledge of clustering is not required.
Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine...Gabriel Moreira
This talk introduces the main techniques of Recommender Systems and Topic Modeling. Then, we present a case of how we've combined those techniques to build Smart Canvas, a SaaS that allows people to bring, create and curate content relevant to their organization, and also helps to tear down knowledge silos.
We give a deep dive into the design of our large-scale recommendation algorithms, giving special attention to a content-based approach that uses topic modeling techniques (like LDA and NMF) to discover people’s topics of interest from unstructured text, and social-based algorithms using a graph database connecting content, people and teams around topics.
Our typical data pipeline that includes the ingestion millions of user events (using Google PubSub and BigQuery), the batch processing of the models (with PySpark, MLib, and Scikit-learn), the online recommendations (with Google App Engine, Titan Graph Database and Elasticsearch), and the data-driven evaluation of UX and algorithms through A/B testing experimentation. We also touch topics about non-functional requirements of a software-as-a-service like scalability, performance, availability, reliability and multi-tenancy and how we addressed it in a robust architecture deployed on Google Cloud Platform.
Short-Bio: Gabriel Moreira is a scientist passionate about solving problems with data. He is Head of Machine Learning at CI&T and Doctoral student at Instituto Tecnológico de Aeronáutica - ITA. where he has also got his Masters on Science. His current research interests are recommender systems and deep learning.
https://www.meetup.com/pt-BR/machine-learning-big-data-engenharia/events/239037949/
1. Mammoth Scale Machine Learning
Speaker: Robin Anil, Apache Mahout PMC Member
OSCONʼ10
Portland, OR
July 2010
2. Quick Show of Hands
Are you fascinated about ML?
Have you used ML?
Do you have Gigabytes and Terabytes of data to analyze?
Do you have Hadoop or MapReduce experience?
Thanks for the survey!
3. Little bit about me
Apache Mahout PMC member
A ML Enthusiast
Software Engineer @ Google
Google Summer of Code Mentor
Previous Life: Google Summer of Code student for 2 years.
4. Agenda
Introducing Mahout
Different classes of problems
And their Mahout based solutions
Basic data structure
Usage examples
Sneak peek at our Next Release
6. Scale!
Scale to large datasets
- Hadoop MapReduce implementations that scales linearly with
data.
- Fast sequential algorithms whose runtime doesn’t depend on
the size of the data
- Goal: To be as fast as possible for any algorithm
Scalable to support your business case
- Apache Software License 2
Scalable community
- Vibrant, responsive and diverse
- Come to the mailing list and find out more
8. Why a new Library
Plenty of open source Machine Learning libraries either
- Lack community
- Lack scalability
- Lack documentations and examples
- Lack Apache licensing
- Are not well tested
- Are Research oriented
9. Agenda
Introducing Mahout
Different classes of problems
And their Mahout based solutions
Basic data structure
Usage examples
Sneak peek at our Next Release
10. ML on Twitter
Collection of tweets in the last hour
Each 140 character or token stream
We will keep using this example throughout this talk
14. Topic modeling
Grouping similar or co-occurring features into a topic
- Topic “Lol Cat”:
- Cat
- Meow
- Purr
- Haz
- Cheeseburger
- Lol
15. Mahout Topic Modeling
Algorithm: Latent Dirichlet Allocation
- Input a set of documents
- Output top K prominent topics and the
features in each topic
18. Mahout Classification
Plenty of algorithms
- Naïve Bayes
- Complementary Naïve Bayes
- Random Forests
- Logistic Regression (Almost done)
- Support Vector Machines (patch ready)
Learn a model from a manually classified data
Predict the class of a new object based on its
features and the learned model
19. Detect OSCON Tweets
“Tweets without #OSCON”
Use tweets mentioning #OSCON to train
and
Classify incoming tweets
20. Recommendations
Predict what the user likes based on
- His/Her historical behavior
- Aggregate behavior of people similar to him
21. Mahout Recommenders
Different types of recommenders
- User based
- Item based
Full framework for storage, online
online and offline computation of recommendations
Like clustering, there is a notion of similarity in users or items
- Cosine, Tanimoto, Pearson and LLR
24. Mahout Parallel FPGrowth
Identify the most commonly
occurring patterns from
- Sales Transactions
buy “Milk, eggs and bread”
- Query Logs
ipad -> apple, tablet, iphone
- Spam Detection
Yahoo! http://www.slideshare.net/hadoopusergroup/mail-antispam
25. Frequent patterns in Tweets
“Identify groups of words that occur together”
Or
“Identify related searches from search logs”
26. Mahout is Evolving
Mapreduce enabled fitness functions for Genetic programming
- Integration with Watchmaker
- Solves: Travelling salesman, class discovery and many others
Singular Value decomposition [SVD] of large matrices
- Reduce a large matrix into a smaller one by identifying the key
rows and columns and discarding the others
- Mapreduce implementation of Lanczos algorithm
27. Agenda
Introducing Mahout
Different classes of problems
And their Mahout based solutions
Basic data structure
Usage examples
Sneak peek at our Next Release
29. Representing Data as Vectors
Y
X = 5 , Y = 3
(5, 3)
X
The vector denoted by point (5, 3) is simply
Array([5, 3]) or HashMap([0 => 5], [1 => 3])
30. Representing Vectors – The basics
Now think 3, 4, 5, ….. n-dimensional
Think of a document as a bag of words.
“she sells sea shells on the sea shore”
Now map them to integers
she => 0
sells => 1
sea => 2
and so on
The resulting vector [1.0, 1.0, 2.0, … ]
31. Vectorizer tools
Map/Reduce tools to convert text data to vectors
- Use collate multiple words (n-grams) eg: “San Francisco”
- Normalization
- Optimize for sequential or random access
- TF-IDF calculation
- Pruning
- Stop words removal
32. Agenda
Introducing Mahout
Different classes of problems
And their Mahout based solutions
Basic data structure
Usage examples
Sneak peek at our Next Release
33. How to use mahout
Command line launcher bin/mahout
See the list of tools and algorithms by running bin/mahout
Run any algorithm by its shortname:
- bin/mahout kmeans –help
By default runs locally
export HADOOP_HOME = /pathto/hadoop-0.20.2/
- Runs on the cluster configured as per the conf files in the
hadoop directory
Use driver classes to launch jobs:
- KMeansDriver.runjob(Path input, Path output …)
34. Clustering Walkthrough (tiny example)
Input: set of text files in a directory
Download Mahout and unzip
- mvn install
- bin/mahout seqdirectory –i <input> –o <seq-
output>
- bin/mahout seq2sparse –i seq-output –o <vector-
output>
- bin/mahout kmeans –i<vector-output>
-c <cluster-temp> -o <cluster-output> -k 10 –cd
0.01 –x 20
35. Clustering Walkthrough (a bit more)
Use bigrams: -ng 2
Prune low frequency: –s 10
Normalize: -n 2
Use a distance measure : -dm
org.apache.mahout.common.distance.CosineDistanceM
easure
36. Clustering Walkthrough (viewing results)
bin/mahout clusterdump
–s cluster-output/clusters-9/part-00000
-d vector-output/dictionary.file-*
-dt sequencefile -n 5 -b 100
Top terms in a typical cluster
comic => 9.793121272867376
comics => 6.115341078151356
con => 5.015090566692931
sdcc => 3.927590843402978
webcomics => 2.916910980686997
37. Agenda
Introducing Mahout
Different classes of problems
And their Mahout based solutions
Basic data structure
Usage examples
Sneak peek at our Next Release
38. Mahout 0.4 (trunk)
New breed of classifiers:
- Stochastic Gradient Descent (SGD)
- Pegasos SVM (Order of magnitude faster than SVM Perf)
- Lib Linear (Winner, ICML 2008)
New Recommenders:
- Restricted Boltzmann Machine (RBM) based recommender
- SVD++ recommender
New Clustering algorithms:
- Spectral Clustering
- K-Means++
Full Hadoop 0.20 API compliance and performance
improvements
39. Get Started
http://mahout.apache.org
dev@mahout.apache.org - Developer mailing list
user@mahout.apache.org - User mailing list
Check out the documentations and wiki for quickstart
http://svn.apache.org/repos/asf/mahout/trunk/ Browse Code