2.
What is Machine Learning?
“Machine learning - branch of artificial
intelligence, concerns the construction
and study of systems that can learn from
data”
3.
Typical Use Cases
●
Recommend products/friends …
●
Classify content into predefined groups
●
Computer vision
●
Sentiment analysis/opinion mining
●
Find patterns in users behavior/actions
●
Identify key topics/summarize text
●
Detect anomalies/fraud
●
Ranking search results
●
Speech and handwriting recognition
●
Natural language processing
4.
ML Algorithms (subset):
●
Supervised learning
–
–
Logistic regression
–
Support Vector Machines
–
●
Linear regression
Random Forests
Unsupervised learning
–
–
Blind signal separation
–
●
Clustering
Hidden Markov models
Semi-supervised
5.
Many ML libraries, frameworks
and tools:
●
Weka
●
Python Scikit
●
Pylearn/Pylearn2
●
Theano
●
Orange
●
SSBrain :)
●
More can be find here: http://mloss.org/software/
6.
Typical Workflow
●
Get data
●
Prepare data
●
Choose algorithm(s)
●
Run your algorithm(s)
●
Validate results
7.
Every ML algorithms deals
with:
1.Data
2.Computation over this data
9.
What is Mahout?
●
●
Scalable ML library built on Hadoop, written in Java
Driven by Ng et al's. Paper “MapReduce for Machine Learning on
Multicore”
●
Started as Lucene sub-project. Became Apache TLP in April 2010
●
25 July 2013 - Apache Mahout 0.8 released
●
Taste Recommended Framework by Sean Owen was added in
2008
11.
When you need Mahout?
Data Size
Lines, Sample Data
Task
Analysis and
visualization
Tools
Whiteboard, bash, ...
KBs – low MBs,
Prototype Data
Analysis and
visualization
Octave, R, bash, ...
MBs – low Gbs,
Online Data
Storage
Data bases (MySQL,
Postgresql), ...
Analysis
NumPy, SciPy, BLAS,
Weka
Visualization
GBs – TBs – Pbs
Big Data
Protovis, D3, ...
Storage
HDFS, Hbase,
Cassandra, ...
Analysis
Mahout, Hive, Pig, ….
table from Varad Meru
12.
Advantages
●
Community
●
Documentations and examples
●
Scalability
●
Apache license
●
Well tested
●
Built over existing production quality
libraries
13.
Requirements
●
Java 1.6.x or greater
●
Maven 3.x to build the source code
●
Hadoop 0.20.0 or greater
16.
Algorithms
●
User and Item based recommenders
●
Matrix factorization based recommenders
●
K-Means, Fuzzy K-Means clustering
●
Latent Dirichlet Allocation
●
Singular value decomposition
●
Logistic regression based classifier
●
Complementary Naive Bayes classifier
●
Random forest decision tree based classifier
18.
Personalization level
●
Generic / Non-Personalized: everyone
receives same recommendations
●
Demographic: matches a target group
●
Ephemeral: matches current activity
●
Persistent: matches long-term interests
19.
Content based
●
User Ratings x Item Attributes => Model
●
Model applied to new items via attributes
●
●
Alternative: knowledge-based (Item
attributes form model of item space)
Example: Personalized news feeds
28.
Evaluation
●
Average absolute difference
●
RMSE
●
Precision and recall
●
●
Precision is the proportion of top results that are relevant, for some
definition of relevant.
Recall is the proportion of all relevant results included in the top
results.
32.
Mahout Classification
Algorithms
●
Logistic regression (SGD) - model parameter
selection can be done in Hadoop
●
Naive Bayes - training runs on Hadoop
●
Random Forests - training is done in Hadoop
●
Hidden Markov Models - training is done in
Map-Reduce
33.
Resources
●
Mahout in action
●
Apache Mahout Cookbook
●
Introduction to Apache Mahout
●
http://mahout.apache.org/
Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.
Be the first to comment