Machine Learning & Apache Mahout

Machine Learning
con Apache Mahout
Domingo Suarez Torres

Machine Learning (ML)
Introduction

Definition

• Machine learning, a branch of artificial
intelligence, is a scientific discipline
concerned with the design and
development of algorithms that allow
computers to evolve behaviors based on
empirical data (1)

1http://en.wikipedia.org/wiki/Machine_learning

• “Machine Learning is programming
computers to optimize a performance
criterion using example data or past
experience”
• Intro. To Machine Learning by E. Alpaydin

Applications
• Recommend friends/dates/ • Detect anomalies in machine
products output

• Classify content into • Ranking search results
predeﬁned groups
• Fraud detection
• Find similar content based
on object properties • Spam detection

• Find associations/patterns in • Medical diagnostics
actions/behaviors
• Translators
• Identify key topics in large
collections of text • Much more¡

Math

• Stadistics
• Discrete Math
• Linear algebra
• Probability

Starting with ML
• Get your data
• Decide on your features per your algorithm
• Prep the data
• Different approaches for different algorithms
• Run your algorithm(s)
• Lather, rinse, repeat
• Validate your results
• Smell test, A/B testing

Apache Mahout

• Machine Learning library. Platform?
• Extensible, we can use our own algorithm.
• Hadoop support
• 2005. Taste Framework
• 2008. Included in Lucene

Scalability
• Huge amount of data, growing every second¡
• Be as fast and efﬁcient as possible given the intrinsic design of
the algorithm
• Some algorithms won’t scale to massive machine clusters
• Others ﬁt logically on a Map Reduce framework like
Apache Hadoop
• Still others will need alternative distributed programming
models
• Be pragmatic
• Most Mahout implementations are Map Reduce enabled

Components

• Recommender Engines (collaborative
ﬁltering, content-based)
• Clustering
• Classiﬁcation

When to use?
• Recommendation
• Rank large datasets
• Clustering
• Group your data
• Classiﬁcation
• Train me to think like you

Recommenders
• Given a data set. Make a recomendation.
• Item recomendation (Book, Movie, etc)
• Ranking based
• Recomendations
• User based
• Item based
• knowledge of user’s relationships to items (user
preferences)

Colaborative ﬁltering
• User based
• Item based
• Both techniques require no knowledge of
the properties of the items themselves.
• Item Type is irrelevant. Apache Mahout is
happy

Content based
• Domain-speciﬁc approaches
• Hard to meaningfully codify into a
framework
• We are responsables of choosing which
item's attributes to use.
• Apache Mahout can’t handle this out-of-
the-box, but can built on top.

Making recommendations

• What we need?
• Input data
• Neighborhood
• Similarity

Input Data
• In Mahout terms: Preferences
• A preference contains:
• User ID
• Item ID
• Preference value
• Example:
• 1,101,5.0
• USER ID: 1, ITEM ID: 101, PrefValue: 5.0

Neighborhood
Nearest N Users Threshold

Clustering

• Surface naturally occurring groups of data
• A notion of similarity (and dissimilarity)
• Algorithms do not require training
• Stopping condition - iterate until close
enough

Clustering
• Document level
• Group documents based on a notion of similarity
• K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift
• Distance Measures
• Manhattan, Euclidean, other
• Topic Modeling
• Cluster words across documents to identify topics
• Latent Dirichlet Allocation

Classiﬁcation

• Require training (supervised)
• Make a single decision with a very limited
set of outcomes
• Typical answers naturally ﬁt into categories

Classiﬁcation samples

• Credit card fraud prediction
• Customer attrition
• Diabetes detector
• Search Engine

Mahout/Hadoop
• For large data sets
• Online
• Ofﬂine (Hadoop prefered)
• You can build your solution with Mahout
• Take a look into Weka
• http://www.cs.waikato.ac.nz/ml/weka/

Join us¡
• GIAMA.
• Agustin Ramos iniciative

Machine Learning & Apache Mahout

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Machine Learning & Apache Mahout

Similar to Machine Learning & Apache Mahout (20)

More from Domingo Suarez Torres

More from Domingo Suarez Torres (20)

Recently uploaded

Recently uploaded (20)

Machine Learning & Apache Mahout

Editor's Notes