Machine Learning es una rama de la inteligencia artificial, que nos permite utilizar algoritmos que pueden operar sobre datos para determinar comportamiento, patrones, preferencias, etc.
Apache Mahout es una librería de código abierto que implementa una diversidad de algoritmos de Machine Learning, que bien pueden ser usados para construir un motor de recomendaciones para dirigir compras.
3. Definition
• Machine learning, a branch of artificial
intelligence, is a scientific discipline
concerned with the design and
development of algorithms that allow
computers to evolve behaviors based on
empirical data (1)
1http://en.wikipedia.org/wiki/Machine_learning
4. • “Machine Learning is programming
computers to optimize a performance
criterion using example data or past
experience”
• Intro. To Machine Learning by E. Alpaydin
5. Applications
• Recommend friends/dates/ • Detect anomalies in machine
products output
• Classify content into • Ranking search results
predefined groups
• Fraud detection
• Find similar content based
on object properties • Spam detection
• Find associations/patterns in • Medical diagnostics
actions/behaviors
• Translators
• Identify key topics in large
collections of text • Much more¡
8. Starting with ML
• Get your data
• Decide on your features per your algorithm
• Prep the data
• Different approaches for different algorithms
• Run your algorithm(s)
• Lather, rinse, repeat
• Validate your results
• Smell test, A/B testing
9. Apache Mahout
• Machine Learning library. Platform?
• Extensible, we can use our own algorithm.
• Hadoop support
• 2005. Taste Framework
• 2008. Included in Lucene
10. Scalability
• Huge amount of data, growing every second¡
• Be as fast and efficient as possible given the intrinsic design of
the algorithm
• Some algorithms won’t scale to massive machine clusters
• Others fit logically on a Map Reduce framework like
Apache Hadoop
• Still others will need alternative distributed programming
models
• Be pragmatic
• Most Mahout implementations are Map Reduce enabled
13. When to use?
• Recommendation
• Rank large datasets
• Clustering
• Group your data
• Classification
• Train me to think like you
14. Recommenders
• Given a data set. Make a recomendation.
• Item recomendation (Book, Movie, etc)
• Ranking based
• Recomendations
• User based
• Item based
• knowledge of user’s relationships to items (user
preferences)
15.
16. Colaborative filtering
• User based
• Item based
• Both techniques require no knowledge of
the properties of the items themselves.
• Item Type is irrelevant. Apache Mahout is
happy
18. Content based
• Domain-specific approaches
• Hard to meaningfully codify into a
framework
• We are responsables of choosing which
item's attributes to use.
• Apache Mahout can’t handle this out-of-
the-box, but can built on top.
20. Input Data
• In Mahout terms: Preferences
• A preference contains:
• User ID
• Item ID
• Preference value
• Example:
• 1,101,5.0
• USER ID: 1, ITEM ID: 101, PrefValue: 5.0
25. Clustering
• Surface naturally occurring groups of data
• A notion of similarity (and dissimilarity)
• Algorithms do not require training
• Stopping condition - iterate until close
enough
26. Clustering
• Document level
• Group documents based on a notion of similarity
• K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift
• Distance Measures
• Manhattan, Euclidean, other
• Topic Modeling
• Cluster words across documents to identify topics
• Latent Dirichlet Allocation
27. Classification
• Require training (supervised)
• Make a single decision with a very limited
set of outcomes
• Typical answers naturally fit into categories
29. Mahout/Hadoop
• For large data sets
• Online
• Offline (Hadoop prefered)
• You can build your solution with Mahout
• Take a look into Weka
• http://www.cs.waikato.ac.nz/ml/weka/