Machine Learning & Apache Mahout


Published on

Machine Learning es una rama de la inteligencia artificial, que nos permite utilizar algoritmos que pueden operar sobre datos para determinar comportamiento, patrones, preferencias, etc.

Apache Mahout es una librería de código abierto que implementa una diversidad de algoritmos de Machine Learning, que bien pueden ser usados para construir un motor de recomendaciones para dirigir compras.

Published in: Technology, Education

Machine Learning & Apache Mahout

  1. Machine Learningcon Apache Mahout Domingo Suarez Torres
  2. Machine Learning (ML) Introduction
  3. Definition • Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data (1)1
  4. • “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” • Intro. To Machine Learning by E. Alpaydin
  5. Applications• Recommend friends/dates/ • Detect anomalies in machine products output• Classify content into • Ranking search results predefined groups • Fraud detection• Find similar content based on object properties • Spam detection• Find associations/patterns in • Medical diagnostics actions/behaviors • Translators• Identify key topics in large collections of text • Much more¡
  6. Math• Stadistics• Discrete Math• Linear algebra• Probability
  7. Starting with ML• Get your data• Decide on your features per your algorithm• Prep the data • Different approaches for different algorithms• Run your algorithm(s) • Lather, rinse, repeat• Validate your results • Smell test, A/B testing
  8. Apache Mahout• Machine Learning library. Platform?• Extensible, we can use our own algorithm.• Hadoop support• 2005. Taste Framework• 2008. Included in Lucene
  9. Scalability• Huge amount of data, growing every second¡• Be as fast and efficient as possible given the intrinsic design of the algorithm • Some algorithms won’t scale to massive machine clusters • Others fit logically on a Map Reduce framework like Apache Hadoop • Still others will need alternative distributed programming models • Be pragmatic• Most Mahout implementations are Map Reduce enabled
  10. Who uses Mahout?
  11. Components• Recommender Engines (collaborative filtering, content-based)• Clustering• Classification
  12. When to use?• Recommendation • Rank large datasets• Clustering • Group your data• Classification • Train me to think like you
  13. Recommenders• Given a data set. Make a recomendation. • Item recomendation (Book, Movie, etc)• Ranking based• Recomendations • User based • Item based• knowledge of user’s relationships to items (user preferences)
  14. Colaborative filtering• User based• Item based• Both techniques require no knowledge of the properties of the items themselves.• Item Type is irrelevant. Apache Mahout is happy
  15. 17
  16. Content based• Domain-specific approaches• Hard to meaningfully codify into a framework• We are responsables of choosing which items attributes to use.• Apache Mahout can’t handle this out-of- the-box, but can built on top.
  17. Making recommendations • What we need? • Input data • Neighborhood • Similarity
  18. Input Data• In Mahout terms: Preferences• A preference contains: • User ID • Item ID • Preference value • Example: • 1,101,5.0 • USER ID: 1, ITEM ID: 101, PrefValue: 5.0
  19. 21
  20. NeighborhoodNearest N Users Threshold
  21. Similarity
  22. Clustering• Surface naturally occurring groups of data• A notion of similarity (and dissimilarity)• Algorithms do not require training• Stopping condition - iterate until close enough
  23. Clustering• Document level • Group documents based on a notion of similarity • K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift • Distance Measures • Manhattan, Euclidean, other• Topic Modeling • Cluster words across documents to identify topics • Latent Dirichlet Allocation
  24. Classification• Require training (supervised)• Make a single decision with a very limited set of outcomes• Typical answers naturally fit into categories
  25. Classification samples• Credit card fraud prediction• Customer attrition• Diabetes detector• Search Engine
  26. Mahout/Hadoop• For large data sets• Online• Offline (Hadoop prefered)• You can build your solution with Mahout• Take a look into Weka •
  27. Resources
  28. Resources
  29. Resources
  30. Join us¡• GIAMA. • Agustin Ramos iniciative