Machine Learning & Apache Mahout

  • 3,156 views
Uploaded on

Machine Learning es una rama de la inteligencia artificial, que nos permite utilizar algoritmos que pueden operar sobre datos para determinar comportamiento, patrones, preferencias, etc. …

Machine Learning es una rama de la inteligencia artificial, que nos permite utilizar algoritmos que pueden operar sobre datos para determinar comportamiento, patrones, preferencias, etc.

Apache Mahout es una librería de código abierto que implementa una diversidad de algoritmos de Machine Learning, que bien pueden ser usados para construir un motor de recomendaciones para dirigir compras.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,156
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
77
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Transcript

  • 1. Machine Learningcon Apache Mahout Domingo Suarez Torres
  • 2. Machine Learning (ML) Introduction
  • 3. Definition • Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data (1)1http://en.wikipedia.org/wiki/Machine_learning
  • 4. • “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” • Intro. To Machine Learning by E. Alpaydin
  • 5. Applications• Recommend friends/dates/ • Detect anomalies in machine products output• Classify content into • Ranking search results predefined groups • Fraud detection• Find similar content based on object properties • Spam detection• Find associations/patterns in • Medical diagnostics actions/behaviors • Translators• Identify key topics in large collections of text • Much more¡
  • 6. Math• Stadistics• Discrete Math• Linear algebra• Probability
  • 7. Starting with ML• Get your data• Decide on your features per your algorithm• Prep the data • Different approaches for different algorithms• Run your algorithm(s) • Lather, rinse, repeat• Validate your results • Smell test, A/B testing
  • 8. Apache Mahout• Machine Learning library. Platform?• Extensible, we can use our own algorithm.• Hadoop support• 2005. Taste Framework• 2008. Included in Lucene
  • 9. Scalability• Huge amount of data, growing every second¡• Be as fast and efficient as possible given the intrinsic design of the algorithm • Some algorithms won’t scale to massive machine clusters • Others fit logically on a Map Reduce framework like Apache Hadoop • Still others will need alternative distributed programming models • Be pragmatic• Most Mahout implementations are Map Reduce enabled
  • 10. Who uses Mahout?
  • 11. Components• Recommender Engines (collaborative filtering, content-based)• Clustering• Classification
  • 12. When to use?• Recommendation • Rank large datasets• Clustering • Group your data• Classification • Train me to think like you
  • 13. Recommenders• Given a data set. Make a recomendation. • Item recomendation (Book, Movie, etc)• Ranking based• Recomendations • User based • Item based• knowledge of user’s relationships to items (user preferences)
  • 14. Colaborative filtering• User based• Item based• Both techniques require no knowledge of the properties of the items themselves.• Item Type is irrelevant. Apache Mahout is happy
  • 15. 17
  • 16. Content based• Domain-specific approaches• Hard to meaningfully codify into a framework• We are responsables of choosing which items attributes to use.• Apache Mahout can’t handle this out-of- the-box, but can built on top.
  • 17. Making recommendations • What we need? • Input data • Neighborhood • Similarity
  • 18. Input Data• In Mahout terms: Preferences• A preference contains: • User ID • Item ID • Preference value • Example: • 1,101,5.0 • USER ID: 1, ITEM ID: 101, PrefValue: 5.0
  • 19. 21
  • 20. NeighborhoodNearest N Users Threshold
  • 21. Similarity
  • 22. Clustering• Surface naturally occurring groups of data• A notion of similarity (and dissimilarity)• Algorithms do not require training• Stopping condition - iterate until close enough
  • 23. Clustering• Document level • Group documents based on a notion of similarity • K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift • Distance Measures • Manhattan, Euclidean, other• Topic Modeling • Cluster words across documents to identify topics • Latent Dirichlet Allocation
  • 24. Classification• Require training (supervised)• Make a single decision with a very limited set of outcomes• Typical answers naturally fit into categories
  • 25. Classification samples• Credit card fraud prediction• Customer attrition• Diabetes detector• Search Engine
  • 26. Mahout/Hadoop• For large data sets• Online• Offline (Hadoop prefered)• You can build your solution with Mahout• Take a look into Weka • http://www.cs.waikato.ac.nz/ml/weka/
  • 27. Resources
  • 28. Resources
  • 29. Resources
  • 30. Join us¡• GIAMA. • Agustin Ramos iniciative