Machine Learningcon Apache Mahout  Domingo Suarez Torres
Machine Learning (ML)        Introduction
Definition     • Machine learning, a branch of artificial        intelligence, is a scientific discipline        concerned wi...
• “Machine Learning is programming  computers to optimize a performance  criterion using example data or past  experience”...
Applications•   Recommend friends/dates/        •   Detect anomalies in machine    products                            out...
Math• Stadistics• Discrete Math• Linear algebra• Probability
Starting with ML•   Get your data•   Decide on your features per your algorithm•   Prep the data    •   Different approach...
Apache Mahout• Machine Learning library. Platform?• Extensible, we can use our own algorithm.• Hadoop support• 2005. Taste...
Scalability•   Huge amount of data, growing every second¡•   Be as fast and efficient as possible given the intrinsic desig...
Who uses Mahout?
Components• Recommender Engines (collaborative  filtering, content-based)• Clustering• Classification
When to use?• Recommendation • Rank large datasets• Clustering • Group your data• Classification • Train me to think like you
Recommenders•   Given a data set. Make a recomendation.    •   Item recomendation (Book, Movie, etc)•   Ranking based•   R...
Colaborative filtering• User based• Item based• Both techniques require no knowledge of  the properties of the items themse...
17
Content based• Domain-specific approaches• Hard to meaningfully codify into a  framework• We are responsables of choosing w...
Making recommendations • What we need?  • Input data  • Neighborhood  • Similarity
Input Data•   In Mahout terms: Preferences•   A preference contains:    •   User ID    •   Item ID    •   Preference value...
21
NeighborhoodNearest N Users    Threshold
Similarity
Clustering• Surface naturally occurring groups of data• A notion of similarity (and dissimilarity)• Algorithms do not requ...
Clustering•   Document level    •   Group documents based on a notion of similarity    •   K-Means, Fuzzy K-Means, Dirichl...
Classification• Require training (supervised)• Make a single decision with a very limited  set of outcomes• Typical answers...
Classification samples• Credit card fraud prediction• Customer attrition• Diabetes detector• Search Engine
Mahout/Hadoop• For large data sets• Online• Offline (Hadoop prefered)• You can build your solution with Mahout• Take a look...
Resources
Resources
Resources
Join us¡• GIAMA. • Agustin Ramos iniciative
Machine Learning & Apache Mahout
Machine Learning & Apache Mahout
Machine Learning & Apache Mahout
Machine Learning & Apache Mahout
Upcoming SlideShare
Loading in...5
×

Machine Learning & Apache Mahout

3,378

Published on

Machine Learning es una rama de la inteligencia artificial, que nos permite utilizar algoritmos que pueden operar sobre datos para determinar comportamiento, patrones, preferencias, etc.

Apache Mahout es una librería de código abierto que implementa una diversidad de algoritmos de Machine Learning, que bien pueden ser usados para construir un motor de recomendaciones para dirigir compras.

Published in: Technology, Education
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,378
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
82
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Machine Learning & Apache Mahout

    1. 1. Machine Learningcon Apache Mahout Domingo Suarez Torres
    2. 2. Machine Learning (ML) Introduction
    3. 3. Definition • Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data (1)1http://en.wikipedia.org/wiki/Machine_learning
    4. 4. • “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” • Intro. To Machine Learning by E. Alpaydin
    5. 5. Applications• Recommend friends/dates/ • Detect anomalies in machine products output• Classify content into • Ranking search results predefined groups • Fraud detection• Find similar content based on object properties • Spam detection• Find associations/patterns in • Medical diagnostics actions/behaviors • Translators• Identify key topics in large collections of text • Much more¡
    6. 6. Math• Stadistics• Discrete Math• Linear algebra• Probability
    7. 7. Starting with ML• Get your data• Decide on your features per your algorithm• Prep the data • Different approaches for different algorithms• Run your algorithm(s) • Lather, rinse, repeat• Validate your results • Smell test, A/B testing
    8. 8. Apache Mahout• Machine Learning library. Platform?• Extensible, we can use our own algorithm.• Hadoop support• 2005. Taste Framework• 2008. Included in Lucene
    9. 9. Scalability• Huge amount of data, growing every second¡• Be as fast and efficient as possible given the intrinsic design of the algorithm • Some algorithms won’t scale to massive machine clusters • Others fit logically on a Map Reduce framework like Apache Hadoop • Still others will need alternative distributed programming models • Be pragmatic• Most Mahout implementations are Map Reduce enabled
    10. 10. Who uses Mahout?
    11. 11. Components• Recommender Engines (collaborative filtering, content-based)• Clustering• Classification
    12. 12. When to use?• Recommendation • Rank large datasets• Clustering • Group your data• Classification • Train me to think like you
    13. 13. Recommenders• Given a data set. Make a recomendation. • Item recomendation (Book, Movie, etc)• Ranking based• Recomendations • User based • Item based• knowledge of user’s relationships to items (user preferences)
    14. 14. Colaborative filtering• User based• Item based• Both techniques require no knowledge of the properties of the items themselves.• Item Type is irrelevant. Apache Mahout is happy
    15. 15. 17
    16. 16. Content based• Domain-specific approaches• Hard to meaningfully codify into a framework• We are responsables of choosing which items attributes to use.• Apache Mahout can’t handle this out-of- the-box, but can built on top.
    17. 17. Making recommendations • What we need? • Input data • Neighborhood • Similarity
    18. 18. Input Data• In Mahout terms: Preferences• A preference contains: • User ID • Item ID • Preference value • Example: • 1,101,5.0 • USER ID: 1, ITEM ID: 101, PrefValue: 5.0
    19. 19. 21
    20. 20. NeighborhoodNearest N Users Threshold
    21. 21. Similarity
    22. 22. Clustering• Surface naturally occurring groups of data• A notion of similarity (and dissimilarity)• Algorithms do not require training• Stopping condition - iterate until close enough
    23. 23. Clustering• Document level • Group documents based on a notion of similarity • K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift • Distance Measures • Manhattan, Euclidean, other• Topic Modeling • Cluster words across documents to identify topics • Latent Dirichlet Allocation
    24. 24. Classification• Require training (supervised)• Make a single decision with a very limited set of outcomes• Typical answers naturally fit into categories
    25. 25. Classification samples• Credit card fraud prediction• Customer attrition• Diabetes detector• Search Engine
    26. 26. Mahout/Hadoop• For large data sets• Online• Offline (Hadoop prefered)• You can build your solution with Mahout• Take a look into Weka • http://www.cs.waikato.ac.nz/ml/weka/
    27. 27. Resources
    28. 28. Resources
    29. 29. Resources
    30. 30. Join us¡• GIAMA. • Agustin Ramos iniciative
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×