Apache   The Elephant Driver          Presenters:      Antonio Loureiro Severien     Emmanouil Dimogerontakis     Muhammad...
What is Apache Mahout?● Machine learning and data mining framework for  classification, clustering and recommendation● The...
Machine Learning"Machine Learning is programming computers to optimize aperformance criterion using example data or pastex...
Data MiningData mining, also called knowledge discovery indatabases(KDD) is the process of discovering interestingand usef...
Why Machine Learning and DataMining?● Data, Data, DATA!!!● Tasks too Hard to Program● Customizing software
Available Machine Learning Tools●   WEKA●   R●   KEEL●   Others...Not enough?
Apache Mahout vs others?Many open source Machine Learninglibraries either:● Lack Community● Lack Documentation and Example...
Mahout = Elephant Driver?
Why we need scalability?● Big Data
Applications● Recommendation features● Clustering of information● ClassificationExamples: Movie recommendations, stockanal...
Supported Algorithms●   Classification●   Clustering●   Recommender / Collaborative Filtering●   Evolutionary Algorithms● ...
Classification(learn to assign categories to documents)Fully functional ● Logistic Regression (SGD) ● BayesianIntegrated t...
Clustering(group items that are topically related)Fully functional ● Expectation Maximization (EM) ● Hierarchical Clusteri...
Recommenders /Collaborative Filtering(find items a user might like /find items that appear together)Integrated to Mahout D...
Who is using it?
Opportunities●   Developers●   Researchers●   Small Business●   Large Business●   Consultancy...    ○ on Mahout    ○ on sp...
Apache MahoutBusiness?Ideas?Suggestions?Questions?
Where to start?● Wikipedia Bayes Example   ○   https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html● What does it ...
ReferencesGeneralhttp://www.slideshare.net/sdec2011/sdec2011-mahout-the-what-the-how-and-the-whyhttp://www.slideshare.net/...
Upcoming SlideShare
Loading in...5
×

Apache Mahout

2,686

Published on

Published in: Technology, Education
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,686
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
135
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Apache Mahout

  1. 1. Apache The Elephant Driver Presenters: Antonio Loureiro Severien Emmanouil Dimogerontakis Muhammad Anis uddin Nasir
  2. 2. What is Apache Mahout?● Machine learning and data mining framework for classification, clustering and recommendation● The Apache Mahout free machine learning librarys goal is to build scalable machine learning tools for use on analysing big data on a distributed manner
  3. 3. Machine Learning"Machine Learning is programming computers to optimize aperformance criterion using example data or pastexperience" - Alpaydin, 2004Machine learning is concerned with the design anddevelopment of algorithms that allow machines to makedecisions or even evolve behaviors based on collection ofempirical data.
  4. 4. Data MiningData mining, also called knowledge discovery indatabases(KDD) is the process of discovering interestingand useful patterns and relationships in large volumes ofdata.Combines tools from: ● statistics ● artificial intelligence (such as neural networks and machine learning)with database management to analyze large data sets.-Britannica Online Encyclopedia
  5. 5. Why Machine Learning and DataMining?● Data, Data, DATA!!!● Tasks too Hard to Program● Customizing software
  6. 6. Available Machine Learning Tools● WEKA● R● KEEL● Others...Not enough?
  7. 7. Apache Mahout vs others?Many open source Machine Learninglibraries either:● Lack Community● Lack Documentation and Examples● Lack the Apache License (business opportunity)● Are research-oriented (not fit for production yet)● Lack Scalability
  8. 8. Mahout = Elephant Driver?
  9. 9. Why we need scalability?● Big Data
  10. 10. Applications● Recommendation features● Clustering of information● ClassificationExamples: Movie recommendations, stockanalysis, fraud detection, ad-senserecommendation, etc... How do we do this?
  11. 11. Supported Algorithms● Classification● Clustering● Recommender / Collaborative Filtering● Evolutionary Algorithms● Pattern Mining● Regression● Dimension reduction● Similarity Vectors
  12. 12. Classification(learn to assign categories to documents)Fully functional ● Logistic Regression (SGD) ● BayesianIntegrated to Mahout Development ● Random Forests (integrated) ● Online Passive Aggressive (integrated) ● Boosting (awaiting patch commit)Open to be worked on... ● Hidden Markov Models (HMM) - Training is done in Map-Reduce ● Support Vector Machines (SVM) (open) ● Perceptron and Winnow (open) ● Neural Network (open)
  13. 13. Clustering(group items that are topically related)Fully functional ● Expectation Maximization (EM) ● Hierarchical ClusteringIntegrated to Mahout Development ● Canopy Clustering ● K-Means Clustering ● Fuzzy K-Means ● Mean Shift Clustering ● Dirichlet Process Clustering ● Latent Dirichlet Allocation ● Spectral Clustering ● Minhash Clustering ● Top Down Clustering
  14. 14. Recommenders /Collaborative Filtering(find items a user might like /find items that appear together)Integrated to Mahout Development● Non-distributed recommenders ("Taste") (integrated)● Distributed Item-Based Collaborative Filtering (integrated)● Collaborative Filtering using a parallel matrix factorization (integrated)
  15. 15. Who is using it?
  16. 16. Opportunities● Developers● Researchers● Small Business● Large Business● Consultancy... ○ on Mahout ○ on specific data analysis● Open data● etc...
  17. 17. Apache MahoutBusiness?Ideas?Suggestions?Questions?
  18. 18. Where to start?● Wikipedia Bayes Example ○ https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html● What does it do? ○ Classify wikipedia data dump by countries. ○ Objective: Predict what country an unseen article should be categorized into.
  19. 19. ReferencesGeneralhttp://www.slideshare.net/sdec2011/sdec2011-mahout-the-what-the-how-and-the-whyhttp://www.slideshare.net/gsingers/intro-to-mahout-dc-hadoophttp://www.slideshare.net/aneeshabakharia/lca2011-mahoutHands-onhttp://www.slideshare.net/OReillyOSCON/hands-on-mahoutWho is using it?https://cwiki.apache.org/MAHOUT/powered-by-mahout.htmlApache Mahouthttp://mahout.apache.org/Quickstarthttps://cwiki.apache.org/MAHOUT/quickstart.html
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×