Apache Mahout
Upcoming SlideShare
Loading in...5

Apache Mahout






Total Views
Views on SlideShare
Embed Views



1 Embed 7

http://web.ict.kth.se 7



Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Apache Mahout Apache Mahout Presentation Transcript

  • Apache The Elephant Driver Presenters: Antonio Loureiro Severien Emmanouil Dimogerontakis Muhammad Anis uddin Nasir
  • What is Apache Mahout?● Machine learning and data mining framework for classification, clustering and recommendation● The Apache Mahout free machine learning librarys goal is to build scalable machine learning tools for use on analysing big data on a distributed manner
  • Machine Learning"Machine Learning is programming computers to optimize aperformance criterion using example data or pastexperience" - Alpaydin, 2004Machine learning is concerned with the design anddevelopment of algorithms that allow machines to makedecisions or even evolve behaviors based on collection ofempirical data.
  • Data MiningData mining, also called knowledge discovery indatabases(KDD) is the process of discovering interestingand useful patterns and relationships in large volumes ofdata.Combines tools from: ● statistics ● artificial intelligence (such as neural networks and machine learning)with database management to analyze large data sets.-Britannica Online Encyclopedia
  • Why Machine Learning and DataMining?● Data, Data, DATA!!!● Tasks too Hard to Program● Customizing software
  • Available Machine Learning Tools● WEKA● R● KEEL● Others...Not enough?
  • Apache Mahout vs others?Many open source Machine Learninglibraries either:● Lack Community● Lack Documentation and Examples● Lack the Apache License (business opportunity)● Are research-oriented (not fit for production yet)● Lack Scalability
  • Mahout = Elephant Driver?
  • Why we need scalability?● Big Data
  • Applications● Recommendation features● Clustering of information● ClassificationExamples: Movie recommendations, stockanalysis, fraud detection, ad-senserecommendation, etc... How do we do this?
  • Supported Algorithms● Classification● Clustering● Recommender / Collaborative Filtering● Evolutionary Algorithms● Pattern Mining● Regression● Dimension reduction● Similarity Vectors
  • Classification(learn to assign categories to documents)Fully functional ● Logistic Regression (SGD) ● BayesianIntegrated to Mahout Development ● Random Forests (integrated) ● Online Passive Aggressive (integrated) ● Boosting (awaiting patch commit)Open to be worked on... ● Hidden Markov Models (HMM) - Training is done in Map-Reduce ● Support Vector Machines (SVM) (open) ● Perceptron and Winnow (open) ● Neural Network (open)
  • Clustering(group items that are topically related)Fully functional ● Expectation Maximization (EM) ● Hierarchical ClusteringIntegrated to Mahout Development ● Canopy Clustering ● K-Means Clustering ● Fuzzy K-Means ● Mean Shift Clustering ● Dirichlet Process Clustering ● Latent Dirichlet Allocation ● Spectral Clustering ● Minhash Clustering ● Top Down Clustering
  • Recommenders /Collaborative Filtering(find items a user might like /find items that appear together)Integrated to Mahout Development● Non-distributed recommenders ("Taste") (integrated)● Distributed Item-Based Collaborative Filtering (integrated)● Collaborative Filtering using a parallel matrix factorization (integrated)
  • Who is using it?
  • Opportunities● Developers● Researchers● Small Business● Large Business● Consultancy... ○ on Mahout ○ on specific data analysis● Open data● etc...
  • Apache MahoutBusiness?Ideas?Suggestions?Questions?
  • Where to start?● Wikipedia Bayes Example ○ https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html● What does it do? ○ Classify wikipedia data dump by countries. ○ Objective: Predict what country an unseen article should be categorized into.
  • ReferencesGeneralhttp://www.slideshare.net/sdec2011/sdec2011-mahout-the-what-the-how-and-the-whyhttp://www.slideshare.net/gsingers/intro-to-mahout-dc-hadoophttp://www.slideshare.net/aneeshabakharia/lca2011-mahoutHands-onhttp://www.slideshare.net/OReillyOSCON/hands-on-mahoutWho is using it?https://cwiki.apache.org/MAHOUT/powered-by-mahout.htmlApache Mahouthttp://mahout.apache.org/Quickstarthttps://cwiki.apache.org/MAHOUT/quickstart.html