SlideShare a Scribd company logo
Apache

   The Elephant Driver
          Presenters:
      Antonio Loureiro Severien
     Emmanouil Dimogerontakis
     Muhammad Anis uddin Nasir
What is Apache Mahout?
● Machine learning and data mining framework for
  classification, clustering and recommendation

● The Apache Mahout free machine learning library's goal
  is to build scalable machine learning tools for use on
  analysing big data on a distributed manner
Machine Learning
"Machine Learning is programming computers to optimize a
performance criterion using example data or past
experience" - Alpaydin, 2004

Machine learning is concerned with the design and
development of algorithms that allow machines to make
decisions or even evolve behaviors based on collection of
empirical data.
Data Mining
Data mining, also called knowledge discovery in
databases(KDD) is the process of discovering interesting
and useful patterns and relationships in large volumes of
data.
Combines tools from:
    ● statistics
    ● artificial intelligence (such as neural networks and
       machine learning)
with database management to analyze large data sets.
-Britannica Online Encyclopedia
Why Machine Learning and Data
Mining?

● Data, Data, DATA!!!


● Tasks too Hard to Program


● Customizing software
Available Machine Learning Tools


●   WEKA
●   R
●   KEEL
●   Others...


Not enough?
Apache Mahout vs others?
Many open source Machine Learning
libraries either:
● Lack Community
● Lack Documentation and Examples
● Lack the Apache License
    (business opportunity)
● Are research-oriented
    (not fit for production yet)
● Lack Scalability
Mahout = Elephant Driver?
Why we need scalability?
● Big Data
Applications
● Recommendation features
● Clustering of information
● Classification

Examples: Movie recommendations, stock
analysis, fraud detection, ad-sense
recommendation, etc...

            How do we do this?
Supported Algorithms
●   Classification
●   Clustering
●   Recommender / Collaborative Filtering
●   Evolutionary Algorithms
●   Pattern Mining
●   Regression
●   Dimension reduction
●   Similarity Vectors
Classification
(learn to assign categories to documents)

Fully functional
 ● Logistic Regression (SGD)
 ● Bayesian

Integrated to Mahout Development
 ● Random Forests (integrated)
 ● Online Passive Aggressive (integrated)
 ● Boosting (awaiting patch commit)

Open to be worked on...
 ● Hidden Markov Models (HMM) - Training is done in Map-Reduce
 ● Support Vector Machines (SVM) (open)
 ● Perceptron and Winnow (open)
 ● Neural Network (open)
Clustering
(group items that are topically related)

Fully functional
 ● Expectation Maximization (EM)
 ● Hierarchical Clustering

Integrated to Mahout Development
 ● Canopy Clustering
 ● K-Means Clustering
 ● Fuzzy K-Means
 ● Mean Shift Clustering
 ● Dirichlet Process Clustering
 ● Latent Dirichlet Allocation
 ● Spectral Clustering
 ● Minhash Clustering
 ● Top Down Clustering
Recommenders /
Collaborative Filtering
(find items a user might like /
find items that appear together)

Integrated to Mahout Development
●   Non-distributed recommenders ("Taste") (integrated)
●   Distributed Item-Based Collaborative Filtering (integrated)
●   Collaborative Filtering using a parallel matrix factorization (integrated)
Who is using it?
Opportunities
●   Developers
●   Researchers
●   Small Business
●   Large Business
●   Consultancy...
    ○ on Mahout
    ○ on specific data analysis
● Open data
● etc...
Apache Mahout
Business?

Ideas?

Suggestions?

Questions?
Where to start?
● Wikipedia Bayes Example
   ○   https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html


● What does it do?
   ○ Classify wikipedia data dump by countries.
   ○ Objective: Predict what country an unseen article
     should be categorized into.
References
General
http://www.slideshare.net/sdec2011/sdec2011-mahout-the-what-the-how-and-
the-why
http://www.slideshare.net/gsingers/intro-to-mahout-dc-hadoop
http://www.slideshare.net/aneeshabakharia/lca2011-mahout
Hands-on
http://www.slideshare.net/OReillyOSCON/hands-on-mahout
Who is using it?
https://cwiki.apache.org/MAHOUT/powered-by-mahout.html
Apache Mahout
http://mahout.apache.org/
Quickstart
https://cwiki.apache.org/MAHOUT/quickstart.html

More Related Content

What's hot

Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache Mahout
Daniel Glauser
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
Grant Ingersoll
 

What's hot (20)

Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache Mahout
 
Whats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache MahoutWhats Right and Wrong with Apache Mahout
Whats Right and Wrong with Apache Mahout
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
 
mahout introduction
mahout  introductionmahout  introduction
mahout introduction
 
Machine Learning with Apache Mahout
Machine Learning with Apache MahoutMachine Learning with Apache Mahout
Machine Learning with Apache Mahout
 
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDC
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Mahout classification presentation
Mahout classification presentationMahout classification presentation
Mahout classification presentation
 
Mahout part2
Mahout part2Mahout part2
Mahout part2
 
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache SparkScalable Collaborative Filtering Recommendation Algorithms on Apache Spark
Scalable Collaborative Filtering Recommendation Algorithms on Apache Spark
 
Intro to Mahout
Intro to MahoutIntro to Mahout
Intro to Mahout
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
 
Primer to Machine Learning
Primer to Machine LearningPrimer to Machine Learning
Primer to Machine Learning
 
Apache Mahout Architecture Overview
Apache Mahout Architecture OverviewApache Mahout Architecture Overview
Apache Mahout Architecture Overview
 

Viewers also liked

Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Hadoop User Group
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user group
Hadoop User Group
 
Diane Richey Resume4
Diane Richey Resume4Diane Richey Resume4
Diane Richey Resume4
Diane Richey
 
China bank industry market forecast and investment strategy report, 2013 2017
China bank industry market forecast and investment strategy report, 2013 2017China bank industry market forecast and investment strategy report, 2013 2017
China bank industry market forecast and investment strategy report, 2013 2017
Qianzhan Intelligence
 
China construction quality testing industry market forecast and competition s...
China construction quality testing industry market forecast and competition s...China construction quality testing industry market forecast and competition s...
China construction quality testing industry market forecast and competition s...
Qianzhan Intelligence
 

Viewers also liked (12)

Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 
MAHOUT classifier tour
MAHOUT classifier tourMAHOUT classifier tour
MAHOUT classifier tour
 
Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010Biometric Databases and Hadoop__HadoopSummit2010
Biometric Databases and Hadoop__HadoopSummit2010
 
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user group
 
Diane Richey Resume4
Diane Richey Resume4Diane Richey Resume4
Diane Richey Resume4
 
How to make mobile convert - usertesting webinar with michael mace
How to make mobile convert - usertesting webinar with michael maceHow to make mobile convert - usertesting webinar with michael mace
How to make mobile convert - usertesting webinar with michael mace
 
Wild Times
Wild TimesWild Times
Wild Times
 
Few words about happiness (Polish talk) / O szczęściu słów kilka
Few words about happiness (Polish talk) / O szczęściu słów kilkaFew words about happiness (Polish talk) / O szczęściu słów kilka
Few words about happiness (Polish talk) / O szczęściu słów kilka
 
China bank industry market forecast and investment strategy report, 2013 2017
China bank industry market forecast and investment strategy report, 2013 2017China bank industry market forecast and investment strategy report, 2013 2017
China bank industry market forecast and investment strategy report, 2013 2017
 
Culture
CultureCulture
Culture
 
China construction quality testing industry market forecast and competition s...
China construction quality testing industry market forecast and competition s...China construction quality testing industry market forecast and competition s...
China construction quality testing industry market forecast and competition s...
 

Similar to Apache Mahout

Parismlmeetupfinalslides 151209190037-lva1-app6892
Parismlmeetupfinalslides 151209190037-lva1-app6892Parismlmeetupfinalslides 151209190037-lva1-app6892
Parismlmeetupfinalslides 151209190037-lva1-app6892
mercedes calderon
 

Similar to Apache Mahout (20)

Mahout
MahoutMahout
Mahout
 
Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)
 
Machine Learning & Apache Mahout
Machine Learning & Apache MahoutMachine Learning & Apache Mahout
Machine Learning & Apache Mahout
 
Apache mahout and R-mining complex dataobject
Apache mahout and R-mining complex dataobjectApache mahout and R-mining complex dataobject
Apache mahout and R-mining complex dataobject
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Machine Learning: Inteligencia Artificial no es sólo un tema de Ciencia Ficci...
Machine Learning: Inteligencia Artificial no es sólo un tema de Ciencia Ficci...Machine Learning: Inteligencia Artificial no es sólo un tema de Ciencia Ficci...
Machine Learning: Inteligencia Artificial no es sólo un tema de Ciencia Ficci...
 
Data science
Data scienceData science
Data science
 
Machine Learning: Artificial Intelligence isn't just a Science Fiction topic
Machine Learning: Artificial Intelligence isn't just a Science Fiction topicMachine Learning: Artificial Intelligence isn't just a Science Fiction topic
Machine Learning: Artificial Intelligence isn't just a Science Fiction topic
 
A view from the ivory tower: Participating in Apache as a member of academia
A view from the ivory tower: Participating in Apache as a member of academiaA view from the ivory tower: Participating in Apache as a member of academia
A view from the ivory tower: Participating in Apache as a member of academia
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
 
Paris ML meetup
Paris ML meetupParis ML meetup
Paris ML meetup
 
Parismlmeetupfinalslides 151209190037-lva1-app6892
Parismlmeetupfinalslides 151209190037-lva1-app6892Parismlmeetupfinalslides 151209190037-lva1-app6892
Parismlmeetupfinalslides 151209190037-lva1-app6892
 
L15.pptx
L15.pptxL15.pptx
L15.pptx
 
Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)Machine Learning for Dummies (without mathematics)
Machine Learning for Dummies (without mathematics)
 
Machine learning-for-dummies-andrews-sobral-activeeon
Machine learning-for-dummies-andrews-sobral-activeeonMachine learning-for-dummies-andrews-sobral-activeeon
Machine learning-for-dummies-andrews-sobral-activeeon
 
General introduction to AI ML DL DS
General introduction to AI ML DL DSGeneral introduction to AI ML DL DS
General introduction to AI ML DL DS
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup Group
 
Data science as career
Data science as careerData science as career
Data science as career
 
IT webinar 2016
IT webinar 2016IT webinar 2016
IT webinar 2016
 
OpenHPI - Parallel Programming Concepts - Week 6
OpenHPI - Parallel Programming Concepts - Week 6OpenHPI - Parallel Programming Concepts - Week 6
OpenHPI - Parallel Programming Concepts - Week 6
 

More from Save Manos

A boring presentation about social mobile communication patterns and opportun...
A boring presentation about social mobile communication patterns and opportun...A boring presentation about social mobile communication patterns and opportun...
A boring presentation about social mobile communication patterns and opportun...
Save Manos
 
Man In The Browser
Man In The BrowserMan In The Browser
Man In The Browser
Save Manos
 
P2P-Tuple: Towards a Robust Volunteer Computing Platform
P2P-Tuple: Towards a Robust Volunteer Computing Platform P2P-Tuple: Towards a Robust Volunteer Computing Platform
P2P-Tuple: Towards a Robust Volunteer Computing Platform
Save Manos
 
A survey on modifications for unstructured P2P in WMNs .
A survey on modifications for unstructured P2P in WMNs . A survey on modifications for unstructured P2P in WMNs .
A survey on modifications for unstructured P2P in WMNs .
Save Manos
 

More from Save Manos (14)

Software Defined Networking for Community Network Testbeds
Software Defined Networking for Community Network TestbedsSoftware Defined Networking for Community Network Testbeds
Software Defined Networking for Community Network Testbeds
 
Lock Service with Paxos in Erlang
Lock Service with Paxos in ErlangLock Service with Paxos in Erlang
Lock Service with Paxos in Erlang
 
NaaS
NaaSNaaS
NaaS
 
FOSS Licenses: A first attempt
FOSS Licenses: A first attemptFOSS Licenses: A first attempt
FOSS Licenses: A first attempt
 
Ciel universal distributed execution engine
Ciel universal distributed execution engine Ciel universal distributed execution engine
Ciel universal distributed execution engine
 
A boring presentation about social mobile communication patterns and opportun...
A boring presentation about social mobile communication patterns and opportun...A boring presentation about social mobile communication patterns and opportun...
A boring presentation about social mobile communication patterns and opportun...
 
Man In The Browser
Man In The BrowserMan In The Browser
Man In The Browser
 
P2P-Tuple: Towards a Robust Volunteer Computing Platform
P2P-Tuple: Towards a Robust Volunteer Computing Platform P2P-Tuple: Towards a Robust Volunteer Computing Platform
P2P-Tuple: Towards a Robust Volunteer Computing Platform
 
A survey on modifications for unstructured P2P in WMNs .
A survey on modifications for unstructured P2P in WMNs . A survey on modifications for unstructured P2P in WMNs .
A survey on modifications for unstructured P2P in WMNs .
 
Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services
 
Network as a Service
Network as  a ServiceNetwork as  a Service
Network as a Service
 
Openflow
OpenflowOpenflow
Openflow
 
RESTful Web Services
RESTful Web ServicesRESTful Web Services
RESTful Web Services
 
Distributed systems
Distributed systemsDistributed systems
Distributed systems
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 

Recently uploaded (20)

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová10 Differences between Sales Cloud and CPQ, Blanka Doktorová
10 Differences between Sales Cloud and CPQ, Blanka Doktorová
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
The architecture of Generative AI for enterprises.pdf
The architecture of Generative AI for enterprises.pdfThe architecture of Generative AI for enterprises.pdf
The architecture of Generative AI for enterprises.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 

Apache Mahout

  • 1. Apache The Elephant Driver Presenters: Antonio Loureiro Severien Emmanouil Dimogerontakis Muhammad Anis uddin Nasir
  • 2. What is Apache Mahout? ● Machine learning and data mining framework for classification, clustering and recommendation ● The Apache Mahout free machine learning library's goal is to build scalable machine learning tools for use on analysing big data on a distributed manner
  • 3. Machine Learning "Machine Learning is programming computers to optimize a performance criterion using example data or past experience" - Alpaydin, 2004 Machine learning is concerned with the design and development of algorithms that allow machines to make decisions or even evolve behaviors based on collection of empirical data.
  • 4. Data Mining Data mining, also called knowledge discovery in databases(KDD) is the process of discovering interesting and useful patterns and relationships in large volumes of data. Combines tools from: ● statistics ● artificial intelligence (such as neural networks and machine learning) with database management to analyze large data sets. -Britannica Online Encyclopedia
  • 5. Why Machine Learning and Data Mining? ● Data, Data, DATA!!! ● Tasks too Hard to Program ● Customizing software
  • 6. Available Machine Learning Tools ● WEKA ● R ● KEEL ● Others... Not enough?
  • 7. Apache Mahout vs others? Many open source Machine Learning libraries either: ● Lack Community ● Lack Documentation and Examples ● Lack the Apache License (business opportunity) ● Are research-oriented (not fit for production yet) ● Lack Scalability
  • 9. Why we need scalability? ● Big Data
  • 10. Applications ● Recommendation features ● Clustering of information ● Classification Examples: Movie recommendations, stock analysis, fraud detection, ad-sense recommendation, etc... How do we do this?
  • 11. Supported Algorithms ● Classification ● Clustering ● Recommender / Collaborative Filtering ● Evolutionary Algorithms ● Pattern Mining ● Regression ● Dimension reduction ● Similarity Vectors
  • 12. Classification (learn to assign categories to documents) Fully functional ● Logistic Regression (SGD) ● Bayesian Integrated to Mahout Development ● Random Forests (integrated) ● Online Passive Aggressive (integrated) ● Boosting (awaiting patch commit) Open to be worked on... ● Hidden Markov Models (HMM) - Training is done in Map-Reduce ● Support Vector Machines (SVM) (open) ● Perceptron and Winnow (open) ● Neural Network (open)
  • 13. Clustering (group items that are topically related) Fully functional ● Expectation Maximization (EM) ● Hierarchical Clustering Integrated to Mahout Development ● Canopy Clustering ● K-Means Clustering ● Fuzzy K-Means ● Mean Shift Clustering ● Dirichlet Process Clustering ● Latent Dirichlet Allocation ● Spectral Clustering ● Minhash Clustering ● Top Down Clustering
  • 14. Recommenders / Collaborative Filtering (find items a user might like / find items that appear together) Integrated to Mahout Development ● Non-distributed recommenders ("Taste") (integrated) ● Distributed Item-Based Collaborative Filtering (integrated) ● Collaborative Filtering using a parallel matrix factorization (integrated)
  • 16. Opportunities ● Developers ● Researchers ● Small Business ● Large Business ● Consultancy... ○ on Mahout ○ on specific data analysis ● Open data ● etc...
  • 18. Where to start? ● Wikipedia Bayes Example ○ https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html ● What does it do? ○ Classify wikipedia data dump by countries. ○ Objective: Predict what country an unseen article should be categorized into.