SlideShare a Scribd company logo
1 of 34
Machine Learning
con Apache Mahout
  Domingo Suarez Torres
Machine Learning (ML)
        Introduction
Definition

     • Machine learning, a branch of artificial
        intelligence, is a scientific discipline
        concerned with the design and
        development of algorithms that allow
        computers to evolve behaviors based on
        empirical data (1)


1http://en.wikipedia.org/wiki/Machine_learning
• “Machine Learning is programming
  computers to optimize a performance
  criterion using example data or past
  experience”
 • Intro. To Machine Learning by E. Alpaydin
Applications
•   Recommend friends/dates/        •   Detect anomalies in machine
    products                            output

•   Classify content into           •   Ranking search results
    predefined groups
                                    •   Fraud detection
•   Find similar content based
    on object properties            •   Spam detection

•   Find associations/patterns in   •   Medical diagnostics
    actions/behaviors
                                    •   Translators
•   Identify key topics in large
    collections of text             •   Much more¡
Math

• Stadistics
• Discrete Math
• Linear algebra
• Probability
Starting with ML
•   Get your data
•   Decide on your features per your algorithm
•   Prep the data
    •   Different approaches for different algorithms
•   Run your algorithm(s)
    •   Lather, rinse, repeat
•   Validate your results
    •   Smell test, A/B testing
Apache Mahout

• Machine Learning library. Platform?
• Extensible, we can use our own algorithm.
• Hadoop support
• 2005. Taste Framework
• 2008. Included in Lucene
Scalability
•   Huge amount of data, growing every second¡
•   Be as fast and efficient as possible given the intrinsic design of
    the algorithm
    •   Some algorithms won’t scale to massive machine clusters
    •   Others fit logically on a Map Reduce framework like
        Apache Hadoop
    •   Still others will need alternative distributed programming
        models
    •   Be pragmatic
•   Most Mahout implementations are Map Reduce enabled
Who uses Mahout?
Components

• Recommender Engines (collaborative
  filtering, content-based)
• Clustering
• Classification
When to use?
• Recommendation
 • Rank large datasets
• Clustering
 • Group your data
• Classification
 • Train me to think like you
Recommenders
•   Given a data set. Make a recomendation.
    •   Item recomendation (Book, Movie, etc)
•   Ranking based
•   Recomendations
    •   User based
    •   Item based
•   knowledge of user’s relationships to items (user
    preferences)
Colaborative filtering
• User based
• Item based
• Both techniques require no knowledge of
  the properties of the items themselves.
• Item Type is irrelevant. Apache Mahout is
  happy
17
Content based
• Domain-specific approaches
• Hard to meaningfully codify into a
  framework
• We are responsables of choosing which
  item's attributes to use.
• Apache Mahout can’t handle this out-of-
  the-box, but can built on top.
Making recommendations

 • What we need?
  • Input data
  • Neighborhood
  • Similarity
Input Data
•   In Mahout terms: Preferences
•   A preference contains:
    •   User ID
    •   Item ID
    •   Preference value
    •   Example:
        •   1,101,5.0
        •   USER ID: 1, ITEM ID: 101, PrefValue: 5.0
21
Neighborhood
Nearest N Users    Threshold
Similarity
Clustering

• Surface naturally occurring groups of data
• A notion of similarity (and dissimilarity)
• Algorithms do not require training
• Stopping condition - iterate until close
  enough
Clustering
•   Document level
    •   Group documents based on a notion of similarity
    •   K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift
    •   Distance Measures
    •   Manhattan, Euclidean, other
•   Topic Modeling
    •   Cluster words across documents to identify topics
    •   Latent Dirichlet Allocation
Classification

• Require training (supervised)
• Make a single decision with a very limited
  set of outcomes
• Typical answers naturally fit into categories
Classification samples

• Credit card fraud prediction
• Customer attrition
• Diabetes detector
• Search Engine
Mahout/Hadoop
• For large data sets
• Online
• Offline (Hadoop prefered)
• You can build your solution with Mahout
• Take a look into Weka
 • http://www.cs.waikato.ac.nz/ml/weka/
Resources
Resources
Resources
Join us¡
• GIAMA.
 • Agustin Ramos iniciative

More Related Content

What's hot

Computer Vision and Text Analytics Using Azure Cognitive Services
Computer Vision and Text Analytics Using Azure Cognitive ServicesComputer Vision and Text Analytics Using Azure Cognitive Services
Computer Vision and Text Analytics Using Azure Cognitive Services
Umme Rubaiyat Chowdhury
 

What's hot (20)

End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and ManageEnd to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
End to end Machine Learning using Kubeflow - Build, Train, Deploy and Manage
 
Top 10 Applications Of Artificial Intelligence | Edureka
Top 10 Applications Of Artificial Intelligence | EdurekaTop 10 Applications Of Artificial Intelligence | Edureka
Top 10 Applications Of Artificial Intelligence | Edureka
 
Best Practices for Database Migration to the Cloud: Improve Application Perfo...
Best Practices for Database Migration to the Cloud: Improve Application Perfo...Best Practices for Database Migration to the Cloud: Improve Application Perfo...
Best Practices for Database Migration to the Cloud: Improve Application Perfo...
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
AWS glue technical enablement training
AWS glue technical enablement trainingAWS glue technical enablement training
AWS glue technical enablement training
 
What are the seven stages of AI
What are the seven stages of AIWhat are the seven stages of AI
What are the seven stages of AI
 
AI in Healthcare SKH 25 Nov 23
AI in Healthcare SKH 25 Nov 23AI in Healthcare SKH 25 Nov 23
AI in Healthcare SKH 25 Nov 23
 
Artificial Intelligence(AI)
Artificial Intelligence(AI)Artificial Intelligence(AI)
Artificial Intelligence(AI)
 
Using Azure Cognitive Search to Dive into the CIA Archives
Using Azure Cognitive Search to Dive into the CIA ArchivesUsing Azure Cognitive Search to Dive into the CIA Archives
Using Azure Cognitive Search to Dive into the CIA Archives
 
Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結Hadoop 與 SQL 的甜蜜連結
Hadoop 與 SQL 的甜蜜連結
 
Ai vs machine learning vs deep learning
Ai vs machine learning vs deep learningAi vs machine learning vs deep learning
Ai vs machine learning vs deep learning
 
Computer Vision and Text Analytics Using Azure Cognitive Services
Computer Vision and Text Analytics Using Azure Cognitive ServicesComputer Vision and Text Analytics Using Azure Cognitive Services
Computer Vision and Text Analytics Using Azure Cognitive Services
 
Advanced Analytics and Data Science Expertise
Advanced Analytics and Data Science ExpertiseAdvanced Analytics and Data Science Expertise
Advanced Analytics and Data Science Expertise
 
AWS or Azure or Google Cloud | Best Cloud Platform | Cloud Platform Comparison
AWS or Azure or Google Cloud | Best Cloud Platform | Cloud Platform ComparisonAWS or Azure or Google Cloud | Best Cloud Platform | Cloud Platform Comparison
AWS or Azure or Google Cloud | Best Cloud Platform | Cloud Platform Comparison
 
Serverless Streaming Data Processing using Amazon Kinesis Analytics
Serverless Streaming Data Processing using Amazon Kinesis AnalyticsServerless Streaming Data Processing using Amazon Kinesis Analytics
Serverless Streaming Data Processing using Amazon Kinesis Analytics
 
MLOps with Kubeflow
MLOps with Kubeflow MLOps with Kubeflow
MLOps with Kubeflow
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
Demystifying identity on AWS
Demystifying identity on AWSDemystifying identity on AWS
Demystifying identity on AWS
 
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNINGARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
 
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
Visualize your data in Data Lake with AWS Athena and AWS Quicksight Hands-on ...
 

Viewers also liked

Viewers also liked (6)

SGCE 2015 REST APIs
SGCE 2015 REST APIsSGCE 2015 REST APIs
SGCE 2015 REST APIs
 
Serling dev team, development process
Serling dev team, development processSerling dev team, development process
Serling dev team, development process
 
SGCE 2012 Lightning Talk-Single Page Interface
SGCE 2012 Lightning Talk-Single Page InterfaceSGCE 2012 Lightning Talk-Single Page Interface
SGCE 2012 Lightning Talk-Single Page Interface
 
SGNext Elasticsearch
SGNext ElasticsearchSGNext Elasticsearch
SGNext Elasticsearch
 
JVM Reactive Programming
JVM Reactive ProgrammingJVM Reactive Programming
JVM Reactive Programming
 
SGCE 2014 micro services
SGCE 2014 micro servicesSGCE 2014 micro services
SGCE 2014 micro services
 

Similar to Machine Learning & Apache Mahout

Download Materials
Download MaterialsDownload Materials
Download Materials
butest
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
S. Diana Hu
 
SDEC2011 Essentials of Mahout
SDEC2011 Essentials of MahoutSDEC2011 Essentials of Mahout
SDEC2011 Essentials of Mahout
Korea Sdec
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Lucidworks
 
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Sonya Liberman
 

Similar to Machine Learning & Apache Mahout (20)

Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 
Building NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML GroupBuilding NLP solutions for Davidson ML Group
Building NLP solutions for Davidson ML Group
 
Data Scientist Toolbox
Data Scientist ToolboxData Scientist Toolbox
Data Scientist Toolbox
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - Recommendation
 
Download Materials
Download MaterialsDownload Materials
Download Materials
 
Building NLP solutions using Python
Building NLP solutions using PythonBuilding NLP solutions using Python
Building NLP solutions using Python
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
 
Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)
 
SDEC2011 Essentials of Mahout
SDEC2011 Essentials of MahoutSDEC2011 Essentials of Mahout
SDEC2011 Essentials of Mahout
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 
machine learning
machine learningmachine learning
machine learning
 
The Art of Intelligence – Introduction Machine Learning for Java professional...
The Art of Intelligence – Introduction Machine Learning for Java professional...The Art of Intelligence – Introduction Machine Learning for Java professional...
The Art of Intelligence – Introduction Machine Learning for Java professional...
 
Workshop Exercise: Text Analysis Methods for Digital Humanities
Workshop Exercise: Text Analysis Methods for Digital HumanitiesWorkshop Exercise: Text Analysis Methods for Digital Humanities
Workshop Exercise: Text Analysis Methods for Digital Humanities
 
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
 

More from Domingo Suarez Torres

More from Domingo Suarez Torres (20)

Cloud Native MX Meetup - Asegurando tu Cluster de Kubernetes
Cloud Native MX Meetup - Asegurando tu Cluster de KubernetesCloud Native MX Meetup - Asegurando tu Cluster de Kubernetes
Cloud Native MX Meetup - Asegurando tu Cluster de Kubernetes
 
Java Dev Day 2019 No kuberneteen por convivir
Java Dev Day 2019  No kuberneteen por convivirJava Dev Day 2019  No kuberneteen por convivir
Java Dev Day 2019 No kuberneteen por convivir
 
Contenedores 101 Digital Ocean CDMX
Contenedores 101 Digital Ocean CDMXContenedores 101 Digital Ocean CDMX
Contenedores 101 Digital Ocean CDMX
 
Retos en la arquitectura de Microservicios
Retos en la arquitectura de MicroserviciosRetos en la arquitectura de Microservicios
Retos en la arquitectura de Microservicios
 
Java Cloud Native Hack Nights GDL
Java Cloud Native Hack Nights GDLJava Cloud Native Hack Nights GDL
Java Cloud Native Hack Nights GDL
 
meetup digital ocean kubernetes
meetup digital ocean kubernetesmeetup digital ocean kubernetes
meetup digital ocean kubernetes
 
Peru JUG Micronaut & GraalVM
Peru JUG Micronaut & GraalVMPeru JUG Micronaut & GraalVM
Peru JUG Micronaut & GraalVM
 
DevFest Lima Corriendo cargas e trabajo seguras en GKE con Istio
DevFest Lima Corriendo cargas e trabajo seguras en GKE con IstioDevFest Lima Corriendo cargas e trabajo seguras en GKE con Istio
DevFest Lima Corriendo cargas e trabajo seguras en GKE con Istio
 
Cloud Native Development in the JVM
Cloud Native Development in the JVMCloud Native Development in the JVM
Cloud Native Development in the JVM
 
Cloud Native Mexico - Introducción a Kubernetes
Cloud Native Mexico - Introducción a KubernetesCloud Native Mexico - Introducción a Kubernetes
Cloud Native Mexico - Introducción a Kubernetes
 
Meetup DigitalOcean Cloud Native architecture
Meetup DigitalOcean Cloud Native architectureMeetup DigitalOcean Cloud Native architecture
Meetup DigitalOcean Cloud Native architecture
 
Cloud Native Mexico Meetup de Marzo 2018 Service Mesh con Istio y Envoy
Cloud Native Mexico Meetup de Marzo 2018 Service Mesh con Istio y EnvoyCloud Native Mexico Meetup de Marzo 2018 Service Mesh con Istio y Envoy
Cloud Native Mexico Meetup de Marzo 2018 Service Mesh con Istio y Envoy
 
Cloud Native Mexico Meetup enero 2018 Observability
Cloud Native Mexico Meetup enero 2018 ObservabilityCloud Native Mexico Meetup enero 2018 Observability
Cloud Native Mexico Meetup enero 2018 Observability
 
Cloud Native Mexico Presentacion
Cloud Native Mexico PresentacionCloud Native Mexico Presentacion
Cloud Native Mexico Presentacion
 
gRPC: Beyond REST
gRPC: Beyond RESTgRPC: Beyond REST
gRPC: Beyond REST
 
Devops Landscape
Devops LandscapeDevops Landscape
Devops Landscape
 
Orquestación de contenedores con Kubernetes SGNext
Orquestación de contenedores con Kubernetes SGNextOrquestación de contenedores con Kubernetes SGNext
Orquestación de contenedores con Kubernetes SGNext
 
Webinar Arquitectura de Microservicios
Webinar Arquitectura de MicroserviciosWebinar Arquitectura de Microservicios
Webinar Arquitectura de Microservicios
 
Elasticsearch JVM-MX Meetup April 2016
Elasticsearch JVM-MX Meetup April 2016Elasticsearch JVM-MX Meetup April 2016
Elasticsearch JVM-MX Meetup April 2016
 
Ratpack JVM_MX Meetup February 2016
Ratpack JVM_MX Meetup February 2016Ratpack JVM_MX Meetup February 2016
Ratpack JVM_MX Meetup February 2016
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

Machine Learning & Apache Mahout

  • 1. Machine Learning con Apache Mahout Domingo Suarez Torres
  • 2. Machine Learning (ML) Introduction
  • 3. Definition • Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data (1) 1http://en.wikipedia.org/wiki/Machine_learning
  • 4. • “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” • Intro. To Machine Learning by E. Alpaydin
  • 5. Applications • Recommend friends/dates/ • Detect anomalies in machine products output • Classify content into • Ranking search results predefined groups • Fraud detection • Find similar content based on object properties • Spam detection • Find associations/patterns in • Medical diagnostics actions/behaviors • Translators • Identify key topics in large collections of text • Much more¡
  • 6. Math • Stadistics • Discrete Math • Linear algebra • Probability
  • 7.
  • 8. Starting with ML • Get your data • Decide on your features per your algorithm • Prep the data • Different approaches for different algorithms • Run your algorithm(s) • Lather, rinse, repeat • Validate your results • Smell test, A/B testing
  • 9. Apache Mahout • Machine Learning library. Platform? • Extensible, we can use our own algorithm. • Hadoop support • 2005. Taste Framework • 2008. Included in Lucene
  • 10. Scalability • Huge amount of data, growing every second¡ • Be as fast and efficient as possible given the intrinsic design of the algorithm • Some algorithms won’t scale to massive machine clusters • Others fit logically on a Map Reduce framework like Apache Hadoop • Still others will need alternative distributed programming models • Be pragmatic • Most Mahout implementations are Map Reduce enabled
  • 12. Components • Recommender Engines (collaborative filtering, content-based) • Clustering • Classification
  • 13. When to use? • Recommendation • Rank large datasets • Clustering • Group your data • Classification • Train me to think like you
  • 14. Recommenders • Given a data set. Make a recomendation. • Item recomendation (Book, Movie, etc) • Ranking based • Recomendations • User based • Item based • knowledge of user’s relationships to items (user preferences)
  • 15.
  • 16. Colaborative filtering • User based • Item based • Both techniques require no knowledge of the properties of the items themselves. • Item Type is irrelevant. Apache Mahout is happy
  • 17. 17
  • 18. Content based • Domain-specific approaches • Hard to meaningfully codify into a framework • We are responsables of choosing which item's attributes to use. • Apache Mahout can’t handle this out-of- the-box, but can built on top.
  • 19. Making recommendations • What we need? • Input data • Neighborhood • Similarity
  • 20. Input Data • In Mahout terms: Preferences • A preference contains: • User ID • Item ID • Preference value • Example: • 1,101,5.0 • USER ID: 1, ITEM ID: 101, PrefValue: 5.0
  • 21. 21
  • 22.
  • 25. Clustering • Surface naturally occurring groups of data • A notion of similarity (and dissimilarity) • Algorithms do not require training • Stopping condition - iterate until close enough
  • 26. Clustering • Document level • Group documents based on a notion of similarity • K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift • Distance Measures • Manhattan, Euclidean, other • Topic Modeling • Cluster words across documents to identify topics • Latent Dirichlet Allocation
  • 27. Classification • Require training (supervised) • Make a single decision with a very limited set of outcomes • Typical answers naturally fit into categories
  • 28. Classification samples • Credit card fraud prediction • Customer attrition • Diabetes detector • Search Engine
  • 29. Mahout/Hadoop • For large data sets • Online • Offline (Hadoop prefered) • You can build your solution with Mahout • Take a look into Weka • http://www.cs.waikato.ac.nz/ml/weka/
  • 33.
  • 34. Join us¡ • GIAMA. • Agustin Ramos iniciative

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n