SlideShare a Scribd company logo
1 of 60
Mahout By: Ariel Kogan
Java Framework Team on IDI 10 years of experience on IT 6 years of experience on Java Masters in Informatics Engineering specializing on Artificial Intelligence Has a weird accent Who’s this guy? Aliyah http://www.flickr.com/photos/triphenawong/4752510292/
Machine Learning Mahout Recommender Engines Clustering Categorization Hadoop Agenda
Machine Learning
Machine Learning Whatchatalkin' 'bout, Willis?
Recommender Engines Clustering Classification Well known use cases for: Machine Learning
Machine Learning Recommender Engines: Amazon
Machine Learning Recommender Engines: Facebook
Machine Learning Clustering: Google News
Machine Learning Classification: Spam Detection
Machine Learning Classification: Picasa face recognition
Because it’s interesting Because it makes money Why learning “Machine Learning”? Machine Learning
Mahout
Open Source project by the Apache Software Foundation Goal: To build scalable machine learning libraries. Large data sets (Hadoop) Commercially friendly Apache Software license Community What is it? Mahout
Mahout - [muh-hout] - (mə’haʊt) A mahout is a person who keeps and drives an elephant. The name Mahout comes from the project's use of Apache Hadoop — which has a yellow elephant as its logo — for scalability and fault tolerance. What’s that name? Mahout
Mahout Mahout and its related projects
Mahout History
Mahout History Mahout is presented on AlphaCSP’s The Edge 2010 Taste Collaborative Filtering has donated it's codebase to the Mahout project Release 0.1 Release 0.2 Release 0.3 Release 0.4 2010 2008 2009 The Lucene Project Management Committee announces the creation of the Mahout subproject Mahout becomes an Apache top level project
Mahout Mailing lists activity
Weka (since 1999) 38 Java projects listed on mloss.org (Machine Learning Open Source Software) Yet another Framework? Similar Products Mahout
Large amount of input data Techniques work better Nature of the deploying context Must produce results quickly The amount of input is so large that it is not feasible to process it all on one computer, even a powerful one Machine Learning Challenges Mahout
Mahout core algorithms are implemented on top of Apache Hadoop using the map/reduce paradigm. Scalability Mahout
Programming model introduced by Google in 2004 Many real world tasks are expressible in this model (“Map-Reduce for Machine Learning on Multicore”, Stanford CS Department’s paper, 2006) Provides automatic parallelization and distribution Runs on large clusters of compute nodes Highly scalable Hadoop is Apache’s open source implementation MapReduce Mahout
Mahout
Mahout
Recommender Engines
Approaches User based Item based Collaborative filtering vs Content-based recommendation Recommender Engines
Data model Users Items Preferences (ratings) ItemSimilarity UserSimilarity UserNeighborhood Recommender What do we need? Recommender Engines
Recommender Engines T-bone Chocolate Lettuce Rump http://www.flickr.com/photos/martinimike/3770274175/ http://www.flickr.com/photos/fotoosvanrobin/3182238046/ http://www.flickr.com/photos/this_girl_daydreams/3190110968/ http://www.flickr.com/photos/19998197@N00/3238445535/
Recommender Engines 5 -5
Recommender Engines Kuki The Vegan Gilad Ariel
Recommender Engines // We create a DataModel based on the information contained on food.csv DataModel model = newFileDataModel(new File(“food.csv")); // We use one of the several user similarity functions we have available UserSimilarity similarity = newPearsonCorrelationSimilarity(model); // Same thing with the UserNeighborhood definition UserNeighborhood neighborhood = newNearestNUserNeighborhood(hoodSize, similarity, model); // Finally we can build or recommender Recommender recommender = newGenericUserBasedRecommender(model, neighborhood, similarity); // And ask for recommendations for a specific user List<RecommendedItem> recommendations = recommender.recommend(userId, howMany); for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); }  CachingUserSimilarity EuclideanDistanceSimilarity GenericUserSimilarity LogLikelihoodSimilarity PearsonCorrelationSimilarity SpearmanCorrelationSimilarity TanimotoCoefficientSimilarity UncenteredCosineSimilarity
Recommender Engines What would we recommend to Ariel? T-bone rating 4.0 Recommendation for Ariel
Recommender Engines Kuki The Vegan Gilad Ariel
10 most popular Random selection What other customers are looking at right now Bestsellers Best prices Nothing at all No initial information Recommender Engines
Clustering
Clustering is about drawing lines Clustering
Clustering Clustering Steps
Possible weather conditions recognition Clustering CLUSTERING temperature wind direction humidity wind speed http://www.icons-land.com
Clustering Vector representation 25 / 50 = 0.5
Clustering Samples Generation 300 samples Mean: [0.0, 2.0] SD: 0.1 500 samples Mean: [1.0, 1.0] SD: 3.0 300 samples Mean: [1.0, 0.0] SD: 0.5
Clustering Iterations with Fuzzy K-Means
Clustering Clustering Discovery Original data generation Discovered clusters
Clustering CosineDistanceMeasure EuclideanDistanceMeasure MahalanobisDistanceMeasure ManhattanDistanceMeasure SquaredEuclideanDistanceMeasure TanimotoDistanceMeasure WeightedDistanceMeasure WeightedEuclideanDistanceMeasure WeightedManhattanDistanceMeasure
Categorization
Categorization Categorization Steps
Our example: What do we want to do? Categorization Java Classifier Document Sport
Categorization Documents Preparation Label <tab> evidence1 <space> evidence2 BayesFileFormatter (Lucene’s Analyzers) Labeled Documents Training Test
Categorization Using the classifier
Categorization Categorization testing, the confusion matrix Summary ------------------------------------------------------- Correctly Classified Instances          :     93    93% Incorrectly Classified Instances        :      7     7% Total Classified Instances              :    100 ======================================================= Confusion Matrix ------------------------------------------------------- java   sport  <--Classified as 56     3      |  59   java 4      37     |  41   sport
Take me to the cluster
The size of our dataset can’t be handled by a single machine. Scale-up vs scale-out. We need the results on nearly real time. Why do we need distributed computing? Hadoop
Hadoop Data Results Hadoop Compute Cluster
We need to: Configure the job Submit it Control its execution Query its state We want to: Just run our machine learning algorithm! Hadoop Jobs Hadoop
Mahout provides an out of the box AbstractJob class and several Jobs and Drivers implementations in order to run Machine Learning algorithms on the cluster without any hassle. Mahout’s AbstractJob and Drivers Hadoop
Our code, including a Job Mahout jars Hadoop jars Everyone’s dependencies jars Resources The dataset What we need Hadoop
Hadoop Packaging a Job – The Maven solution pom.xml
Hadoop Job feeding Job Dataset Hadoop Compute Cluster
Hadoop We take the project’s dependencies
Hadoop Using an Ant task, we pack everything together

More Related Content

Recently uploaded

Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Recently uploaded (20)

Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 

Featured

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Featured (20)

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 

Mahout's presentation at AlphaCSP's The Edge 2010

  • 2. Java Framework Team on IDI 10 years of experience on IT 6 years of experience on Java Masters in Informatics Engineering specializing on Artificial Intelligence Has a weird accent Who’s this guy? Aliyah http://www.flickr.com/photos/triphenawong/4752510292/
  • 3. Machine Learning Mahout Recommender Engines Clustering Categorization Hadoop Agenda
  • 6. Recommender Engines Clustering Classification Well known use cases for: Machine Learning
  • 8. Machine Learning Recommender Engines: Facebook
  • 11. Machine Learning Classification: Picasa face recognition
  • 12. Because it’s interesting Because it makes money Why learning “Machine Learning”? Machine Learning
  • 14. Open Source project by the Apache Software Foundation Goal: To build scalable machine learning libraries. Large data sets (Hadoop) Commercially friendly Apache Software license Community What is it? Mahout
  • 15. Mahout - [muh-hout] - (mə’haʊt) A mahout is a person who keeps and drives an elephant. The name Mahout comes from the project's use of Apache Hadoop — which has a yellow elephant as its logo — for scalability and fault tolerance. What’s that name? Mahout
  • 16. Mahout Mahout and its related projects
  • 18. Mahout History Mahout is presented on AlphaCSP’s The Edge 2010 Taste Collaborative Filtering has donated it's codebase to the Mahout project Release 0.1 Release 0.2 Release 0.3 Release 0.4 2010 2008 2009 The Lucene Project Management Committee announces the creation of the Mahout subproject Mahout becomes an Apache top level project
  • 20. Weka (since 1999) 38 Java projects listed on mloss.org (Machine Learning Open Source Software) Yet another Framework? Similar Products Mahout
  • 21. Large amount of input data Techniques work better Nature of the deploying context Must produce results quickly The amount of input is so large that it is not feasible to process it all on one computer, even a powerful one Machine Learning Challenges Mahout
  • 22. Mahout core algorithms are implemented on top of Apache Hadoop using the map/reduce paradigm. Scalability Mahout
  • 23. Programming model introduced by Google in 2004 Many real world tasks are expressible in this model (“Map-Reduce for Machine Learning on Multicore”, Stanford CS Department’s paper, 2006) Provides automatic parallelization and distribution Runs on large clusters of compute nodes Highly scalable Hadoop is Apache’s open source implementation MapReduce Mahout
  • 27. Approaches User based Item based Collaborative filtering vs Content-based recommendation Recommender Engines
  • 28. Data model Users Items Preferences (ratings) ItemSimilarity UserSimilarity UserNeighborhood Recommender What do we need? Recommender Engines
  • 29. Recommender Engines T-bone Chocolate Lettuce Rump http://www.flickr.com/photos/martinimike/3770274175/ http://www.flickr.com/photos/fotoosvanrobin/3182238046/ http://www.flickr.com/photos/this_girl_daydreams/3190110968/ http://www.flickr.com/photos/19998197@N00/3238445535/
  • 31. Recommender Engines Kuki The Vegan Gilad Ariel
  • 32. Recommender Engines // We create a DataModel based on the information contained on food.csv DataModel model = newFileDataModel(new File(“food.csv")); // We use one of the several user similarity functions we have available UserSimilarity similarity = newPearsonCorrelationSimilarity(model); // Same thing with the UserNeighborhood definition UserNeighborhood neighborhood = newNearestNUserNeighborhood(hoodSize, similarity, model); // Finally we can build or recommender Recommender recommender = newGenericUserBasedRecommender(model, neighborhood, similarity); // And ask for recommendations for a specific user List<RecommendedItem> recommendations = recommender.recommend(userId, howMany); for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); } CachingUserSimilarity EuclideanDistanceSimilarity GenericUserSimilarity LogLikelihoodSimilarity PearsonCorrelationSimilarity SpearmanCorrelationSimilarity TanimotoCoefficientSimilarity UncenteredCosineSimilarity
  • 33. Recommender Engines What would we recommend to Ariel? T-bone rating 4.0 Recommendation for Ariel
  • 34. Recommender Engines Kuki The Vegan Gilad Ariel
  • 35. 10 most popular Random selection What other customers are looking at right now Bestsellers Best prices Nothing at all No initial information Recommender Engines
  • 37. Clustering is about drawing lines Clustering
  • 39. Possible weather conditions recognition Clustering CLUSTERING temperature wind direction humidity wind speed http://www.icons-land.com
  • 41. Clustering Samples Generation 300 samples Mean: [0.0, 2.0] SD: 0.1 500 samples Mean: [1.0, 1.0] SD: 3.0 300 samples Mean: [1.0, 0.0] SD: 0.5
  • 42. Clustering Iterations with Fuzzy K-Means
  • 43. Clustering Clustering Discovery Original data generation Discovered clusters
  • 44. Clustering CosineDistanceMeasure EuclideanDistanceMeasure MahalanobisDistanceMeasure ManhattanDistanceMeasure SquaredEuclideanDistanceMeasure TanimotoDistanceMeasure WeightedDistanceMeasure WeightedEuclideanDistanceMeasure WeightedManhattanDistanceMeasure
  • 47. Our example: What do we want to do? Categorization Java Classifier Document Sport
  • 48. Categorization Documents Preparation Label <tab> evidence1 <space> evidence2 BayesFileFormatter (Lucene’s Analyzers) Labeled Documents Training Test
  • 50. Categorization Categorization testing, the confusion matrix Summary ------------------------------------------------------- Correctly Classified Instances : 93 93% Incorrectly Classified Instances : 7 7% Total Classified Instances : 100 ======================================================= Confusion Matrix ------------------------------------------------------- java sport <--Classified as 56 3 | 59 java 4 37 | 41 sport
  • 51. Take me to the cluster
  • 52. The size of our dataset can’t be handled by a single machine. Scale-up vs scale-out. We need the results on nearly real time. Why do we need distributed computing? Hadoop
  • 53. Hadoop Data Results Hadoop Compute Cluster
  • 54. We need to: Configure the job Submit it Control its execution Query its state We want to: Just run our machine learning algorithm! Hadoop Jobs Hadoop
  • 55. Mahout provides an out of the box AbstractJob class and several Jobs and Drivers implementations in order to run Machine Learning algorithms on the cluster without any hassle. Mahout’s AbstractJob and Drivers Hadoop
  • 56. Our code, including a Job Mahout jars Hadoop jars Everyone’s dependencies jars Resources The dataset What we need Hadoop
  • 57. Hadoop Packaging a Job – The Maven solution pom.xml
  • 58. Hadoop Job feeding Job Dataset Hadoop Compute Cluster
  • 59. Hadoop We take the project’s dependencies
  • 60. Hadoop Using an Ant task, we pack everything together
  • 61. Hadoop We attach the Job jar on the packaging phase
  • 62. Hadoop Running our Job Upload our job to the HDFS $ hadoopfs -put myjob.job /myjob.job $ hadoopfs -put dataset.dat /dataset.dat $ hadoop jar /myjob.jobc.a.RecommenderJob/dataset.dat/output.dat Upload the dataset to the HDFS jar class input output Run the job
  • 63. Our dataset is too big and we need the results fast. Mahout gives us out of the box all we need to run on Hadoop. We can pack all together with Maven. Our Machine Learning algorithms are running on the cluster! Summary Hadoop
  • 64. Mahout’s website Mahout in Action – May 2011 (est.) Introducing Apache Mahout @ IBM developerWorks References
  • 65. Thank You Any questions? We appreciate your feedback

Editor's Notes

  1. למידה חישובית
  2. Make it clear that I don’t want the crowd to read the table, it’s only to generate an overwhelming sensation
  3. מערכות המלצה
  4. Strictly speaking, these are examples of “collaborative filtering” -- producing recommendations based on, and only based on, knowledge of users’ relationships to items. These techniques require no knowledge of the properties of the items themselves. This is, in a way, an advantage. This recommender framework couldn’t care less whether the “items” are books, theme parks, flowers, or even other people, since nothing about their attributes enters into any of the input.
  5. UserSimilarity: Way to compare users (user based approach)ItemSimilarity: Way to compare items (items based approach)Recommender: Interface for providing recommendationsUserNeighborhood: Interface for computing a neighborhood of similar users that can then be used by the Recommenders
  6. http://www.flickr.com/photos/martinimike/3770274175/http://www.flickr.com/photos/fotoosvanrobin/3182238046/http://www.flickr.com/photos/this_girl_daydreams/3190110968/http://www.flickr.com/photos/19998197@N00/3238445535/
  7. Talk aboutRecommenderEvaluator
  8. ממוצעסטיית תקן
  9. Explain that we are going to take all project dependencies, pack them together with Ant and include them on the package phase of Maven