SlideShare a Scribd company logo
1 of 60
Mahout By: Ariel Kogan
Java Framework Team on IDI 10 years of experience on IT 6 years of experience on Java Masters in Informatics Engineering specializing on Artificial Intelligence Has a weird accent Who’s this guy? Aliyah http://www.flickr.com/photos/triphenawong/4752510292/
Machine Learning Mahout Recommender Engines Clustering Categorization Hadoop Agenda
Machine Learning
Machine Learning Whatchatalkin' 'bout, Willis?
Recommender Engines Clustering Classification Well known use cases for: Machine Learning
Machine Learning Recommender Engines: Amazon
Machine Learning Recommender Engines: Facebook
Machine Learning Clustering: Google News
Machine Learning Classification: Spam Detection
Machine Learning Classification: Picasa face recognition
Because it’s interesting Because it makes money Why learning “Machine Learning”? Machine Learning
Mahout
Open Source project by the Apache Software Foundation Goal: To build scalable machine learning libraries. Large data sets (Hadoop) Commercially friendly Apache Software license Community What is it? Mahout
Mahout - [muh-hout] - (mə’haʊt) A mahout is a person who keeps and drives an elephant. The name Mahout comes from the project's use of Apache Hadoop — which has a yellow elephant as its logo — for scalability and fault tolerance. What’s that name? Mahout
Mahout Mahout and its related projects
Mahout History
Mahout History Mahout is presented on AlphaCSP’s The Edge 2010 Taste Collaborative Filtering has donated it's codebase to the Mahout project Release 0.1 Release 0.2 Release 0.3 Release 0.4 2010 2008 2009 The Lucene Project Management Committee announces the creation of the Mahout subproject Mahout becomes an Apache top level project
Mahout Mailing lists activity
Weka (since 1999) 38 Java projects listed on mloss.org (Machine Learning Open Source Software) Yet another Framework? Similar Products Mahout
Large amount of input data Techniques work better Nature of the deploying context Must produce results quickly The amount of input is so large that it is not feasible to process it all on one computer, even a powerful one Machine Learning Challenges Mahout
Mahout core algorithms are implemented on top of Apache Hadoop using the map/reduce paradigm. Scalability Mahout
Programming model introduced by Google in 2004 Many real world tasks are expressible in this model (“Map-Reduce for Machine Learning on Multicore”, Stanford CS Department’s paper, 2006) Provides automatic parallelization and distribution Runs on large clusters of compute nodes Highly scalable Hadoop is Apache’s open source implementation MapReduce Mahout
Mahout
Mahout
Recommender Engines
Approaches User based Item based Collaborative filtering vs Content-based recommendation Recommender Engines
Data model Users Items Preferences (ratings) ItemSimilarity UserSimilarity UserNeighborhood Recommender What do we need? Recommender Engines
Recommender Engines T-bone Chocolate Lettuce Rump http://www.flickr.com/photos/martinimike/3770274175/ http://www.flickr.com/photos/fotoosvanrobin/3182238046/ http://www.flickr.com/photos/this_girl_daydreams/3190110968/ http://www.flickr.com/photos/19998197@N00/3238445535/
Recommender Engines 5 -5
Recommender Engines Kuki The Vegan Gilad Ariel
Recommender Engines // We create a DataModel based on the information contained on food.csv DataModel model = newFileDataModel(new File(“food.csv")); // We use one of the several user similarity functions we have available UserSimilarity similarity = newPearsonCorrelationSimilarity(model); // Same thing with the UserNeighborhood definition UserNeighborhood neighborhood = newNearestNUserNeighborhood(hoodSize, similarity, model); // Finally we can build or recommender Recommender recommender = newGenericUserBasedRecommender(model, neighborhood, similarity); // And ask for recommendations for a specific user List<RecommendedItem> recommendations = recommender.recommend(userId, howMany); for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); }  CachingUserSimilarity EuclideanDistanceSimilarity GenericUserSimilarity LogLikelihoodSimilarity PearsonCorrelationSimilarity SpearmanCorrelationSimilarity TanimotoCoefficientSimilarity UncenteredCosineSimilarity
Recommender Engines What would we recommend to Ariel? T-bone rating 4.0 Recommendation for Ariel
Recommender Engines Kuki The Vegan Gilad Ariel
10 most popular Random selection What other customers are looking at right now Bestsellers Best prices Nothing at all No initial information Recommender Engines
Clustering
Clustering is about drawing lines Clustering
Clustering Clustering Steps
Possible weather conditions recognition Clustering CLUSTERING temperature wind direction humidity wind speed http://www.icons-land.com
Clustering Vector representation 25 / 50 = 0.5
Clustering Samples Generation 300 samples Mean: [0.0, 2.0] SD: 0.1 500 samples Mean: [1.0, 1.0] SD: 3.0 300 samples Mean: [1.0, 0.0] SD: 0.5
Clustering Iterations with Fuzzy K-Means
Clustering Clustering Discovery Original data generation Discovered clusters
Clustering CosineDistanceMeasure EuclideanDistanceMeasure MahalanobisDistanceMeasure ManhattanDistanceMeasure SquaredEuclideanDistanceMeasure TanimotoDistanceMeasure WeightedDistanceMeasure WeightedEuclideanDistanceMeasure WeightedManhattanDistanceMeasure
Categorization
Categorization Categorization Steps
Our example: What do we want to do? Categorization Java Classifier Document Sport
Categorization Documents Preparation Label <tab> evidence1 <space> evidence2 BayesFileFormatter (Lucene’s Analyzers) Labeled Documents Training Test
Categorization Using the classifier
Categorization Categorization testing, the confusion matrix Summary ------------------------------------------------------- Correctly Classified Instances          :     93    93% Incorrectly Classified Instances        :      7     7% Total Classified Instances              :    100 ======================================================= Confusion Matrix ------------------------------------------------------- java   sport  <--Classified as 56     3      |  59   java 4      37     |  41   sport
Take me to the cluster
The size of our dataset can’t be handled by a single machine. Scale-up vs scale-out. We need the results on nearly real time. Why do we need distributed computing? Hadoop
Hadoop Data Results Hadoop Compute Cluster
We need to: Configure the job Submit it Control its execution Query its state We want to: Just run our machine learning algorithm! Hadoop Jobs Hadoop
Mahout provides an out of the box AbstractJob class and several Jobs and Drivers implementations in order to run Machine Learning algorithms on the cluster without any hassle. Mahout’s AbstractJob and Drivers Hadoop
Our code, including a Job Mahout jars Hadoop jars Everyone’s dependencies jars Resources The dataset What we need Hadoop
Hadoop Packaging a Job – The Maven solution pom.xml
Hadoop Job feeding Job Dataset Hadoop Compute Cluster
Hadoop We take the project’s dependencies
Hadoop Using an Ant task, we pack everything together

More Related Content

Recently uploaded

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

Featured

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Featured (20)

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 

Mahout's presentation at AlphaCSP's The Edge 2010

  • 2. Java Framework Team on IDI 10 years of experience on IT 6 years of experience on Java Masters in Informatics Engineering specializing on Artificial Intelligence Has a weird accent Who’s this guy? Aliyah http://www.flickr.com/photos/triphenawong/4752510292/
  • 3. Machine Learning Mahout Recommender Engines Clustering Categorization Hadoop Agenda
  • 6. Recommender Engines Clustering Classification Well known use cases for: Machine Learning
  • 8. Machine Learning Recommender Engines: Facebook
  • 11. Machine Learning Classification: Picasa face recognition
  • 12. Because it’s interesting Because it makes money Why learning “Machine Learning”? Machine Learning
  • 14. Open Source project by the Apache Software Foundation Goal: To build scalable machine learning libraries. Large data sets (Hadoop) Commercially friendly Apache Software license Community What is it? Mahout
  • 15. Mahout - [muh-hout] - (mə’haʊt) A mahout is a person who keeps and drives an elephant. The name Mahout comes from the project's use of Apache Hadoop — which has a yellow elephant as its logo — for scalability and fault tolerance. What’s that name? Mahout
  • 16. Mahout Mahout and its related projects
  • 18. Mahout History Mahout is presented on AlphaCSP’s The Edge 2010 Taste Collaborative Filtering has donated it's codebase to the Mahout project Release 0.1 Release 0.2 Release 0.3 Release 0.4 2010 2008 2009 The Lucene Project Management Committee announces the creation of the Mahout subproject Mahout becomes an Apache top level project
  • 20. Weka (since 1999) 38 Java projects listed on mloss.org (Machine Learning Open Source Software) Yet another Framework? Similar Products Mahout
  • 21. Large amount of input data Techniques work better Nature of the deploying context Must produce results quickly The amount of input is so large that it is not feasible to process it all on one computer, even a powerful one Machine Learning Challenges Mahout
  • 22. Mahout core algorithms are implemented on top of Apache Hadoop using the map/reduce paradigm. Scalability Mahout
  • 23. Programming model introduced by Google in 2004 Many real world tasks are expressible in this model (“Map-Reduce for Machine Learning on Multicore”, Stanford CS Department’s paper, 2006) Provides automatic parallelization and distribution Runs on large clusters of compute nodes Highly scalable Hadoop is Apache’s open source implementation MapReduce Mahout
  • 27. Approaches User based Item based Collaborative filtering vs Content-based recommendation Recommender Engines
  • 28. Data model Users Items Preferences (ratings) ItemSimilarity UserSimilarity UserNeighborhood Recommender What do we need? Recommender Engines
  • 29. Recommender Engines T-bone Chocolate Lettuce Rump http://www.flickr.com/photos/martinimike/3770274175/ http://www.flickr.com/photos/fotoosvanrobin/3182238046/ http://www.flickr.com/photos/this_girl_daydreams/3190110968/ http://www.flickr.com/photos/19998197@N00/3238445535/
  • 31. Recommender Engines Kuki The Vegan Gilad Ariel
  • 32. Recommender Engines // We create a DataModel based on the information contained on food.csv DataModel model = newFileDataModel(new File(“food.csv")); // We use one of the several user similarity functions we have available UserSimilarity similarity = newPearsonCorrelationSimilarity(model); // Same thing with the UserNeighborhood definition UserNeighborhood neighborhood = newNearestNUserNeighborhood(hoodSize, similarity, model); // Finally we can build or recommender Recommender recommender = newGenericUserBasedRecommender(model, neighborhood, similarity); // And ask for recommendations for a specific user List<RecommendedItem> recommendations = recommender.recommend(userId, howMany); for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); } CachingUserSimilarity EuclideanDistanceSimilarity GenericUserSimilarity LogLikelihoodSimilarity PearsonCorrelationSimilarity SpearmanCorrelationSimilarity TanimotoCoefficientSimilarity UncenteredCosineSimilarity
  • 33. Recommender Engines What would we recommend to Ariel? T-bone rating 4.0 Recommendation for Ariel
  • 34. Recommender Engines Kuki The Vegan Gilad Ariel
  • 35. 10 most popular Random selection What other customers are looking at right now Bestsellers Best prices Nothing at all No initial information Recommender Engines
  • 37. Clustering is about drawing lines Clustering
  • 39. Possible weather conditions recognition Clustering CLUSTERING temperature wind direction humidity wind speed http://www.icons-land.com
  • 41. Clustering Samples Generation 300 samples Mean: [0.0, 2.0] SD: 0.1 500 samples Mean: [1.0, 1.0] SD: 3.0 300 samples Mean: [1.0, 0.0] SD: 0.5
  • 42. Clustering Iterations with Fuzzy K-Means
  • 43. Clustering Clustering Discovery Original data generation Discovered clusters
  • 44. Clustering CosineDistanceMeasure EuclideanDistanceMeasure MahalanobisDistanceMeasure ManhattanDistanceMeasure SquaredEuclideanDistanceMeasure TanimotoDistanceMeasure WeightedDistanceMeasure WeightedEuclideanDistanceMeasure WeightedManhattanDistanceMeasure
  • 47. Our example: What do we want to do? Categorization Java Classifier Document Sport
  • 48. Categorization Documents Preparation Label <tab> evidence1 <space> evidence2 BayesFileFormatter (Lucene’s Analyzers) Labeled Documents Training Test
  • 50. Categorization Categorization testing, the confusion matrix Summary ------------------------------------------------------- Correctly Classified Instances : 93 93% Incorrectly Classified Instances : 7 7% Total Classified Instances : 100 ======================================================= Confusion Matrix ------------------------------------------------------- java sport <--Classified as 56 3 | 59 java 4 37 | 41 sport
  • 51. Take me to the cluster
  • 52. The size of our dataset can’t be handled by a single machine. Scale-up vs scale-out. We need the results on nearly real time. Why do we need distributed computing? Hadoop
  • 53. Hadoop Data Results Hadoop Compute Cluster
  • 54. We need to: Configure the job Submit it Control its execution Query its state We want to: Just run our machine learning algorithm! Hadoop Jobs Hadoop
  • 55. Mahout provides an out of the box AbstractJob class and several Jobs and Drivers implementations in order to run Machine Learning algorithms on the cluster without any hassle. Mahout’s AbstractJob and Drivers Hadoop
  • 56. Our code, including a Job Mahout jars Hadoop jars Everyone’s dependencies jars Resources The dataset What we need Hadoop
  • 57. Hadoop Packaging a Job – The Maven solution pom.xml
  • 58. Hadoop Job feeding Job Dataset Hadoop Compute Cluster
  • 59. Hadoop We take the project’s dependencies
  • 60. Hadoop Using an Ant task, we pack everything together
  • 61. Hadoop We attach the Job jar on the packaging phase
  • 62. Hadoop Running our Job Upload our job to the HDFS $ hadoopfs -put myjob.job /myjob.job $ hadoopfs -put dataset.dat /dataset.dat $ hadoop jar /myjob.jobc.a.RecommenderJob/dataset.dat/output.dat Upload the dataset to the HDFS jar class input output Run the job
  • 63. Our dataset is too big and we need the results fast. Mahout gives us out of the box all we need to run on Hadoop. We can pack all together with Maven. Our Machine Learning algorithms are running on the cluster! Summary Hadoop
  • 64. Mahout’s website Mahout in Action – May 2011 (est.) Introducing Apache Mahout @ IBM developerWorks References
  • 65. Thank You Any questions? We appreciate your feedback

Editor's Notes

  1. למידה חישובית
  2. Make it clear that I don’t want the crowd to read the table, it’s only to generate an overwhelming sensation
  3. מערכות המלצה
  4. Strictly speaking, these are examples of “collaborative filtering” -- producing recommendations based on, and only based on, knowledge of users’ relationships to items. These techniques require no knowledge of the properties of the items themselves. This is, in a way, an advantage. This recommender framework couldn’t care less whether the “items” are books, theme parks, flowers, or even other people, since nothing about their attributes enters into any of the input.
  5. UserSimilarity: Way to compare users (user based approach)ItemSimilarity: Way to compare items (items based approach)Recommender: Interface for providing recommendationsUserNeighborhood: Interface for computing a neighborhood of similar users that can then be used by the Recommenders
  6. http://www.flickr.com/photos/martinimike/3770274175/http://www.flickr.com/photos/fotoosvanrobin/3182238046/http://www.flickr.com/photos/this_girl_daydreams/3190110968/http://www.flickr.com/photos/19998197@N00/3238445535/
  7. Talk aboutRecommenderEvaluator
  8. ממוצעסטיית תקן
  9. Explain that we are going to take all project dependencies, pack them together with Ant and include them on the package phase of Maven