Scalable Machine Learning with                   Hadoop (most of the time)                   Grant Ingersoll              ...
Anyone Here Use Machine Learning?• Any users of:  • Google?     • Search     • Translation     • Priority Inbox  • Faceboo...
Topics    • What is scalable machine learning?    • Use Cases    • Approaches      • Hadoop-based      • Alternatives    •...
Machine Learning  • “Machine Learning is programming computers to     optimize a performance criterion using example data ...
What does scalable mean for us?  • As data grows linearly, either scale linearly in time or in     machines     • 2X data ...
Common Use Cases                    http://www.readwriteweb.com/archives/linkedin_plots_your_profession                   ...
My Use Cases                                             Search                                            Relevance      ...
Scalable Approaches    • Mind the Gap      • Algorithms are the fun stuff, but you’ll spend more time on        ETL, featu...
Open Source Machine Learning Libraries    • Apache Mahout    • Vowpal Wabbit    • R Stats Project    • Weka    • LibSVM, S...
Apache Mahout• An Apache Software Foundation project to  create scalable machine learning libraries  under the Apache Soft...
Who uses Mahout? https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+MahoutProprietary© 2012 LucidWorks
What Can I do with Mahout Right Now?                    3 “C”s + ExtrasProprietary© 2012 LucidWorks
Collaborative Filtering• Recommender Approaches  • User based  • Item based• Online and Offline support  • Offline can uti...
Hadoop Recommenders     • Alternating Least Squares          • Iterative, but scales well          • Deals well with spars...
Clustering• Document level  • Group documents based    on a notion of similarity  • K-Means, Fuzzy K-    Means, Dirichlet,...
Clustering In Hadoop• Many people start with K-Means, but others can be  more effective• Challenges  • Iterative nature of...
Classification                    “This gives a raw                                  classification rate                  ...
Scaling Mahout Classification                                 Client (Train/Test/Classify)                                ...
Other Mahout Features• Apache Licensed:  • Primitive Collections!  • Extensive Math library     • Vectors, Matrices, Stati...
What’s Next for Mahout?     • Streaming K-Means     • Map/Reduce Training for HMM?     • Clean Up towards 1.0 release     ...
Resources     • http://www.lucidworks.com     • grant@lucidworks.com     • @gsingers     Proprietary21   © 2012 LucidWorks
Upcoming SlideShare
Loading in...5
×

Scalable Machine Learning with Hadoop

4,290

Published on

My intro to machine learning talk at

0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,290
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
94
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide
  • A few things come to mind
  • Transcript of "Scalable Machine Learning with Hadoop"

    1. 1. Scalable Machine Learning with Hadoop (most of the time) Grant Ingersoll Chief Scientist October 2, 2012© Copyright 2012
    2. 2. Anyone Here Use Machine Learning?• Any users of: • Google? • Search • Translation • Priority Inbox • Facebook? Google Translate • Twitter? • LinkedIn?Proprietary© 2012 LucidWorks
    3. 3. Topics • What is scalable machine learning? • Use Cases • Approaches • Hadoop-based • Alternatives • What is Apache Mahout? Proprietary3 © 2012 LucidWorks
    4. 4. Machine Learning • “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” • Intro. To Machine Learning by E. Alpaydin • Lots of related fields: • Information Retrieval • Stats • Biology • Linear algebra • Many moreProprietary© 2012 LucidWorks
    5. 5. What does scalable mean for us? • As data grows linearly, either scale linearly in time or in machines • 2X data requires 2X time or 2X machines (or less!) • Goal: Be as fast and efficient as possible given the intrinsic design of the algorithm • Some algorithms won’t scale to massive machine clusters • Others fit logically on a Map Reduce framework like Apache Hadoop • Still others will need different distributed programming models • Be pragmaticProprietary© 2012 LucidWorks
    6. 6. Common Use Cases http://www.readwriteweb.com/archives/linkedin_plots_your_profession al_network_with_inma.phpProprietary© 2012 LucidWorks
    7. 7. My Use Cases Search Relevance Recommendations Related Items Content/User Classification Phrases Topics Analytics Discovery Proprietary7 © 2012 LucidWorks
    8. 8. Scalable Approaches • Mind the Gap • Algorithms are the fun stuff, but you’ll spend more time on ETL, feature selection and post-processing • Simpler is usually better at scale 1. Scale Data Pipeline -> Sample -> Sequential 2. Hadoop 3. Ensemble (distribute many sequential models) 4. Spark, MPI & BSP, Others Proprietary8 © 2012 LucidWorks
    9. 9. Open Source Machine Learning Libraries • Apache Mahout • Vowpal Wabbit • R Stats Project • Weka • LibSVM, SVMLight • Many, many more Proprietary9 © 2012 LucidWorks
    10. 10. Apache Mahout• An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License • http://mahout.apache.org• Why Mahout? • Many Open Source ML libraries are either: • Lack Community • Lack Documentation and Examples • Lack Scalability • Lack the Apache License • Or are research-oriented http://dictionary.reference.com/browse/mahoutProprietary© 2012 LucidWorks
    11. 11. Who uses Mahout? https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+MahoutProprietary© 2012 LucidWorks
    12. 12. What Can I do with Mahout Right Now? 3 “C”s + ExtrasProprietary© 2012 LucidWorks
    13. 13. Collaborative Filtering• Recommender Approaches • User based • Item based• Online and Offline support • Offline can utilize Hadoop• Many different Similarity measures • Cosine, LLR, Tanimoto, Pearson, othersProprietary© 2012 LucidWorks
    14. 14. Hadoop Recommenders • Alternating Least Squares • Iterative, but scales well • Deals well with sparseness • “Large-scale Parallel Collaborative Filtering for the Netflix Prize” by Zhou et. al • https://cwiki.apache.org/MAHOUT/collaborative-filtering-with- als-wr.html • Slope One • Simple yet effective • Pseudo • Distribute sequential approach across Hadoop nodes Proprietary14 © 2012 LucidWorks
    15. 15. Clustering• Document level • Group documents based on a notion of similarity • K-Means, Fuzzy K- Means, Dirichlet, Canopy, Mean-Shift, Spectral, Top- Down • Topic Modeling • Pluggable Distance • Cluster words across Measures documents to identify topics • Latent Dirichlet Allocation • Using Collapsed Variational BayesProprietary http://carrotsearch.com/foamtree-overview.html© 2012 LucidWorks
    16. 16. Clustering In Hadoop• Many people start with K-Means, but others can be more effective• Challenges • Iterative nature of many clustering algorithms can be slow • Distance measures and other factors can have dramatic impact on performance and quality • When in doubt, experimentProprietary© 2012 LucidWorks
    17. 17. Classification “This gives a raw classification rate requirement of tens of millions of• Place new items into classifications per predefined categories second, which is, as they say in the old• Online and Offline country, a lot.” supported “Mahout in Action” http://awe.sm/5FyNe• Hadoop • Sequential • Naïve Bayes • Logistic Regression • Complementary Naïve Bayes • Stochastic Grad. • Decision Forests Descent • Clustering-based • Hidden Markov Model • Winnow/PerceptronProprietary© 2012 LucidWorks
    18. 18. Scaling Mahout Classification Client (Train/Test/Classify) LucidWorks Big Data Classifier Model Model ... Node 1 A N Zookeeper ... Classifier Model Model ... Node N C Q Model Model Model HDFS Model Offline (Map/Reduce?) TrainingProprietary© 2012 LucidWorks
    19. 19. Other Mahout Features• Apache Licensed: • Primitive Collections! • Extensive Math library • Vectors, Matrices, Statistics, etc. • Vector Encoding options• Singular Value Decomposition• Frequent Pattern Mining• Collocations (statistically interesting phrases)• I/O: Lucene, Cassandra, MongoDB and othersProprietary© 2012 LucidWorks
    20. 20. What’s Next for Mahout? • Streaming K-Means • Map/Reduce Training for HMM? • Clean Up towards 1.0 release • 1.0? Proprietary20 © 2012 LucidWorks
    21. 21. Resources • http://www.lucidworks.com • grant@lucidworks.com • @gsingers Proprietary21 © 2012 LucidWorks
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×