Scalable Machine Learning with Hadoop
Upcoming SlideShare
Loading in...5
×
 

Scalable Machine Learning with Hadoop

on

  • 4,838 views

My intro to machine learning talk at

My intro to machine learning talk at

Statistics

Views

Total Views
4,838
Views on SlideShare
4,801
Embed Views
37

Actions

Likes
7
Downloads
90
Comments
0

3 Embeds 37

https://twitter.com 32
https://si0.twimg.com 3
http://tweetedtimes.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • A few things come to mind

Scalable Machine Learning with Hadoop Scalable Machine Learning with Hadoop Presentation Transcript

  • Scalable Machine Learning with Hadoop (most of the time) Grant Ingersoll Chief Scientist October 2, 2012© Copyright 2012
  • Anyone Here Use Machine Learning?• Any users of: • Google? • Search • Translation • Priority Inbox • Facebook? Google Translate • Twitter? • LinkedIn?Proprietary© 2012 LucidWorks
  • Topics • What is scalable machine learning? • Use Cases • Approaches • Hadoop-based • Alternatives • What is Apache Mahout? Proprietary3 © 2012 LucidWorks
  • Machine Learning • “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” • Intro. To Machine Learning by E. Alpaydin • Lots of related fields: • Information Retrieval • Stats • Biology • Linear algebra • Many moreProprietary© 2012 LucidWorks
  • What does scalable mean for us? • As data grows linearly, either scale linearly in time or in machines • 2X data requires 2X time or 2X machines (or less!) • Goal: Be as fast and efficient as possible given the intrinsic design of the algorithm • Some algorithms won’t scale to massive machine clusters • Others fit logically on a Map Reduce framework like Apache Hadoop • Still others will need different distributed programming models • Be pragmaticProprietary© 2012 LucidWorks
  • Common Use Cases http://www.readwriteweb.com/archives/linkedin_plots_your_profession al_network_with_inma.phpProprietary© 2012 LucidWorks
  • My Use Cases Search Relevance Recommendations Related Items Content/User Classification Phrases Topics Analytics Discovery Proprietary7 © 2012 LucidWorks
  • Scalable Approaches • Mind the Gap • Algorithms are the fun stuff, but you’ll spend more time on ETL, feature selection and post-processing • Simpler is usually better at scale 1. Scale Data Pipeline -> Sample -> Sequential 2. Hadoop 3. Ensemble (distribute many sequential models) 4. Spark, MPI & BSP, Others Proprietary8 © 2012 LucidWorks
  • Open Source Machine Learning Libraries • Apache Mahout • Vowpal Wabbit • R Stats Project • Weka • LibSVM, SVMLight • Many, many more Proprietary9 © 2012 LucidWorks
  • Apache Mahout• An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License • http://mahout.apache.org• Why Mahout? • Many Open Source ML libraries are either: • Lack Community • Lack Documentation and Examples • Lack Scalability • Lack the Apache License • Or are research-oriented http://dictionary.reference.com/browse/mahoutProprietary© 2012 LucidWorks
  • Who uses Mahout? https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+MahoutProprietary© 2012 LucidWorks
  • What Can I do with Mahout Right Now? 3 “C”s + ExtrasProprietary© 2012 LucidWorks
  • Collaborative Filtering• Recommender Approaches • User based • Item based• Online and Offline support • Offline can utilize Hadoop• Many different Similarity measures • Cosine, LLR, Tanimoto, Pearson, othersProprietary© 2012 LucidWorks
  • Hadoop Recommenders • Alternating Least Squares • Iterative, but scales well • Deals well with sparseness • “Large-scale Parallel Collaborative Filtering for the Netflix Prize” by Zhou et. al • https://cwiki.apache.org/MAHOUT/collaborative-filtering-with- als-wr.html • Slope One • Simple yet effective • Pseudo • Distribute sequential approach across Hadoop nodes Proprietary14 © 2012 LucidWorks
  • Clustering• Document level • Group documents based on a notion of similarity • K-Means, Fuzzy K- Means, Dirichlet, Canopy, Mean-Shift, Spectral, Top- Down • Topic Modeling • Pluggable Distance • Cluster words across Measures documents to identify topics • Latent Dirichlet Allocation • Using Collapsed Variational BayesProprietary http://carrotsearch.com/foamtree-overview.html© 2012 LucidWorks
  • Clustering In Hadoop• Many people start with K-Means, but others can be more effective• Challenges • Iterative nature of many clustering algorithms can be slow • Distance measures and other factors can have dramatic impact on performance and quality • When in doubt, experimentProprietary© 2012 LucidWorks
  • Classification “This gives a raw classification rate requirement of tens of millions of• Place new items into classifications per predefined categories second, which is, as they say in the old• Online and Offline country, a lot.” supported “Mahout in Action” http://awe.sm/5FyNe• Hadoop • Sequential • Naïve Bayes • Logistic Regression • Complementary Naïve Bayes • Stochastic Grad. • Decision Forests Descent • Clustering-based • Hidden Markov Model • Winnow/PerceptronProprietary© 2012 LucidWorks
  • Scaling Mahout Classification Client (Train/Test/Classify) LucidWorks Big Data Classifier Model Model ... Node 1 A N Zookeeper ... Classifier Model Model ... Node N C Q Model Model Model HDFS Model Offline (Map/Reduce?) TrainingProprietary© 2012 LucidWorks
  • Other Mahout Features• Apache Licensed: • Primitive Collections! • Extensive Math library • Vectors, Matrices, Statistics, etc. • Vector Encoding options• Singular Value Decomposition• Frequent Pattern Mining• Collocations (statistically interesting phrases)• I/O: Lucene, Cassandra, MongoDB and othersProprietary© 2012 LucidWorks
  • What’s Next for Mahout? • Streaming K-Means • Map/Reduce Training for HMM? • Clean Up towards 1.0 release • 1.0? Proprietary20 © 2012 LucidWorks
  • Resources • http://www.lucidworks.com • grant@lucidworks.com • @gsingers Proprietary21 © 2012 LucidWorks