Your SlideShare is downloading. ×
0
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Intro to Mahout -- DC Hadoop

12,193

Published on

Introduction to Apache Mahout -- talk given at DC Hadoop Meetup on April 28

Introduction to Apache Mahout -- talk given at DC Hadoop Meetup on April 28

Published in: Technology
0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
12,193
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
382
Comments
0
Likes
12
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • 3C: The three C’s: clustering, classification and collaborative filteringFPM: Frequent patternset miningO: Other (math, collections, etc.)
  • Convergence just checks to see how far the centroid has moved from the previous centroid
  • Transcript

    • 1. Intro to Apache Mahout
      Grant Ingersoll
      Lucid Imagination
      http://www.lucidimagination.com
    • 2. Anyone Here Use Machine Learning?
      Any users of:
      Google?
      Search?
      Priority Inbox?
      Facebook?
      Twitter?
      LinkedIn?
    • 3. Topics
      Background and Use Cases
      What can you do in Mahout?
      Where’s the community at?
      Resources
      K-Means in Hadoop (time permitting)
    • 4. Definition
      “Machine Learning is programming computers to optimize a performance criterion using example data or past experience”
      Intro. To Machine Learning by E. Alpaydin
      Subset of Artificial Intelligence
      Lots of related fields:
      Information Retrieval
      Stats
      Biology
      Linear algebra
      Many more
    • 5. Common Use Cases
      Recommend friends/dates/products
      Classify content into predefined groups
      Find similar content
      Find associations/patterns in actions/behaviors
      Identify key topics/summarize text
      Documents and Corpora
      Detect anomalies/fraud
      Ranking search results
      Others?
    • 6. Apache Mahout
      An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License
      http://mahout.apache.org
      Why Mahout?
      Many Open Source ML libraries either:
      Lack Community
      Lack Documentation and Examples
      Lack Scalability
      Lack the Apache License
      Or are research-oriented
      Definition:http://dictionary.reference.com/browse/mahout
    • 7. What does scalable mean to us?
      Goal: Be as fast and efficient as possible given the intrinsic design of the algorithm
      Some algorithms won’t scale to massive machine clusters
      Others fit logically on a Map Reduce framework like Apache Hadoop
      Still others will need different distributed programming models
      Others are already fast (SGD)
      Be pragmatic
    • 8. Sampling of Who uses Mahout?
      https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
    • 9. What Can I do with Mahout Right Now?3C + FPM + O = Mahout
    • 10. Collaborative Filtering
      Extensive framework for collaborative filtering (recommenders)
      Recommenders
      User based
      Item based
      Online and Offline support
      Offline can utilize Hadoop
      Many different Similarity measures
      Cosine, LLR, Tanimoto, Pearson, others
    • 11. Clustering
      Document level
      Group documents based on a notion of similarity
      K-Means, Fuzzy K-Means, Dirichlet, Canopy, Mean-Shift, EigenCuts (Spectral)
      All Map/Reduce
      Distance Measures
      Manhattan, Euclidean, other
      Topic Modeling
      Cluster words across documents to identify topics
      Latent Dirichlet Allocation (M/R)
    • 12. Categorization
      Place new items into predefined categories:
      Sports, politics, entertainment
      Recommenders
      Implementations
      Naïve Bayes (M/R)
      Compl. Naïve Bayes (M/R)
      Decision Forests (M/R)
      Linear Regression (Seq. but Fast!)
      • See Chapter 17 of Mahout in Action for Shop It To Me use case:
      • 13. http://awe.sm/5FyNe
    • Freq. Pattern Mining
      Identify frequently co-occurrent items
      Useful for:
      Query Recommendations
      Apple -> iPhone, orange, OS X
      Related product placement
      Basket Analysis
      Map/Reduce
      http://www.amazon.com
    • 14. Other
      Primitive Collections!
      Collocations (M/R)
      Math library
      Vectors, Matrices, etc.
      Noise Reduction via Singular Value Decomp (M/R)
    • 15. Prepare Data from Raw content
      Data Sources:
      Lucene integration
      bin/mahout lucene.vector…
      Document Vectorizer
      bin/mahout seqdirectory …
      bin/mahout seq2sparse …
      Programmatically
      See the Utils module in Mahout and the Iterator<Vector> classes
      Database
      File system
    • 16. How to: Command Line
      Most algorithms have a Driver program
      $MAHOUT_HOME/bin/mahout.shhelps with most tasks
      Prepare the Data
      Different algorithms require different setup
      Run the algorithm
      Single Node
      Hadoop
      Print out the results or incorporate into application
      Several helper classes:
      LDAPrintTopics, ClusterDumper, etc.
    • 17. What’s Happening Now?
      Unified Framework for Clustering and Classification
      0.5 release on the horizon (May?)
      Working towards 1.0 release by focusing on:
      Tests, examples, documentation
      API cleanup and consistency
      Gearing up for Google Summer of Code
      New M/R work for Hidden Markov Models
    • 18. Summary
      Machine learning is all over the web today
      Mahout is about scalable machine learning
      Mahout has functionality for many of today’s common machine learning tasks
      Many Mahout implementations use Hadoop
    • 19. Resources
      http://mahout.apache.org
      http://cwiki.apache.org/MAHOUT
      {user|dev}@mahout.apache.org
      http://svn.apache.org/repos/asf/mahout/trunk
      http://hadoop.apache.org
    • 20. Resources
      “Mahout in Action”
      Owen, Anil, Dunning and Friedman
      http://awe.sm/5FyNe
      “Introducing Apache Mahout”
      http://www.ibm.com/developerworks/java/library/j-mahout/
      “Taming Text” by Ingersoll, Morton, Farris
      “Programming Collective Intelligence” by Toby Segaran
      “Data Mining - Practical Machine Learning Tools and Techniques” by Ian H. Witten and Eibe Frank
      “Data-Intensive Text Processing with MapReduce” by Jimmy Lin and Chris Dyer
    • 21. K-Means
      Clustering Algorithm
      Nicely parallelizable!
      http://en.wikipedia.org/wiki/K-means_clustering
    • 22. K-Means in Map-Reduce
      Input:
      Mahout Vectors representing the original content
      Either:
      A predefined set of initial centroids (Can be from Canopy)
      --k – The number of clusters to produce
      Iterate
      Do the centroid calculation (more in a moment)
      Clustering Step (optional)
      Output
      Centroids (as Mahout Vectors)
      Points for each Centroid (if Clustering Step was taken)
    • 23. Map-Reduce Iteration
      Each Iteration calculates the Centroids using:
      KMeansMapper
      KMeansCombiner
      KMeansReducer
      Clustering Step
      Calculate the points for each Centroid using:
      KMeansClusterMapper
    • 24. KMeansMapper
      During Setup:
      Load the initial Centroids (or the Centroids from the last iteration)
      Map Phase
      For each input
      Calculate it’s distance from each Centroid and output the closest one
      Distance Measures are pluggable
      Manhattan, Euclidean, Squared Euclidean, Cosine, others
    • 25. KMeansReducer
      Setup:
      Load up clusters
      Convergence information
      Partial sums from KMeansCombiner (more in a moment)
      Reduce Phase
      Sum all the vectors in the cluster to produce a new Centroid
      Check for Convergence
      Output cluster
    • 26. KMeansCombiner
      Just like KMeansReducer, but only produces partial sum of the cluster based on the data local to the Mapper
    • 27. KMeansClusterMapper
      Some applications only care about what the Centroids are, so this step is optional
      Setup:
      Load up the clusters and the DistanceMeasure used
      Map Phase
      Calculate which Cluster the point belongs to
      Output <ClusterId, Vector>

    ×