Your SlideShare is downloading. ×

Hands on Mahout!

17,767

Published on

Ted Dunning, Robin Anil

Ted Dunning, Robin Anil

Published in: Technology, Education
1 Comment
30 Likes
Statistics
Notes
No Downloads
Views
Total Views
17,767
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
1,021
Comments
1
Likes
30
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • 9.1 top left
  • 9.1 top right
  • 9.1 bottom left
  • 9.1 bottom right
  • Transcript

    • 1. Hands on! Speakers: Ted Dunning, Robin Anil OSCON 2011, Portland
    • 2. About Us
      • Ted Dunning: Chief Application Architect at MapR Committer and PMC Member at Apache Mahout Previously: MusicMatch (Yahoo! Music), Veoh recommendation, ID Analytics
      • Robin Anil: Software Engineer at Google Committer and PMC Member at Apache Mahout Previously: Yahoo! (Display ads), Minekey recommendation
    • 3. Agenda
      • Intro to Mahout (5 mins)
      • Overview of Algorithms in Mahout (10 mins)
      • Hands on Mahout!
        • Clustering (30 mins)
        • Classification (30 mins)
        • Advanced topics with Q&A (15 mins)
    • 4. Mission
      • To build a scalable machine learning library
    • 5. Scale!
      • Scale to large datasets
        • Hadoop MapReduce implementations that scales linearly with data.
        • Fast sequential algorithms whose runtime doesn’t depend on the size of the data
        • Goal: To be as fast as possible for any algorithm
      • Scalable to support your business case
        • Apache Software License 2
      • Scalable community
        • Vibrant, responsive and diverse
        • Come to the mailing list and find out more
    • 6. Current state of ML libraries
      • Lack community
      • Lack scalability
      • Lack documentations and examples
      • Lack Apache licensing
      • Are not well tested
      • Are Research oriented
      • Not built over existing production quality libraries
      • Lack “Deployability”
    • 7. Algorithms and Applications
    • 8. Clustering
      • Call it fuzzy grouping based on a notion of similarity
    • 9. Mahout Clustering
      • Plenty of Algorithms: K-Means, Fuzzy K-Means, Mean Shift, Canopy, Dirichlet
      • Group similar looking objects
      • Notion of similarity: Distance measure:
        • Euclidean
        • Cosine
        • Tanimoto
        • Manhattan
    • 10. Classification
      • Predicting the type of a new object based on its features
      • The types are predetermined
      • Dog Cat
    • 11. Mahout Classification
      • Plenty of algorithms
        • Naïve Bayes
        • Complementary Naïve Bayes
        • Random Forests
        • Logistic Regression (SGD)
        • Support Vector Machines (patch ready)
      • Learn a model from a manually classified data
      • Predict the class of a new object based on its features and the learned model
    • 12. Part 1 - Clustering
    • 13. Understanding data - Vectors
      • The vector denoted by point (5, 3) is simply Array([5, 3]) or HashMap([0 => 5], [1 => 3])
      X = 5 , Y = 3 (5, 3) Y X
    • 14. Representing Vectors – The basics
      • Now think 3, 4, 5, ….. n-dimensional
      • Think of a document as a bag of words.
        • “ she sells sea shells on the sea shore”
      • Now map them to integers
        • she => 0
        • sells => 1
        • sea => 2
        • and so on
      • The resulting vector [1.0, 1.0, 2.0, … ]
    • 15. Vectors
      • Imagine one dimension for each word.
      • Each dimension is also called a feature
      • Two techniques
        • Dictionary Based
        • Randomizer Based
    • 16. Clustering Reuters dataset
    • 17. Step 1 – Convert dataset into a Hadoop Sequence File
      • http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz
      • Download (8.2 MB) and extract the SGML files.
        • $ mkdir -p mahout-work/reuters-sgm
        • $ cd mahout-work/reuters-sgm && tar xzf ../reuters21578.tar.gz && cd .. && cd ..
      • Extract content from SGML to text file
        • $ bin/mahout org.apache.lucene.benchmark.utils.ExtractReuters mahout-work/reuters-sgm mahout-work/reuters-out
    • 18. Step 1 – Convert dataset into a Hadoop Sequence File
      • Use seqdirectory tool to convert text file into a Hadoop Sequence File
        • $ bin/mahout seqdirectory
        • -i mahout-work/reuters-out
        • -o mahout-work/reuters-out-seqdir
        • -c UTF-8 -chunk 5
    • 19. Hadoop Sequence File
      • Sequence of Records, where each record is a <Key, Value> pair
        • <Key1, Value1>
        • <Key2, Value2>
        • <Keyn, Valuen>
      • Key and Value needs to be of class org.apache.hadoop.io.Text
        • Key = Record name or File name or unique identifier
        • Value = Content as UTF-8 encoded string
      • TIP: Dump data from your database directly into Hadoop Sequence Files (see next slide)
    • 20. Writing to Sequence Files
      • Configuration conf = new Configuration();
      • FileSystem fs = FileSystem.get(conf) ;
      • Path path = new Path(&quot;testdata/part-00000&quot;);
      • SequenceFile.Writer writer = new SequenceFile.Writer(
      • fs, conf, path, Text.class, Text.class );
      • for (int i = 0; i < MAX_DOCS; i++)
      • writer.append( new Text(documents(i).Id()),
      • new Text(documents(i).Content()));
      • }
      • writer.close();
    • 21. Generate Vectors from Sequence Files
      • Steps
        • Compute Dictionary
        • Assign integers for words
        • Compute feature weights
        • Create vector for each document using word-integer mapping and feature-weight Or
      • Simply run $ bin/mahout seq2sparse
    • 22. Generate Vectors from Sequence Files
      • $ bin/mahout seq2sparse -i mahout-work/reuters-out-seqdir/ -o mahout-work/reuters-out-seqdir-sparse-kmeans
      • Important options
        • Ngrams
        • Lucene Analyzer for tokenizing
        • Feature Pruning
          • Min support
          • Max Document Frequency
          • Min LLR (for ngrams)
        • Weighting Method
          • TF v/s TFIDF
          • lp-Norm
          • Log normalize length
    • 23. Start K-Means clustering
      • $ bin/mahout kmeans -i mahout-work/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ -c mahout-work/reuters-kmeans-clusters -o mahout-work/reuters-kmeans -dm org.apache.mahout.distance.CosineDistanceMeasure –cd 0.1 -x 10 -k 20 –ow
      • Things to watch out for
        • Number of iterations
        • Convergence delta
        • Distance Measure
        • Creating assignments
    • 24. K-Means clustering c1 c2 c3
    • 25. K-Means clustering c1 c2 c3
    • 26. K-Means clustering c1 c2 c3 c1 c2 c3
    • 27. K-Means clustering c1 c2 c3
    • 28. Inspect clusters
      • $ bin/mahout clusterdump -s mahout-work/reuters-kmeans/clusters-9 -d mahout-work/reuters-out-seqdir-sparse-kmeans/dictionary.file-0 -dt sequencefile -b 100 -n 20
      • Typical output
      • :VL-21438{n=518 c=[0.56:0.019, 00:0.154, 00.03:0.018, 00.18:0.018, …
      • Top Terms:
      • iran => 3.1861672217321213
      • strike => 2.567886952727918
      • iranian => 2.133417966282966
      • union => 2.116033937940266
      • said => 2.101773806290277
      • workers => 2.066259451354332
      • gulf => 1.9501374918521601
      • had => 1.6077752463145605
      • he => 1.5355078004962228
    • 29. FAQs
      • How to get rid of useless words
      • How to see documents to cluster assignments
      • How to choose appropriate weighting
      • How to run this on a cluster
      • How to scale
      • How to choose k
      • How to improve similarity measurement
    • 30. FAQs
      • How to get rid of useless words
        • Increase minSupport and or decrease dfPercent
        • Use StopwordsAnalyzer
      • How to see documents to cluster assignments
        • Run clustering process at the end of centroid generation using –cl
      • How to choose appropriate weighting
        • If its long text, go with tfidf. Use normalization if documents different in length
      • How to run this on a cluster
        • Set HADOOP_CONF directory to point to your hadoop cluster conf directory
      • How to scale
        • Use small value of k to partially cluster data and then do full clustering on each cluster.
    • 31. FAQs
      • How to choose k
        • Figure out based on the data you have. Trial and error
        • Or use Canopy Clustering and distance threshold to figure it out
        • Or use Spectral clustering
      • How to improve Similarity Measurement
        • Not all features are equal
        • Small weight difference for certain types creates a large semantic difference
        • Use WeightedDistanceMeasure
        • Or write a custom DistanceMeasure
    • 32. Interesting problems
      • Cluster users talking about OSCON’11 and cluster them based on what they are tweeting
        • Can you suggest people to network with.
      • Use user generate tags that people have given for musicians and cluster them
        • Use the cluster to pre-populate suggest-box to autocomplete tags when users type
      • Cluster movies based on abstract and description and show related movies.
        • Note: How it can augment recommendations or collaborative filtering algorithms.
    • 33. More clustering algorithms
      • Canopy
      • Fuzzy K-Means
      • Mean Shift
      • Dirichlet process clustering
      • Spectral clustering.
    • 34. Part 2 - Classification
    • 35. Preliminaries
      • Code is available from github:
        • git@github.com:tdunning/Chapter-16.git
      • EC2 instances available
      • Thumb drives also available
      • Email to [email_address]
      • Twitter @ted_dunning
    • 36. A Quick Review
      • What is classification?
        • goes-ins: predictors
        • goes-outs: target variable
      • What is classifiable data?
        • continuous, categorical, word-like, text-like
        • uniform schema
      • How do we convert from classifiable data to feature vector?
    • 37. Data Flow Not quite so simple
    • 38. Classifiable Data
      • Continuous
        • A number that represents a quantity, not an id
        • Blood pressure, stock price, latitude, mass
      • Categorical
        • One of a known, small set (color, shape)
      • Word-like
        • One of a possibly unknown, possibly large set
      • Text-like
        • Many word-like things, usually unordered
    • 39. But that isn’t quite there
      • Learning algorithms need feature vectors
        • Have to convert from data to vector
      • Can assign one location per feature
        • or category
        • or word
      • Can assign one or more locations with hashing
        • scary
        • but safe on average
    • 40. Data Flow
    • 41.  
    • 42. The pipeline Classifiable Data Vectors
    • 43. Instance and Target Variable
    • 44. Instance and Target Variable
    • 45. Hashed Encoding
    • 46. What about collisions?
    • 47. Let’s write some code (cue relaxing background music)
    • 48. Generating new features
      • Sometimes the existing features are difficult to use
      • Restating the geometry using new reference points may help
      • Automatic reference points using k-means can be better than manual references
    • 49. K-means using target
    • 50. K-means features
    • 51. More code! (cue relaxing background music)
    • 52. Integration Issues
      • Feature extraction is ideal for map-reduce
        • Side data adds some complexity
      • Clustering works great with map-reduce
        • Cluster centroids to HDFS
      • Model training works better sequentially
        • Need centroids in normal files
      • Model deployment shouldn’t depend on HDFS
    • 53. Parallel Stochastic Gradient Descent Average models Train sub model Model I n p u t
    • 54. Variational Dirichlet Assignment Update model Gather sufficient statistics Model I n p u t
    • 55. Old tricks, new dogs
      • Mapper
        • Assign point to cluster
        • Emit cluster id, (1, point)
      • Combiner and reducer
        • Sum counts, weighted sum of points
        • Emit cluster id, (n, sum/n)
      • Output to HDFS
      Read from HDFS to local disk by distributed cache Written by map-reduce Read from local disk from distributed cache
    • 56. Old tricks, new dogs
      • Mapper
        • Assign point to cluster
        • Emit cluster id, 1, point
      • Combiner and reducer
        • Sum counts, weighted sum of points
        • Emit cluster id, n, sum/n
      • Output to HDFS
      MapR FS Read from NFS Written by map-reduce
    • 57. Modeling architecture Feature extraction and down sampling I n p u t Side-data Data join Sequential SGD Learning Map-reduce Now via NFS
    • 58. More in Mahout
    • 59. Topic modeling
      • Grouping similar or co-occurring features into a topic
        • Topic “Lol Cat”:
          • Cat
          • Meow
          • Purr
          • Haz
          • Cheeseburger
          • Lol
    • 60. Mahout Topic Modeling
      • Algorithm: Latent Dirichlet Allocation
        • Input a set of documents
        • Output top K prominent topics and the features in each topic
    • 61. Recommendations
      • Predict what the user likes based on
        • His/Her historical behavior
        • Aggregate behavior of people similar to him
    • 62. Mahout Recommenders
      • Different types of recommenders
        • User based
        • Item based
      • Full framework for storage, online online and offline computation of recommendations
      • Like clustering, there is a notion of similarity in users or items
        • Cosine, Tanimoto, Pearson and LLR
    • 63. Frequent Pattern Mining
      • Find interesting groups of items based on how they co-occur in a dataset
    • 64. Mahout Parallel FPGrowth
      • Identify the most commonly occurring patterns from
        • Sales Transactions buy “Milk, eggs and bread”
        • Query Logs
          • ipad -> apple, tablet, iphone
        • Spam Detection
          • Yahoo! http://www.slideshare.net/hadoopusergroup/mail-antispam
    • 65. Get Started
      • http://mahout.apache.org
      • [email_address] - Developer mailing list
      • [email_address] - User mailing list
      • Check out the documentations and wiki for quickstart
      • http://svn.apache.org/repos/asf/mahout/trunk/ Browse Code
      • Send me email!
        • [email_address]
        • [email_address]
        • [email_address]
      • Try out MapR!
        • www.mapr.com
    • 66. Resources
      • “ Mahout in Action” Owen, Anil, Dunning, Friedman http://www.manning.com/owen
      • “ Taming Text” Ingersoll, Morton, Farris http://www.manning.com/ingersoll
      • “ Introducing Apache Mahout” http://www.ibm.com/developerworks/java/library/j-mahout/
    • 67. Thanks to
      • Apache Foundation
      • Mahout Committers
      • Google Summer of Code Organizers
      • And Students
      • OSCON
      • Open source!
    • 68. References
      • news.google.com
      • Cat http://www.flickr.com/photos/gattou/3178745634/
      • Dog http://www.flickr.com/photos/30800139@N04/3879737638/
      • Milk Eggs Bread http://www.flickr.com/photos/nauright/4792775946/
      • Amazon Recommendations
      • twitter

    ×