Orchestrating the Intelligent Web with Apache Mahout

  • 3,723 views
Uploaded on

Presentation on Apache Mahout at Linux Conference Australia 2011

Presentation on Apache Mahout at Linux Conference Australia 2011

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,723
On Slideshare
0
From Embeds
0
Number of Embeds
5

Actions

Shares
Downloads
186
Comments
0
Likes
6

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Orchestrating the Intelligent Web with Apache Mahout Presented by Aneesha Bakharia Twitter: aneesha Email: aneesha.bakharia@gmail.com
  • 2. What is Apache Mahout?
    • Open source
    • Machine Learning Java library
    • Scalable (Apache Hadoop)
    • Framework for developing, testing and deploying large-scale algorithms http://mahout.apache.org/
  • 3. What’s in a Name?
    • Mahout is Hindi for Elephant Driver
  • 4. What is Apache Mahout?
    • Framework
      • Vector Math/Matrices (eg SVD)
      • Collections
      • Hadoop
    • Algorithms
      • Classification, Clustering, etc
    • Your Application???
      • You can orchestrate the intelligent web!!!
  • 5. A New Breed of Developer
    • Key Skills
      • Databases
      • Programming
      • Networking
      • Security
    • … but now also
      • distributed data processing is fast becoming an essential part the developer’s toolbox.
  • 6.
    • You never know where you will use Probability and Statistics!!!! Video snippet from Equilibrium: http://en.wikipedia.org/wiki/Equilibrium_%28film%29
  • 7.
    • You never know what you will discover!!!!
  • 8. Where people swear in the United States? http://flowingdata.com/2011/01/25/where-people-swear-in-the-united-states/
  • 9. Algorithms is Apache Mahout
    • Recommendation (collaborative filtering)
    • Clustering
    • Classification
    • Evolutionary Algorithms
  • 10. Algorithms is Apache Mahout
    • Top 10 algorithms in data mining Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., et al. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1-37.
    • k-Means, Apriori (fp-growth), kNN, Naive Bayes , SVM (coming) Already supported
  • 11. Requirements
    • Java 1.6 java -version
    • Maven 2.2 mvn -- version
    • Hadoop 0.2
  • 12. Running Mahout
    • Command line launcher bin/mahout (This shows the list of algorithms) Valid program names are:
    • canopy: : Canopy clustering
    • cleansvd: : Cleanup and verification of SVD output
    • clusterdump: : Dump cluster output to text
    • dirichlet: : Dirichlet Clustering
    • fkmeans: : Fuzzy K-means clustering
    • fpg: : Frequent Pattern Growth
    • itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering
    • kmeans: : K-means clustering
    • lda: : Latent Dirchlet Allocation
    • ldatopics: : LDA Print Topics
    • lucene.vector: : Generate Vectors from a Lucene index
    • matrixmult: : Take the product of two matrices
    • meanshift: : Mean Shift clustering
    • recommenditembased: : Compute recommendations using item-based collaborative filtering
    • … ..
  • 13. Running Mahout
    • Run any algorithm eg kmeans locally bin/mahout kmeans –help Job-Specific Options: --input (-i) input --output (-o) output --distanceMeasure (-dm) eg SquaredEuclidean --numClusters (-k) k
  • 14. Running Mahout
    • Scale out Runs on cluster as per conf files in Hadoop directory
    • export HADOOP_HOME = /pathto/hadoop-0.20.2/
    • Need to use the driver classes KMeansDriver.runjob(Path input, Path output ...)
  • 15. Clustering
    • Unsupervised Machine Learning technique
    • Organise items in to clusters/groups based upon similarity
    • Good for finding patterns and exploring data
  • 16. Clustering
    • Lots of Algorithms: k-means, Fuzzy K-means, Mean Shift, Canopy, Dirichlet Process, Latent Dirichlet Allocation
    • Similarity Distance Measures
      • Euclidean
      • Cosine
      • Tanimoto
      • Manhattan
  • 17. Vectors
    • Documents Bag of words word1 => 10 word2 => 2 word3 => 4 Resulting vector [10.0, 2.0, 4.0, .... ]
  • 18. Range of Vectorization Tools
    • Collate multiple words (n-grams)
    • Normalization
    • TF-IDF
    • Stop word removal
  • 19. kmeans Example
    • Set of text files in a directory
    • Use seqdirectory to convert files to vectors bin/mahout seqdirectory -i <input> -o <seq-output>
    • Use seq2sparse to convert to sparse vector bin/mahout seq2sparse -i seq-output -o <vector-output>
    • Run kmeans with k=5 bin/mahout kmeans -i<vector-output> -c <cluster-temp> -o <cluster-output> -k 5
    • View output bin/mahout clusterdump
  • 20. Easy enough, but
    • How do you know k?
    • Data Exploration is required to find the
      • k for your purposes
      • Similarity distance for your purpose
    • Role for the Data Scientist
      • Explore, Model, Test and Evaluate
  • 21. Recommender Engines
    • Encounter the most
    • Recommend products (books, movies, etc) based upon past actions
    • Infer tastes and preferences to identify unknown items of interest
  • 22. Recomendation
    • Algorithms: user and item recommendation
    • Framework for storage, online and offline computation
    • Similarity Measures
      • Cosine
      • Tanimoto
      • Pearson
  • 23. Frequent Pattern Mining
    • Discover interesting patterns based upon how items occur in a sequence
    • Example Sales Transactions (Bread, Milk and Eggs) (Nappies, Beer)
    • Parallel FPGrowth Algorithm
  • 24. Classification
    • Set of classes/categories (observed pattern)
    • Decide if a new input matches a category
    • Supervised technique – need training
    • Eg spam or not
  • 25. Classification
    • Algorithms: Naive Bayes, Random Forest Decision Tree, SVM coming
    • Learn a model from a manually trained dataset
    • Predict the class of an unseen object based on features
  • 26. Latent Dirichlet Allocation
      • Convert text to term-document matrix
      • LDA produces
        • word-theme mapping
        • theme-document mapping
        • Allows topic overlap
      • Need to specify number of Topics (k)
  • 27. Latent Dirichlet Allocation
    • LDA
      • Tweet 1
      • Tweet 2
      • Tweet 3
    Term-Document Matrix Specify No Themes (k) Topic to Word Mapping X Tweet to Topic Mapping Word 1 Word 2 Word n Doc 1 1 0 2 Doc 2 0 1 0 Doc 3 0 1 1 Word 1 Word 2 Word n Topic 1 0.5 0 1 Topic 2 0 0.5 0 Topic 1 Topic 2 Doc 1 1 0 Doc 2 0 1 Doc 3 0 1
  • 28. Latent Dirichlet Allocation
      • Run LDA bin/mahout  lda  -input <PATH>  ‐output <PATH>  –numTopics  20
      • View Topics bin/mahout  LDAPrintTopics  ‐input  <PATH> ‐output  <PATH> ‐dictonaryType  sequencefile
  • 29. Suggesting Twitter Lists
      • Twitter introduced Lists group people you follow so you can see only their timeline of tweets
      • Build an application that could recommend people that should be grouped in the same list.
      • LDA because it will allow for overlapping list membership - this is great because people talk about multiple topics.
  • 30. Suggesting Twitter Lists
      • Twitter API Tasks
        • Get list of people that a user follows
        • Retrieve tweets for each person
        • Save Lists back to Twitter
      • Data Processing
        • Combine all tweets for a person
        • Remove stop words
        • Stem words
        • Create a user-word matrix
  • 31. Suggesting Twitter Lists
      • Web UI
        • Authenticate to Twitter
        • Display suggested lists (based on estimate of k) (Could also display the important tweets that place the person in the group?)
        • Allow users to change k ie decide on the number of Lists
        • Allow group re-organisation with jquery sortables
  • 32. Gently Getting into Machine Learning and Data Mining
    • Programming Collective Intelligence by Toby Segaram
    • Mahout in Action by Owen, Anil, Dunning and Friedman
  • 33. Summary
    • Mahout offers good abstraction for building intelligent web applications
    • Skills in data analysis and exploration are now more important than ever
    • Mahout is a good platform for distributed algorithm development
  • 34. Fascinating Algorithms
    • My Top 3 algorithms
      • Some interesting and some disturbing and interesting at the same time
  • 35. Fascinating Algorithms
    • No 3 – Identifying Manipulated Images http://www.technologyreview.com/computing/20423/page1/
  • 36. Fascinating Algorithms
    • No 2 – Seam Carving Content Aware Resizing Example http://swieskowski.net/carve/
  • 37. Disturbing Algorithms
    • No 1 – Digital Face Beautification http://leyvand.com/research/beautification/dfb_sketch.pdf
  • 38. Disturbing Algorithms
    • No 1 – Digital Face Beautification http://leyvand.com/research/beautification/dfb_sketch.pdf
  • 39. Disturbing Algorithms
    • No 1 – Digital Face Beautification http://leyvand.com/research/beautification/dfb_sketch.pdf
    Image from Shrek Copyright Dreamworks
  • 40. Discussion/Questions
    • What will you build?