Orchestrating the Intelligent Web with Apache Mahout Presented by Aneesha Bakharia Twitter: aneesha Email: aneesha.bakharia@gmail.com
What is Apache Mahout? Open source  Machine Learning Java library Scalable (Apache Hadoop)  Framework for developing, testing and deploying large-scale algorithms http://mahout.apache.org/
What’s in a Name? Mahout is Hindi for Elephant Driver
What is Apache Mahout? Framework Vector Math/Matrices (eg SVD) Collections Hadoop Algorithms Classification, Clustering, etc Your Application??? You can orchestrate the intelligent web!!!
A New Breed of Developer Key Skills Databases Programming Networking Security … but now also distributed data processing is fast becoming an essential part the developer’s toolbox.
You never know where you will use Probability and Statistics!!!! Video snippet from Equilibrium: http://en.wikipedia.org/wiki/Equilibrium_%28film%29
You never know what you will discover!!!!
Where people swear in the United States? http://flowingdata.com/2011/01/25/where-people-swear-in-the-united-states/
Algorithms is Apache Mahout Recommendation (collaborative filtering) Clustering Classification  Evolutionary Algorithms
Algorithms is Apache Mahout Top 10 algorithms in data mining Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., et al. (2008). Top 10 algorithms in data mining.  Knowledge and Information Systems, 14(1), 1-37. k-Means, Apriori (fp-growth), kNN,  Naive Bayes ,  SVM (coming) Already supported
Requirements Java 1.6 java -version Maven 2.2 mvn -- version Hadoop 0.2
Running Mahout Command line launcher bin/mahout (This shows the list of algorithms) Valid program names are: canopy: : Canopy clustering cleansvd: : Cleanup and verification of SVD output clusterdump: : Dump cluster output to text dirichlet: : Dirichlet Clustering fkmeans: : Fuzzy K-means clustering fpg: : Frequent Pattern Growth itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering kmeans: : K-means clustering lda: : Latent Dirchlet Allocation ldatopics: : LDA Print Topics lucene.vector: : Generate Vectors from a Lucene index matrixmult: : Take the product of two matrices meanshift: : Mean Shift clustering recommenditembased: : Compute recommendations using item-based collaborative filtering … ..
Running Mahout Run any algorithm eg kmeans locally bin/mahout kmeans –help Job-Specific Options:   --input (-i) input   --output (-o) output  --distanceMeasure (-dm)  eg SquaredEuclidean  --numClusters (-k) k
Running Mahout Scale out Runs on cluster as per conf files in Hadoop directory export HADOOP_HOME = /pathto/hadoop-0.20.2/ Need to use the driver classes KMeansDriver.runjob(Path input, Path output ...)
Clustering Unsupervised Machine Learning technique Organise items in to clusters/groups based upon similarity Good for finding patterns and exploring data
Clustering Lots of Algorithms: k-means, Fuzzy K-means, Mean Shift, Canopy, Dirichlet Process, Latent Dirichlet Allocation Similarity Distance Measures Euclidean Cosine Tanimoto Manhattan
Vectors Documents Bag of words word1 => 10 word2 => 2 word3 => 4 Resulting vector [10.0, 2.0, 4.0, .... ]
Range of Vectorization Tools Collate multiple words (n-grams) Normalization TF-IDF Stop word removal
kmeans Example Set of text files in a directory Use seqdirectory to convert files to vectors bin/mahout seqdirectory -i <input> -o <seq-output> Use seq2sparse to convert to sparse vector bin/mahout seq2sparse -i seq-output -o <vector-output> Run kmeans with k=5 bin/mahout kmeans -i<vector-output> -c <cluster-temp>  -o <cluster-output> -k 5 View output bin/mahout clusterdump
Easy enough, but How do you know k? Data Exploration is required to find the  k for your purposes Similarity distance for your purpose  Role for the Data Scientist Explore, Model, Test and Evaluate
Recommender Engines Encounter the most Recommend products (books, movies, etc) based upon past actions Infer tastes and preferences to identify unknown items of interest
Recomendation Algorithms: user and item recommendation Framework for storage, online and offline computation Similarity Measures Cosine Tanimoto Pearson
Frequent Pattern Mining Discover interesting patterns based upon how items occur in a sequence Example Sales Transactions  (Bread, Milk and Eggs) (Nappies, Beer) Parallel FPGrowth Algorithm
Classification Set of classes/categories (observed pattern) Decide if a new input matches a category Supervised technique – need training Eg spam or not
Classification Algorithms: Naive Bayes, Random Forest Decision Tree, SVM coming Learn a model from a manually trained dataset Predict the class of an unseen object based on features
Latent Dirichlet Allocation Convert text to term-document matrix LDA produces  word-theme mapping theme-document mapping Allows topic overlap Need to specify number of Topics (k)
Latent Dirichlet Allocation LDA Tweet 1 Tweet 2 Tweet 3 Term-Document Matrix Specify No Themes (k) Topic to Word Mapping X Tweet to Topic Mapping Word 1 Word 2 Word n Doc 1 1 0 2 Doc 2 0 1 0 Doc 3 0 1 1 Word 1 Word 2 Word n Topic 1 0.5 0 1 Topic 2 0 0.5 0 Topic 1 Topic 2 Doc 1 1 0 Doc 2 0 1 Doc 3 0 1
Latent Dirichlet Allocation Run LDA bin/mahout  lda  -input <PATH>  ‐output <PATH>  –numTopics  20 View Topics bin/mahout  LDAPrintTopics  ‐input  <PATH> ‐output  <PATH> ‐dictonaryType  sequencefile
Suggesting Twitter Lists Twitter introduced Lists  group people you follow so you can see only their timeline of tweets Build an application that could recommend people that should be grouped in the same list.  LDA because it will allow for overlapping list membership - this is great because people talk about multiple topics.
Suggesting Twitter Lists Twitter API Tasks Get list of people that a user follows Retrieve tweets for each person Save Lists back to Twitter Data Processing Combine all tweets for a person Remove stop words Stem words Create a user-word matrix
Suggesting Twitter Lists Web UI Authenticate to Twitter Display suggested lists (based on estimate of k)  (Could also display the important tweets that place the person in the group?) Allow users to change k  ie decide on the number of Lists Allow group re-organisation with jquery sortables
Gently Getting into Machine Learning and Data Mining Programming Collective  Intelligence by Toby Segaram Mahout in Action by Owen, Anil, Dunning  and Friedman
Summary Mahout offers good abstraction for building intelligent web applications Skills in data analysis and exploration are now more important than ever Mahout is a good platform for distributed algorithm development
Fascinating Algorithms My Top 3 algorithms Some interesting and some disturbing and interesting at the same time
Fascinating Algorithms No 3 – Identifying Manipulated Images http://www.technologyreview.com/computing/20423/page1/
Fascinating Algorithms No 2 – Seam Carving Content Aware Resizing Example http://swieskowski.net/carve/
Disturbing Algorithms No 1 – Digital Face Beautification http://leyvand.com/research/beautification/dfb_sketch.pdf
Disturbing Algorithms No 1 – Digital Face Beautification http://leyvand.com/research/beautification/dfb_sketch.pdf
Disturbing Algorithms No 1 – Digital Face Beautification http://leyvand.com/research/beautification/dfb_sketch.pdf Image from Shrek Copyright Dreamworks
Discussion/Questions What will you build?

Orchestrating the Intelligent Web with Apache Mahout

  • 1.
    Orchestrating the IntelligentWeb with Apache Mahout Presented by Aneesha Bakharia Twitter: aneesha Email: aneesha.bakharia@gmail.com
  • 2.
    What is ApacheMahout? Open source Machine Learning Java library Scalable (Apache Hadoop) Framework for developing, testing and deploying large-scale algorithms http://mahout.apache.org/
  • 3.
    What’s in aName? Mahout is Hindi for Elephant Driver
  • 4.
    What is ApacheMahout? Framework Vector Math/Matrices (eg SVD) Collections Hadoop Algorithms Classification, Clustering, etc Your Application??? You can orchestrate the intelligent web!!!
  • 5.
    A New Breedof Developer Key Skills Databases Programming Networking Security … but now also distributed data processing is fast becoming an essential part the developer’s toolbox.
  • 6.
    You never knowwhere you will use Probability and Statistics!!!! Video snippet from Equilibrium: http://en.wikipedia.org/wiki/Equilibrium_%28film%29
  • 7.
    You never knowwhat you will discover!!!!
  • 8.
    Where people swearin the United States? http://flowingdata.com/2011/01/25/where-people-swear-in-the-united-states/
  • 9.
    Algorithms is ApacheMahout Recommendation (collaborative filtering) Clustering Classification Evolutionary Algorithms
  • 10.
    Algorithms is ApacheMahout Top 10 algorithms in data mining Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., et al. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1-37. k-Means, Apriori (fp-growth), kNN, Naive Bayes , SVM (coming) Already supported
  • 11.
    Requirements Java 1.6java -version Maven 2.2 mvn -- version Hadoop 0.2
  • 12.
    Running Mahout Commandline launcher bin/mahout (This shows the list of algorithms) Valid program names are: canopy: : Canopy clustering cleansvd: : Cleanup and verification of SVD output clusterdump: : Dump cluster output to text dirichlet: : Dirichlet Clustering fkmeans: : Fuzzy K-means clustering fpg: : Frequent Pattern Growth itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering kmeans: : K-means clustering lda: : Latent Dirchlet Allocation ldatopics: : LDA Print Topics lucene.vector: : Generate Vectors from a Lucene index matrixmult: : Take the product of two matrices meanshift: : Mean Shift clustering recommenditembased: : Compute recommendations using item-based collaborative filtering … ..
  • 13.
    Running Mahout Runany algorithm eg kmeans locally bin/mahout kmeans –help Job-Specific Options: --input (-i) input --output (-o) output --distanceMeasure (-dm) eg SquaredEuclidean --numClusters (-k) k
  • 14.
    Running Mahout Scaleout Runs on cluster as per conf files in Hadoop directory export HADOOP_HOME = /pathto/hadoop-0.20.2/ Need to use the driver classes KMeansDriver.runjob(Path input, Path output ...)
  • 15.
    Clustering Unsupervised MachineLearning technique Organise items in to clusters/groups based upon similarity Good for finding patterns and exploring data
  • 16.
    Clustering Lots ofAlgorithms: k-means, Fuzzy K-means, Mean Shift, Canopy, Dirichlet Process, Latent Dirichlet Allocation Similarity Distance Measures Euclidean Cosine Tanimoto Manhattan
  • 17.
    Vectors Documents Bagof words word1 => 10 word2 => 2 word3 => 4 Resulting vector [10.0, 2.0, 4.0, .... ]
  • 18.
    Range of VectorizationTools Collate multiple words (n-grams) Normalization TF-IDF Stop word removal
  • 19.
    kmeans Example Setof text files in a directory Use seqdirectory to convert files to vectors bin/mahout seqdirectory -i <input> -o <seq-output> Use seq2sparse to convert to sparse vector bin/mahout seq2sparse -i seq-output -o <vector-output> Run kmeans with k=5 bin/mahout kmeans -i<vector-output> -c <cluster-temp> -o <cluster-output> -k 5 View output bin/mahout clusterdump
  • 20.
    Easy enough, butHow do you know k? Data Exploration is required to find the k for your purposes Similarity distance for your purpose Role for the Data Scientist Explore, Model, Test and Evaluate
  • 21.
    Recommender Engines Encounterthe most Recommend products (books, movies, etc) based upon past actions Infer tastes and preferences to identify unknown items of interest
  • 22.
    Recomendation Algorithms: userand item recommendation Framework for storage, online and offline computation Similarity Measures Cosine Tanimoto Pearson
  • 23.
    Frequent Pattern MiningDiscover interesting patterns based upon how items occur in a sequence Example Sales Transactions (Bread, Milk and Eggs) (Nappies, Beer) Parallel FPGrowth Algorithm
  • 24.
    Classification Set ofclasses/categories (observed pattern) Decide if a new input matches a category Supervised technique – need training Eg spam or not
  • 25.
    Classification Algorithms: NaiveBayes, Random Forest Decision Tree, SVM coming Learn a model from a manually trained dataset Predict the class of an unseen object based on features
  • 26.
    Latent Dirichlet AllocationConvert text to term-document matrix LDA produces word-theme mapping theme-document mapping Allows topic overlap Need to specify number of Topics (k)
  • 27.
    Latent Dirichlet AllocationLDA Tweet 1 Tweet 2 Tweet 3 Term-Document Matrix Specify No Themes (k) Topic to Word Mapping X Tweet to Topic Mapping Word 1 Word 2 Word n Doc 1 1 0 2 Doc 2 0 1 0 Doc 3 0 1 1 Word 1 Word 2 Word n Topic 1 0.5 0 1 Topic 2 0 0.5 0 Topic 1 Topic 2 Doc 1 1 0 Doc 2 0 1 Doc 3 0 1
  • 28.
    Latent Dirichlet AllocationRun LDA bin/mahout  lda  -input <PATH>  ‐output <PATH>  –numTopics  20 View Topics bin/mahout  LDAPrintTopics  ‐input  <PATH> ‐output  <PATH> ‐dictonaryType  sequencefile
  • 29.
    Suggesting Twitter ListsTwitter introduced Lists group people you follow so you can see only their timeline of tweets Build an application that could recommend people that should be grouped in the same list. LDA because it will allow for overlapping list membership - this is great because people talk about multiple topics.
  • 30.
    Suggesting Twitter ListsTwitter API Tasks Get list of people that a user follows Retrieve tweets for each person Save Lists back to Twitter Data Processing Combine all tweets for a person Remove stop words Stem words Create a user-word matrix
  • 31.
    Suggesting Twitter ListsWeb UI Authenticate to Twitter Display suggested lists (based on estimate of k) (Could also display the important tweets that place the person in the group?) Allow users to change k ie decide on the number of Lists Allow group re-organisation with jquery sortables
  • 32.
    Gently Getting intoMachine Learning and Data Mining Programming Collective Intelligence by Toby Segaram Mahout in Action by Owen, Anil, Dunning and Friedman
  • 33.
    Summary Mahout offersgood abstraction for building intelligent web applications Skills in data analysis and exploration are now more important than ever Mahout is a good platform for distributed algorithm development
  • 34.
    Fascinating Algorithms MyTop 3 algorithms Some interesting and some disturbing and interesting at the same time
  • 35.
    Fascinating Algorithms No3 – Identifying Manipulated Images http://www.technologyreview.com/computing/20423/page1/
  • 36.
    Fascinating Algorithms No2 – Seam Carving Content Aware Resizing Example http://swieskowski.net/carve/
  • 37.
    Disturbing Algorithms No1 – Digital Face Beautification http://leyvand.com/research/beautification/dfb_sketch.pdf
  • 38.
    Disturbing Algorithms No1 – Digital Face Beautification http://leyvand.com/research/beautification/dfb_sketch.pdf
  • 39.
    Disturbing Algorithms No1 – Digital Face Beautification http://leyvand.com/research/beautification/dfb_sketch.pdf Image from Shrek Copyright Dreamworks
  • 40.