Your SlideShare is downloading. ×
Mahout part2
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Mahout part2


Published on

Part two of a presentation about Mahout system. It is based on

Part two of a presentation about Mahout system. It is based on

Published in: Technology, Education

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Mahout in Action Part 2 Yasmine M. Gaber 4 April 2013
  • 2. Agenda Part 2: Clustering Part 3: Classification
  • 3. Clustering An algorithm A notion of both similarity and dissimilarity A stopping condition
  • 4. Measuring the similarity of items Euclidean Distance
  • 5. Creating the input Preprocess the data Use that data to create vectors Save the vectors in SequenceFile format as input for the algorithm
  • 6. Using Mahout clustering The SequenceFile containing the input vectors. The SequenceFile containing the initial cluster centers. The similarity measure to be used. The convergenceThreshold. The number of iterations to be done. The Vector implementation used in the input files.
  • 7. Using Mahout clustering
  • 8. Distance measures Euclidean distance measure Squared Euclidean distance measure Manhattan distance measure
  • 9. Distance measures Cosine distance measure Tanimoto distance measure
  • 10. Playing Around
  • 11. Representing data
  • 12. Representing text documents as vectors Vector Space Model (VSM) TF-IDF N-gram collocations
  • 13. Generating vectors from documents $ bin/mahout seqdirectory -c UTF-8 -i examples/reuters-extracted/ -o reuters-seqfiles $ bin/mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors -ow
  • 14. Improving quality of vectors using normalization P-norm $ bin/mahout seq2sparse -i reuters-seqfiles/ -o reuters-normalized-bigram -ow -a org.apache.lucene.analysis.WhitespaceAnalyz er-chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2
  • 15. Clustering Categories Exclusive clustering Overlapping clustering Hierarchical clustering Probabilistic clustering
  • 16. Clustering Approaches Fixed number of centers Bottom-up approach Top-down approach
  • 17. Clustering algorithms K-means clustering Fuzzy k-means clustering Dirichlet clustering
  • 18. k-means clustering algorithm
  • 19. Running k-means clustering
  • 20. Running k-means clustering $ bin/mahout kmeans -i reuters-vectors/tfidf- vectors/ -c reuters-initial-clusters -o reuters- kmeans-clusters -dm org.apache.mahout.common.distance.Square dEuclideanDistanceMeasure -cd 1.0 -k 20 -x 20 -cl $ bin/mahout kmeans -i reuters-vectors/tfidf- vectors/ -c reuters-initial-clusters -o reuters- kmeans-clusters -dm org.apache.mahout.common.distance.Cosine DistanceMeasure -cd 0.1 -k 20 -x 20 -cl $ bin/mahout clusterdump -dt sequencefile -d
  • 21. Fuzzy k-means clustering Instead of the exclusive clustering in k-means, fuzzy k-means tries to generate overlapping clusters from the data set. Also known as fuzzy c-means algorithm.
  • 22. Running fuzzy k-means clustering
  • 23. Running fuzzy k-means clustering $ bin/mahout fkmeans -i reuters-vectors/tfidf- vectors/ -c reuters-fkmeans-centroids -o reuters-fkmeans-clusters -cd 1.0 -k 21 -m 2 -ow -x 10 -dm org.apache.mahout.common.distance.Square dEuclideanDistanceMeasure Fuzziness factor
  • 24. Dirichlet clustering model-based clustering algorithm
  • 25. Running Dirichlet clustering $ bin/mahout dirichlet -i reuters-vectors/tfidf- vectors -o reuters-dirichlet-clusters -k 60 -x 10 -a0 1.0 -md org.apache.mahout.clustering.dirichlet.models. GaussianClusterDistribution -mp org.apache.mahout.math.SequentialAccessSp arseVector
  • 26. Evaluating and improving clustering quality Inspecting clustering output Evaluating the quality of clustering0 Improving clustering quality
  • 27. Inspecting clustering output $ bin/mahout clusterdump -s kmeans- output/clusters-19/ -d reuters- vectors/dictionary.file-0 -dt sequencefile -n 10 Top Terms: said => 11.60126582278481 bank => 5.943037974683544 dollar =>
  • 28. Analyzing clustering output Distance measure and feature selection Inter-cluster and intra-cluster distances Mixed and overlapping clusters
  • 29. Improving clustering quality Improving document vector generation Writing a custom distance measure
  • 30. Real-world applications of clustering Clustering like-minded people on Twitter Suggesting tags for an artist on using clustering Creating a related-posts feature for a website
  • 31. Classification Classification is a process of using specific information (input) to choose a single selection (output) from a short list of predetermined potential responses. Applications of classification, e.g. spam filtering
  • 32. Why use Mahout for classification?
  • 33. How classification works
  • 34. Classification Training versus test versus production Predictor variables versus target variable Records, fields, and values
  • 35. Types of values for predictor variables Continuous Categorical Word-like Text-like
  • 36. Classification Work flow Training the model Evaluating the model Using the model in production
  • 37. Stage 1: training the classification modelStage 2: evaluating the classification modelStage 3: using the model in production
  • 38. Stage 1: training the classification model Define Categories for the Target Variable Collect Historical Data Define Predictor Variables Select a Learning Algorithm to Train the Model Use Learning Algorithm to Train the Model
  • 39. Extracting features to build a Mahout classifier
  • 40. Preprocessing raw data into classifiable data
  • 41. Converting classifiable data into vectors Use one Vector cell per word, category, or continuous value Represent Vectors implicitly as bags of words Use feature hashing
  • 42. Classifying the 20 newsgroups data set
  • 43. Choosing an algorithm
  • 44. The classifier evaluation API Percent correct Confusion matrix Entropy matrix AUC Log likelihood
  • 45. When classifiers go bad Target leaks Broken feature extraction
  • 46. Tuning the problem Remove Fluff Variables Add New Variables, Interactions, and Derived Values
  • 47. Tuning the classifier Try Alternative Algorithms Tune the Learning Algorithm
  • 48. Thank You Contact at:Email: