Upcoming SlideShare
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Standard text messaging rates apply

# Mahout part2

2,938

Published on

Part two of a presentation about Mahout system. It is based on http://my.safaribooksonline.com/9781935182689/

Part two of a presentation about Mahout system. It is based on http://my.safaribooksonline.com/9781935182689/

Published in: Technology, Education
7 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total Views
2,938
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
149
0
Likes
7
Embeds 0
No embeds

No notes for slide

### Transcript

• 1. Mahout in Action Part 2 Yasmine M. Gaber 4 April 2013
• 2. Agenda Part 2: Clustering Part 3: Classification
• 3. Clustering An algorithm A notion of both similarity and dissimilarity A stopping condition
• 4. Measuring the similarity of items Euclidean Distance
• 5. Creating the input Preprocess the data Use that data to create vectors Save the vectors in SequenceFile format as input for the algorithm
• 6. Using Mahout clustering The SequenceFile containing the input vectors. The SequenceFile containing the initial cluster centers. The similarity measure to be used. The convergenceThreshold. The number of iterations to be done. The Vector implementation used in the input files.
• 7. Using Mahout clustering
• 8. Distance measures Euclidean distance measure Squared Euclidean distance measure Manhattan distance measure
• 9. Distance measures Cosine distance measure Tanimoto distance measure
• 10. Playing Around
• 11. Representing data
• 12. Representing text documents as vectors Vector Space Model (VSM) TF-IDF N-gram collocations
• 13. Generating vectors from documents \$ bin/mahout seqdirectory -c UTF-8 -i examples/reuters-extracted/ -o reuters-seqfiles \$ bin/mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors -ow
• 14. Improving quality of vectors using normalization P-norm \$ bin/mahout seq2sparse -i reuters-seqfiles/ -o reuters-normalized-bigram -ow -a org.apache.lucene.analysis.WhitespaceAnalyz er-chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2
• 15. Clustering Categories Exclusive clustering Overlapping clustering Hierarchical clustering Probabilistic clustering
• 16. Clustering Approaches Fixed number of centers Bottom-up approach Top-down approach
• 17. Clustering algorithms K-means clustering Fuzzy k-means clustering Dirichlet clustering
• 18. k-means clustering algorithm
• 19. Running k-means clustering
• 20. Running k-means clustering \$ bin/mahout kmeans -i reuters-vectors/tfidf- vectors/ -c reuters-initial-clusters -o reuters- kmeans-clusters -dm org.apache.mahout.common.distance.Square dEuclideanDistanceMeasure -cd 1.0 -k 20 -x 20 -cl \$ bin/mahout kmeans -i reuters-vectors/tfidf- vectors/ -c reuters-initial-clusters -o reuters- kmeans-clusters -dm org.apache.mahout.common.distance.Cosine DistanceMeasure -cd 0.1 -k 20 -x 20 -cl \$ bin/mahout clusterdump -dt sequencefile -d
• 21. Fuzzy k-means clustering Instead of the exclusive clustering in k-means, fuzzy k-means tries to generate overlapping clusters from the data set. Also known as fuzzy c-means algorithm.
• 22. Running fuzzy k-means clustering
• 23. Running fuzzy k-means clustering \$ bin/mahout fkmeans -i reuters-vectors/tfidf- vectors/ -c reuters-fkmeans-centroids -o reuters-fkmeans-clusters -cd 1.0 -k 21 -m 2 -ow -x 10 -dm org.apache.mahout.common.distance.Square dEuclideanDistanceMeasure Fuzziness factor
• 24. Dirichlet clustering model-based clustering algorithm
• 25. Running Dirichlet clustering \$ bin/mahout dirichlet -i reuters-vectors/tfidf- vectors -o reuters-dirichlet-clusters -k 60 -x 10 -a0 1.0 -md org.apache.mahout.clustering.dirichlet.models. GaussianClusterDistribution -mp org.apache.mahout.math.SequentialAccessSp arseVector
• 26. Evaluating and improving clustering quality Inspecting clustering output Evaluating the quality of clustering0 Improving clustering quality
• 27. Inspecting clustering output \$ bin/mahout clusterdump -s kmeans- output/clusters-19/ -d reuters- vectors/dictionary.file-0 -dt sequencefile -n 10 Top Terms: said => 11.60126582278481 bank => 5.943037974683544 dollar =>
• 28. Analyzing clustering output Distance measure and feature selection Inter-cluster and intra-cluster distances Mixed and overlapping clusters
• 29. Improving clustering quality Improving document vector generation Writing a custom distance measure
• 30. Real-world applications of clustering Clustering like-minded people on Twitter Suggesting tags for an artist on Last.fm using clustering Creating a related-posts feature for a website
• 31. Classification Classification is a process of using specific information (input) to choose a single selection (output) from a short list of predetermined potential responses. Applications of classification, e.g. spam filtering
• 32. Why use Mahout for classification?
• 33. How classification works
• 34. Classification Training versus test versus production Predictor variables versus target variable Records, fields, and values
• 35. Types of values for predictor variables Continuous Categorical Word-like Text-like
• 36. Classification Work flow Training the model Evaluating the model Using the model in production
• 37. Stage 1: training the classification modelStage 2: evaluating the classification modelStage 3: using the model in production
• 38. Stage 1: training the classification model Define Categories for the Target Variable Collect Historical Data Define Predictor Variables Select a Learning Algorithm to Train the Model Use Learning Algorithm to Train the Model
• 39. Extracting features to build a Mahout classifier
• 40. Preprocessing raw data into classifiable data
• 41. Converting classifiable data into vectors Use one Vector cell per word, category, or continuous value Represent Vectors implicitly as bags of words Use feature hashing
• 42. Classifying the 20 newsgroups data set
• 43. Choosing an algorithm
• 44. The classifier evaluation API Percent correct Confusion matrix Entropy matrix AUC Log likelihood
• 45. When classifiers go bad Target leaks Broken feature extraction
• 46. Tuning the problem Remove Fluff Variables Add New Variables, Interactions, and Derived Values
• 47. Tuning the classifier Try Alternative Algorithms Tune the Learning Algorithm