Mahout in Action         Part 2    Yasmine M. Gaber       4 April 2013
Agenda    Part 2: Clustering    Part 3: Classification
Clustering    An algorithm    A notion of both similarity and dissimilarity    A stopping condition
Measuring the similarity of items    Euclidean Distance
Creating the input    Preprocess the data    Use that data to create vectors    Save the vectors in SequenceFile format...
Using Mahout clustering    The SequenceFile containing the input    vectors.    The SequenceFile containing the initial ...
Using Mahout clustering
Distance measures    Euclidean distance measure    Squared Euclidean distance measure    Manhattan distance measure
Distance measures    Cosine distance measure    Tanimoto distance measure
Playing Around
Representing data
Representing text documents as               vectors    Vector Space Model (VSM)    TF-IDF    N-gram collocations
Generating vectors from documents    $ bin/mahout seqdirectory -c UTF-8 -i    examples/reuters-extracted/ -o reuters-seqf...
Improving quality of vectors using             normalization    P-norm    $ bin/mahout seq2sparse -i reuters-seqfiles/  ...
Clustering Categories    Exclusive clustering    Overlapping clustering    Hierarchical clustering    Probabilistic cl...
Clustering Approaches    Fixed number of centers    Bottom-up approach    Top-down approach
Clustering algorithms    K-means clustering    Fuzzy k-means clustering    Dirichlet clustering
k-means clustering algorithm
Running k-means clustering
Running k-means clustering    $ bin/mahout kmeans -i reuters-vectors/tfidf-    vectors/ -c reuters-initial-clusters -o re...
Fuzzy k-means clustering    Instead of the exclusive clustering in k-means,    fuzzy k-means tries to generate overlappin...
Running fuzzy k-means clustering
Running fuzzy k-means clustering    $ bin/mahout fkmeans -i reuters-vectors/tfidf-    vectors/ -c reuters-fkmeans-centroi...
Dirichlet clustering    model-based clustering algorithm
Running Dirichlet clustering    $ bin/mahout dirichlet -i reuters-vectors/tfidf-    vectors -o reuters-dirichlet-clusters...
Evaluating and improving clustering              quality    Inspecting clustering output    Evaluating the quality of cl...
Inspecting clustering output    $ bin/mahout clusterdump -s kmeans-    output/clusters-19/ -d reuters-    vectors/diction...
Analyzing clustering output    Distance measure and feature selection    Inter-cluster and intra-cluster distances    M...
Improving clustering quality    Improving document vector generation    Writing a custom distance measure
Real-world applications of clustering    Clustering like-minded people on Twitter    Suggesting tags for an artist on La...
Classification    Classification is a process of using specific    information (input) to choose a single selection    (o...
Why use Mahout for classification?
How classification works
Classification    Training versus test versus production    Predictor variables versus target variable    Records, fiel...
Types of values for predictor                variables    Continuous    Categorical    Word-like    Text-like
Classification Work flow    Training the model    Evaluating the model    Using the model in production
Stage 1: training the classification                modelStage 2: evaluating the classification              modelStage 3:...
Stage 1: training the classification                  model    Define Categories for the Target Variable    Collect Hist...
Extracting features to build a      Mahout classifier
Preprocessing raw data into     classifiable data
Converting classifiable data into                vectors    Use one Vector cell per word, category, or    continuous valu...
Classifying the 20 newsgroups data                 set
Choosing an algorithm
The classifier evaluation API    Percent correct    Confusion matrix    Entropy matrix    AUC    Log likelihood
When classifiers go bad    Target leaks    Broken feature extraction
Tuning the problem    Remove Fluff Variables    Add New Variables, Interactions, and Derived    Values
Tuning the classifier    Try Alternative Algorithms    Tune the Learning Algorithm
Thank You               Contact at:Email: Yasmine.Gaber@espace.com.egTwitter: Twitter.com/yasmine_mohamed
Upcoming SlideShare
Loading in...5
×

Mahout part2

3,058

Published on

Part two of a presentation about Mahout system. It is based on http://my.safaribooksonline.com/9781935182689/

Published in: Technology, Education

Mahout part2

  1. 1. Mahout in Action Part 2 Yasmine M. Gaber 4 April 2013
  2. 2. Agenda Part 2: Clustering Part 3: Classification
  3. 3. Clustering An algorithm A notion of both similarity and dissimilarity A stopping condition
  4. 4. Measuring the similarity of items Euclidean Distance
  5. 5. Creating the input Preprocess the data Use that data to create vectors Save the vectors in SequenceFile format as input for the algorithm
  6. 6. Using Mahout clustering The SequenceFile containing the input vectors. The SequenceFile containing the initial cluster centers. The similarity measure to be used. The convergenceThreshold. The number of iterations to be done. The Vector implementation used in the input files.
  7. 7. Using Mahout clustering
  8. 8. Distance measures Euclidean distance measure Squared Euclidean distance measure Manhattan distance measure
  9. 9. Distance measures Cosine distance measure Tanimoto distance measure
  10. 10. Playing Around
  11. 11. Representing data
  12. 12. Representing text documents as vectors Vector Space Model (VSM) TF-IDF N-gram collocations
  13. 13. Generating vectors from documents $ bin/mahout seqdirectory -c UTF-8 -i examples/reuters-extracted/ -o reuters-seqfiles $ bin/mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors -ow
  14. 14. Improving quality of vectors using normalization P-norm $ bin/mahout seq2sparse -i reuters-seqfiles/ -o reuters-normalized-bigram -ow -a org.apache.lucene.analysis.WhitespaceAnalyz er-chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2
  15. 15. Clustering Categories Exclusive clustering Overlapping clustering Hierarchical clustering Probabilistic clustering
  16. 16. Clustering Approaches Fixed number of centers Bottom-up approach Top-down approach
  17. 17. Clustering algorithms K-means clustering Fuzzy k-means clustering Dirichlet clustering
  18. 18. k-means clustering algorithm
  19. 19. Running k-means clustering
  20. 20. Running k-means clustering $ bin/mahout kmeans -i reuters-vectors/tfidf- vectors/ -c reuters-initial-clusters -o reuters- kmeans-clusters -dm org.apache.mahout.common.distance.Square dEuclideanDistanceMeasure -cd 1.0 -k 20 -x 20 -cl $ bin/mahout kmeans -i reuters-vectors/tfidf- vectors/ -c reuters-initial-clusters -o reuters- kmeans-clusters -dm org.apache.mahout.common.distance.Cosine DistanceMeasure -cd 0.1 -k 20 -x 20 -cl $ bin/mahout clusterdump -dt sequencefile -d
  21. 21. Fuzzy k-means clustering Instead of the exclusive clustering in k-means, fuzzy k-means tries to generate overlapping clusters from the data set. Also known as fuzzy c-means algorithm.
  22. 22. Running fuzzy k-means clustering
  23. 23. Running fuzzy k-means clustering $ bin/mahout fkmeans -i reuters-vectors/tfidf- vectors/ -c reuters-fkmeans-centroids -o reuters-fkmeans-clusters -cd 1.0 -k 21 -m 2 -ow -x 10 -dm org.apache.mahout.common.distance.Square dEuclideanDistanceMeasure Fuzziness factor
  24. 24. Dirichlet clustering model-based clustering algorithm
  25. 25. Running Dirichlet clustering $ bin/mahout dirichlet -i reuters-vectors/tfidf- vectors -o reuters-dirichlet-clusters -k 60 -x 10 -a0 1.0 -md org.apache.mahout.clustering.dirichlet.models. GaussianClusterDistribution -mp org.apache.mahout.math.SequentialAccessSp arseVector
  26. 26. Evaluating and improving clustering quality Inspecting clustering output Evaluating the quality of clustering0 Improving clustering quality
  27. 27. Inspecting clustering output $ bin/mahout clusterdump -s kmeans- output/clusters-19/ -d reuters- vectors/dictionary.file-0 -dt sequencefile -n 10 Top Terms: said => 11.60126582278481 bank => 5.943037974683544 dollar =>
  28. 28. Analyzing clustering output Distance measure and feature selection Inter-cluster and intra-cluster distances Mixed and overlapping clusters
  29. 29. Improving clustering quality Improving document vector generation Writing a custom distance measure
  30. 30. Real-world applications of clustering Clustering like-minded people on Twitter Suggesting tags for an artist on Last.fm using clustering Creating a related-posts feature for a website
  31. 31. Classification Classification is a process of using specific information (input) to choose a single selection (output) from a short list of predetermined potential responses. Applications of classification, e.g. spam filtering
  32. 32. Why use Mahout for classification?
  33. 33. How classification works
  34. 34. Classification Training versus test versus production Predictor variables versus target variable Records, fields, and values
  35. 35. Types of values for predictor variables Continuous Categorical Word-like Text-like
  36. 36. Classification Work flow Training the model Evaluating the model Using the model in production
  37. 37. Stage 1: training the classification modelStage 2: evaluating the classification modelStage 3: using the model in production
  38. 38. Stage 1: training the classification model Define Categories for the Target Variable Collect Historical Data Define Predictor Variables Select a Learning Algorithm to Train the Model Use Learning Algorithm to Train the Model
  39. 39. Extracting features to build a Mahout classifier
  40. 40. Preprocessing raw data into classifiable data
  41. 41. Converting classifiable data into vectors Use one Vector cell per word, category, or continuous value Represent Vectors implicitly as bags of words Use feature hashing
  42. 42. Classifying the 20 newsgroups data set
  43. 43. Choosing an algorithm
  44. 44. The classifier evaluation API Percent correct Confusion matrix Entropy matrix AUC Log likelihood
  45. 45. When classifiers go bad Target leaks Broken feature extraction
  46. 46. Tuning the problem Remove Fluff Variables Add New Variables, Interactions, and Derived Values
  47. 47. Tuning the classifier Try Alternative Algorithms Tune the Learning Algorithm
  48. 48. Thank You Contact at:Email: Yasmine.Gaber@espace.com.egTwitter: Twitter.com/yasmine_mohamed
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×