Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mahout classification presentation

2,304 views

Published on

These slides were presented in class on April 7th, 2014.

Published in: Technology, Education
  • Be the first to comment

Mahout classification presentation

  1. 1. Classification on Mahout Naoki Nakatani San Jose State University CS185C Spring 2014
  2. 2. Agenda ● Classification Overview ● Mahout Overview ○ Classification on Mahout ● Case Study with Demo ○ Problem Description ○ Working Environment ○ Data Preparation ○ ML Model Generation
  3. 3. Classification? ● Classifying examples into given set of categories ● Supervised learning ○ Prepare data ○ Build classifier (train & test) ○ Apply classifier to new data http://www.ndm.net/opentext/images/stories/images/extraction_cmyk_thumb.jpg
  4. 4. Mahout? ● Scalable machine learning library = Can handle Big Data ● Runs on HDFS ● Classification, Clustering, Collaborative Filtering , etc http://www.robinanil.com/wp-content/uploads/2010/03/mahout-logo-200.png
  5. 5. Classification on Mahout? Classifying examples into given set of categories Scalable machine learning library that can handle big data Classifying big data into given set of categories
  6. 6. Case Study & Demo Given question with title and body, can we automatically generate tags for it? Where can I find the LaTeX3 manual? Few month ago I saw a big pdf-manual of all LaTeX3-packages and the new syntax. I think it was bigger than 300 pages. I can't find it on the web. Does anyone have a link? Documentation latex3 expl3
  7. 7. Dataset File : ● TrainSmall.tsv Fields : ● id, title, body, tags Characteristics : ● Each question contains only one tag 0 “----” , ”-----------” , “------------------------” , “--- --- --- --- ” 0 0 “----” , ”-----------” , “------------------------” , “--- --- --- --- ” “----” , ”-----------” , “------------------------” , “--- --- --- --- ”
  8. 8. Working Environment ● Mac OS 10.9.1 ● Eclipse 4.3.2 ● Hadoop 1.2.1 ● Mahout 0.9 ● Source code available here.
  9. 9. Prerequisite (Where are you?) ● You have input tsv file at result > output-topfivetags. ● You are at “result” directory in Terminal. ● Command “hadoop” and “mahout” is working.
  10. 10. Prepare Data 1. Convert TSV file to Hadoop sequence file format. Specify tag as a category. (Run TSVToSeq.java) output-tsvtoseq folder and chunk-0 file is created.
  11. 11. Prepare Data 1. Make directory in HDFS and upload chunk-0 (sequence file) to the folder.
  12. 12. hadoop fs -mkdir <directory>
  13. 13. hadoop fs -put <source> <destination>
  14. 14. Prepare Data 2. Transform questions into vectors. (mahout seq2sparse)
  15. 15. mahout seq2sparse -i <input directory> -o <output directory>
  16. 16. Prepare Data 3. Split data into a. Train set : to train model b. Test set : to test model
  17. 17. mahout split -i <input directory> --trainingOutput <output dir to train> --testOutput <output dir to test> --randomSelectionPct <integer> --overwrite --sequenceFiles -xm sequential
  18. 18. Build Classifier 1. Choose algorithm to use for classification Available algorithms: ○ Naive Bayes ■ trainnb, testnb ■ org.apache.mahout. classifier.naivebayes ○ Hidden Markov Model ■ baumwelch, hmmpredict ■ org.apache.mahout. classifier.sequencelearning. hmm ○ Logistic Regression ■ trainlogistic, testlogistic ■ org.apache.mahout. classifier.sgd ○ Random Forest ■ ? ■ ?
  19. 19. 2. Train & test model using train set Should yield high accuracy Build Classifier (Naive Bayes)
  20. 20. mahout trainnb -i <dir to train vectors> -el -li <dir to put label index> -o <dir to put model> -ow -c
  21. 21. mahout testnb -i <dir to train vectors> -m <dir to model> -l <dir to label index> -ow -o <output dir> -c
  22. 22. Build Classifier (Naive Bayes) 3. Test model using test set Check if the accuracy is satisfactory
  23. 23. Apply Classifier What do you have at this point? ● model ● label index You can start classifying new data! (Check this example) Model Label Index
  24. 24. References ● Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages ● Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages (part 2: distribute classification with hadoop)
  25. 25. Happy Machine Learning!

×