Tag Extraction Final Presentation - CS185CSpring2014

545 views

Published on

These slides were presented in class on May 7th 2014.
Task allocation
• George : ETL, Data Analysis, Machine Learning, Multi-label classification with Apache Spark
• Naoki : ETL, Data Analysis, Machine Learning, Feature Engineering, Multi-label classification with Apache Mahout

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
545
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
16
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Tag Extraction Final Presentation - CS185CSpring2014

  1. 1. Tag Extraction George McBay, Naoki Nakatani San Jose State University CS185C Spring 2014
  2. 2. Agenda Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark
  3. 3. Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark Agenda
  4. 4. Problem Given question with title and body, can we automatically generate tags for it? Where can I find the LaTeX3 manual? Few month ago I saw a big pdf-manual of all LaTeX3-packages and the new syntax. I think it was bigger than 300 pages. I can't find it on the web. Does anyone have a link? Documentation latex3 expl3
  5. 5. Dataset File : ● Train.csv ● Test.csv Fields : ● id, title, body, tags (Train) ● id, title, body (Test) Characteristics : ● Quoted csv ● Body contains n ● Tags separated by space ● Entry delimited by 0 0 “----” , ”-----------” , “------------------------ “--- --- --- ---” ------------------------” , 0 0 0 “----” , ”-----------” , “------------------------” , “--- --- --- ---” “----” , ”-----------” , “------------------------” , “--- --- --- ---” “----” , ”-----------” , “------------------------ “--- --- --- ---” ------------------------” ,
  6. 6. Working Environment ● Mac OS 10.9.1 ● Apache Hadoop 1.2.1 ● Apache Mahout 0.8 ● Apache Spark 0.9
  7. 7. Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark Agenda
  8. 8. ETL Extract : Assume data is extracted from website Transform : Use OpenCSV 1. Remove whitespaces (‘ ’, ‘n’, ‘t’) 2. Combine fields with ‘t’ 3. Write to tsv file Load : Upload to HDFS
  9. 9. Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark Agenda
  10. 10. Data Analysis Tag Occurrence Count TSV File Map-Reduce • Input : <index, question> • Mapper output : <tag, 1> for each tag • Reducer output : <tag, count> for each tag 7785 c# 6788 java 6575 php 6135 javascript 5317 android 4949 jquery 3278 c++ 3082 python
  11. 11. Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark Agenda
  12. 12. Question Filtering for ML TSV File Map-Reduce • Input : <index, question> • Mapper output : <index, question> if question contains top5 tag • Reducer output : <index, question> TSV File with questions that has one of top5 tags
  13. 13. Machine Learning ● Problem ○ Can we classify questions into one of 5 categories (tags) ? Classification ● Naive Bayes Classifier ● Detail in Mahout Classification Presentation
  14. 14. Machine Learning Correctly Classified Instances : 10209 81.8816% Incorrectly Classified Instances : 2259 18.1184% Total Classified Instances : 12468
  15. 15. Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark Agenda
  16. 16. Title vs Body Intuitively… Title is a short summary describing the body of the question ⇒ Title must be more important than body! How to put more emphasis on title? ● Build separate models for title & body + more weight for title model? ● Prepend title several times and feed into regular model?
  17. 17. Two models approach Title model not accurate… ● Too short for model to distinguish labels ● Longer text wins!
  18. 18. Repeated title approach Slight improvement! ● Testing against train-set ~ 93% ⇒ ~ 95% ● Testing against test-set ~ 80% ⇒ ~ 82% Multiple title ● more stop words ⇒ No effect ● more keywords (if title has)
  19. 19. Agenda Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark
  20. 20. Diving into model ● Top 10 words from each category ● Popular (redundant) words showing up in all categories (I, it, code, etc) BUT ● Some words specific to each category (activity for android, jquery for javascript, echo for php)
  21. 21. Which words to drop? Word count against TrainSmall.tsv? ● Total count : 19276034 Top 5: ● p - 827029 ● the - 545950 ● i - 476056 ● to - 393027 ● a - 362328 Problem ● Key words have high count too ○ 39th - http - 51412wc ○ 63rd - java - 35076wc ○ 91st - php - 25135wc Can’t even throw away first 100 words...
  22. 22. Which words to drop? Word count against ordinary english text? ● 20 books from gutenberg.org ● Total count : 1041565 ● A lot less technical! (only 4wc for java, probably an island from Indonesia?) ● Safe to throw away 1959 words (> 50wc)
  23. 23. BUT
  24. 24. Not much improvement... ● Due to tf-idf measurement ○ Less weight for words appearing in many documents ○ More weight for words appearing only in specific documents
  25. 25. Agenda Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark
  26. 26. Any room for improvement? What is the source of error? ● android ⇔ java ==> both java ● javascript ⇔ php ===> both web-related ● java classified as c# ===> many questions have both tags
  27. 27. Any room for improvement? No problem if we can give multiple labels to one question!
  28. 28. Multi-label classification ● Modification from previous classification task ○ Top5 tags ⇒ Top1000 tags ○ 1 tag for 1 question ⇒ 5 tags for 1 question (Pick 5 most probable tags) ○ 1 question learned only once ⇒ 1 question with multiple tags learned multiple times tag1 body tag2 body model
  29. 29. Good outcome (Example 1) TITLE: Upgrade iPhone 3GS from iOS 4.2.1 to 4.3.x BODY: <p>Hi folks I have a iPhone 3GS at 4.2.1 and want to upgrade it to 4.3.x for testing. I have read some articles about it but it seems that those are too old and cannot work. </p> <p>Does anyone have some experience in doing this or does apple provide tutorials for developers in this?</p> <p>A lot of thanks.</p> Actual tags ● iphone ● ios ● upgrade Predicted tags ● iphone ● ios ● osx ● objective-c ● php
  30. 30. GREAT outcome (Example 2) TITLE: Is it possible to display an image in text field in html? BODY: <p>Can we display image inside a text field in <code>html</code>? </p> <p><strong>Edit</strong></p> <p>What I want to do is to have an <code>editable</code> area, and want to add <code>html</code> objects inside it(i.e. button, image ..etc)</p> Actual tags ● javascript ● jquery ● html ● css ● web Predicted tags ● javascript ● jquery ===> Never appears in text! ● html ● c# ● php
  31. 31. Stats Row : # actual tags assigned to one question Col : # predicted tags which are also in actual tag set [Ex] Out of total 32798 questions which have 2 tags: ● For 14541 questions, model suggested both 2 actual tags. ● For 13922 questions, model suggested 1 of 2 actual tags. ● For 4335 questions, model couldn’t suggest the correct tags.
  32. 32. How to evaluate Generous evaluator If model gets at least 1 correct, approve it! Total accuracy = 83.55% (B)
  33. 33. How to evaluate Strict evaluator Never approve unless model gets all correct! Total accuracy = 43.04% (F)
  34. 34. Conclusion for performance ● Overall, good! ○ Predicted tag set is relatively close to the actual tag set (Apple-related, Web-related) ● but, not there yet... ○ Almost impossible to distinguish versions (c#-3.0, c#-4.0 ⇒ c#), or sub-tags (facebook-graph-api, facebook-like ⇒ facebook) ○ Still showing unrelated tags (php python everywhere!)
  35. 35. Agenda Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark
  36. 36. Spark Advantages: - Easy to get started with - Interactive shell - Less code to write
  37. 37. Spark Disadvantages: - Not many reference for MLlib - Still new
  38. 38. Spark ● Used PySpark which is python interface to using Spark ● Implemented ML model from ground-up using python dictionaries and mapreduce procedure
  39. 39. How It Works 5 basic procedures used: ● map ● flatMap ● reduce ● reduceByKey ● collectAsMap
  40. 40. How It Works key_val = line.flatMap(~).map(~) key_val = key_val.reduceByKey(~) (a, 1) (b, 1) (c, 1) (d, 1) (a, 1) (b, 1) (c, 1) (d, 1) (a, 1) (d, 1) (a, 2) (b, 1) (c, 1) (d, 2) LINE
  41. 41. How It Works dict = key_val.collectAsMap() {a : 2, b : 1, c : 1, d : 2}(a, 2) (b, 1) (c, 1) (d, 2)
  42. 42. How It Works Model: - statistical model - matrix of weights - uses tf-idf
  43. 43. How It Works Tags
  44. 44. How It Works Tags Words from document
  45. 45. How It Works Tags Relevance Words from document
  46. 46. How It Works Implemented as → { tag : { word : wight } }
  47. 47. How It Works ● Most relevant tag chosen by sum of weights associated to words contained in the document
  48. 48. How It Works Now, how are the weights calculated? ● First calculate idf (inverse document frequency) for each word ● Next calculate tf (term frequency) associated with each tag ● Multiply idf to each entry then Normalize
  49. 49. How It Works idf for a word defined by: idf(word) = log(D/F(word)) where, D = total # of doc in the training set F(word) = # of doc which contains word
  50. 50. How It Works Two ways to calculate tf: 1) number of times you see the term associated with a tag 2) number of documents you see the term associated with a tag (in other words only count one time per doc)
  51. 51. Results TITLE: Upgrade iPhone 3GS from iOS 4.2.1 to 4.3.x BODY: <p>Hi folks I have a iPhone 3GS at 4.2.1 and want to upgrade it to 4.3.x for testing. I have read some articles about it but it seems that those are too old and cannot work. </p> <p>Does anyone have some experience in doing this or does apple provide tutorials for developers in this?</p> <p>A lot of thanks.</p> Actual tags ● iphone ● ios ● upgrade Predicted tags ● ios4.3 ● iphone-3gs ● cocoa-touch ● ios4 ● upgrade
  52. 52. Results TITLE: Is it possible to display an image in text field in html? BODY: <p>Can we display image inside a text field in <code>html</code>? </p> <p><strong>Edit</strong></p> <p>What I want to do is to have an <code>editable</code> area, and want to add <code>html</code> objects inside it(i.e. button, image ..etc)</p> Actual tags ● javascript ● jquery ● html ● css ● web Predicted tags ● html ● img ● alignment ● get ● web
  53. 53. Results Top: Predicted Below: Actual
  54. 54. Results ● Not perfect ● But very close ● Relevant words for tags look right
  55. 55. Results most relevant words for tag “python”: [u'python', u'def', u'import', u'print', u'module', u'file', u'self', … ] most relevant words for tag “math”: [u'vector', u'x', u'math', u'calculate', u'number', u'mathematical', u'example', u'matlab', ... ]
  56. 56. Adjusting What can be adjusted? ● Pretty much anything! ● I tried playing with: tf, idf, tag_frequency, normalization, cleaning text, etc.
  57. 57. Conclusion ● Adjusting the metrics to get the right model can be time consuming (many things can be adjusted)! ● But still, Naive Bayes algorithm is very suited for keyword extraction problem (and text classification in general), because of how tf- idf is defined.

×