Your SlideShare is downloading. ×
0
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Tag Extraction Final Presentation - CS185CSpring2014
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Tag Extraction Final Presentation - CS185CSpring2014

186

Published on

These slides were presented in class on May 7th 2014. …

These slides were presented in class on May 7th 2014.
Task allocation
• George : ETL, Data Analysis, Machine Learning, Multi-label classification with Apache Spark
• Naoki : ETL, Data Analysis, Machine Learning, Feature Engineering, Multi-label classification with Apache Mahout

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
186
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Tag Extraction George McBay, Naoki Nakatani San Jose State University CS185C Spring 2014
  • 2. Agenda Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark
  • 3. Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark Agenda
  • 4. Problem Given question with title and body, can we automatically generate tags for it? Where can I find the LaTeX3 manual? Few month ago I saw a big pdf-manual of all LaTeX3-packages and the new syntax. I think it was bigger than 300 pages. I can't find it on the web. Does anyone have a link? Documentation latex3 expl3
  • 5. Dataset File : ● Train.csv ● Test.csv Fields : ● id, title, body, tags (Train) ● id, title, body (Test) Characteristics : ● Quoted csv ● Body contains n ● Tags separated by space ● Entry delimited by 0 0 “----” , ”-----------” , “------------------------ “--- --- --- ---” ------------------------” , 0 0 0 “----” , ”-----------” , “------------------------” , “--- --- --- ---” “----” , ”-----------” , “------------------------” , “--- --- --- ---” “----” , ”-----------” , “------------------------ “--- --- --- ---” ------------------------” ,
  • 6. Working Environment ● Mac OS 10.9.1 ● Apache Hadoop 1.2.1 ● Apache Mahout 0.8 ● Apache Spark 0.9
  • 7. Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark Agenda
  • 8. ETL Extract : Assume data is extracted from website Transform : Use OpenCSV 1. Remove whitespaces (‘ ’, ‘n’, ‘t’) 2. Combine fields with ‘t’ 3. Write to tsv file Load : Upload to HDFS
  • 9. Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark Agenda
  • 10. Data Analysis Tag Occurrence Count TSV File Map-Reduce • Input : <index, question> • Mapper output : <tag, 1> for each tag • Reducer output : <tag, count> for each tag 7785 c# 6788 java 6575 php 6135 javascript 5317 android 4949 jquery 3278 c++ 3082 python
  • 11. Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark Agenda
  • 12. Question Filtering for ML TSV File Map-Reduce • Input : <index, question> • Mapper output : <index, question> if question contains top5 tag • Reducer output : <index, question> TSV File with questions that has one of top5 tags
  • 13. Machine Learning ● Problem ○ Can we classify questions into one of 5 categories (tags) ? Classification ● Naive Bayes Classifier ● Detail in Mahout Classification Presentation
  • 14. Machine Learning Correctly Classified Instances : 10209 81.8816% Incorrectly Classified Instances : 2259 18.1184% Total Classified Instances : 12468
  • 15. Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark Agenda
  • 16. Title vs Body Intuitively… Title is a short summary describing the body of the question ⇒ Title must be more important than body! How to put more emphasis on title? ● Build separate models for title & body + more weight for title model? ● Prepend title several times and feed into regular model?
  • 17. Two models approach Title model not accurate… ● Too short for model to distinguish labels ● Longer text wins!
  • 18. Repeated title approach Slight improvement! ● Testing against train-set ~ 93% ⇒ ~ 95% ● Testing against test-set ~ 80% ⇒ ~ 82% Multiple title ● more stop words ⇒ No effect ● more keywords (if title has)
  • 19. Agenda Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark
  • 20. Diving into model ● Top 10 words from each category ● Popular (redundant) words showing up in all categories (I, it, code, etc) BUT ● Some words specific to each category (activity for android, jquery for javascript, echo for php)
  • 21. Which words to drop? Word count against TrainSmall.tsv? ● Total count : 19276034 Top 5: ● p - 827029 ● the - 545950 ● i - 476056 ● to - 393027 ● a - 362328 Problem ● Key words have high count too ○ 39th - http - 51412wc ○ 63rd - java - 35076wc ○ 91st - php - 25135wc Can’t even throw away first 100 words...
  • 22. Which words to drop? Word count against ordinary english text? ● 20 books from gutenberg.org ● Total count : 1041565 ● A lot less technical! (only 4wc for java, probably an island from Indonesia?) ● Safe to throw away 1959 words (> 50wc)
  • 23. BUT
  • 24. Not much improvement... ● Due to tf-idf measurement ○ Less weight for words appearing in many documents ○ More weight for words appearing only in specific documents
  • 25. Agenda Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark
  • 26. Any room for improvement? What is the source of error? ● android ⇔ java ==> both java ● javascript ⇔ php ===> both web-related ● java classified as c# ===> many questions have both tags
  • 27. Any room for improvement? No problem if we can give multiple labels to one question!
  • 28. Multi-label classification ● Modification from previous classification task ○ Top5 tags ⇒ Top1000 tags ○ 1 tag for 1 question ⇒ 5 tags for 1 question (Pick 5 most probable tags) ○ 1 question learned only once ⇒ 1 question with multiple tags learned multiple times tag1 body tag2 body model
  • 29. Good outcome (Example 1) TITLE: Upgrade iPhone 3GS from iOS 4.2.1 to 4.3.x BODY: <p>Hi folks I have a iPhone 3GS at 4.2.1 and want to upgrade it to 4.3.x for testing. I have read some articles about it but it seems that those are too old and cannot work. </p> <p>Does anyone have some experience in doing this or does apple provide tutorials for developers in this?</p> <p>A lot of thanks.</p> Actual tags ● iphone ● ios ● upgrade Predicted tags ● iphone ● ios ● osx ● objective-c ● php
  • 30. GREAT outcome (Example 2) TITLE: Is it possible to display an image in text field in html? BODY: <p>Can we display image inside a text field in <code>html</code>? </p> <p><strong>Edit</strong></p> <p>What I want to do is to have an <code>editable</code> area, and want to add <code>html</code> objects inside it(i.e. button, image ..etc)</p> Actual tags ● javascript ● jquery ● html ● css ● web Predicted tags ● javascript ● jquery ===> Never appears in text! ● html ● c# ● php
  • 31. Stats Row : # actual tags assigned to one question Col : # predicted tags which are also in actual tag set [Ex] Out of total 32798 questions which have 2 tags: ● For 14541 questions, model suggested both 2 actual tags. ● For 13922 questions, model suggested 1 of 2 actual tags. ● For 4335 questions, model couldn’t suggest the correct tags.
  • 32. How to evaluate Generous evaluator If model gets at least 1 correct, approve it! Total accuracy = 83.55% (B)
  • 33. How to evaluate Strict evaluator Never approve unless model gets all correct! Total accuracy = 43.04% (F)
  • 34. Conclusion for performance ● Overall, good! ○ Predicted tag set is relatively close to the actual tag set (Apple-related, Web-related) ● but, not there yet... ○ Almost impossible to distinguish versions (c#-3.0, c#-4.0 ⇒ c#), or sub-tags (facebook-graph-api, facebook-like ⇒ facebook) ○ Still showing unrelated tags (php python everywhere!)
  • 35. Agenda Problem Description ETL Data Analysis Machine Learning Optimization Feature Engineering Title vs Body Stop Words Multi-label Classification Apache Spark
  • 36. Spark Advantages: - Easy to get started with - Interactive shell - Less code to write
  • 37. Spark Disadvantages: - Not many reference for MLlib - Still new
  • 38. Spark ● Used PySpark which is python interface to using Spark ● Implemented ML model from ground-up using python dictionaries and mapreduce procedure
  • 39. How It Works 5 basic procedures used: ● map ● flatMap ● reduce ● reduceByKey ● collectAsMap
  • 40. How It Works key_val = line.flatMap(~).map(~) key_val = key_val.reduceByKey(~) (a, 1) (b, 1) (c, 1) (d, 1) (a, 1) (b, 1) (c, 1) (d, 1) (a, 1) (d, 1) (a, 2) (b, 1) (c, 1) (d, 2) LINE
  • 41. How It Works dict = key_val.collectAsMap() {a : 2, b : 1, c : 1, d : 2}(a, 2) (b, 1) (c, 1) (d, 2)
  • 42. How It Works Model: - statistical model - matrix of weights - uses tf-idf
  • 43. How It Works Tags
  • 44. How It Works Tags Words from document
  • 45. How It Works Tags Relevance Words from document
  • 46. How It Works Implemented as → { tag : { word : wight } }
  • 47. How It Works ● Most relevant tag chosen by sum of weights associated to words contained in the document
  • 48. How It Works Now, how are the weights calculated? ● First calculate idf (inverse document frequency) for each word ● Next calculate tf (term frequency) associated with each tag ● Multiply idf to each entry then Normalize
  • 49. How It Works idf for a word defined by: idf(word) = log(D/F(word)) where, D = total # of doc in the training set F(word) = # of doc which contains word
  • 50. How It Works Two ways to calculate tf: 1) number of times you see the term associated with a tag 2) number of documents you see the term associated with a tag (in other words only count one time per doc)
  • 51. Results TITLE: Upgrade iPhone 3GS from iOS 4.2.1 to 4.3.x BODY: <p>Hi folks I have a iPhone 3GS at 4.2.1 and want to upgrade it to 4.3.x for testing. I have read some articles about it but it seems that those are too old and cannot work. </p> <p>Does anyone have some experience in doing this or does apple provide tutorials for developers in this?</p> <p>A lot of thanks.</p> Actual tags ● iphone ● ios ● upgrade Predicted tags ● ios4.3 ● iphone-3gs ● cocoa-touch ● ios4 ● upgrade
  • 52. Results TITLE: Is it possible to display an image in text field in html? BODY: <p>Can we display image inside a text field in <code>html</code>? </p> <p><strong>Edit</strong></p> <p>What I want to do is to have an <code>editable</code> area, and want to add <code>html</code> objects inside it(i.e. button, image ..etc)</p> Actual tags ● javascript ● jquery ● html ● css ● web Predicted tags ● html ● img ● alignment ● get ● web
  • 53. Results Top: Predicted Below: Actual
  • 54. Results ● Not perfect ● But very close ● Relevant words for tags look right
  • 55. Results most relevant words for tag “python”: [u'python', u'def', u'import', u'print', u'module', u'file', u'self', … ] most relevant words for tag “math”: [u'vector', u'x', u'math', u'calculate', u'number', u'mathematical', u'example', u'matlab', ... ]
  • 56. Adjusting What can be adjusted? ● Pretty much anything! ● I tried playing with: tf, idf, tag_frequency, normalization, cleaning text, etc.
  • 57. Conclusion ● Adjusting the metrics to get the right model can be time consuming (many things can be adjusted)! ● But still, Naive Bayes algorithm is very suited for keyword extraction problem (and text classification in general), because of how tf- idf is defined.

×