600.465 Connecting the dots - I(NLP in Practice)<br />Delip Rao<br />delip@jhu.edu<br />
What is “Text”?<br />
What is “Text”?<br />
What is “Text”?<br />
“Real” World<br />Tons of data on the web<br />A lot of it is text<br />In many languages<br />In many genres<br />Languag...
But we have 600.465<br /><ul><li>We can study anything about language ...
1. Formalize some insights
2. Study the formalism mathematically
3. Develop & implement algorithms
4. Test on real data</li></ul>Forward Backward, <br />Gradient Descent, LBFGS, Simulated Annealing, Contrastive Estimation...
NLP for fun and profit<br />Making NLP more accessible<br />Provide APIs for common NLP tasks<br />vartext = document.get(...
Desideratum: Multilinguality<br />Except for feature extraction, systems should be language agnostic<br />
In this lecture<br />Understand how to solve and ace in NLP tasks<br />Learn general methodology or approaches<br />End-to...
Case study: Named Entity Recognition<br />
Case study: Named Entity Recognition<br />Demo: http://viewer.opencalais.com<br /><ul><li>How do we build something like t...
How do we find out well we are doing?
How can we improve?</li></li></ul><li>Case study: Named Entity Recognition<br />Define the problem<br />Say, PERSON, LOCAT...
Case study: Named Entity Recognition<br />Collect data to learn from<br />Sentences with words marked as PER, ORG, LOC, NO...
Pay the experts<br />
Wisdom of the crowds<br />
Getting the data: Annotation<br />Time consuming<br />Costs $$$<br />Need for quality control<br />Inter-annotator aggreem...
Only France and Great Britain backed Fischler ‘s proposal .<br />Only France and Great Britain backed Fischler‘s proposal ...
<ul><li>1. Formalize some insights
2. Study the formalism mathematically
3. Develop & implement algorithms
4. Test on real data</li></ul>Our recipe …<br />
NER: Designing features<br />Need to segment sentences<br />Tokenize the sentences<br />Preprocessing<br />Not as trivial ...
NER: Designing features<br />
NER: Designing features<br />
NER: Designing features<br />
NER: Designing features<br />
NER: Designing features<br />These are extracted during preprocessing!<br />
NER: Designing features<br />
NER: Designing features<br />
Upcoming SlideShare
Loading in …5
×

NLP in Practice - Part I

1,781 views

Published on

How do we build NLP-based systems for real world problems?

See http://wp.me/p1K9ov-I for details.

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,781
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

NLP in Practice - Part I

  1. 1. 600.465 Connecting the dots - I(NLP in Practice)<br />Delip Rao<br />delip@jhu.edu<br />
  2. 2.
  3. 3. What is “Text”?<br />
  4. 4. What is “Text”?<br />
  5. 5. What is “Text”?<br />
  6. 6. “Real” World<br />Tons of data on the web<br />A lot of it is text<br />In many languages<br />In many genres<br />Language by itself is complex. <br />The Web further complicates language.<br />
  7. 7. But we have 600.465<br /><ul><li>We can study anything about language ...
  8. 8. 1. Formalize some insights
  9. 9. 2. Study the formalism mathematically
  10. 10. 3. Develop & implement algorithms
  11. 11. 4. Test on real data</li></ul>Forward Backward, <br />Gradient Descent, LBFGS, Simulated Annealing, Contrastive Estimation, …<br />feature functions!<br />f(wi = off, wi+1 = the)<br />f(wi = obama, yi = NP)<br />Adapted from : Jason Eisner<br />
  12. 12. NLP for fun and profit<br />Making NLP more accessible<br />Provide APIs for common NLP tasks<br />vartext = document.get(…);<br />varentities = agent.markNE(text);<br />Big $$$$<br />Backend to intelligent processing of text<br />
  13. 13. Desideratum: Multilinguality<br />Except for feature extraction, systems should be language agnostic<br />
  14. 14. In this lecture<br />Understand how to solve and ace in NLP tasks<br />Learn general methodology or approaches<br />End-to-End development using an example task<br />Overview of (un)common NLP tasks<br />
  15. 15. Case study: Named Entity Recognition<br />
  16. 16. Case study: Named Entity Recognition<br />Demo: http://viewer.opencalais.com<br /><ul><li>How do we build something like this?
  17. 17. How do we find out well we are doing?
  18. 18. How can we improve?</li></li></ul><li>Case study: Named Entity Recognition<br />Define the problem<br />Say, PERSON, LOCATION, ORGANIZATION<br />The UN secretary general met president Obama at Hague.<br />The UN secretary general met president Obama at Hague.<br />ORG<br />PER<br />LOC<br />
  19. 19. Case study: Named Entity Recognition<br />Collect data to learn from<br />Sentences with words marked as PER, ORG, LOC, NONE<br />How do we get this data?<br />
  20. 20. Pay the experts<br />
  21. 21. Wisdom of the crowds<br />
  22. 22. Getting the data: Annotation<br />Time consuming<br />Costs $$$<br />Need for quality control<br />Inter-annotator aggreement<br />Kappa score (Kippendorf, 1980)<br />Smarter ways to annotate<br />Get fewer annotations: Active Learning<br />Rationales (Zaidan, Eisner & Piatko, 2007)<br />
  23. 23. Only France and Great Britain backed Fischler ‘s proposal .<br />Only France and Great Britain backed Fischler‘s proposal .<br />Input<br />x<br />Labels<br />y<br />
  24. 24. <ul><li>1. Formalize some insights
  25. 25. 2. Study the formalism mathematically
  26. 26. 3. Develop & implement algorithms
  27. 27. 4. Test on real data</li></ul>Our recipe …<br />
  28. 28. NER: Designing features<br />Need to segment sentences<br />Tokenize the sentences<br />Preprocessing<br />Not as trivial as you think<br />Original text itself might be in an ugly HTML<br />Cleaneval!<br />
  29. 29. NER: Designing features<br />
  30. 30. NER: Designing features<br />
  31. 31. NER: Designing features<br />
  32. 32. NER: Designing features<br />
  33. 33. NER: Designing features<br />These are extracted during preprocessing!<br />
  34. 34. NER: Designing features<br />
  35. 35. NER: Designing features<br />
  36. 36. NER: Designing features<br />Can you think of other features?<br />HAS_DIGITS<br />IS_HYPHENATED<br />IS_ALLCAPS<br />FREQ_WORD<br />RARE_WORD<br />USEFUL_UNIGRAM_PER<br />USEFUL_BIGRAM_PER<br />USEFUL_UNIGRAM_LOC<br />USEFUL_BIGRAM_LOC<br />USEFUL_UNIGRAM_ORG<br />USEFUL_BIGRAM_ORG<br />USEFUL_SUFFIX_PER<br />USEFUL_SUFFIX_LOC<br />USEFUL_SUFFIX_ORG<br />WORD<br />PREV_WORD<br />NEXT_WORD<br />PREV_BIGRAM<br />NEXT_BIGRAM<br />POS<br />PREV_POS<br />NEXT_POS<br />PREV_POS_BIGRAM<br />NEXT_POS_BIGRAM<br />IN_LEXICON_PER<br />IN_LEXICON_LOC<br />IN_LEXICON_ORG<br />IS_CAPITALIZED<br />
  37. 37. Case: Named Entity Recognition<br />Evaluation Metrics<br />Token accuracy: What percent of the tokens got labeled correctly<br />Problem with accuracy<br />Precision-Recall-F<br />president O<br />Barack B-PER<br />Obama O<br />
  38. 38. NER: How can we improve?<br />Engineer better features<br />Design better models<br />Conditional Random Fields<br />Y1<br />Y2<br />Y3<br />Y4<br />x1<br />x2<br />x3<br />x4<br />
  39. 39. NER: How else can we improve?<br />Unlabeled data!<br />example from Jerry Zhu<br />
  40. 40. NER : Challenges<br />Domain transfer<br /> WSJ NYT<br /> WSJ  Blogs ??<br /> WSJ  Twitter ??!?<br />Tough nut: Organizations<br />Non textual data?<br />Entity Extraction is a Boring Solved Problem – or is it?<br />(Vilain, Su and Lubar, 2007)<br />
  41. 41. NER: Related application<br />Extracting real estate information from Criagslist Ads<br />Our oversized one, two and three bedroom apartment homes with floor plans featuring 1 and 2 baths offer space unlike any competition. Relax and enjoy the views from your own private balcony or patio, or feel free to entertain, with plenty of space in your large living room, dining area and eat-in kitchen. The lovely pool and sun deck make summer fun a splash. Our location makes commuting a breeze – Near MTA bus lines, the Metro station, major shopping areas, and for the little ones, an elementary school is right next door. <br />Our oversized one, two and three bedroom apartment homes with floor plans featuring 1 and 2 baths offer space unlike any competition. Relax and enjoy the views from your own private balcony or patio, or feel free to entertain, with plenty of space in your large living room, dining area and eat-inkitchen. The lovely pool and sun deck make summer fun a splash. Our location makes commuting a breeze – Near MTA bus lines, the Metro station, major shopping areas, and for the little ones, an elementary school is right next door. <br />
  42. 42. NER: Related Application<br />BioNLP: Annotation of chemical entities<br />Corbet, Batchelor & Teufel, 2007<br />
  43. 43. Shared Tasks: NLP in practice<br />Shared Task<br />Everybody works on a (mostly) common dataset<br />Evaluation measures are defined<br />Participants get ranked on the evaluation measures<br />Advance the state of the art<br />Set benchmarks<br />Tasks involve common hard problems or new interesting problems<br />

×