Your SlideShare is downloading. ×
NLP in Practice - Part I
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

NLP in Practice - Part I

1,420

Published on

How do we build NLP-based systems for real world problems? …

How do we build NLP-based systems for real world problems?

See http://wp.me/p1K9ov-I for details.

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,420
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 600.465 Connecting the dots - I(NLP in Practice)
    Delip Rao
    delip@jhu.edu
  • 2.
  • 3. What is “Text”?
  • 4. What is “Text”?
  • 5. What is “Text”?
  • 6. “Real” World
    Tons of data on the web
    A lot of it is text
    In many languages
    In many genres
    Language by itself is complex.
    The Web further complicates language.
  • 7. But we have 600.465
    • We can study anything about language ...
    • 8. 1. Formalize some insights
    • 9. 2. Study the formalism mathematically
    • 10. 3. Develop & implement algorithms
    • 11. 4. Test on real data
    Forward Backward,
    Gradient Descent, LBFGS, Simulated Annealing, Contrastive Estimation, …
    feature functions!
    f(wi = off, wi+1 = the)
    f(wi = obama, yi = NP)
    Adapted from : Jason Eisner
  • 12. NLP for fun and profit
    Making NLP more accessible
    Provide APIs for common NLP tasks
    vartext = document.get(…);
    varentities = agent.markNE(text);
    Big $$$$
    Backend to intelligent processing of text
  • 13. Desideratum: Multilinguality
    Except for feature extraction, systems should be language agnostic
  • 14. In this lecture
    Understand how to solve and ace in NLP tasks
    Learn general methodology or approaches
    End-to-End development using an example task
    Overview of (un)common NLP tasks
  • 15. Case study: Named Entity Recognition
  • 16. Case study: Named Entity Recognition
    Demo: http://viewer.opencalais.com
    • How do we build something like this?
    • 17. How do we find out well we are doing?
    • 18. How can we improve?
  • Case study: Named Entity Recognition
    Define the problem
    Say, PERSON, LOCATION, ORGANIZATION
    The UN secretary general met president Obama at Hague.
    The UN secretary general met president Obama at Hague.
    ORG
    PER
    LOC
  • 19. Case study: Named Entity Recognition
    Collect data to learn from
    Sentences with words marked as PER, ORG, LOC, NONE
    How do we get this data?
  • 20. Pay the experts
  • 21. Wisdom of the crowds
  • 22. Getting the data: Annotation
    Time consuming
    Costs $$$
    Need for quality control
    Inter-annotator aggreement
    Kappa score (Kippendorf, 1980)
    Smarter ways to annotate
    Get fewer annotations: Active Learning
    Rationales (Zaidan, Eisner & Piatko, 2007)
  • 23. Only France and Great Britain backed Fischler ‘s proposal .
    Only France and Great Britain backed Fischler‘s proposal .
    Input
    x
    Labels
    y
  • 24.
    • 1. Formalize some insights
    • 25. 2. Study the formalism mathematically
    • 26. 3. Develop & implement algorithms
    • 27. 4. Test on real data
    Our recipe …
  • 28. NER: Designing features
    Need to segment sentences
    Tokenize the sentences
    Preprocessing
    Not as trivial as you think
    Original text itself might be in an ugly HTML
    Cleaneval!
  • 29. NER: Designing features
  • 30. NER: Designing features
  • 31. NER: Designing features
  • 32. NER: Designing features
  • 33. NER: Designing features
    These are extracted during preprocessing!
  • 34. NER: Designing features
  • 35. NER: Designing features
  • 36. NER: Designing features
    Can you think of other features?
    HAS_DIGITS
    IS_HYPHENATED
    IS_ALLCAPS
    FREQ_WORD
    RARE_WORD
    USEFUL_UNIGRAM_PER
    USEFUL_BIGRAM_PER
    USEFUL_UNIGRAM_LOC
    USEFUL_BIGRAM_LOC
    USEFUL_UNIGRAM_ORG
    USEFUL_BIGRAM_ORG
    USEFUL_SUFFIX_PER
    USEFUL_SUFFIX_LOC
    USEFUL_SUFFIX_ORG
    WORD
    PREV_WORD
    NEXT_WORD
    PREV_BIGRAM
    NEXT_BIGRAM
    POS
    PREV_POS
    NEXT_POS
    PREV_POS_BIGRAM
    NEXT_POS_BIGRAM
    IN_LEXICON_PER
    IN_LEXICON_LOC
    IN_LEXICON_ORG
    IS_CAPITALIZED
  • 37. Case: Named Entity Recognition
    Evaluation Metrics
    Token accuracy: What percent of the tokens got labeled correctly
    Problem with accuracy
    Precision-Recall-F
    president O
    Barack B-PER
    Obama O
  • 38. NER: How can we improve?
    Engineer better features
    Design better models
    Conditional Random Fields
    Y1
    Y2
    Y3
    Y4
    x1
    x2
    x3
    x4
  • 39. NER: How else can we improve?
    Unlabeled data!
    example from Jerry Zhu
  • 40. NER : Challenges
    Domain transfer
    WSJ NYT
    WSJ  Blogs ??
    WSJ  Twitter ??!?
    Tough nut: Organizations
    Non textual data?
    Entity Extraction is a Boring Solved Problem – or is it?
    (Vilain, Su and Lubar, 2007)
  • 41. NER: Related application
    Extracting real estate information from Criagslist Ads
    Our oversized one, two and three bedroom apartment homes with floor plans featuring 1 and 2 baths offer space unlike any competition. Relax and enjoy the views from your own private balcony or patio, or feel free to entertain, with plenty of space in your large living room, dining area and eat-in kitchen. The lovely pool and sun deck make summer fun a splash. Our location makes commuting a breeze – Near MTA bus lines, the Metro station, major shopping areas, and for the little ones, an elementary school is right next door.
    Our oversized one, two and three bedroom apartment homes with floor plans featuring 1 and 2 baths offer space unlike any competition. Relax and enjoy the views from your own private balcony or patio, or feel free to entertain, with plenty of space in your large living room, dining area and eat-inkitchen. The lovely pool and sun deck make summer fun a splash. Our location makes commuting a breeze – Near MTA bus lines, the Metro station, major shopping areas, and for the little ones, an elementary school is right next door.
  • 42. NER: Related Application
    BioNLP: Annotation of chemical entities
    Corbet, Batchelor & Teufel, 2007
  • 43. Shared Tasks: NLP in practice
    Shared Task
    Everybody works on a (mostly) common dataset
    Evaluation measures are defined
    Participants get ranked on the evaluation measures
    Advance the state of the art
    Set benchmarks
    Tasks involve common hard problems or new interesting problems

×