Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
NLP Introduction: Open Source Natural Language Processing
1. Open Source
Natural Language Processing
Francis Bond
<www3.ntu.edu.sg/home/fcbond/>
Division of Linguistics and Multilingual Studies
Nanyang Technological University
<bond@ieee.org>
2009-08-21 (GeekCamp)
2. Self Introduction
¢ BA in Japanese and Mathematics
¢ BEng in Power and Control
¢ PhD in “Machine Translation”
¢ 1991-2006 NTT (Nippon Telegraph and Telephone)
Japanese - English/Malay Machine Translation
Japanese corpus, grammar and ontology (Hinoki)
¢ 2006-2009 NICT (National Inst. for Info. and Comm.
Technology)
Japanese - English, Chinese Machine Translation
Japanese WordNet (Released in March 2009)
2009-08-21 (GeekCamp) 1
3. Overview
¢ What is NLP (and Why do it)?
¢ Machine Translation Examples
¢ Why Open Source?
¢ Wrap Up
¢ State of the Art
2009-08-21 (GeekCamp) 2
4. The basic problem
We get words
People saw her duck.
We want meaning
2009-08-21 (GeekCamp) 3
5. People saw her duck1
http://www.animaltalk.us/for/Animals/
fw-cute-picture-of-your-daughter-with-duck/
2009-08-21 (GeekCamp) 4
6. People saw her duck2
http://www.nataliedee.com/012109/
ducking-incoming-balls.jpg
2009-08-21 (GeekCamp) 5
7. People saw her duck3
OpenClipArtLibrary
2009-08-21 (GeekCamp) 6
8. Syntax
(1) (2) (3)
S S S
NP VP NP VP NP VP
N N N
V NP V VP V NP
N N N
V:see DET N V NP V V:saw DET N
N N N
saw her N V:see N V saw her N
People People People
N saw her V N
duck. duck. duck.
2009-08-21 (GeekCamp) 7
9. Structural Semantics
Who did what to whom, how, where, when and why?
(1) see(people, ducki: past) poss(ducki, pron:[3rd, sg,
fem]: past)
(2) see(people, duckj ) duckj (pron:[3rd, sg, fem])
(3) saw(people, ducki) poss(ducki, pron:[3rd, sg, fem])
2009-08-21 (GeekCamp) 8
10. Lexical Semantics
What are people? What’s a duck? What does sawing entail?
(4) people ⊂ entity
(5) see ⊂ perceive
(6) saw ⊂ cut
(7) ducki ⊂ bird
(8) duckj ⊂ move
2009-08-21 (GeekCamp) 9
11. Pragmatics
The study of meaning in context.
¢ Which people?
¢ What duck?
¢ Why did you say that?
¢ What does it imply?
2009-08-21 (GeekCamp) 10
12. The problem restated
¢ How can we model and resolve ambiguity?
¢ Two main approaches
Deduce implicit models
∗ bag of words, n-gram chunks, . . .
Define explicit models
∗ Grammars, lexicons and thesauri
¢ Then build a statistical language model (machine learning)
2009-08-21 (GeekCamp) 11
13. Not just algorithms
¢ The data is as important as the algorithm
¢ Two areas of development
Open (?) Content
∗ The Web!, Text Corpora, WordNet, Wikipedia,
dictionaries, . . .
Open Software
∗ NLTK (python), Gate, DELPH-IN, . . .
¢ Copyright issues are always with us (;_;)
2009-08-21 (GeekCamp) 12
14. Some Examples
¢ Speech Recognition
¢ Text-to-speech
¢ Segmentation: split strings into words
¢ Part-of-Speech (nouns or verbs)
¢ Named Entity Recognition
¢ Syntactic Parsing: syntactic trees and dependencies
¢ Word Sense Disambiguation: lexical semantics
¢ Semantic Parsing: structural semantics
2009-08-21 (GeekCamp) 13
15. Two Examples of Open Source MT
¢ MOSES (http://www.statmt.org/moses/)
Open Source Statistical MT tool kit
Just add bilingual corpus!
¢ LOGON (www.delph-in.net/)
Open Source Knowledge-based MT tool kit
Just add transfer rules!
2009-08-21 (GeekCamp) 14
16. Statistical Machine Translation?
Basic Idea (Brown et al 1990)
ˆ
E = argmax P (E|J )
E
Japanese Translation Model English Language Model
J P (J |E) E P (E)
Decoder ˆ
J argmaxE P (E)P (J |E)
E
2009-08-21 (GeekCamp) 15
17. Translation Model (IBM Model 4)
P (J, A|E)
Fertility Model could you recommend another hotel
n(φi|Ei)
NULL Generation Model could could recommend another another hotel
m−φ0
φ0
p0 2φ0 pφ0
m−
1
Lexicon Model could could recommend NULL another another hotel NU
t(Jj |EAj )
Distortion Model ていただけ ます 紹介し を 他 の ホテル か
d1(j − k|A(Ei)B(Jj ))
d1>(j − j ′|B(Jj ))
他 の ホテル を 紹介し ていただけ ます か
Now with chunks (another hotel ↔ 他 の ホテル)!
2009-08-21 (GeekCamp) 16
18. Knowledge-based MT
Source Source Semantic Target Target
Text Analysis MRS S MRS T Generation Text
(JACY) Transfer (ERG)
Stochastic Model(s)
¢ From text to meaning and back again
Grammars for Japanese and English
Stochastic models to choose interpretations
Brittle but powerful
2009-08-21 (GeekCamp) 17
19. Some Examples
Source 私はいやいやその仕事をした 。
Ref I did the work against my will.
Moses I did the work against his will.
JaEn I did that work unwillingly.
Source バイオリンの音色はとても美しい。
Ref The sound of the violin is very sweet.
Moses The violin 音色 is very beautiful .
JaEn Really, the violin timbers are beautiful.
Source メイドはテーブルにナイフとフォークを並べた。
Ref The maid arranged the knives and forks on the table.
Moses The maid on the table arranged the knives and forks.
JaEn The maid set up the fork with the knife in the table.
2009-08-21 (GeekCamp) 18
20. Source その銀行はここから遠いですか。
Ref Is there bank far from here?
Moses The bank is a long way from here?
JaEn Is that bank distant from here?
Source シェークスピアに匹敵する劇作家はいない。
Ref No dramatist can compare with Shakespeare.
Moses Shakespeare is quite equal to a dramatist. (no no)
JaEn A playwright, that matches in Shie-kusupia, doesn’t live.
Source 彼はなぜそんなことをしたのか。
Ref Why did he do that?
Moses Why did he did such a thing?
JaEn Why did he do that business?
2009-08-21 (GeekCamp) 19
21. Why Open?
¢ NLP needs serious resources
They cannot be built and maintained by a single group
Open source is a very practical way of achieving flexible
multi-group collaboration
¢ NLP needs standards and historically the successful ones
have been created bottom-up.
¢ Seeing one’s work used by other groups is very rewarding.
¢ People are generally enthusiastic about contributing to widely
used work.
Not just the warm inner glow 20
22. ¢ Making resources open source removes difficulties in
distributing work or in continuing work at another institution.
¢ Researchers are evaluated by the impact that their work has:
Open Source work generally has more impact.
¢ Research should be open in principle:
. . . the principle of openness in research - the principle
of freedom of access by all interested persons to the
underlying data, to the processes, and to the final
results of research - is one of overriding importance.
Openness in Research (Stanford, Research Policy Handbook 2.6)
Not just the warm inner glow 21
23. NLP by regexp
Bilingual Dictionaries from mainly monolingual text!
¢ Fully Bracketed Examples
「収穫逓減の法則(the law of diminishing return)」
¢ Partly Bracketed Examples
図1に,明瞭性 (Clarity)・新奇性 (Novelty)
¢ Over a million pairs from the Japanese Web corpus
Not yet released (copyright again)
It’s fun 22
24. The ultimate goal
¢ NLP is fairly wide in scope
¢ We want to know everything about everything and
how it fits together
The best source of knowledge we have is still text
Replace human bandwidth with machine bandwidth
Process, refine, reprocess
¢ Need both technical and social approaches
Linguistic Analysis
Machine Learning
User Generated Content
Mad Scientists of the World Unite 23
25. Closing
¢ There are many great open source NLP tools
the bleeding edge is mainly open source
If you want to know more
Or even better want to play with them
Or best of all develop them
⇒ Say hello: (especially PhD candidates)
bond@ieee.org
And now, the end is near 24
26. Another Example of the Problem
(9) Everyone gets a little of Cucumber’s ♥.
¢ Lexical gaps: Cucumber (name)
¢ Lexical gaps: ♥ (noun – we have it as verb: I ♥ NY)
¢ How to model ambiguity
Cucumber is deliberately ambiguous here
∗ research show rude jokes are funnier
∗ can we model this?
Topical Example 25
27. Solutions
¢ Morphological analysis should guess the POS
Based on two to three words of previous context
and a large learned lexicon and model
This allows us to parse
Actually there are issues with ♥ (words are [a-z -]+)
¢ Recognizing “Cucumber” as software
Cucumber is a tool that can execute . . .
¢ Linking ♥ to love: ♥n → ♥v (v2n derivational rule)
¢ Scaling is the problem
Feel free to use these slides or extracts from them for any purpose at all, Francis Bond 2009-08-22. 26