SlideShare a Scribd company logo
1 of 27
Download to read offline
Open Source
  Natural Language Processing
                  Francis Bond
       <www3.ntu.edu.sg/home/fcbond/>
Division of Linguistics and Multilingual Studies
          Nanyang Technological University

               <bond@ieee.org>



               2009-08-21 (GeekCamp)
Self Introduction

¢   BA in Japanese and Mathematics
¢   BEng in Power and Control
¢   PhD in “Machine Translation”
¢   1991-2006 NTT (Nippon Telegraph and Telephone)
 ­ Japanese - English/Malay Machine Translation
 ­ Japanese corpus, grammar and ontology (Hinoki)
¢ 2006-2009 NICT (National Inst. for Info. and Comm.
  Technology)
  ­ Japanese - English, Chinese Machine Translation
  ­ Japanese WordNet               (Released in March 2009)

2009-08-21 (GeekCamp)                                     1
Overview

¢ What is NLP (and Why do it)?

¢ Machine Translation Examples

¢ Why Open Source?

¢ Wrap Up

¢ State of the Art




2009-08-21 (GeekCamp)              2
The basic problem

                             We get words

                        People saw her duck.

                         We want meaning




2009-08-21 (GeekCamp)                          3
People saw her duck1




                    http://www.animaltalk.us/for/Animals/
              fw-cute-picture-of-your-daughter-with-duck/

2009-08-21 (GeekCamp)                                   4
People saw her duck2




                          http://www.nataliedee.com/012109/
                                 ducking-incoming-balls.jpg



2009-08-21 (GeekCamp)                                     5
People saw her duck3




                                        OpenClipArtLibrary




2009-08-21 (GeekCamp)                                    6
Syntax

         (1)                                   (2)                                 (3)
          S                                      S                                 S


 NP               VP                   NP                    VP            NP              VP

  N                                     N                                   N
              V          NP                            V           VP                  V          NP
  N                                     N                                   N
          V:see    DET         N                V          NP      V               V:saw    DET         N
  N                                     N                                   N
          saw      her         N               V:see       N       V               saw      her         N
People                                People                              People
                               N                saw        her     V                                    N

                              duck.                               duck.                                duck.




  2009-08-21 (GeekCamp)                                                                            7
Structural Semantics

     Who did what to whom, how, where, when and why?

    (1)      see(people, ducki: past) poss(ducki, pron:[3rd, sg,
             fem]: past)
    (2)      see(people, duckj ) duckj (pron:[3rd, sg, fem])
    (3)      saw(people, ducki) poss(ducki, pron:[3rd, sg, fem])




2009-08-21 (GeekCamp)                                          8
Lexical Semantics

     What are people? What’s a duck? What does sawing entail?

    (4)      people ⊂ entity
    (5)      see ⊂ perceive
    (6)      saw ⊂ cut
    (7)      ducki ⊂ bird
    (8)      duckj ⊂ move




2009-08-21 (GeekCamp)                                       9
Pragmatics

     The study of meaning in context.

¢ Which people?

¢ What duck?

¢ Why did you say that?

¢ What does it imply?




2009-08-21 (GeekCamp)                   10
The problem restated

¢ How can we model and resolve ambiguity?

¢ Two main approaches
  ­ Deduce implicit models
    ∗ bag of words, n-gram chunks, . . .
  ­ Define explicit models
    ∗ Grammars, lexicons and thesauri

¢ Then build a statistical language model (machine learning)



2009-08-21 (GeekCamp)                                          11
Not just algorithms

¢ The data is as important as the algorithm

¢ Two areas of development
  ­ Open (?) Content
    ∗ The Web!, Text Corpora, WordNet,        Wikipedia,
      dictionaries, . . .
  ­ Open Software
    ∗ NLTK (python), Gate, DELPH-IN, . . .

¢ Copyright issues are always with us (;_;)


2009-08-21 (GeekCamp)                                 12
Some Examples

¢ Speech Recognition
¢ Text-to-speech
¢ Segmentation: split strings into words
¢ Part-of-Speech (nouns or verbs)
¢ Named Entity Recognition
¢ Syntactic Parsing: syntactic trees and dependencies
¢ Word Sense Disambiguation: lexical semantics
¢ Semantic Parsing: structural semantics

2009-08-21 (GeekCamp)                                   13
Two Examples of Open Source MT

¢ MOSES (http://www.statmt.org/moses/)
  ­ Open Source Statistical MT tool kit
                                  Just add bilingual corpus!

¢ LOGON (www.delph-in.net/)
  ­ Open Source Knowledge-based MT tool kit
                                  Just add transfer rules!




2009-08-21 (GeekCamp)                                     14
Statistical Machine Translation?

     Basic Idea (Brown et al 1990)


                                 ˆ
                                 E = argmax P (E|J )
                                         E




  Japanese                Translation Model     English   Language Model
     J                        P (J |E)            E           P (E)


                              Decoder             ˆ
       J                argmaxE P (E)P (J |E)
                                                  E



2009-08-21 (GeekCamp)                                                      15
Translation Model (IBM Model 4)
P (J, A|E)
       Fertility Model       could you recommend another hotel
   n(φi|Ei)
      NULL Generation Model could could recommend another another hotel
  m−φ0
   φ0
       p0 2φ0 pφ0
        m−
               1

      Lexicon Model          could could recommend NULL another another hotel NU
    t(Jj |EAj )

              Distortion Model ていただけ ます 紹介し を 他 の ホテル か
      d1(j − k|A(Ei)B(Jj ))
      d1>(j − j ′|B(Jj ))
                             他 の ホテル を 紹介し ていただけ ます か


Now with chunks (another hotel ↔ 他 の ホテル)!
2009-08-21 (GeekCamp)                                             16
Knowledge-based MT

 Source            Source             Semantic         Target     Target
  Text             Analysis   MRS S            MRS T Generation    Text
                   (JACY)             Transfer         (ERG)

                                Stochastic Model(s)


¢ From text to meaning and back again
  ­ Grammars for Japanese and English
  ­ Stochastic models to choose interpretations
  ­ Brittle but powerful



2009-08-21 (GeekCamp)                                                 17
Some Examples
  Source          私はいやいやその仕事をした 。
  Ref             I did the work against my will.
  Moses           I did the work against his will.
  JaEn            I did that work unwillingly.
  Source          バイオリンの音色はとても美しい。
  Ref             The sound of the violin is very sweet.
  Moses           The violin 音色 is very beautiful .
  JaEn            Really, the violin timbers are beautiful.
  Source          メイドはテーブルにナイフとフォークを並べた。
  Ref             The maid arranged the knives and forks on the table.
  Moses           The maid on the table arranged the knives and forks.
  JaEn            The maid set up the fork with the knife in the table.
2009-08-21 (GeekCamp)                                               18
Source          その銀行はここから遠いですか。
  Ref             Is there bank far from here?
  Moses           The bank is a long way from here?
  JaEn            Is that bank distant from here?
  Source          シェークスピアに匹敵する劇作家はいない。
  Ref             No dramatist can compare with Shakespeare.
  Moses           Shakespeare is quite equal to a dramatist.      (no no)
  JaEn            A playwright, that matches in Shie-kusupia, doesn’t live.
  Source          彼はなぜそんなことをしたのか。
  Ref             Why did he do that?
  Moses           Why did he did such a thing?
  JaEn            Why did he do that business?
2009-08-21 (GeekCamp)                                                19
Why Open?

¢ NLP needs serious resources
  ­ They cannot be built and maintained by a single group
  ­ Open source is a very practical way of achieving flexible
    multi-group collaboration

¢ NLP needs standards and historically the successful ones
  have been created bottom-up.

¢ Seeing one’s work used by other groups is very rewarding.

¢ People are generally enthusiastic about contributing to widely
  used work.
Not just the warm inner glow                                  20
¢ Making resources open source removes difficulties in
  distributing work or in continuing work at another institution.

¢ Researchers are evaluated by the impact that their work has:
  Open Source work generally has more impact.

¢ Research should be open in principle:
        . . . the principle of openness in research - the principle
        of freedom of access by all interested persons to the
        underlying data, to the processes, and to the final
        results of research - is one of overriding importance.
        Openness in Research (Stanford, Research Policy Handbook 2.6)


Not just the warm inner glow                                            21
NLP by regexp

      Bilingual Dictionaries from mainly monolingual text!

¢ Fully Bracketed Examples
   ­ 「収穫逓減の法則(the law of diminishing return)」

¢ Partly Bracketed Examples
   ­ 図1に,明瞭性 (Clarity)・新奇性 (Novelty)

¢ Over a million pairs from the Japanese Web corpus
   ­ Not yet released                           (copyright again)

It’s fun                                                       22
The ultimate goal
¢ NLP is fairly wide in scope

¢   We want to know everything about everything and
    how it fits together
  ­ The best source of knowledge we have is still text
  ­ Replace human bandwidth with machine bandwidth
  ­ Process, refine, reprocess

¢ Need both technical and social approaches
  ­ Linguistic Analysis
  ­ Machine Learning
  ­ User Generated Content

Mad Scientists of the World Unite                        23
Closing

¢ There are many great open source NLP tools
   ­ the bleeding edge is mainly open source

 If you want to know more

Or even better want to play with them

Or best of all develop them

⇒ Say hello:                            (especially PhD candidates)

                            bond@ieee.org

 And now, the end is near                                        24
Another Example of the Problem

    (9)       Everyone gets a little of Cucumber’s ♥.

¢ Lexical gaps: Cucumber (name)

¢ Lexical gaps: ♥ (noun – we have it as verb: I ♥ NY)

¢ How to model ambiguity
  ­ Cucumber is deliberately ambiguous here
    ∗ research show rude jokes are funnier
    ∗ can we model this?


Topical Example                                         25
Solutions

¢ Morphological analysis should guess the POS
  ­ Based on two to three words of previous context
    and a large learned lexicon and model
  ­ This allows us to parse
  ­ Actually there are issues with ♥ (words are [a-z -]+)

¢ Recognizing “Cucumber” as software
  Cucumber is a tool that can execute . . .

¢ Linking ♥ to love: ♥n → ♥v (v2n derivational rule)

¢ Scaling is the problem

Feel free to use these slides or extracts from them for any purpose at all, Francis Bond 2009-08-22.   26

More Related Content

Similar to NLP Introduction: Open Source Natural Language Processing

Time Machine session @ ICME 2012 - DTW's New Youth
Time Machine session @ ICME 2012 - DTW's New YouthTime Machine session @ ICME 2012 - DTW's New Youth
Time Machine session @ ICME 2012 - DTW's New YouthXavier Anguera
 
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi..."Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...Yandex
 
The Rise of Dynamic Languages
The Rise of Dynamic LanguagesThe Rise of Dynamic Languages
The Rise of Dynamic Languagesgreenwop
 
Fanizzi Ilp2008 Kernel
Fanizzi Ilp2008 KernelFanizzi Ilp2008 Kernel
Fanizzi Ilp2008 Kernelfanizzi
 
Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Erik Bernhardsson
 
Introduction of tango! (en)
Introduction of tango! (en)Introduction of tango! (en)
Introduction of tango! (en)Yohei Yasukawa
 
Grid: New Business Opportunities?
Grid: New Business Opportunities?Grid: New Business Opportunities?
Grid: New Business Opportunities?Cybera Inc.
 
Quines—Programming your way back to where you were
Quines—Programming your way back to where you wereQuines—Programming your way back to where you were
Quines—Programming your way back to where you wereJean-Baptiste Mazon
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsForward Gradient
 
Tree-based Translation Models (『機械翻訳』§6.2-6.3)
Tree-based Translation Models (『機械翻訳』§6.2-6.3)Tree-based Translation Models (『機械翻訳』§6.2-6.3)
Tree-based Translation Models (『機械翻訳』§6.2-6.3)Yusuke Oda
 
Compiler Components and their Generators - Traditional Parsing Algorithms
Compiler Components and their Generators - Traditional Parsing AlgorithmsCompiler Components and their Generators - Traditional Parsing Algorithms
Compiler Components and their Generators - Traditional Parsing AlgorithmsGuido Wachsmuth
 
Lecture14 xing fei-fei
Lecture14 xing fei-feiLecture14 xing fei-fei
Lecture14 xing fei-feiTianlu Wang
 
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18Olga Zinkevych
 

Similar to NLP Introduction: Open Source Natural Language Processing (20)

The NERD project
The NERD projectThe NERD project
The NERD project
 
Time Machine session @ ICME 2012 - DTW's New Youth
Time Machine session @ ICME 2012 - DTW's New YouthTime Machine session @ ICME 2012 - DTW's New Youth
Time Machine session @ ICME 2012 - DTW's New Youth
 
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi..."Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
 
The Rise of Dynamic Languages
The Rise of Dynamic LanguagesThe Rise of Dynamic Languages
The Rise of Dynamic Languages
 
Fanizzi Ilp2008 Kernel
Fanizzi Ilp2008 KernelFanizzi Ilp2008 Kernel
Fanizzi Ilp2008 Kernel
 
Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014
 
Introduction of tango! (en)
Introduction of tango! (en)Introduction of tango! (en)
Introduction of tango! (en)
 
Grid: New Business Opportunities?
Grid: New Business Opportunities?Grid: New Business Opportunities?
Grid: New Business Opportunities?
 
DL for molecules
DL for moleculesDL for molecules
DL for molecules
 
Integrated Learning
Integrated LearningIntegrated Learning
Integrated Learning
 
Quines—Programming your way back to where you were
Quines—Programming your way back to where you wereQuines—Programming your way back to where you were
Quines—Programming your way back to where you were
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
 
Tree-based Translation Models (『機械翻訳』§6.2-6.3)
Tree-based Translation Models (『機械翻訳』§6.2-6.3)Tree-based Translation Models (『機械翻訳』§6.2-6.3)
Tree-based Translation Models (『機械翻訳』§6.2-6.3)
 
The NLP Muppets revolution!
The NLP Muppets revolution!The NLP Muppets revolution!
The NLP Muppets revolution!
 
Compiler Components and their Generators - Traditional Parsing Algorithms
Compiler Components and their Generators - Traditional Parsing AlgorithmsCompiler Components and their Generators - Traditional Parsing Algorithms
Compiler Components and their Generators - Traditional Parsing Algorithms
 
Jvm Language Summit Rose 20081016
Jvm Language Summit Rose 20081016Jvm Language Summit Rose 20081016
Jvm Language Summit Rose 20081016
 
Lecture14 xing fei-fei
Lecture14 xing fei-feiLecture14 xing fei-fei
Lecture14 xing fei-fei
 
Context Mapping In Action
Context Mapping In ActionContext Mapping In Action
Context Mapping In Action
 
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
Variational autoencoders for speech processing d.bielievtsov dataconf 21 04 18
 
SEASR Text
SEASR TextSEASR Text
SEASR Text
 

Recently uploaded

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 

Recently uploaded (20)

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 

NLP Introduction: Open Source Natural Language Processing

  • 1. Open Source Natural Language Processing Francis Bond <www3.ntu.edu.sg/home/fcbond/> Division of Linguistics and Multilingual Studies Nanyang Technological University <bond@ieee.org> 2009-08-21 (GeekCamp)
  • 2. Self Introduction ¢ BA in Japanese and Mathematics ¢ BEng in Power and Control ¢ PhD in “Machine Translation” ¢ 1991-2006 NTT (Nippon Telegraph and Telephone) ­ Japanese - English/Malay Machine Translation ­ Japanese corpus, grammar and ontology (Hinoki) ¢ 2006-2009 NICT (National Inst. for Info. and Comm. Technology) ­ Japanese - English, Chinese Machine Translation ­ Japanese WordNet (Released in March 2009) 2009-08-21 (GeekCamp) 1
  • 3. Overview ¢ What is NLP (and Why do it)? ¢ Machine Translation Examples ¢ Why Open Source? ¢ Wrap Up ¢ State of the Art 2009-08-21 (GeekCamp) 2
  • 4. The basic problem We get words People saw her duck. We want meaning 2009-08-21 (GeekCamp) 3
  • 5. People saw her duck1 http://www.animaltalk.us/for/Animals/ fw-cute-picture-of-your-daughter-with-duck/ 2009-08-21 (GeekCamp) 4
  • 6. People saw her duck2 http://www.nataliedee.com/012109/ ducking-incoming-balls.jpg 2009-08-21 (GeekCamp) 5
  • 7. People saw her duck3 OpenClipArtLibrary 2009-08-21 (GeekCamp) 6
  • 8. Syntax (1) (2) (3) S S S NP VP NP VP NP VP N N N V NP V VP V NP N N N V:see DET N V NP V V:saw DET N N N N saw her N V:see N V saw her N People People People N saw her V N duck. duck. duck. 2009-08-21 (GeekCamp) 7
  • 9. Structural Semantics Who did what to whom, how, where, when and why? (1) see(people, ducki: past) poss(ducki, pron:[3rd, sg, fem]: past) (2) see(people, duckj ) duckj (pron:[3rd, sg, fem]) (3) saw(people, ducki) poss(ducki, pron:[3rd, sg, fem]) 2009-08-21 (GeekCamp) 8
  • 10. Lexical Semantics What are people? What’s a duck? What does sawing entail? (4) people ⊂ entity (5) see ⊂ perceive (6) saw ⊂ cut (7) ducki ⊂ bird (8) duckj ⊂ move 2009-08-21 (GeekCamp) 9
  • 11. Pragmatics The study of meaning in context. ¢ Which people? ¢ What duck? ¢ Why did you say that? ¢ What does it imply? 2009-08-21 (GeekCamp) 10
  • 12. The problem restated ¢ How can we model and resolve ambiguity? ¢ Two main approaches ­ Deduce implicit models ∗ bag of words, n-gram chunks, . . . ­ Define explicit models ∗ Grammars, lexicons and thesauri ¢ Then build a statistical language model (machine learning) 2009-08-21 (GeekCamp) 11
  • 13. Not just algorithms ¢ The data is as important as the algorithm ¢ Two areas of development ­ Open (?) Content ∗ The Web!, Text Corpora, WordNet, Wikipedia, dictionaries, . . . ­ Open Software ∗ NLTK (python), Gate, DELPH-IN, . . . ¢ Copyright issues are always with us (;_;) 2009-08-21 (GeekCamp) 12
  • 14. Some Examples ¢ Speech Recognition ¢ Text-to-speech ¢ Segmentation: split strings into words ¢ Part-of-Speech (nouns or verbs) ¢ Named Entity Recognition ¢ Syntactic Parsing: syntactic trees and dependencies ¢ Word Sense Disambiguation: lexical semantics ¢ Semantic Parsing: structural semantics 2009-08-21 (GeekCamp) 13
  • 15. Two Examples of Open Source MT ¢ MOSES (http://www.statmt.org/moses/) ­ Open Source Statistical MT tool kit Just add bilingual corpus! ¢ LOGON (www.delph-in.net/) ­ Open Source Knowledge-based MT tool kit Just add transfer rules! 2009-08-21 (GeekCamp) 14
  • 16. Statistical Machine Translation? Basic Idea (Brown et al 1990) ˆ E = argmax P (E|J ) E Japanese Translation Model English Language Model J P (J |E) E P (E) Decoder ˆ J argmaxE P (E)P (J |E) E 2009-08-21 (GeekCamp) 15
  • 17. Translation Model (IBM Model 4) P (J, A|E) Fertility Model could you recommend another hotel n(φi|Ei) NULL Generation Model could could recommend another another hotel m−φ0 φ0 p0 2φ0 pφ0 m− 1 Lexicon Model could could recommend NULL another another hotel NU t(Jj |EAj ) Distortion Model ていただけ ます 紹介し を 他 の ホテル か d1(j − k|A(Ei)B(Jj )) d1>(j − j ′|B(Jj )) 他 の ホテル を 紹介し ていただけ ます か Now with chunks (another hotel ↔ 他 の ホテル)! 2009-08-21 (GeekCamp) 16
  • 18. Knowledge-based MT Source Source Semantic Target Target Text Analysis MRS S MRS T Generation Text (JACY) Transfer (ERG) Stochastic Model(s) ¢ From text to meaning and back again ­ Grammars for Japanese and English ­ Stochastic models to choose interpretations ­ Brittle but powerful 2009-08-21 (GeekCamp) 17
  • 19. Some Examples Source 私はいやいやその仕事をした 。 Ref I did the work against my will. Moses I did the work against his will. JaEn I did that work unwillingly. Source バイオリンの音色はとても美しい。 Ref The sound of the violin is very sweet. Moses The violin 音色 is very beautiful . JaEn Really, the violin timbers are beautiful. Source メイドはテーブルにナイフとフォークを並べた。 Ref The maid arranged the knives and forks on the table. Moses The maid on the table arranged the knives and forks. JaEn The maid set up the fork with the knife in the table. 2009-08-21 (GeekCamp) 18
  • 20. Source その銀行はここから遠いですか。 Ref Is there bank far from here? Moses The bank is a long way from here? JaEn Is that bank distant from here? Source シェークスピアに匹敵する劇作家はいない。 Ref No dramatist can compare with Shakespeare. Moses Shakespeare is quite equal to a dramatist. (no no) JaEn A playwright, that matches in Shie-kusupia, doesn’t live. Source 彼はなぜそんなことをしたのか。 Ref Why did he do that? Moses Why did he did such a thing? JaEn Why did he do that business? 2009-08-21 (GeekCamp) 19
  • 21. Why Open? ¢ NLP needs serious resources ­ They cannot be built and maintained by a single group ­ Open source is a very practical way of achieving flexible multi-group collaboration ¢ NLP needs standards and historically the successful ones have been created bottom-up. ¢ Seeing one’s work used by other groups is very rewarding. ¢ People are generally enthusiastic about contributing to widely used work. Not just the warm inner glow 20
  • 22. ¢ Making resources open source removes difficulties in distributing work or in continuing work at another institution. ¢ Researchers are evaluated by the impact that their work has: Open Source work generally has more impact. ¢ Research should be open in principle: . . . the principle of openness in research - the principle of freedom of access by all interested persons to the underlying data, to the processes, and to the final results of research - is one of overriding importance. Openness in Research (Stanford, Research Policy Handbook 2.6) Not just the warm inner glow 21
  • 23. NLP by regexp Bilingual Dictionaries from mainly monolingual text! ¢ Fully Bracketed Examples ­ 「収穫逓減の法則(the law of diminishing return)」 ¢ Partly Bracketed Examples ­ 図1に,明瞭性 (Clarity)・新奇性 (Novelty) ¢ Over a million pairs from the Japanese Web corpus ­ Not yet released (copyright again) It’s fun 22
  • 24. The ultimate goal ¢ NLP is fairly wide in scope ¢ We want to know everything about everything and how it fits together ­ The best source of knowledge we have is still text ­ Replace human bandwidth with machine bandwidth ­ Process, refine, reprocess ¢ Need both technical and social approaches ­ Linguistic Analysis ­ Machine Learning ­ User Generated Content Mad Scientists of the World Unite 23
  • 25. Closing ¢ There are many great open source NLP tools ­ the bleeding edge is mainly open source If you want to know more Or even better want to play with them Or best of all develop them ⇒ Say hello: (especially PhD candidates) bond@ieee.org And now, the end is near 24
  • 26. Another Example of the Problem (9) Everyone gets a little of Cucumber’s ♥. ¢ Lexical gaps: Cucumber (name) ¢ Lexical gaps: ♥ (noun – we have it as verb: I ♥ NY) ¢ How to model ambiguity ­ Cucumber is deliberately ambiguous here ∗ research show rude jokes are funnier ∗ can we model this? Topical Example 25
  • 27. Solutions ¢ Morphological analysis should guess the POS ­ Based on two to three words of previous context and a large learned lexicon and model ­ This allows us to parse ­ Actually there are issues with ♥ (words are [a-z -]+) ¢ Recognizing “Cucumber” as software Cucumber is a tool that can execute . . . ¢ Linking ♥ to love: ♥n → ♥v (v2n derivational rule) ¢ Scaling is the problem Feel free to use these slides or extracts from them for any purpose at all, Francis Bond 2009-08-22. 26