Python for NLP and the Natural
Language Toolkit
Introductions
 Anirudh K Menon
Software Engineer at IGate, working as part of the Big Data & Analytics Team there.
I am a computer Science Engineer having experience in web development and big data
space.
My e-mail : animenon@mail.com
 You?
Natural Language Processing
 Fundamental goal: deep understanding of broad language
 Not just string processing or keyword matching!
 End systems that we want to build:
 Ambitious: speech recognition, machine translation, question answering…
 Modest: spelling correction, text categorization…
Example: Machine Translation
NLP applications
 Text Categorization
 Spelling & Grammar Corrections
 Information Extraction
 Speech Recognition
 Information Retrieval
 Synonym Generation
 Summarization
 Machine Translation
 Question Answering
 Dialog Systems
 Language generation
Why NLP is difficult
 A NLP system needs to answer the question “who did what to whom”
 Language is ambiguous
 At all levels: lexical, phrase, semantic
 Iraqi Head Seeks Arms
 Word sense is ambiguous (head, arms)
 Stolen Painting Found by Tree
 Thematic role is ambiguous: tree is agent or location?
 Ban on Nude Dancing on Governor’s Desk
 Syntactic structure (attachment) is ambiguous: is the ban or the dancing on the desk?
 Hospitals Are Sued by 7 Foot Doctors
 Semantics is ambiguous : what is 7 foot?
Why NLP is difficult
 Language is flexible
 New words, new meanings
 Different meanings in different contexts
 Language is subtle
 He arrived at the lecture
 He chuckled at the lecture
 He chuckled his way through the lecture
 **He arrived his way through the lecture
 Language is complex!
Corpus-based statistical approaches to
tackle NLP problem
 How can a can a machine understand these differences?
 Decorate the cake with the frosting
 Decorate the cake with the kids
 Rules based approaches, i.e. hand coded syntactic constraints and preference rules:
 The verb decorate require an animate being as agent
 The object cake is formed by any of the following, inanimate entities (cream, dough,
frosting…..)
 Such approaches have been showed to be time consuming to build, do not scale up
well and are very brittle to new, unusual, metaphorical use of language
 To swallow requires an animate being as agent/subject and a physical object as object
 I swallowed his story or the actor swallowed his lines.
 The supernova swallowed the planet
Corpus-based statistical approaches
to tackle NLP problem
 Feature extractions (usually linguistics motivated)
 Statistical models
 Data (corpora, labels, linguistic resources)
Intro to NLTK
 The NLTK is a set of Python modules to carry out many common natural
language tasks.
 NLTK defines an infrastructure that can be used to build NLP programs in
Python.
 It provides basic classes for representing data relevant to natural language
processing.
 There are versions for Windows, OS X, Unix, Linux. Detailed instructions
on Installation tab
 Windows :
>>> import nltk
>>> nltk.download('all')
 Linux :
$ pip install --upgrade nltk
NLTK: Top-Level Organization
 NLTK is organized as a flat hierarchy of packages
and modules.
 Each module provides the tools necessary to
address a specific task
 Modules contain two types of classes:
 Data-oriented classes are used to represent information
relevant to natural language processing.
 Task-oriented classes encapsulate the resources and
methods needed to perform a specific task.
Modules
 The NLTK modules include:
 token: classes for representing and processing individual elements of
text, such as words and sentences
 probability: classes for representing and processing probabilistic
information.
 tree: classes for representing and processing hierarchical information
over text.
 cfg: classes for representing and processing context free grammars.
 tagger: tagging each word with a part-of-speech, a sense, etc
 parser: building trees over text (includes chart, chunk and probabilistic
parsers)
 classifier: classify text into categories (includes feature,
featureSelection, maxent, naivebayes)
 draw: visualize NLP structures and processes
 corpus: access (tagged) corpus data
 We will cover some of these explicitly as we reach topics.
 Standard interfaces for performing tasks such as part-of-speech tagging,
syntactic parsing, and text classification.
 Standard implementations for each task can be combined to solve
complex problems.
Example
 The most basic natural language processing technique is tokenization.
 Tokenization means splitting the input into tokens.
Eg: Word Tokenization –
Input : “Hey there, How are you all?”
Output : “Hey”, “there,”, “How”, “are”, “you”, “all?”
The task of converting a text from a single string to a list of tokens is known as
tokenization.
Tokens and Types
 The term word can be used in two different ways:
1. To refer to an individual occurrence of a word
2. To refer to an abstract vocabulary item
 For example, the sentence “my dog likes his dog”
contains five occurrences of words, but four vocabulary
items.
 To avoid confusion use more precise terminology:
1. Word token: an occurrence of a word
2. Word Type: a vocabulary item
Examples on python shell
 Tokenization
 Sentence Detection
 Common Usages, etc.
References
1. CS1573: AI Application Development, Spring 2003
(modified from Edward Loper’s notes)
2. nltk.sourceforge.net/tutorial/introduction/index.html
3. Applied Natural Language Processing, Fall 2009, by Barbara Rosario
Thank You for your patient listening!
Contact : animenon@mail.com

Nltk

  • 1.
    Python for NLPand the Natural Language Toolkit
  • 2.
    Introductions  Anirudh KMenon Software Engineer at IGate, working as part of the Big Data & Analytics Team there. I am a computer Science Engineer having experience in web development and big data space. My e-mail : animenon@mail.com  You?
  • 3.
    Natural Language Processing Fundamental goal: deep understanding of broad language  Not just string processing or keyword matching!  End systems that we want to build:  Ambitious: speech recognition, machine translation, question answering…  Modest: spelling correction, text categorization…
  • 4.
  • 5.
    NLP applications  TextCategorization  Spelling & Grammar Corrections  Information Extraction  Speech Recognition  Information Retrieval  Synonym Generation  Summarization  Machine Translation  Question Answering  Dialog Systems  Language generation
  • 6.
    Why NLP isdifficult  A NLP system needs to answer the question “who did what to whom”  Language is ambiguous  At all levels: lexical, phrase, semantic  Iraqi Head Seeks Arms  Word sense is ambiguous (head, arms)  Stolen Painting Found by Tree  Thematic role is ambiguous: tree is agent or location?  Ban on Nude Dancing on Governor’s Desk  Syntactic structure (attachment) is ambiguous: is the ban or the dancing on the desk?  Hospitals Are Sued by 7 Foot Doctors  Semantics is ambiguous : what is 7 foot?
  • 7.
    Why NLP isdifficult  Language is flexible  New words, new meanings  Different meanings in different contexts  Language is subtle  He arrived at the lecture  He chuckled at the lecture  He chuckled his way through the lecture  **He arrived his way through the lecture  Language is complex!
  • 8.
    Corpus-based statistical approachesto tackle NLP problem  How can a can a machine understand these differences?  Decorate the cake with the frosting  Decorate the cake with the kids  Rules based approaches, i.e. hand coded syntactic constraints and preference rules:  The verb decorate require an animate being as agent  The object cake is formed by any of the following, inanimate entities (cream, dough, frosting…..)  Such approaches have been showed to be time consuming to build, do not scale up well and are very brittle to new, unusual, metaphorical use of language  To swallow requires an animate being as agent/subject and a physical object as object  I swallowed his story or the actor swallowed his lines.  The supernova swallowed the planet
  • 9.
    Corpus-based statistical approaches totackle NLP problem  Feature extractions (usually linguistics motivated)  Statistical models  Data (corpora, labels, linguistic resources)
  • 10.
    Intro to NLTK The NLTK is a set of Python modules to carry out many common natural language tasks.  NLTK defines an infrastructure that can be used to build NLP programs in Python.  It provides basic classes for representing data relevant to natural language processing.  There are versions for Windows, OS X, Unix, Linux. Detailed instructions on Installation tab  Windows : >>> import nltk >>> nltk.download('all')  Linux : $ pip install --upgrade nltk
  • 11.
    NLTK: Top-Level Organization NLTK is organized as a flat hierarchy of packages and modules.  Each module provides the tools necessary to address a specific task  Modules contain two types of classes:  Data-oriented classes are used to represent information relevant to natural language processing.  Task-oriented classes encapsulate the resources and methods needed to perform a specific task.
  • 12.
    Modules  The NLTKmodules include:  token: classes for representing and processing individual elements of text, such as words and sentences  probability: classes for representing and processing probabilistic information.  tree: classes for representing and processing hierarchical information over text.  cfg: classes for representing and processing context free grammars.  tagger: tagging each word with a part-of-speech, a sense, etc  parser: building trees over text (includes chart, chunk and probabilistic parsers)  classifier: classify text into categories (includes feature, featureSelection, maxent, naivebayes)  draw: visualize NLP structures and processes  corpus: access (tagged) corpus data  We will cover some of these explicitly as we reach topics.
  • 13.
     Standard interfacesfor performing tasks such as part-of-speech tagging, syntactic parsing, and text classification.  Standard implementations for each task can be combined to solve complex problems.
  • 14.
    Example  The mostbasic natural language processing technique is tokenization.  Tokenization means splitting the input into tokens. Eg: Word Tokenization – Input : “Hey there, How are you all?” Output : “Hey”, “there,”, “How”, “are”, “you”, “all?” The task of converting a text from a single string to a list of tokens is known as tokenization.
  • 15.
    Tokens and Types The term word can be used in two different ways: 1. To refer to an individual occurrence of a word 2. To refer to an abstract vocabulary item  For example, the sentence “my dog likes his dog” contains five occurrences of words, but four vocabulary items.  To avoid confusion use more precise terminology: 1. Word token: an occurrence of a word 2. Word Type: a vocabulary item
  • 16.
    Examples on pythonshell  Tokenization  Sentence Detection  Common Usages, etc.
  • 17.
    References 1. CS1573: AIApplication Development, Spring 2003 (modified from Edward Loper’s notes) 2. nltk.sourceforge.net/tutorial/introduction/index.html 3. Applied Natural Language Processing, Fall 2009, by Barbara Rosario
  • 18.
    Thank You foryour patient listening! Contact : animenon@mail.com