Nltk

Python for NLP and the Natural
Language Toolkit

Introductions
 Anirudh K Menon
Software Engineer at IGate, working as part of the Big Data & Analytics Team there.
I am a computer Science Engineer having experience in web development and big data
space.
My e-mail : animenon@mail.com
 You?

Natural Language Processing
 Fundamental goal: deep understanding of broad language
 Not just string processing or keyword matching!
 End systems that we want to build:
 Ambitious: speech recognition, machine translation, question answering…
 Modest: spelling correction, text categorization…

NLP applications
 Text Categorization
 Spelling & Grammar Corrections
 Information Extraction
 Speech Recognition
 Information Retrieval
 Synonym Generation
 Summarization
 Machine Translation
 Question Answering
 Dialog Systems
 Language generation

Why NLP is difficult
 A NLP system needs to answer the question “who did what to whom”
 Language is ambiguous
 At all levels: lexical, phrase, semantic
 Iraqi Head Seeks Arms
 Word sense is ambiguous (head, arms)
 Stolen Painting Found by Tree
 Thematic role is ambiguous: tree is agent or location?
 Ban on Nude Dancing on Governor’s Desk
 Syntactic structure (attachment) is ambiguous: is the ban or the dancing on the desk?
 Hospitals Are Sued by 7 Foot Doctors
 Semantics is ambiguous : what is 7 foot?

Why NLP is difficult
 Language is flexible
 New words, new meanings
 Different meanings in different contexts
 Language is subtle
 He arrived at the lecture
 He chuckled at the lecture
 He chuckled his way through the lecture
 **He arrived his way through the lecture
 Language is complex!

Corpus-based statistical approaches to
tackle NLP problem
 How can a can a machine understand these differences?
 Decorate the cake with the frosting
 Decorate the cake with the kids
 Rules based approaches, i.e. hand coded syntactic constraints and preference rules:
 The verb decorate require an animate being as agent
 The object cake is formed by any of the following, inanimate entities (cream, dough,
frosting…..)
 Such approaches have been showed to be time consuming to build, do not scale up
well and are very brittle to new, unusual, metaphorical use of language
 To swallow requires an animate being as agent/subject and a physical object as object
 I swallowed his story or the actor swallowed his lines.
 The supernova swallowed the planet

Corpus-based statistical approaches
to tackle NLP problem
 Feature extractions (usually linguistics motivated)
 Statistical models
 Data (corpora, labels, linguistic resources)

Intro to NLTK
 The NLTK is a set of Python modules to carry out many common natural
language tasks.
 NLTK defines an infrastructure that can be used to build NLP programs in
Python.
 It provides basic classes for representing data relevant to natural language
processing.
 There are versions for Windows, OS X, Unix, Linux. Detailed instructions
on Installation tab
 Windows :
>>> import nltk
>>> nltk.download('all')
 Linux :
$ pip install --upgrade nltk

NLTK: Top-Level Organization
 NLTK is organized as a flat hierarchy of packages
and modules.
 Each module provides the tools necessary to
address a specific task
 Modules contain two types of classes:
 Data-oriented classes are used to represent information
relevant to natural language processing.
 Task-oriented classes encapsulate the resources and
methods needed to perform a specific task.

Modules
 The NLTK modules include:
 token: classes for representing and processing individual elements of
text, such as words and sentences
 probability: classes for representing and processing probabilistic
information.
 tree: classes for representing and processing hierarchical information
over text.
 cfg: classes for representing and processing context free grammars.
 tagger: tagging each word with a part-of-speech, a sense, etc
 parser: building trees over text (includes chart, chunk and probabilistic
parsers)
 classifier: classify text into categories (includes feature,
featureSelection, maxent, naivebayes)
 draw: visualize NLP structures and processes
 corpus: access (tagged) corpus data
 We will cover some of these explicitly as we reach topics.

 Standard interfaces for performing tasks such as part-of-speech tagging,
syntactic parsing, and text classification.
 Standard implementations for each task can be combined to solve
complex problems.

Example
 The most basic natural language processing technique is tokenization.
 Tokenization means splitting the input into tokens.
Eg: Word Tokenization –
Input : “Hey there, How are you all?”
Output : “Hey”, “there,”, “How”, “are”, “you”, “all?”
The task of converting a text from a single string to a list of tokens is known as
tokenization.

Tokens and Types
 The term word can be used in two different ways:
1. To refer to an individual occurrence of a word
2. To refer to an abstract vocabulary item
 For example, the sentence “my dog likes his dog”
contains five occurrences of words, but four vocabulary
items.
 To avoid confusion use more precise terminology:
1. Word token: an occurrence of a word
2. Word Type: a vocabulary item

Examples on python shell
 Tokenization
 Sentence Detection
 Common Usages, etc.

References
1. CS1573: AI Application Development, Spring 2003
(modified from Edward Loper’s notes)
2. nltk.sourceforge.net/tutorial/introduction/index.html
3. Applied Natural Language Processing, Fall 2009, by Barbara Rosario

Thank You for your patient listening!
Contact : animenon@mail.com

Nltk

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Nltk

Similar to Nltk (20)

Recently uploaded

Recently uploaded (20)

Nltk