2. Introductions
Anirudh K Menon
Software Engineer at IGate, working as part of the Big Data & Analytics Team there.
I am a computer Science Engineer having experience in web development and big data
space.
My e-mail : animenon@mail.com
You?
3. Natural Language Processing
Fundamental goal: deep understanding of broad language
Not just string processing or keyword matching!
End systems that we want to build:
Ambitious: speech recognition, machine translation, question answering…
Modest: spelling correction, text categorization…
5. NLP applications
Text Categorization
Spelling & Grammar Corrections
Information Extraction
Speech Recognition
Information Retrieval
Synonym Generation
Summarization
Machine Translation
Question Answering
Dialog Systems
Language generation
6. Why NLP is difficult
A NLP system needs to answer the question “who did what to whom”
Language is ambiguous
At all levels: lexical, phrase, semantic
Iraqi Head Seeks Arms
Word sense is ambiguous (head, arms)
Stolen Painting Found by Tree
Thematic role is ambiguous: tree is agent or location?
Ban on Nude Dancing on Governor’s Desk
Syntactic structure (attachment) is ambiguous: is the ban or the dancing on the desk?
Hospitals Are Sued by 7 Foot Doctors
Semantics is ambiguous : what is 7 foot?
7. Why NLP is difficult
Language is flexible
New words, new meanings
Different meanings in different contexts
Language is subtle
He arrived at the lecture
He chuckled at the lecture
He chuckled his way through the lecture
**He arrived his way through the lecture
Language is complex!
8. Corpus-based statistical approaches to
tackle NLP problem
How can a can a machine understand these differences?
Decorate the cake with the frosting
Decorate the cake with the kids
Rules based approaches, i.e. hand coded syntactic constraints and preference rules:
The verb decorate require an animate being as agent
The object cake is formed by any of the following, inanimate entities (cream, dough,
frosting…..)
Such approaches have been showed to be time consuming to build, do not scale up
well and are very brittle to new, unusual, metaphorical use of language
To swallow requires an animate being as agent/subject and a physical object as object
I swallowed his story or the actor swallowed his lines.
The supernova swallowed the planet
9. Corpus-based statistical approaches
to tackle NLP problem
Feature extractions (usually linguistics motivated)
Statistical models
Data (corpora, labels, linguistic resources)
10. Intro to NLTK
The NLTK is a set of Python modules to carry out many common natural
language tasks.
NLTK defines an infrastructure that can be used to build NLP programs in
Python.
It provides basic classes for representing data relevant to natural language
processing.
There are versions for Windows, OS X, Unix, Linux. Detailed instructions
on Installation tab
Windows :
>>> import nltk
>>> nltk.download('all')
Linux :
$ pip install --upgrade nltk
11. NLTK: Top-Level Organization
NLTK is organized as a flat hierarchy of packages
and modules.
Each module provides the tools necessary to
address a specific task
Modules contain two types of classes:
Data-oriented classes are used to represent information
relevant to natural language processing.
Task-oriented classes encapsulate the resources and
methods needed to perform a specific task.
12. Modules
The NLTK modules include:
token: classes for representing and processing individual elements of
text, such as words and sentences
probability: classes for representing and processing probabilistic
information.
tree: classes for representing and processing hierarchical information
over text.
cfg: classes for representing and processing context free grammars.
tagger: tagging each word with a part-of-speech, a sense, etc
parser: building trees over text (includes chart, chunk and probabilistic
parsers)
classifier: classify text into categories (includes feature,
featureSelection, maxent, naivebayes)
draw: visualize NLP structures and processes
corpus: access (tagged) corpus data
We will cover some of these explicitly as we reach topics.
13. Standard interfaces for performing tasks such as part-of-speech tagging,
syntactic parsing, and text classification.
Standard implementations for each task can be combined to solve
complex problems.
14. Example
The most basic natural language processing technique is tokenization.
Tokenization means splitting the input into tokens.
Eg: Word Tokenization –
Input : “Hey there, How are you all?”
Output : “Hey”, “there,”, “How”, “are”, “you”, “all?”
The task of converting a text from a single string to a list of tokens is known as
tokenization.
15. Tokens and Types
The term word can be used in two different ways:
1. To refer to an individual occurrence of a word
2. To refer to an abstract vocabulary item
For example, the sentence “my dog likes his dog”
contains five occurrences of words, but four vocabulary
items.
To avoid confusion use more precise terminology:
1. Word token: an occurrence of a word
2. Word Type: a vocabulary item
16. Examples on python shell
Tokenization
Sentence Detection
Common Usages, etc.
17. References
1. CS1573: AI Application Development, Spring 2003
(modified from Edward Loper’s notes)
2. nltk.sourceforge.net/tutorial/introduction/index.html
3. Applied Natural Language Processing, Fall 2009, by Barbara Rosario
18. Thank You for your patient listening!
Contact : animenon@mail.com