NLTK

Natural Language Toolkit (NLTK)
■ A collection of Python programs, modules, data set and tutorial to
support research and development in Natural Language Processing
(NLP)
■ Written by Steven Bird, Edvard Loper and Ewan Klien
■ NLTK is
– Free and Open source
– Easy to use
– Modular
– Well documented
– Simple and extensible

Installation of NLTK
■ NLTK requires Python versions 2.7 or 3.2+

Python REPL
• Read Eval Print Loop
• REPL: This is a procedure that just
loops, accepts one command at a
time, executing it, and printing the
result.
• GUI OR CLI

1. Start a Command Prompt as an Administrator ( Windows User )
1. Click Start.
2. In the Start Search box, type cmd, and then press CTRL+SHIFT+ENTER.
3. If the User Account Control dialog box appears, confirm that the action it
displays is what you want, and then click Continue.
2. changing from user to Superuser ( linux user )
– sudo su

Install NLTK: run
pip install nltk

Test installation
run
python
then type
import nltk

Installing NLTK Data
■ NLTK comes with many corpora, toy grammars, trained models, etc. A complete
list is posted at: http://nltk.org/nltk_data/
■ Run the Python REPL and type the commands:
■ A new window should open, showing the NLTK Downloader. Click on the File menu and select Change Download Directory. For central
installation, set this to C:nltk_data (Windows), /usr/local/share/nltk_data (Mac), or /usr/share/nltk_data (Unix). Next, select the
packages or collections you want to download.
>>> import nltk
>>> nltk.download()

Sentence splitter
from nltk.tokenize import sent_tokenize
input_sting = ‘Hello Every One. How Are you ? life is not
easy.’
all_sent = sent_tokenize(input_sting)
print (all_sent)

Tokenization
sent = "Hi Everyone ! How do you do ?"
# Split() built-in string function
print (sent.split())
# word_tokenize
from nltk.tokenize import word_tokenize
print (word_tokenize(sent))

Edit-Distance
We can create a very basic spellchecker
by just using a dictionary lookup.

Part of Speech Tagging
■ Stanford tagger
■ N-gram tagger
■ Regex tagger
■ Brill tagger
■ Machine learning based tagger
■ NER tagger
– Named Entity Recognition (NER)
– NLTK provides the ne_chunk() method

NER tagger
 Tokens
 Tagged
 Entities

Shallow VS deep parsing
■ In deep or full parsing, typically, grammar concepts such as CFG, and probabilistic
context-free grammar (PCFG), and a search strategy is used to give a complete
syntactic structure to a sentence.
■ Shallow parsing is the task of parsing a limited part of the syntactic information
from the given text.

The two approaches in parsing
The rule-based approach The probabilistic approach
This approach is based on rules/grammar In this approach, you learn rules/grammar
by using probabilistic models
Manual grammatical rules are coded down
in CFG, and so on, in this approach
This uses observed probabilities of linguistic
features
This has a top-down approach This has a bottom-up approach
This approach includes CFG and Regexbased
parser
This approach includes PCFG and the
Stanford parser

context-free grammar (CFG)
■ Generating sentences from context-free grammars:

Different types of parsers
■ Recursive descent parser
■ Shift-reduce parser
■ Chart parser
■ Regex parser
■ Dependency parsing

Different types of parsers
Recursive descent parser One of the most straightforward forms of parsing is recursive
descent parsing. This is a top-down process in which the parser
attempts to verify that the syntax of the input stream is correct,
as it is read from left to right.
Shift-reduce parser The shift-reduce parser is a simple kind of bottom-up parser.
Chart parser We will apply the algorithm design technique of dynamic
programming to the parsing problem.
Regex parser A regex parser uses a regular expression defined in the form of
grammar on top of a POS-tagged string.
Dependency parsing Dependency parsing (DP) is a modern parsing mechanism. The
main concept of DP is that each linguistic unit (words) is
connected with each other by a directed link.

Chunking
■ Chunking is shallow parsing where instead of reaching out to the deep structure of
the sentence, we try to club some chunks of the sentences that constitute some
meaning.
■ For example, the sentence "the President speaks about the health care reforms"

So, let's write some code snippets to do some basic
chunking:

Display a parse tree
# import treebank corpus
from nltk.corpus import treebank
t = treebank.parsed_sents('wsj_0001.mrg')[0]
t.draw()

Resources
■ NLTK 3.0 documentation
– http://www.nltk.org/
■ NLTK Essentials
– https://www.packtpub.com/big-data-and-business-intelligence/nltk-essentials
■ nltk_tutorial_repo [Code]
– https://git.io/vaRIR

Thanks.
Questions? Send me an email! (I love talking!)
Mohammed Shokr
mohammedshokr2014@gmail.com
@MShokr1

NLTK

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to NLTK

Similar to NLTK (20)

Recently uploaded

Recently uploaded (20)

NLTK

Editor's Notes