NLTK
Mohammed Shokr
16-Mar-16
Natural Language Toolkit (NLTK)
■ A collection of Python programs, modules, data set and tutorial to
support research and development in Natural Language Processing
(NLP)
■ Written by Steven Bird, Edvard Loper and Ewan Klien
■ NLTK is
– Free and Open source
– Easy to use
– Modular
– Well documented
– Simple and extensible
Installation of NLTK
■ NLTK requires Python versions 2.7 or 3.2+
Python REPL
• Read Eval Print Loop
• REPL: This is a procedure that just
loops, accepts one command at a
time, executing it, and printing the
result.
• GUI OR CLI
Installation of NLTK
1. Start a Command Prompt as an Administrator ( Windows User )
1. Click Start.
2. In the Start Search box, type cmd, and then press CTRL+SHIFT+ENTER.
3. If the User Account Control dialog box appears, confirm that the action it
displays is what you want, and then click Continue.
2. changing from user to Superuser ( linux user )
– sudo su
Installation of NLTK
Install NLTK: run
pip install nltk
Test installation
run
python
then type
import nltk
Installing NLTK Data
■ NLTK comes with many corpora, toy grammars, trained models, etc. A complete
list is posted at: http://nltk.org/nltk_data/
■ Run the Python REPL and type the commands:
■ A new window should open, showing the NLTK Downloader. Click on the File menu and select Change Download Directory. For central
installation, set this to C:nltk_data (Windows), /usr/local/share/nltk_data (Mac), or /usr/share/nltk_data (Unix). Next, select the
packages or collections you want to download.
>>> import nltk
>>> nltk.download()
NLTK Downloader
Getting Started with NLTK
NLTK Text
Text Pre-processing
Sentence splitter
from nltk.tokenize import sent_tokenize
input_sting = ‘Hello Every One. How Are you ? life is not
easy.’
all_sent = sent_tokenize(input_sting)
print (all_sent)
Tokenization
sent = "Hi Everyone ! How do you do ?"
# Split() built-in string function
print (sent.split())
# word_tokenize
from nltk.tokenize import word_tokenize
print (word_tokenize(sent))
Stemming
Lemmatization
Morphology
Edit-Distance
We can create a very basic spellchecker
by just using a dictionary lookup.
Part of Speech Tagging
Part of Speech
Tagging
Penn Bank Part-of-Speech Tags
Part of Speech Tagging
■ Stanford tagger
■ N-gram tagger
■ Regex tagger
■ Brill tagger
■ Machine learning based tagger
■ NER tagger
– Named Entity Recognition (NER)
– NLTK provides the ne_chunk() method
NER tagger
 Tokens
 Tagged
 Entities
Parsing Structure in Text
Shallow VS deep parsing
■ In deep or full parsing, typically, grammar concepts such as CFG, and probabilistic
context-free grammar (PCFG), and a search strategy is used to give a complete
syntactic structure to a sentence.
■ Shallow parsing is the task of parsing a limited part of the syntactic information
from the given text.
The two approaches in parsing
The rule-based approach The probabilistic approach
This approach is based on rules/grammar In this approach, you learn rules/grammar
by using probabilistic models
Manual grammatical rules are coded down
in CFG, and so on, in this approach
This uses observed probabilities of linguistic
features
This has a top-down approach This has a bottom-up approach
This approach includes CFG and Regexbased
parser
This approach includes PCFG and the
Stanford parser
context-free grammar (CFG)
■ Generating sentences from context-free grammars:
Different types of parsers
■ Recursive descent parser
■ Shift-reduce parser
■ Chart parser
■ Regex parser
■ Dependency parsing
Different types of parsers
Recursive descent parser One of the most straightforward forms of parsing is recursive
descent parsing. This is a top-down process in which the parser
attempts to verify that the syntax of the input stream is correct,
as it is read from left to right.
Shift-reduce parser The shift-reduce parser is a simple kind of bottom-up parser.
Chart parser We will apply the algorithm design technique of dynamic
programming to the parsing problem.
Regex parser A regex parser uses a regular expression defined in the form of
grammar on top of a POS-tagged string.
Dependency parsing Dependency parsing (DP) is a modern parsing mechanism. The
main concept of DP is that each linguistic unit (words) is
connected with each other by a directed link.
Chunking
■ Chunking is shallow parsing where instead of reaching out to the deep structure of
the sentence, we try to club some chunks of the sentences that constitute some
meaning.
■ For example, the sentence "the President speaks about the health care reforms"
So, let's write some code snippets to do some basic
chunking:
Display a parse tree
# import treebank corpus
from nltk.corpus import treebank
t = treebank.parsed_sents('wsj_0001.mrg')[0]
t.draw()
Relation Extraction
Output
Resources
■ NLTK 3.0 documentation
– http://www.nltk.org/
■ NLTK Essentials
– https://www.packtpub.com/big-data-and-business-intelligence/nltk-essentials
■ nltk_tutorial_repo [Code]
– https://git.io/vaRIR
Thanks.
Questions? Send me an email! (I love talking!)
Mohammed Shokr
mohammedshokr2014@gmail.com
@MShokr1

NLTK

  • 1.
  • 2.
    Natural Language Toolkit(NLTK) ■ A collection of Python programs, modules, data set and tutorial to support research and development in Natural Language Processing (NLP) ■ Written by Steven Bird, Edvard Loper and Ewan Klien ■ NLTK is – Free and Open source – Easy to use – Modular – Well documented – Simple and extensible
  • 3.
    Installation of NLTK ■NLTK requires Python versions 2.7 or 3.2+
  • 4.
    Python REPL • ReadEval Print Loop • REPL: This is a procedure that just loops, accepts one command at a time, executing it, and printing the result. • GUI OR CLI
  • 5.
    Installation of NLTK 1.Start a Command Prompt as an Administrator ( Windows User ) 1. Click Start. 2. In the Start Search box, type cmd, and then press CTRL+SHIFT+ENTER. 3. If the User Account Control dialog box appears, confirm that the action it displays is what you want, and then click Continue. 2. changing from user to Superuser ( linux user ) – sudo su
  • 6.
    Installation of NLTK InstallNLTK: run pip install nltk
  • 7.
  • 8.
    Installing NLTK Data ■NLTK comes with many corpora, toy grammars, trained models, etc. A complete list is posted at: http://nltk.org/nltk_data/ ■ Run the Python REPL and type the commands: ■ A new window should open, showing the NLTK Downloader. Click on the File menu and select Change Download Directory. For central installation, set this to C:nltk_data (Windows), /usr/local/share/nltk_data (Mac), or /usr/share/nltk_data (Unix). Next, select the packages or collections you want to download. >>> import nltk >>> nltk.download()
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
    Sentence splitter from nltk.tokenizeimport sent_tokenize input_sting = ‘Hello Every One. How Are you ? life is not easy.’ all_sent = sent_tokenize(input_sting) print (all_sent)
  • 14.
    Tokenization sent = "HiEveryone ! How do you do ?" # Split() built-in string function print (sent.split()) # word_tokenize from nltk.tokenize import word_tokenize print (word_tokenize(sent))
  • 15.
  • 16.
  • 17.
  • 18.
    Edit-Distance We can createa very basic spellchecker by just using a dictionary lookup.
  • 19.
  • 20.
  • 21.
  • 22.
    Part of SpeechTagging ■ Stanford tagger ■ N-gram tagger ■ Regex tagger ■ Brill tagger ■ Machine learning based tagger ■ NER tagger – Named Entity Recognition (NER) – NLTK provides the ne_chunk() method
  • 23.
    NER tagger  Tokens Tagged  Entities
  • 24.
  • 25.
    Shallow VS deepparsing ■ In deep or full parsing, typically, grammar concepts such as CFG, and probabilistic context-free grammar (PCFG), and a search strategy is used to give a complete syntactic structure to a sentence. ■ Shallow parsing is the task of parsing a limited part of the syntactic information from the given text.
  • 26.
    The two approachesin parsing The rule-based approach The probabilistic approach This approach is based on rules/grammar In this approach, you learn rules/grammar by using probabilistic models Manual grammatical rules are coded down in CFG, and so on, in this approach This uses observed probabilities of linguistic features This has a top-down approach This has a bottom-up approach This approach includes CFG and Regexbased parser This approach includes PCFG and the Stanford parser
  • 27.
    context-free grammar (CFG) ■Generating sentences from context-free grammars:
  • 29.
    Different types ofparsers ■ Recursive descent parser ■ Shift-reduce parser ■ Chart parser ■ Regex parser ■ Dependency parsing
  • 30.
    Different types ofparsers Recursive descent parser One of the most straightforward forms of parsing is recursive descent parsing. This is a top-down process in which the parser attempts to verify that the syntax of the input stream is correct, as it is read from left to right. Shift-reduce parser The shift-reduce parser is a simple kind of bottom-up parser. Chart parser We will apply the algorithm design technique of dynamic programming to the parsing problem. Regex parser A regex parser uses a regular expression defined in the form of grammar on top of a POS-tagged string. Dependency parsing Dependency parsing (DP) is a modern parsing mechanism. The main concept of DP is that each linguistic unit (words) is connected with each other by a directed link.
  • 31.
    Chunking ■ Chunking isshallow parsing where instead of reaching out to the deep structure of the sentence, we try to club some chunks of the sentences that constitute some meaning. ■ For example, the sentence "the President speaks about the health care reforms"
  • 32.
    So, let's writesome code snippets to do some basic chunking:
  • 33.
    Display a parsetree # import treebank corpus from nltk.corpus import treebank t = treebank.parsed_sents('wsj_0001.mrg')[0] t.draw()
  • 34.
  • 35.
    Resources ■ NLTK 3.0documentation – http://www.nltk.org/ ■ NLTK Essentials – https://www.packtpub.com/big-data-and-business-intelligence/nltk-essentials ■ nltk_tutorial_repo [Code] – https://git.io/vaRIR
  • 36.
    Thanks. Questions? Send mean email! (I love talking!) Mohammed Shokr mohammedshokr2014@gmail.com @MShokr1

Editor's Notes

  • #26 While deep parsing is required for more complex NLP applications, such as dialogue systems and summarization , shallow parsing is more suited for information extraction and text mining varieties of applications.