Natural Language Processing: Comparing NLTK and OpenNLP

https://blogs.microsoft.com/blog/2018/01/17/future-
computed-artificial-intelligence-role-
society/?MC=DevOps&MC=MachLearn&MC=OfficeO
365&MC=MSAzure&MC=CloudPlat
BeautifulSoup: Web Scraping in Python

 Natural language processing (NLP) is a field of
computer science, artificial intelligence concerned
with the interactions between computers and human
(natural) languages, and, in particular, concerned with
programming computers to fruitfully process large
natural language data.
What???

 In 1950, Alan Turing published an article titled
"Computing Machinery and Intelligence“ which
proposed what is now called the Turing test as a
criterion of intelligence.
 Eliza 1964
ELIZA might provide a generic response, for example,
responding to "My head hurts" with "Why do you say
your head hurts?".
When???

Where?
 Machine Translation
 Fighting Spam
 Mail Inbox or Spam
 Information Extraction
 Social media monitoring
 Summarization
 Question Answering

Tasks in OpenNLP
 The Apache OpenNLP library is a machine
learning based toolkit for the processing of natural
language text.
 It supports the most common NLP tasks, such
as language detection, tokenization, sentence
segmentation, part-of-speech tagging, named entity
extraction, chunking, parsing and co reference
resolution.

Tasks in OpenNLP
 Text data in the form of un-structured text, in the form
of comments, reviews or articles
 Extract meaning full information from them, done
with the help of set of tasks
 Tokenizing:
 Take a large piece of text and break it into smaller components
 Break it into sentences or individual words

Stop words removal
 Once Tokenized,
 next Stop words removal, i.e. differentiating words
which has specific meaning from the words which
adds to structure to the sentence.
 Eg:

N-Grams
 Once Stop words are removed,
 Commonly occurring words in a sentence, because these will be
most important words in the text.
https://en.wikipedia.org/wiki/N-gram#Examples
 If a word appears 2 times in a particular sentence, its called
bigrams.
 Eg:
 Code(s) Description
 M79.661 Pain in right lower leg
 M79.662 Pain in left lower leg
 M79.669 Pain in unspecified lower leg

Word Sense Disambiguation
 Eg:
 I am taking aspirin for my cold
 Let's go inside, I'm cold
 It's cold today, only 2 degrees
 It identifies the meaning of the word, based on the
context it is spoken.

Parts-of-Speech Tagging
 It can either occur as part of WSD, or as a independent
task.
 It helps in identifying parts of speech, whether Noun,
Verb, Adjective, etc.
Stemming
 Eg: Close, Closed, Closely, Closer
 Converting the word to its base form

Python NLTK and OpenNLP
 NLTK is one of the leading platforms for working with
human language data and Python, the module NLTK is
used for natural language processing.
 NLTK is literally an acronym for Natural Language Toolkit.
 The Apache OpenNLP library is a machine learning based
toolkit for the processing of natural language text.
 It supports the most common NLP tasks, such as
tokenization, sentence segmentation, part-of-speech
tagging, named entity extraction, chunking, parsing, and
coreference resolution.

Python NLTK
 Step 1: Collect all individual Sentences in an article, to
a list.
 Tokenization () from NLTK: ie from nltk.tokenize we
can import the functions sent_tokenize(breakdown into
sentences) and word_tokenize(breakdown into words)
 Import stopwords () from nltk.corpus module.
 punctuation from string module.
 Note : Sentence ends with a period symbol(.) and a space
after that.

Frequency distribution
 Construct a frequency distribution : words and no of
times each word occurs
 Functions Defined for NLTK's Frequency Distributions
Example Description
fdist = FreqDist(samples) create a frequency distribution containing the given samples
fdist[sample] += 1 increment the count for this sample
fdist['monstrous'] count of the number of times a given sample occurred
fdist.freq('monstrous') frequency of a given sample
fdist.N() total number of samples
fdist.most_common(n) the n most common samples and their frequencies
for sample in fdist: iterate over the samples
fdist.max() sample with the greatest count
fdist.tabulate() tabulate the frequency distribution
fdist.plot() graphical plot of the frequency distribution
fdist.plot(cumulative=True) cumulative plot of the frequency distribution
fdist1 |= fdist2 update fdist1 with counts from fdist2
fdist1 < fdist2 test if samples in fdist1 occur less frequently than in fdist2

Use Tokenizing: Sentence Detector
 Python Usage
 Step 1: Import NLTK
 Step2:
 text = "Mary had a little lamp. Her fleece was as white as snow"
 from nltk.tokenize import word_tokenize, sent_tokenize
 sents = sent_tokenize(text)
 print(sents)
 Java Usage
 Step 1:
 Step 2: Some Java code snippet

OpenNLP syntax
 OpenNLP components have similar APIs. Normally, to
execute a task, one should provide a model and an
input.
 A model is usually loaded by providing a
FileInputStream with a model to a constructor of the
model class
 try (InputStream modelIn = new
FileInputStream("lang-model-name.bin")) {
SomeModel model = new SomeModel(modelIn); }

Features
 Language detection
 https://www.apache.org/dist/opennlp/models/langdete
ct/1.8.3/README.txt

Breaking into Word
 Python
 words=[word_tokenize(sent) for sent in sents]
 print words
 Java
 InputStream is = new FileInputStream("en-token.bin");
 TokenizerModel model = new TokenizerModel(is);
 Tokenizer tokenizer = new TokenizerME(model);
 String tokens[] = tokenizer.tokenize("Hi. How are you? This
is Mike.");
 for (String a : tokens)
System.out.println(a);
 is.close();

POS - Tagging
https://www.winwaed.com/blog/2011/11/08/part-of-speech-tags/

I hope I made myself
understandable.
Thanks!

Natural Language Processing: Comparing NLTK and OpenNLP

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Natural Language Processing: Comparing NLTK and OpenNLP

Similar to Natural Language Processing: Comparing NLTK and OpenNLP (20)

More from CodeOps Technologies LLP

More from CodeOps Technologies LLP (20)

Recently uploaded

Recently uploaded (20)

Natural Language Processing: Comparing NLTK and OpenNLP