In this presentation presented in AI & ML meetup on 2nd Feb, Sangram Mishra develops the same NLP solution using NLTK and OpenNLP, Sangram compares and contrasts the two open source technologies for deeper understanding and insights on choosing and using them for real-world projects.
4. Natural language processing (NLP) is a field of
computer science, artificial intelligence concerned
with the interactions between computers and human
(natural) languages, and, in particular, concerned with
programming computers to fruitfully process large
natural language data.
What???
5. In 1950, Alan Turing published an article titled
"Computing Machinery and Intelligence“ which
proposed what is now called the Turing test as a
criterion of intelligence.
Eliza 1964
ELIZA might provide a generic response, for example,
responding to "My head hurts" with "Why do you say
your head hurts?".
When???
7. Where?
Machine Translation
Fighting Spam
Mail Inbox or Spam
Information Extraction
Social media monitoring
Summarization
Question Answering
8. Tasks in OpenNLP
The Apache OpenNLP library is a machine
learning based toolkit for the processing of natural
language text.
It supports the most common NLP tasks, such
as language detection, tokenization, sentence
segmentation, part-of-speech tagging, named entity
extraction, chunking, parsing and co reference
resolution.
9. Tasks in OpenNLP
Text data in the form of un-structured text, in the form
of comments, reviews or articles
Extract meaning full information from them, done
with the help of set of tasks
Tokenizing:
Take a large piece of text and break it into smaller components
Break it into sentences or individual words
10. Stop words removal
Once Tokenized,
next Stop words removal, i.e. differentiating words
which has specific meaning from the words which
adds to structure to the sentence.
Eg:
11. N-Grams
Once Stop words are removed,
Commonly occurring words in a sentence, because these will be
most important words in the text.
https://en.wikipedia.org/wiki/N-gram#Examples
If a word appears 2 times in a particular sentence, its called
bigrams.
Eg:
Code(s) Description
M79.661 Pain in right lower leg
M79.662 Pain in left lower leg
M79.669 Pain in unspecified lower leg
12. Word Sense Disambiguation
Eg:
I am taking aspirin for my cold
Let's go inside, I'm cold
It's cold today, only 2 degrees
It identifies the meaning of the word, based on the
context it is spoken.
13. Parts-of-Speech Tagging
It can either occur as part of WSD, or as a independent
task.
It helps in identifying parts of speech, whether Noun,
Verb, Adjective, etc.
Stemming
Eg: Close, Closed, Closely, Closer
Converting the word to its base form
14. Python NLTK and OpenNLP
NLTK is one of the leading platforms for working with
human language data and Python, the module NLTK is
used for natural language processing.
NLTK is literally an acronym for Natural Language Toolkit.
The Apache OpenNLP library is a machine learning based
toolkit for the processing of natural language text.
It supports the most common NLP tasks, such as
tokenization, sentence segmentation, part-of-speech
tagging, named entity extraction, chunking, parsing, and
coreference resolution.
15. Python NLTK
Step 1: Collect all individual Sentences in an article, to
a list.
Tokenization () from NLTK: ie from nltk.tokenize we
can import the functions sent_tokenize(breakdown into
sentences) and word_tokenize(breakdown into words)
Import stopwords () from nltk.corpus module.
punctuation from string module.
Note : Sentence ends with a period symbol(.) and a space
after that.
16. Frequency distribution
Construct a frequency distribution : words and no of
times each word occurs
Functions Defined for NLTK's Frequency Distributions
Example Description
fdist = FreqDist(samples) create a frequency distribution containing the given samples
fdist[sample] += 1 increment the count for this sample
fdist['monstrous'] count of the number of times a given sample occurred
fdist.freq('monstrous') frequency of a given sample
fdist.N() total number of samples
fdist.most_common(n) the n most common samples and their frequencies
for sample in fdist: iterate over the samples
fdist.max() sample with the greatest count
fdist.tabulate() tabulate the frequency distribution
fdist.plot() graphical plot of the frequency distribution
fdist.plot(cumulative=True) cumulative plot of the frequency distribution
fdist1 |= fdist2 update fdist1 with counts from fdist2
fdist1 < fdist2 test if samples in fdist1 occur less frequently than in fdist2
17. Use Tokenizing: Sentence Detector
Python Usage
Step 1: Import NLTK
Step2:
text = "Mary had a little lamp. Her fleece was as white as snow"
from nltk.tokenize import word_tokenize, sent_tokenize
sents = sent_tokenize(text)
print(sents)
Java Usage
Step 1:
Step 2: Some Java code snippet
18. OpenNLP syntax
OpenNLP components have similar APIs. Normally, to
execute a task, one should provide a model and an
input.
A model is usually loaded by providing a
FileInputStream with a model to a constructor of the
model class
try (InputStream modelIn = new
FileInputStream("lang-model-name.bin")) {
SomeModel model = new SomeModel(modelIn); }
20. Breaking into Word
Python
words=[word_tokenize(sent) for sent in sents]
print words
Java
InputStream is = new FileInputStream("en-token.bin");
TokenizerModel model = new TokenizerModel(is);
Tokenizer tokenizer = new TokenizerME(model);
String tokens[] = tokenizer.tokenize("Hi. How are you? This
is Mike.");
for (String a : tokens)
System.out.println(a);
is.close();