A standard workflow for many varieties of text analysis is to tokenize, then remove stop words from the list of tokens, and then stem the remaining tokens. These tokens are also sometimes used to generate n-grams. In this presentation, we will be talking about very fast noun and noun phrase extraction by showing a method for using Part-of-Speech (POS) tagging and additional logic to find the nouns and noun phrases in text using a widely available Python library.
This alternate workflow replaces the standard workflow while generating a list of tokens or features from the text that represent the nouns in the text, and are ready to be used as features in further analysis. These nouns and noun phrase features perform better in classification type analysis than stemmed tokens, are able to serve many of the same purposes that n-grams serve, while also avoiding or eliminating many of the problems associated with the standard workflow and n-grams. The primary limitation of the method is that POS tagging and interpretation is computationally intense. However, we have implemented our solution in a manner that has scaled to 100’s of GB of text and believe this will reasonably scale into the low TB range without hardware changes.
In this presentation, we will walk through live Python code for extracting nouns and noun phrases from a document. We will talk about and look at code for extending this style of feature generation for text to large corpora using a Spark cluster and PySpark.
3. Information Technology
About Me
Lots of database design and architecture experience
Text analytics class during my Master’s degree in Predictive Analytics
Preparing text for analysis has been a theme of just about every project I’ve
been involved in for the last 4+ years
4. Information Technology
Agenda
Overview of using part-of-speech tagging to extract tokens from text
Python notebooks showing some different extraction approaches
– Comparison of standard pipeline tokens to noun phrase extraction
Pseudo code example of scaling the method to a Spark cluster
Q&A
5. Information Technology
Begin With the End in Mind
A part-of-speech based token extraction pipeline for text:
– Can be straightforward to implement
– Generates tokens that are more human readable
– Can find phrases naturally (replacement for n-grams generation)
– Removes many stopwords “for free”
Lemmatizing or singularizing terms can collapse terms that are different
because of plural / singular differences
6. Information Technology
Typical pipeline
Tokenize the text
Convert all tokens to lower case
Apply a stemmer to reduce tokens to common roots
– E.g. boil, boils, boiler, boiled, boiling boil
Remove stop words
– Common English words (the, of, to, in, and, or, which)
Might also include n-gram generation
7. Information Technology
Challenge
Standard pipeline for preparing text for analysis yields tokens that can be of
low value
These tokens can be difficult to explain to business users – what do the
tokens mean?
Uses many rules for handling text, and text (nearly) always has exceptions to
each rule.
8. Information Technology
Solution
Replace the standard tokenize-stem-stop-n_gram pipeline with a part-of-
speech based pipeline
Extract words and phrases with the desired parts-of-speech
Use singularization as a different approach to stemming to bring singular and
plural forms of words together
Result is not perfect, but better quality than the
tokenization approach.
9. Information Technology
Benefit
The resulting tokens naturally:
Include phrases (n-gram replacement)
Remove the standard English stop words. They have a part-of-speech that
isn’t typically used in analysis
Supports readability and don’t need stemming to collapse similar words to a
common root
11. Information Technology
Resources and Links
Pattern
– For Python* 3 compatibility, I had to install a development branch of the
library. See (link) for details
– Something like:
git clone –b development https://github.com/clips/pattern
cd pattern
sudo python setup.py install
NLTK
Penn-Treebank (on Wikipedia) and Tags
Intel® Distribution for Python* (link)
Others: RDRPOSTagger, spaCy, rakutenma, TextBlob