Nouns are Better than N-Grams with Asoka Diggs

•

1 like•357 views

A standard workflow for many varieties of text analysis is to tokenize, then remove stop words from the list of tokens, and then stem the remaining tokens. These tokens are also sometimes used to generate n-grams. In this presentation, we will be talking about very fast noun and noun phrase extraction by showing a method for using Part-of-Speech (POS) tagging and additional logic to find the nouns and noun phrases in text using a widely available Python library. This alternate workflow replaces the standard workflow while generating a list of tokens or features from the text that represent the nouns in the text, and are ready to be used as features in further analysis. These nouns and noun phrase features perform better in classification type analysis than stemmed tokens, are able to serve many of the same purposes that n-grams serve, while also avoiding or eliminating many of the problems associated with the standard workflow and n-grams. The primary limitation of the method is that POS tagging and interpretation is computationally intense. However, we have implemented our solution in a manner that has scaled to 100’s of GB of text and believe this will reasonably scale into the low TB range without hardware changes. In this presentation, we will walk through live Python code for extracting nouns and noun phrases from a document. We will talk about and look at code for extending this style of feature generation for text to large corpora using a Spark cluster and PySpark.

Data & Analytics

Information Technology
About Me
 Lots of database design and architecture experience
 Text analytics class during my Master’s degree in Predictive Analytics
 Preparing text for analysis has been a theme of just about every project I’ve
been involved in for the last 4+ years

Information Technology
Agenda
 Overview of using part-of-speech tagging to extract tokens from text
 Python notebooks showing some different extraction approaches
– Comparison of standard pipeline tokens to noun phrase extraction
 Pseudo code example of scaling the method to a Spark cluster
 Q&A

Information Technology
Begin With the End in Mind
 A part-of-speech based token extraction pipeline for text:
– Can be straightforward to implement
– Generates tokens that are more human readable
– Can find phrases naturally (replacement for n-grams generation)
– Removes many stopwords “for free”
 Lemmatizing or singularizing terms can collapse terms that are different
because of plural / singular differences

Information Technology
Typical pipeline
 Tokenize the text
 Convert all tokens to lower case
 Apply a stemmer to reduce tokens to common roots
– E.g. boil, boils, boiler, boiled, boiling  boil
 Remove stop words
– Common English words (the, of, to, in, and, or, which)
 Might also include n-gram generation

Information Technology
Challenge
 Standard pipeline for preparing text for analysis yields tokens that can be of
low value
 These tokens can be difficult to explain to business users – what do the
tokens mean?
 Uses many rules for handling text, and text (nearly) always has exceptions to
each rule.

Information Technology
Solution
 Replace the standard tokenize-stem-stop-n_gram pipeline with a part-of-
speech based pipeline
 Extract words and phrases with the desired parts-of-speech
 Use singularization as a different approach to stemming to bring singular and
plural forms of words together
Result is not perfect, but better quality than the
tokenization approach.

Information Technology
Benefit
The resulting tokens naturally:
 Include phrases (n-gram replacement)
 Remove the standard English stop words. They have a part-of-speech that
isn’t typically used in analysis
 Supports readability and don’t need stemming to collapse similar words to a
common root

Information Technology
Resources and Links
 Pattern
– For Python* 3 compatibility, I had to install a development branch of the
library. See (link) for details
– Something like:
git clone –b development https://github.com/clips/pattern
cd pattern
sudo python setup.py install
 NLTK
 Penn-Treebank (on Wikipedia) and Tags
 Intel® Distribution for Python* (link)
 Others: RDRPOSTagger, spaCy, rakutenma, TextBlob

Information Technology
IT@INTEL:Sharing Intel IT Best Practices With the World
Learn more about Intel IT’s initiatives at: www.intel.com/IT

Nouns are Better than N-Grams with Asoka Diggs

What's hot

Tech Comms Text NfJohn_Wilson

Ijarcet vol-3-issue-1-9-11Dhabal Sethi

Automatic Term Recognition with Apache SolrJIE GAO

5 cs xii_python_functions _ passing str list tupleSanjayKumarMahto1

Assignment 3onxmsly

Information retrieval 10 tf idf and bag of wordsVaibhav Khanna

A neural conversational_modelsotanemoto

Extractive Summarization with Very Deep Pretrained Language Modelgerogepatton

What's hot (8)

Tech Comms Text Nf

Ijarcet vol-3-issue-1-9-11

Automatic Term Recognition with Apache Solr

5 cs xii_python_functions _ passing str list tuple

Assignment 3

Information retrieval 10 tf idf and bag of words

A neural conversational_model

Extractive Summarization with Very Deep Pretrained Language Model

Similar to Nouns are Better than N-Grams with Asoka Diggs

#ATAGTR2019 Presentation "Re-engineering perfmance strategy of deep learning ...Agile Testing Alliance

IRJET - Pseudocode to Python Translation using Machine LearningIRJET Journal

Illustrated Code (ASE 2021)CISPA Helmholtz Center for Information Security

Designing a Generative AI QnA solution with Proprietary Enterprise Business K...IRJET Journal

IRJET- Natural Language Query ProcessingIRJET Journal

Recent Trends in Translation of Programming Languages using NLP ApproachesIRJET Journal

Generative AI in CSharp with Semantic Kernel.pptxAlon Fliess

Speech To Speech TranslationIRJET Journal

IRJET- Speech to Speech Translation SystemIRJET Journal

Writting Better Softwaresvilen.ivanov

NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobilDatabricks

IRJET- Voice based Billing SystemIRJET Journal

Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Intel® Software

Creation of Software Focusing on Patent AnalysisIRJET Journal

Multikeyword Hunt on Progressive GraphsIRJET Journal

My Curriculum Vitaeadil raja

ICIC 2013 New Product Introductions CEPTDr. Haxel Consult

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEMIRJET Journal

IntelliSemantic - MyIntelliPatent in a nutshellIntelliSemantic

Similar to Nouns are Better than N-Grams with Asoka Diggs (20)

#ATAGTR2019 Presentation "Re-engineering perfmance strategy of deep learning ...

IRJET - Pseudocode to Python Translation using Machine Learning

Illustrated Code (ASE 2021)

Designing a Generative AI QnA solution with Proprietary Enterprise Business K...

IRJET- Natural Language Query Processing

Recent Trends in Translation of Programming Languages using NLP Approaches

Generative AI in CSharp with Semantic Kernel.pptx

Speech To Speech Translation

IRJET- Speech to Speech Translation System

Writting Better Software

NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil

IRJET- Voice based Billing System

Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...

Creation of Software Focusing on Patent Analysis

Multikeyword Hunt on Progressive Graphs

My Curriculum Vitae

ICIC 2013 New Product Introductions CEPT

Vertex AI Gemini Prompt Engineering Tips

CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEM

IntelliSemantic - MyIntelliPatent in a nutshell

Recently uploaded

Capstone Project on IBM Data Analytics ProgramMoniSankarHazra

Data-Analysis for Chicago Crime Data 2023ymrp368

VidaXL dropshipping via API with DroFx.pptxolyaivanovalion

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor

100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823

ALSO dropshipping via API with DroFx.pptxolyaivanovalion

Halmar dropshipping via API with DroFxolyaivanovalion

Zuja dropshipping via API with DroFx.pptxolyaivanovalion

Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra

Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh9953056974 Low Rate Call Girls In Saket, Delhi NCR

BabyOno dropshipping via API with DroFx.pptxolyaivanovalion

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls

Smarteg dropshipping via API with DroFx.pptxolyaivanovalion

Introduction-to-Machine-Learning (1).pptxfirstjob4

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Recently uploaded (20)

Capstone Project on IBM Data Analytics Program

Data-Analysis for Chicago Crime Data 2023

VidaXL dropshipping via API with DroFx.pptx

꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130

100-Concepts-of-AI by Anupama Kate .pptx

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...

ALSO dropshipping via API with DroFx.pptx

Halmar dropshipping via API with DroFx

Zuja dropshipping via API with DroFx.pptx

Sampling (random) method and Non random.ppt

Log Analysis using OSSEC sasoasasasas.pptx

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh

BabyOno dropshipping via API with DroFx.pptx

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...

Smarteg dropshipping via API with DroFx.pptx

Introduction-to-Machine-Learning (1).pptx

꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...

Nouns are Better than N-Grams with Asoka Diggs

1. Asoka Diggs - Data Scientist June 2018

2. Information Technology 2 This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. Copyright © 2018, Intel Corporation. All rights reserved. Legal Notices

3. Information Technology About Me  Lots of database design and architecture experience  Text analytics class during my Master’s degree in Predictive Analytics  Preparing text for analysis has been a theme of just about every project I’ve been involved in for the last 4+ years

4. Information Technology Agenda  Overview of using part-of-speech tagging to extract tokens from text  Python notebooks showing some different extraction approaches – Comparison of standard pipeline tokens to noun phrase extraction  Pseudo code example of scaling the method to a Spark cluster  Q&A

5. Information Technology Begin With the End in Mind  A part-of-speech based token extraction pipeline for text: – Can be straightforward to implement – Generates tokens that are more human readable – Can find phrases naturally (replacement for n-grams generation) – Removes many stopwords “for free”  Lemmatizing or singularizing terms can collapse terms that are different because of plural / singular differences

6. Information Technology Typical pipeline  Tokenize the text  Convert all tokens to lower case  Apply a stemmer to reduce tokens to common roots – E.g. boil, boils, boiler, boiled, boiling  boil  Remove stop words – Common English words (the, of, to, in, and, or, which)  Might also include n-gram generation

7. Information Technology Challenge  Standard pipeline for preparing text for analysis yields tokens that can be of low value  These tokens can be difficult to explain to business users – what do the tokens mean?  Uses many rules for handling text, and text (nearly) always has exceptions to each rule.

8. Information Technology Solution  Replace the standard tokenize-stem-stop-n_gram pipeline with a part-of- speech based pipeline  Extract words and phrases with the desired parts-of-speech  Use singularization as a different approach to stemming to bring singular and plural forms of words together Result is not perfect, but better quality than the tokenization approach.

9. Information Technology Benefit The resulting tokens naturally:  Include phrases (n-gram replacement)  Remove the standard English stop words. They have a part-of-speech that isn’t typically used in analysis  Supports readability and don’t need stemming to collapse similar words to a common root

10. Information Technology

11. Information Technology Resources and Links  Pattern – For Python* 3 compatibility, I had to install a development branch of the library. See (link) for details – Something like: git clone –b development https://github.com/clips/pattern cd pattern sudo python setup.py install  NLTK  Penn-Treebank (on Wikipedia) and Tags  Intel® Distribution for Python* (link)  Others: RDRPOSTagger, spaCy, rakutenma, TextBlob

12. Information Technology

13. Information Technology IT@INTEL:Sharing Intel IT Best Practices With the World Learn more about Intel IT’s initiatives at: www.intel.com/IT

Nouns are Better than N-Grams with Asoka Diggs

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Similar to Nouns are Better than N-Grams with Asoka Diggs

Similar to Nouns are Better than N-Grams with Asoka Diggs (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Nouns are Better than N-Grams with Asoka Diggs