NLP IN THE WILD
COLLEEN M. FARRELLY, DATASEMBLY
COMMON INDUSTRY NLP
PROBLEMS
• Sentiment analysis/tracking of customer
feedback
• Computational linguistics/psychology of
language usage
• Chatbots
• Translation services
• Supervised learning
• Document summary
Problem
formulation
Data
collection
Choice of
tools
(math/ML)
Application Results
CASE STUDY 1: CONSUMER GROUP
CLUSTERING
• Want to understand how
different groups interact
with a chatbot
• Sales implications
• Groups-specific needs
for future feature builds
• Chatbot conversation data
sample
• NLP to derive salient text
features
• Persistent homology to
CASE 2: SUPERVISED LEARNING
• Want to classify products by
type (such as fruit or canned
soup) using title text
• Data includes a small sample of
scraped titles from a sample of
retailers with manual annotation
of product type
• Text cleaning and embedding
algorithms to prepare the text
data for machine learning
• Supervised learning algorithm
to create the classier
CASE 3: TOPIC
MODELING
• Want to find main
topics discussed in a
corpus of documents
(poems)
• Poetry data sample
across genres of
poetry by a single
author
• Topic modeling to
classify poems
CASE 4:
TIME-
BASED
ANALYSIS
OF
MINDSET
• Want to quickly understand
changes in leader’s behavior at
onset of war
• Public statement sample by
president over course of several
weeks as input data
• NLP to derive linguistic features
• Longitudinal models and topology-
based changepoint algorithm on
linguistic feature time series
HELPFUL
PYTHON
PACKAGES
• NLP:
• NLTK (parts of speech tagging,
munging data…)
• Gensim (topic models)
• Vader (sentiment analysis)
• TDA
• Persim/ripser (persistent
homology)
• Kmapper (Mapper algorithm)
• Structural equation modeling/latent
class modeling
• Semopy (similar to lavaan in R)
CONTACT ME
• cfarrelly@med.miami.edu
• LinkedIn (Colleen M. Farrelly)

Natural Language Processing in the Wild.pptx

  • 1.
    NLP IN THEWILD COLLEEN M. FARRELLY, DATASEMBLY
  • 2.
    COMMON INDUSTRY NLP PROBLEMS •Sentiment analysis/tracking of customer feedback • Computational linguistics/psychology of language usage • Chatbots • Translation services • Supervised learning • Document summary
  • 3.
  • 4.
    CASE STUDY 1:CONSUMER GROUP CLUSTERING • Want to understand how different groups interact with a chatbot • Sales implications • Groups-specific needs for future feature builds • Chatbot conversation data sample • NLP to derive salient text features • Persistent homology to
  • 5.
    CASE 2: SUPERVISEDLEARNING • Want to classify products by type (such as fruit or canned soup) using title text • Data includes a small sample of scraped titles from a sample of retailers with manual annotation of product type • Text cleaning and embedding algorithms to prepare the text data for machine learning • Supervised learning algorithm to create the classier
  • 6.
    CASE 3: TOPIC MODELING •Want to find main topics discussed in a corpus of documents (poems) • Poetry data sample across genres of poetry by a single author • Topic modeling to classify poems
  • 7.
    CASE 4: TIME- BASED ANALYSIS OF MINDSET • Wantto quickly understand changes in leader’s behavior at onset of war • Public statement sample by president over course of several weeks as input data • NLP to derive linguistic features • Longitudinal models and topology- based changepoint algorithm on linguistic feature time series
  • 8.
    HELPFUL PYTHON PACKAGES • NLP: • NLTK(parts of speech tagging, munging data…) • Gensim (topic models) • Vader (sentiment analysis) • TDA • Persim/ripser (persistent homology) • Kmapper (Mapper algorithm) • Structural equation modeling/latent class modeling • Semopy (similar to lavaan in R)
  • 9.
    CONTACT ME • cfarrelly@med.miami.edu •LinkedIn (Colleen M. Farrelly)