A Practical Approach to Text Analysis and its Real-world Applications (Strata Hadoop Keynote)

Practical Text Analytics
and its Real-World Applications

Rebecca Bilbro
Lead Data Scientist at ByteCubed
Faculty at Georgetown Univ.
Partner at District Data Labs
@rebeccabilbro

tl;dr
● Text is the next frontier in big data.
● Language-aware data products are:
○ Not academia, but informed by it.
○ Not automagic, just feel that way.
● Machine learning is flexible; rules are not.
● Text comes with some unique requirements.
● Facilitate iteration with the model selection triple.
● Deployment is an opportunity to ingest more data.
● Pipelines are necessary for production.

“Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth”

Natural Language Understanding (AI)
Models for semantic understanding,
reasoning, and generation of natural
languages for human-computer
interaction.
Computational Linguistics (NLP)
Approaches to demonstrate how
humans interpret and understand
language and show how languages
evolve.
vs.

negative
angry, bad, contempt,
deceive, evil, fake, grim,
hoarder, ignorant, joke,
kaput, lies, measly, nasty,
obscure,pointless, quit,
rampant, stupid, trivial,
unclean, venomous,
weak, yell, zealot
positive
awesome, best, cool,
dazzle, easy, friendly,
golden, happy, improve,
joy, keen, lucky, marvel,
normal, original, peerless,
quick, remedy, super,
tidy, upbeat, vivid, warm,
yay, zenith

“It sucks I didn't take pictures of the food I ordered here because I really
wanted to show it off.
The restaurant isn't the biggest. It's pretty small. I had people constantly run
into my bag that I hung on the edge of my chair. Quite annoying honestly but
it's my bad for carrying such a large bag.
It didn't take long for the food to come out. I've been disappointed with one
of New York's best rated brunch spots that I waited 2+ hours for before so I
decided not to have any expectations for this place at all. However, the food
here actually tastes great.”
- 9/6/2017 Yelp Review

Sample Sentiment Analysis Pipeline
Training Data
(Historic Reviews)
Training Labels
(# Stars)
Feature
Vectors
Classification
Algorithm
New Data:
New Review Feature
Vector
Predictive
Model
Predicted Label
(# Stars)

Instances = Documents or Utterances
(no matter their size)

0
at
2
bat
1
can
0
door
1
echolocation
0
elephant
0
of
0
open
0
potato
2
see
0
she
1
sight
1
sneeze
0
studio
1
the
0
to
1
via
0
w
onder
The elephant sneezed
at the sight of
potatoes.
Bats can see via
echolocation. See the
bat sight sneeze!
Wondering, she
opened the door to
the studio.
Bag-of-words · One-hot encoding · TFIDF · Distributed representation
Vectorization

Feature
Analysis
Algorithm
Selection
Hyperparameter
Tuning
The Model Selection Triple
Arun Kumar http://bit.ly/2abVNrI

Data Management Layer
Raw Data
Feature Engineering Hyperparameter Tuning
Algorithm Selection
Model Selection Triples
Instance
Database
Model Storage
Model
Family
Model
Form

Case Study:
Predicting Political Orientation

Partisan Discourse: Architecture
Initial ModelDebate Transcripts
Submit URL
Preprocessing
Feature
Extraction
Evaluate Model
Fit Model
Model
Storage
Model
Monitoring
Corpus
Storage
Corpus
Monitoring
Classification
Feedback
Model Selectionstart
here

Partisan Discourse: New Documents
Users can:
- add new documents
- add labels to train
the model

Partisan Discourse: User Model
Over time, models
evolve:
- Global model
- Local models
- User models

Data Loader
Text
Normalization
Text
Vectorization
Feature
Decomposition
Estimator
Data Loader
Feature Union Pipeline
Estimator
Text
Normalization
Document
Features
Text
Extraction
Summary
Vectorization
Article
Vectorization
Concept
Features
Metadata
Features
Dict
Vectorizer

• Summarization
• Reference Resolution
• Machine Translation
• Language Generation
• Language Understanding
• Document Classification
• Author Identification
• Part of Speech Tagging
• Question Answering
• Information Extraction
• Information Retrieval
• Speech Recognition
• Sense Disambiguation
• Topic Recognition
• Relationship Detection
• Named Entity Recognition
Everyday NLP Applications

A Practical Approach to Text Analysis and its Real-world Applications (Strata Hadoop Keynote)

Recommended

Recommended

More Related Content

More from Rebecca Bilbro

More from Rebecca Bilbro (14)

Recently uploaded

Recently uploaded (20)

A Practical Approach to Text Analysis and its Real-world Applications (Strata Hadoop Keynote)