SlideShare a Scribd company logo
1 of 27
Download to read offline
Practical Text Analytics
and its Real-World Applications
Rebecca Bilbro
Lead Data Scientist at ByteCubed
Faculty at Georgetown Univ.
Partner at District Data Labs
@rebeccabilbro
Overview
tl;dr
● Text is the next frontier in big data.
● Language-aware data products are:
○ Not academia, but informed by it.
○ Not automagic, just feel that way.
● Machine learning is flexible; rules are not.
● Text comes with some unique requirements.
● Facilitate iteration with the model selection triple.
● Deployment is an opportunity to ingest more data.
● Pipelines are necessary for production.
“Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth”
Natural Language Understanding (AI)
Models for semantic understanding,
reasoning, and generation of natural
languages for human-computer
interaction.
Computational Linguistics (NLP)
Approaches to demonstrate how
humans interpret and understand
language and show how languages
evolve.
vs.
negative
angry, bad, contempt,
deceive, evil, fake, grim,
hoarder, ignorant, joke,
kaput, lies, measly, nasty,
obscure,pointless, quit,
rampant, stupid, trivial,
unclean, venomous,
weak, yell, zealot
positive
awesome, best, cool,
dazzle, easy, friendly,
golden, happy, improve,
joy, keen, lucky, marvel,
normal, original, peerless,
quick, remedy, super,
tidy, upbeat, vivid, warm,
yay, zenith
“It sucks I didn't take pictures of the food I ordered here because I really
wanted to show it off.
The restaurant isn't the biggest. It's pretty small. I had people constantly run
into my bag that I hung on the edge of my chair. Quite annoying honestly but
it's my bad for carrying such a large bag.
It didn't take long for the food to come out. I've been disappointed with one
of New York's best rated brunch spots that I waited 2+ hours for before so I
decided not to have any expectations for this place at all. However, the food
here actually tastes great.”
- 9/6/2017 Yelp Review
Sample Sentiment Analysis Pipeline
Training Data
(Historic Reviews)
Training Labels
(# Stars)
Feature
Vectors
Classification
Algorithm
New Data:
New Review Feature
Vector
Predictive
Model
Predicted Label
(# Stars)
Instances = Documents or Utterances
(no matter their size)
0
at
2
bat
1
can
0
door
1
echolocation
0
elephant
0
of
0
open
0
potato
2
see
0
she
1
sight
1
sneeze
0
studio
1
the
0
to
1
via
0
w
onder
The elephant sneezed
at the sight of
potatoes.
Bats can see via
echolocation. See the
bat sight sneeze!
Wondering, she
opened the door to
the studio.
Bag-of-words · One-hot encoding · TFIDF · Distributed representation
Vectorization
Feature
Analysis
Algorithm
Selection
Hyperparameter
Tuning
The Model Selection Triple
Arun Kumar http://bit.ly/2abVNrI
Data Management Layer
Raw Data
Feature Engineering Hyperparameter Tuning
Algorithm Selection
Model Selection Triples
Instance
Database
Model Storage
Model
Family
Model
Form
Case Study:
Predicting Political Orientation
Partisan Discourse: Architecture
Initial ModelDebate Transcripts
Submit URL
Preprocessing
Feature
Extraction
Evaluate Model
Fit Model
Model
Storage
Model
Monitoring
Corpus
Storage
Corpus
Monitoring
Classification
Feedback
Model Selectionstart
here
Partisan Discourse: New Documents
Users can:
- add new documents
- add labels to train
the model
Partisan Discourse: User Model
Over time, models
evolve:
- Global model
- Local models
- User models
Data Loader
Text
Normalization
Text
Vectorization
Feature
Decomposition
Estimator
Data Loader
Feature Union Pipeline
Estimator
Text
Normalization
Document
Features
Text
Extraction
Summary
Vectorization
Article
Vectorization
Concept
Features
Metadata
Features
Dict
Vectorizer
tl;dr
● Text is the next frontier in big data.
● Language-aware data products are:
○ Not academia, but informed by it.
○ Not automagic, just feel that way.
● Machine learning is flexible; rules are not.
● Text comes with some unique requirements.
● Facilitate iteration with the model selection triple.
● Deployment is an opportunity to ingest more data.
● Pipelines are necessary for production.
• Summarization
• Reference Resolution
• Machine Translation
• Language Generation
• Language Understanding
• Document Classification
• Author Identification
• Part of Speech Tagging
• Question Answering
• Information Extraction
• Information Retrieval
• Speech Recognition
• Sense Disambiguation
• Topic Recognition
• Relationship Detection
• Named Entity Recognition
Everyday NLP Applications
Thank you!
@rebeccabilbro

More Related Content

More from Rebecca Bilbro

PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningRebecca Bilbro
 
EuroSciPy 2019: Visual diagnostics at scale
EuroSciPy 2019: Visual diagnostics at scaleEuroSciPy 2019: Visual diagnostics at scale
EuroSciPy 2019: Visual diagnostics at scaleRebecca Bilbro
 
Visual diagnostics at scale
Visual diagnostics at scaleVisual diagnostics at scale
Visual diagnostics at scaleRebecca Bilbro
 
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Rebecca Bilbro
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsRebecca Bilbro
 
The Incredible Disappearing Data Scientist
The Incredible Disappearing Data ScientistThe Incredible Disappearing Data Scientist
The Incredible Disappearing Data ScientistRebecca Bilbro
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with YellowbrickRebecca Bilbro
 
Escaping the Black Box
Escaping the Black BoxEscaping the Black Box
Escaping the Black BoxRebecca Bilbro
 
Yellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformersYellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformersRebecca Bilbro
 
Visualizing the model selection process
Visualizing the model selection processVisualizing the model selection process
Visualizing the model selection processRebecca Bilbro
 
NLP for Everyday People
NLP for Everyday PeopleNLP for Everyday People
NLP for Everyday PeopleRebecca Bilbro
 
Commerce Data Usability Project
Commerce Data Usability ProjectCommerce Data Usability Project
Commerce Data Usability ProjectRebecca Bilbro
 

More from Rebecca Bilbro (14)

PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine Learning
 
EuroSciPy 2019: Visual diagnostics at scale
EuroSciPy 2019: Visual diagnostics at scaleEuroSciPy 2019: Visual diagnostics at scale
EuroSciPy 2019: Visual diagnostics at scale
 
Visual diagnostics at scale
Visual diagnostics at scaleVisual diagnostics at scale
Visual diagnostics at scale
 
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and Distributions
 
Words in space
Words in spaceWords in space
Words in space
 
The Incredible Disappearing Data Scientist
The Incredible Disappearing Data ScientistThe Incredible Disappearing Data Scientist
The Incredible Disappearing Data Scientist
 
Camlis
CamlisCamlis
Camlis
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with Yellowbrick
 
Escaping the Black Box
Escaping the Black BoxEscaping the Black Box
Escaping the Black Box
 
Yellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformersYellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformers
 
Visualizing the model selection process
Visualizing the model selection processVisualizing the model selection process
Visualizing the model selection process
 
NLP for Everyday People
NLP for Everyday PeopleNLP for Everyday People
NLP for Everyday People
 
Commerce Data Usability Project
Commerce Data Usability ProjectCommerce Data Usability Project
Commerce Data Usability Project
 

Recently uploaded

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

A Practical Approach to Text Analysis and its Real-world Applications (Strata Hadoop Keynote)

  • 1. Practical Text Analytics and its Real-World Applications
  • 2. Rebecca Bilbro Lead Data Scientist at ByteCubed Faculty at Georgetown Univ. Partner at District Data Labs @rebeccabilbro
  • 4. tl;dr ● Text is the next frontier in big data. ● Language-aware data products are: ○ Not academia, but informed by it. ○ Not automagic, just feel that way. ● Machine learning is flexible; rules are not. ● Text comes with some unique requirements. ● Facilitate iteration with the model selection triple. ● Deployment is an opportunity to ingest more data. ● Pipelines are necessary for production.
  • 5. “Two roads diverged in a yellow wood, And sorry I could not travel both And be one traveler, long I stood And looked down one as far as I could To where it bent in the undergrowth”
  • 6. Natural Language Understanding (AI) Models for semantic understanding, reasoning, and generation of natural languages for human-computer interaction. Computational Linguistics (NLP) Approaches to demonstrate how humans interpret and understand language and show how languages evolve. vs.
  • 7.
  • 8. negative angry, bad, contempt, deceive, evil, fake, grim, hoarder, ignorant, joke, kaput, lies, measly, nasty, obscure,pointless, quit, rampant, stupid, trivial, unclean, venomous, weak, yell, zealot positive awesome, best, cool, dazzle, easy, friendly, golden, happy, improve, joy, keen, lucky, marvel, normal, original, peerless, quick, remedy, super, tidy, upbeat, vivid, warm, yay, zenith
  • 9. “It sucks I didn't take pictures of the food I ordered here because I really wanted to show it off. The restaurant isn't the biggest. It's pretty small. I had people constantly run into my bag that I hung on the edge of my chair. Quite annoying honestly but it's my bad for carrying such a large bag. It didn't take long for the food to come out. I've been disappointed with one of New York's best rated brunch spots that I waited 2+ hours for before so I decided not to have any expectations for this place at all. However, the food here actually tastes great.” - 9/6/2017 Yelp Review
  • 10.
  • 11.
  • 12. Sample Sentiment Analysis Pipeline Training Data (Historic Reviews) Training Labels (# Stars) Feature Vectors Classification Algorithm New Data: New Review Feature Vector Predictive Model Predicted Label (# Stars)
  • 13. Instances = Documents or Utterances (no matter their size)
  • 14.
  • 15. 0 at 2 bat 1 can 0 door 1 echolocation 0 elephant 0 of 0 open 0 potato 2 see 0 she 1 sight 1 sneeze 0 studio 1 the 0 to 1 via 0 w onder The elephant sneezed at the sight of potatoes. Bats can see via echolocation. See the bat sight sneeze! Wondering, she opened the door to the studio. Bag-of-words · One-hot encoding · TFIDF · Distributed representation Vectorization
  • 16.
  • 18. Data Management Layer Raw Data Feature Engineering Hyperparameter Tuning Algorithm Selection Model Selection Triples Instance Database Model Storage Model Family Model Form
  • 19.
  • 21. Partisan Discourse: Architecture Initial ModelDebate Transcripts Submit URL Preprocessing Feature Extraction Evaluate Model Fit Model Model Storage Model Monitoring Corpus Storage Corpus Monitoring Classification Feedback Model Selectionstart here
  • 22. Partisan Discourse: New Documents Users can: - add new documents - add labels to train the model
  • 23. Partisan Discourse: User Model Over time, models evolve: - Global model - Local models - User models
  • 24. Data Loader Text Normalization Text Vectorization Feature Decomposition Estimator Data Loader Feature Union Pipeline Estimator Text Normalization Document Features Text Extraction Summary Vectorization Article Vectorization Concept Features Metadata Features Dict Vectorizer
  • 25. tl;dr ● Text is the next frontier in big data. ● Language-aware data products are: ○ Not academia, but informed by it. ○ Not automagic, just feel that way. ● Machine learning is flexible; rules are not. ● Text comes with some unique requirements. ● Facilitate iteration with the model selection triple. ● Deployment is an opportunity to ingest more data. ● Pipelines are necessary for production.
  • 26. • Summarization • Reference Resolution • Machine Translation • Language Generation • Language Understanding • Document Classification • Author Identification • Part of Speech Tagging • Question Answering • Information Extraction • Information Retrieval • Speech Recognition • Sense Disambiguation • Topic Recognition • Relationship Detection • Named Entity Recognition Everyday NLP Applications