From Unstructured Text to
Structured Data
Dan Sullivan
MarTechPDX Meetup
February 21, 2018
“Text is unstructured”
Unstructured?
“Unstructured” to Structured
“We booked a Napa group
tour. Booking and pickup
was seamless. All very
professional. Rob our
driver was fantastic. He
has very deep roots in
Napa and is a wealth of
info.”
Identifying Structured Attributes of Text
● Sentiment Analysis
● Topic Modeling
● Named Entity Recognition
● Event Extraction
Sentiment Analysis
* Analysis of tone or opinion of a communication
* Polarity: {positive, neutral, negative}
* Categorization: {angry, pleased, confused …}
* Scale -10 … +10
Sentiment Analysis Techniques
● Keywords
● Affective Norms
for English Words
(ANEW)
Image Source: https://www.csc2.ncsu.edu/faculty/healey/tweet_viz/tweet_app/
Topic Modeling
● Technique for identify dominant themes in
document
● Does not require training
● Multiple Algorithms
○ Probabilistic Latent Semantic Indexing (PLSI)
○ Latent Dirichlet allocation (LDA)
● Assumptions
○ Documents about a mixture of topics
○ Words used in document attributable to topic
Hotels, Cambodia,
pop-up restaurant
Image source: NYTimes.com
Cuisine, small plates,
Las Vegas
Cuisine, vegetables,
Alaska,
Documents have Multiple Topics
● Topics represented by words; documents about a set of topics
○ Doc 1: 50% hotel, 50% beach
○ Doc 2: 25% cuisine, 30% Thailand, 45% vegitarian
○ Doc 3: 30% wine, 40% Napa, 30% roadway
● Learning Topics
○ Probability of topic given a document P(topic|doc)
○ Probability of word given a topic P(word|topic)
○ Reassign word to new topic with probability
Named Entity Recognition
● Processes of identifying words and
phrases describing objects in
specific categories
● Common classes of entities:
● Persons
● Organizations
● Geographic locations
● Dates
● Monetary amounts
Travel Review
"Space needle had beautiful views
of the city I will recommend it for
people of all ages, the Chihuahua
gardens were amazing an
unexpected bonus to my space
needle visit this are must see when
visiting Seattle“
Named Entity Recognition Techniques
● Linguistic - utilize structure of sentence
● Statistical – detect patterns in training
● Custom patterns – regular expressions
● Dictionaries
Event Extraction
● Entities and relations between entities
○ Company A acquires Company B
○ Engineer A filed patent application on Topic B on
Date C
○ Politician P announces A on Twitter on Date B
● Assign roles to entities
● Assign categories to entities
Biomedical Example
Methods and Purpose
Sentiment Analysis What is the tone of the text?
Topic Modeling What concepts are discussed?
Named Entity Recognition What entities are mentioned?
Event Extraction How are entities related?
Text Mining Tools
Q & A
Sentiment Analysis Use Cases
* Brand monitoring
* Competitive intelligence
* Demographic modeling
* Campaign analysis
Topic Model Use Cases
● Data exploration in large corpus
● Pre-classification analysis
● Identify dominant themes
Named Entity Recognition Use Cases
● Name normalization
● Entity correlation
● Quantified metrics based on texts
● Building block for event extraction
Background
● Worked in AI and natural
language processing prior to AI
Winter
● Most recent work in NLP involved
text mining biomedical literature
● Focus on how to map
“unstructured” text to data
structures that are amenable to
structured analytics

Unstructured text to structured data

  • 1.
    From Unstructured Textto Structured Data Dan Sullivan MarTechPDX Meetup February 21, 2018
  • 2.
  • 3.
  • 4.
    “Unstructured” to Structured “Webooked a Napa group tour. Booking and pickup was seamless. All very professional. Rob our driver was fantastic. He has very deep roots in Napa and is a wealth of info.”
  • 5.
    Identifying Structured Attributesof Text ● Sentiment Analysis ● Topic Modeling ● Named Entity Recognition ● Event Extraction
  • 6.
    Sentiment Analysis * Analysisof tone or opinion of a communication * Polarity: {positive, neutral, negative} * Categorization: {angry, pleased, confused …} * Scale -10 … +10
  • 7.
    Sentiment Analysis Techniques ●Keywords ● Affective Norms for English Words (ANEW) Image Source: https://www.csc2.ncsu.edu/faculty/healey/tweet_viz/tweet_app/
  • 8.
    Topic Modeling ● Techniquefor identify dominant themes in document ● Does not require training ● Multiple Algorithms ○ Probabilistic Latent Semantic Indexing (PLSI) ○ Latent Dirichlet allocation (LDA) ● Assumptions ○ Documents about a mixture of topics ○ Words used in document attributable to topic
  • 9.
    Hotels, Cambodia, pop-up restaurant Imagesource: NYTimes.com Cuisine, small plates, Las Vegas Cuisine, vegetables, Alaska,
  • 10.
    Documents have MultipleTopics ● Topics represented by words; documents about a set of topics ○ Doc 1: 50% hotel, 50% beach ○ Doc 2: 25% cuisine, 30% Thailand, 45% vegitarian ○ Doc 3: 30% wine, 40% Napa, 30% roadway ● Learning Topics ○ Probability of topic given a document P(topic|doc) ○ Probability of word given a topic P(word|topic) ○ Reassign word to new topic with probability
  • 11.
    Named Entity Recognition ●Processes of identifying words and phrases describing objects in specific categories ● Common classes of entities: ● Persons ● Organizations ● Geographic locations ● Dates ● Monetary amounts
  • 12.
    Travel Review "Space needlehad beautiful views of the city I will recommend it for people of all ages, the Chihuahua gardens were amazing an unexpected bonus to my space needle visit this are must see when visiting Seattle“
  • 13.
    Named Entity RecognitionTechniques ● Linguistic - utilize structure of sentence ● Statistical – detect patterns in training ● Custom patterns – regular expressions ● Dictionaries
  • 14.
    Event Extraction ● Entitiesand relations between entities ○ Company A acquires Company B ○ Engineer A filed patent application on Topic B on Date C ○ Politician P announces A on Twitter on Date B ● Assign roles to entities ● Assign categories to entities
  • 15.
  • 16.
    Methods and Purpose SentimentAnalysis What is the tone of the text? Topic Modeling What concepts are discussed? Named Entity Recognition What entities are mentioned? Event Extraction How are entities related?
  • 17.
  • 18.
  • 19.
    Sentiment Analysis UseCases * Brand monitoring * Competitive intelligence * Demographic modeling * Campaign analysis
  • 20.
    Topic Model UseCases ● Data exploration in large corpus ● Pre-classification analysis ● Identify dominant themes
  • 21.
    Named Entity RecognitionUse Cases ● Name normalization ● Entity correlation ● Quantified metrics based on texts ● Building block for event extraction
  • 22.
    Background ● Worked inAI and natural language processing prior to AI Winter ● Most recent work in NLP involved text mining biomedical literature ● Focus on how to map “unstructured” text to data structures that are amenable to structured analytics