Deep Distillation from Natural Language Text

Deep Distillation from Text
Naveen Ashish
University of Southern California & Cognie Inc.,
March 18th 2014

This is about …..
“DEEP TEXT DISTILLATION”
The hard nut of having computers “understand” natural
language (text) ….
 Pushing the boundaries of what we can achieve ….
"It's (the problem of computers understanding natural language) ambitious ...in
fact there's no more important project than understanding intelligence and
recreating it.“ - Ray Kurzweil (2013)
Alan Turing based the Turing Test entirely on written language….To really master
natural language …that’s the key to the Turing Test–to a human requires the full
scope of human intelligence. …So the point is that natural language is a very
profound domain to do artificial intelligence in. - Ray Kurzweil (2013)

Why ….
 the problem is far from solved ….. !!!!
 unstructured data everywhere
95 % !
search
text analytics
big data analytics
health informatics
social-media intelligence

Introduction
About myself
Associate Professor (Informatics), Keck School of Medicine,
University of Southern California
Cognie Inc.,
Work leverages
Information extraction work and systems developed at UC Irvine
 XAR, UCI-PEP
Advisory consulting engagements with several companies and
start-ups

Outline
Deep distillation: What is and why
State-of-the-art
Fundamentals
Approach
Details
Expressions, Entities, Sentiment
Case studies
Retail, Health, Risk assessment
Conclusions

What is “Deep” text distillation ?

Data
Abstract
This paper describes the results of a
study investigating ….
…..
We conclude that salt and diabetes are
largely unrelated.

Deep Distillation
The abstract, not explicitly mentioned !
What falls in this category
Expressions
Contextual sentiment
Aspect classification
I think you need better chefs  SUGGESTION
The mocha is too sweet  NEGATIVE
I used to take Lipitor for … PERSONAL EXPERIENCE
The dim lights have a cozy effect …. AMBIENCE

A Common Intersection
Distill at sentence level
Aggregate to entire feedback, post, comment or
thread
Three primary elements
Expression/Intent
Entities/Aspects (and Classes)
Sentiment

Why Deeper ?
 Goal: Get actionable insights from data !
 Hypothesis: Deeper extraction  Better insights !
The top advice items advised for skin rash are aloe vera,
vitamin E oil and oatmeal
Complaints comprise 36% of the overall feedback with top
issues being slow service, drinks and coffee

Context
COGNIETM: A PLATFORM for text analytics
COGNIE TM
XAR UCI-PEP
SHIP SURVEY
ANALYTICS
RETAIL
ANALYTICS
RISK
ASSESSMENT

Expressions
Beyond entities and sentiment : EXPRESSSIONS
EXPRESSIONS
Introduced in [Ashish et al, 2011]

Expressions
You should try Vitamin E oil …  ADVICE
..I have had arthritis since 1991…  EXPERIENCE
HEALTH
..for me lipitor worked like a charm…  OUTCOME

Expressions
…showers had no hot water !…  COMPLAINT
..you should have more veggie options…  SUGGESTION
RETAIL/ENTERPRISE
..meats on special this weekend…  ANNOUNCEMENT
..this is the best store on the west side…  ADVOCACY
There is hardly any evidence to suggest a link between salt and diabetes  -
This results confirm that high intake of salt leads to increase in BP +
RISK ASSESSMENT

Text Analytics Spectrum
Wide offering of
 Text analytics engines
 Text analysis tools – many open-source
Largely still for “spotting things”
 entities, concepts, sentiment, topics, emotions ….
Going deeper
 Luminoso
 Attensity (Intents)
Deep Learning for Sentiment
 Stanford
 Recursive Neural Networks

Approach
natural language processing
machine learning
semantics

Architecture: COGNIE TM Platform
Segmentation
POS Tagging
Entity extraction
Anaphora
Parsing
Gram analysis
Existing (DMOZ, SNOMED,UMLS)
Creation
Declarative
Naïve-Bayes
MaxEnt
TFIDF
CRF
RNN Deep Learning
ENSEMBLE
NLP
Machine Learning
Knowledge Engineering

The Indicators: “Give Aways”
A combination of multiple types of elements !
…showers had no hot water !… COMPLAINT
(You) should have more veggie options… SUGGESTION
..i have been on lipitor… EXPERIENCE
..this is the best store on the west side… ADVOCACY

Approach: Given Indicators
NLP
Identification of individual elements
 Unsupervised
Relationships between elements
Semantics
Identification of individual elements
 Knowledge driven
Machine Learning Classification
Combine elements  classify

Natural Language Processing
 UIMA and GATE
 Stanford NLP Tools
POS tagging
 Parsing
 NE Recognizer
 Geo-tagger
 ….

Natural Language Processing
 Text Segmentation
In many cases the “unit” if distillation is a sentence
 Segmentation
 UIMA (or GATE)
 Custom
 Complex sentence segmentation
 Breakup into individual clauses

NLP
 Part-of-speech tags are key indicators
Expression distillation
 Entity extraction
Names, Locations, Organizations
 Parsing
If required
 Anaphora

NGram Analysis
Unigram and Bigram analysis
Obtain
Grams
Frequency
Entropy
Grams of tokens as well as POS Patterns
VB VBD

Before Automated Classification: Manual
Patterns
SoL: Sequences of Labels
Labels
LEX-FOODADJ
 spicy
LEX-EXCESS
 too, very
ONT-FOOD
POS-NOUN
Sequences (Patterns)
ANY LEX-EXCESS LEX-FOODADJ ANY 
POS-VB POS-MD ….

Classification: Machine Learning
 Classification tasks
Expression
(Contextual) Sentiment
Aspect category
Frameworks
Weka
Mallet

Baseline Classifiers
 Mallet and Weka
NaiveBayes
MaxEnt
CRF
 Gram-based
Uni, Bi and Trigram features
Baseline
~ 10% accuracy

Expression Classification: Features
 Features
Polar words
Punctuations
Ngrams
POS patterns
Length !
Beginning
Ontology
…

Classifiers
 Trees
Decision Tree (J48)
Functions
Logistic Regression
SVM
Sequence Tagging
CRF: Conditional Random Fields

Expression Classification: Results
Have achieved 75% precision and recall for all
expressions considered
Factors
Feature engineering
Classifier selection
Knowledge engineering

Contextual Sentiment
 (Just) polar words can be misleading !
Polar words many not be present at all !
Combination of elements
The mocha is too sweet
Wait time is over an hour
Aisles are too narrow
Service is slow

Semantics: Ontologies
 Health
 Drugs
 Conditions
 Procedures
 Symptoms
 …
Retail (Dining)
 Food/Entrees
 Service
 Ambience
 ….

Leverage Existing Knowledge Sources
Health informatics
 UMLS
 NCI Thesaurus
 SNOMED
Retail
 DMOZ
Many other
 Freebase
 Wikipedia, DBPedia
OpenData
 data.gov

Knowledge Engineering Tools
“Mini” ontology creation
API access
Freebase
BioPortal
Wrappers
DMOZ, ….

Practical Requirements
Confidence Measures
Below threshold routed to manual transcription teams
Polarity
Snippets

COGNIE TM : Open Source Tools
Framework
UIMA
Classification
Weka
Mallet
NLP
Stanford tools
Indexing
Lucene
Databases
MySQL, MongoDB
Knowledge Engineering
Protégé

Case Study: Health Informatics

Case Study: Retail & Survey Analytics
Feedback
 Direct, device collected
 Social-media
Typically short, few sentences
Strong requirement for aspect classification
 [Food,Service,Ambience,Pricing,Other]
Negative : “Immediate” vs “Long Term” classification
…food was awesome, service needs improvement ….
you need to be open longer !

Case Study: Risk Assessment
 Biomedical Literature Abstracts
Correlation direction (+ -)
Subject
Article type
Features
Clauses
Negation and Triggers
Semantic Heterogeneity

MapReduce
Throughput can be an issue
Complex language processing algorithms
Large ontologies in some cases
Hadoop MapReduce
[Kahn and Ashish, 2014]

Conclusions
Deeper distillation from text is important
Can be achieved by
Detecting and combining multiple elements in text
 Feature engineering
 Knowledge engineering
 Classifier selection
Does not have to be perfect
Every domain, dataset has its nuances

thank you !
naveen.ashish@cognie.com

Deep Distillation from Natural Language Text

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (19)

Similar to Deep Distillation from Natural Language Text

Similar to Deep Distillation from Natural Language Text (20)

Recently uploaded

Recently uploaded (20)

Deep Distillation from Natural Language Text