1. Deep Distillation from Text
Naveen Ashish
University of Southern California & Cognie Inc.,
March 18th 2014
2. This is about …..
“DEEP TEXT DISTILLATION”
The hard nut of having computers “understand” natural
language (text) ….
Pushing the boundaries of what we can achieve ….
"It's (the problem of computers understanding natural language) ambitious ...in
fact there's no more important project than understanding intelligence and
recreating it.“ - Ray Kurzweil (2013)
Alan Turing based the Turing Test entirely on written language….To really master
natural language …that’s the key to the Turing Test–to a human requires the full
scope of human intelligence. …So the point is that natural language is a very
profound domain to do artificial intelligence in. - Ray Kurzweil (2013)
3. Why ….
the problem is far from solved ….. !!!!
unstructured data everywhere
95 % !
search
text analytics
big data analytics
health informatics
social-media intelligence
4. Introduction
About myself
Associate Professor (Informatics), Keck School of Medicine,
University of Southern California
Cognie Inc.,
Work leverages
Information extraction work and systems developed at UC Irvine
XAR, UCI-PEP
Advisory consulting engagements with several companies and
start-ups
5. Outline
Deep distillation: What is and why
State-of-the-art
Fundamentals
Approach
Details
Expressions, Entities, Sentiment
Case studies
Retail, Health, Risk assessment
Conclusions
8. Deep Distillation
The abstract, not explicitly mentioned !
What falls in this category
Expressions
Contextual sentiment
Aspect classification
I think you need better chefs SUGGESTION
The mocha is too sweet NEGATIVE
I used to take Lipitor for … PERSONAL EXPERIENCE
The dim lights have a cozy effect …. AMBIENCE
9. A Common Intersection
Distill at sentence level
Aggregate to entire feedback, post, comment or
thread
Three primary elements
Expression/Intent
Entities/Aspects (and Classes)
Sentiment
10. Why Deeper ?
Goal: Get actionable insights from data !
Hypothesis: Deeper extraction Better insights !
The top advice items advised for skin rash are aloe vera,
vitamin E oil and oatmeal
Complaints comprise 36% of the overall feedback with top
issues being slow service, drinks and coffee
11. Context
COGNIETM: A PLATFORM for text analytics
COGNIE TM
XAR UCI-PEP
SHIP SURVEY
ANALYTICS
RETAIL
ANALYTICS
RISK
ASSESSMENT
13. Expressions
You should try Vitamin E oil … ADVICE
..I have had arthritis since 1991… EXPERIENCE
HEALTH
..for me lipitor worked like a charm… OUTCOME
14. Expressions
…showers had no hot water !… COMPLAINT
..you should have more veggie options… SUGGESTION
RETAIL/ENTERPRISE
..meats on special this weekend… ANNOUNCEMENT
..this is the best store on the west side… ADVOCACY
There is hardly any evidence to suggest a link between salt and diabetes -
This results confirm that high intake of salt leads to increase in BP +
RISK ASSESSMENT
20. The Indicators: “Give Aways”
A combination of multiple types of elements !
…showers had no hot water !… COMPLAINT
(You) should have more veggie options… SUGGESTION
..i have been on lipitor… EXPERIENCE
..this is the best store on the west side… ADVOCACY
21. Approach: Given Indicators
NLP
Identification of individual elements
Unsupervised
Relationships between elements
Semantics
Identification of individual elements
Knowledge driven
Machine Learning Classification
Combine elements classify
22. Natural Language Processing
UIMA and GATE
Stanford NLP Tools
POS tagging
Parsing
NE Recognizer
Geo-tagger
….
23. Natural Language Processing
Text Segmentation
In many cases the “unit” if distillation is a sentence
Segmentation
UIMA (or GATE)
Custom
Complex sentence segmentation
Breakup into individual clauses
31. Expression Classification: Results
Have achieved 75% precision and recall for all
expressions considered
Factors
Feature engineering
Classifier selection
Knowledge engineering
32. Contextual Sentiment
(Just) polar words can be misleading !
Polar words many not be present at all !
Combination of elements
The mocha is too sweet
Wait time is over an hour
Aisles are too narrow
Service is slow
42. Case Study: Retail & Survey Analytics
Feedback
Direct, device collected
Social-media
Typically short, few sentences
Strong requirement for aspect classification
[Food,Service,Ambience,Pricing,Other]
Negative : “Immediate” vs “Long Term” classification
…food was awesome, service needs improvement ….
you need to be open longer !
43. Case Study: Risk Assessment
Biomedical Literature Abstracts
Correlation direction (+ -)
Subject
Article type
Features
Clauses
Negation and Triggers
Semantic Heterogeneity
45. MapReduce
Throughput can be an issue
Complex language processing algorithms
Large ontologies in some cases
Hadoop MapReduce
[Kahn and Ashish, 2014]
47. Conclusions
Deeper distillation from text is important
Can be achieved by
Detecting and combining multiple elements in text
Feature engineering
Knowledge engineering
Classifier selection
Does not have to be perfect
Every domain, dataset has its nuances