1. Deep Machine Reading: Taming Unstructured, Natural Language Data
Naveen Ashish
University of Southern California & Cognie Inc.,
BigDataTECHCON, San Francisco, October 29th2014
2. This is about …..
DEEP MACHINE READING
The hard nut of having computers “understand” natural language (text) ….
Pushing the boundaries of what we can achieve ….
3. A True AI Challenge
"It's (the problem of computers understanding natural language) ambitious ...in fact there's no more important project than understanding intelligence and recreating it.“ -Ray Kurzweil(2013)
Alan Turing based the Turing Test entirely on written language….To really master natural language …that’s the key to the Turing Test–to a human requires the full scope of human intelligence. …So the point is that natural language is a very profound domain to do artificial intelligence in. - Ray Kurzweil (2013)
“Another example of a good language problem is question answering, like What’s the second-biggest city in California that is not near a river?” Michael Jordan, in response to “What would you do with $1B?”, IEEE Spectrum Interview Oct 2014
4. Commercial Relevance Today
the problem of taming unstructured data is far from solved ….. !!!!
search
text analytics
big data analytics
health informatics
social-media intelligence
mining research literature
5. CognieInc.,
CognieInc.,
Incorporated in 2006
High-end consulting for semantic-search
Focus is on machine reading technologies
Work leverages
Information extraction work and systems conceptualized as part of university research
XAR: eXtractionwith Adaptive Rules (Ashish and Mehrotra, 2009)
PEP: Pathology Extraction Pipeline (Ashish, Dahmand Boicey2014)
Team
Developers, Student interns, Researchers
Blog
http://cognie.blog.com
Today
Building custom text analytics engines
6. Model
Build custom text understanding engines for domains
CognieTMPlatform for Building Text Analytics Engines
Retail Text
Engine
Health NLP
Engine
Research Mining
Engine
Customization, Application Integration, Evolution
7. Outline
Deep machine reading: What is, and why needed
State-of-the-art
Fundamentals
Approach
Details
Case studies
Retail, Health, Risk assessment, Customer support, Intelligence
Conclusions
9. Deep Machine Reading is ….
The ability to distill the abstract from text
The ability to comprehensively extract multiple concepts and relationships from the text
The ability to link extracted elements to known concepts
The ability to use the text (data) itself, to improve understanding of that text
10. The Abstract, in Text
The abstract, not explicitly mentioned !
What falls in this category
Expressions
Contextual sentiment
Aspects or Categories
I think you need better chefs SUGGESTION
The mocha is too sweet NEGATIVE
I used to take Lipitor for …PERSONAL EXPERIENCE
The dim lights have a cozy effect ….AMBIENCE
11. Classification, rather than Extrication
Much of the technology, up to recently, is extrication focused
Extricate particular terms, elements, concepts from the text
Extrication
Named-Entity extraction
PERSONS, ORGANIZATIONS, LOCATIONS, …
Sentiment extraction
Based on polar words
Need for much more sophisticated classification of text snippets
Along different dimensions of interest
12. A Comprehensive Signature of Text
Cognieexperience
Many applications have unique requirements of what they want from the text
“ …and for six months I was indeed taking Lipitor but I must say ….” PERSONAL EXPERIENCE
“…there is direct correlation between Cadmium exposure and lung …”CAUSALITY
But, many groups of applications have common requirements within
Primary elements required from text
Expressions
Entities
Sentiment
Contextual
Qualified
Emotion
Topics
Categories/Aspects
Specific signal (“directionality”)
Relationships
13. Deeper Text Analysis Better Insights
Goal: Get actionable insights from data !
Hypothesis: Deeper extraction Better insights !
Thetopadviceitemsadvisedforskinrasharealoevera, vitaminEoilandoatmeal
Complaintscomprise36%oftheoverallfeedbackwithtopissuesbeingslowservice,drinksandcoffee
73%ofallresearcharticlesindicatethatCadmiumisacausalfactorforlungirritation
14. Context
COGNIETM: A PLATFORM for text analytics
COGNIE TM
XAR
UCI-PEP
SHIP
SURVEY ANALYTICS
RETAIL ANALYTICS
RISK ASSESSMENT
15. Modus Operandi
All applications require a structured representation of the (unstructured) data
A structured database/meta-base that powers
Analytics dashboards
Data coding processes
Risk assessment computations
Consumer health portals
….
Manual extraction processes are typically in place
Goal is to eliminate or alleviate manual effort
16. Text Analytics Spectrum
Gamut of Text Analytics Engines
in Market
•Lexalytics
•Alchemy API
•Semantria
•Clarabridge
•ConveyAPI
•Linguamatics
•….
Engines Aiming Deeper
•Luminoso
•Attensity
•…
Availability of Open-source Text Analysis Tools
•UIMA
•GATE
•Deep Learning for Sentiment Analysis (Stanford)
•Recursive Neural Networks
•http://openair.allenai.org
21. Step 0: Basic Text Analysis
Text Segmentation
In many cases the “unit” of distillation is a sentence
Segmentation strategies
Built-in, such as in UIMA or GATE
Custom segmentation
Sentence decomposition
Decompose sentence into individual clauses
23. Expressions
…showers had no hot water !… COMPLAINT
..you should have more veggie options… SUGGESTION
RETAIL/ENTERPRISE
..meats on special this weekend… ANNOUNCEMENT
..this is the best store on the west side… ADVOCACY
There is hardly any evidence to suggest a link between salt and diabetes -
This results confirm that high intake of salt leads to increase in BP+
RISK ASSESSMENT
24. Expressions
You should try Vitamin E oil … ADVICE
..I have had arthritis since 1991… EXPERIENCE
HEALTH
..for me lipitor worked like a charm… OUTCOME
25. The Indicators: “Give Aways”
A combinationof multiple types of elements !
…showers had no hot water !… COMPLAINT
(You) should have more veggie options… SUGGESTION
..i have been on lipitor… EXPERIENCE
..this is the best store on the west side…ADVOCACY
26. Approach: Given Indicators
NLP
Identification of individual elements
Unsupervised
Relationships betweenelements
Semantics
Identification of individual elements
Knowledge driven
Machine Learning Classification
Combine elements classify
27. Expression Classification: Relevant Features
Curated lexicons of specific indicative phrases
Examples
“could you”, “I took”, ….
Approach
Manual creation of “seed” lexicons
Automated expansion from data plus resource such as WordNet
The Sentiment
For instance a Complaint would almost always have negative sentiment
Punctuations, Other expressions or emoticons
28. Expression Classification Features
Positional information of words, phrases, or part-of- speech patterns in the sentence
Suggestions will usually begin with certain ‘request’ words
Custom patterns
Such as subject-verb-object for PERSONAL EXPERIENCE
Ontology concepts
29. Expression Classification: Results
Have achieved 75% precision and recall for all expressions considered
Factors
Feature engineering
Classifier selection
Knowledge engineering
30. Before Automated Classification: Manual Patterns
SoL: Sequences of Labels
Labels
LEX-FOODADJ
spicy
LEX-EXCESS
too, very
ONT-FOOD
POS-NOUN
Sequences (Patterns)
ANY LEX-EXCESS LEX-FOODADJ ANY Negative
POS-VB POS-MD * Suggestion
32. Baseline Classifiers for Expressions
Mallet and Weka
NaiveBayes
MaxEnt
CRF
Gram-based
Uni, Bi and Trigram features
Baseline
~ 10% accuracy
33. Expression Classifiers
Trees
Decision Tree (J48)
Functions
Logistic Regression
SVM
Sequence Tagging
CRF: Conditional Random Fields
34. Entities
Named-entity extractors
The generic PERSON, ORGANIZATION, LOCATION
Ngramand part-of-speech analysis
Frequently mentioned ‘entities’
Improves recall
Ontology driven concept mapping
Using pre-assembled domain ontologies/taxonomies/dictionaries
Based on modules like UIMA ConceptMapper
Scale is a challenge
35. Contextual Sentiment
(Just) polar words can be misleading !
Polar words many not be present at all !
Combination of elements
The mocha is too sweet
Wait time is over an hour
Aisles are too narrow
Service is slow
36. Qualified Sentiment
Classify negative comments
Further segregate into
Immediately actionable items
‘Long term’ issues
Approach
Curation of Ngramsfor each type of negative comments
Classifier
37. Topic Mining
Motivated by feedback survey analytics
People can talk about “anything”
Interested in broad ‘topics’ of discussion
But the set of topics is dynamic, not necessarily known
Unsupervised topic mining
LDA: Latent DirichletAllocation
As-isled to very fragmented topics that were semantically not meaningful
Solution: consolidation of terms using WordNet
Expand terms using WordNet synonyms
Consolidate with manual curation after
Semi-automated approach
38. Cohesive Topic Mining
Problem with WordNet (synonym) expansion
Prone to semantic divergence
Example
Presentation Project(or) Milestones
(Almost) strongly connected components in relationship graph
Manual review after
40. Categories over Topics
Consolidate topics into broad, fixed categories
Ontology mapping approach
Each category has associated concepts
Topic signature maps to category concepts
Hershey
Bieber
Cocoa beans
Personnel
Competitors
Yearly reviews
41. Emotion Extraction
Plutchikwheel of emotions
Fundamental emotion concepts captured in ontology
Augmented with indicator terms, and their synonyms
Ontology driven extraction for emotion concepts
45. Knowledge Engineering Tools
Getting available ontologies into usable formats
Available as database dumps, RDF, or Web data
“Mini” ontology creation
Curate manually when possible (small dictionaries)
Example: list of competitors
API access
Freebase https://www.freebase.com/query
Query using ‘MQL’ –MetawebQuery Language (Sparqllike)
BioPortalhttp://data.bioontology.org/documentation
Provided sometimes by customer !
46. Practical Requirements
Confidence Measures
Quantitative confidence score for extracted elements
Binary confidence Y/N
Not confident Routed for manual review
‘Explanation’ for classification
Relevant snippets
“….and the checkout times continue to be long despite …” Complaint
47. Feedback Learning Mechanisms
Manual overview is not dismissed entirely
Comprehensive pipeline for manual review
Learn and improve from feedback
51. Scalability
Scale requirements
Large numbers of documents as opposed to large document size
Throughput can be an issue
Complex language processing algorithms
Feature extraction can be complex
Large ontologies in some cases
Solutions
Multi-threading and Thread pooling architecture
Hadoop MapReduce[Kahn and Ashish, 2014]
53. Grand Challenge Projects
Aristo
At AI2, Allen AI Institute
http://www.allenai.org
Areas
Knowledge Extraction
Reasoning
Question Answering
Can the system answer 4th, 6thgrade exams ?
Project NELL
Never Ending Language Learning
http://rtw.ml.cmu.edu/rtw/
“Learnt” 50+million facts from Web data
54. Conclusions
Deeper distillation from text is required
Can be achieved by
Detecting and combining multiple elements in text
Feature engineering
Knowledge engineering
Classifier selection
Semantics and Knowledge Engineering is key
Have been successful in leveraging the CognieTMPlatform to develop custom text analytics engines in multiple domains