Deep Machine Reading

Deep Machine Reading: Taming Unstructured, Natural Language Data
Naveen Ashish
University of Southern California & Cognie Inc.,
BigDataTECHCON, San Francisco, October 29th2014

This is about …..
DEEP MACHINE READING
The hard nut of having computers “understand” natural language (text) ….
Pushing the boundaries of what we can achieve ….

A True AI Challenge
"It's (the problem of computers understanding natural language) ambitious ...in fact there's no more important project than understanding intelligence and recreating it.“ -Ray Kurzweil(2013)
Alan Turing based the Turing Test entirely on written language….To really master natural language …that’s the key to the Turing Test–to a human requires the full scope of human intelligence. …So the point is that natural language is a very profound domain to do artificial intelligence in. - Ray Kurzweil (2013)
“Another example of a good language problem is question answering, like What’s the second-biggest city in California that is not near a river?” Michael Jordan, in response to “What would you do with $1B?”, IEEE Spectrum Interview Oct 2014

Commercial Relevance Today
the problem of taming unstructured data is far from solved ….. !!!!
search
text analytics
big data analytics
health informatics
social-media intelligence
mining research literature

CognieInc.,
CognieInc.,
Incorporated in 2006
High-end consulting for semantic-search
Focus is on machine reading technologies
Work leverages
Information extraction work and systems conceptualized as part of university research
XAR: eXtractionwith Adaptive Rules (Ashish and Mehrotra, 2009)
PEP: Pathology Extraction Pipeline (Ashish, Dahmand Boicey2014)
Team
Developers, Student interns, Researchers
Blog
http://cognie.blog.com
Today
Building custom text analytics engines

Model
Build custom text understanding engines for domains
CognieTMPlatform for Building Text Analytics Engines
Retail Text
Engine
Health NLP
Engine
Research Mining
Engine
Customization, Application Integration, Evolution

Outline
Deep machine reading: What is, and why needed
State-of-the-art
Fundamentals
Approach
Details
Case studies
Retail, Health, Risk assessment, Customer support, Intelligence
Conclusions

What is “Deep” machine reading ?

Deep Machine Reading is ….
The ability to distill the abstract from text
The ability to comprehensively extract multiple concepts and relationships from the text
The ability to link extracted elements to known concepts
The ability to use the text (data) itself, to improve understanding of that text

The Abstract, in Text
The abstract, not explicitly mentioned !
What falls in this category
Expressions
Contextual sentiment
Aspects or Categories
I think you need better chefs SUGGESTION
The mocha is too sweet NEGATIVE
I used to take Lipitor for …PERSONAL EXPERIENCE
The dim lights have a cozy effect ….AMBIENCE

Classification, rather than Extrication
Much of the technology, up to recently, is extrication focused
Extricate particular terms, elements, concepts from the text
Extrication
Named-Entity extraction
PERSONS, ORGANIZATIONS, LOCATIONS, …
Sentiment extraction
Based on polar words
Need for much more sophisticated classification of text snippets
Along different dimensions of interest

A Comprehensive Signature of Text
Cognieexperience
Many applications have unique requirements of what they want from the text
“ …and for six months I was indeed taking Lipitor but I must say ….” PERSONAL EXPERIENCE
“…there is direct correlation between Cadmium exposure and lung …”CAUSALITY
But, many groups of applications have common requirements within
Primary elements required from text
Expressions
Entities
Sentiment
Contextual
Qualified
Emotion
Topics
Categories/Aspects
Specific signal (“directionality”)
Relationships

Deeper Text Analysis Better Insights
Goal: Get actionable insights from data !
Hypothesis: Deeper extraction Better insights !
Thetopadviceitemsadvisedforskinrasharealoevera, vitaminEoilandoatmeal
Complaintscomprise36%oftheoverallfeedbackwithtopissuesbeingslowservice,drinksandcoffee
73%ofallresearcharticlesindicatethatCadmiumisacausalfactorforlungirritation

Context
COGNIETM: A PLATFORM for text analytics
COGNIE TM
XAR
UCI-PEP
SHIP
SURVEY ANALYTICS
RETAIL ANALYTICS
RISK ASSESSMENT

Modus Operandi
All applications require a structured representation of the (unstructured) data
A structured database/meta-base that powers
Analytics dashboards
Data coding processes
Risk assessment computations
Consumer health portals
….
Manual extraction processes are typically in place
Goal is to eliminate or alleviate manual effort

Text Analytics Spectrum
Gamut of Text Analytics Engines
in Market
•Lexalytics
•Alchemy API
•Semantria
•Clarabridge
•ConveyAPI
•Linguamatics
•….
Engines Aiming Deeper
•Luminoso
•Attensity
•…
Availability of Open-source Text Analysis Tools
•UIMA
•GATE
•Deep Learning for Sentiment Analysis (Stanford)
•Recursive Neural Networks
•http://openair.allenai.org

Approach
natural language processing
machine learning
semantics

Architecture: COGNIE TM Platform
Segmentation
POS Tagging
Entity extraction
Anaphora
Parsing
Gram analysis
Existing (DMOZ, SNOMED,UMLS)
Creation
Declarative
Naïve-Bayes
MaxEnt
TFIDF
CRF
RNN Deep Learning
ENSEMBLE
NLP
Machine Learning
Knowledge Engineering

COGNIE TM : Open-source Leverage
Framework
UIMA
Classification
Weka
Mallet
NLP
Stanford CoreNLP
Indexing
Lucene
Databases
MySQL, MongoDB
Knowledge Engineering
Protégé
Topic mining
Mallet
Sentiment
Stanford Deep Learner

Step 0: Basic Text Analysis
Text Segmentation
In many cases the “unit” of distillation is a sentence
Segmentation strategies
Built-in, such as in UIMA or GATE
Custom segmentation
Sentence decomposition
Decompose sentence into individual clauses

Expressions
Beyond entities and sentiment : EXPRESSSIONS
EXPRESSIONS
Introduced in [Ashish et al, 2011]

Expressions
…showers had no hot water !… COMPLAINT
..you should have more veggie options… SUGGESTION
RETAIL/ENTERPRISE
..meats on special this weekend… ANNOUNCEMENT
..this is the best store on the west side… ADVOCACY
There is hardly any evidence to suggest a link between salt and diabetes -
This results confirm that high intake of salt leads to increase in BP+
RISK ASSESSMENT

Expressions
You should try Vitamin E oil … ADVICE
..I have had arthritis since 1991… EXPERIENCE
HEALTH
..for me lipitor worked like a charm… OUTCOME

The Indicators: “Give Aways”
A combinationof multiple types of elements !
…showers had no hot water !… COMPLAINT
(You) should have more veggie options… SUGGESTION
..i have been on lipitor… EXPERIENCE
..this is the best store on the west side…ADVOCACY

Approach: Given Indicators
NLP
Identification of individual elements
Unsupervised
Relationships betweenelements
Semantics
Identification of individual elements
Knowledge driven
Machine Learning Classification
Combine elements classify

Expression Classification: Relevant Features
Curated lexicons of specific indicative phrases
Examples
“could you”, “I took”, ….
Approach
Manual creation of “seed” lexicons
Automated expansion from data plus resource such as WordNet
The Sentiment
For instance a Complaint would almost always have negative sentiment
Punctuations, Other expressions or emoticons

Expression Classification Features
Positional information of words, phrases, or part-of- speech patterns in the sentence
Suggestions will usually begin with certain ‘request’ words
Custom patterns
Such as subject-verb-object for PERSONAL EXPERIENCE
Ontology concepts

Expression Classification: Results
Have achieved 75% precision and recall for all expressions considered
Factors
Feature engineering
Classifier selection
Knowledge engineering

Before Automated Classification: Manual Patterns
SoL: Sequences of Labels
Labels
LEX-FOODADJ
spicy
LEX-EXCESS
too, very
ONT-FOOD
POS-NOUN
Sequences (Patterns)
ANY LEX-EXCESS LEX-FOODADJ ANY Negative
POS-VB POS-MD * Suggestion

Classification: Machine Learning
Classification tasks
Expression
(Contextual) Sentiment
Aspect category
Frameworks
Weka
Mallet

Baseline Classifiers for Expressions
Mallet and Weka
NaiveBayes
MaxEnt
CRF
Gram-based
Uni, Bi and Trigram features
Baseline
~ 10% accuracy

Expression Classifiers
Trees
Decision Tree (J48)
Functions
Logistic Regression
SVM
Sequence Tagging
CRF: Conditional Random Fields

Entities
Named-entity extractors
The generic PERSON, ORGANIZATION, LOCATION
Ngramand part-of-speech analysis
Frequently mentioned ‘entities’
Improves recall
Ontology driven concept mapping
Using pre-assembled domain ontologies/taxonomies/dictionaries
Based on modules like UIMA ConceptMapper
Scale is a challenge

Contextual Sentiment
(Just) polar words can be misleading !
Polar words many not be present at all !
Combination of elements
The mocha is too sweet
Wait time is over an hour
Aisles are too narrow
Service is slow

Qualified Sentiment
Classify negative comments
Further segregate into
Immediately actionable items
‘Long term’ issues
Approach
Curation of Ngramsfor each type of negative comments
Classifier

Topic Mining
Motivated by feedback survey analytics
People can talk about “anything”
Interested in broad ‘topics’ of discussion
But the set of topics is dynamic, not necessarily known
Unsupervised topic mining
LDA: Latent DirichletAllocation
As-isled to very fragmented topics that were semantically not meaningful
Solution: consolidation of terms using WordNet
Expand terms using WordNet synonyms
Consolidate with manual curation after
Semi-automated approach

Cohesive Topic Mining
Problem with WordNet (synonym) expansion
Prone to semantic divergence
Example
Presentation Project(or) Milestones
(Almost) strongly connected components in relationship graph
Manual review after

Aspect Classification
Binning data into few broad categories
Approach
Ngrammining
Classification

Categories over Topics
Consolidate topics into broad, fixed categories
Ontology mapping approach
Each category has associated concepts
Topic signature maps to category concepts
Hershey
Bieber
Cocoa beans
Personnel
Competitors
Yearly reviews

Emotion Extraction
Plutchikwheel of emotions
Fundamental emotion concepts captured in ontology
Augmented with indicator terms, and their synonyms
Ontology driven extraction for emotion concepts

Semantics
Domain knowledge is not ‘nice-to-have’ but critical
HEALTH
•Condition names
•Drug names
•Symptoms
•Procedures
•..
RETAIL
•Food items
•Other products
•Competitors
•…
RESEARCH
•Chemical substances
•Harmful conditions
•…
INTELLIGENCE
•Manufacturers
•Vehicles
…

Leverage ExistingKnowledge Sources
Health informatics
UMLS
http://www.nlm.nih.gov/research/umls/
NCI Thesaurus
http://ncit.nci.nih.gov/
SNOMED
http://www.nlm.nih.gov/snomed
Retail
DMOZ
http://www.dmoz.org
Many other
Freebase
http://www.freebase.com
Wikipedia, DBPedia
OpenData
data.gov

Knowledge Engineering Tools
Getting available ontologies into usable formats
Available as database dumps, RDF, or Web data
“Mini” ontology creation
Curate manually when possible (small dictionaries)
Example: list of competitors
API access
Freebase https://www.freebase.com/query
Query using ‘MQL’ –MetawebQuery Language (Sparqllike)
BioPortalhttp://data.bioontology.org/documentation
Provided sometimes by customer !

Practical Requirements
Confidence Measures
Quantitative confidence score for extracted elements
Binary confidence Y/N
Not confident Routed for manual review
‘Explanation’ for classification
Relevant snippets
“….and the checkout times continue to be long despite …”  Complaint

Feedback Learning Mechanisms
Manual overview is not dismissed entirely
Comprehensive pipeline for manual review
Learn and improve from feedback

Applications
Core Cognie
Platform
Retail Analytics
Engine
Health Distillation
Engine
Survey Analytics
Engine
Research Mining
Engine
Coding Validation
Engine
Risk Analysis
System
Coding
Processes
Health Insights
Portal

Scalability
Scale requirements
Large numbers of documents as opposed to large document size
Throughput can be an issue
Complex language processing algorithms
Feature extraction can be complex
Large ontologies in some cases
Solutions
Multi-threading and Thread pooling architecture
Hadoop MapReduce[Kahn and Ashish, 2014]

Grand Challenge Projects
Aristo
At AI2, Allen AI Institute
http://www.allenai.org
Areas
Knowledge Extraction
Reasoning
Question Answering
Can the system answer 4th, 6thgrade exams ?
Project NELL
Never Ending Language Learning
http://rtw.ml.cmu.edu/rtw/
“Learnt” 50+million facts from Web data

Conclusions
Deeper distillation from text is required
Can be achieved by
Detecting and combining multiple elements in text
Feature engineering
Knowledge engineering
Classifier selection
Semantics and Knowledge Engineering is key
Have been successful in leveraging the CognieTMPlatform to develop custom text analytics engines in multiple domains

thank you !
naveen.ashish@cognie.com

Deep Machine Reading

Recommended

Recommended

More Related Content

Similar to Deep Machine Reading

Similar to Deep Machine Reading (20)

Recently uploaded

Recently uploaded (20)

Deep Machine Reading