SlideShare a Scribd company logo
1 of 55
Download to read offline
Deep Machine Reading: Taming Unstructured, Natural Language Data 
Naveen Ashish 
University of Southern California & Cognie Inc., 
BigDataTECHCON, San Francisco, October 29th2014
This is about ….. 
DEEP MACHINE READING 
The hard nut of having computers “understand” natural language (text) …. 
Pushing the boundaries of what we can achieve ….
A True AI Challenge 
"It's (the problem of computers understanding natural language) ambitious ...in fact there's no more important project than understanding intelligence and recreating it.“ -Ray Kurzweil(2013) 
Alan Turing based the Turing Test entirely on written language….To really master natural language …that’s the key to the Turing Test–to a human requires the full scope of human intelligence. …So the point is that natural language is a very profound domain to do artificial intelligence in. - Ray Kurzweil (2013) 
“Another example of a good language problem is question answering, like What’s the second-biggest city in California that is not near a river?” Michael Jordan, in response to “What would you do with $1B?”, IEEE Spectrum Interview Oct 2014
Commercial Relevance Today 
the problem of taming unstructured data is far from solved ….. !!!! 
search 
text analytics 
big data analytics 
health informatics 
social-media intelligence 
mining research literature
CognieInc., 
CognieInc., 
Incorporated in 2006 
High-end consulting for semantic-search 
Focus is on machine reading technologies 
Work leverages 
Information extraction work and systems conceptualized as part of university research 
XAR: eXtractionwith Adaptive Rules (Ashish and Mehrotra, 2009) 
PEP: Pathology Extraction Pipeline (Ashish, Dahmand Boicey2014) 
Team 
Developers, Student interns, Researchers 
Blog 
http://cognie.blog.com 
Today 
Building custom text analytics engines
Model 
Build custom text understanding engines for domains 
CognieTMPlatform for Building Text Analytics Engines 
Retail Text 
Engine 
Health NLP 
Engine 
Research Mining 
Engine 
Customization, Application Integration, Evolution
Outline 
Deep machine reading: What is, and why needed 
State-of-the-art 
Fundamentals 
Approach 
Details 
Case studies 
Retail, Health, Risk assessment, Customer support, Intelligence 
Conclusions
What is “Deep” machine reading ?
Deep Machine Reading is …. 
The ability to distill the abstract from text 
The ability to comprehensively extract multiple concepts and relationships from the text 
The ability to link extracted elements to known concepts 
The ability to use the text (data) itself, to improve understanding of that text
The Abstract, in Text 
The abstract, not explicitly mentioned ! 
What falls in this category 
Expressions 
Contextual sentiment 
Aspects or Categories 
I think you need better chefs SUGGESTION 
The mocha is too sweet NEGATIVE 
I used to take Lipitor for …PERSONAL EXPERIENCE 
The dim lights have a cozy effect ….AMBIENCE
Classification, rather than Extrication 
Much of the technology, up to recently, is extrication focused 
Extricate particular terms, elements, concepts from the text 
Extrication 
Named-Entity extraction 
PERSONS, ORGANIZATIONS, LOCATIONS, … 
Sentiment extraction 
Based on polar words 
Need for much more sophisticated classification of text snippets 
Along different dimensions of interest
A Comprehensive Signature of Text 
Cognieexperience 
Many applications have unique requirements of what they want from the text 
“ …and for six months I was indeed taking Lipitor but I must say ….” PERSONAL EXPERIENCE 
“…there is direct correlation between Cadmium exposure and lung …”CAUSALITY 
But, many groups of applications have common requirements within 
Primary elements required from text 
Expressions 
Entities 
Sentiment 
Contextual 
Qualified 
Emotion 
Topics 
Categories/Aspects 
Specific signal (“directionality”) 
Relationships
Deeper Text Analysis Better Insights 
Goal: Get actionable insights from data ! 
Hypothesis: Deeper extraction Better insights ! 
Thetopadviceitemsadvisedforskinrasharealoevera, vitaminEoilandoatmeal 
Complaintscomprise36%oftheoverallfeedbackwithtopissuesbeingslowservice,drinksandcoffee 
73%ofallresearcharticlesindicatethatCadmiumisacausalfactorforlungirritation
Context 
COGNIETM: A PLATFORM for text analytics 
COGNIE TM 
XAR 
UCI-PEP 
SHIP 
SURVEY ANALYTICS 
RETAIL ANALYTICS 
RISK ASSESSMENT
Modus Operandi 
All applications require a structured representation of the (unstructured) data 
A structured database/meta-base that powers 
Analytics dashboards 
Data coding processes 
Risk assessment computations 
Consumer health portals 
…. 
Manual extraction processes are typically in place 
Goal is to eliminate or alleviate manual effort
Text Analytics Spectrum 
Gamut of Text Analytics Engines 
in Market 
•Lexalytics 
•Alchemy API 
•Semantria 
•Clarabridge 
•ConveyAPI 
•Linguamatics 
•…. 
Engines Aiming Deeper 
•Luminoso 
•Attensity 
•… 
Availability of Open-source Text Analysis Tools 
•UIMA 
•GATE 
•Deep Learning for Sentiment Analysis (Stanford) 
•Recursive Neural Networks 
•http://openair.allenai.org
Approach
Approach 
natural language processing 
machine learning 
semantics
Architecture: COGNIE TM Platform 
Segmentation 
POS Tagging 
Entity extraction 
Anaphora 
Parsing 
Gram analysis 
Existing (DMOZ, SNOMED,UMLS) 
Creation 
Declarative 
Naïve-Bayes 
MaxEnt 
TFIDF 
CRF 
RNN Deep Learning 
ENSEMBLE 
NLP 
Machine Learning 
Knowledge Engineering
COGNIE TM : Open-source Leverage 
Framework 
UIMA 
Classification 
Weka 
Mallet 
NLP 
Stanford CoreNLP 
Indexing 
Lucene 
Databases 
MySQL, MongoDB 
Knowledge Engineering 
Protégé 
Topic mining 
Mallet 
Sentiment 
Stanford Deep Learner
Step 0: Basic Text Analysis 
Text Segmentation 
In many cases the “unit” of distillation is a sentence 
Segmentation strategies 
Built-in, such as in UIMA or GATE 
Custom segmentation 
Sentence decomposition 
Decompose sentence into individual clauses
Expressions 
Beyond entities and sentiment : EXPRESSSIONS 
EXPRESSIONS 
Introduced in [Ashish et al, 2011]
Expressions 
…showers had no hot water !… COMPLAINT 
..you should have more veggie options… SUGGESTION 
RETAIL/ENTERPRISE 
..meats on special this weekend… ANNOUNCEMENT 
..this is the best store on the west side… ADVOCACY 
There is hardly any evidence to suggest a link between salt and diabetes - 
This results confirm that high intake of salt leads to increase in BP+ 
RISK ASSESSMENT
Expressions 
You should try Vitamin E oil … ADVICE 
..I have had arthritis since 1991… EXPERIENCE 
HEALTH 
..for me lipitor worked like a charm… OUTCOME
The Indicators: “Give Aways” 
A combinationof multiple types of elements ! 
…showers had no hot water !… COMPLAINT 
(You) should have more veggie options… SUGGESTION 
..i have been on lipitor… EXPERIENCE 
..this is the best store on the west side…ADVOCACY
Approach: Given Indicators 
NLP 
Identification of individual elements 
Unsupervised 
Relationships betweenelements 
Semantics 
Identification of individual elements 
Knowledge driven 
Machine Learning Classification 
Combine elements classify
Expression Classification: Relevant Features 
Curated lexicons of specific indicative phrases 
Examples 
“could you”, “I took”, …. 
Approach 
Manual creation of “seed” lexicons 
Automated expansion from data plus resource such as WordNet 
The Sentiment 
For instance a Complaint would almost always have negative sentiment 
Punctuations, Other expressions or emoticons
Expression Classification Features 
Positional information of words, phrases, or part-of- speech patterns in the sentence 
Suggestions will usually begin with certain ‘request’ words 
Custom patterns 
Such as subject-verb-object for PERSONAL EXPERIENCE 
Ontology concepts
Expression Classification: Results 
Have achieved 75% precision and recall for all expressions considered 
Factors 
Feature engineering 
Classifier selection 
Knowledge engineering
Before Automated Classification: Manual Patterns 
SoL: Sequences of Labels 
Labels 
LEX-FOODADJ 
spicy 
LEX-EXCESS 
too, very 
ONT-FOOD 
POS-NOUN 
Sequences (Patterns) 
ANY LEX-EXCESS LEX-FOODADJ ANY Negative 
POS-VB POS-MD * Suggestion
Classification: Machine Learning 
Classification tasks 
Expression 
(Contextual) Sentiment 
Aspect category 
Frameworks 
Weka 
Mallet
Baseline Classifiers for Expressions 
Mallet and Weka 
NaiveBayes 
MaxEnt 
CRF 
Gram-based 
Uni, Bi and Trigram features 
Baseline 
~ 10% accuracy
Expression Classifiers 
Trees 
Decision Tree (J48) 
Functions 
Logistic Regression 
SVM 
Sequence Tagging 
CRF: Conditional Random Fields
Entities 
Named-entity extractors 
The generic PERSON, ORGANIZATION, LOCATION 
Ngramand part-of-speech analysis 
Frequently mentioned ‘entities’ 
Improves recall 
Ontology driven concept mapping 
Using pre-assembled domain ontologies/taxonomies/dictionaries 
Based on modules like UIMA ConceptMapper 
Scale is a challenge
Contextual Sentiment 
(Just) polar words can be misleading ! 
Polar words many not be present at all ! 
Combination of elements 
The mocha is too sweet 
Wait time is over an hour 
Aisles are too narrow 
Service is slow
Qualified Sentiment 
Classify negative comments 
Further segregate into 
Immediately actionable items 
‘Long term’ issues 
Approach 
Curation of Ngramsfor each type of negative comments 
Classifier
Topic Mining 
Motivated by feedback survey analytics 
People can talk about “anything” 
Interested in broad ‘topics’ of discussion 
But the set of topics is dynamic, not necessarily known 
Unsupervised topic mining 
LDA: Latent DirichletAllocation 
As-isled to very fragmented topics that were semantically not meaningful 
Solution: consolidation of terms using WordNet 
Expand terms using WordNet synonyms 
Consolidate with manual curation after 
Semi-automated approach
Cohesive Topic Mining 
Problem with WordNet (synonym) expansion 
Prone to semantic divergence 
Example 
Presentation Project(or) Milestones 
(Almost) strongly connected components in relationship graph 
Manual review after
Aspect Classification 
Binning data into few broad categories 
Approach 
Ngrammining 
Classification
Categories over Topics 
Consolidate topics into broad, fixed categories 
Ontology mapping approach 
Each category has associated concepts 
Topic signature maps to category concepts 
Hershey 
Bieber 
Cocoa beans 
Personnel 
Competitors 
Yearly reviews
Emotion Extraction 
Plutchikwheel of emotions 
Fundamental emotion concepts captured in ontology 
Augmented with indicator terms, and their synonyms 
Ontology driven extraction for emotion concepts
Semantics is Key
Semantics 
Domain knowledge is not ‘nice-to-have’ but critical 
HEALTH 
•Condition names 
•Drug names 
•Symptoms 
•Procedures 
•.. 
RETAIL 
•Food items 
•Other products 
•Competitors 
•… 
RESEARCH 
•Chemical substances 
•Harmful conditions 
•… 
INTELLIGENCE 
•Manufacturers 
•Vehicles 
…
Leverage ExistingKnowledge Sources 
Health informatics 
UMLS 
http://www.nlm.nih.gov/research/umls/ 
NCI Thesaurus 
http://ncit.nci.nih.gov/ 
SNOMED 
http://www.nlm.nih.gov/snomed 
Retail 
DMOZ 
http://www.dmoz.org 
Many other 
Freebase 
http://www.freebase.com 
Wikipedia, DBPedia 
OpenData 
data.gov
Knowledge Engineering Tools 
Getting available ontologies into usable formats 
Available as database dumps, RDF, or Web data 
“Mini” ontology creation 
Curate manually when possible (small dictionaries) 
Example: list of competitors 
API access 
Freebase https://www.freebase.com/query 
Query using ‘MQL’ –MetawebQuery Language (Sparqllike) 
BioPortalhttp://data.bioontology.org/documentation 
Provided sometimes by customer !
Practical Requirements 
Confidence Measures 
Quantitative confidence score for extracted elements 
Binary confidence Y/N 
Not confident Routed for manual review 
‘Explanation’ for classification 
Relevant snippets 
“….and the checkout times continue to be long despite …”  Complaint
Feedback Learning Mechanisms 
Manual overview is not dismissed entirely 
Comprehensive pipeline for manual review 
Learn and improve from feedback
Applications
Applications 
Core Cognie 
Platform 
Retail Analytics 
Engine 
Health Distillation 
Engine 
Survey Analytics 
Engine 
Research Mining 
Engine 
Coding Validation 
Engine 
Risk Analysis 
System 
Coding 
Processes 
Health Insights 
Portal
Scale
Scalability 
Scale requirements 
Large numbers of documents as opposed to large document size 
Throughput can be an issue 
Complex language processing algorithms 
Feature extraction can be complex 
Large ontologies in some cases 
Solutions 
Multi-threading and Thread pooling architecture 
Hadoop MapReduce[Kahn and Ashish, 2014]
Conclusions
Grand Challenge Projects 
Aristo 
At AI2, Allen AI Institute 
http://www.allenai.org 
Areas 
Knowledge Extraction 
Reasoning 
Question Answering 
Can the system answer 4th, 6thgrade exams ? 
Project NELL 
Never Ending Language Learning 
http://rtw.ml.cmu.edu/rtw/ 
“Learnt” 50+million facts from Web data
Conclusions 
Deeper distillation from text is required 
Can be achieved by 
Detecting and combining multiple elements in text 
Feature engineering 
Knowledge engineering 
Classifier selection 
Semantics and Knowledge Engineering is key 
Have been successful in leveraging the CognieTMPlatform to develop custom text analytics engines in multiple domains
thank you ! 
naveen.ashish@cognie.com

More Related Content

Similar to Deep Machine Reading

Deep Machine Reading for Customer Analytics
Deep Machine Reading for Customer AnalyticsDeep Machine Reading for Customer Analytics
Deep Machine Reading for Customer AnalyticsNaveen Ashish
 
NLP - updated (Natural Language Processing))
NLP - updated (Natural Language Processing))NLP - updated (Natural Language Processing))
NLP - updated (Natural Language Processing))Jitendra Kumar Yadav
 
Text Analytics for Semantic Computing
Text Analytics for Semantic ComputingText Analytics for Semantic Computing
Text Analytics for Semantic ComputingMeena Nagarajan
 
opinionmining-131221011849-phpapp02-converted.ppt
opinionmining-131221011849-phpapp02-converted.pptopinionmining-131221011849-phpapp02-converted.ppt
opinionmining-131221011849-phpapp02-converted.pptssuser059331
 
Sentiment analysis using ml
Sentiment analysis using mlSentiment analysis using ml
Sentiment analysis using mlPravin Katiyar
 
Machine learning presentation (razi)
Machine learning presentation (razi)Machine learning presentation (razi)
Machine learning presentation (razi)Rizwan Shaukat
 
Opinion Mining
Opinion MiningOpinion Mining
Opinion MiningAli Habeeb
 
An overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support SystemAn overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support SystemGan Keng Hoon
 
Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...
Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...
Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...CITE
 
MIS 07 Expert Systems
MIS 07  Expert SystemsMIS 07  Expert Systems
MIS 07 Expert SystemsTushar B Kute
 
SMEs: Unlock the potential for AI & ML_Doron Shachar
SMEs: Unlock the potential for AI & ML_Doron ShacharSMEs: Unlock the potential for AI & ML_Doron Shachar
SMEs: Unlock the potential for AI & ML_Doron ShacharDoron Shachar
 
Findwise and IBM Watson
Findwise and IBM WatsonFindwise and IBM Watson
Findwise and IBM WatsonFindwise
 
Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysisSeher Can
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
introduction to machine learning and nlp
introduction to machine learning and nlpintroduction to machine learning and nlp
introduction to machine learning and nlpMahmoud Farag
 
Making light work of data- improving the UX of data rich interfaces- UX Austr...
Making light work of data- improving the UX of data rich interfaces- UX Austr...Making light work of data- improving the UX of data rich interfaces- UX Austr...
Making light work of data- improving the UX of data rich interfaces- UX Austr...Stephen Hall
 

Similar to Deep Machine Reading (20)

Deep Machine Reading for Customer Analytics
Deep Machine Reading for Customer AnalyticsDeep Machine Reading for Customer Analytics
Deep Machine Reading for Customer Analytics
 
Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysis
 
NLP(Natural Language Processing)
NLP(Natural Language Processing)NLP(Natural Language Processing)
NLP(Natural Language Processing)
 
NLP - updated (Natural Language Processing))
NLP - updated (Natural Language Processing))NLP - updated (Natural Language Processing))
NLP - updated (Natural Language Processing))
 
Text Analytics for Semantic Computing
Text Analytics for Semantic ComputingText Analytics for Semantic Computing
Text Analytics for Semantic Computing
 
opinionmining-131221011849-phpapp02-converted.ppt
opinionmining-131221011849-phpapp02-converted.pptopinionmining-131221011849-phpapp02-converted.ppt
opinionmining-131221011849-phpapp02-converted.ppt
 
Sentiment analysis using ml
Sentiment analysis using mlSentiment analysis using ml
Sentiment analysis using ml
 
Machine learning presentation (razi)
Machine learning presentation (razi)Machine learning presentation (razi)
Machine learning presentation (razi)
 
Opinion Mining
Opinion MiningOpinion Mining
Opinion Mining
 
Ontology
OntologyOntology
Ontology
 
An overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support SystemAn overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support System
 
Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...
Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...
Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...
 
MIS 07 Expert Systems
MIS 07  Expert SystemsMIS 07  Expert Systems
MIS 07 Expert Systems
 
SMEs: Unlock the potential for AI & ML_Doron Shachar
SMEs: Unlock the potential for AI & ML_Doron ShacharSMEs: Unlock the potential for AI & ML_Doron Shachar
SMEs: Unlock the potential for AI & ML_Doron Shachar
 
Findwise and IBM Watson
Findwise and IBM WatsonFindwise and IBM Watson
Findwise and IBM Watson
 
Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysis
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
introduction to machine learning and nlp
introduction to machine learning and nlpintroduction to machine learning and nlp
introduction to machine learning and nlp
 
Sumit A
Sumit ASumit A
Sumit A
 
Making light work of data- improving the UX of data rich interfaces- UX Austr...
Making light work of data- improving the UX of data rich interfaces- UX Austr...Making light work of data- improving the UX of data rich interfaces- UX Austr...
Making light work of data- improving the UX of data rich interfaces- UX Austr...
 

Recently uploaded

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 

Recently uploaded (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 

Deep Machine Reading

  • 1. Deep Machine Reading: Taming Unstructured, Natural Language Data Naveen Ashish University of Southern California & Cognie Inc., BigDataTECHCON, San Francisco, October 29th2014
  • 2. This is about ….. DEEP MACHINE READING The hard nut of having computers “understand” natural language (text) …. Pushing the boundaries of what we can achieve ….
  • 3. A True AI Challenge "It's (the problem of computers understanding natural language) ambitious ...in fact there's no more important project than understanding intelligence and recreating it.“ -Ray Kurzweil(2013) Alan Turing based the Turing Test entirely on written language….To really master natural language …that’s the key to the Turing Test–to a human requires the full scope of human intelligence. …So the point is that natural language is a very profound domain to do artificial intelligence in. - Ray Kurzweil (2013) “Another example of a good language problem is question answering, like What’s the second-biggest city in California that is not near a river?” Michael Jordan, in response to “What would you do with $1B?”, IEEE Spectrum Interview Oct 2014
  • 4. Commercial Relevance Today the problem of taming unstructured data is far from solved ….. !!!! search text analytics big data analytics health informatics social-media intelligence mining research literature
  • 5. CognieInc., CognieInc., Incorporated in 2006 High-end consulting for semantic-search Focus is on machine reading technologies Work leverages Information extraction work and systems conceptualized as part of university research XAR: eXtractionwith Adaptive Rules (Ashish and Mehrotra, 2009) PEP: Pathology Extraction Pipeline (Ashish, Dahmand Boicey2014) Team Developers, Student interns, Researchers Blog http://cognie.blog.com Today Building custom text analytics engines
  • 6. Model Build custom text understanding engines for domains CognieTMPlatform for Building Text Analytics Engines Retail Text Engine Health NLP Engine Research Mining Engine Customization, Application Integration, Evolution
  • 7. Outline Deep machine reading: What is, and why needed State-of-the-art Fundamentals Approach Details Case studies Retail, Health, Risk assessment, Customer support, Intelligence Conclusions
  • 8. What is “Deep” machine reading ?
  • 9. Deep Machine Reading is …. The ability to distill the abstract from text The ability to comprehensively extract multiple concepts and relationships from the text The ability to link extracted elements to known concepts The ability to use the text (data) itself, to improve understanding of that text
  • 10. The Abstract, in Text The abstract, not explicitly mentioned ! What falls in this category Expressions Contextual sentiment Aspects or Categories I think you need better chefs SUGGESTION The mocha is too sweet NEGATIVE I used to take Lipitor for …PERSONAL EXPERIENCE The dim lights have a cozy effect ….AMBIENCE
  • 11. Classification, rather than Extrication Much of the technology, up to recently, is extrication focused Extricate particular terms, elements, concepts from the text Extrication Named-Entity extraction PERSONS, ORGANIZATIONS, LOCATIONS, … Sentiment extraction Based on polar words Need for much more sophisticated classification of text snippets Along different dimensions of interest
  • 12. A Comprehensive Signature of Text Cognieexperience Many applications have unique requirements of what they want from the text “ …and for six months I was indeed taking Lipitor but I must say ….” PERSONAL EXPERIENCE “…there is direct correlation between Cadmium exposure and lung …”CAUSALITY But, many groups of applications have common requirements within Primary elements required from text Expressions Entities Sentiment Contextual Qualified Emotion Topics Categories/Aspects Specific signal (“directionality”) Relationships
  • 13. Deeper Text Analysis Better Insights Goal: Get actionable insights from data ! Hypothesis: Deeper extraction Better insights ! Thetopadviceitemsadvisedforskinrasharealoevera, vitaminEoilandoatmeal Complaintscomprise36%oftheoverallfeedbackwithtopissuesbeingslowservice,drinksandcoffee 73%ofallresearcharticlesindicatethatCadmiumisacausalfactorforlungirritation
  • 14. Context COGNIETM: A PLATFORM for text analytics COGNIE TM XAR UCI-PEP SHIP SURVEY ANALYTICS RETAIL ANALYTICS RISK ASSESSMENT
  • 15. Modus Operandi All applications require a structured representation of the (unstructured) data A structured database/meta-base that powers Analytics dashboards Data coding processes Risk assessment computations Consumer health portals …. Manual extraction processes are typically in place Goal is to eliminate or alleviate manual effort
  • 16. Text Analytics Spectrum Gamut of Text Analytics Engines in Market •Lexalytics •Alchemy API •Semantria •Clarabridge •ConveyAPI •Linguamatics •…. Engines Aiming Deeper •Luminoso •Attensity •… Availability of Open-source Text Analysis Tools •UIMA •GATE •Deep Learning for Sentiment Analysis (Stanford) •Recursive Neural Networks •http://openair.allenai.org
  • 18. Approach natural language processing machine learning semantics
  • 19. Architecture: COGNIE TM Platform Segmentation POS Tagging Entity extraction Anaphora Parsing Gram analysis Existing (DMOZ, SNOMED,UMLS) Creation Declarative Naïve-Bayes MaxEnt TFIDF CRF RNN Deep Learning ENSEMBLE NLP Machine Learning Knowledge Engineering
  • 20. COGNIE TM : Open-source Leverage Framework UIMA Classification Weka Mallet NLP Stanford CoreNLP Indexing Lucene Databases MySQL, MongoDB Knowledge Engineering Protégé Topic mining Mallet Sentiment Stanford Deep Learner
  • 21. Step 0: Basic Text Analysis Text Segmentation In many cases the “unit” of distillation is a sentence Segmentation strategies Built-in, such as in UIMA or GATE Custom segmentation Sentence decomposition Decompose sentence into individual clauses
  • 22. Expressions Beyond entities and sentiment : EXPRESSSIONS EXPRESSIONS Introduced in [Ashish et al, 2011]
  • 23. Expressions …showers had no hot water !… COMPLAINT ..you should have more veggie options… SUGGESTION RETAIL/ENTERPRISE ..meats on special this weekend… ANNOUNCEMENT ..this is the best store on the west side… ADVOCACY There is hardly any evidence to suggest a link between salt and diabetes - This results confirm that high intake of salt leads to increase in BP+ RISK ASSESSMENT
  • 24. Expressions You should try Vitamin E oil … ADVICE ..I have had arthritis since 1991… EXPERIENCE HEALTH ..for me lipitor worked like a charm… OUTCOME
  • 25. The Indicators: “Give Aways” A combinationof multiple types of elements ! …showers had no hot water !… COMPLAINT (You) should have more veggie options… SUGGESTION ..i have been on lipitor… EXPERIENCE ..this is the best store on the west side…ADVOCACY
  • 26. Approach: Given Indicators NLP Identification of individual elements Unsupervised Relationships betweenelements Semantics Identification of individual elements Knowledge driven Machine Learning Classification Combine elements classify
  • 27. Expression Classification: Relevant Features Curated lexicons of specific indicative phrases Examples “could you”, “I took”, …. Approach Manual creation of “seed” lexicons Automated expansion from data plus resource such as WordNet The Sentiment For instance a Complaint would almost always have negative sentiment Punctuations, Other expressions or emoticons
  • 28. Expression Classification Features Positional information of words, phrases, or part-of- speech patterns in the sentence Suggestions will usually begin with certain ‘request’ words Custom patterns Such as subject-verb-object for PERSONAL EXPERIENCE Ontology concepts
  • 29. Expression Classification: Results Have achieved 75% precision and recall for all expressions considered Factors Feature engineering Classifier selection Knowledge engineering
  • 30. Before Automated Classification: Manual Patterns SoL: Sequences of Labels Labels LEX-FOODADJ spicy LEX-EXCESS too, very ONT-FOOD POS-NOUN Sequences (Patterns) ANY LEX-EXCESS LEX-FOODADJ ANY Negative POS-VB POS-MD * Suggestion
  • 31. Classification: Machine Learning Classification tasks Expression (Contextual) Sentiment Aspect category Frameworks Weka Mallet
  • 32. Baseline Classifiers for Expressions Mallet and Weka NaiveBayes MaxEnt CRF Gram-based Uni, Bi and Trigram features Baseline ~ 10% accuracy
  • 33. Expression Classifiers Trees Decision Tree (J48) Functions Logistic Regression SVM Sequence Tagging CRF: Conditional Random Fields
  • 34. Entities Named-entity extractors The generic PERSON, ORGANIZATION, LOCATION Ngramand part-of-speech analysis Frequently mentioned ‘entities’ Improves recall Ontology driven concept mapping Using pre-assembled domain ontologies/taxonomies/dictionaries Based on modules like UIMA ConceptMapper Scale is a challenge
  • 35. Contextual Sentiment (Just) polar words can be misleading ! Polar words many not be present at all ! Combination of elements The mocha is too sweet Wait time is over an hour Aisles are too narrow Service is slow
  • 36. Qualified Sentiment Classify negative comments Further segregate into Immediately actionable items ‘Long term’ issues Approach Curation of Ngramsfor each type of negative comments Classifier
  • 37. Topic Mining Motivated by feedback survey analytics People can talk about “anything” Interested in broad ‘topics’ of discussion But the set of topics is dynamic, not necessarily known Unsupervised topic mining LDA: Latent DirichletAllocation As-isled to very fragmented topics that were semantically not meaningful Solution: consolidation of terms using WordNet Expand terms using WordNet synonyms Consolidate with manual curation after Semi-automated approach
  • 38. Cohesive Topic Mining Problem with WordNet (synonym) expansion Prone to semantic divergence Example Presentation Project(or) Milestones (Almost) strongly connected components in relationship graph Manual review after
  • 39. Aspect Classification Binning data into few broad categories Approach Ngrammining Classification
  • 40. Categories over Topics Consolidate topics into broad, fixed categories Ontology mapping approach Each category has associated concepts Topic signature maps to category concepts Hershey Bieber Cocoa beans Personnel Competitors Yearly reviews
  • 41. Emotion Extraction Plutchikwheel of emotions Fundamental emotion concepts captured in ontology Augmented with indicator terms, and their synonyms Ontology driven extraction for emotion concepts
  • 43. Semantics Domain knowledge is not ‘nice-to-have’ but critical HEALTH •Condition names •Drug names •Symptoms •Procedures •.. RETAIL •Food items •Other products •Competitors •… RESEARCH •Chemical substances •Harmful conditions •… INTELLIGENCE •Manufacturers •Vehicles …
  • 44. Leverage ExistingKnowledge Sources Health informatics UMLS http://www.nlm.nih.gov/research/umls/ NCI Thesaurus http://ncit.nci.nih.gov/ SNOMED http://www.nlm.nih.gov/snomed Retail DMOZ http://www.dmoz.org Many other Freebase http://www.freebase.com Wikipedia, DBPedia OpenData data.gov
  • 45. Knowledge Engineering Tools Getting available ontologies into usable formats Available as database dumps, RDF, or Web data “Mini” ontology creation Curate manually when possible (small dictionaries) Example: list of competitors API access Freebase https://www.freebase.com/query Query using ‘MQL’ –MetawebQuery Language (Sparqllike) BioPortalhttp://data.bioontology.org/documentation Provided sometimes by customer !
  • 46. Practical Requirements Confidence Measures Quantitative confidence score for extracted elements Binary confidence Y/N Not confident Routed for manual review ‘Explanation’ for classification Relevant snippets “….and the checkout times continue to be long despite …”  Complaint
  • 47. Feedback Learning Mechanisms Manual overview is not dismissed entirely Comprehensive pipeline for manual review Learn and improve from feedback
  • 49. Applications Core Cognie Platform Retail Analytics Engine Health Distillation Engine Survey Analytics Engine Research Mining Engine Coding Validation Engine Risk Analysis System Coding Processes Health Insights Portal
  • 50. Scale
  • 51. Scalability Scale requirements Large numbers of documents as opposed to large document size Throughput can be an issue Complex language processing algorithms Feature extraction can be complex Large ontologies in some cases Solutions Multi-threading and Thread pooling architecture Hadoop MapReduce[Kahn and Ashish, 2014]
  • 53. Grand Challenge Projects Aristo At AI2, Allen AI Institute http://www.allenai.org Areas Knowledge Extraction Reasoning Question Answering Can the system answer 4th, 6thgrade exams ? Project NELL Never Ending Language Learning http://rtw.ml.cmu.edu/rtw/ “Learnt” 50+million facts from Web data
  • 54. Conclusions Deeper distillation from text is required Can be achieved by Detecting and combining multiple elements in text Feature engineering Knowledge engineering Classifier selection Semantics and Knowledge Engineering is key Have been successful in leveraging the CognieTMPlatform to develop custom text analytics engines in multiple domains
  • 55. thank you ! naveen.ashish@cognie.com