SlideShare a Scribd company logo
1 of 48
Deep Distillation from Text
Naveen Ashish
University of Southern California & Cognie Inc.,
March 18th 2014
This is about …..
“DEEP TEXT DISTILLATION”
The hard nut of having computers “understand” natural
language (text) ….
 Pushing the boundaries of what we can achieve ….
"It's (the problem of computers understanding natural language) ambitious ...in
fact there's no more important project than understanding intelligence and
recreating it.“ - Ray Kurzweil (2013)
Alan Turing based the Turing Test entirely on written language….To really master
natural language …that’s the key to the Turing Test–to a human requires the full
scope of human intelligence. …So the point is that natural language is a very
profound domain to do artificial intelligence in. - Ray Kurzweil (2013)
Why ….
 the problem is far from solved ….. !!!!
 unstructured data everywhere
95 % !
search
text analytics
big data analytics
health informatics
social-media intelligence
Introduction
About myself
Associate Professor (Informatics), Keck School of Medicine,
University of Southern California
Cognie Inc.,
Work leverages
Information extraction work and systems developed at UC Irvine
 XAR, UCI-PEP
Advisory consulting engagements with several companies and
start-ups
Outline
Deep distillation: What is and why
State-of-the-art
Fundamentals
Approach
Details
Expressions, Entities, Sentiment
Case studies
Retail, Health, Risk assessment
Conclusions
What is “Deep” text distillation ?
Data
Abstract
This paper describes the results of a
study investigating ….
…..
We conclude that salt and diabetes are
largely unrelated.
Deep Distillation
The abstract, not explicitly mentioned !
What falls in this category
Expressions
Contextual sentiment
Aspect classification
I think you need better chefs  SUGGESTION
The mocha is too sweet  NEGATIVE
I used to take Lipitor for … PERSONAL EXPERIENCE
The dim lights have a cozy effect …. AMBIENCE
A Common Intersection
Distill at sentence level
Aggregate to entire feedback, post, comment or
thread
Three primary elements
Expression/Intent
Entities/Aspects (and Classes)
Sentiment
Why Deeper ?
 Goal: Get actionable insights from data !
 Hypothesis: Deeper extraction  Better insights !
The top advice items advised for skin rash are aloe vera,
vitamin E oil and oatmeal
Complaints comprise 36% of the overall feedback with top
issues being slow service, drinks and coffee
Context
COGNIETM: A PLATFORM for text analytics
COGNIE TM
XAR UCI-PEP
SHIP SURVEY
ANALYTICS
RETAIL
ANALYTICS
RISK
ASSESSMENT
Expressions
Beyond entities and sentiment : EXPRESSSIONS
EXPRESSIONS
Introduced in [Ashish et al, 2011]
Expressions
You should try Vitamin E oil …  ADVICE
..I have had arthritis since 1991…  EXPERIENCE
HEALTH
..for me lipitor worked like a charm…  OUTCOME
Expressions
…showers had no hot water !…  COMPLAINT
..you should have more veggie options…  SUGGESTION
RETAIL/ENTERPRISE
..meats on special this weekend…  ANNOUNCEMENT
..this is the best store on the west side…  ADVOCACY
There is hardly any evidence to suggest a link between salt and diabetes  -
This results confirm that high intake of salt leads to increase in BP +
RISK ASSESSMENT
The Landscape
Text Analytics Spectrum
Wide offering of
 Text analytics engines
 Text analysis tools – many open-source
Largely still for “spotting things”
 entities, concepts, sentiment, topics, emotions ….
Going deeper
 Luminoso
 Attensity (Intents)
Deep Learning for Sentiment
 Stanford
 Recursive Neural Networks
Approach
Approach
natural language processing
machine learning
semantics
Architecture: COGNIE TM Platform
Segmentation
POS Tagging
Entity extraction
Anaphora
Parsing
Gram analysis
Existing (DMOZ, SNOMED,UMLS)
Creation
Declarative
Naïve-Bayes
MaxEnt
TFIDF
CRF
RNN Deep Learning
ENSEMBLE
NLP
Machine Learning
Knowledge Engineering
The Indicators: “Give Aways”
A combination of multiple types of elements !
…showers had no hot water !… COMPLAINT
(You) should have more veggie options… SUGGESTION
..i have been on lipitor… EXPERIENCE
..this is the best store on the west side… ADVOCACY
Approach: Given Indicators
NLP
Identification of individual elements
 Unsupervised
Relationships between elements
Semantics
Identification of individual elements
 Knowledge driven
Machine Learning Classification
Combine elements  classify
Natural Language Processing
 UIMA and GATE
 Stanford NLP Tools
POS tagging
 Parsing
 NE Recognizer
 Geo-tagger
 ….
Natural Language Processing
 Text Segmentation
In many cases the “unit” if distillation is a sentence
 Segmentation
 UIMA (or GATE)
 Custom
 Complex sentence segmentation
 Breakup into individual clauses
NLP
 Part-of-speech tags are key indicators
Expression distillation
 Entity extraction
Names, Locations, Organizations
 Parsing
If required
 Anaphora
NGram Analysis
Unigram and Bigram analysis
Obtain
Grams
Frequency
Entropy
Grams of tokens as well as POS Patterns
VB VBD
Before Automated Classification: Manual
Patterns
SoL: Sequences of Labels
Labels
LEX-FOODADJ
 spicy
LEX-EXCESS
 too, very
ONT-FOOD
POS-NOUN
Sequences (Patterns)
ANY LEX-EXCESS LEX-FOODADJ ANY 
POS-VB POS-MD ….
Classification: Machine Learning
 Classification tasks
Expression
(Contextual) Sentiment
Aspect category
Frameworks
Weka
Mallet
Baseline Classifiers
 Mallet and Weka
NaiveBayes
MaxEnt
CRF
 Gram-based
Uni, Bi and Trigram features
Baseline
~ 10% accuracy
Expression Classification: Features
 Features
Polar words
Punctuations
Ngrams
POS patterns
Length !
Beginning
Ontology
…
Classifiers
 Trees
Decision Tree (J48)
Functions
Logistic Regression
SVM
Sequence Tagging
CRF: Conditional Random Fields
Expression Classification: Results
Have achieved 75% precision and recall for all
expressions considered
Factors
Feature engineering
Classifier selection
Knowledge engineering
Contextual Sentiment
 (Just) polar words can be misleading !
Polar words many not be present at all !
Combination of elements
The mocha is too sweet
Wait time is over an hour
Aisles are too narrow
Service is slow
Semantics: Ontologies
 Health
 Drugs
 Conditions
 Procedures
 Symptoms
 …
Retail (Dining)
 Food/Entrees
 Service
 Ambience
 ….
Leverage Existing Knowledge Sources
Health informatics
 UMLS
 NCI Thesaurus
 SNOMED
Retail
 DMOZ
Many other
 Freebase
 Wikipedia, DBPedia
OpenData
 data.gov
Knowledge Engineering Tools
“Mini” ontology creation
API access
Freebase
BioPortal
Wrappers
DMOZ, ….
Practical Requirements
Confidence Measures
Below threshold routed to manual transcription teams
Polarity
Snippets
Open-Source Leverage
COGNIE TM : Open Source Tools
Framework
UIMA
Classification
Weka
Mallet
NLP
Stanford tools
Indexing
Lucene
Databases
MySQL, MongoDB
Knowledge Engineering
Protégé
Select Case Studies
Case Study: Health Informatics
Distillation
Case Study: Retail & Survey Analytics
Feedback
 Direct, device collected
 Social-media
Typically short, few sentences
Strong requirement for aspect classification
 [Food,Service,Ambience,Pricing,Other]
Negative : “Immediate” vs “Long Term” classification
…food was awesome, service needs improvement ….
you need to be open longer !
Case Study: Risk Assessment
 Biomedical Literature Abstracts
Correlation direction (+ -)
Subject
Article type
Features
Clauses
Negation and Triggers
Semantic Heterogeneity
Performance
MapReduce
Throughput can be an issue
Complex language processing algorithms
Large ontologies in some cases
Hadoop MapReduce
[Kahn and Ashish, 2014]
Conclusions
Conclusions
Deeper distillation from text is important
Can be achieved by
Detecting and combining multiple elements in text
 Feature engineering
 Knowledge engineering
 Classifier selection
Does not have to be perfect
Every domain, dataset has its nuances
thank you !
naveen.ashish@cognie.com

More Related Content

Viewers also liked

Corrosion Sl Part Three
Corrosion Sl Part ThreeCorrosion Sl Part Three
Corrosion Sl Part Three
Steve1954
 

Viewers also liked (19)

Colloids presentation slides
Colloids presentation slidesColloids presentation slides
Colloids presentation slides
 
The Fundamentals of Rheology
The Fundamentals of RheologyThe Fundamentals of Rheology
The Fundamentals of Rheology
 
Adsorption
AdsorptionAdsorption
Adsorption
 
Distillation
DistillationDistillation
Distillation
 
Colloids
ColloidsColloids
Colloids
 
Rheology methods
Rheology methodsRheology methods
Rheology methods
 
distillation
distillationdistillation
distillation
 
Distillation
DistillationDistillation
Distillation
 
Types of corrosions
Types of corrosionsTypes of corrosions
Types of corrosions
 
distillation
distillationdistillation
distillation
 
Rheology
RheologyRheology
Rheology
 
Adsorption presentation
Adsorption  presentationAdsorption  presentation
Adsorption presentation
 
Distillation Column Design
Distillation Column DesignDistillation Column Design
Distillation Column Design
 
corrosion presentation
corrosion presentationcorrosion presentation
corrosion presentation
 
Corrosion Sl Part Three
Corrosion Sl Part ThreeCorrosion Sl Part Three
Corrosion Sl Part Three
 
Corrosion.ppt
Corrosion.pptCorrosion.ppt
Corrosion.ppt
 
Rheology
RheologyRheology
Rheology
 
Cleaning validation a complete know how
Cleaning validation a complete know howCleaning validation a complete know how
Cleaning validation a complete know how
 
Principles of corrosion
Principles of corrosionPrinciples of corrosion
Principles of corrosion
 

Similar to Deep Distillation from Natural Language Text

New Approaches at Natural Language Processing Systems
New Approaches at Natural Language Processing SystemsNew Approaches at Natural Language Processing Systems
New Approaches at Natural Language Processing Systems
Andrejkovics Zoltán
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
iarthur
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
iarthur
 
97 thingseveryprogrammershouldknow
97 thingseveryprogrammershouldknow97 thingseveryprogrammershouldknow
97 thingseveryprogrammershouldknow
REHAN KHAN
 

Similar to Deep Distillation from Natural Language Text (20)

Deep Machine Reading
Deep Machine ReadingDeep Machine Reading
Deep Machine Reading
 
Deep Machine Reading for Customer Analytics
Deep Machine Reading for Customer AnalyticsDeep Machine Reading for Customer Analytics
Deep Machine Reading for Customer Analytics
 
New Approaches at Natural Language Processing Systems
New Approaches at Natural Language Processing SystemsNew Approaches at Natural Language Processing Systems
New Approaches at Natural Language Processing Systems
 
Ai4life aiml-xops-sig
Ai4life aiml-xops-sigAi4life aiml-xops-sig
Ai4life aiml-xops-sig
 
Text Analytics for Semantic Computing
Text Analytics for Semantic ComputingText Analytics for Semantic Computing
Text Analytics for Semantic Computing
 
MIS 07 Expert Systems
MIS 07  Expert SystemsMIS 07  Expert Systems
MIS 07 Expert Systems
 
Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysis
 
NLP(Natural Language Processing)
NLP(Natural Language Processing)NLP(Natural Language Processing)
NLP(Natural Language Processing)
 
NLP - updated (Natural Language Processing))
NLP - updated (Natural Language Processing))NLP - updated (Natural Language Processing))
NLP - updated (Natural Language Processing))
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
 
TVOT June 2012
TVOT June 2012TVOT June 2012
TVOT June 2012
 
97 thingseveryprogrammershouldknow
97 thingseveryprogrammershouldknow97 thingseveryprogrammershouldknow
97 thingseveryprogrammershouldknow
 
detect emotion from text
detect emotion from textdetect emotion from text
detect emotion from text
 
Oman qaboos
Oman qaboosOman qaboos
Oman qaboos
 
Artificial intelligence in health care by Islam salama " Saimo#BoOm "
Artificial intelligence in health care by Islam salama " Saimo#BoOm "Artificial intelligence in health care by Islam salama " Saimo#BoOm "
Artificial intelligence in health care by Islam salama " Saimo#BoOm "
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language Processing
 
Rekard Edgren - Curing Our Binary Disease - EuroSTAR 2012
Rekard Edgren - Curing Our Binary Disease - EuroSTAR 2012Rekard Edgren - Curing Our Binary Disease - EuroSTAR 2012
Rekard Edgren - Curing Our Binary Disease - EuroSTAR 2012
 
"An Introduction to AI and Deep Learning"
"An Introduction to AI and Deep Learning""An Introduction to AI and Deep Learning"
"An Introduction to AI and Deep Learning"
 
Deep learning for natural language understanding
Deep learning for natural language understandingDeep learning for natural language understanding
Deep learning for natural language understanding
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Deep Distillation from Natural Language Text

  • 1. Deep Distillation from Text Naveen Ashish University of Southern California & Cognie Inc., March 18th 2014
  • 2. This is about ….. “DEEP TEXT DISTILLATION” The hard nut of having computers “understand” natural language (text) ….  Pushing the boundaries of what we can achieve …. "It's (the problem of computers understanding natural language) ambitious ...in fact there's no more important project than understanding intelligence and recreating it.“ - Ray Kurzweil (2013) Alan Turing based the Turing Test entirely on written language….To really master natural language …that’s the key to the Turing Test–to a human requires the full scope of human intelligence. …So the point is that natural language is a very profound domain to do artificial intelligence in. - Ray Kurzweil (2013)
  • 3. Why ….  the problem is far from solved ….. !!!!  unstructured data everywhere 95 % ! search text analytics big data analytics health informatics social-media intelligence
  • 4. Introduction About myself Associate Professor (Informatics), Keck School of Medicine, University of Southern California Cognie Inc., Work leverages Information extraction work and systems developed at UC Irvine  XAR, UCI-PEP Advisory consulting engagements with several companies and start-ups
  • 5. Outline Deep distillation: What is and why State-of-the-art Fundamentals Approach Details Expressions, Entities, Sentiment Case studies Retail, Health, Risk assessment Conclusions
  • 6. What is “Deep” text distillation ?
  • 7. Data Abstract This paper describes the results of a study investigating …. ….. We conclude that salt and diabetes are largely unrelated.
  • 8. Deep Distillation The abstract, not explicitly mentioned ! What falls in this category Expressions Contextual sentiment Aspect classification I think you need better chefs  SUGGESTION The mocha is too sweet  NEGATIVE I used to take Lipitor for … PERSONAL EXPERIENCE The dim lights have a cozy effect …. AMBIENCE
  • 9. A Common Intersection Distill at sentence level Aggregate to entire feedback, post, comment or thread Three primary elements Expression/Intent Entities/Aspects (and Classes) Sentiment
  • 10. Why Deeper ?  Goal: Get actionable insights from data !  Hypothesis: Deeper extraction  Better insights ! The top advice items advised for skin rash are aloe vera, vitamin E oil and oatmeal Complaints comprise 36% of the overall feedback with top issues being slow service, drinks and coffee
  • 11. Context COGNIETM: A PLATFORM for text analytics COGNIE TM XAR UCI-PEP SHIP SURVEY ANALYTICS RETAIL ANALYTICS RISK ASSESSMENT
  • 12. Expressions Beyond entities and sentiment : EXPRESSSIONS EXPRESSIONS Introduced in [Ashish et al, 2011]
  • 13. Expressions You should try Vitamin E oil …  ADVICE ..I have had arthritis since 1991…  EXPERIENCE HEALTH ..for me lipitor worked like a charm…  OUTCOME
  • 14. Expressions …showers had no hot water !…  COMPLAINT ..you should have more veggie options…  SUGGESTION RETAIL/ENTERPRISE ..meats on special this weekend…  ANNOUNCEMENT ..this is the best store on the west side…  ADVOCACY There is hardly any evidence to suggest a link between salt and diabetes  - This results confirm that high intake of salt leads to increase in BP + RISK ASSESSMENT
  • 16. Text Analytics Spectrum Wide offering of  Text analytics engines  Text analysis tools – many open-source Largely still for “spotting things”  entities, concepts, sentiment, topics, emotions …. Going deeper  Luminoso  Attensity (Intents) Deep Learning for Sentiment  Stanford  Recursive Neural Networks
  • 19. Architecture: COGNIE TM Platform Segmentation POS Tagging Entity extraction Anaphora Parsing Gram analysis Existing (DMOZ, SNOMED,UMLS) Creation Declarative Naïve-Bayes MaxEnt TFIDF CRF RNN Deep Learning ENSEMBLE NLP Machine Learning Knowledge Engineering
  • 20. The Indicators: “Give Aways” A combination of multiple types of elements ! …showers had no hot water !… COMPLAINT (You) should have more veggie options… SUGGESTION ..i have been on lipitor… EXPERIENCE ..this is the best store on the west side… ADVOCACY
  • 21. Approach: Given Indicators NLP Identification of individual elements  Unsupervised Relationships between elements Semantics Identification of individual elements  Knowledge driven Machine Learning Classification Combine elements  classify
  • 22. Natural Language Processing  UIMA and GATE  Stanford NLP Tools POS tagging  Parsing  NE Recognizer  Geo-tagger  ….
  • 23. Natural Language Processing  Text Segmentation In many cases the “unit” if distillation is a sentence  Segmentation  UIMA (or GATE)  Custom  Complex sentence segmentation  Breakup into individual clauses
  • 24. NLP  Part-of-speech tags are key indicators Expression distillation  Entity extraction Names, Locations, Organizations  Parsing If required  Anaphora
  • 25. NGram Analysis Unigram and Bigram analysis Obtain Grams Frequency Entropy Grams of tokens as well as POS Patterns VB VBD
  • 26. Before Automated Classification: Manual Patterns SoL: Sequences of Labels Labels LEX-FOODADJ  spicy LEX-EXCESS  too, very ONT-FOOD POS-NOUN Sequences (Patterns) ANY LEX-EXCESS LEX-FOODADJ ANY  POS-VB POS-MD ….
  • 27. Classification: Machine Learning  Classification tasks Expression (Contextual) Sentiment Aspect category Frameworks Weka Mallet
  • 28. Baseline Classifiers  Mallet and Weka NaiveBayes MaxEnt CRF  Gram-based Uni, Bi and Trigram features Baseline ~ 10% accuracy
  • 29. Expression Classification: Features  Features Polar words Punctuations Ngrams POS patterns Length ! Beginning Ontology …
  • 30. Classifiers  Trees Decision Tree (J48) Functions Logistic Regression SVM Sequence Tagging CRF: Conditional Random Fields
  • 31. Expression Classification: Results Have achieved 75% precision and recall for all expressions considered Factors Feature engineering Classifier selection Knowledge engineering
  • 32. Contextual Sentiment  (Just) polar words can be misleading ! Polar words many not be present at all ! Combination of elements The mocha is too sweet Wait time is over an hour Aisles are too narrow Service is slow
  • 33. Semantics: Ontologies  Health  Drugs  Conditions  Procedures  Symptoms  … Retail (Dining)  Food/Entrees  Service  Ambience  ….
  • 34. Leverage Existing Knowledge Sources Health informatics  UMLS  NCI Thesaurus  SNOMED Retail  DMOZ Many other  Freebase  Wikipedia, DBPedia OpenData  data.gov
  • 35. Knowledge Engineering Tools “Mini” ontology creation API access Freebase BioPortal Wrappers DMOZ, ….
  • 36. Practical Requirements Confidence Measures Below threshold routed to manual transcription teams Polarity Snippets
  • 38. COGNIE TM : Open Source Tools Framework UIMA Classification Weka Mallet NLP Stanford tools Indexing Lucene Databases MySQL, MongoDB Knowledge Engineering Protégé
  • 40. Case Study: Health Informatics
  • 42. Case Study: Retail & Survey Analytics Feedback  Direct, device collected  Social-media Typically short, few sentences Strong requirement for aspect classification  [Food,Service,Ambience,Pricing,Other] Negative : “Immediate” vs “Long Term” classification …food was awesome, service needs improvement …. you need to be open longer !
  • 43. Case Study: Risk Assessment  Biomedical Literature Abstracts Correlation direction (+ -) Subject Article type Features Clauses Negation and Triggers Semantic Heterogeneity
  • 45. MapReduce Throughput can be an issue Complex language processing algorithms Large ontologies in some cases Hadoop MapReduce [Kahn and Ashish, 2014]
  • 47. Conclusions Deeper distillation from text is important Can be achieved by Detecting and combining multiple elements in text  Feature engineering  Knowledge engineering  Classifier selection Does not have to be perfect Every domain, dataset has its nuances