SlideShare a Scribd company logo
1 of 32
Download to read offline
Mind the Semantic Gap
How "talking semantics" can help you perform better data science
Panos Alexopoulos
Head of Ontology
We are all here for the same purpose
Some of us work on the data supply side
• We collect and generate data
• We represent, integrate, store and
make them accessible through data
models (and relevant technology)
• We get them ready for usage and
exploitation
Some others work on the data exploitation side
• We use data to build predictive,
descriptive or other types of analytics
solutions
• We use data to build and power AI
applications
And many of us do both
But there is a gap between the two sides that very
often we don’t see
And that’s the semantic gap
• The situation when the data models of
the supply side are misunderstood
and misused by the exploitation side.
• The situation when the data
requirements of the exploitation side
are misunderstood by the supply side.
• Typically the more distant is supply
from usage, the greater is the
semantic gap.
Data meaning is communicated through (semantic)
data models
• Conceptual descriptions and representations of data that convey the
latter’s meaning in an explicit and commonly understood and accepted
way among humans and systems.
The semantic gap is caused by bad semantic models
• We model data meaning in a
wrong way.
• We model data meaning in a
non-explicit way
• We model data meaning in a
not commonly accepted way
Let’s talk about
names
Which data model is correct?
Well, none!
What do we do wrong?
• We often give inaccurate and misleading
or ambiguous names to data modeling
elements:
• If I name a table “Car” then its rows
should represent concrete cars (e.g.,
the car with registration number XYZ)
• But if my rows represent car models
(e.g., BMW 3.16 or AUDI A4), then the
table should be named “CarModel”, not
“Car”.
Why we do it?
• Not realizing there any other interpretations of
the name we use
• Assuming other interpretations are irrelevant
and that people will know what we mean
• Assuming that the correct meaning will be
inferred by the context.
How to narrow the gap
• Always contemplate an element’s name in
relative isolation and try to think all the possible
and legitimate ways this can be interpreted by a
human.
• If an element’s name has more that one
interpretations, make it unambiguous, even if
the other interpretations are not within the
domain or not very likely to occur
• Observe how the element is used in practice by
your modelers, annotators, developers and users.
Let’s talk about
synonymy
• Supply-Demand Analysis
• Top Skills per Job
• Career Paths
At Textkernel we do Labour Market Analytics
For that we need synonyms!
• Two terms are synonymous when they mean the same thing in (almost )
all contexts.
• We need synonyms to get statistics on the actual professions and skills,
no matter the form or language they are expressed in text
Can we use any data model for synonymy? Not really!
Term Synonyms Model
Profession Occupation, Vocation, Work,
Living
KBPedia
Chief Executive Officer CEO, chief operating officer Wordnet
Chief Executive Officer Senior executive officer,
chairman, CEO, managing
director, president
ESCO
Economist economics science researcher,
macro analyst, economics
analyst, interest analyst, ...
ESCO
Data Scientist data engineer, research data
scientist, data expert, data
research scientist
ESCO
Why this gap?
• We forget or ignore that synonymy is a vague
and context dependent relation.
• We mix synonymy with hyponymy and
semantic relatedness and similarity
• We are unaware of subtle but important
differences in meaning for our particular
domain or context
• We don’t document biases, assumptions and
choices
How to narrow the gap
• Insist on meaning equivalence over mere
relatedness
• Get multiple opinions (from people and data)
• If you can’t be sure that your synonyms are
indeed synonyms, then don’t call them like
that
• Always document the criteria, assumptions
and biases of your synonymy.
Let’s talk about
semantic relatedness
Another critical capability for good analytics is entity
disambiguation
For that we need semantically related terms!
• The meaning of an ambiguous term in a
text is most likely the one that is related to
the meanings of the other terms in the
same text.
• Therefore, knowing which terms are
semantically related, helps in performing
disambiguation.
Can we use any related terms for disambiguation? Not really!
• We need related terms that are not very
ambiguous themselves
• We need related terms that are highly specific
to our target term.
• We need related terms that are prevalent in
the data we process.
A soccer experiment
Back in 2015, my old team had to detect and
disambiguate mentions of soccer players and teams in
short textual extracts from video scenes from football
matches:
“It's the 70th minute of the game and after a magnificent
pass by Casemiro, Ronaldo managed to beat Claudio Bravo.
Real now leads 1-0."
For that we used an in-house system, called Knowledge
Tagger, and DBpedia as domain knowledge about soccer
teams and players.
A soccer experiment
Initially, we ran the system with all the DBPedia
related entities for each player as disambiguation
evidence.
Precision was 60% and recall 55%
Then we pruned DBPedia and kept only three
relations:
• Players and their current teams
• Players and their current co-players
• Players and their current managers
Precision increased to 82% and recall to 80%
Why this gap?
• We usually don’t want just any relatedness but
a relatedness that actually helps our goal.
• Our task’s required relatedness seems to be
compatible with the one provided by the data,
yet there are subtle differences that make the
latter non-useful or even harmful.
• Semantic relatedness is a vague relation for
which it’s relatively easy to get agreement
outside of any context, but hard within one.
How to narrow the gap
• Uncover the hidden assumptions and expectations
behind the “should be related” requirement.
• Give people examples of terms that you think
they can be related
• Ask them to judge them as related or not in
context.
• Challenge them to justify their decisions.
• Identify patterns and rules that characterize
these decisions.
• Use this information to derive the “relatedness”
you need.
Let’s summarize
Take aways
The Semantic Gap in Data
Science is real
We can avoid and /or
narrow it though by paying
more attention
➔ We often model data
meaning badly
➔ We often understand the
data meaning wrongly
➔ We often produce the
wrong results
➔ Ambiguity
➔ Vagueness
➔ Variety and diversity
➔ Context-dependence
➔ Understand basic
semantic phenomena
➔ Understand how data can
be misunderstood
➔ Be aware of and
document assumptions,
choices and biases
Closing it is hard
Thank you!
Panos Alexopoulos
Head of Ontology @ Textkernel
Writing a book on semantic data modeling @ O’Reilly
E-mail: alexopoulos@textkernel.nl
Web: http://www.panosalexopoulos.com
LinkedIn: www.linkedin.com/in/panosalexopoulos
Twitter: @PAlexop

More Related Content

What's hot

An Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentationAn Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentationSeth Grimes
 
Text Analytics Market Insights: What's Working and What's Next
Text Analytics Market Insights: What's Working and What's NextText Analytics Market Insights: What's Working and What's Next
Text Analytics Market Insights: What's Working and What's NextSeth Grimes
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
Text Analytics for Dummies 2010
Text Analytics for Dummies 2010Text Analytics for Dummies 2010
Text Analytics for Dummies 2010Seth Grimes
 
Lexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text AnalyticsLexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text AnalyticsLexalytics
 
Text Analytics Presentation
Text Analytics PresentationText Analytics Presentation
Text Analytics PresentationSkylar Ritchie
 
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...Analytics India Magazine
 
Project sentiment analysis
Project sentiment analysisProject sentiment analysis
Project sentiment analysisBob Prieto
 
Text Analytics Overview, 2011
Text Analytics Overview, 2011Text Analytics Overview, 2011
Text Analytics Overview, 2011Seth Grimes
 
Module 9: Natural Language Processing Part 2
Module 9:  Natural Language Processing Part 2Module 9:  Natural Language Processing Part 2
Module 9: Natural Language Processing Part 2Sara Hooker
 
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!Sri Ambati
 
Text Analytics Today
Text Analytics TodayText Analytics Today
Text Analytics TodaySeth Grimes
 
Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016George Roth
 
Text Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry ViewText Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry ViewSeth Grimes
 
These slides cover the final defense presentation for my Doctorate degree. Th...
These slides cover the final defense presentation for my Doctorate degree. Th...These slides cover the final defense presentation for my Doctorate degree. Th...
These slides cover the final defense presentation for my Doctorate degree. Th...Eric Brown
 
Text Analytics Applied (LIDER roadmapping presentation)
Text Analytics Applied (LIDER roadmapping presentation)Text Analytics Applied (LIDER roadmapping presentation)
Text Analytics Applied (LIDER roadmapping presentation)Seth Grimes
 
project sentiment analysis
project sentiment analysisproject sentiment analysis
project sentiment analysissneha penmetsa
 
Twitter Sentiment & Investing - modeling stock price movements with twitter s...
Twitter Sentiment & Investing - modeling stock price movements with twitter s...Twitter Sentiment & Investing - modeling stock price movements with twitter s...
Twitter Sentiment & Investing - modeling stock price movements with twitter s...Eric Brown
 

What's hot (20)

An Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentationAn Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentation
 
Text Analytics Market Insights: What's Working and What's Next
Text Analytics Market Insights: What's Working and What's NextText Analytics Market Insights: What's Working and What's Next
Text Analytics Market Insights: What's Working and What's Next
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Text Analytics for Dummies 2010
Text Analytics for Dummies 2010Text Analytics for Dummies 2010
Text Analytics for Dummies 2010
 
Lexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text AnalyticsLexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text Analytics
 
Text Analytics Presentation
Text Analytics PresentationText Analytics Presentation
Text Analytics Presentation
 
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
 
Project sentiment analysis
Project sentiment analysisProject sentiment analysis
Project sentiment analysis
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Text Analytics Overview, 2011
Text Analytics Overview, 2011Text Analytics Overview, 2011
Text Analytics Overview, 2011
 
Module 9: Natural Language Processing Part 2
Module 9:  Natural Language Processing Part 2Module 9:  Natural Language Processing Part 2
Module 9: Natural Language Processing Part 2
 
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
 
Text Analytics Today
Text Analytics TodayText Analytics Today
Text Analytics Today
 
The Need for Explainable AI - Dorothea Wisemann
The Need for Explainable AI - Dorothea WisemannThe Need for Explainable AI - Dorothea Wisemann
The Need for Explainable AI - Dorothea Wisemann
 
Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016
 
Text Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry ViewText Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry View
 
These slides cover the final defense presentation for my Doctorate degree. Th...
These slides cover the final defense presentation for my Doctorate degree. Th...These slides cover the final defense presentation for my Doctorate degree. Th...
These slides cover the final defense presentation for my Doctorate degree. Th...
 
Text Analytics Applied (LIDER roadmapping presentation)
Text Analytics Applied (LIDER roadmapping presentation)Text Analytics Applied (LIDER roadmapping presentation)
Text Analytics Applied (LIDER roadmapping presentation)
 
project sentiment analysis
project sentiment analysisproject sentiment analysis
project sentiment analysis
 
Twitter Sentiment & Investing - modeling stock price movements with twitter s...
Twitter Sentiment & Investing - modeling stock price movements with twitter s...Twitter Sentiment & Investing - modeling stock price movements with twitter s...
Twitter Sentiment & Investing - modeling stock price movements with twitter s...
 

Similar to Mind the Semantic Gap in Data Science

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
​​Explainability in AI and Recommender systems: let’s make it interactive!
​​Explainability in AI and Recommender systems: let’s make it interactive!​​Explainability in AI and Recommender systems: let’s make it interactive!
​​Explainability in AI and Recommender systems: let’s make it interactive!Eindhoven University of Technology / JADS
 
Language First Protocol from QSi
Language First Protocol from QSiLanguage First Protocol from QSi
Language First Protocol from QSiJohn O'Gorman
 
SOFLUX Meetup - Landing on your dream job
SOFLUX Meetup - Landing on your dream jobSOFLUX Meetup - Landing on your dream job
SOFLUX Meetup - Landing on your dream jobMarta Guerra
 
Veda Semantics - introduction document
Veda Semantics - introduction documentVeda Semantics - introduction document
Veda Semantics - introduction documentrajatkr
 
NLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnn
NLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnnNLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnn
NLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnnshradhasharma2101
 
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...Glen Cathey
 
Sourceconaifullv6forslideshare 120108133422-phpapp02
Sourceconaifullv6forslideshare 120108133422-phpapp02Sourceconaifullv6forslideshare 120108133422-phpapp02
Sourceconaifullv6forslideshare 120108133422-phpapp02Rose Nolen
 
Hacking Hired - Job Hunting Vectors
Hacking Hired - Job Hunting VectorsHacking Hired - Job Hunting Vectors
Hacking Hired - Job Hunting VectorsRachel Harpley
 
The Revolution of Digital Marketing in the Artificial Intelligence era
The Revolution of Digital Marketing in the Artificial Intelligence eraThe Revolution of Digital Marketing in the Artificial Intelligence era
The Revolution of Digital Marketing in the Artificial Intelligence eraMohamed Hanafy
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Chapter 17 Reading and Writing Social ResearchSOC 363 Re.docx
Chapter 17 Reading and Writing Social ResearchSOC 363 Re.docxChapter 17 Reading and Writing Social ResearchSOC 363 Re.docx
Chapter 17 Reading and Writing Social ResearchSOC 363 Re.docxcravennichole326
 
Introduction to Semantic Technology for SharePoint Administrators
Introduction to Semantic Technology for SharePoint AdministratorsIntroduction to Semantic Technology for SharePoint Administrators
Introduction to Semantic Technology for SharePoint AdministratorsBradley Bennet
 
Effective Searching: Part 3 - Narrow your search (Generic Web)
Effective Searching: Part 3 - Narrow your search (Generic Web)Effective Searching: Part 3 - Narrow your search (Generic Web)
Effective Searching: Part 3 - Narrow your search (Generic Web)Jamie Bisset
 
Case StudyIn March 1994, Randal Schwartz was indicted on three f.docx
Case StudyIn March 1994, Randal Schwartz was indicted on three f.docxCase StudyIn March 1994, Randal Schwartz was indicted on three f.docx
Case StudyIn March 1994, Randal Schwartz was indicted on three f.docxwendolynhalbert
 
Ontology for Knowledge and Data Strategies.pptx
Ontology for Knowledge and Data Strategies.pptxOntology for Knowledge and Data Strategies.pptx
Ontology for Knowledge and Data Strategies.pptxMike Bennett
 

Similar to Mind the Semantic Gap in Data Science (20)

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
NLP Ecosystem
NLP EcosystemNLP Ecosystem
NLP Ecosystem
 
​​Explainability in AI and Recommender systems: let’s make it interactive!
​​Explainability in AI and Recommender systems: let’s make it interactive!​​Explainability in AI and Recommender systems: let’s make it interactive!
​​Explainability in AI and Recommender systems: let’s make it interactive!
 
Language First Protocol from QSi
Language First Protocol from QSiLanguage First Protocol from QSi
Language First Protocol from QSi
 
SOFLUX Meetup - Landing on your dream job
SOFLUX Meetup - Landing on your dream jobSOFLUX Meetup - Landing on your dream job
SOFLUX Meetup - Landing on your dream job
 
Veda Semantics - introduction document
Veda Semantics - introduction documentVeda Semantics - introduction document
Veda Semantics - introduction document
 
NLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnn
NLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnnNLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnn
NLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnn
 
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...
 
Sourceconaifullv6forslideshare 120108133422-phpapp02
Sourceconaifullv6forslideshare 120108133422-phpapp02Sourceconaifullv6forslideshare 120108133422-phpapp02
Sourceconaifullv6forslideshare 120108133422-phpapp02
 
Hacking Hired - Job Hunting Vectors
Hacking Hired - Job Hunting VectorsHacking Hired - Job Hunting Vectors
Hacking Hired - Job Hunting Vectors
 
The Revolution of Digital Marketing in the Artificial Intelligence era
The Revolution of Digital Marketing in the Artificial Intelligence eraThe Revolution of Digital Marketing in the Artificial Intelligence era
The Revolution of Digital Marketing in the Artificial Intelligence era
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
EEDL_JUL23_Webinar_FINAL.pdf
EEDL_JUL23_Webinar_FINAL.pdfEEDL_JUL23_Webinar_FINAL.pdf
EEDL_JUL23_Webinar_FINAL.pdf
 
Chapter 17 Reading and Writing Social ResearchSOC 363 Re.docx
Chapter 17 Reading and Writing Social ResearchSOC 363 Re.docxChapter 17 Reading and Writing Social ResearchSOC 363 Re.docx
Chapter 17 Reading and Writing Social ResearchSOC 363 Re.docx
 
670-11 Analysis of Urban Conversations 675-5
670-11 Analysis of Urban Conversations 675-5670-11 Analysis of Urban Conversations 675-5
670-11 Analysis of Urban Conversations 675-5
 
September16
September16September16
September16
 
Introduction to Semantic Technology for SharePoint Administrators
Introduction to Semantic Technology for SharePoint AdministratorsIntroduction to Semantic Technology for SharePoint Administrators
Introduction to Semantic Technology for SharePoint Administrators
 
Effective Searching: Part 3 - Narrow your search (Generic Web)
Effective Searching: Part 3 - Narrow your search (Generic Web)Effective Searching: Part 3 - Narrow your search (Generic Web)
Effective Searching: Part 3 - Narrow your search (Generic Web)
 
Case StudyIn March 1994, Randal Schwartz was indicted on three f.docx
Case StudyIn March 1994, Randal Schwartz was indicted on three f.docxCase StudyIn March 1994, Randal Schwartz was indicted on three f.docx
Case StudyIn March 1994, Randal Schwartz was indicted on three f.docx
 
Ontology for Knowledge and Data Strategies.pptx
Ontology for Knowledge and Data Strategies.pptxOntology for Knowledge and Data Strategies.pptx
Ontology for Knowledge and Data Strategies.pptx
 

Recently uploaded

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Recently uploaded (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Mind the Semantic Gap in Data Science

  • 1. Mind the Semantic Gap How "talking semantics" can help you perform better data science Panos Alexopoulos Head of Ontology
  • 2. We are all here for the same purpose
  • 3. Some of us work on the data supply side • We collect and generate data • We represent, integrate, store and make them accessible through data models (and relevant technology) • We get them ready for usage and exploitation
  • 4. Some others work on the data exploitation side • We use data to build predictive, descriptive or other types of analytics solutions • We use data to build and power AI applications
  • 5. And many of us do both
  • 6. But there is a gap between the two sides that very often we don’t see
  • 7. And that’s the semantic gap • The situation when the data models of the supply side are misunderstood and misused by the exploitation side. • The situation when the data requirements of the exploitation side are misunderstood by the supply side. • Typically the more distant is supply from usage, the greater is the semantic gap.
  • 8. Data meaning is communicated through (semantic) data models • Conceptual descriptions and representations of data that convey the latter’s meaning in an explicit and commonly understood and accepted way among humans and systems.
  • 9. The semantic gap is caused by bad semantic models • We model data meaning in a wrong way. • We model data meaning in a non-explicit way • We model data meaning in a not commonly accepted way
  • 11. Which data model is correct?
  • 13. What do we do wrong? • We often give inaccurate and misleading or ambiguous names to data modeling elements: • If I name a table “Car” then its rows should represent concrete cars (e.g., the car with registration number XYZ) • But if my rows represent car models (e.g., BMW 3.16 or AUDI A4), then the table should be named “CarModel”, not “Car”.
  • 14. Why we do it? • Not realizing there any other interpretations of the name we use • Assuming other interpretations are irrelevant and that people will know what we mean • Assuming that the correct meaning will be inferred by the context.
  • 15. How to narrow the gap • Always contemplate an element’s name in relative isolation and try to think all the possible and legitimate ways this can be interpreted by a human. • If an element’s name has more that one interpretations, make it unambiguous, even if the other interpretations are not within the domain or not very likely to occur • Observe how the element is used in practice by your modelers, annotators, developers and users.
  • 17. • Supply-Demand Analysis • Top Skills per Job • Career Paths At Textkernel we do Labour Market Analytics
  • 18. For that we need synonyms! • Two terms are synonymous when they mean the same thing in (almost ) all contexts. • We need synonyms to get statistics on the actual professions and skills, no matter the form or language they are expressed in text
  • 19. Can we use any data model for synonymy? Not really! Term Synonyms Model Profession Occupation, Vocation, Work, Living KBPedia Chief Executive Officer CEO, chief operating officer Wordnet Chief Executive Officer Senior executive officer, chairman, CEO, managing director, president ESCO Economist economics science researcher, macro analyst, economics analyst, interest analyst, ... ESCO Data Scientist data engineer, research data scientist, data expert, data research scientist ESCO
  • 20. Why this gap? • We forget or ignore that synonymy is a vague and context dependent relation. • We mix synonymy with hyponymy and semantic relatedness and similarity • We are unaware of subtle but important differences in meaning for our particular domain or context • We don’t document biases, assumptions and choices
  • 21. How to narrow the gap • Insist on meaning equivalence over mere relatedness • Get multiple opinions (from people and data) • If you can’t be sure that your synonyms are indeed synonyms, then don’t call them like that • Always document the criteria, assumptions and biases of your synonymy.
  • 23. Another critical capability for good analytics is entity disambiguation
  • 24. For that we need semantically related terms! • The meaning of an ambiguous term in a text is most likely the one that is related to the meanings of the other terms in the same text. • Therefore, knowing which terms are semantically related, helps in performing disambiguation.
  • 25. Can we use any related terms for disambiguation? Not really! • We need related terms that are not very ambiguous themselves • We need related terms that are highly specific to our target term. • We need related terms that are prevalent in the data we process.
  • 26. A soccer experiment Back in 2015, my old team had to detect and disambiguate mentions of soccer players and teams in short textual extracts from video scenes from football matches: “It's the 70th minute of the game and after a magnificent pass by Casemiro, Ronaldo managed to beat Claudio Bravo. Real now leads 1-0." For that we used an in-house system, called Knowledge Tagger, and DBpedia as domain knowledge about soccer teams and players.
  • 27. A soccer experiment Initially, we ran the system with all the DBPedia related entities for each player as disambiguation evidence. Precision was 60% and recall 55% Then we pruned DBPedia and kept only three relations: • Players and their current teams • Players and their current co-players • Players and their current managers Precision increased to 82% and recall to 80%
  • 28. Why this gap? • We usually don’t want just any relatedness but a relatedness that actually helps our goal. • Our task’s required relatedness seems to be compatible with the one provided by the data, yet there are subtle differences that make the latter non-useful or even harmful. • Semantic relatedness is a vague relation for which it’s relatively easy to get agreement outside of any context, but hard within one.
  • 29. How to narrow the gap • Uncover the hidden assumptions and expectations behind the “should be related” requirement. • Give people examples of terms that you think they can be related • Ask them to judge them as related or not in context. • Challenge them to justify their decisions. • Identify patterns and rules that characterize these decisions. • Use this information to derive the “relatedness” you need.
  • 31. Take aways The Semantic Gap in Data Science is real We can avoid and /or narrow it though by paying more attention ➔ We often model data meaning badly ➔ We often understand the data meaning wrongly ➔ We often produce the wrong results ➔ Ambiguity ➔ Vagueness ➔ Variety and diversity ➔ Context-dependence ➔ Understand basic semantic phenomena ➔ Understand how data can be misunderstood ➔ Be aware of and document assumptions, choices and biases Closing it is hard
  • 32. Thank you! Panos Alexopoulos Head of Ontology @ Textkernel Writing a book on semantic data modeling @ O’Reilly E-mail: alexopoulos@textkernel.nl Web: http://www.panosalexopoulos.com LinkedIn: www.linkedin.com/in/panosalexopoulos Twitter: @PAlexop