Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mind the Semantic Gap


Published on

Mind the Semantic Gap - How "talking semantics" can help you perform better data science

Published in: Technology

Mind the Semantic Gap

  1. 1. Mind the Semantic Gap How "talking semantics" can help you perform better data science Panos Alexopoulos Head of Ontology
  2. 2. We are all here for the same purpose
  3. 3. Some of us work on the data supply side • We collect and generate data • We represent, integrate, store and make them accessible through data models (and relevant technology) • We get them ready for usage and exploitation
  4. 4. Some others work on the data exploitation side • We use data to build predictive, descriptive or other types of analytics solutions • We use data to build and power AI applications
  5. 5. And many of us do both
  6. 6. But there is a gap between the two sides that very often we don’t see
  7. 7. And that’s the semantic gap • The situation when the data models of the supply side are misunderstood and misused by the exploitation side. • The situation when the data requirements of the exploitation side are misunderstood by the supply side. • Typically the more distant is supply from usage, the greater is the semantic gap.
  8. 8. Data meaning is communicated through (semantic) data models • Conceptual descriptions and representations of data that convey the latter’s meaning in an explicit and commonly understood and accepted way among humans and systems.
  9. 9. The semantic gap is caused by bad semantic models • We model data meaning in a wrong way. • We model data meaning in a non-explicit way • We model data meaning in a not commonly accepted way
  10. 10. Let’s talk about names
  11. 11. Which data model is correct?
  12. 12. Well, none!
  13. 13. What do we do wrong? • We often give inaccurate and misleading or ambiguous names to data modeling elements: • If I name a table “Car” then its rows should represent concrete cars (e.g., the car with registration number XYZ) • But if my rows represent car models (e.g., BMW 3.16 or AUDI A4), then the table should be named “CarModel”, not “Car”.
  14. 14. Why we do it? • Not realizing there any other interpretations of the name we use • Assuming other interpretations are irrelevant and that people will know what we mean • Assuming that the correct meaning will be inferred by the context.
  15. 15. How to narrow the gap • Always contemplate an element’s name in relative isolation and try to think all the possible and legitimate ways this can be interpreted by a human. • If an element’s name has more that one interpretations, make it unambiguous, even if the other interpretations are not within the domain or not very likely to occur • Observe how the element is used in practice by your modelers, annotators, developers and users.
  16. 16. Let’s talk about synonymy
  17. 17. • Supply-Demand Analysis • Top Skills per Job • Career Paths At Textkernel we do Labour Market Analytics
  18. 18. For that we need synonyms! • Two terms are synonymous when they mean the same thing in (almost ) all contexts. • We need synonyms to get statistics on the actual professions and skills, no matter the form or language they are expressed in text
  19. 19. Can we use any data model for synonymy? Not really! Term Synonyms Model Profession Occupation, Vocation, Work, Living KBPedia Chief Executive Officer CEO, chief operating officer Wordnet Chief Executive Officer Senior executive officer, chairman, CEO, managing director, president ESCO Economist economics science researcher, macro analyst, economics analyst, interest analyst, ... ESCO Data Scientist data engineer, research data scientist, data expert, data research scientist ESCO
  20. 20. Why this gap? • We forget or ignore that synonymy is a vague and context dependent relation. • We mix synonymy with hyponymy and semantic relatedness and similarity • We are unaware of subtle but important differences in meaning for our particular domain or context • We don’t document biases, assumptions and choices
  21. 21. How to narrow the gap • Insist on meaning equivalence over mere relatedness • Get multiple opinions (from people and data) • If you can’t be sure that your synonyms are indeed synonyms, then don’t call them like that • Always document the criteria, assumptions and biases of your synonymy.
  22. 22. Let’s talk about semantic relatedness
  23. 23. Another critical capability for good analytics is entity disambiguation
  24. 24. For that we need semantically related terms! • The meaning of an ambiguous term in a text is most likely the one that is related to the meanings of the other terms in the same text. • Therefore, knowing which terms are semantically related, helps in performing disambiguation.
  25. 25. Can we use any related terms for disambiguation? Not really! • We need related terms that are not very ambiguous themselves • We need related terms that are highly specific to our target term. • We need related terms that are prevalent in the data we process.
  26. 26. A soccer experiment Back in 2015, my old team had to detect and disambiguate mentions of soccer players and teams in short textual extracts from video scenes from football matches: “It's the 70th minute of the game and after a magnificent pass by Casemiro, Ronaldo managed to beat Claudio Bravo. Real now leads 1-0." For that we used an in-house system, called Knowledge Tagger, and DBpedia as domain knowledge about soccer teams and players.
  27. 27. A soccer experiment Initially, we ran the system with all the DBPedia related entities for each player as disambiguation evidence. Precision was 60% and recall 55% Then we pruned DBPedia and kept only three relations: • Players and their current teams • Players and their current co-players • Players and their current managers Precision increased to 82% and recall to 80%
  28. 28. Why this gap? • We usually don’t want just any relatedness but a relatedness that actually helps our goal. • Our task’s required relatedness seems to be compatible with the one provided by the data, yet there are subtle differences that make the latter non-useful or even harmful. • Semantic relatedness is a vague relation for which it’s relatively easy to get agreement outside of any context, but hard within one.
  29. 29. How to narrow the gap • Uncover the hidden assumptions and expectations behind the “should be related” requirement. • Give people examples of terms that you think they can be related • Ask them to judge them as related or not in context. • Challenge them to justify their decisions. • Identify patterns and rules that characterize these decisions. • Use this information to derive the “relatedness” you need.
  30. 30. Let’s summarize
  31. 31. Take aways The Semantic Gap in Data Science is real We can avoid and /or narrow it though by paying more attention ➔ We often model data meaning badly ➔ We often understand the data meaning wrongly ➔ We often produce the wrong results ➔ Ambiguity ➔ Vagueness ➔ Variety and diversity ➔ Context-dependence ➔ Understand basic semantic phenomena ➔ Understand how data can be misunderstood ➔ Be aware of and document assumptions, choices and biases Closing it is hard
  32. 32. Thank you! Panos Alexopoulos Head of Ontology @ Textkernel Writing a book on semantic data modeling @ O’Reilly E-mail: Web: LinkedIn: Twitter: @PAlexop