Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A pair of shoes in the thesaurus; some reflexions on human and computer indexing


Published on

Presentation at Society of Indexers 2010 Conference, 30 september 2010, Middelburg

Published in: Education
  • Be the first to comment

A pair of shoes in the thesaurus; some reflexions on human and computer indexing

  1. 1. Eric Sieverts Media, information & communication Amsterdam University of Applied Sciences / Section Innovation & Development University Library Utrecht A pair of shoes in the thesaurus reflexions on human and computer indexing Society of Indexers Conference 2010 The challenging future of indexing 30 September 2010, Middelburg
  2. 2. agenda <ul><li>the holy grail for search systems: </li></ul><ul><li>let people find what they search </li></ul><ul><li>searching in the world of Google </li></ul><ul><li>what's wrong with Google (and alikes) </li></ul><ul><li>metadata and indexing </li></ul><ul><li>indexing and knowledge organization </li></ul><ul><li>knowledge organization and the semantic web </li></ul>Eric Sieverts | | | | Middelburg 30-9-2010
  3. 3. searching in the world of <ul><li>Google appears to be &quot;the measure of all things&quot; in search: </li></ul><ul><ul><li>with Google &quot;everything can be found&quot; </li></ul></ul>Eric Sieverts | | | | Middelburg 30-9-2010
  4. 4. searching in the world of <ul><li>Google appears to be &quot;the measure of all things&quot; in search: </li></ul><ul><ul><li>with Google &quot;everything can be found&quot; </li></ul></ul><ul><li>but isn't there a paradox ? </li></ul><ul><ul><li>if Google (or Yahoo! or Bing) contains everything </li></ul></ul><ul><ul><li>(> 500.000.000.000 items) </li></ul></ul><ul><ul><li>can &quot;it&quot; still be found ? </li></ul></ul><ul><ul><li>>> anticipation of user's intentions </li></ul></ul><ul><ul><li> & peerless ranking algorithms </li></ul></ul><ul><ul><li> become increasingly important </li></ul></ul>Eric Sieverts | | | | Middelburg 30-9-2010
  5. 5. search, search, search, search, search, ...... searcher / query documents match the basic search-and-find paradigm Eric Sieverts | | | | Middelburg 30-9-2010
  6. 6. search, search, search, ...... validity for free-text matching ? match Eric Sieverts | | | | Middelburg 30-9-2010 <ul><li>(paraphrasing a Dutch poetry title &quot;Lees maar er staat niet wat er staat&quot;) </li></ul><ul><li>&quot;just read; </li></ul><ul><li>it does not mean what you're reading&quot; </li></ul><ul><li>How does Google know what you mean? </li></ul><ul><li>How does Google know what a document means? </li></ul>
  7. 7. <ul><li>filename: thesaurus.jpg </li></ul><ul><li>is this meant to be representative for the ease of use of thesauri? </li></ul>to what query is this Google's answer ?
  8. 8. Want to know something about &quot; hallenkerken &quot; (Dutch for &quot;hall church&quot;) thru Google Books? Google's first hit is a book about building thesauri, containing the word in a single example of broader and narrower terms
  9. 9. searching in the world of <ul><li>The new Google Instant tries to predict </li></ul><ul><li>user intent </li></ul><ul><li>(the holy grail for search engine developers) </li></ul><ul><li>after typing 1 or 2 letters it already presents results </li></ul><ul><li>for statistically most probable (longer) words </li></ul><ul><li>but is Google really guessing right? </li></ul>Eric Sieverts | | | | Middelburg 30-9-2010
  10. 10. match classical situation with controlled human indexing searcher must enter the &quot;term(s)&quot; that have been used to characterize the subject indexer must assign “correct” terms to characterize the document in principle perfect match is possible Eric Sieverts | | | | Middelburg 30-9-2010 search, search, search, ......
  11. 11. match not user-friendly: searcher has to invent the correct terms expensive: indexers must analyze the document in order to assign the correct terms however Eric Sieverts | | | | Middelburg 30-9-2010 search, search, search, ...... classical situation with controlled human indexing
  12. 12. search in the world of searcher just types some words (or often only one single word) search system contains (all) the words from the documents themselves often you don't find all you need - still satisfied ? match Eric Sieverts | | | | Middelburg 30-9-2010 search, search, search, ......
  13. 13. why still user satisfaction ? <ul><li>despite recall and precision problems: </li></ul><ul><li>search system looks attractively simple </li></ul><ul><li>searcher always finds something (in 500 billion web pages) </li></ul><ul><li>smart relevance ranking, </li></ul><ul><li>providing some relevant items among first 10 </li></ul><ul><li>for most (simple) questions, for majority of users, </li></ul><ul><li>very often even #1 already </li></ul><ul><li>and: who cares about lousy recall & precision </li></ul><ul><li>(in the Google -world)? </li></ul>Eric Sieverts | | | | Middelburg 30-9-2010
  14. 14. language technology at searcher side original simple query expanded & disambiguated statistics generate additional terms to refine queries search system contains just the words from the documents themselves improved queries will result in better answers ? match Eric Sieverts | | | | Middelburg 30-9-2010 search, search, search, ......
  15. 15. language technology for better &quot;query&quot; <ul><li>&quot;word stemming&quot; and &quot;fuzzy search&quot; : automatically search for more wordforms >> better recall </li></ul><ul><li>semantic network (or ontology) contains semantic relations between words : query expanded with semantically related terms >> better recall </li></ul><ul><li>for different meanings of a word, a semantic network (or ontology) contains relations with different words </li></ul><ul><li>>> disambiguation >> better precision </li></ul><ul><li>no scientific evidence yet about how much improvement </li></ul>Eric Sieverts | | | | Middelburg 30-9-2010
  16. 16. language technology for better &quot;query&quot; <ul><li>statistical analysis of search result generates characteristic terms, from which user can choose to refine its query </li></ul><ul><li>such words can also be derived from a synonym list, thesaurus, semantic network et cetera </li></ul><ul><li>mostly >> better precision </li></ul>Eric Sieverts | | | | Middelburg 30-9-2010
  17. 17. language technology at the document search with &quot;correct&quot; or “important” terms language technology enriches document with &quot;correct&quot; term (from thesaurus) or derives characteristic terms from the text in principle perfect match is possible match Eric Sieverts | | | | Middelburg 30-9-2010 search, search, search, ......
  18. 18. automatic classification
  19. 19. automatic classification or enrichment <ul><li>1. deriving specific terms from the document itself </li></ul><ul><li>on the basis of word lists and text analysis specific types of terms (e.g. names of persons, places, products, parties, companies, etc.) can be recognized and marked as such </li></ul><ul><li>2. adding characteristics to classify a document </li></ul><ul><li>after training it, a system can analyze documents and classify them with terms from a thesaurus or with classes from a taxonomy </li></ul><ul><li>despite some limitations it's getting better all the time </li></ul><ul><li>even for less tangible tasks as sentiment analysis </li></ul>Eric Sieverts | | | | Middelburg 30-9-2010
  20. 20. The Calais Web Service automatically creates rich semantic metadata Named Entities Facts Events
  21. 21. geographical recognition in Google Books
  22. 22. training a system thesaurus training documents analysis module “ finger- prints” training module enrichment of thesaurus  Joop van Gent, Irion
  23. 23. classification with system enriched thesaurus new documents analysis module “ finger- prints” classification module  Joop van Gent, Irion enriched documents
  24. 24. endgame tips: checkmate with bishop and knight (in Dutch: &quot;horse&quot; ) chess equestrianism
  25. 25. knowledge organization systems metadata: more than keywords or thesauri ?
  26. 26. <ul><li>knowledge organization systems can be more than </li></ul><ul><li>just metadata models or tools for subject indexing </li></ul><ul><li>4 types of KOS : </li></ul><ul><li>categorization systems (like classifications and taxonomies) </li></ul><ul><li>metadata models (like MARC or Dublin Core) </li></ul><ul><li>relational models (like thesauri, semantic networks, ontologies ) </li></ul><ul><li>term lists (like authorization files) </li></ul><ul><li>more about ontologies in a moment </li></ul>knowledge organization systems Eric Sieverts | | | | Middelburg 30-9-2010
  27. 27. <ul><li>4 types of functions for KOS: </li></ul><ul><li>description and labeling (e.g. subject indexing with a thesaurus) </li></ul><ul><li>definition (e.g. specification of the meaning of concepts in a thesaurus or ontology) </li></ul><ul><li>translation (e.g. concordance between systems for interoperability ) </li></ul><ul><li>navigational (thru the systematic structure of a taxonomy or classification, or the hierarchy of concepts in a thesaurus or ontology) </li></ul><ul><li>some of these play a role in the semantic web </li></ul>knowledge organization systems Eric Sieverts | | | | Middelburg 30-9-2010
  28. 28. <ul><li>&quot;knowledge-representation“ in which knowledge about (a small part of) the world is stored </li></ul><ul><li>mostly not directly used for subject indexing </li></ul><ul><li>allows more complete and complex representations of reality than a thesaurus </li></ul><ul><li>with many possible types of relations between concepts </li></ul><ul><li>with fixed roles and properties of these concepts </li></ul><ul><li>often for limited domains (“wine ontology”) </li></ul><ul><li>sometimes broader in so-called “core ontologies” </li></ul><ul><li>for example: CIDOC-CRM (conceptual reference model) for concepts, relations and properties in the field of cultural heritage </li></ul>ontologies Eric Sieverts | | | | Middelburg 30-9-2010
  29. 29. relations between some concepts in a simple &quot;wine ontology&quot;
  30. 30. example of the relations between concepts about the statue of Balzac by Rodin [in CIDOC-CRM]
  31. 31. semantic web
  32. 32. <ul><li>“ ontologies” in relation to the semantic web </li></ul><ul><li>in a more general connotation : </li></ul><ul><li>general name for all kinds of subject indexing (thesauri, classifications, taxonomies, name authority lists, .....) </li></ul><ul><li>essential requirements : </li></ul><ul><ul><li>ontology must be available in a form that can be read, interpreted and processed by a computer program </li></ul></ul><ul><ul><li>-> needs notations and formal languages to describe them </li></ul></ul>ontologies Eric Sieverts | | | | Middelburg 30-9-2010
  33. 33. ontology notation for semantic web <ul><li>RDF resource description framework </li></ul><ul><li>standard to describe relations between object and its metadata </li></ul><ul><li>OWL web ontology language </li></ul><ul><li>standard for computer readable description of ontologies </li></ul><ul><li>RDFS RDF-schema </li></ul><ul><li>standard for description of a KOS in RDF </li></ul><ul><li>SKOS simple knowledge organization system </li></ul><ul><li>standard for describing KOSses and relations between them in RDF </li></ul>Eric Sieverts | | | | Middelburg 30-9-2010
  34. 34. <ul><li>RDF uses XML to describe the relation between a resource (or object), its metadata and the used metadata standards </li></ul><ul><li>resources should have a URI to refer to them </li></ul><ul><li>RDF uses “namespaces” to refer to computer-readable description of the standards (link via URL) </li></ul><ul><li>RDF is meant to (re)use and to combine existing semantic systems </li></ul><ul><li>properties (metadata) are registered in so-called triples: subject <predicate> object </li></ul><ul><li>(which we could perhaps also write: thing <property> value ) </li></ul><ul><li>RDF-triples are used in &quot;linked data&quot; </li></ul>Eric Sieverts | | | [email_address] resource description framework
  35. 35. rdf triples <ul><li>subject <predicate> object </li></ul><ul><li>doc1 <has author> auth1 </li></ul><ul><li>auth1 <has name> john smith </li></ul><ul><li>auth1 <has affiliation> home inc. </li></ul><ul><li>auth1 <has email> [email_address] </li></ul>Eric Sieverts | | | | Middelburg 30-9-2010 graphical representation of simple network of 4 RDF-triples
  36. 36. SKOS-representation of thesaurus term & relations can be described in RDF Term : Economic cooperation Used For : Economic co-operation Broader terms : Economic policy Narrower terms : Economic integration, European economic cooperation, European industrial cooperation, Industrial cooperation Related terms : Interdependence Scope Note : Includes cooperative measures in banking, trade, industry etc., between and among countries. Eric Sieverts | | | | Middelburg 30-9-2010
  37. 37. SKOS representation in RDF <ul><li><rdf:RDF xmlns:rdf=&quot;; </li></ul><ul><li> xmlns:skos=&quot;;> </li></ul><ul><li><skos:Concept> </li></ul><ul><li><skos:prefLabel>Economic cooperation</skos:prefLabel> </li></ul><ul><li><skos:altLabel>Economic co-operation</skos:altLabel> </li></ul><ul><li><skos:scopeNote>Includes cooperative measures in banking, trade, </li></ul><ul><li>industry etc., between and among countries. </skos:scopeNote> </li></ul><ul><li><skos:broader> </li></ul><ul><li><skos:Concept> </li></ul><ul><li><skos:prefLabel>Economic policy</skos:prefLabel> </li></ul><ul><li></skos:Concept> </li></ul><ul><li></skos:broader> </li></ul><ul><li><skos:related> </li></ul><ul><li><skos:Concept> </li></ul><ul><li><skos:prefLabel>Interdependence</skos:prefLabel> </li></ul><ul><li></skos:Concept> </li></ul><ul><li></skos:related> </li></ul><ul><li><skos:narrower> </li></ul><ul><li><skos:Concept> </li></ul><ul><li><skos:prefLabel>Economic integration</skos:prefLabel> </li></ul><ul><li></skos:Concept> </li></ul><ul><li></skos:narrower> </li></ul><ul><li><!-- ...more narrower terms omitted ... --> </li></ul><ul><li></skos:Concept> </li></ul><ul><li></rdf:RDF> </li></ul>Eric Sieverts | | | | Middelburg 30-9-2010
  38. 38. RDF and &quot;linked data&quot; Eric Sieverts | | | | Middelburg 30-9-2010 <ul><li>a lot of buzz recently about &quot; linked (open) data &quot; </li></ul><ul><li>it's just RDF-triples </li></ul><ul><li>so it's computer readable </li></ul><ul><li>it's on the internet </li></ul><ul><li>so it's open </li></ul><ul><li>it's meant to be re-used </li></ul><ul><li>so it's an important ingredient for the semantic web </li></ul><ul><li>it's standardized </li></ul><ul><li>so it can be re-used </li></ul><ul><li>everybody can (and has to) contribute data </li></ul><ul><li>so it is also somewhat messy </li></ul>
  39. 39. the &quot;linked data cloud&quot; - september 2010 - 24 billion RDF triples online
  40. 40. viaf: virtual international authority file dbpedia: data from Wikipedia artists geonames: 6.2 M toponyms BBC: wildlife finder LCSH Reuters: openCalais IMDB
  41. 41. topic maps <ul><li>XML-based information systems </li></ul><ul><li>that can be considered as ontologies </li></ul><ul><li>that need no additional notations and/or standards to make them computer-readable </li></ul><ul><li>that combine knowledge representations and the indexed information in a single self-containing, interlinked system </li></ul><ul><li>suited to make local knowledge accessible </li></ul>Eric Sieverts | | | | Middelburg 30-9-2010
  42. 42. topic maps <ul><li>consist of: </li></ul><ul><li>concepts (=topics) </li></ul><ul><li>that are being characterized with </li></ul><ul><ul><li>“ names” (can be any word - even multiple- to describe them) (names are topics themselves as well!) </li></ul></ul><ul><ul><li>“ types” (describing to what class of concepts it belongs) </li></ul></ul><ul><ul><li>(types are topics themselves as well!) </li></ul></ul><ul><ul><li>“ associations” (specified types of relations between topics) (associations are also topics, thus having types!) </li></ul></ul><ul><ul><li>“ occurrences” (information-items “about” the concept-topic) (occurrences are also topics, thus having types!) </li></ul></ul><ul><li>all of this described in XML </li></ul>Eric Sieverts | | | | Middelburg 30-9-2010
  43. 43. verdi puccini lucca italy italia italië italien tosca madame -butterfly madama -butterfly roma rome occurrences situated in influenced composed location for place of birth simple example of opera topic-map adopted from Pepper association types topic types composer opera city country
  44. 44. © Antony Pitts, Kal Ahmed, MusicDNA Eric Sieverts | | | | Middelburg 30-9-2010 topic map application Royal Academy of Music in London developed a model to describe &quot;everything&quot; around music, from work/composition to experience of a particular performance conceptually similar to relational FRBR model in library world
  45. 45. © Antony Pitts, Kal Ahmed, MusicDNA
  46. 46. semantic web <ul><li>ultimate application of interoperability </li></ul><ul><li>using combination of methods and standards for storing, structuring, filling, formalizing, describing and interpreting metadata </li></ul><ul><ul><li>RDF(S) </li></ul></ul><ul><ul><li>ontologies (as well as thesauri, taxonomies, semantic networks, …) </li></ul></ul><ul><ul><li>formal languages (like SKOS and OWL) </li></ul></ul><ul><ul><li>annotation of resources/objects (= subject indexing ) </li></ul></ul><ul><li>so that computers will be able to interpret meaning and to combine knowledge from separate systems </li></ul>Eric Sieverts | | | | Middelburg 30-9-2010
  47. 47. © Guus Schreiber UvA / VU rdf annotation of web resource Eric Sieverts | | | | Middelburg 30-9-2010
  48. 48. iconclass annotation
  49. 49. &quot;species ontology&quot; Eric Sieverts | | | [email_address] © Guus Schreiber UvA / VU
  50. 50. search, search, search, search, search, ...... match <ul><li>the semantic web (and interoperability) still require a lot of subject indexing, but with smart systems that: </li></ul><ul><ul><li>(help to) index dumb documents </li></ul></ul><ul><ul><li>can infer meaning </li></ul></ul><ul><ul><li>can match heterogeneous metadata </li></ul></ul><ul><ul><li>can improve dumb searches </li></ul></ul><ul><li>even a monkey may find correct information, </li></ul><ul><li>even information he didn't know he was looking for </li></ul>Eric Sieverts | | | | Middelburg 30-9-2010