Enriching search engines - thomas francart - english version

1,064 views

Published on

Thomas Francart/Mondeca - How to add value to search engines using semantic information.

Published in: Technology, Design
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,064
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Enriching search engines - thomas francart - english version

  1. 1. Mondeca thomas.francart@mondeca.com 07/03/2012Mondeca’s approach to enriching search engines usingbusiness knowledge
  2. 2. The intersection points of several domains Knowledge- based enhanced search Knowledge Search Smart structure-based indexing of content Content Content semanticannotation
  3. 3. LERUDI use case ITM AFSKnowledge Search WCM SDK CMS Content
  4. 4. What knowledge are we talking about?• Internal business/reference vocabularies : – Thesauri (multilingual) – Dictionaries – Named entities lists – Classification rules – Thesaurus alignments – …• Structured data - Always• Linked Data : – E.g.: GEMET thesaurus, subset of DBPedia named entities, etc.
  5. 5. At which level do we bring value?• at 2 different levels: – when indexing content • via index enrichment – when users perform search • by assisting them in the query (re)formulation• The preferred /most useful technique is to enrich content during the indexing phase – but this implies that content be reindexed every time business knowldege evolves or changes
  6. 6. The search engine we used to demonstrate this• Lucene SolR : – Open-source – Has advanced plain text search capabilities – Allows faceted search – Offers a highlight feature – Has spellchecker capabilities – Includes a « More Like This » (find related content) feature – Is UIMA compliant – … full feature list available at : http://lucene.apache.org/solr/features.html• Principles discussed in the next slides may be applied to other search engines
  7. 7. SolR explorer : a test interface•SolR returns an XML •SolR explorer :feed to an http request –A web interface to visualize / –http://localhost:8080/solr/se navigate / test the retunred lect/q=lac&start=0&length=10 XML feed –Definitely not meant for end users! –https://issues.apache.org/jira /browse/SOLR-1163
  8. 8. The data set• Structured catalogue of an e-tourism portal – Hotels – Restaurants – Activities – Contacts – Etc.• Each resource is linked to a web site
  9. 9. Starting point: simple web indexing– without enrichment
  10. 10. Plan1 - Synonyms2 - Translations3 – Specific terms4 – Spelling mistakes5 - Facets6 – Dynamic classification7 – Vocabulary alignments8 - Disambiguation
  11. 11. 1 : enrichment using synonyms• Why? – Increase recall, expand a request using similar terms• How? – By providing a list of equivalent terms to the search engine – SolR configuration: <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory" /> <!-- in this example, we will only use synonyms at index time --> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" /> <!-- ... --> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory" /> <!-- <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> --> <!-- ... --> </analyzer> </fieldType>
  12. 12. Format of the synonyms file• One line for each equivalent synonym• Option 1 : use all equivalent terms – If one term is found, all equivalent terms are added to the index déreglementation,libéralisation,dérégulation croisière,croisière de plaisance,croisière maritime spectacle,attraction,show vacances familiales,tourisme familial pêche,pêche à la ligne,permis de pêche,article de pêche,pêche au gros,pêche touristique office de tourisme,otsi,office municipal de tourisme,syndicat dinitiative• Option 2 : use controlled term only – If one of the terms is found, only the controlled term is added to the index libéralisation,dérégulation => déréglementation croisière de plaisance,croisière maritime => croisière attraction,show => spectacle tourisme familial => vacances familiales pêche à la ligne,permis de pêche,article de pêche,pêche au gros,pêche touristique => pêche otsi,office municipal de tourisme,syndicat dinitiative => office de tourismeThomas Francart - Enrichissement des moteurs de recherche à partir de connaissances métier
  13. 13. Generation of a synonyms file • Generation of « synonyms.txt » from a SKOS file – E.g.: using the World Tourism Organisation thesaurus<skos:Concept rdf:about="http://thes.world-tourism.org#VACANCES"> <skos:altLabel xml:lang="en">long stays</skos:altLabel> <skos:altLabel xml:lang="fr">marché des vacances</skos:altLabel> <skos:altLabel xml:lang="fr">genre de vacances</skos:altLabel> … <skos:altLabel xml:lang="fr">long séjour</skos:altLabel> <skos:altLabel xml:lang="en">holiday markets</skos:altLabel> Activités nautiques <skos:altLabel xml:lang="en">vacations</skos:altLabel> HOLIDAYS,VACANCES,long <skos:altLabel xml:lang="es">mercado de vacaciones</skos:altLabel> stays,marché des vacances,genre <skos:altLabel xml:lang="fr">activité de vacances</skos:altLabel> de vacances,long séjour,holiday markets,vacations,activité de <skos:altLabel xml:lang="fr">type de vacances</skos:altLabel> vacances,type de vacances,holiday <skos:altLabel xml:lang="en">holiday tourism</skos:altLabel> tourism,congés payés,06.09 <skos:altLabel xml:lang="fr">congés payés</skos:altLabel> KOREA DPR,COREE RDP,20.03.05.03 <skos:altLabel xml:lang="es">estancia larga</skos:altLabel> <skos:altLabel xml:lang="fr">06.09</skos:altLabel> TOURISM IN NATIONAL ECONOMIES,TOURISME DANS <skos:broader rdf:resource="http://thes.world-tourism.org#FLUX_TOURISTIQUE" /> LECONOMIE <skos:inScheme rdf:resource="http://thes.world-tourism.org#_06_FLUX_TOURISTIQUE" /> NATIONALE,04.04.04,place du <!-- … --> tourisme dans léconomie <skos:narrower rdf:resource="http://thes.world-tourism.org#VACANCES_DHIVER" /> … <skos:narrower rdf:resource="http://thes.world-tourism.org#VACANCES_DETE" /> <skos:prefLabel xml:lang="en">HOLIDAYS</skos:prefLabel> <skos:prefLabel xml:lang="fr">VACANCES</skos:prefLabel> <skos:prefLabel xml:lang="es">VACACIONES</skos:prefLabel></skos:Concept>
  14. 14. Result
  15. 15. Handle synonyms at index-time or query-time ?• In most cases, it is recommended to handle synonyms at index-time – A synonym composed of several words (e.g.:« nautical activities ») is tokenised at query and will not be correctly identified • Even when using quotes? – It impacts the search engine’s scoring algorithms (IDF) – prefix queries (« naut* ») or fuzzy queries (« ~activities ») are not analysed at the moment of the query and will not be extended to synonyms• But : – The index will get all the more bigger – If synonyms change, reindexing must be done
  16. 16. To expand, or not to expand queries…?• One possible solution to avoid inflating the index: – Avoid expanding from a list of synonyms… spectacle,attraction,show pêche,pêche à la ligne,permis de pêche,article de pêche,pêche au gros,pêche touristique office de tourisme,otsi,office municipal de tourisme,syndicat dinitiative – …but rather restrict expansion to one controlled value… attraction,show => spectacle pêche à la ligne,permis de pêche,article de pêche,pêche au gros,pêche touristique => pêche otsi,office municipal de tourisme,syndicat dinitiative => office de tourisme – … which could be the URI of a concept attraction,show,spectacle => http://thes.world-tourism.org#SPECTACLE pêche, pêche à la ligne,permis de pêche,article de pêche,pêche au gros,pêche touristique => http://thes.world-tourism.org#PECHE office de tourisme, otsi,office municipal de tourisme,syndicat dinitiative => http://thes.world-tourism.org#OFFICE_DE_TOURISME• Advantages: – Index size does not inflate – No impact on scoring algorithms• But it requires analysis when indexing and querying• Does not solve issue of synonyms composed of several words
  17. 17. Mixed approaches• Use two synonym lists: – One tailored for indexing – Another one tailored for search expansion at query-time• When new synonyms are needed: – Add them to the synonym list tailored for search • They can be leveraged in real time, no need for reindexing • Does not solve the question of synonyms composed of several words – Add them to the synonym list tailored for indexing too • They will be leveraged at the next indexing phase• At the next indexing phase: – Empty the synonyms list tailored for search• Another mixed approach: – Process all the synonyms of a given single word when searching – Process all the synonyms composed of several words at indexing phase
  18. 18. Plan1 - Synonyms2 - Translations3 – Specific terms4 – Spelling mistakes5 - Facets6 – Dynamic classification7 – Vocabulary alignments8 - Disambiguation
  19. 19. 2 : enrich using translations• Why? – Add multilingual capabilities to the search engine / allow searching for content in a different language than the one used in the query• Same methodology as for synonyms – Translations are declared as equivelent synonyms• Example – Using the GEMET thesaurus (sustainable developpment) – Can be download in SKOS at http://www.eionet.europa.eu/gemet <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" /> <filter class="solr.SynonymFilterFactory" synonyms="gemet.txt" ignoreCase="true" expand="true" /> </analyzer> <!-- … --> </fieldType> <rdf:Description rdf:about="concept/10910"> <skos:prefLabel xml:lang="fr">station de montagne</skos:prefLabel> <skos:prefLabel xml:lang="en">mountain resort</skos:prefLabel> <skos:prefLabel xml:lang="es">centro turistico de montana</skos:prefLabel> </rdf:Description> … achat,purchase,compra mosaïque,mosaic,mosaico station de montagne,mountain resort,centro turístico de montaña …
  20. 20. Resultat
  21. 21. !?• Why would a search using « mosaic » match « poterie » and « vitrail »?• In…GEMET, the only information available is: achat,purchase,compra mosaïque,mosaic,mosaico station de montagne,mountain resort,centro turístico de montaña …• BUT, in WTO, we also find the following information: ARTISANAT,vitrail,orfèvrerie,mécanique,dentelle,plomberie,tapisserie,ébénisterie,mosaïque,modélisme, tissage,porcelaine,crafts,artisanat dart,menuiserie,cristallerie,joaillerie,émaux,peinture sur soie,poterie• As we are using GEMET and WTO dictionaries of synonyms, the result when indexing is: – « Poterie »  « mosaïque »  « mosaic »• We are exploiting both WTO synonyms and translations from GEMET – B eware of any unwanted interactions!
  22. 22. Plan1 - Synonyms2 - Translations3 – Specific terms4 – Spelling mistakes5 - Facets6 – Dynamic classification7 – Vocabulary alignments8 - Disambiguation
  23. 23. 3 : enrichment using specific terms• Why? – To increase recall. Allows searching on generic notions• The GEMET and WTO thesauri rely on a hierarchy of terms – Loisirs > loisirs de plein air > randonnée > randonnées cycliste – Loisirs > sorties > spectacle > cirque• A search on « sorties » should find documents containing « spectacle » or « cirque » – A search on « Loisirs » (leisure) should find documents containing « randonnée » (trek) or « spectacle » (show) – Etc.
  24. 24. Generation of the specific terms file • How? – Same methodology as for the synonyms – Translation of specific<skos:Concept rdf:about="http://thes.world-tourism.org#LOISIRS"> terms is performed when <skos:narrower rdf:resource="http://thes.world-tourism.org#LOISIRS_DE_PLEIN_AIR" /> indexing, so as to translate <skos:narrower rdf:resource="http://thes.world-tourism.org#SORTIE" /> <skos:narrower rdf:resource="http://thes.world-tourism.org#LOISIRS_DINTERIEUR" /> a specific term into all of <skos:narrower rdf:resource="http://thes.world-tourism.org#OISIVETE" /> its corresponding generic <skos:narrower rdf:resource="http://thes.world-tourism.org#LOISIRS_CULTURELS" /> <skos:narrower rdf:resource="http://thes.world-tourism.org#JEU" /> terms <skos:narrower rdf:resource="http://thes.world-tourism.org#ARTISANAT" /> <skos:prefLabel xml:lang="fr">LOISIRS</skos:prefLabel> • If done at search, we</skos:Concept><skos:Concept rdf:about="http://thes.world-tourism.org#LOISIRS_CULTURELS"> would translate from <skos:altLabel xml:lang="fr">loisirs artistiques</skos:altLabel> generic to specific <skos:broader rdf:resource="http://thes.world-tourism.org#LOISIRS" /> <skos:narrower rdf:resource="http://thes.world-tourism.org#PEINTURE" /> • If « peinture » (paint) <skos:prefLabel xml:lang="fr">LOISIRS CULTURELS</skos:prefLabel></skos:Concept> is in the text, then we<skos:Concept rdf:about="http://thes.world-tourism.org#PEINTURE"> must add « loisirs <skos:altLabel xml:lang="fr">09.03.07</skos:altLabel> <skos:broader rdf:resource="http://thes.world-tourism.org#LOISIRS_CULTURELS" /> culturels » et <skos:prefLabel xml:lang="fr">PEINTURE</skos:prefLabel> « loisirs » which are</skos:Concept> the generic terms of that specific one RESEAU => TRAFIC,TRANSPORT PEINTURE => LOISIRS CULTURELS,LOISIRS FETE => MANIFESTATION CULTURELLE,MANIFESTATION TOURISTIQUE TRANSPORT FLUVIAL => MODE DE TRANSPORT,TRANSPORT
  25. 25. Result
  26. 26. Plan1 - Synonyms2 - Translations3 – Specific terms4 – Spelling mistakes5 - Facets6 – Dynamic classification7 – Vocabulary alignments8 - Disambiguation
  27. 27. 4 : spell checking• Why? – Provide users with similar terms for a worng entry • e.g.: « retsaurant »  « Did you mean ‘restaurant’ ? »• How? Here are the two ways to to build smart spellchecking : – By using the index as a dictionary • Spelling corrections are in fact existing entries in the index – Hence almost a 100% chances to find resutls, except when spellchecked terms are combined with other terms from the query • But not all of the controlled/business terms are necessarily available for spell checking – If they do not exist in the indexed content – By using a list of controlled terms • The suggested spelling corrections will not necessarily trigger results – There is not garanty that any of the indexed document contains the proposed terms • But all business terms are available for controlled searches
  28. 28. Spellchecking using an authority list• Configuration SolR : solrconfig.xml <config> <searchComponent name="spellcheck" class="solr.SpellCheckComponent"> <str name="queryAnalyzerFieldType">textSpell</str> <lst name="spellchecker"> <str name="name">default</str> <str name="field">name</str> <str name="spellcheckIndexDir">./spellchecker</str> </lst> <lst name="spellchecker"> <str name="classname">solr.FileBasedSpellChecker</str> <str name="name">file</str> <str name="sourceLocation">spellcheck.txt</str> <str name="characterEncoding">UTF-8</str> <str name="accuracy">0.8</str> <str name="spellcheckIndexDir">./spellcheckerFile</str> </lst> </searchComponent> <requestHandler name="standard" class="solr.SearchHandler" default="true"> <lst name="defaults"> <str name="echoParams">explicit</str> <str name="spellcheck.onlyMorePopular">false</str> <str name="spellcheck.extendedResults">false</str> <str name="spellcheck.count">1</str> </lst> <arr name="last-components"> <str>spellcheck</str> </arr> </requestHandler> </config>
  29. 29. Generation of spellcheck.txt• Generation of spellcheck.txt from the WTO SKOS<skos:Concept rdf:about="http://thes.world-tourism.org#VACANCES"> <skos:altLabel xml:lang="en">long stays</skos:altLabel> <skos:altLabel xml:lang="fr">marché des vacances</skos:altLabel> <skos:altLabel xml:lang="fr">genre de vacances</skos:altLabel> ORGANISMO DE CREDITO <skos:altLabel xml:lang="fr">long séjour</skos:altLabel> 14.11.02 <skos:altLabel xml:lang="en">holiday markets</skos:altLabel> Activités nautiques <skos:altLabel xml:lang="en">vacations</skos:altLabel> HOLIDAYS <skos:altLabel xml:lang="es">mercado de vacaciones</skos:altLabel> VACANCES <skos:altLabel xml:lang="fr">activité de vacances</skos:altLabel> VACACIONES <skos:altLabel xml:lang="fr">type de vacances</skos:altLabel> marché des vacances <skos:altLabel xml:lang="en">holiday tourism</skos:altLabel> genre de vacances <skos:altLabel xml:lang="fr">congés payés</skos:altLabel> long séjour <skos:altLabel xml:lang="es">estancia larga</skos:altLabel> activité de vacances <skos:altLabel xml:lang="fr">06.09</skos:altLabel> type de vacances <skos:broader rdf:resource="http://thes.world-tourism.org#FLUX_TOURISTIQUE" /> congés payés <skos:inScheme rdf:resource="http://thes.world-tourism.org#_06_FLUX_TOURISTIQUE" /> 06.09 <!-- … --> KOREA DPR <skos:narrower rdf:resource="http://thes.world-tourism.org#VACANCES_DHIVER" /> … <skos:narrower rdf:resource="http://thes.world-tourism.org#VACANCES_DETE" /> <skos:prefLabel xml:lang="en">HOLIDAYS</skos:prefLabel> <skos:prefLabel xml:lang="fr">VACANCES</skos:prefLabel> <skos:prefLabel xml:lang="es">VACACIONES</skos:prefLabel></skos:Concept>
  30. 30. Result
  31. 31. Plan1 - Synonyms2 - Translations3 – Specific terms4 – Spelling mistakes5 - Facets6 – Dynamic classification7 – Vocabulary alignments8 - Disambiguation
  32. 32. 5 : content semantic structuring• « Smarter content = smarter index »• It takes content semantic structuring to enhance the search experience – Associate meaningful metadata to content – Meaningful metadata bring unambiguous values from reference vocabularies (identification using URIs)• Associating structured metadata to content enables faceted navigation• This is a wide-ranging process which we will not describe in details in this presentation – E.g.: use of Text-Mining and/or integration middleware sur as Mondeca’s CA Mananger • SolR supports UIMA integration in its indexing chain to add text mining tools – E.g.: manual tagging in the case of tourism catalogues
  33. 33. Strucutured catalogue in RDF
  34. 34. Index schema configuration• 1 index field for each metadata – In conf/schema.xml<field name="Mot_Cle_103696" multiValued="true" type="string" indexed="true" stored="true" /><field name="animaux_acceptes" multiValued="false" type="string" indexed="true" stored="true" /><field name="bassin_touristique_at" multiValued="true" type="string" indexed="true" stored="true" /><field name="bordereau_Tourinfrance_103952" multiValued="true" type="string" indexed="true" stored="true" /><field name="commune_at" multiValued="true" type="string" indexed="true" stored="true" /><field name="zone_geographique_at" multiValued="true" type="string" indexed="true" stored="true" /><field name="paiement_accepte" multiValued="true" type="string" indexed="true" stored="true" /><field name="label_at" multiValued="true" type="string" indexed="true" stored="true" /><field name="langue_parlee" multiValued="true" type="string" indexed="true" stored="true" /><field name="type_h" multiValued="true" type="string" indexed="true" stored="true" /><field name="classement" multiValued="true" type="string" indexed="true" stored="true" /><field name="tarif_nuit_mini" multiValued="true" type="string" indexed="true" stored="true" />
  35. 35. Result: facets
  36. 36. Plan1 - Synonyms2 - Translations3 – Specific terms4 – Spelling mistakes5 - Facets6 – Dynamic classification7 – Vocabulary alignments8 - Disambiguation
  37. 37. 7 : dynamic conetntn classification • Why? – The classification plan used in the catalogue is not meant to be understood by end users • « objective » vs. « subjective » vision of the content – There is a need to adapt the classification plan : • To different types if audiences • For diffent channels – The same catalogue needs to be presented according to •Looking for place different perspectives to stay? – To increase content repurposing • Simple? • Classic? • Elegant?Thomas Francart - Enrichissement des moteurs de recherche à partir de connaissances métier
  38. 38. ITM-rules : création des règles
  39. 39. Rules definitions: format• New hierarchical classifications are in SKOS• A SPARQL classification rule (generated from ITM Rule Editor) is associated to each entry in the SKOS file <skos:Concept rdf:about="itm:n#_migration_taxo_106544"> <skos:prefLabel xml:lang="fr">Raffiné</skos:prefLabel> <skos:definition> PREFIX r: <itm:n#> PREFIX q: http://www.nievre-tourisme.com/onto# CONSTRUCT { ?SEARCHED_TOPIC <http://purl.org/dc/terms/subject> r:_migration_taxo_106544 .} WHERE { ?SEARCHED_TOPIC a q:Hebergement . ?SEARCHED_TOPIC q:classement q:class_CAT4 . } </skos:definition> <skos:definition> PREFIX j: <itm:n#> PREFIX i: http://www.nievre-tourisme.com/onto# CONSTRUCT { ?SEARCHED_TOPIC <http://purl.org/dc/terms/subject> j:_migration_taxo_106544 .} WHERE { ?SEARCHED_TOPIC a i:Hebergement . ?SEARCHED_TOPIC i:classement i:class_4EP . } </skos:definition> </skos:Concept>
  40. 40. Content Classifier : rules execution Taxonomy Terminology (Classification Rules) SKOS + RDF SKOS + SPARQL ?x is a <Hotel> and price(?x) < 50 ?x is a <Camping> and size(?x) > 300 … Classification MetadataRDF Content Metadata Classification engine Content classified with additionnal dcterms:subject and • Based on RDF triplestore dc:subject properties • Loads terminology and metadata • Infer on terminology • OWL & SKOS inference • Custom rules … • Apply SPARQL classification rules … • optionnaly, simplifies RDF structure
  41. 41. Catalogue classified with additional metadata
  42. 42. Additional index fields for the new classifications• In conf/schema.xml <field name="" multiValued="true" type="string" indexed="true" stored="true" />taxo_confort <field name="taxo_generale" multiValued="true" type="string" indexed="true"stored="true" />
  43. 43. Dynmaic Classification: Result <field name=""multiValued="true"type="string"indexed="true"stored="true" />taxo_confort <fieldname="taxo_generale"multiValued="true"type="string"indexed="true"stored="true" />
  44. 44. Plan1 - Synonyms2 - Translations3 – Specific terms4 – Spelling mistakes5 - Facets6 – Dynamic classification7 – Vocabulary alignments8 - Disambiguation
  45. 45. 8 : using reference vocabulary alignments• Why? – What if content A is annotated using thesaurus A, and users want to search content using thesaurus B ? – Allows queries on a corpus annotated with a thesaurus different from the one used to control queries WTO GEMET Thesaurus alignment
  46. 46. ITM-align : creation ofn alignments
  47. 47. Alignment fiormats Aligned concepts<map> <Cell rdf:about="150046"> <entity1> <edoal:Class rdf:about="http://eurlex-directory-codes.europa.eu/0350" /> </entity1> <entity2> <edoal:Class rdf:about="http://eurovoc.europa.eu/2897" /> </entity2> <relation>fr.inrialpes.exmo.align.impl.rel.EquivRelation</relation> <measure rdf:datatype="http://www.w3.org/2001/XMLSchema#float">1.0</measure> </Cell></map><map> Relation type Score <Cell rdf:about="152849"> <entity1> <edoal:Class rdf:about="http://eurlex-directory-codes.europa.eu/0350" /> </entity1> <entity2> <edoal:Class rdf:about="http://eurovoc.europa.eu/2479" /> </entity2> <relation>fr.inrialpes.exmo.align.impl.rel.EquivRelation</relation> <measure rdf:datatype="http://www.w3.org/2001/XMLSchema#float">1.0</measure> </Cell></map> « EDOAL » format from INRIA: http://alignapi.gforge.inria.fr/edoal.html
  48. 48. Using alignments• When indexing• The original document annotations are translated using the alignement – from Thesaurus A to thesaurus B• The index is enriched with concepts from thesaurus B – The index now contains annotations based on thesaurus A and thesaurus B• One can then search the corpus using concepts from thesaurus B• The alignment is interpreted by specific code in the indexing chain, there is no specific configuration in SolR – except to specify a dedicated field which will be used for the result of the alignment translation
  49. 49. Reference vocabulary alignments: resultKeywords from the source thesaurus (eurovoc) Keywords from conceptstranslated usingalignments (from eurovoc to eurlex)
  50. 50. Plan1 - Synonyms2 - Translations3 – Specific terms4 – Spelling mistakes5 - Facets6 – Dynamic classification7 – Vocabulary alignments8 - Disambiguation
  51. 51. Disambiguation• Why? – Match a user’s searched term to a controlled entity • « loisirs »  http://thes.world-tourism.org#LOISIRS• Disambiguate entities when searching only makes sense if the same entities have been disambiguated when indexing – Either the document was explicitly categorized using a controlled entity (its URI) – Or the entity was extracted using text mining tools• disambiguation of an entity from a controlled vocabulary by the search engine is possible only if the controlled vocabulary has itself been indexed by the search engineThomas Francart - Enrichissement des moteurs de recherche à partir de connaissances métier
  52. 52. Disambiguation: principle 3. Keyword1. Use reference 2. Indexing of disambiguation usingvocabulary when reference a controlled entity indexing vocabulary venus http://www.z.fr/e1 cupidon http://www.z.fr/e2 4. Search on http://www.z.fr/e1 doc1 controlled entity id http://www.z.fr/e1 doc2
  53. 53. Disambiguation: result
  54. 54. Thank you for your attention !thomas.francart@mondeca.com

×