Your SlideShare is downloading. ×
0
A Domain Based Approach to Information Retrieval in Digital Libraries
A Domain Based Approach to Information Retrieval in Digital Libraries
A Domain Based Approach to Information Retrieval in Digital Libraries
A Domain Based Approach to Information Retrieval in Digital Libraries
A Domain Based Approach to Information Retrieval in Digital Libraries
A Domain Based Approach to Information Retrieval in Digital Libraries
A Domain Based Approach to Information Retrieval in Digital Libraries
A Domain Based Approach to Information Retrieval in Digital Libraries
A Domain Based Approach to Information Retrieval in Digital Libraries
A Domain Based Approach to Information Retrieval in Digital Libraries
A Domain Based Approach to Information Retrieval in Digital Libraries
A Domain Based Approach to Information Retrieval in Digital Libraries
A Domain Based Approach to Information Retrieval in Digital Libraries
A Domain Based Approach to Information Retrieval in Digital Libraries
A Domain Based Approach to Information Retrieval in Digital Libraries
A Domain Based Approach to Information Retrieval in Digital Libraries
A Domain Based Approach to Information Retrieval in Digital Libraries
A Domain Based Approach to Information Retrieval in Digital Libraries
A Domain Based Approach to Information Retrieval in Digital Libraries
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

A Domain Based Approach to Information Retrieval in Digital Libraries

344

Published on

The current abundance of electronic documents requires automatic techniques that support the users in understanding their content and extracting useful information. To this aim, improving the …

The current abundance of electronic documents requires automatic techniques that support the users in understanding their content and extracting useful information. To this aim, improving the retrieval performance must necessarily go beyond simple lexical interpretation ofthe user queries, and pass through an understanding of their semantic content and aims. It goes without saying that any digital library wouldtake enormous advantage from the availability of effective Information Retrieval techniques to provide to their users. This paper proposes an approach to Information Retrieval based on a correspondence of the domain of discourse between the query and the documents in the repository. Such an association is based on standard general-purpose linguistic resources (WordNet and WordNet Domains) and on a novel similarity assessmenttechnique. Although the work is at a preliminary stage, interesting initial results suggest to go on extending and improving the approach.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
344
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Università degli studi di Bari “Aldo Moro” Dipartimento di Informatica A Domain Based Approach to Information Retrieval in Digital Libraries F. Rotella, S. Ferilli, F. LeuzziL.A.C.A.M. ferilli@di.uniba.it, {fabio.leuzzi, rotella.fulvio}@gmail.comhttp://lacam.di.uniba.it:8000 8th Italian Research Conference on Digital Libraries Bari, Italy, February 9-10, 2012
  • 2. Overview ● Introduction & Objectives ● Keyword Extraction ● Word Sense Disambiguation ● Synset Clustering ● A Multistrategy Similarity Measure ● Document Partitioning ● User Query Processing ● A Preliminary Evaluation ● Conclusions & Future WorksA Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 2
  • 3. Introduction Some repositories leave the responsibility of quality to the authors. + Anybody can produce and distribute documents. = Possible low average quality of the repository contents. Users are often overwhelmed by documents that only apparently are suitable for satisfying their information needs.A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 3
  • 4. Introduction ● Possible way out: Information Retrieval systems ● Numerical/statistical manipulation of (key)words has been widely explored in the literature ● Still unable to fully solve the problem ● Achieving better retrieval performance requires to go beyond simple lexical interpretation of the user queries ● Pass through an understanding of their semantic content and aims ● Ontological taxonomy ● WordNet ● WordNet DomainsA Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 4
  • 5. Objectives Improving fruition of a DL ● Use of advanced techniques for document retrieval ● Try to overcome the ambiguity of natural language ● Inspired by the typical behavior of humans: ● take into account the possible meanings of words ● select the most appropriate one according to the context of the discourseA Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 5
  • 6. Keyword Extraction ● Each document in the digital library is progressively split into paragraphs, sentences, and single words ● Integrated in the DOMINUS framework ● Obtained the syntactic structure of sentences, and the lemmas ● Integrated in the Stanford Parser ● Classical VSM ● TF*IDF weighting ● Two filters: ● Only nouns considered ● The representation of adverbs, verbs and adjectives in WordNet is different ● Only the top 10% keywords for each document ● To be noise-tolerant ● To limit the possibility of including non-discriminative and very general words in the representation of a documentA Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 6
  • 7. Word Sense Disambiguation Domain Driven One Domain per Discourse assumption: many uses of a word in a coherent portion of text tend to share the same domain. Prevalent domain Prevalent domain individuation individuation Extraction of all Extraction of all synsets for each term synsets for each term Extraction of all Extraction of all domains for each synset domains for each synset Choice of prevalent Choice of prevalent domain synset domain synsetA Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 7
  • 8. Synset Clustering Pairwise complete link agglomerative strategy● Each synset generates a singleton cluster● For each pair of clusters ● If the complete link property holds ● Merge the involved clustersA Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 8
  • 9. A Multistrategy Similarity Measure3 components are summed andnormalized, in ]0,1[● depth (ancestors)● breadth (direct neighbors)● breadth (inverse neighbors)WordNet relationship are consideredCooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 9
  • 10. A Multistrategy Similarity Measure Cosidered Relationship member meronimy: the latter synset is a member meronym of the former; substance meronimy: the latter synset is a substance meronym of the former; part meronimy: the latter synset is a part meronym of the former; similarity: the latter synset is similar in meaning to the former; antonym: specifies antonymous word; attribute: defines the attribute relation between noun and adjective synset pairs in which the adjective is a value of the noun; additional information: additional information about the first word can be obtained by seeing the second word; part of speech based: specifies two different relations based on the parts of speech involved; participle: the adjective first word is a participle of the verb second word; hyperonymy: the latter synset is a hypernym of the former.A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 10
  • 11. Document Partitioning ● SynsetWord structure: ● Original word ● TF*IDF weight ● Synset ● The Pairwise Clustering step returned a set of synset clusters ● For each document in the collection ● Each of its SynsetWord votes with its TF*IDF weight ● The first three clusters are chosen from the ranked list ● They represent the intensional description of the documentA Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 11
  • 12. Users Query Elaboration Overview ● Same grammatical preprocessing as in the previous phase ● Query usually very short ● No keyword extraction: all nouns retained for the next operations ● WSD Domain Driven unreliable ● For each word, all corresponding synsets in WordNet are kept ● A single lexical query yields many semantic queries ● All possible combinations of synsetsA Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 12
  • 13. Users Query Elaboration A Brute Force WSD For each combination: ● a similarity evaluated against each cluster that has at least one associated document ● using the same similarity function as for clustering Twofold objective: ● finding the combination of synsets that represents the best word sense disambiguation ● obtaining the most similar cluster to the involved wordsA Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 13
  • 14. Users Query Elaboration Query Results The best combination is used to obtain the list of clusters ranked by descending relevance, that can be used as an answer to the user search. The results are then displayed to the user, in particular are displayed the first n sets of document such that n is the minimum value that shows at least 10 results.A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 14
  • 15. A Preliminary Evaluation The Quality of Clusters 86 documents, 4 topics: 27 general science and physics; 21 music; 15 politics; 23 religion. Query: Reincarnation and eternal life Best combination: ● synset: 106191212; lemmas: reincarnation; gloss: the Hindu or Buddhist doctrine that a person may be reborn successively into one of five classes of living beings (god or human or animal or hungry ghost or denizen of Hell) depending on the person’s own actions; ● synset: 100006269; lemmas: life; gloss: living things collectively. Most similar cluster: ● synset: 106191212; lemmas: reincarnation; gloss: the Hindu or Buddhist doctrine that a person may be reborn successively into one of five classes of living beings (god or human or animal or hungry ghost or denizen of Hell) depending on the person’s own actions; ● synset: 105943300; lemmas: doctrine, philosophical system, philosophy and school of thought; gloss: a belief (or system of beliefs) accepted as authoritative by some group or school; ● synset: 105941423; lemmas: belief; gloss: any cognitive content held as true.A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 15
  • 16. A Preliminary Evaluation The Quality of Clusters Query: Ornaments and melodies Best combination: ● synset: 103169390; lemmas: decoration, ornament and ornamentation; gloss: something used to beautify; ● synset: 107028373; lemmas: air, line, melodic line, melodic phrase, melody, strain and tune; gloss: a succession of notes forming a distinctive sequence. Most similar cluster: ● synset: 107025900; lemmas: classical, classical music and serious music; gloss: traditional genre of music conforming to an established form and appealing to critical interest and developed musical taste; ● synset: 107033753; lemmas: mass; gloss: a musical setting for a Mass; ● synset: 107026352; lemmas: opera; gloss: a drama set to music, consists of singing with orchestral accompaniment and an orchestral overture and interludes; ● synset: 107071942; lemmas: genre, music genre, musical genre and musical style; gloss: an expressive style of music; ● synset: 107064715; lemmas: rock, rock ’n’ roll, rock and roll, rock music, rock’n’roll and rock-and- roll; gloss: a genre of popular music originating in the 1950s, a blend of black rhythm-and-blues with white country-and-western.A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 16
  • 17. A Preliminary Evaluation Synthesis of Outcomes # Query Outcomes Precision Recall [1 to 9] music 1 Ornaments and melodies [10 to 11] religion 0.82 (1.0) 0.43 (9/21) [1 to 9] religion 2 Reincarnation and eternal life [10] science 0.9 (1.0) 0.39 (9/23) [1 to 4] music 3 Traditions and folks [5 to 6] religion 0.8 (1.0) 0.38 (8/21) [7 to 10] music [1 to 2] science [3] politics 4 Limits of theory of relativity [4 to 5] religion 0.8 0.44 (12/27) [6 to 15] science [1 to 3] politics [4] science [5 to 6] religion 5 Capitalism vs communism [7 to 11] politics 0.61 (0.77) 0.53 (8/15) [12] science [13] music [1] politics [2] music 6 Markets and new economy [3] science 0.6 (0.7) 0.4 (6/15) [4 to 8] politics [9 to 10] religion [1 to 3] politics [4] science 7 Relationship between democracy and parliament [5 to 6] politics 0.5 (0.6) 0.33 (5/15) [7 to 10] religionA Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 17
  • 18. Conclusions Proposed an approach to extract information from digital libraries ● Go beyond simple lexical matching, toward the semantic content underlying queries The approach consists of: ● An off-line preprocessing on the entire corpus ● Find sets of synset as intensional descriptions for the documents ● An on-line phase on the queries ● Find the most suitable sense, evaluating all possible combinations of synset against each intensional descriptions of the documents ● In order to propose as result the most relevant ones Preliminary experiments show that this approach can be viable.A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 18
  • 19. Future Works ● Substitution of the ODD assumption with a more elaborated strategy for WSD ● Avoiding the pre-processing step ● To handle cases when new documents are progressively included in the collection ● Including adverbs, verbs and adjectives ● To improve the quality of the semantic representatives of the documents ● To explore other approaches to choose better intensional descriptions of each documentA Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 19

×