Your SlideShare is downloading. ×
Non-textual ranking in Digital Libraries
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Non-textual ranking in Digital Libraries

69
views

Published on

Published in: Education, Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
69
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 1 Non-textual ranking in digital libraries Philipp Mayr Hochschule Darmstadt Jour fixe ISE am 18.11.2009 Hochschule Darmstadt Fachbereich Media Slides in cooperation with Peter Mutschke & Philipp Schaer
  • 2. 2 Agenda • Introduction • Ranking in DL • IRM project • Non-textual ranking in IRM • Results • Conclusion & Demo
  • 3. 3 Background I Database perspective: • Large and heterogeneous document sets for subject specific questions • Various relevant and accessible databases for a topic • Focus on bibliographic databases (metadata) • journal articles • monographs
  • 4. 4 Background II User perspective: • Relevant & qualitative documents (relevance ranking) • Comprehensive search: documents from other fields • Flexible search systems: alternative search strategies and techniques (e.g. Berrypicking) • Value-added, e.g. direct access to fulltexts or metrics like citation counts in Google Scholar
  • 5. 5 Digital Libraries • Indexing & Abstracting databases • Library catalogues • Full texts • Links to online resources • Data • Digital objects • …
  • 6. 6 Ranking Models: • Exact match vs. best match (e.g. tf-idf) • Sorting vs. ranking Textual vs. non-textual ranking
  • 7. 7 Ranking: non-textual Link analysis (PageRank, HITS) Relevance feedback (user feedback) Popularity (documents accessed)
  • 8. 8 Ranking in Digital Libraries
  • 9. 9 Ranking in Digital Libraries
  • 10. 10 Ranking in Digital Libraries
  • 11. 11 Non-textual ranking in DL • Link analysis (PageRank, HITS) • Relevance feedback (user feedback) • Popularity (documents accessed) BUT we have: • High quality metadata • Controlled vocabularies • Maintained (curated) collections are problematic in DL
  • 12. 12 Project IRM
  • 13. 13 Value Added Services for IR Systems Major problem areas of scholarly IR systems (Krause 2007): 1. search term vagueness 2. information overload by large result sets IRM services → structural attributes of the science system : • (1) Search Term Recommender: more appropriate terms from controlled vocabulary (co-word analysis) • (2a) Bradfordizing: re-ranking by core journals (bibliometrics) • (2b) Author Centrality: re-ranking by centrality in co-authorship networks (network analysis)
  • 14. 14 Search Term Recommender (Petras 2006) Search Term Service: recommending strongly associated terms from controlled vocabulary
  • 15. 15 Bradfordizing (White 1981, Mayr 2009) Bradford Law of Scattering (Bradford 1948): idealized example for 450 articles Nucleus/Core: 150 papers in 3 Journals Zone 2: 150 papers in 9 Journals Zone 3: 150 papers in 27 Journals Ranking by Bradfordizing: sorting the core journal papers / core books on top bradfordized list of journals in informetrics applied to monographs: publisher as sorting criterion
  • 16. 16 Author Centrality (Mutschke 2001, 2004) Ranking by Author Centrality: sorting central author papers on top
  • 17. 17 Scenarios for combined ranking services iterative use : simultanous use: Result Set Core Journal Papers Central Author Papers Relevant Papers Result Set Central Author Papers Core Journal Papers
  • 18. 18 Combination Matrix Combination Author Centrality Bradfordizing STR 0 - - - 1 1 - - 2 - 1 - 3 - - 1 4 1 2 - 5 1 1 - 6 2 1 - 7 2 - 1 8 - 2 1 9 2 3 1 10 3 2 1 11 2 2 1 (number = order; same number in a row = simultanous use)
  • 19. 19 Main Research Issue: Contribution to retrieval quality and usability • Precision: – Do central authors (core journals) provide more relevant hits? – Do highly associated cowords have any positive effects? • Value-adding effects: – Do central authors (core journals) provide OTHER relevant hits? – Do coword-relationships provide OTHER relevant search terms? • Mashup effects: – Do combinations of the services enhance the effects?
  • 20. 20 Evaluation Design • precision in existing evaluation data: – Clef 2003-2007: 125 topics; 65,297 SOLIS documents – KoMoHe 2007: 39 topics; 31,155 SOLIS documents • plausibility tests: – author centrality / journal coreness ↔ precision – Bradfordizing ↔ author centrality • precision tests with users (Online-Assessment-Tool) • usability tests with users (acceptance)
  • 21. 21 Prototype Architecture 2,235,769 documents from
  • 22. 22 Motivation: non-textual approaches in DL • Larger document sets for subject specific searches need to be concentrated again (compensation, structuring) • Exploring alternative ranking approaches which can provide insights in document spaces and enhance retrieval • Plausibility that the nucleus of a literature or central authors provide an utility for users searching large document spaces
  • 23. 23 Bradfordizing Basis: Bradford Law of Scattering Approach: Usage of the document distributions (scattering) in scientific journal and monograph publications. Core journals on research topics -> bibliometric approach • Identification of „core journals“ and core publishers • ISSN and ISBN as identifiers
  • 24. 24 Author Centrality Basis: Graph theory, network analysis Approach: Usage of the interaction (communication) pattern -> coauthorship relations in a research community • Identification of „experts“ • Identification of networked, „central“ persons • Different centrality measures
  • 25. 25 Results
  • 26. 26 Results: Bradfordizing Bradford distributions appear in all subject domains and also for queries in databases. It follows that Bradfordizing can be used for re-sorting results, generally for topic specific queries in bibliographic databases. topic core core-j z2 z2-j z3 z3-j 1 45 3 46 11 52 42 2 85 6 88 20 87 63 3 72 9 66 19 72 55 4 99 6 97 15 92 61 5 66 2 71 14 65 50 73,4 5,2 73,6 15,8 74 54,2 Example: Articles and journals in the three zones (core, z2, z3) for 5 topics
  • 27. 27 CLEF topics 2006 1 10 100 1000 1 10 100 1000 top152 top155 top156 top158 top160 top163 top164 top165 top171 top175 top174 top173 top172 top170 top169 top168 top167 top166 top162 top161 top159 top157 top154 top153 top151 SOLIS database (German literature) core z2 z3 journals articles
  • 28. 28 Results: Bradfordizing  Results from qualitative interviews with information professionals 1. Spontaneous naming of core journal can be difficult - no naming of core journals for 50% of the topics - high plausibility of the bradfordized journals 2. Majority attest a positive relevance effect for core journals  Highest value-added can be expected for novice researchers, students in a scientific field  Perhaps the zone 3 (periphery, long tail) is most valuable for seniors in a field
  • 29. 29 Results: Bradfordizing CLEF article Verbesserung core zu Z3 in % Verbesserung core zu Z2 in % Verbesserung Z2 zu Z3 in % Verbesserung core zu base- line in % 2003 86,56 (*) 34,57 (*) 38,63 (*) 32,65 (*) 2004 69,23 (*) 22,45 38,20 26,25 (*) 2005 78,03 (*) 29,05 (*) 37,95 (*) 29,52 (*) 2006 17,63 7,66 9,27 8,46 2007 28,18 (*) 8,31 18,35 11,77 55,93 (*) 20,41 (*) 28,48 (*) 21,73 (*) KoMoHe article Verbesserung core zu Z3 in % Verbesserung core zu Z2 in % Verbesserung Z2 zu Z3 in % Verbesserung core zu base- line in % Test1 18,82 11,75 6,32 9,84 Test2 11,58 6,16 5,11 6,12 Test3 19,32 (*) 8,67 (*) 9,80 (*) 9,00 (*) 16,57 (*) 8,86 7,08 (*) 8,32 (*) Vergleich der durchschnittl. Precision zwischen den Zonen ergibt: • Core relevanter als Zone 2 und Zone 3 • Zone 2 relevanter als Zone 3 • Meist signifikante Verbesserungen (t-Test, Wilcoxon) • niedrigere Verbesserungen bei KoMoHe • kontinuierliche Verschiebung der Relevanz  relevance related distributions* Significant based on the Wilcoxon signed-rank test and the paired t-Test
  • 30. 30 Results: Author centrality
  • 31. 31 Heuristische Evaluation des Ex-Post-Ranking-Modells  Nutzer-evaluierte Anwendungen (Jugendinstitut 1997, ASI- Tagung 2003) Query Ergebnismenge sortiert nach PY IDF ACL Information. Mehrwert ACL Jugend – Gewalt 0.25 0.60 0.55 92 Rechtsextremismus – Ostdeutschland 0.35 0.45 0.60 122 Frau – Personalpolitik 0.35 0.60 0.65 100 Widerstand – Drittes Reich 0.40 0.65 0.95 138 Zwangsarbeit – II. Weltkrieg 0.55 0.65 0.70 92 Eliten – BRD 0.40 0.70 0.85 107 Armut – Stadt 0.30 0.35 0.55 157 Arbeiterbewegung – 19./20. Jahrh. 0.55 0.55 0.90 164 Wertewandel – Jugend 0.40 0.50 0.30 50 Terrorismus - Demokratie 0.20 0.35 0.60 129 Durchschnitt 0.38 0.54 0.67 115 PY = Erscheinungsjahr, IDF = Inverse Dokumenthäufigkeit , ACL = Autor-Closeness Retrievaltest Qualitative Evaluationen Rankings nach Autorenzentralität:  höhere Precision als traditionelle Rankings [?]  hoher informationeller Mehrwert (andere Dokumente) [?] Ergebnisse (Hypothesen)
  • 32. 32 Evaluation of Author Centrality on CLEF Data • moderate positive relationship between rate of networking and precision • precision of TF-IDF rankings (0.60) significantly higher than author centrality based rankings (0.31) – BUT: • very little overlap of documents on top of the ranking lists: 90% of relevant hits provided by author centrality did not appear on top of TF-IDF rankings → added precision of 28% • author centrality seems to favor OTHER relevant documents than traditional rankings • value-adding effect: other view to the information space
  • 33. 33 Zentrale Akteure im Umfeld eines Autors (Expertensuche) Ego = Ulrich Teichler Schlagwort = Hochschule oder Studium Ranking der Ko-Akteure nach Zentralität
  • 34. 34 Zentrale Autoren in Dokumentkollektion (Beispiel: Rechtsextremismus) Schlagwort = Rechtsextremismus... oder Rechtsextremismus oder Antisemitismus oder Rassismus oder Ausländerfeindlichkeit oder Ethnozentrismus oder Faschismus oder Neofaschismus ab 1996 2833 SOLIS/FORIS- Nachweise 1851 vernetzte Akteure (65 Giant)
  • 35. 35 Conclusion • Methods are non-textual • Scientometric/bibliometric approach • Network analysis • The methods can successfully be applied (holds true in different domains, databases and document types) • A value-added can be demonstrated (significant precision improvement) • Users are intuitively and empirically satisfied • High plausibility of the methods
  • 36. 36 Demo
  • 37. 37 Demo
  • 38. 38 Demo Link zum Prototyp http://multiweb.gesis.org/GrailsSTR/testSTR/index
  • 39. 39 Dr. Philipp Mayr F14, Raum 39b (06151) 16-9394 mailto:philipp.mayr@h-da.de oder mailto:philipp.mayr@gesis.org http://www.gesis.org/index.php?id=2479

×