MetaSearch  vs Harvesting and Indexing Lukas Koster Library of the University of Amsterdam -- http://commonplace.net 2009 ...
So many databases to search
MetaSearch – Federated Search Z39.50 SRU Proprietary Search Translate search syntax MARC21 MARCXML DC Conversion Merging D...
Technical bottlenecks Z39.50 SRU Proprietary Search Translate search syntax MARC21 MARCXML DC Conversion Merging Deduplica...
Technical bottlenecks <ul><li>Changes in </li></ul><ul><ul><li>Remote database server IP address </li></ul></ul><ul><ul><l...
MetaSearch limitations <ul><li>Differences in searches, indexes </li></ul><ul><ul><li>Author </li></ul></ul><ul><ul><li>Su...
Author searches <ul><li>Variations in author name storage formats </li></ul><ul><ul><li>Henry James </li></ul></ul><ul><ul...
Variations in author names
Subject searches <ul><li>Different qualification, keyword schemes per database </li></ul><ul><ul><li>LoC subject Headings ...
Multilingual searches <ul><li>All words searches </li></ul><ul><li>Subject searches </li></ul><ul><ul><li>English “cooking...
All processing on the fly <ul><li>Issues, dependent on each other: </li></ul><ul><ul><li>Speed (slowness) </li></ul></ul><...
Speed (slowness) <ul><li>Dependent on  </li></ul><ul><ul><li>Search term transformation </li></ul></ul><ul><ul><li>Respons...
Limited number of databases <ul><li>Searching too many databases takes too long </li></ul><ul><li>Local processing time in...
Not all results in first set <ul><li>Merging, deduplication, ranking of all results takes too long </li></ul><ul><li>Only ...
Relevance <ul><li>Dependent on default sort order  (relevance?, date?)  of each external database </li></ul><ul><li>Depend...
Solution? <ul><li>Don’t rank </li></ul><ul><li>Don’t deduplicate </li></ul><ul><li>Don’t merge (in advance) </li></ul><ul>...
Search with MetaSearch
Translate search syntax on the fly
Fetching results
Conversion of results on the fly
Conversion of results on the fly
Conversion of results on the fly
Results with MetaSearch
Results with MetaSearch
Harvesting and indexing Search Normalising Indexing Ranking Results Central index H&I tool Databases Harvesting Searching ...
Advantages of H&I <ul><li>Speed </li></ul><ul><li>No maximum number of searchable databases </li></ul><ul><li>All results ...
H&I: Aquabrowser
H&I: Primo
MetaSearch = “Just in time” <ul><li>Bookshop – Central Book Deposit </li></ul><ul><li>Always order on request </li></ul><u...
H&I = “just in case” <ul><li>Bookshop with large stock </li></ul><ul><li>Customers always find something </li></ul><ul><li...
Images <ul><li>http://www.flickr.com/photos/donpezzano/3044975399/ </li></ul><ul><li>http://www.flickr.com/photos/halighal...
Upcoming SlideShare
Loading in …5
×

MetaSearch vs Harvesting and Indexing

3,978 views

Published on

A comparison between metasearch/federated search and harvesting & indexing in libraries.

Published in: Technology, Sports
2 Comments
12 Likes
Statistics
Notes
No Downloads
Views
Total views
3,978
On SlideShare
0
From Embeds
0
Number of Embeds
1,068
Actions
Shares
0
Downloads
115
Comments
2
Likes
12
Embeds 0
No embeds

No notes for slide

MetaSearch vs Harvesting and Indexing

  1. 1. MetaSearch vs Harvesting and Indexing Lukas Koster Library of the University of Amsterdam -- http://commonplace.net 2009 http://www.flickr.com/photos/donpezzano/3044975399/
  2. 2. So many databases to search
  3. 3. MetaSearch – Federated Search Z39.50 SRU Proprietary Search Translate search syntax MARC21 MARCXML DC Conversion Merging Deduplication Ranking (First 30 per DB) Results Database Connectors MetaSearch tool Databases Searching and Data fetching: One integrated interdependent on-the-fly procedure Search Engine
  4. 4. Technical bottlenecks Z39.50 SRU Proprietary Search Translate search syntax MARC21 MARCXML DC Conversion Merging Deduplication Ranking (First 30 per DB) Results Database Connectors MetaSearch tool Databases Connection Access Authorisation Search Engine
  5. 5. Technical bottlenecks <ul><li>Changes in </li></ul><ul><ul><li>Remote database server IP address </li></ul></ul><ul><ul><li>Remote database server hostname </li></ul></ul><ul><ul><li>Remote database server configuration </li></ul></ul><ul><ul><li>Remote database authentication </li></ul></ul><ul><ul><li>Firewall </li></ul></ul><ul><ul><li>Database system </li></ul></ul><ul><ul><li>Network </li></ul></ul>
  6. 6. MetaSearch limitations <ul><li>Differences in searches, indexes </li></ul><ul><ul><li>Author </li></ul></ul><ul><ul><li>Subject </li></ul></ul><ul><ul><li>Multiple languages </li></ul></ul><ul><li>Speed (slowness) </li></ul><ul><li>Limited number of searchable databases </li></ul><ul><li>Not all results in first set </li></ul><ul><li>Relevance </li></ul>
  7. 7. Author searches <ul><li>Variations in author name storage formats </li></ul><ul><ul><li>Henry James </li></ul></ul><ul><ul><li>James, Henry </li></ul></ul><ul><ul><li>James, H. </li></ul></ul><ul><ul><li>H.James </li></ul></ul><ul><ul><li>Which Henry James? </li></ul></ul><ul><ul><li>Or is it: Henry, J./James Henry ? </li></ul></ul><ul><li>Variations in supported search formats </li></ul><ul><ul><li>Only one? </li></ul></ul><ul><ul><li>All of the above? </li></ul></ul>
  8. 8. Variations in author names
  9. 9. Subject searches <ul><li>Different qualification, keyword schemes per database </li></ul><ul><ul><li>LoC subject Headings </li></ul></ul><ul><ul><li>Dutch Basic Classification </li></ul></ul><ul><ul><li>Local subject schemes </li></ul></ul><ul><li>Different use of subjects per database </li></ul><ul><ul><li>Cooking </li></ul></ul><ul><ul><li>Cookery </li></ul></ul><ul><ul><li>Food </li></ul></ul><ul><li>Different use of subjects within one database </li></ul><ul><li>Errors </li></ul>
  10. 10. Multilingual searches <ul><li>All words searches </li></ul><ul><li>Subject searches </li></ul><ul><ul><li>English “cooking” </li></ul></ul><ul><ul><li>Japanese “???” </li></ul></ul><ul><li>Title searches </li></ul><ul><ul><li>Translations (We need FRBR!) </li></ul></ul><ul><li>Author searches (historical names) </li></ul><ul><ul><li>See: Erasmus </li></ul></ul>
  11. 11. All processing on the fly <ul><li>Issues, dependent on each other: </li></ul><ul><ul><li>Speed (slowness) </li></ul></ul><ul><ul><li>Limited number of searchable databases </li></ul></ul><ul><ul><li>Not all results in first set </li></ul></ul><ul><ul><li>Relevance </li></ul></ul>
  12. 12. Speed (slowness) <ul><li>Dependent on </li></ul><ul><ul><li>Search term transformation </li></ul></ul><ul><ul><li>Response time of external databases </li></ul></ul><ul><ul><li>Speed of internet connection </li></ul></ul><ul><ul><li>Conversion of results to presentation format </li></ul></ul><ul><ul><li>Merging of results </li></ul></ul><ul><ul><li>Deduplication of results </li></ul></ul><ul><ul><li>Relevance ranking </li></ul></ul>
  13. 13. Limited number of databases <ul><li>Searching too many databases takes too long </li></ul><ul><li>Local processing time influenced by </li></ul><ul><ul><li>Merging ( takes time ) </li></ul></ul><ul><ul><li>Deduplication ( takes time ) </li></ul></ul><ul><ul><li>Ranking ( takes time ) </li></ul></ul>
  14. 14. Not all results in first set <ul><li>Merging, deduplication, ranking of all results takes too long </li></ul><ul><li>Only first 30 or so of each database are processed initially </li></ul><ul><li>Get more: next 30 per database are fetched and processed </li></ul>
  15. 15. Relevance <ul><li>Dependent on default sort order (relevance?, date?) of each external database </li></ul><ul><li>Dependent on default ranking mechanism of each database </li></ul><ul><li>Local ranking initially performed on first batches of 30 records per database </li></ul><ul><li>After additional fetching records, ranking is done again: </li></ul><ul><ul><li>Initial top results may go down </li></ul></ul>
  16. 16. Solution? <ul><li>Don’t rank </li></ul><ul><li>Don’t deduplicate </li></ul><ul><li>Don’t merge (in advance) </li></ul><ul><li>If you don’t merge, there is no point in deduplicating or ranking!! </li></ul><ul><li>“ Does not make much sense anyway” </li></ul><ul><li>“ Does not work always anyway” </li></ul><ul><li>“ So, you have separate lists that you can merge later on” </li></ul>
  17. 17. Search with MetaSearch
  18. 18. Translate search syntax on the fly
  19. 19. Fetching results
  20. 20. Conversion of results on the fly
  21. 21. Conversion of results on the fly
  22. 22. Conversion of results on the fly
  23. 23. Results with MetaSearch
  24. 24. Results with MetaSearch
  25. 25. Harvesting and indexing Search Normalising Indexing Ranking Results Central index H&I tool Databases Harvesting Searching and Data fetching: Two completely separate procedures Search Engine
  26. 26. Advantages of H&I <ul><li>Speed </li></ul><ul><li>No maximum number of searchable databases </li></ul><ul><li>All results in first set </li></ul><ul><li>No differences in searches, indexes </li></ul><ul><li>Relevance </li></ul><ul><li>Fewer technical bottlenecks </li></ul><ul><ul><li>Central index always available in case of connection problem </li></ul></ul>
  27. 27. H&I: Aquabrowser
  28. 28. H&I: Primo
  29. 29. MetaSearch = “Just in time” <ul><li>Bookshop – Central Book Deposit </li></ul><ul><li>Always order on request </li></ul><ul><li>Risk of logistics problems </li></ul>http://www.flickr.com/photos/stijnnieuwendijk/125159282/
  30. 30. H&I = “just in case” <ul><li>Bookshop with large stock </li></ul><ul><li>Customers always find something </li></ul><ul><li>Maybe not the most recent stuff </li></ul>http://www.flickr.com/photos/brewbooks/2131521680/
  31. 31. Images <ul><li>http://www.flickr.com/photos/donpezzano/3044975399/ </li></ul><ul><li>http://www.flickr.com/photos/halighalie/663414371/ </li></ul><ul><li>http://www.flickr.com/photos/notionscapital/2280408255/ </li></ul><ul><li>http://www.flickr.com/photos/giveawayboy/2691195763/ </li></ul><ul><li>http://www.flickr.com/photos/stijnnieuwendijk/125159282/ </li></ul><ul><li>http://www.flickr.com/photos/brewbooks/2131521680/ </li></ul><ul><li>http://www.flickr.com/photos/joshb/444529511/ </li></ul><ul><li>http://www.flickr.com/photos/eaglelover2006/3168378578/ </li></ul><ul><li>http://www.flickr.com/photos/robbie73/3387189144/ </li></ul><ul><li>http://www.flickr.com/photos/saralparker/2602254206/ </li></ul><ul><li>http://www.flickr.com/photos/manchesterlibrary/2034771121/ </li></ul><ul><li>http://www.flickr.com/photos/bk/158637798/ </li></ul><ul><li>http://www.flickr.com/photos/saamiam/3802869384/ </li></ul><ul><li>http://www.flickr.com/photos/roboppy/37024023/ </li></ul><ul><li>http://www.flickr.com/photos/stijnnieuwendijk/125159282/ </li></ul><ul><li>http://www.flickr.com/photos/brewbooks/2131521680/ </li></ul>

×