Web-Scale Discovery: From start to for sale in one year

575 views
538 views

Published on

A presentation on building a web-scale discovery solution in one year delivered at the 2010 Access Conference in Winnipeg.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
575
On SlideShare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Here is the content provider
  • Here is how the transmit the data. It flows in one direction – but not always reliably
  • Then you have to acquire the data and try to make some sense of it
  • Then you need to refine the data so that it can be used
  • Then you need to provide the data to your users in a relevant fashion
  • Overall, a very messy job
  • Web-Scale Discovery: From start to for sale in one year

    1. 1. WEB-SCALE DISCOVERY<br />FROM START TO FOR-SALE IN ONE YEAR<br />
    2. 2. It ain’t easy<br />“We will so feel their pain, but I hope technology and content provider engagement have improved to make it a bit easier for them!”<br />Miriam Blake, Los Alamos National Laboratory Research Library when talking about Summon and other Discovery tools in reference to their own experience in building a unified index on Code4Lib on Jun 30, 2010.<br />2<br />
    3. 3. She goes on to say…<br />“Aside from the contracts, I can also attest to the major amount of work it has been. We have 95M bibliographic records, stored in > 75TB of disk, and counting. Its all running on SOLR, with a local interface and the distributed aDORe repository on backend. ~ 2 FTE keep it running in production now.”<br />3<br />
    4. 4. Did she say 75 Million Records?<br />That’s a drop in the bucket for what a “Unified Discovery Index” needs to provide to be successful.<br />4<br />
    5. 5. Requirements<br />Strong Publisher Relations<br />Plenty of funding<br />Fresh team with lots of experience<br />5<br />
    6. 6. Content Acquisitions<br />Met with hundreds of content providers from all over the globe<br />Nearly 7000 publishers represented in Summon index<br />6<br />
    7. 7. Content Acquisition Methods<br />7<br />
    8. 8. Merged Deduplication<br />8<br />
    9. 9. Data Normalization<br />Cleanup is important<br />Dates<br />Does October 15, 2010 = Fall 2010?<br />Author Names<br />Publication Title<br />Is it really a journal article or a book review?<br />Is an obituary really a newspaper article? <br />9<br />
    10. 10. Complex Indexing Models<br />Planning<br />Maintenance<br />Hardware<br />Planning<br />Analysis<br />Planning<br />10<br />
    11. 11. Better example<br />Let’s pretend Data is Crude Oil<br />11<br />
    12. 12. Content Provider<br />12<br />
    13. 13. Content Acquisition<br />13<br />
    14. 14. Content Acquisition and Cleanup<br />14<br />
    15. 15. 15<br />
    16. 16. Relevancy<br />16<br />
    17. 17. Very Messy Job<br />17<br />

    ×