Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Web-Scale Discovery: From start to for sale in one year


Published on

A presentation on building a web-scale discovery solution in one year delivered at the 2010 Access Conference in Winnipeg.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Web-Scale Discovery: From start to for sale in one year

  2. 2. It ain’t easy<br />“We will so feel their pain, but I hope technology and content provider engagement have improved to make it a bit easier for them!”<br />Miriam Blake, Los Alamos National Laboratory Research Library when talking about Summon and other Discovery tools in reference to their own experience in building a unified index on Code4Lib on Jun 30, 2010.<br />2<br />
  3. 3. She goes on to say…<br />“Aside from the contracts, I can also attest to the major amount of work it has been. We have 95M bibliographic records, stored in > 75TB of disk, and counting. Its all running on SOLR, with a local interface and the distributed aDORe repository on backend. ~ 2 FTE keep it running in production now.”<br />3<br />
  4. 4. Did she say 75 Million Records?<br />That’s a drop in the bucket for what a “Unified Discovery Index” needs to provide to be successful.<br />4<br />
  5. 5. Requirements<br />Strong Publisher Relations<br />Plenty of funding<br />Fresh team with lots of experience<br />5<br />
  6. 6. Content Acquisitions<br />Met with hundreds of content providers from all over the globe<br />Nearly 7000 publishers represented in Summon index<br />6<br />
  7. 7. Content Acquisition Methods<br />7<br />
  8. 8. Merged Deduplication<br />8<br />
  9. 9. Data Normalization<br />Cleanup is important<br />Dates<br />Does October 15, 2010 = Fall 2010?<br />Author Names<br />Publication Title<br />Is it really a journal article or a book review?<br />Is an obituary really a newspaper article? <br />9<br />
  10. 10. Complex Indexing Models<br />Planning<br />Maintenance<br />Hardware<br />Planning<br />Analysis<br />Planning<br />10<br />
  11. 11. Better example<br />Let’s pretend Data is Crude Oil<br />11<br />
  12. 12. Content Provider<br />12<br />
  13. 13. Content Acquisition<br />13<br />
  14. 14. Content Acquisition and Cleanup<br />14<br />
  15. 15. 15<br />
  16. 16. Relevancy<br />16<br />
  17. 17. Very Messy Job<br />17<br />