Sigir Presentation Craig Scott

851 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
851
On SlideShare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Sigir Presentation Craig Scott

    1. 1. The Integration of Web-based Content within Scopus Prepared by: Craig Scott ( Scirus ) Date/Place: July 27 th 2007, SIGIR 2007, Amsterdam
    2. 2. Agenda <ul><li>What are Scirus and Scopus? Why integrate? </li></ul><ul><li>Federated Content Integration </li></ul><ul><li>Web and Patent Citation Integration </li></ul><ul><li>Summary </li></ul>
    3. 3. What are Scirus and Scopus? <ul><li>Scirus, scirus.com </li></ul><ul><li>Free, science-specific web search engine </li></ul><ul><li>Launched 2001 </li></ul><ul><li>>430 M document index </li></ul><ul><li>Web pages, journal content, institutional repositories, subject repositories, patents, books, opencourseware, dissertations etc. </li></ul><ul><li>Scopus, scopus.com </li></ul><ul><li>The largest abstract and citation database of peer-reviewed literature </li></ul><ul><li>With smart tools to track, analyse and visualise research </li></ul><ul><li>Launched 2004 </li></ul><ul><li>>30 M document index, >15,000 journal titles, >4,000 publishers </li></ul>
    4. 4. Why integrate the two? <ul><li>Customer need </li></ul><ul><ul><li>Increasing interest in completing coverage overview with ‘grey literature’ </li></ul></ul><ul><ul><li>Expose the influence of primary literature on patents and ‘grey literature’ </li></ul></ul><ul><li>Change in landscape </li></ul><ul><ul><li>Rapid growth of scientific information available on the web </li></ul></ul><ul><ul><li>Growth in institutional and subject repositories </li></ul></ul><ul><li>Competitive advantage </li></ul><ul><ul><li>Enhance traditional A&I citation information with web content </li></ul></ul><ul><ul><li>Strong differentiator for Scopus, single starting point </li></ul></ul>
    5. 5. Agenda <ul><li>What are Scirus and Scopus? Why integrate? </li></ul><ul><li>Federated Content Integration </li></ul><ul><li>Web and Patent Citation Integration </li></ul><ul><li>Summary </li></ul>
    6. 6. What was integrated? <ul><li>Scirus indexes ~430 M documents: </li></ul><ul><ul><li>Scientific web (scientists’ homepages, university sites etc.) </li></ul></ul><ul><ul><li>Patents (US, European, Japanese, WIPO, UK) </li></ul></ul><ul><ul><li>Selected sources (Inst. and Subject Repositories, OCW etc.) </li></ul></ul><ul><ul><li>Excluded Publisher Journal sources </li></ul></ul>~380 M ~21 M ~2 M ~25 M Content
    7. 7. How was it integrated? <ul><li>Search technology for both products provided by FAST Search & Transfer </li></ul><ul><li>However </li></ul><ul><ul><li>Separate indexes </li></ul></ul><ul><ul><li>Different software release versions </li></ul></ul><ul><ul><li>Different update and release cycles </li></ul></ul><ul><ul><li>Different architectural/hardware priorities </li></ul></ul><ul><ul><li>Different index structures </li></ul></ul><ul><ul><li>Different query syntaxes </li></ul></ul>
    8. 8. Federated Search <ul><li>Web Service based (SOAP) </li></ul><ul><li>Provides tabbed search, faceted search, search refinement </li></ul><ul><li>Simple broadcast of search terms entered </li></ul><ul><li>Query translation (different index structures, query syntax) </li></ul><ul><li>Result retrieval, processing, rendering </li></ul>
    9. 9. Federated search across Scopus and Scirus
    10. 10. Federated search…Web results
    11. 11. Federated search…Web results facets
    12. 12. Agenda <ul><li>What are Scirus and Scopus? Why integrate? </li></ul><ul><li>Federated Content Integration </li></ul><ul><li>Web and Patent Citation Integration </li></ul><ul><li>Summary </li></ul>
    13. 13. WebCitations and PatentCitations <ul><li>Interest in exposing the influence of primary literature on </li></ul><ul><ul><li>Patents---practical application in Medicine, Engineering, Chemistry… </li></ul></ul><ul><ul><li>Theses & Dissertations </li></ul></ul><ul><ul><li>Other grey literature </li></ul></ul><ul><li>Scirus/Scopus connected via Federated search </li></ul><ul><ul><li>Focused on Keyword Search </li></ul></ul><ul><li>Not suitable for citation index analysis, [Smith et al ., 2007] </li></ul><ul><ul><li>Data formats, quality and normalization </li></ul></ul><ul><ul><li>Need to extract, parse and tag references from unstructured docs </li></ul></ul><ul><ul><li>Need to match these extracted refs with the bibliometric frontmatter of an article housed in a separate database </li></ul></ul><ul><ul><li>Need to overcome faulty or missing citation information </li></ul></ul>
    14. 14. Scopus data <ul><li>Single schema, highly structured </li></ul><ul><li>Normalised data </li></ul><ul><li>Extremely rich granularity </li></ul><ul><li>Highly QCd, manually corrected if required </li></ul>
    15. 15. Patent data <ul><li>Single schema, structured </li></ul><ul><li>Item level only granularity </li></ul>
    16. 16. Web data---PDF, PS, PPT, MSWord, HTML <ul><li>Little or no structure </li></ul><ul><li>No normalization </li></ul><ul><li>Thousands of different creators </li></ul>
    17. 17. Solution <ul><li>Parity Computing’s BibExtractor </li></ul><ul><ul><li>Parity Tagger engine for extracting refs and tagging fields </li></ul></ul><ul><ul><li>Parity Linker engine to provide high-accuracy ref linking </li></ul></ul><ul><li>Extracts and tags references with a rich structure </li></ul><ul><li>Handles unstructured and binary input </li></ul><ul><li>Automatically corrects and normalizes </li></ul>
    18. 18. BibExtractor <ul><li>During Scirus document processing </li></ul><ul><ul><li>Set of keys generated for each extracted reference candidate </li></ul></ul><ul><li>On the fly, during Scopus document rendering </li></ul><ul><ul><li>JavaBean generates set of keys from article bibliographic information </li></ul></ul><ul><li>FAST federated search matches keys </li></ul><ul><ul><li>multiple keys </li></ul></ul><ul><ul><li>any single typical error or omission in a reference (e.g. missing volume number or misspelled author) is overcome by at least one of the keys so that there will still be a match </li></ul></ul>
    19. 19. Key Matching <ul><li>&query= (OR(keycode:3432892933214533363,keycode:4705044283615064583,keycode:3254172693062972934,keycode:3802902805014063493,keycode:4803493593804163354,keycode:3844025092624903053,keycode:3683284293294562633,keycode:3124342944233734474,keycode:2914504304914114294,keycode:2843494294624963914,keycode:4002734582812583043,keycode:5083403014734374203,keycode:3894263804974544702,keycode:3544645074593474482,keycode:4293683454244753725,keycode:4904562793032942843,keycode:4212823283734704303,keycode:2813783923563442814,keycode:3204473704944794703,keycode:3384152714633484214,keycode:2902673864824414783,keycode:4274343294633914785,keycode:4704814683324283612,keycode:3223854174132874655,keycode:4285085053504982923%2Ckeycode:3553263163704344434,keycode:5033374333732803765,keycode:4173674584684312755,keycode:4383714472774305114,keycode:4742744334904204583,keycode:3492913574793993844,keycode:4353152764113584973,keycode:4122843453174022602,keycode:3504822894843224753,keycode:4214733324914272624,keycode:4303973173534553463,keycode:4044672933053514754,keycode:3454444303644212694,keycode:4704304722572723814,keycode:4953842903473604143,keycode:3704574243492762703,keycode:4193033542962724444,keycode:4972564254563303894,keycode:2953703822834675072,keycode:4233134052862573394,keycode:4273833164724123782,keycode:4943183042643454164,keycode:4614452924392753914,keycode:2792854154793923954,keycode:3963715043203173364,keycode:2973934723284944383,keycode:4753834082753262753,keycode:3954103772804793893,keycode:4372923254794094823,keycode:3943314903474672934,keycode:4464984962913084712,keycode:4074574032973973943,keycode:2573592674485023074,keycode:2842842842863555103)) </li></ul><ul><li>H4IDKDP3IDLDP3PDPDJ4HDVCC6LEPEN5TEVEA6 </li></ul><ul><li>H4IDKDP3IDLDP3PDPDJ4HDVCJ5KFHEN5HELEO5 </li></ul>Generated BibKeys for Ref <ul><li>H4IDKDP3IDLDP3PDPDJ4HDVCC6LEPEN5TEVEA6 </li></ul><ul><li>H4IDKDP3IDLDP3PDPDJ4HDVCJ5KFHEN5HELEO5 </li></ul><ul><li>NDIDE4VCPDJ4PDHDP3PFJEO5LEPED5PEMEF5KEVEO5LFTEN5PEUEB5TELFI5 </li></ul>Pommier, Y.; Neamati, N. Adv. Virus Res. 1999, 52, 427-458. Matching Scopus item’s BibKeys Ref Candidate
    20. 20. Scirus document processing <ul><li>“ Typical” Scirus/FAST search engine document flow </li></ul>
    21. 21. Key Generation, Scirus Document Processing
    22. 22. Key matching, realtime
    23. 23. WebCites and PatentCites
    24. 24. WebCites
    25. 25. WebCites
    26. 26. WebCites
    27. 27. Precision and recall evaluation <ul><li>Internal Scopus evaluation </li></ul>P ≈ 99% R ≈ 95% 93.15 96.11 80.88 94.7 Recall 95.15 HTML Special Other binaries Scanned binaries XML Type 96.99 Special 97.15 Special 99.1 Patent Precision Source
    28. 28. Agenda <ul><li>What are Scirus and Scopus? Why integrate? </li></ul><ul><li>Federated Content Integration </li></ul><ul><li>Web and Patent Citation Integration </li></ul><ul><li>Summary </li></ul>
    29. 29. Summary <ul><li>Challenges met </li></ul><ul><li>Multi-party development </li></ul><ul><li>Multi-system connection </li></ul><ul><li>Large scale </li></ul><ul><li>Rapid development </li></ul><ul><li>Benefits gained </li></ul><ul><li>Overview beyond primary literature </li></ul><ul><li>Added a new dimension to citation-based analysis and literature research and review </li></ul><ul><li>With high P/R </li></ul>
    30. 30. Questions, discussion… <ul><li>Thank you! Any questions? </li></ul><ul><li>Contact: </li></ul><ul><ul><li>Craig Scott, Senior Product Manager, Scirus </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><li>www.scirus.com </li></ul><ul><li>www.scopus.com </li></ul><ul><li>www.fastsearch.com </li></ul><ul><li>www.paritycomputing.com </li></ul>

    ×