• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Sigir Presentation Craig Scott
 

Sigir Presentation Craig Scott

on

  • 897 views

 

Statistics

Views

Total Views
897
Views on SlideShare
897
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Sigir Presentation Craig Scott Sigir Presentation Craig Scott Presentation Transcript

  • The Integration of Web-based Content within Scopus Prepared by: Craig Scott ( Scirus ) Date/Place: July 27 th 2007, SIGIR 2007, Amsterdam
  • Agenda
    • What are Scirus and Scopus? Why integrate?
    • Federated Content Integration
    • Web and Patent Citation Integration
    • Summary
  • What are Scirus and Scopus?
    • Scirus, scirus.com
    • Free, science-specific web search engine
    • Launched 2001
    • >430 M document index
    • Web pages, journal content, institutional repositories, subject repositories, patents, books, opencourseware, dissertations etc.
    • Scopus, scopus.com
    • The largest abstract and citation database of peer-reviewed literature
    • With smart tools to track, analyse and visualise research
    • Launched 2004
    • >30 M document index, >15,000 journal titles, >4,000 publishers
  • Why integrate the two?
    • Customer need
      • Increasing interest in completing coverage overview with ‘grey literature’
      • Expose the influence of primary literature on patents and ‘grey literature’
    • Change in landscape
      • Rapid growth of scientific information available on the web
      • Growth in institutional and subject repositories
    • Competitive advantage
      • Enhance traditional A&I citation information with web content
      • Strong differentiator for Scopus, single starting point
  • Agenda
    • What are Scirus and Scopus? Why integrate?
    • Federated Content Integration
    • Web and Patent Citation Integration
    • Summary
  • What was integrated?
    • Scirus indexes ~430 M documents:
      • Scientific web (scientists’ homepages, university sites etc.)
      • Patents (US, European, Japanese, WIPO, UK)
      • Selected sources (Inst. and Subject Repositories, OCW etc.)
      • Excluded Publisher Journal sources
    ~380 M ~21 M ~2 M ~25 M Content
  • How was it integrated?
    • Search technology for both products provided by FAST Search & Transfer
    • However
      • Separate indexes
      • Different software release versions
      • Different update and release cycles
      • Different architectural/hardware priorities
      • Different index structures
      • Different query syntaxes
  • Federated Search
    • Web Service based (SOAP)
    • Provides tabbed search, faceted search, search refinement
    • Simple broadcast of search terms entered
    • Query translation (different index structures, query syntax)
    • Result retrieval, processing, rendering
  • Federated search across Scopus and Scirus
  • Federated search…Web results
  • Federated search…Web results facets
  • Agenda
    • What are Scirus and Scopus? Why integrate?
    • Federated Content Integration
    • Web and Patent Citation Integration
    • Summary
  • WebCitations and PatentCitations
    • Interest in exposing the influence of primary literature on
      • Patents---practical application in Medicine, Engineering, Chemistry…
      • Theses & Dissertations
      • Other grey literature
    • Scirus/Scopus connected via Federated search
      • Focused on Keyword Search
    • Not suitable for citation index analysis, [Smith et al ., 2007]
      • Data formats, quality and normalization
      • Need to extract, parse and tag references from unstructured docs
      • Need to match these extracted refs with the bibliometric frontmatter of an article housed in a separate database
      • Need to overcome faulty or missing citation information
  • Scopus data
    • Single schema, highly structured
    • Normalised data
    • Extremely rich granularity
    • Highly QCd, manually corrected if required
  • Patent data
    • Single schema, structured
    • Item level only granularity
  • Web data---PDF, PS, PPT, MSWord, HTML
    • Little or no structure
    • No normalization
    • Thousands of different creators
  • Solution
    • Parity Computing’s BibExtractor
      • Parity Tagger engine for extracting refs and tagging fields
      • Parity Linker engine to provide high-accuracy ref linking
    • Extracts and tags references with a rich structure
    • Handles unstructured and binary input
    • Automatically corrects and normalizes
  • BibExtractor
    • During Scirus document processing
      • Set of keys generated for each extracted reference candidate
    • On the fly, during Scopus document rendering
      • JavaBean generates set of keys from article bibliographic information
    • FAST federated search matches keys
      • multiple keys
      • any single typical error or omission in a reference (e.g. missing volume number or misspelled author) is overcome by at least one of the keys so that there will still be a match
  • Key Matching
    • &query= (OR(keycode:3432892933214533363,keycode:4705044283615064583,keycode:3254172693062972934,keycode:3802902805014063493,keycode:4803493593804163354,keycode:3844025092624903053,keycode:3683284293294562633,keycode:3124342944233734474,keycode:2914504304914114294,keycode:2843494294624963914,keycode:4002734582812583043,keycode:5083403014734374203,keycode:3894263804974544702,keycode:3544645074593474482,keycode:4293683454244753725,keycode:4904562793032942843,keycode:4212823283734704303,keycode:2813783923563442814,keycode:3204473704944794703,keycode:3384152714633484214,keycode:2902673864824414783,keycode:4274343294633914785,keycode:4704814683324283612,keycode:3223854174132874655,keycode:4285085053504982923%2Ckeycode:3553263163704344434,keycode:5033374333732803765,keycode:4173674584684312755,keycode:4383714472774305114,keycode:4742744334904204583,keycode:3492913574793993844,keycode:4353152764113584973,keycode:4122843453174022602,keycode:3504822894843224753,keycode:4214733324914272624,keycode:4303973173534553463,keycode:4044672933053514754,keycode:3454444303644212694,keycode:4704304722572723814,keycode:4953842903473604143,keycode:3704574243492762703,keycode:4193033542962724444,keycode:4972564254563303894,keycode:2953703822834675072,keycode:4233134052862573394,keycode:4273833164724123782,keycode:4943183042643454164,keycode:4614452924392753914,keycode:2792854154793923954,keycode:3963715043203173364,keycode:2973934723284944383,keycode:4753834082753262753,keycode:3954103772804793893,keycode:4372923254794094823,keycode:3943314903474672934,keycode:4464984962913084712,keycode:4074574032973973943,keycode:2573592674485023074,keycode:2842842842863555103))
    • H4IDKDP3IDLDP3PDPDJ4HDVCC6LEPEN5TEVEA6
    • H4IDKDP3IDLDP3PDPDJ4HDVCJ5KFHEN5HELEO5
    Generated BibKeys for Ref
    • H4IDKDP3IDLDP3PDPDJ4HDVCC6LEPEN5TEVEA6
    • H4IDKDP3IDLDP3PDPDJ4HDVCJ5KFHEN5HELEO5
    • NDIDE4VCPDJ4PDHDP3PFJEO5LEPED5PEMEF5KEVEO5LFTEN5PEUEB5TELFI5
    Pommier, Y.; Neamati, N. Adv. Virus Res. 1999, 52, 427-458. Matching Scopus item’s BibKeys Ref Candidate
  • Scirus document processing
    • “ Typical” Scirus/FAST search engine document flow
  • Key Generation, Scirus Document Processing
  • Key matching, realtime
  • WebCites and PatentCites
  • WebCites
  • WebCites
  • WebCites
  • Precision and recall evaluation
    • Internal Scopus evaluation
    P ≈ 99% R ≈ 95% 93.15 96.11 80.88 94.7 Recall 95.15 HTML Special Other binaries Scanned binaries XML Type 96.99 Special 97.15 Special 99.1 Patent Precision Source
  • Agenda
    • What are Scirus and Scopus? Why integrate?
    • Federated Content Integration
    • Web and Patent Citation Integration
    • Summary
  • Summary
    • Challenges met
    • Multi-party development
    • Multi-system connection
    • Large scale
    • Rapid development
    • Benefits gained
    • Overview beyond primary literature
    • Added a new dimension to citation-based analysis and literature research and review
    • With high P/R
  • Questions, discussion…
    • Thank you! Any questions?
    • Contact:
      • Craig Scott, Senior Product Manager, Scirus
      • [email_address]
    • www.scirus.com
    • www.scopus.com
    • www.fastsearch.com
    • www.paritycomputing.com