• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Rocky2010 roeder full_textbiomedicalliteratureprocesing
 

Rocky2010 roeder full_textbiomedicalliteratureprocesing

on

  • 263 views

 

Statistics

Views

Total Views
263
Views on SlideShare
263
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Rocky2010 roeder full_textbiomedicalliteratureprocesing Rocky2010 roeder full_textbiomedicalliteratureprocesing Presentation Transcript

    • Full Text Biomedical Literature Processing:More Than a Scaling Challenge
      Christophe Roeder, Tom Christiansen, Helen Johnson, Karin Verspoor, (UC Denver) Gully Burns (ISI) , Lawrence Hunter (UC Denver)
    • Obtaining Documents
      Identify documents by querying PubMed
      Challenging due to variations in names
      Not all documents are freely available
      One project identified 3034 documents
      1253 (41%) licensed, available without charge
      418 (14 %) available in PubMed Central
      Availability effects experiment reproducibility
      Downloading can be problematic
      Manual download is slow. PMC Open Access is limited
      Arrange bulk download from publishers based on existing licenses
    • File Formats
      Documents are available in many formats:
      HTML, XML, PDF, plain text
      Convert to plain text for NLP tool input
      Stripping XML or HTML markup is relatively easy
      ISI is working on PDF Extract to find correct flow
      Keep document zoning, other markup
      headings, sections, captions, italics
      Identify source character encoding properly
      XML stores the encoding in file, others do not
    • Character Representation
      Encoding is a mapping from bytes to characters
      Difficult to discern wich encoding a file uses
      ASCII, UTF-8, MacRoman, ISO-8859-1, or other?
      Reading a file with the wrong encoding can produce unreported errors and spurious ‘?’ characters
      Java regular expression classes (w, s) don’t match non-ASCII characters
      Some characters look like others:
      dash, en dash, minus
      space, em space, non-breaking-space
    • Scaling
      Use a cluster when you need more than a desktop
      Prefer an easy migration from desktop to cluster
      Concurrency (threading) issues are minimized since most NLP processes are independent
      Finding success using Sun/Oracle Grid Engine (SGE) and Network File System (NFS) on a small (48 core) cluster
      NFS shares disks between nodes
      SGE starts and manages processes on cluster
    • Acknowledgements
      UC Denver
      Helen Johnson
      Tom Christiansen
      Karin Verspoor, NIH grant R01 LM010120-01
      Larry Hunter,
      NIH 2R01LM009254-04
      NIH 2R01LM008111-04A1
      NIH 5R01GM083649-02
      ISI
      Gully Burns, NSF grant #0849977