Full Text Biomedical Literature Processing:More Than a Scaling Challenge<br />Christophe Roeder, Tom Christiansen, Helen J...
Obtaining Documents<br />Identify documents by querying PubMed<br />Challenging due to variations in names<br />Not all do...
File Formats<br />Documents are available in many formats:<br />HTML, XML, PDF, plain text<br />Convert to plain text for ...
Character Representation<br />Encoding is a mapping from bytes to characters<br />Difficult to discern wich encoding a fil...
Scaling<br />Use a cluster when you need more than a desktop<br />Prefer an easy migration from desktop to cluster<br />Co...
Acknowledgements<br />UC Denver<br />Helen Johnson<br />Tom Christiansen<br />Karin Verspoor, NIH grant R01 LM010120-01<br...
Upcoming SlideShare
Loading in …5
×

Rocky2010 roeder full_textbiomedicalliteratureprocesing

246
-1

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
246
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Rocky2010 roeder full_textbiomedicalliteratureprocesing

  1. 1. Full Text Biomedical Literature Processing:More Than a Scaling Challenge<br />Christophe Roeder, Tom Christiansen, Helen Johnson, Karin Verspoor, (UC Denver) Gully Burns (ISI) , Lawrence Hunter (UC Denver)<br />
  2. 2. Obtaining Documents<br />Identify documents by querying PubMed<br />Challenging due to variations in names<br />Not all documents are freely available<br />One project identified 3034 documents<br />1253 (41%) licensed, available without charge<br />418 (14 %) available in PubMed Central <br />Availability effects experiment reproducibility<br />Downloading can be problematic<br />Manual download is slow. PMC Open Access is limited<br />Arrange bulk download from publishers based on existing licenses<br />
  3. 3. File Formats<br />Documents are available in many formats:<br />HTML, XML, PDF, plain text<br />Convert to plain text for NLP tool input<br />Stripping XML or HTML markup is relatively easy<br />ISI is working on PDF Extract to find correct flow<br />Keep document zoning, other markup <br />headings, sections, captions, italics<br />Identify source character encoding properly<br />XML stores the encoding in file, others do not<br />
  4. 4. Character Representation<br />Encoding is a mapping from bytes to characters<br />Difficult to discern wich encoding a file uses<br /> ASCII, UTF-8, MacRoman, ISO-8859-1, or other?<br />Reading a file with the wrong encoding can produce unreported errors and spurious ‘?’ characters<br />Java regular expression classes (w, s) don’t match non-ASCII characters<br />Some characters look like others:<br /> dash, en dash, minus <br />space, em space, non-breaking-space<br />
  5. 5. Scaling<br />Use a cluster when you need more than a desktop<br />Prefer an easy migration from desktop to cluster<br />Concurrency (threading) issues are minimized since most NLP processes are independent<br />Finding success using Sun/Oracle Grid Engine (SGE) and Network File System (NFS) on a small (48 core) cluster<br />NFS shares disks between nodes<br />SGE starts and manages processes on cluster<br />
  6. 6. Acknowledgements<br />UC Denver<br />Helen Johnson<br />Tom Christiansen<br />Karin Verspoor, NIH grant R01 LM010120-01<br />Larry Hunter, <br />NIH 2R01LM009254-04<br />NIH 2R01LM008111-04A1<br />NIH 5R01GM083649-02<br />ISI<br />Gully Burns, NSF grant #0849977<br />
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×