SlideShare a Scribd company logo
1 of 6
Full Text Biomedical Literature Processing:More Than a Scaling Challenge Christophe Roeder, Tom Christiansen, Helen Johnson, Karin Verspoor, (UC Denver) Gully Burns (ISI) , Lawrence Hunter (UC Denver)
Obtaining Documents Identify documents by querying PubMed Challenging due to variations in names Not all documents are freely available One project identified 3034 documents 1253 (41%) licensed, available without charge 418 (14 %) available in PubMed Central 	 Availability effects experiment reproducibility Downloading can be problematic Manual download is slow. PMC Open Access is limited Arrange bulk download from publishers based on existing licenses
File Formats Documents are available in many formats: HTML, XML, PDF, plain text Convert to plain text for NLP tool input Stripping XML or HTML markup is relatively easy ISI is working on PDF Extract to find correct flow Keep document zoning, other markup  headings, sections, captions, italics Identify source character encoding properly XML stores the encoding in file, others do not
Character Representation Encoding is a mapping from bytes to characters Difficult to discern wich encoding a file uses  ASCII, UTF-8, MacRoman,  ISO-8859-1, or other? Reading a file with the wrong encoding can produce unreported errors and spurious ‘?’ characters Java regular expression classes (, ) don’t match non-ASCII characters Some characters look like others:  dash, en dash, minus  space, em space, non-breaking-space
Scaling Use a cluster when you need more than a desktop Prefer an easy migration from desktop to cluster Concurrency (threading) issues are minimized since most NLP processes are independent Finding success using Sun/Oracle Grid Engine (SGE) and Network File System (NFS) on a small (48 core) cluster NFS shares disks between nodes SGE starts and manages processes on cluster
Acknowledgements UC Denver Helen Johnson Tom Christiansen Karin Verspoor, NIH grant R01 LM010120-01 Larry Hunter,  NIH 2R01LM009254-04 NIH 2R01LM008111-04A1 NIH 5R01GM083649-02 ISI Gully Burns, NSF grant #0849977

More Related Content

What's hot

SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutioneXascale Infolab
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionaryEditor IJMTER
 
Anton Dorfman - Reversing data formats what data can reveal
Anton Dorfman - Reversing data formats what data can revealAnton Dorfman - Reversing data formats what data can reveal
Anton Dorfman - Reversing data formats what data can revealDefconRussia
 
Brain Imaging Data Structure and Center for Reproducible Neuroscince
Brain Imaging Data Structure and Center for Reproducible NeuroscinceBrain Imaging Data Structure and Center for Reproducible Neuroscince
Brain Imaging Data Structure and Center for Reproducible NeuroscinceKrzysztof Gorgolewski
 
Adaptive information extraction
Adaptive information extractionAdaptive information extraction
Adaptive information extractionunyil96
 
Corpus Linguistics :Analytical Tools
Corpus Linguistics :Analytical ToolsCorpus Linguistics :Analytical Tools
Corpus Linguistics :Analytical ToolsJitendra Patil
 

What's hot (8)

Ld4 l triannon
Ld4 l triannonLd4 l triannon
Ld4 l triannon
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference Resolution
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse Dictionary
 
Anton Dorfman - Reversing data formats what data can reveal
Anton Dorfman - Reversing data formats what data can revealAnton Dorfman - Reversing data formats what data can reveal
Anton Dorfman - Reversing data formats what data can reveal
 
Open source software for building open access repositories
Open source software for building open access repositoriesOpen source software for building open access repositories
Open source software for building open access repositories
 
Brain Imaging Data Structure and Center for Reproducible Neuroscince
Brain Imaging Data Structure and Center for Reproducible NeuroscinceBrain Imaging Data Structure and Center for Reproducible Neuroscince
Brain Imaging Data Structure and Center for Reproducible Neuroscince
 
Adaptive information extraction
Adaptive information extractionAdaptive information extraction
Adaptive information extraction
 
Corpus Linguistics :Analytical Tools
Corpus Linguistics :Analytical ToolsCorpus Linguistics :Analytical Tools
Corpus Linguistics :Analytical Tools
 

Viewers also liked (9)

Roeder posterismb2010
Roeder posterismb2010Roeder posterismb2010
Roeder posterismb2010
 
Spring survey
Spring surveySpring survey
Spring survey
 
Uml
UmlUml
Uml
 
Roeder rocky 2011_46
Roeder rocky 2011_46Roeder rocky 2011_46
Roeder rocky 2011_46
 
Spring Framework 101
Spring Framework 101Spring Framework 101
Spring Framework 101
 
Sge
SgeSge
Sge
 
Maven
MavenMaven
Maven
 
Spring Intro
Spring IntroSpring Intro
Spring Intro
 
Spring MVC Basics
Spring MVC BasicsSpring MVC Basics
Spring MVC Basics
 

Recently uploaded

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Rocky2010 roeder full_textbiomedicalliteratureprocesing

  • 1. Full Text Biomedical Literature Processing:More Than a Scaling Challenge Christophe Roeder, Tom Christiansen, Helen Johnson, Karin Verspoor, (UC Denver) Gully Burns (ISI) , Lawrence Hunter (UC Denver)
  • 2. Obtaining Documents Identify documents by querying PubMed Challenging due to variations in names Not all documents are freely available One project identified 3034 documents 1253 (41%) licensed, available without charge 418 (14 %) available in PubMed Central Availability effects experiment reproducibility Downloading can be problematic Manual download is slow. PMC Open Access is limited Arrange bulk download from publishers based on existing licenses
  • 3. File Formats Documents are available in many formats: HTML, XML, PDF, plain text Convert to plain text for NLP tool input Stripping XML or HTML markup is relatively easy ISI is working on PDF Extract to find correct flow Keep document zoning, other markup headings, sections, captions, italics Identify source character encoding properly XML stores the encoding in file, others do not
  • 4. Character Representation Encoding is a mapping from bytes to characters Difficult to discern wich encoding a file uses ASCII, UTF-8, MacRoman, ISO-8859-1, or other? Reading a file with the wrong encoding can produce unreported errors and spurious ‘?’ characters Java regular expression classes (, ) don’t match non-ASCII characters Some characters look like others: dash, en dash, minus space, em space, non-breaking-space
  • 5. Scaling Use a cluster when you need more than a desktop Prefer an easy migration from desktop to cluster Concurrency (threading) issues are minimized since most NLP processes are independent Finding success using Sun/Oracle Grid Engine (SGE) and Network File System (NFS) on a small (48 core) cluster NFS shares disks between nodes SGE starts and manages processes on cluster
  • 6. Acknowledgements UC Denver Helen Johnson Tom Christiansen Karin Verspoor, NIH grant R01 LM010120-01 Larry Hunter, NIH 2R01LM009254-04 NIH 2R01LM008111-04A1 NIH 5R01GM083649-02 ISI Gully Burns, NSF grant #0849977