CiteXtract: extracting references from the life science literature


Published on

Presentation given in September 2006 at the Sequence database group at the European Bioinformatics Institute in Cambridge, UK.

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

CiteXtract: extracting references from the life science literature

  1. 1. CiteXtract: extracting references from the life science literature Nikolay Nikolov Stoehr/Zhu Group European Bioinformatics Institute September, 2006
  2. 2. Contents: <ul><li>About citeXtract </li></ul><ul><li>What’s done and what’s next </li></ul><ul><li>Acknowledgements </li></ul>
  3. 3. About citeXtract <ul><li>CiteXtract will add citation information to citeXplore ( ) </li></ul><ul><li>CiteXplore contains </li></ul><ul><li>the PubMed, EPO </li></ul><ul><li>and CBA data </li></ul>
  4. 4. About citeXtract: Motivation <ul><li>Why add citations? </li></ul><ul><li>Better navigation </li></ul><ul><li>Forward (“cites”) and backward (“cited by”) links (Google Scholar offers only backward navigation, Entrez doesn’t offer any at all) </li></ul><ul><li>Better ranking </li></ul><ul><li>Word-based similarity ranking used by Lucene sometimes orders papers of exotic journals higher in the result set than papers from Nature or Science (Google Scholar takes into account paper popularity, Entrez doesn’t) </li></ul><ul><li>Citation context </li></ul><ul><li>A feature entirely new for the life sciences literature (Neither GS, nor Entrez). </li></ul>
  5. 5. citeXtract: Citation context <ul><li>Context for Apoptosis and delayed degeneration after spinal cord injury in rats and monkeys MJ Crowe, JC Bresnahan, SL Shuman, JN Masters, 1997 - </li></ul><ul><li>[1]   Apoptosis in neurodegenerative disorders - group of 9 » MP Mattson… - Nat Rev Mol Cell Biol, 2000: </li></ul><ul><li>“ Studies of SCI in rats and monkeys show apoptosis of oligodendrocytes involving a progressive inflammation-like process [Crowe et al., 1997] . Thus, apoptosis of both neurons and oligodendrocytes may contribute greatly to the paralysis of patients with SCI.” </li></ul><ul><li>[2]   Long-Distance Axonal Regeneration in the Transected Adult Rat Spinal Cord A Ramon-Cueto, GW Plant, J Avila, MB Bunge - Journal of Neuroscience, 1998 - </li></ul><ul><li>“ Hoechst-labeled nuclei were round or elongated with smooth borders. They displayed uniform staining and showed an absence of chromatin condensation or fragmentation. The labeling pattern and the lack of small aggregates of DNA indicated that Hoechst-labeled cells were alive and corresponded to EG (Crowe et al., 1997 ) . Clearly, some macrophages had taken up the Hoechst dye, but their appearance was very different (Fig. 3 B ).” </li></ul><ul><li>[3]    Inhibition of Akt Kinase by Cell-permeable Ceramide and Its Implications for Ceramide-induced … - group of 3 » H Zhou, SA Summers, MJ Birnbaum, RN Pittman - Journal of Biological Chemistry, 1998 - </li></ul><ul><li>“ In the nervous system, apoptosis is required for normal development and has been implicated in the pathogenesis of various neurodegenerative conditions such as Alzheimer's disease, Huntington's disease, stroke, head trauma, and neuronal death following spinal cord injury (Crowe et al., 1997 ) . Although cellular mechanisms underlying apoptosis remain unclear, a number of recent studies suggest that activating sphingomyelinase and subsequent generation of ceramide play an important role in regulating apoptosis in many systems” </li></ul>
  6. 6. citeXtract: Challenges <ul><li>PDF conversion </li></ul><ul><li>HTML irregularities </li></ul><ul><li>Text flow problem </li></ul><ul><li>Citation format variation </li></ul>
  7. 7. Getting HTML from PDF <ul><li>PDF is currently the most popular file format, but it is a binary file format and </li></ul><ul><li>has to be transformed into text file (preferably by keeping format information). </li></ul><ul><li>Tens of PDF converters exist, but most are unusable (text or format get corrupted). A shareware (PDF2HTML) was found to deliver very good results at affordable cost. </li></ul><ul><li>However, some PDFs are encoded as images (mostly older papers digitalized through scanning). PDF2HTML cannot process them (an OCR package is needed) </li></ul><ul><li>But the HTML output still contains various irregularities, two of which are common: </li></ul><ul><li>– inserting blanks at random positions </li></ul><ul><li>– line segment overlaps </li></ul>
  8. 8. HTML irregularities: Blanks
  9. 9. HTML irregularities: Overlaps
  10. 10. Reason for the blanks/overlaps <ul><li>In PDF text is encoded as pixel regions. The PDF converter reproduces the map of pixel regions by encoding in the HTML the following: </li></ul><ul><li>-top left pixel of the region </li></ul><ul><li>-the character sequence contained in this region </li></ul><ul><li>-font information for the characters (font name, style, size) </li></ul><ul><li>The browser engine concatenates the various line segments into lines using (along the Cartesian coordinates and the character sequence) the font information. Different rendering of the font produces either too short segments (blanks) or too long segments (overlaps) </li></ul><ul><li>In addition, the Cartesian coordinates seem to not always exactly match the original which may cause in addition slight movements over the vertical (but since the vertical offsets are usually very small they cause less or no harm) </li></ul>
  11. 11. Solution for the blanks/overlaps <ul><li>If the HTML lines are to be processed correctly then every font variety has to be supported (this would mean implementing the rendering engine of a browser). This is too much work, and, as the browsers show, it doesn’t guarantee success. </li></ul><ul><li>Instead, a constant for the character width is used (determined empirically). Problem: it adds noise and makes identifying columns difficult. </li></ul><ul><li>The noise is eliminated by using histograms of the X-Cartesian coordinates. The histograms show (platykurtic) Gaussian distribution with the highest point of the graph nearly coinciding with the mid of the column. The “valleys” correspond to the white space between the columns. An empirically determined cut-off value is used to separate the columns. A covering algorithm finally stitches up the column lines together. </li></ul>
  12. 12. Journal Layout Variation <ul><li>Layout varies not just between journals, but also within the same issue of a </li></ul><ul><li>journal (and even the same page). This makes determining the reading order </li></ul><ul><li>difficult. This is because the converter does not include any text flow tags (read </li></ul><ul><li>order of columns or paragraphs) </li></ul>
  13. 13. Journal Layout Variation <ul><li>Different layouts mean different text flow </li></ul><ul><li>models. </li></ul><ul><li>Regular </li></ul><ul><li>text-flow: </li></ul>
  14. 14. citeXtract challenges: Journal Layout Variation <ul><li>Irregular </li></ul><ul><li>text-flow: </li></ul>
  15. 15. citeXtract challenges: Citation format irregularities Most popular: Sometimes the title is left out: Order of elements and how they are separated can differ, too:
  16. 16. citeXtract: current state <ul><li>Limited to the most journal popular layout (2 columns) </li></ul><ul><li>Achieved 98% success on correctly extracting First Author, Title, Year of Publication </li></ul><ul><li>A second phase (fuzzy match against PubMed sources) is under way </li></ul>
  17. 17. citeXtract: what comes next <ul><li>Irregular layout </li></ul><ul><li>Extracting the context </li></ul><ul><li>Getting the papers </li></ul>
  18. 18. citeXtract: Acknowledgements <ul><li>I owe thanks to many people. First, to Weimin Zhu for his support with my Marie Curie Fellowship and his understanding for my flexible schedule; to Peter Stoehr for helping me continue this project; to Mark Rijnbeek and Paula de Matos for their continuous support and to the whole Chebi group for the friendly atmosphere and all the morning coffees :) </li></ul>
  19. 19. References <ul><li>The images on pp. 12 - 14 (examples of text flow patterns) are from </li></ul><ul><li>T.M. Breuel “High Performance Document Layout Analysis” Symposium on Document Image Understanding Technology, Greenbelt Maryland April 9-11th, 2003 </li></ul>