Hacking the JPEG/PDF tree format


Published on

Presentation made on the 26th of October in Edinburgh at the Scottish Phylogenetic Discussion Group

Published in: Education, Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • I will start with a brief look at the history of illustrating phylogenies Discuss why we might want to re-use the growing number of old trees I will then present the ways that are currently available to liberate published trees Finally, I will suggest ways of keeping our trees looking green or even turning them gold
  • We have been illustrating the relationships between species as trees ever since the publication of the Origin of species by Darwin. Actually, this tree metaphor is the only figure in the origin of species.
  • Since we have invented many weird and wonderful ways of representing the relationships between organisms using trees. For example, Ernst Haeckel's "tree of life ” illustrates Darwin's metaphor of the pattern of universal common descent.
  • But why re-use an old trees anyway? They can be used for comparative analyses, building larger trees, looking at the effect of methodologies and different types of characters,
  • aggregating into useful products (timetree)
  • Additionnally, past phylogenies will remain central to guiding researchers towards studying poorly supported relationships and under-sampled lineages and putting recent molecular studies into the context of previous studies. Supernetwork of 50 order-level phylogenies sourced from publications since 1969 according to (Davis et al., 2010). SplitsTree (Huson & Bryant, 2006) was used to generate the supernetwork.
  • Additionally, they can be used for determining how phylogenetic hypotheses have changed and whether we are reaching a consensus. They provide the means of placing extinct species within the context of currently extant lineages. But in my opinion the most important reason for re-using old trees is because they cannot be readily replicated for example some of these phylogenies are based on morphological characters, it is also a waste of time and resources redoing perfectly good work.
  • Since 1859 we have been pretty busy building trees. The number of studies depicting phylogenies has exploded with the development of the polymerase chain reaction and the ever decreasing cost of sequencing. Journals were created specifically for publishing the molecular phylogenies such as Molecular Phylogenetics and Evolution established in 1992 With over 110,000 phylogenetic publications indexed in Pubmed, the amount of phylogenetic information avialable in the pages of manuscripts should not be understated. Whilst the best format for exchanging and sharing phylogenetic hypotheses in the early years of phylogenetics was to embed/print illustrations into manuscripts This has resulted in the locking up of phylogenetic hypotheses into the pages of journals and books without an easy way to access this information. Accessing this information is not a trivial task.
  • The idea of using a program to convert a tree image into a computer-readable representation of that tree is not new and the approaches used range from manual to semi-automated approaches
  • TreeThief developed around 2000, is the first such progam, it requires the manual entry of data and only works on MacOS 9
  • TreeRogue is a similar concept to TreeThief, it requires a program called GraphClick that enables you to get the positions of points, the output from GraphClick are then processed in an R script to get the tree
  • TreeSnatcher Plus is a GUI-driven Java application that automates the generation of a machine- readable representation of multifurcating phylogenetic trees contained in pixel images. The user supervises the semi-automatic recognition process and makes corrections to the image and to the topology where necessary.
  • Due to the scale of the problem there is a need to automate the process to defrost those phylogenies embedded in pdfs Automating the recognition of a tree has parallels with the OCR developed since the 1930s where an image of text can be converted into a digital format. In comparison the optical recognition of tree images is in its infancy, it not only requires the OCR of the text but also the recognition of topology of the tree to be able to generate the digital bracket file format like Phylip and Nexus that can be used to reconstruct supertrees for example. So I’ve tried to devlop a program to do just that.
  • TreeRipper is a fully automated tree recognition software. It takes an input image, cleans the image, detect the contour of the phylogeny to create the bracket nexus format and uses an OCR program to recognize the label names. TreeRipper was developed as a C++ command line program to process large numbers of trees. It uses the Magick++ standard template libraries.
  • To test tree ripper, We downloaded 322 images which had phylogen* or supertree in their caption from 249 articles published in the Open Access journal BMC Evolutionary Biology between 1997 and 2009. Only 38% met the prerequisites for treeripper and of these The processing of these 114 trees took under 3 hours with 32% of topologies successfully recognized. The success rate depends on factors such as the resolution of the image (the higher the better but the higher the slower) and the messiness of the image (lots of internal labels make it less succesfull).
  • TreeRIpper is available as a webservice.
  • Although it is possible to automatically recognize tree images from legacy literature, the software is so far limited to rectangular phylogenies despite the diversity of ways we illustrate phylogenetic relationships.
  • What can we do to avoid this problem in the future. Green OA Self Archiving - authors publish trees in any journal and then self-archive the digital format of the tree Gold OA Publishing – authors publish the tree in an open access journal that provides immediate OA to all the data on the publisher's website.
  • The green OA system has been available in the form of databases like TreeBase which has become valuable repository as it holds morphological and genetic data with the associated published phylogeny. However, because submission to TreeBASE is not a pre-requisite for publication, the rapid growth of published phylogenies has not been matched by the availability of those trees in the database. So relying on self-archiving hasn ’t really worked.
  • It might be worth thinking of ways we can turn our trees gold. In the same ways that information about a pdf is embedded in the PDF file (you see author, doi, title etc..) or that geolocation is embedded in a photo when it is taken (you see the geographical coordinates), we could embed the digital format of phylogenies into the images that we use in our publications. The usual way we work, is to visualize the tree in TreeView or FigTree and then save that tree to make further annotations to the phylogeny (e.g. adding bootstrap supports) in an image editor. One solution would be upon saving the file in TreeView or FigTree, to associate the tree file with the image.
  • I would be the first to admit that it is not easy being green.
  • Hacking the JPEG/PDF tree format

    1. 1. Hacking the JPEG/PDF tree format Joseph Hughes
    2. 2. Illustrating trees Why re-use old wood? Growing tree numbers Hacking trees TreeThief TreeSnatcher TreeRogue TreeRipper Keeping our trees green gold
    3. 3. In the past Darwin, C. R. The origin of species by means of natural selection. 1859 Illustrating trees
    4. 4. Weird and wonderful trees! Haeckel, E. The evolution of Man. 1879. Illustrating trees
    5. 5. Why re-use old wood? <ul><li>To evaluate comparative data (evolution, ecology, biogeography, disease) </li></ul><ul><li>To use as inputs for building larger trees (constraints, supertrees, megatrees) </li></ul><ul><li>To study the effects of methodology (priors on tree shape) </li></ul>O ’Meara, B. Nature Precedings 2011 Why re-use old wood?
    6. 6. Aggregate data in useful resources Kumar & Hedges. Bioinformatics 2011 Why re-use old wood?
    7. 7. Guide researchers Why re-use old wood?
    8. 8. And <ul><li>To determine how phylogenetic hypotheses have changed over time </li></ul><ul><li>Because the tree cannot be readily replicated (morphological characters, too time consuming, too expensive for the taxpayer) </li></ul>Why re-use old wood? We need to re-use! We need to be more green!
    9. 9. We have been busy >110,000 phylogenetic studies indexed in PubMed What have we been doing? 1983: invention of the PCR 1992: First issue of Mol. Phyl. Evol.
    10. 10. Different approaches Hacking trees
    11. 11. TreeThief <ul><li>A tool for manual phylogenetic entry by Andrew Rambaut </li></ul><ul><li>MacOS 9 </li></ul><ul><li>http://goo.gl/BFM2N </li></ul>Hacking trees
    12. 12. TreeRogue <ul><li>TreeThief-like approach by Nick Matzke </li></ul><ul><li>GraphClick ($8) and R script </li></ul><ul><li>http://goo.gl/eunO2 </li></ul>Hacking trees
    13. 13. TreeSnatcher and + <ul><li>A semi-automated approach by Thomas Laubach </li></ul><ul><li>GUI-driven Java app </li></ul><ul><li>Multifurcating trees in any shape </li></ul><ul><li>http://goo.gl/Das63 </li></ul>Hacking trees Laubach, T., von Haeseler A. Bioinformatics 2007
    14. 14. Hacking the JPEG/PDF format ('Marsupialia',('Xenarthra',(('Eulipotyphla',('Scandentia','Primates')),('Afrosoricida',(('Tubulidentata','Macroscelidea'),('Hyracoidea',('Sirenia','Proboscidea'))))))); This is a lot of 12 point text to test the ocr code and see if it works on all types of file format. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. Hacking trees
    15. 15. TreeRipper Hughes, J. BMC Bioinformatics 2011 Hacking trees
    16. 16. Testing Hughes, J. BMC Bioinformatics 2011 Hacking trees
    17. 17. http://goo.gl/EZ67K Short URL: goo.gl/EZ67K Hacking trees
    18. 18. Limitations Hacking trees
    19. 19. Keeping our trees (random thoughts) <ul><li>Green OA Self Archiving </li></ul><ul><li>Gold OA Publishing </li></ul>Keeping our trees
    20. 20. Keeping trees green Page, R.D.M. Nature Precedings 2007 Keeping our trees
    21. 21. Turning trees gold Keeping our trees
    22. 22. <ul><li>It ’s not easy being green! </li></ul>