Hacking the JPEG/PDF tree format

•

4 likes•1,288 views

Joseph Hughes

Presentation made on the 26th of October in Edinburgh at the Scottish Phylogenetic Discussion Group

Education Technology Business

Hacking the JPEG/PDF tree format Joseph Hughes

Illustrating trees Why re-use old wood? Growing tree numbers Hacking trees TreeThief TreeSnatcher TreeRogue TreeRipper Keeping our trees green gold

In the past Darwin, C. R. The origin of species by means of natural selection. 1859 Illustrating trees

Weird and wonderful trees! Haeckel, E. The evolution of Man. 1879. Illustrating trees

Why re-use old wood? ,[object Object],[object Object],[object Object],O ’Meara, B. Nature Precedings 2011 Why re-use old wood?

Aggregate data in useful resources Kumar & Hedges. Bioinformatics 2011 Why re-use old wood?

And ,[object Object],[object Object],Why re-use old wood? We need to re-use! We need to be more green!

We have been busy >110,000 phylogenetic studies indexed in PubMed What have we been doing? 1983: invention of the PCR 1992: First issue of Mol. Phyl. Evol.

TreeThief ,[object Object],[object Object],[object Object],Hacking trees

TreeRogue ,[object Object],[object Object],[object Object],Hacking trees

TreeSnatcher and + ,[object Object],[object Object],[object Object],[object Object],Hacking trees Laubach, T., von Haeseler A. Bioinformatics 2007

Hacking the JPEG/PDF format ('Marsupialia',('Xenarthra',(('Eulipotyphla',('Scandentia','Primates')),('Afrosoricida',(('Tubulidentata','Macroscelidea'),('Hyracoidea',('Sirenia','Proboscidea'))))))); This is a lot of 12 point text to test the ocr code and see if it works on all types of file format. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. Hacking trees

TreeRipper Hughes, J. BMC Bioinformatics 2011 Hacking trees

Testing Hughes, J. BMC Bioinformatics 2011 Hacking trees

http://goo.gl/EZ67K Short URL: goo.gl/EZ67K Hacking trees

Keeping our trees (random thoughts) ,[object Object],[object Object],Keeping our trees

Keeping trees green Page, R.D.M. Nature Precedings 2007 Keeping our trees

Recently uploaded

Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2

AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb

What is Model Inheritance in Odoo 17 ERPCeline George

Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir

Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy

GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2

Keynote by Prof. Wurzer at Nordex about IP-designMIPLM

Culture Uniformity or Diversity IN SOCIOLOGY.pptxPoojaSen20

AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood

4.16.24 21st Century Movements for Black Lives.pptxmary850239

ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1

YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxConquiztadors- the Quiz Society of Sri Venkateswara College

Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99

THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña

Difference Between Search & Browse Methods in Odoo 17Celine George

Transaction Management in Database Management SystemChristalin Nelson

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood

Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri

How to do quick user assign in kanban in Odoo 17 ERPCeline George

FILIPINO PSYCHology sikolohiyang pilipinojohnmickonozaleda

Recently uploaded (20)

Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf

AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf

What is Model Inheritance in Odoo 17 ERP

Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf

Student Profile Sample - We help schools to connect the data they have, with ...

GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS

Keynote by Prof. Wurzer at Nordex about IP-design

Culture Uniformity or Diversity IN SOCIOLOGY.pptx

AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx

4.16.24 21st Century Movements for Black Lives.pptx

ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...

YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx

Choosing the Right CBSE School A Comprehensive Guide for Parents

THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION

Difference Between Search & Browse Methods in Odoo 17

Transaction Management in Database Management System

ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx

Judging the Relevance and worth of ideas part 2.pptx

How to do quick user assign in kanban in Odoo 17 ERP

FILIPINO PSYCHology sikolohiyang pilipino

Featured

AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork

Skeleton Culture CodeSkeleton Technologies

PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley

Content Methodology: A Best Practices Report (Webinar)contently

How to Prepare For a Successful Job Search for 2024Albert Qian

Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)

Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal

5 Public speaking tips from TED - Visualized summarySpeakerHub

ChatGPT and the Future of Work - Clark Boyd Clark Boyd

Getting into the tech field. what next Tessa Mero

Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray

How to have difficult conversations Rajiv Jayarajah, MAppComm, ACC

Introduction to Data ScienceChristy Abraham Joy

Time Management & Productivity - Best PracticesVit Horky

The six step guide to practical project managementMindGenius

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools

12 Ways to Increase Your Influence at WorkGetSmarter

ChatGPT webinar slidesAlireza Esmikhani

More than Just Lines on a Map: Best Practices for U.S Bike RoutesProject for Public Spaces & National Center for Biking and Walking

Featured (20)

AI Trends in Creative Operations 2024 by Artwork Flow.pdf

Skeleton Culture Code

PEPSICO Presentation to CAGNY Conference Feb 2024

Content Methodology: A Best Practices Report (Webinar)

How to Prepare For a Successful Job Search for 2024

Social Media Marketing Trends 2024 // The Global Indie Insights

Trends In Paid Search: Navigating The Digital Landscape In 2024

5 Public speaking tips from TED - Visualized summary

ChatGPT and the Future of Work - Clark Boyd

Getting into the tech field. what next

Google's Just Not That Into You: Understanding Core Updates & Search Intent

How to have difficult conversations

Introduction to Data Science

Time Management & Productivity - Best Practices

The six step guide to practical project management

Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...

Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...

12 Ways to Increase Your Influence at Work

ChatGPT webinar slides

More than Just Lines on a Map: Best Practices for U.S Bike Routes

Hacking the JPEG/PDF tree format

1. Hacking the JPEG/PDF tree format Joseph Hughes

2. Illustrating trees Why re-use old wood? Growing tree numbers Hacking trees TreeThief TreeSnatcher TreeRogue TreeRipper Keeping our trees green gold

3. In the past Darwin, C. R. The origin of species by means of natural selection. 1859 Illustrating trees

4. Weird and wonderful trees! Haeckel, E. The evolution of Man. 1879. Illustrating trees

6. Aggregate data in useful resources Kumar & Hedges. Bioinformatics 2011 Why re-use old wood?

7. Guide researchers Why re-use old wood?

9. We have been busy >110,000 phylogenetic studies indexed in PubMed What have we been doing? 1983: invention of the PCR 1992: First issue of Mol. Phyl. Evol.

10. Different approaches Hacking trees

11.

12.

13.

14. Hacking the JPEG/PDF format ('Marsupialia',('Xenarthra',(('Eulipotyphla',('Scandentia','Primates')),('Afrosoricida',(('Tubulidentata','Macroscelidea'),('Hyracoidea',('Sirenia','Proboscidea'))))))); This is a lot of 12 point text to test the ocr code and see if it works on all types of file format. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. Hacking trees

15. TreeRipper Hughes, J. BMC Bioinformatics 2011 Hacking trees

16. Testing Hughes, J. BMC Bioinformatics 2011 Hacking trees

17. http://goo.gl/EZ67K Short URL: goo.gl/EZ67K Hacking trees

18. Limitations Hacking trees

19.

20. Keeping trees green Page, R.D.M. Nature Precedings 2007 Keeping our trees

21. Turning trees gold Keeping our trees

22.

Editor's Notes

I will start with a brief look at the history of illustrating phylogenies Discuss why we might want to re-use the growing number of old trees I will then present the ways that are currently available to liberate published trees Finally, I will suggest ways of keeping our trees looking green or even turning them gold
We have been illustrating the relationships between species as trees ever since the publication of the Origin of species by Darwin. Actually, this tree metaphor is the only figure in the origin of species.
Since we have invented many weird and wonderful ways of representing the relationships between organisms using trees. For example, Ernst Haeckel's &quot;tree of life ” illustrates Darwin's metaphor of the pattern of universal common descent.
But why re-use an old trees anyway? They can be used for comparative analyses, building larger trees, looking at the effect of methodologies and different types of characters,
aggregating into useful products (timetree)
Additionnally, past phylogenies will remain central to guiding researchers towards studying poorly supported relationships and under-sampled lineages and putting recent molecular studies into the context of previous studies. Supernetwork of 50 order-level phylogenies sourced from publications since 1969 according to (Davis et al., 2010). SplitsTree (Huson & Bryant, 2006) was used to generate the supernetwork.
Additionally, they can be used for determining how phylogenetic hypotheses have changed and whether we are reaching a consensus. They provide the means of placing extinct species within the context of currently extant lineages. But in my opinion the most important reason for re-using old trees is because they cannot be readily replicated for example some of these phylogenies are based on morphological characters, it is also a waste of time and resources redoing perfectly good work.
Since 1859 we have been pretty busy building trees. The number of studies depicting phylogenies has exploded with the development of the polymerase chain reaction and the ever decreasing cost of sequencing. Journals were created specifically for publishing the molecular phylogenies such as Molecular Phylogenetics and Evolution established in 1992 With over 110,000 phylogenetic publications indexed in Pubmed, the amount of phylogenetic information avialable in the pages of manuscripts should not be understated. Whilst the best format for exchanging and sharing phylogenetic hypotheses in the early years of phylogenetics was to embed/print illustrations into manuscripts This has resulted in the locking up of phylogenetic hypotheses into the pages of journals and books without an easy way to access this information. Accessing this information is not a trivial task.
The idea of using a program to convert a tree image into a computer-readable representation of that tree is not new and the approaches used range from manual to semi-automated approaches
TreeThief developed around 2000, is the first such progam, it requires the manual entry of data and only works on MacOS 9
TreeRogue is a similar concept to TreeThief, it requires a program called GraphClick that enables you to get the positions of points, the output from GraphClick are then processed in an R script to get the tree
TreeSnatcher Plus is a GUI-driven Java application that automates the generation of a machine- readable representation of multifurcating phylogenetic trees contained in pixel images. The user supervises the semi-automatic recognition process and makes corrections to the image and to the topology where necessary.
Due to the scale of the problem there is a need to automate the process to defrost those phylogenies embedded in pdfs Automating the recognition of a tree has parallels with the OCR developed since the 1930s where an image of text can be converted into a digital format. In comparison the optical recognition of tree images is in its infancy, it not only requires the OCR of the text but also the recognition of topology of the tree to be able to generate the digital bracket file format like Phylip and Nexus that can be used to reconstruct supertrees for example. So I’ve tried to devlop a program to do just that.
TreeRipper is a fully automated tree recognition software. It takes an input image, cleans the image, detect the contour of the phylogeny to create the bracket nexus format and uses an OCR program to recognize the label names. TreeRipper was developed as a C++ command line program to process large numbers of trees. It uses the Magick++ standard template libraries.
To test tree ripper, We downloaded 322 images which had phylogen* or supertree in their caption from 249 articles published in the Open Access journal BMC Evolutionary Biology between 1997 and 2009. Only 38% met the prerequisites for treeripper and of these The processing of these 114 trees took under 3 hours with 32% of topologies successfully recognized. The success rate depends on factors such as the resolution of the image (the higher the better but the higher the slower) and the messiness of the image (lots of internal labels make it less succesfull).
TreeRIpper is available as a webservice.
Although it is possible to automatically recognize tree images from legacy literature, the software is so far limited to rectangular phylogenies despite the diversity of ways we illustrate phylogenetic relationships.
What can we do to avoid this problem in the future. Green OA Self Archiving - authors publish trees in any journal and then self-archive the digital format of the tree Gold OA Publishing – authors publish the tree in an open access journal that provides immediate OA to all the data on the publisher's website.
The green OA system has been available in the form of databases like TreeBase which has become valuable repository as it holds morphological and genetic data with the associated published phylogeny. However, because submission to TreeBASE is not a pre-requisite for publication, the rapid growth of published phylogenies has not been matched by the availability of those trees in the database. So relying on self-archiving hasn ’t really worked.
It might be worth thinking of ways we can turn our trees gold. In the same ways that information about a pdf is embedded in the PDF file (you see author, doi, title etc..) or that geolocation is embedded in a photo when it is taken (you see the geographical coordinates), we could embed the digital format of phylogenies into the images that we use in our publications. The usual way we work, is to visualize the tree in TreeView or FigTree and then save that tree to make further annotations to the phylogeny (e.g. adding bootstrap supports) in an image editor. One solution would be upon saving the file in TreeView or FigTree, to associate the tree file with the image.
I would be the first to admit that it is not easy being green.

Hacking the JPEG/PDF tree format

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Hacking the JPEG/PDF tree format

Editor's Notes