Sharing re-usable phylogenetic data: we're not there yet
Sharing reusable phylogenetic data:
we're not there yet
A talk of
1.) Outlining the extent of the problem
(lack of) sharing, standards, care (?)
2.) What I'm trying to do about it:
Digging data out of PDFs
Where's the data?
Just ~4% of published phylogenetic studies in 2010
publicly archived their supporting phylo data in
Stoltzfus A, O'Meara B, Whitacre J, Mounce R, Gillespie E, Kumar S, Rosauer D, & Vos R. 2012
Sharing and re-use of phylogenetic trees (and associated data) to facilitate synthesis
BMC Research Notes 10.1186/1756-0500-5-574
Check our data yourself on Dryad here: 10.5061/dryad.h6pf365t
Scientists cannot be relied upon to
share published data upon request
This has been known for a while now
e.g. (in Psychology) Wicherts et al 2006
But has been confirmed to be true for phylogenetics too:
Drew et al 2013 'Lost Branches in the Tree of Life'
report that just ~16% of researchers contacted supplied
the requested ('published') phylo data.
My own experience tallies with this – I soon stopped bothering to try and
ask people via email for a copy of their published data. It's a waste of time.
The (Single) Supplementary Data File
was a Y2K solution – a dump
Many legacy journal supplementary data systems
bury data and leave it there to decompose
Often not re-usable in form e.g. a lazy PDF
Sometimes 'typeset', corrupting the data
A jumble of words & data where the bit you
want is on page 92 (no programmatic access)
BURIED and really not very discoverable
Do reviewers even look at it? I think not tbh
I wasted too much of my PhD
trying to get usable data to re-analyze
This is what I felt like...
So I tried to do something
An open letter in support of
palaeontology data archiving
Which was picked-up by Nature News
Which, in turn got me in touch with:
Since few will help you to re-use their data
You've got to dig it out
make it re-usable yourself
re-release it openly
so no-one else wastes their time doing this
It's not just phylogenetics.
I learned from the Open Knowledge Conference (Berlin 2011)
that a lot different academic fields seem also struggle to
make re-usable published data available.
If it's a common, shared-problem...
why not seek a shared, cross-disciplinary solution?
Building upon tools first developed
in computational chemistry by the Murray-Rust lab
ChemicalTagger → PhyloTagger (Entity tagging)
(Chem)PubCrawler → (Phylo)PubCrawler
(to getting 10,000+ PDFs to work on)
BBSRC grant approved
“PLUTo: Phyloinformatic Literature Unlocking Tools”
Software for making published phyloinformatic
data discoverable, open, and reusable
...I just need to get my PhD viva done & rubber-stamped
Instructions for getting the current working setup here:
(multiple repositories, dependencies & requirements!)
Evolution of ultraviolet
vision in the largest avian
radiation - the passerines
Anders Ödeen 1* , Olle Håstad
and Per Alström 4
Styles , superscripts
Orthonyx x 2
Cnemophilus x 4
Philesturnus x 2
Motacilla x 2
Toxorhampus x 2
Typical phylo tree: 60 nodes, complex and miniscule annotation,
vertical text, hyphenation and valuable branch lengths. AMI extracts ALL
Acknowledgements & Thanks
For the Panton Fellowship,
inspiration and support
To the organisers
of both the session:
Nico, Hilmar, Rutger
and the conference
as a whole!
For travel & accommodation
support, without which I couldn't
possibly attend TDWG
My main collaborators on PLUTo: Matthew Wills and Peter Murray-Rust
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.