SlideShare a Scribd company logo
1 of 19
Download to read offline
PLUTo
Phyloinformatic
Literature
Unlocking
Tools
Ross Mounce @rmounce
Peter Murray-Rust @petermurrayrust
Matthew Wills
Machine-
Readable
Re-usable
Data &
Metadata
#iEvoBio
Talk Structure
● Why re-extract?
Everyone shares their data, right? [no]
● Where are the trees?
Creating an atlas of phylogeny
● How to scalably extract tree data?
Liberating Figure Images & Captions
Extracting Re-usable Data from Images
If I go too fast/slow, the full slidedeck is on Slideshare
http://www.slideshare.net/rossmounce/
Why hack data from the literature?
Multiple independent studies show re-usable phylogenetic data is
NOT publicly available online for most studies
● Stoltzfus et al. (2012) BMC Research Notes estimates 4%
● Drew et al. (2013) PLOS Biology estimates 17%
● Magee et al. (2014) arXiv preprint, estimates 25%
Why the difference between studies? Different methods & scope
Drew & Magee sampled only from 'better' papers
Drew: from well-known journals (only), excluding less-read journals
Magee: from papers citing relatively new, complex methods
Over ALL journals/papers Stoltzfus (2012) probably provides the most representative estimate
Pop Quiz Time
Which journal publishes
the most papers containing
phylogenetic analyses,
per year?
Credit: thanks @rdmpage for the 'pop quiz' technique
#1 is IJSEM
International
Journal of
Systematic &
Evolutionary
Microbiology
#5 PLOS ONE
(probably #3 now)
Source: Web of Science /
Mounce (2013) PhD thesis
The long tail distribution of phylogenetic analyses
There's at least a
1000 different journals
in which phylogenetic
analyses have been
published in.
Collectively this
represents significant
volume.
In terms of journals,
volume of phylogeny
papers published has
no relation to 'quality'
of phylogenetic
analysis
Creating an atlas of phylogeny
Problems:
● Indexers like Google Scholar, Scopus & Web of Science
don't perfectly index the literature – many false negatives
(relevant papers not found that should be found)
● No-one has access to ALL journals. Paywalls. Grr
● Even with legitimate access, publisher-imposed &
copyright restrictions hamper phylogeny discovery
Solutions (partial):
●
As of June 1st
2014 the UK has new copyright exceptions
to enable and protect text & data mining for
non-commercial research purposes [link]
Searching for phylogeny is hard
Make it a lot easier!
Search by “presence
of phylogenetic trees”
Link to journal search here
Creating an OA atlas of phylogeny
● Free-to-use platform (free as in beer, it's not open)
One Terabyte of free storage per account
● Highly popular platform for image sharing
(in top 100 most frequently visited websites of the world)
● Supports Creative Commons licensing (many platforms don't)
● Feature-rich, good UI, useful API, etc...
Full attribution visible next to figure. One-click link to source. Full caption text. Searchable.
View-counter (METRICS!). Open licencing marked (tells you it's CC BY on mouse-over)
Only one publisher currently embeds
useful metadata in their figure images
Well done PLOS!
Not perfect though.
Author names &
the paper title are
NOT embedded
The OA 'Atlas of Phylogeny'
nearly 10,000 figures!
● 4045 phylogeny figures from PLOS ONE
– bit.ly/PLOStrees
● 5215 phylogeny figures from 154 OA journals
(Pensoft, BMC, FrontiersIn, other PLOS journals,
Hindawi, MDPI) & a tiny number of hybrid OA papers
from Elsevier, Royal Society and Magnolia Press.
– bit.ly/phylofigs
correct as of June 22nd 2014
How to get the data from the image?
● Previous work
TreeThief (Rambaut, 2000) old, not used anymore
TreeRipper (Hughes, 2011) automated, but v. picky
TreeSnatcher Plus (Laubach et al. 2012) manual
TreeSnatcher authors report it took them
21 minutes to manually extract the tree &
taxon labels from this radial bustard tree,
using TreeSnatcher Plus (Supp. Data. 6)
● Faster than TreeSnatcher Plus
● Less picky about tree style than TreeRipper
Our approach: automated!
Stages 1&2 :
binarization (Black or White) & thinning (1 pixel width structures)
Stage 3 :
Assume largest 'pixel island' is the tree structure
Several stages later... (segmentation, find root if applicable, etc)
Re-draw / Re-use extracted data!
from NeXML output
Newick output also optional
Tip labels will be OCR'd, harder for radial trees
Still in very active development...
https://bitbucket.org/petermr/imageanalysis
https://bitbucket.org/petermr/diagramanalyzer
Java, Maven, Apache PDFbox, BoofCV,
Test-driven development, openly-licensed
Please stop publishing needlessly
composite figures in online-only journals!!!

More Related Content

What's hot

GARNet workshop on Integrating Large Data into Plant Science
GARNet workshop on Integrating Large Data into Plant ScienceGARNet workshop on Integrating Large Data into Plant Science
GARNet workshop on Integrating Large Data into Plant ScienceDavid Johnson
 
Reproducibility and Scientific Research: why, what, where, when, who, how
Reproducibility and Scientific Research: why, what, where, when, who, how Reproducibility and Scientific Research: why, what, where, when, who, how
Reproducibility and Scientific Research: why, what, where, when, who, how Carole Goble
 
2011 03-provenance-workshop-edingurgh
2011 03-provenance-workshop-edingurgh2011 03-provenance-workshop-edingurgh
2011 03-provenance-workshop-edingurghJun Zhao
 
Your research as open science
Your research as open scienceYour research as open science
Your research as open scienceIrina Radchenko
 
Semantics as a service at EMBL-EBI
Semantics as a service at EMBL-EBISemantics as a service at EMBL-EBI
Semantics as a service at EMBL-EBISimon Jupp
 
Fixing the infrastructure for open science
Fixing the infrastructure for open scienceFixing the infrastructure for open science
Fixing the infrastructure for open scienceBjörn Brembs
 
Workshop Systematic Reviews Werkgroep Sociaal Wetenschappelijke Informatie
Workshop Systematic Reviews Werkgroep Sociaal Wetenschappelijke InformatieWorkshop Systematic Reviews Werkgroep Sociaal Wetenschappelijke Informatie
Workshop Systematic Reviews Werkgroep Sociaal Wetenschappelijke Informatiejstaaks
 
Open Notebook Science in Drug Discovery
Open Notebook Science in Drug DiscoveryOpen Notebook Science in Drug Discovery
Open Notebook Science in Drug DiscoveryJean-Claude Bradley
 
Annotopia open annotation services platform
Annotopia open annotation services platformAnnotopia open annotation services platform
Annotopia open annotation services platformTim Clark
 
Nanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS TalkNanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS TalkJean-Claude Bradley
 
Capturing the context: one small(ish step for modellers, one giant leap for m...
Capturing the context: one small(ish step for modellers, one giant leap for m...Capturing the context: one small(ish step for modellers, one giant leap for m...
Capturing the context: one small(ish step for modellers, one giant leap for m...FAIRDOM
 
Community curation at PomBase
Community curation at PomBaseCommunity curation at PomBase
Community curation at PomBaseValerie Wood
 
Clustering the royal society of chemistry chemical repository to enable enhan...
Clustering the royal society of chemistry chemical repository to enable enhan...Clustering the royal society of chemistry chemical repository to enable enhan...
Clustering the royal society of chemistry chemical repository to enable enhan...Valery Tkachenko
 
Fairport domain specific metadata using w3 c dcat & skos w ontology views
Fairport domain specific metadata using w3 c dcat & skos w ontology viewsFairport domain specific metadata using w3 c dcat & skos w ontology views
Fairport domain specific metadata using w3 c dcat & skos w ontology viewsTim Clark
 
eXframe: A Semantic Web Platform for Genomic Experiments
eXframe: A Semantic Web Platform for Genomic ExperimentseXframe: A Semantic Web Platform for Genomic Experiments
eXframe: A Semantic Web Platform for Genomic ExperimentsTim Clark
 
exFrame: a Semantic Web Platform for Genomics Experiments
exFrame: a Semantic Web Platform for Genomics ExperimentsexFrame: a Semantic Web Platform for Genomics Experiments
exFrame: a Semantic Web Platform for Genomics ExperimentsTim Clark
 
On community-standards, data curation and scholarly communication - BITS, Ita...
On community-standards, data curation and scholarly communication - BITS, Ita...On community-standards, data curation and scholarly communication - BITS, Ita...
On community-standards, data curation and scholarly communication - BITS, Ita...Susanna-Assunta Sansone
 

What's hot (20)

GARNet workshop on Integrating Large Data into Plant Science
GARNet workshop on Integrating Large Data into Plant ScienceGARNet workshop on Integrating Large Data into Plant Science
GARNet workshop on Integrating Large Data into Plant Science
 
Reproducibility and Scientific Research: why, what, where, when, who, how
Reproducibility and Scientific Research: why, what, where, when, who, how Reproducibility and Scientific Research: why, what, where, when, who, how
Reproducibility and Scientific Research: why, what, where, when, who, how
 
2011 03-provenance-workshop-edingurgh
2011 03-provenance-workshop-edingurgh2011 03-provenance-workshop-edingurgh
2011 03-provenance-workshop-edingurgh
 
Your research as open science
Your research as open scienceYour research as open science
Your research as open science
 
Crosslinks
Crosslinks Crosslinks
Crosslinks
 
Semantics as a service at EMBL-EBI
Semantics as a service at EMBL-EBISemantics as a service at EMBL-EBI
Semantics as a service at EMBL-EBI
 
Fixing the infrastructure for open science
Fixing the infrastructure for open scienceFixing the infrastructure for open science
Fixing the infrastructure for open science
 
IGERT Open Notebook Science
IGERT Open Notebook ScienceIGERT Open Notebook Science
IGERT Open Notebook Science
 
Workshop Systematic Reviews Werkgroep Sociaal Wetenschappelijke Informatie
Workshop Systematic Reviews Werkgroep Sociaal Wetenschappelijke InformatieWorkshop Systematic Reviews Werkgroep Sociaal Wetenschappelijke Informatie
Workshop Systematic Reviews Werkgroep Sociaal Wetenschappelijke Informatie
 
Bh14 ogo
Bh14 ogoBh14 ogo
Bh14 ogo
 
Open Notebook Science in Drug Discovery
Open Notebook Science in Drug DiscoveryOpen Notebook Science in Drug Discovery
Open Notebook Science in Drug Discovery
 
Annotopia open annotation services platform
Annotopia open annotation services platformAnnotopia open annotation services platform
Annotopia open annotation services platform
 
Nanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS TalkNanoinformatics 2010 SMIRP-ONS Talk
Nanoinformatics 2010 SMIRP-ONS Talk
 
Capturing the context: one small(ish step for modellers, one giant leap for m...
Capturing the context: one small(ish step for modellers, one giant leap for m...Capturing the context: one small(ish step for modellers, one giant leap for m...
Capturing the context: one small(ish step for modellers, one giant leap for m...
 
Community curation at PomBase
Community curation at PomBaseCommunity curation at PomBase
Community curation at PomBase
 
Clustering the royal society of chemistry chemical repository to enable enhan...
Clustering the royal society of chemistry chemical repository to enable enhan...Clustering the royal society of chemistry chemical repository to enable enhan...
Clustering the royal society of chemistry chemical repository to enable enhan...
 
Fairport domain specific metadata using w3 c dcat & skos w ontology views
Fairport domain specific metadata using w3 c dcat & skos w ontology viewsFairport domain specific metadata using w3 c dcat & skos w ontology views
Fairport domain specific metadata using w3 c dcat & skos w ontology views
 
eXframe: A Semantic Web Platform for Genomic Experiments
eXframe: A Semantic Web Platform for Genomic ExperimentseXframe: A Semantic Web Platform for Genomic Experiments
eXframe: A Semantic Web Platform for Genomic Experiments
 
exFrame: a Semantic Web Platform for Genomics Experiments
exFrame: a Semantic Web Platform for Genomics ExperimentsexFrame: a Semantic Web Platform for Genomics Experiments
exFrame: a Semantic Web Platform for Genomics Experiments
 
On community-standards, data curation and scholarly communication - BITS, Ita...
On community-standards, data curation and scholarly communication - BITS, Ita...On community-standards, data curation and scholarly communication - BITS, Ita...
On community-standards, data curation and scholarly communication - BITS, Ita...
 

Viewers also liked

Museum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themMuseum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themRoss Mounce
 
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference Kaitlin Thaney
 
Open Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureOpen Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureRoss Mounce
 
Modern Tools & Rationales for 21st Century Research
Modern Tools & Rationales  for 21st Century ResearchModern Tools & Rationales  for 21st Century Research
Modern Tools & Rationales for 21st Century ResearchRoss Mounce
 
The State of Open Research Data
The State of Open Research DataThe State of Open Research Data
The State of Open Research DataRoss Mounce
 
Open scholarship [a FOSTER open science talk]
Open scholarship [a FOSTER open science talk]Open scholarship [a FOSTER open science talk]
Open scholarship [a FOSTER open science talk]Ross Mounce
 
How can repositories support the text-mining of their content and why?
How can repositories support the text-mining of their content and why? How can repositories support the text-mining of their content and why?
How can repositories support the text-mining of their content and why? Nancy Pontika
 
The OpenCon Intro to Open Data
The OpenCon Intro to Open DataThe OpenCon Intro to Open Data
The OpenCon Intro to Open DataRoss Mounce
 
Subscription costs versus open access costs, & Dissolving journals' boundaries
Subscription costs versus open access costs, & Dissolving journals' boundariesSubscription costs versus open access costs, & Dissolving journals' boundaries
Subscription costs versus open access costs, & Dissolving journals' boundariesAlex Holcombe
 
SocialCite makes its debut at the HighWire Press meeting
SocialCite makes its debut at the HighWire Press meetingSocialCite makes its debut at the HighWire Press meeting
SocialCite makes its debut at the HighWire Press meetingKent Anderson
 
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...Ross Mounce
 
Open Access for Early Career Researchers
Open Access for Early Career ResearchersOpen Access for Early Career Researchers
Open Access for Early Career ResearchersRoss Mounce
 
Research publication support for scholars in Brazil: Rising to the challenge
Research publication support for scholars in Brazil: Rising to the challengeResearch publication support for scholars in Brazil: Rising to the challenge
Research publication support for scholars in Brazil: Rising to the challengeRon Martinez
 
Open Access: Which Side Are You On
Open Access: Which Side Are You OnOpen Access: Which Side Are You On
Open Access: Which Side Are You OnJill Cirasella
 
Fifty shades of green and gold: open access to scholarly information
Fifty shades of green and gold: open access to scholarly informationFifty shades of green and gold: open access to scholarly information
Fifty shades of green and gold: open access to scholarly informationhierohiero
 

Viewers also liked (16)

Museum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themMuseum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on them
 
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
Leveraging the power of the web - Rocky Mountain Advanced Computing Conference
 
Open Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | FutureOpen Research Data: Licensing | Standards | Future
Open Research Data: Licensing | Standards | Future
 
Modern Tools & Rationales for 21st Century Research
Modern Tools & Rationales  for 21st Century ResearchModern Tools & Rationales  for 21st Century Research
Modern Tools & Rationales for 21st Century Research
 
The State of Open Research Data
The State of Open Research DataThe State of Open Research Data
The State of Open Research Data
 
Open scholarship [a FOSTER open science talk]
Open scholarship [a FOSTER open science talk]Open scholarship [a FOSTER open science talk]
Open scholarship [a FOSTER open science talk]
 
Open Access Publishing, Threat or Opportunity?
Open Access Publishing, Threat or Opportunity?Open Access Publishing, Threat or Opportunity?
Open Access Publishing, Threat or Opportunity?
 
How can repositories support the text-mining of their content and why?
How can repositories support the text-mining of their content and why? How can repositories support the text-mining of their content and why?
How can repositories support the text-mining of their content and why?
 
The OpenCon Intro to Open Data
The OpenCon Intro to Open DataThe OpenCon Intro to Open Data
The OpenCon Intro to Open Data
 
Subscription costs versus open access costs, & Dissolving journals' boundaries
Subscription costs versus open access costs, & Dissolving journals' boundariesSubscription costs versus open access costs, & Dissolving journals' boundaries
Subscription costs versus open access costs, & Dissolving journals' boundaries
 
SocialCite makes its debut at the HighWire Press meeting
SocialCite makes its debut at the HighWire Press meetingSocialCite makes its debut at the HighWire Press meeting
SocialCite makes its debut at the HighWire Press meeting
 
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...
 
Open Access for Early Career Researchers
Open Access for Early Career ResearchersOpen Access for Early Career Researchers
Open Access for Early Career Researchers
 
Research publication support for scholars in Brazil: Rising to the challenge
Research publication support for scholars in Brazil: Rising to the challengeResearch publication support for scholars in Brazil: Rising to the challenge
Research publication support for scholars in Brazil: Rising to the challenge
 
Open Access: Which Side Are You On
Open Access: Which Side Are You OnOpen Access: Which Side Are You On
Open Access: Which Side Are You On
 
Fifty shades of green and gold: open access to scholarly information
Fifty shades of green and gold: open access to scholarly informationFifty shades of green and gold: open access to scholarly information
Fifty shades of green and gold: open access to scholarly information
 

Similar to The PLUTo project @iEvoBio 2014

Laurie Goodman at #SSPBoston: Article+Data+Tools Reproducibility, Reuse, & Ra...
Laurie Goodman at #SSPBoston: Article+Data+ToolsReproducibility, Reuse, & Ra...Laurie Goodman at #SSPBoston: Article+Data+ToolsReproducibility, Reuse, & Ra...
Laurie Goodman at #SSPBoston: Article+Data+Tools Reproducibility, Reuse, & Ra...GigaScience, BGI Hong Kong
 
Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?Sandra Binning
 
ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKpetermurrayrust
 
A Few Simple Things Authors Can Do to Make Their Data More Discoverable and R...
A Few Simple Things Authors Can Do to Make Their Data More Discoverable and R...A Few Simple Things Authors Can Do to Make Their Data More Discoverable and R...
A Few Simple Things Authors Can Do to Make Their Data More Discoverable and R...Phoenix Bioinformatics
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer SchoolCarole Goble
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...Carole Goble
 
Open Access Week - Oxford, 20-24 Oct 2014
Open Access Week - Oxford, 20-24 Oct 2014Open Access Week - Oxford, 20-24 Oct 2014
Open Access Week - Oxford, 20-24 Oct 2014Susanna-Assunta Sansone
 
The culture of researchData
The culture of researchDataThe culture of researchData
The culture of researchDatapetermurrayrust
 
RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015William Gunn
 
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014Susanna-Assunta Sansone
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData TheContentMine
 
The Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-RustThe Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-RustLEARN Project
 
re3data.org – a Registry of Research Data Repositories
re3data.org – a Registry of Research Data Repositoriesre3data.org – a Registry of Research Data Repositories
re3data.org – a Registry of Research Data RepositoriesHeinz Pampel
 
Open PHACTS April 2017 Science webinar Workflow tools
Open PHACTS April 2017 Science webinar Workflow toolsOpen PHACTS April 2017 Science webinar Workflow tools
Open PHACTS April 2017 Science webinar Workflow toolsopen_phacts
 
Stories of “Glocality"—Nations in a Global Infrastructure
Stories of “Glocality"—Nations in a Global InfrastructureStories of “Glocality"—Nations in a Global Infrastructure
Stories of “Glocality"—Nations in a Global InfrastructureResearch Data Alliance
 
Research Objects for FAIRer Science
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science Carole Goble
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8Scott Edmunds
 
Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Carole Goble
 

Similar to The PLUTo project @iEvoBio 2014 (20)

Open Access Publishing: More Readers, More Impact
Open Access Publishing: More Readers, More ImpactOpen Access Publishing: More Readers, More Impact
Open Access Publishing: More Readers, More Impact
 
Laurie Goodman at #SSPBoston: Article+Data+Tools Reproducibility, Reuse, & Ra...
Laurie Goodman at #SSPBoston: Article+Data+ToolsReproducibility, Reuse, & Ra...Laurie Goodman at #SSPBoston: Article+Data+ToolsReproducibility, Reuse, & Ra...
Laurie Goodman at #SSPBoston: Article+Data+Tools Reproducibility, Reuse, & Ra...
 
Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?Public Data Archiving in Ecology and Evolution: How well are we doing?
Public Data Archiving in Ecology and Evolution: How well are we doing?
 
ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UK
 
BioPortal: ontologies and integrated data resources at the click of a mouse
BioPortal: ontologies and integrated data resourcesat the click of a mouseBioPortal: ontologies and integrated data resourcesat the click of a mouse
BioPortal: ontologies and integrated data resources at the click of a mouse
 
A Few Simple Things Authors Can Do to Make Their Data More Discoverable and R...
A Few Simple Things Authors Can Do to Make Their Data More Discoverable and R...A Few Simple Things Authors Can Do to Make Their Data More Discoverable and R...
A Few Simple Things Authors Can Do to Make Their Data More Discoverable and R...
 
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being FAIR:  FAIR data and model management SSBSS 2017 Summer SchoolBeing FAIR:  FAIR data and model management SSBSS 2017 Summer School
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
 
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
ISMB/ECCB 2013 Keynote Goble Results may vary: what is reproducible? why do o...
 
Open Access Week - Oxford, 20-24 Oct 2014
Open Access Week - Oxford, 20-24 Oct 2014Open Access Week - Oxford, 20-24 Oct 2014
Open Access Week - Oxford, 20-24 Oct 2014
 
The culture of researchData
The culture of researchDataThe culture of researchData
The culture of researchData
 
RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015
 
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
Oxford DTP - Sansone - Data publications and Scientific Data - Dec 2014
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData
 
The Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-RustThe Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-Rust
 
re3data.org – a Registry of Research Data Repositories
re3data.org – a Registry of Research Data Repositoriesre3data.org – a Registry of Research Data Repositories
re3data.org – a Registry of Research Data Repositories
 
Open PHACTS April 2017 Science webinar Workflow tools
Open PHACTS April 2017 Science webinar Workflow toolsOpen PHACTS April 2017 Science webinar Workflow tools
Open PHACTS April 2017 Science webinar Workflow tools
 
Stories of “Glocality"—Nations in a Global Infrastructure
Stories of “Glocality"—Nations in a Global InfrastructureStories of “Glocality"—Nations in a Global Infrastructure
Stories of “Glocality"—Nations in a Global Infrastructure
 
Research Objects for FAIRer Science
Research Objects for FAIRer Science Research Objects for FAIRer Science
Research Objects for FAIRer Science
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8
 
Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017Being Reproducible: SSBSS Summer School 2017
Being Reproducible: SSBSS Summer School 2017
 

More from Ross Mounce

Workshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data miningWorkshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data miningRoss Mounce
 
Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)
Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)
Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)Ross Mounce
 
Social Media For Researchers
Social Media For ResearchersSocial Media For Researchers
Social Media For ResearchersRoss Mounce
 
Social Media for Science
Social Media for ScienceSocial Media for Science
Social Media for ScienceRoss Mounce
 
Sharing re-usable phylogenetic data: we're not there yet
Sharing re-usable phylogenetic data: we're not there yetSharing re-usable phylogenetic data: we're not there yet
Sharing re-usable phylogenetic data: we're not there yetRoss Mounce
 
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...Ross Mounce
 

More from Ross Mounce (9)

Workshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data miningWorkshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data mining
 
Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)
Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)
Liberating OA figures from PDF to Flickr (A Pro-iBiosphere talk)
 
Social Media For Researchers
Social Media For ResearchersSocial Media For Researchers
Social Media For Researchers
 
Social Media for Science
Social Media for ScienceSocial Media for Science
Social Media for Science
 
Sharing re-usable phylogenetic data: we're not there yet
Sharing re-usable phylogenetic data: we're not there yetSharing re-usable phylogenetic data: we're not there yet
Sharing re-usable phylogenetic data: we're not there yet
 
Herding Cats
Herding CatsHerding Cats
Herding Cats
 
Content Mining
Content MiningContent Mining
Content Mining
 
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...
Phylogenetic Congruence between Cranial and Postcranial Characters in Archosa...
 
ProgPal2011
ProgPal2011ProgPal2011
ProgPal2011
 

Recently uploaded

How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 

Recently uploaded (20)

How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 

The PLUTo project @iEvoBio 2014

  • 1. PLUTo Phyloinformatic Literature Unlocking Tools Ross Mounce @rmounce Peter Murray-Rust @petermurrayrust Matthew Wills Machine- Readable Re-usable Data & Metadata #iEvoBio
  • 2. Talk Structure ● Why re-extract? Everyone shares their data, right? [no] ● Where are the trees? Creating an atlas of phylogeny ● How to scalably extract tree data? Liberating Figure Images & Captions Extracting Re-usable Data from Images If I go too fast/slow, the full slidedeck is on Slideshare http://www.slideshare.net/rossmounce/
  • 3. Why hack data from the literature? Multiple independent studies show re-usable phylogenetic data is NOT publicly available online for most studies ● Stoltzfus et al. (2012) BMC Research Notes estimates 4% ● Drew et al. (2013) PLOS Biology estimates 17% ● Magee et al. (2014) arXiv preprint, estimates 25% Why the difference between studies? Different methods & scope Drew & Magee sampled only from 'better' papers Drew: from well-known journals (only), excluding less-read journals Magee: from papers citing relatively new, complex methods Over ALL journals/papers Stoltzfus (2012) probably provides the most representative estimate
  • 4. Pop Quiz Time Which journal publishes the most papers containing phylogenetic analyses, per year? Credit: thanks @rdmpage for the 'pop quiz' technique
  • 5. #1 is IJSEM International Journal of Systematic & Evolutionary Microbiology #5 PLOS ONE (probably #3 now) Source: Web of Science / Mounce (2013) PhD thesis
  • 6. The long tail distribution of phylogenetic analyses There's at least a 1000 different journals in which phylogenetic analyses have been published in. Collectively this represents significant volume. In terms of journals, volume of phylogeny papers published has no relation to 'quality' of phylogenetic analysis
  • 7. Creating an atlas of phylogeny Problems: ● Indexers like Google Scholar, Scopus & Web of Science don't perfectly index the literature – many false negatives (relevant papers not found that should be found) ● No-one has access to ALL journals. Paywalls. Grr ● Even with legitimate access, publisher-imposed & copyright restrictions hamper phylogeny discovery Solutions (partial): ● As of June 1st 2014 the UK has new copyright exceptions to enable and protect text & data mining for non-commercial research purposes [link]
  • 8. Searching for phylogeny is hard Make it a lot easier! Search by “presence of phylogenetic trees” Link to journal search here
  • 9. Creating an OA atlas of phylogeny ● Free-to-use platform (free as in beer, it's not open) One Terabyte of free storage per account ● Highly popular platform for image sharing (in top 100 most frequently visited websites of the world) ● Supports Creative Commons licensing (many platforms don't) ● Feature-rich, good UI, useful API, etc...
  • 10. Full attribution visible next to figure. One-click link to source. Full caption text. Searchable. View-counter (METRICS!). Open licencing marked (tells you it's CC BY on mouse-over)
  • 11. Only one publisher currently embeds useful metadata in their figure images Well done PLOS! Not perfect though. Author names & the paper title are NOT embedded
  • 12. The OA 'Atlas of Phylogeny' nearly 10,000 figures! ● 4045 phylogeny figures from PLOS ONE – bit.ly/PLOStrees ● 5215 phylogeny figures from 154 OA journals (Pensoft, BMC, FrontiersIn, other PLOS journals, Hindawi, MDPI) & a tiny number of hybrid OA papers from Elsevier, Royal Society and Magnolia Press. – bit.ly/phylofigs correct as of June 22nd 2014
  • 13. How to get the data from the image? ● Previous work TreeThief (Rambaut, 2000) old, not used anymore TreeRipper (Hughes, 2011) automated, but v. picky TreeSnatcher Plus (Laubach et al. 2012) manual TreeSnatcher authors report it took them 21 minutes to manually extract the tree & taxon labels from this radial bustard tree, using TreeSnatcher Plus (Supp. Data. 6)
  • 14. ● Faster than TreeSnatcher Plus ● Less picky about tree style than TreeRipper Our approach: automated!
  • 15. Stages 1&2 : binarization (Black or White) & thinning (1 pixel width structures)
  • 16. Stage 3 : Assume largest 'pixel island' is the tree structure
  • 17. Several stages later... (segmentation, find root if applicable, etc) Re-draw / Re-use extracted data! from NeXML output Newick output also optional Tip labels will be OCR'd, harder for radial trees
  • 18. Still in very active development... https://bitbucket.org/petermr/imageanalysis https://bitbucket.org/petermr/diagramanalyzer Java, Maven, Apache PDFbox, BoofCV, Test-driven development, openly-licensed
  • 19. Please stop publishing needlessly composite figures in online-only journals!!!