SlideShare a Scribd company logo
1 of 18
Download to read offline
Data Provenance for
      Phyloinformatics:
Introduction & Survey Results
         Elliott Hauser
    UNC Information Science

        Karen Cranston
      NESCent Informatics
Overview:
What is Phylogenetics?
What is Phylogenetic Data?




                                                                  ...many things!
           Source: DRAFT: Current Best Practices for Publishing Trees Electronically, 2010. Stoltzfus et al.
           http://wiki.tdwg.org/twiki/bin/view/Phylogenetics/LinkingTrees2010
What is Phylogenetic Data?


        <A sample NeXML file>




              Source: http://github.com/miapa/miapa-etl/tree/master/nexmlex
What is a
Minimum Information Standard?
The answer to this question, for a domain:

"What is the minimum information necessary
for an independent scientist to carry out an
independent analysis of the data?"

                               Quackenbush, 2005

For Phylogenetics, this is MIAPA:
Minimum Information About a Phylogenetic Analysis
What do we need to know to analyze
this tree?
Overview:
What is MIAPA?




                 Source: Leebens-Mack et al. 2006
Overview:
Producers' and Consumers' attitudes

                                                Most important
                                                metadata type




                                                Least important
                                                metadata type



                      Source: Cranston MIAPA survey, 2012 (unpublished)
Half of all metadata types are
critically important to two+ subfields




                        Source: Cranston MIAPA survey, 2012 (unpublished)
The majority of metadata types are
easy to produce for all subfields




                       Source: Cranston MIAPA survey, 2012 (unpublished)
How to balance the needs of
Producers and Consumers?

                                                Most important
                                                metadata type




                                                Least important
                                                metadata type



                      Source: Cranston MIAPA survey, 2012 (unpublished)
Metadata at work:
The Open Tree of Life Project




                   Conflicting Data, Conflicting Needs:
                    ● A Single, 'Best' Tree of Life
                    ● Access to Underlying, Conflicting Trees
A new research area:
Computational data provenance




              ...Huh?
A new research area:
Computational data provenance

Computational: The result of a computation

Data provenance: Where/how it came to be


   As science becomes more and more
computational, we need to know more about
                our data!
Reprise:
What is Phylogenetics?




a perfect field for computational data provenance!
Discussion
Will our survey results predict actual behavior?

What tools, if any, will preserve and encourage
submission of computational data provenance?

Is computational data different from measurement
data, classification data, or other types of
metadata? If so, does that affect our work?
Thanks!
eah13@mac.com
Reprise: balancing the needs of
Producers and Consumers?

                              Most important
                              metadata type




                              Least important
                              metadata type

More Related Content

What's hot

Finding statistics2
Finding statistics2Finding statistics2
Finding statistics2
lmk7
 

What's hot (20)

Data Discovery and Visualization
Data Discovery and VisualizationData Discovery and Visualization
Data Discovery and Visualization
 
In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
In Search of a Missing Link in the Data Deluge vs. Data Scarcity DebateIn Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
 
Application of recently developed FAIR metrics to the ELIXIR Core Data Resources
Application of recently developed FAIR metrics to the ELIXIR Core Data ResourcesApplication of recently developed FAIR metrics to the ELIXIR Core Data Resources
Application of recently developed FAIR metrics to the ELIXIR Core Data Resources
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
Building a Network of Interoperable and Independently Produced Linked and Ope...
Building a Network of Interoperable and Independently Produced Linked and Ope...Building a Network of Interoperable and Independently Produced Linked and Ope...
Building a Network of Interoperable and Independently Produced Linked and Ope...
 
Finding statistics2
Finding statistics2Finding statistics2
Finding statistics2
 
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
 
On community-standards, data curation and scholarly communication - BITS, Ita...
On community-standards, data curation and scholarly communication - BITS, Ita...On community-standards, data curation and scholarly communication - BITS, Ita...
On community-standards, data curation and scholarly communication - BITS, Ita...
 
The NIH as a Digital Enterprise: Implications for PAG
The NIH as a Digital Enterprise: Implications for PAGThe NIH as a Digital Enterprise: Implications for PAG
The NIH as a Digital Enterprise: Implications for PAG
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicine
 
Artificial Intelligence in Data Curation
Artificial Intelligence in Data CurationArtificial Intelligence in Data Curation
Artificial Intelligence in Data Curation
 
AI in translational medicine webinar
AI in translational medicine webinarAI in translational medicine webinar
AI in translational medicine webinar
 
Road map to secondary data
Road map to secondary dataRoad map to secondary data
Road map to secondary data
 
Introduction to Data Management
Introduction to Data ManagementIntroduction to Data Management
Introduction to Data Management
 
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
 
Introduction to open-data
Introduction to open-dataIntroduction to open-data
Introduction to open-data
 
Machines are people too
Machines are people tooMachines are people too
Machines are people too
 
From data lakes to actionable data (adventures in data curation)
From data lakes to actionable data (adventures in data curation)From data lakes to actionable data (adventures in data curation)
From data lakes to actionable data (adventures in data curation)
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
 

Viewers also liked

Artikel groen onderwijs
Artikel groen onderwijsArtikel groen onderwijs
Artikel groen onderwijs
MaraJuckers
 
Makalah genetik suminar kelompok
Makalah genetik suminar kelompokMakalah genetik suminar kelompok
Makalah genetik suminar kelompok
liajujul
 
20 ideas-for-using-mobile-phones-in-language-classroom-
20 ideas-for-using-mobile-phones-in-language-classroom-20 ideas-for-using-mobile-phones-in-language-classroom-
20 ideas-for-using-mobile-phones-in-language-classroom-
Emmahmoud Emrami
 
презентация команды клевер
презентация команды клеверпрезентация команды клевер
презентация команды клевер
Klever98
 

Viewers also liked (17)

20150728 USERDIVE in BPStudy
20150728 USERDIVE in BPStudy20150728 USERDIVE in BPStudy
20150728 USERDIVE in BPStudy
 
Компьютерное моделирование
Компьютерное моделированиеКомпьютерное моделирование
Компьютерное моделирование
 
Product & MVP for startups
Product & MVP for startupsProduct & MVP for startups
Product & MVP for startups
 
Artikel groen onderwijs
Artikel groen onderwijsArtikel groen onderwijs
Artikel groen onderwijs
 
Đất nền quận 12 giá rẻ nhất khu vực cách QL1A 70m
Đất nền quận 12 giá rẻ nhất khu vực cách QL1A 70mĐất nền quận 12 giá rẻ nhất khu vực cách QL1A 70m
Đất nền quận 12 giá rẻ nhất khu vực cách QL1A 70m
 
Untitled Presentation
Untitled PresentationUntitled Presentation
Untitled Presentation
 
Neurri unitateak
Neurri unitateakNeurri unitateak
Neurri unitateak
 
Neurri unitateak
Neurri unitateakNeurri unitateak
Neurri unitateak
 
Makalah genetik suminar kelompok
Makalah genetik suminar kelompokMakalah genetik suminar kelompok
Makalah genetik suminar kelompok
 
20 ideas-for-using-mobile-phones-in-language-classroom-
20 ideas-for-using-mobile-phones-in-language-classroom-20 ideas-for-using-mobile-phones-in-language-classroom-
20 ideas-for-using-mobile-phones-in-language-classroom-
 
20150912 doda
20150912 doda20150912 doda
20150912 doda
 
презентация команды клевер
презентация команды клеверпрезентация команды клевер
презентация команды клевер
 
Website evaluation tutorial
Website evaluation tutorialWebsite evaluation tutorial
Website evaluation tutorial
 
20160416 プロジェクトの始め方【超入門】
20160416 プロジェクトの始め方【超入門】20160416 プロジェクトの始め方【超入門】
20160416 プロジェクトの始め方【超入門】
 
Vsh news presentation
Vsh news presentationVsh news presentation
Vsh news presentation
 
2015727 USERDIVE in Cybird LT
2015727 USERDIVE in Cybird LT2015727 USERDIVE in Cybird LT
2015727 USERDIVE in Cybird LT
 
Userdiveによるimpala導入へのミチ
Userdiveによるimpala導入へのミチUserdiveによるimpala導入へのミチ
Userdiveによるimpala導入へのミチ
 

Similar to Phylogenetics & Data Provenance: Survey Results

Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and
butest
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
butest
 
Data Mining and Knowledge Discovery in Business Databases
Data Mining and Knowledge Discovery in Business DatabasesData Mining and Knowledge Discovery in Business Databases
Data Mining and Knowledge Discovery in Business Databases
butest
 

Similar to Phylogenetics & Data Provenance: Survey Results (20)

Machine Learning, Data Mining, and
Machine Learning, Data Mining, and Machine Learning, Data Mining, and
Machine Learning, Data Mining, and
 
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
Embracing Semantic Technology for Better Metadata Authoring in Biomedicine (S...
 
Data Walden.docx
Data Walden.docxData Walden.docx
Data Walden.docx
 
Philosophy of Big Data
Philosophy of Big DataPhilosophy of Big Data
Philosophy of Big Data
 
Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data Keynote on 2015 Yale Day of Data
Keynote on 2015 Yale Day of Data
 
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at ScaleFull Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciences
 
Data Mining and Knowledge Discovery in Business Databases
Data Mining and Knowledge Discovery in Business DatabasesData Mining and Knowledge Discovery in Business Databases
Data Mining and Knowledge Discovery in Business Databases
 
Module 1 - Data Around Us .pptx
Module 1 - Data Around Us .pptxModule 1 - Data Around Us .pptx
Module 1 - Data Around Us .pptx
 
Training Seminar - The Data Design Process
Training Seminar - The Data Design ProcessTraining Seminar - The Data Design Process
Training Seminar - The Data Design Process
 
Introduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .pptIntroduction to Data Mining and technologies .ppt
Introduction to Data Mining and technologies .ppt
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Data discussion.docx
Data discussion.docxData discussion.docx
Data discussion.docx
 
Hattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in MaterialsHattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in Materials
 
Pistoia alliance harmonizing fair data catalog approaches webinar
Pistoia alliance harmonizing fair data catalog approaches webinarPistoia alliance harmonizing fair data catalog approaches webinar
Pistoia alliance harmonizing fair data catalog approaches webinar
 
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCESBROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCES
 
Data_Mining.ppt
Data_Mining.pptData_Mining.ppt
Data_Mining.ppt
 
Big data from small data: A deep survey of the neuroscience landscape data via
Big data from small data:  A deep survey of the neuroscience landscape data viaBig data from small data:  A deep survey of the neuroscience landscape data via
Big data from small data: A deep survey of the neuroscience landscape data via
 

Phylogenetics & Data Provenance: Survey Results

  • 1. Data Provenance for Phyloinformatics: Introduction & Survey Results Elliott Hauser UNC Information Science Karen Cranston NESCent Informatics
  • 3. What is Phylogenetic Data? ...many things! Source: DRAFT: Current Best Practices for Publishing Trees Electronically, 2010. Stoltzfus et al. http://wiki.tdwg.org/twiki/bin/view/Phylogenetics/LinkingTrees2010
  • 4. What is Phylogenetic Data? <A sample NeXML file> Source: http://github.com/miapa/miapa-etl/tree/master/nexmlex
  • 5. What is a Minimum Information Standard? The answer to this question, for a domain: "What is the minimum information necessary for an independent scientist to carry out an independent analysis of the data?" Quackenbush, 2005 For Phylogenetics, this is MIAPA: Minimum Information About a Phylogenetic Analysis
  • 6. What do we need to know to analyze this tree?
  • 7. Overview: What is MIAPA? Source: Leebens-Mack et al. 2006
  • 8. Overview: Producers' and Consumers' attitudes Most important metadata type Least important metadata type Source: Cranston MIAPA survey, 2012 (unpublished)
  • 9. Half of all metadata types are critically important to two+ subfields Source: Cranston MIAPA survey, 2012 (unpublished)
  • 10. The majority of metadata types are easy to produce for all subfields Source: Cranston MIAPA survey, 2012 (unpublished)
  • 11. How to balance the needs of Producers and Consumers? Most important metadata type Least important metadata type Source: Cranston MIAPA survey, 2012 (unpublished)
  • 12. Metadata at work: The Open Tree of Life Project Conflicting Data, Conflicting Needs: ● A Single, 'Best' Tree of Life ● Access to Underlying, Conflicting Trees
  • 13. A new research area: Computational data provenance ...Huh?
  • 14. A new research area: Computational data provenance Computational: The result of a computation Data provenance: Where/how it came to be As science becomes more and more computational, we need to know more about our data!
  • 15. Reprise: What is Phylogenetics? a perfect field for computational data provenance!
  • 16. Discussion Will our survey results predict actual behavior? What tools, if any, will preserve and encourage submission of computational data provenance? Is computational data different from measurement data, classification data, or other types of metadata? If so, does that affect our work?
  • 18. Reprise: balancing the needs of Producers and Consumers? Most important metadata type Least important metadata type