SlideShare a Scribd company logo
1 of 16
KURATOR: A Provenance-enabled 
Workflow Platform and Toolkit to Curate 
Biodiversity Data 
Bertram Ludäscher 
Graduate School of Library and Information Science (GSLIS) 
National Center for Supercomputing Applications (NCSA)
• Kurator: 
– What problems is Kurator tackling and for whom? 
– Curation Workflow Example 
– How we’re going about it 
• Not Today: 
– Related Biodiversity Informatics Projects 
• Filtered-Push 
• Exploring Taxon Concepts (ETC) 
• Euler 
– Other Informatics Projects 
• DataONE 
• SKOPE 
ERRT @ GSLIS 10/22/2014 2 
Outline
What is Kurator? 
• NSF-DBI #1356751 
– Collaborative Research: ABI Development: 
Kurator: A Provenance-enabled Workflow Platform 
and Toolkit to Curate Biodiversity Data 
– Sept. 2014 – 2017 
– @Illinois: 
• B. Ludäscher, James Macklin, Tim McPhillips, … 
– @Harvard: 
• James Hanken, Paul Morris, Bob Morris, … 
ERRT @ GSLIS 10/22/2014 3
Problem: Data & Metadata Quality 
• Collections & occurrence data is 
all over the map 
– … literally (off the map!) 
• Issues: 
– Lat/Long transposition, 
coordinate & projection issues 
– Scientific Names (spelling 
errors, other) 
– Data entry/creation, “fuzzy” 
data, naming issues, bit rot, 
data conversions and 
transformations, schema 
mappings, … (you name it) 
• Precursor: 
– Filtered-Push Collaboration 
ERRT @ GSLIS 10/22/2014 4
What Problems does Kurator try to solve? 
• Detect and flag data quality issues 
• Repair if possible 
• Keep track of provenance 
– automatic repairs 
– human curator edits 
ERRT @ GSLIS 10/22/2014 5
Who are the customers? 
• Collection Managers 
– … who are managing the collections databases 
– Can run curation workflows periodically 
• … in the presence of new data and/or new curation services 
• (Biodiversity) Researchers 
– To perform an analysis in the presence of (partially) 
dirty data, researchers need to 
• Clean or fix dirty data 
• Throw out unfixable data 
– Pushing changes to the original data collections and 
collection managers (cf. FPush) 
ERRT @ GSLIS 10/22/2014 6
Example: Kepler/Kurator (FPush project) 
ERRT @ GSLIS 10/22/2014 7
Simplified Example Workflow 
• Related Research (Tianhong Song, UC Davis) 
– Analyze linear workflow “story” 
– Use patterns to discover wf design issues 
(e.g. use before update); then fix them 
– Parallelize when possible 
• Kurator: 
– Allow easy 
assembly of such 
workflows 
– For tool makers 
– … and tool users 
– … scalability 
challenge. 
ERRT @ GSLIS 10/22/2014 8
Example Output … 
ERRT @ GSLIS 10/22/2014 9
… close up … 
ERRT @ GSLIS 10/22/2014 10
How we do it 
• Build a library of curation services such that 
curation workflows can be run from various 
platforms 
– Scientific workflow systems 
• e.g. Restflow, Kepler, Taverna, Galaxy 
– Other platforms 
• e.g. Akka, Python-based, … 
• … leveraging existing technologies 
ERRT @ GSLIS 10/22/2014 11
How we do it 
• Open source, community-friendly approach 
– git repository (NCSA open source projects) 
• Agile software development 
– NCSA support tools, e.g. JIRA, Bamboo 
• Inspired by 
– Small bioinformatics tools manifesto (post-facto) 
– Unix tenets (small, interoperable tools, … ) 
– Experience with other (sometimes not so agile) 
development projects 
ERRT @ GSLIS 10/22/2014 12
Kurator: Agile Development 
ERRT @ GSLIS 10/22/2014 13
Q & A … 
• What does data curation, quality control mean in 
you domain / application / research? 
• Are there particular issues that are important to 
you? 
• Join us! 
– Kurator & other Biodiversity Interest 
• Hackers welcome, too. 
– Email: ludaesch@illinois.edu 
ERRT @ GSLIS 10/22/2014 14
Related Research (Tianhong Song) 
• Automated Design, Analysis, Optimization of 
Curation Workflows. 
• Idea: 
• Example Workflow 
[Scientific Name Validation]  [GeoRef Validation]  [Date Validation] 
ERRT @ GSLIS 10/22/2014 15
Related Research (Tianhong Song) 
• Analyze linear workflow 
“story” 
• Use patterns to discover wf 
design issues (e.g. use before 
update); then fix them 
• Parallelize when possible 
ERRT @ GSLIS 10/22/2014 16

More Related Content

Similar to Kurator Project Overview (Brief)

Curation Service Models - Michael Witt - RDAP12
Curation Service Models - Michael Witt - RDAP12Curation Service Models - Michael Witt - RDAP12
Curation Service Models - Michael Witt - RDAP12ASIS&T
 
Workshop 4: Open Science & Open Data for Librarians/Ina Smith
Workshop 4: Open Science & Open Data for Librarians/Ina SmithWorkshop 4: Open Science & Open Data for Librarians/Ina Smith
Workshop 4: Open Science & Open Data for Librarians/Ina SmithAfrican Open Science Platform
 
Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Jeroen Rombouts
 
What are we doing about data? Emerging roles in data librarianship and Tales ...
What are we doing about data? Emerging roles in data librarianship and Tales ...What are we doing about data? Emerging roles in data librarianship and Tales ...
What are we doing about data? Emerging roles in data librarianship and Tales ...Donna Kafel
 
What are we doing about data? Emerging roles in data librarianship and Tales ...
What are we doing about data? Emerging roles in data librarianship and Tales ...What are we doing about data? Emerging roles in data librarianship and Tales ...
What are we doing about data? Emerging roles in data librarianship and Tales ...Donna Kafel
 
Open Source and Science at the National Science Foundation (NSF)
Open Source and Science at the National Science Foundation (NSF)Open Source and Science at the National Science Foundation (NSF)
Open Source and Science at the National Science Foundation (NSF)Daniel S. Katz
 
Managing Research Data in the Life Sciences
Managing Research Data in the Life SciencesManaging Research Data in the Life Sciences
Managing Research Data in the Life Sciencesalwerhane
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypseENUG
 
From Open Access to Open Data: Collaborative Work in the University Libraries...
From Open Access to Open Data: Collaborative Work in the University Libraries...From Open Access to Open Data: Collaborative Work in the University Libraries...
From Open Access to Open Data: Collaborative Work in the University Libraries...LIBER Europe
 
Bibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
Bibliometric-enhanced Information Retrieval: Connecting IR with BibliometricsBibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
Bibliometric-enhanced Information Retrieval: Connecting IR with BibliometricsGESIS
 
Research Software Engineering Inside and Outside the Library
Research Software Engineering Inside and Outside the LibraryResearch Software Engineering Inside and Outside the Library
Research Software Engineering Inside and Outside the LibraryPatrick McCann
 
UCD Library's Training Programme and Resources for Researchers
UCD Library's Training Programme and Resources for ResearchersUCD Library's Training Programme and Resources for Researchers
UCD Library's Training Programme and Resources for ResearchersUCD Library
 
EPrints Update, Les Carr, University of Southampton
EPrints  Update, Les Carr, University of SouthamptonEPrints  Update, Les Carr, University of Southampton
EPrints Update, Les Carr, University of SouthamptonRepository Fringe
 
Transparent Licenses: Making user rights clear (OLA Super Conference 2015)
Transparent Licenses: Making user rights clear (OLA Super Conference 2015)Transparent Licenses: Making user rights clear (OLA Super Conference 2015)
Transparent Licenses: Making user rights clear (OLA Super Conference 2015)Hong (Jenny) Jing
 
What is eScience, and where does it go from here?
What is eScience, and where does it go from here?What is eScience, and where does it go from here?
What is eScience, and where does it go from here?Daniel S. Katz
 
Working towards Sustainable Software for Science (an NSF and community view)
Working towards Sustainable Software for Science (an NSF and community view)Working towards Sustainable Software for Science (an NSF and community view)
Working towards Sustainable Software for Science (an NSF and community view)Daniel S. Katz
 
Advocating Open Access: Before, during and after HEFCE
Advocating Open Access: Before, during and after HEFCEAdvocating Open Access: Before, during and after HEFCE
Advocating Open Access: Before, during and after HEFCENick Sheppard
 

Similar to Kurator Project Overview (Brief) (20)

Tdwg14 fp-kurator-ludaescher
Tdwg14 fp-kurator-ludaescherTdwg14 fp-kurator-ludaescher
Tdwg14 fp-kurator-ludaescher
 
Curation Service Models - Michael Witt - RDAP12
Curation Service Models - Michael Witt - RDAP12Curation Service Models - Michael Witt - RDAP12
Curation Service Models - Michael Witt - RDAP12
 
Workshop 4: Open Science & Open Data for Librarians/Ina Smith
Workshop 4: Open Science & Open Data for Librarians/Ina SmithWorkshop 4: Open Science & Open Data for Librarians/Ina Smith
Workshop 4: Open Science & Open Data for Librarians/Ina Smith
 
Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10
 
What are we doing about data? Emerging roles in data librarianship and Tales ...
What are we doing about data? Emerging roles in data librarianship and Tales ...What are we doing about data? Emerging roles in data librarianship and Tales ...
What are we doing about data? Emerging roles in data librarianship and Tales ...
 
What are we doing about data? Emerging roles in data librarianship and Tales ...
What are we doing about data? Emerging roles in data librarianship and Tales ...What are we doing about data? Emerging roles in data librarianship and Tales ...
What are we doing about data? Emerging roles in data librarianship and Tales ...
 
Curating Humanities Data: Law, technology and reality
Curating Humanities Data: Law, technology and realityCurating Humanities Data: Law, technology and reality
Curating Humanities Data: Law, technology and reality
 
Open Source and Science at the National Science Foundation (NSF)
Open Source and Science at the National Science Foundation (NSF)Open Source and Science at the National Science Foundation (NSF)
Open Source and Science at the National Science Foundation (NSF)
 
Managing Research Data in the Life Sciences
Managing Research Data in the Life SciencesManaging Research Data in the Life Sciences
Managing Research Data in the Life Sciences
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypse
 
From Open Access to Open Data: Collaborative Work in the University Libraries...
From Open Access to Open Data: Collaborative Work in the University Libraries...From Open Access to Open Data: Collaborative Work in the University Libraries...
From Open Access to Open Data: Collaborative Work in the University Libraries...
 
Bibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
Bibliometric-enhanced Information Retrieval: Connecting IR with BibliometricsBibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
Bibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
 
Research Software Engineering Inside and Outside the Library
Research Software Engineering Inside and Outside the LibraryResearch Software Engineering Inside and Outside the Library
Research Software Engineering Inside and Outside the Library
 
UCD Library's Training Programme and Resources for Researchers
UCD Library's Training Programme and Resources for ResearchersUCD Library's Training Programme and Resources for Researchers
UCD Library's Training Programme and Resources for Researchers
 
EPrints Update, Les Carr, University of Southampton
EPrints  Update, Les Carr, University of SouthamptonEPrints  Update, Les Carr, University of Southampton
EPrints Update, Les Carr, University of Southampton
 
Transparent Licenses: Making user rights clear (OLA Super Conference 2015)
Transparent Licenses: Making user rights clear (OLA Super Conference 2015)Transparent Licenses: Making user rights clear (OLA Super Conference 2015)
Transparent Licenses: Making user rights clear (OLA Super Conference 2015)
 
What is eScience, and where does it go from here?
What is eScience, and where does it go from here?What is eScience, and where does it go from here?
What is eScience, and where does it go from here?
 
RDM Programme @ Edinburgh: Data Librarian Experience
RDM Programme @ Edinburgh: Data Librarian ExperienceRDM Programme @ Edinburgh: Data Librarian Experience
RDM Programme @ Edinburgh: Data Librarian Experience
 
Working towards Sustainable Software for Science (an NSF and community view)
Working towards Sustainable Software for Science (an NSF and community view)Working towards Sustainable Software for Science (an NSF and community view)
Working towards Sustainable Software for Science (an NSF and community view)
 
Advocating Open Access: Before, during and after HEFCE
Advocating Open Access: Before, during and after HEFCEAdvocating Open Access: Before, during and after HEFCE
Advocating Open Access: Before, during and after HEFCE
 

More from Bertram Ludäscher

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionBertram Ludäscher
 
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Bertram Ludäscher
 
[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database RulesBertram Ludäscher
 
[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database RulesBertram Ludäscher
 
Answering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsAnswering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsBertram Ludäscher
 
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Bertram Ludäscher
 
Which Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueWhich Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueBertram Ludäscher
 
From Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science TalesFrom Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science TalesBertram Ludäscher
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesBertram Ludäscher
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsBertram Ludäscher
 
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseDeduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseBertram Ludäscher
 
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...Bertram Ludäscher
 
Dissecting Reproducibility: A case study with ecological niche models in th...
Dissecting Reproducibility:  A case study with ecological niche models  in th...Dissecting Reproducibility:  A case study with ecological niche models  in th...
Dissecting Reproducibility: A case study with ecological niche models in th...Bertram Ludäscher
 
Incremental Recomputation: Those who cannot remember the past are condemned ...
Incremental Recomputation:  Those who cannot remember the past are condemned ...Incremental Recomputation:  Those who cannot remember the past are condemned ...
Incremental Recomputation: Those who cannot remember the past are condemned ...Bertram Ludäscher
 
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsValidation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsBertram Ludäscher
 
An ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsAn ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsBertram Ludäscher
 
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachKnowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachBertram Ludäscher
 
Whole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchWhole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchBertram Ludäscher
 
ETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatBertram Ludäscher
 

More from Bertram Ludäscher (20)

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
 
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
 
[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules
 
[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules
 
Answering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsAnswering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query Patterns
 
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
 
Which Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueWhich Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A Dialogue
 
From Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science TalesFrom Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science Tales
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science Tales
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
 
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseDeduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
 
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
 
Dissecting Reproducibility: A case study with ecological niche models in th...
Dissecting Reproducibility:  A case study with ecological niche models  in th...Dissecting Reproducibility:  A case study with ecological niche models  in th...
Dissecting Reproducibility: A case study with ecological niche models in th...
 
Incremental Recomputation: Those who cannot remember the past are condemned ...
Incremental Recomputation:  Those who cannot remember the past are condemned ...Incremental Recomputation:  Those who cannot remember the past are condemned ...
Incremental Recomputation: Those who cannot remember the past are condemned ...
 
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsValidation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
 
An ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsAn ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflows
 
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachKnowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
 
Whole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchWhole-Tale: The Experience of Research
Whole-Tale: The Experience of Research
 
ETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatETC & Authors in the Driver's Seat
ETC & Authors in the Driver's Seat
 

Kurator Project Overview (Brief)

  • 1. KURATOR: A Provenance-enabled Workflow Platform and Toolkit to Curate Biodiversity Data Bertram Ludäscher Graduate School of Library and Information Science (GSLIS) National Center for Supercomputing Applications (NCSA)
  • 2. • Kurator: – What problems is Kurator tackling and for whom? – Curation Workflow Example – How we’re going about it • Not Today: – Related Biodiversity Informatics Projects • Filtered-Push • Exploring Taxon Concepts (ETC) • Euler – Other Informatics Projects • DataONE • SKOPE ERRT @ GSLIS 10/22/2014 2 Outline
  • 3. What is Kurator? • NSF-DBI #1356751 – Collaborative Research: ABI Development: Kurator: A Provenance-enabled Workflow Platform and Toolkit to Curate Biodiversity Data – Sept. 2014 – 2017 – @Illinois: • B. Ludäscher, James Macklin, Tim McPhillips, … – @Harvard: • James Hanken, Paul Morris, Bob Morris, … ERRT @ GSLIS 10/22/2014 3
  • 4. Problem: Data & Metadata Quality • Collections & occurrence data is all over the map – … literally (off the map!) • Issues: – Lat/Long transposition, coordinate & projection issues – Scientific Names (spelling errors, other) – Data entry/creation, “fuzzy” data, naming issues, bit rot, data conversions and transformations, schema mappings, … (you name it) • Precursor: – Filtered-Push Collaboration ERRT @ GSLIS 10/22/2014 4
  • 5. What Problems does Kurator try to solve? • Detect and flag data quality issues • Repair if possible • Keep track of provenance – automatic repairs – human curator edits ERRT @ GSLIS 10/22/2014 5
  • 6. Who are the customers? • Collection Managers – … who are managing the collections databases – Can run curation workflows periodically • … in the presence of new data and/or new curation services • (Biodiversity) Researchers – To perform an analysis in the presence of (partially) dirty data, researchers need to • Clean or fix dirty data • Throw out unfixable data – Pushing changes to the original data collections and collection managers (cf. FPush) ERRT @ GSLIS 10/22/2014 6
  • 7. Example: Kepler/Kurator (FPush project) ERRT @ GSLIS 10/22/2014 7
  • 8. Simplified Example Workflow • Related Research (Tianhong Song, UC Davis) – Analyze linear workflow “story” – Use patterns to discover wf design issues (e.g. use before update); then fix them – Parallelize when possible • Kurator: – Allow easy assembly of such workflows – For tool makers – … and tool users – … scalability challenge. ERRT @ GSLIS 10/22/2014 8
  • 9. Example Output … ERRT @ GSLIS 10/22/2014 9
  • 10. … close up … ERRT @ GSLIS 10/22/2014 10
  • 11. How we do it • Build a library of curation services such that curation workflows can be run from various platforms – Scientific workflow systems • e.g. Restflow, Kepler, Taverna, Galaxy – Other platforms • e.g. Akka, Python-based, … • … leveraging existing technologies ERRT @ GSLIS 10/22/2014 11
  • 12. How we do it • Open source, community-friendly approach – git repository (NCSA open source projects) • Agile software development – NCSA support tools, e.g. JIRA, Bamboo • Inspired by – Small bioinformatics tools manifesto (post-facto) – Unix tenets (small, interoperable tools, … ) – Experience with other (sometimes not so agile) development projects ERRT @ GSLIS 10/22/2014 12
  • 13. Kurator: Agile Development ERRT @ GSLIS 10/22/2014 13
  • 14. Q & A … • What does data curation, quality control mean in you domain / application / research? • Are there particular issues that are important to you? • Join us! – Kurator & other Biodiversity Interest • Hackers welcome, too. – Email: ludaesch@illinois.edu ERRT @ GSLIS 10/22/2014 14
  • 15. Related Research (Tianhong Song) • Automated Design, Analysis, Optimization of Curation Workflows. • Idea: • Example Workflow [Scientific Name Validation]  [GeoRef Validation]  [Date Validation] ERRT @ GSLIS 10/22/2014 15
  • 16. Related Research (Tianhong Song) • Analyze linear workflow “story” • Use patterns to discover wf design issues (e.g. use before update); then fix them • Parallelize when possible ERRT @ GSLIS 10/22/2014 16