1. KURATOR: A Provenance-enabled
Workflow Platform and Toolkit to Curate
Biodiversity Data
Bertram Ludäscher
Graduate School of Library and Information Science (GSLIS)
National Center for Supercomputing Applications (NCSA)
2. • Kurator:
– What problems is Kurator tackling and for whom?
– Curation Workflow Example
– How we’re going about it
• Not Today:
– Related Biodiversity Informatics Projects
• Filtered-Push
• Exploring Taxon Concepts (ETC)
• Euler
– Other Informatics Projects
• DataONE
• SKOPE
ERRT @ GSLIS 10/22/2014 2
Outline
3. What is Kurator?
• NSF-DBI #1356751
– Collaborative Research: ABI Development:
Kurator: A Provenance-enabled Workflow Platform
and Toolkit to Curate Biodiversity Data
– Sept. 2014 – 2017
– @Illinois:
• B. Ludäscher, James Macklin, Tim McPhillips, …
– @Harvard:
• James Hanken, Paul Morris, Bob Morris, …
ERRT @ GSLIS 10/22/2014 3
4. Problem: Data & Metadata Quality
• Collections & occurrence data is
all over the map
– … literally (off the map!)
• Issues:
– Lat/Long transposition,
coordinate & projection issues
– Scientific Names (spelling
errors, other)
– Data entry/creation, “fuzzy”
data, naming issues, bit rot,
data conversions and
transformations, schema
mappings, … (you name it)
• Precursor:
– Filtered-Push Collaboration
ERRT @ GSLIS 10/22/2014 4
5. What Problems does Kurator try to solve?
• Detect and flag data quality issues
• Repair if possible
• Keep track of provenance
– automatic repairs
– human curator edits
ERRT @ GSLIS 10/22/2014 5
6. Who are the customers?
• Collection Managers
– … who are managing the collections databases
– Can run curation workflows periodically
• … in the presence of new data and/or new curation services
• (Biodiversity) Researchers
– To perform an analysis in the presence of (partially)
dirty data, researchers need to
• Clean or fix dirty data
• Throw out unfixable data
– Pushing changes to the original data collections and
collection managers (cf. FPush)
ERRT @ GSLIS 10/22/2014 6
8. Simplified Example Workflow
• Related Research (Tianhong Song, UC Davis)
– Analyze linear workflow “story”
– Use patterns to discover wf design issues
(e.g. use before update); then fix them
– Parallelize when possible
• Kurator:
– Allow easy
assembly of such
workflows
– For tool makers
– … and tool users
– … scalability
challenge.
ERRT @ GSLIS 10/22/2014 8
11. How we do it
• Build a library of curation services such that
curation workflows can be run from various
platforms
– Scientific workflow systems
• e.g. Restflow, Kepler, Taverna, Galaxy
– Other platforms
• e.g. Akka, Python-based, …
• … leveraging existing technologies
ERRT @ GSLIS 10/22/2014 11
12. How we do it
• Open source, community-friendly approach
– git repository (NCSA open source projects)
• Agile software development
– NCSA support tools, e.g. JIRA, Bamboo
• Inspired by
– Small bioinformatics tools manifesto (post-facto)
– Unix tenets (small, interoperable tools, … )
– Experience with other (sometimes not so agile)
development projects
ERRT @ GSLIS 10/22/2014 12
14. Q & A …
• What does data curation, quality control mean in
you domain / application / research?
• Are there particular issues that are important to
you?
• Join us!
– Kurator & other Biodiversity Interest
• Hackers welcome, too.
– Email: ludaesch@illinois.edu
ERRT @ GSLIS 10/22/2014 14
15. Related Research (Tianhong Song)
• Automated Design, Analysis, Optimization of
Curation Workflows.
• Idea:
• Example Workflow
[Scientific Name Validation] [GeoRef Validation] [Date Validation]
ERRT @ GSLIS 10/22/2014 15
16. Related Research (Tianhong Song)
• Analyze linear workflow
“story”
• Use patterns to discover wf
design issues (e.g. use before
update); then fix them
• Parallelize when possible
ERRT @ GSLIS 10/22/2014 16