SlideShare a Scribd company logo
1 of 23
1 
Workflow Support for Continuous Data Quality 
Control in a FilteredPush Network 
J. Hanken, D. Lowery, B. Ludäscher, J. Macklin, T. McPhillips 
P. Morris, B. Morris, T. Song
Problem: Data & Metadata Quality 
• Collections & occurrence data 
… is all over the map 
… literally (off the map!) 
• DQ Issues, e.g., … 
– Lat/Long transposition, 
coordinate & projection issues 
– Scientific Names (spelling 
errors, other) 
– Data entry/creation, “fuzzy” 
data, naming issues, bit rot, 
data conversions and 
transformations, schema 
mappings, … (you name it) 
• Related Projects: 
– Filtered-Push 
– Kurator 
2
What problems are we trying to solve? 
• Detect and flag data quality issues 
• Repair if possible 
– … ask human curators as needed 
• Keep track of provenance 
– automatic repairs 
– human curators’ edits 
• Employ workflow (semi-)automation 
– Scientific workflow systems: 
• Kepler/COMAD, Restflow, Galaxy, Biovel/Taverna, Argo, VisTrails, … 
– Related technologies 
• Akka parallel execution platform 
• Script-based automation (e.g. Python) and digital notebooks (iPython) 
3
Data Curation Workflow 
4 
Dou, Lei., G. Cao, P.J. Morris, R.A. Morris, B. Ludäscher, J.A. Macklin, J. Hanken. 2012. Kurator: A Kepler Package for 
Data Curation Workflows, Procedia Computer Science, 9:1614-1619, doi:10.1016/j.procs.2012.04.177
Customers of Curation Workflows 
• Collection Managers 
– … who are managing the collections databases 
– Can run curation workflows periodically 
• … in the presence of new data and/or new curation services 
• (Biodiversity) Researchers 
– To perform an analysis in the presence of (partially) 
dirty data, researchers need to 
• Clean or fix dirty data 
• Throw out unfixable data 
– Reporting back to the collection managers (cf. FPush) 
5
Filtered Push 
http://xkcd.com/386/ 
(1) Kvetch about data 
(2) Push to interested parties 
(3) Human Filter 
(4) Change data 
in databases 
(5) Store all 
assertions 
Source: Paul J. Morris 
6
7 
Introduction NEVP Digiitization NEVP Data Flow Annotations Duplicates Quality Control 
Symbiota Instance 
Symbiota Instance & DB 
Akka curation workflow 
on FP2, working on DW 
spreadsheet reports 
Source: Paul J. Morris
Overall Dataflow 
Access 
Point 
Symbiota 
Portal FilteredPush 
Node 
Akka 
Kurator 
Workflows 
Occurrence 
Records 
Quality Control 
Annotations 
Quality Control 
Workflow Quality Controlled 
Data Set 
Source: Paul J. Morris 
8
Example Curation Workflow … 
• Load Dataset 
• Scientific Name Validation 
• Georeference Validation 
• Collection Date Validation 
• [Create Annotations into FPush Network] 
• Output results 
– translate to spreadsheet 
– with provenance! 
	 
9 
some steps of a larger workflow
… Curation Workflow Output … 
10
… close up … 
• CORRECT 
– Checked and OK 
• CURATED: 
– Checked and fixed 
• UNABLE_CURATE 
– Internally inconsistent 
– cannot fix 
• UNABLED_DET_VALIDITY 
– Not enough data: 
• No external reference found 
11
… even more close: Spreadsheet Provenance 
• Assertions made 
– sign changed coordinates are on the Earth's surface 
– Coordinates not inside country 
– transposed/sign changed coordinates to place inside country 
– Transposed/sign changed coordinates are near georeference 
of locality from Geolocate 
• Sources used 
– Land data from Natural Earth 
– Country boundary data from GeoCommunity 
– GeoLocate 
12
Date Validation 
• Check: 
– Collector’s life span 
– .. vs. Date-Collected 
• Possible outcomes: 
– Valid 
– Corrected 
– Unable to validate 
• Internal inconsistency 
– Contradicting dates 
• External inconsistency 
– Lack of date data 
13
The Logic Behind Each Step … 
• Date Collected 
– … collectors life-time vs date collected 
• Georeference Validation 
– Lat/long valid (on Earth) 
– … within a country (shape file), point in polygon 
– If georef is “bad” then try 
• … transpositions, sign-swapping etc of lat/long 
• If they match  fix it! 
• Make sure to record in provenance 
• Using the transposed (or sign-fixed) original date 
(not the Geolocate) 
14
… Logic Behind Each Step (cont’d) 
• Scientific Name Validation 
– Customer-dependent: 
• Collection Managers: 
– Nomenclature 
• Researchers: 
– Taxonomy (current names) 
– Several Remote services 
• IPNI, GNI, … 
• …. <your logic here> … 
15
Curation Workflow Challenges: 
Machine Cycles 
• Scalability & Technology Issues: 
– Clean aggregated data at a FP Node 
• Headless 
• Use of Kepler/COMAD, pros & cons: 
– OK on human cycles, but NOT OK on machine cycles 
• Akka 
– Parallelize remote service invocation: helps 
– Non-trivial programming 
• => add another layer on top of Akka 
• .. or … ?? <tell us about your technology!> 
16
Challenges: Human Cycles 
• New Kurator project: 
– Enable tool makers 
– Make it easy to build 
• components (software “actors”, services) 
• workflows (gluing services together) 
• Data Curation Workflows Interest Group !? 
– Service builders 
– Service & Workflow Registries 
• cf. myExperiment 
– Service aggregators 
• cf. BioVel, DwC validator, … 
17
What is Kurator? 
• NSF-DBI #1356751 
– Collaborative Research: ABI Development: 
Kurator: A Provenance-enabled Workflow Platform 
and Toolkit to Curate Biodiversity Data 
– Sept. 2014 – 2017 
– @Illinois: 
• B. Ludäscher, James Macklin, Tim McPhillips, … 
– @Harvard: 
• James Hanken, Paul Morris, Bob Morris, … 
– @TDWG community 
• <your name here> 
18
Kurator Tenets 
• Technology Agnostic 
– … to the extent we can … 
– … avoid reinventing the wheel 
– … one size probably doesn’t fit all 
=> Deploy curation steps on different wf systems, platforms 
• For Tool Makers 
• Agile, Community-Driven Development 
• Kurator just started, evolving 
– Get involved now! 
– Kick-off meeting November 17 & 18 
• @ NCSA (University of Illinois, Urbana-Champaign) 
19
How we do it 
• Build a library of curation services such that 
curation workflows can be run from various 
platforms 
– Scientific workflow systems 
• e.g. Restflow, Kepler, Taverna, Galaxy 
– Other platforms 
• e.g. Akka, Python-based, … 
• … leveraging existing technologies 
20
How we do it 
• Open source, community-friendly approach 
– git repository (NCSA open source projects) 
• Agile software development 
– NCSA support tools, e.g. JIRA, Bamboo 
• Inspired by 
– Small bioinformatics tools manifesto (post-facto) 
• cf. Unix tenets (small tools, use filters, pipes, … KISS!) 
– Experience with other (sometimes not so agile) 
development projects 
21
Agile Kurator Development 
22 
Interested in looking under the hood? 
Kurator/Akka curation wf demo: 
Wed PM 
Initial URL: 
opensource.ncsa.illinois.edu/projects/KURATOR
Related Research (Tianhong Song, UC Davis) 
• Analyze linear workflow 
“story” 
• Use patterns to discover wf 
design issues (e.g. use before 
update); then fix them 
• Parallelize when possible 
23

More Related Content

Viewers also liked

Total and Partial Well-Founded Datalog Coincide
Total and Partial Well-Founded Datalog CoincideTotal and Partial Well-Founded Datalog Coincide
Total and Partial Well-Founded Datalog CoincideBertram Ludäscher
 
A Sightseeing Tour of Provenance in Databases & Workflows
A Sightseeing Tour of Provenance in Databases & WorkflowsA Sightseeing Tour of Provenance in Databases & Workflows
A Sightseeing Tour of Provenance in Databases & WorkflowsBertram Ludäscher
 
ICSSP-Panel Austin, May 15, 2016.
ICSSP-Panel Austin, May 15, 2016.ICSSP-Panel Austin, May 15, 2016.
ICSSP-Panel Austin, May 15, 2016.Bertram Ludäscher
 
Towards Constraint Provenance Games
Towards Constraint Provenance GamesTowards Constraint Provenance Games
Towards Constraint Provenance GamesBertram Ludäscher
 
YesWorkflow: How to render a script as a workflow in half an hour!
YesWorkflow: How to render a script as a workflow in half an hour!YesWorkflow: How to render a script as a workflow in half an hour!
YesWorkflow: How to render a script as a workflow in half an hour!Bertram Ludäscher
 
Euler: A Logic­‐Based Toolkit for Aligning & Reconciling Multiple Taxonomic P...
Euler: A Logic­‐Based Toolkit for Aligning & Reconciling Multiple Taxonomic P...Euler: A Logic­‐Based Toolkit for Aligning & Reconciling Multiple Taxonomic P...
Euler: A Logic­‐Based Toolkit for Aligning & Reconciling Multiple Taxonomic P...Bertram Ludäscher
 
YesWorkflow: More Provenance Mileage from Hybrid Provenance Models and Queries
YesWorkflow: More Provenance Mileage from Hybrid Provenance Models and QueriesYesWorkflow: More Provenance Mileage from Hybrid Provenance Models and Queries
YesWorkflow: More Provenance Mileage from Hybrid Provenance Models and QueriesBertram Ludäscher
 
Kurator Project Overview (Brief)
Kurator Project Overview (Brief)Kurator Project Overview (Brief)
Kurator Project Overview (Brief)Bertram Ludäscher
 
Provenance and DataONE: Facilitating Reproducible Science
Provenance and DataONE: Facilitating Reproducible ScienceProvenance and DataONE: Facilitating Reproducible Science
Provenance and DataONE: Facilitating Reproducible ScienceBertram Ludäscher
 
Yin & Yang: Demonstrating complementary provenance from noWorkflow & YesWorkflow
Yin & Yang: Demonstrating complementary provenance from noWorkflow & YesWorkflowYin & Yang: Demonstrating complementary provenance from noWorkflow & YesWorkflow
Yin & Yang: Demonstrating complementary provenance from noWorkflow & YesWorkflowBertram Ludäscher
 
YesWorkflow: Retrospective Provenance Without a Runtime Provenance Recorder
YesWorkflow: Retrospective Provenance Without a Runtime Provenance RecorderYesWorkflow: Retrospective Provenance Without a Runtime Provenance Recorder
YesWorkflow: Retrospective Provenance Without a Runtime Provenance RecorderBertram Ludäscher
 
From Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & ProvenanceFrom Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & ProvenanceBertram Ludäscher
 
Bertram's talk on Hybrid (Black-box + White-box) Diagnosis at RuleML'14 in Pr...
Bertram's talk on Hybrid (Black-box + White-box) Diagnosis at RuleML'14 in Pr...Bertram's talk on Hybrid (Black-box + White-box) Diagnosis at RuleML'14 in Pr...
Bertram's talk on Hybrid (Black-box + White-box) Diagnosis at RuleML'14 in Pr...Bertram Ludäscher
 
From Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & ProvenanceFrom Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & ProvenanceBertram Ludäscher
 
Kurator: Towards Data Curation for Mere Mortals
Kurator: Towards Data Curation for Mere MortalsKurator: Towards Data Curation for Mere Mortals
Kurator: Towards Data Curation for Mere MortalsBertram Ludäscher
 
Theory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineTheory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineBertram Ludäscher
 
Provenance in Databases and Scientific Workflows: Part II (Databases)
Provenance in Databases and Scientific Workflows: Part II (Databases)Provenance in Databases and Scientific Workflows: Part II (Databases)
Provenance in Databases and Scientific Workflows: Part II (Databases)Bertram Ludäscher
 
Provenance in Databases and Scientific Workflows: Part I
Provenance in Databases and Scientific Workflows: Part IProvenance in Databases and Scientific Workflows: Part I
Provenance in Databases and Scientific Workflows: Part IBertram Ludäscher
 

Viewers also liked (19)

Total and Partial Well-Founded Datalog Coincide
Total and Partial Well-Founded Datalog CoincideTotal and Partial Well-Founded Datalog Coincide
Total and Partial Well-Founded Datalog Coincide
 
Works 2015-provenance-mileage
Works 2015-provenance-mileageWorks 2015-provenance-mileage
Works 2015-provenance-mileage
 
A Sightseeing Tour of Provenance in Databases & Workflows
A Sightseeing Tour of Provenance in Databases & WorkflowsA Sightseeing Tour of Provenance in Databases & Workflows
A Sightseeing Tour of Provenance in Databases & Workflows
 
ICSSP-Panel Austin, May 15, 2016.
ICSSP-Panel Austin, May 15, 2016.ICSSP-Panel Austin, May 15, 2016.
ICSSP-Panel Austin, May 15, 2016.
 
Towards Constraint Provenance Games
Towards Constraint Provenance GamesTowards Constraint Provenance Games
Towards Constraint Provenance Games
 
YesWorkflow: How to render a script as a workflow in half an hour!
YesWorkflow: How to render a script as a workflow in half an hour!YesWorkflow: How to render a script as a workflow in half an hour!
YesWorkflow: How to render a script as a workflow in half an hour!
 
Euler: A Logic­‐Based Toolkit for Aligning & Reconciling Multiple Taxonomic P...
Euler: A Logic­‐Based Toolkit for Aligning & Reconciling Multiple Taxonomic P...Euler: A Logic­‐Based Toolkit for Aligning & Reconciling Multiple Taxonomic P...
Euler: A Logic­‐Based Toolkit for Aligning & Reconciling Multiple Taxonomic P...
 
YesWorkflow: More Provenance Mileage from Hybrid Provenance Models and Queries
YesWorkflow: More Provenance Mileage from Hybrid Provenance Models and QueriesYesWorkflow: More Provenance Mileage from Hybrid Provenance Models and Queries
YesWorkflow: More Provenance Mileage from Hybrid Provenance Models and Queries
 
Kurator Project Overview (Brief)
Kurator Project Overview (Brief)Kurator Project Overview (Brief)
Kurator Project Overview (Brief)
 
Provenance and DataONE: Facilitating Reproducible Science
Provenance and DataONE: Facilitating Reproducible ScienceProvenance and DataONE: Facilitating Reproducible Science
Provenance and DataONE: Facilitating Reproducible Science
 
Yin & Yang: Demonstrating complementary provenance from noWorkflow & YesWorkflow
Yin & Yang: Demonstrating complementary provenance from noWorkflow & YesWorkflowYin & Yang: Demonstrating complementary provenance from noWorkflow & YesWorkflow
Yin & Yang: Demonstrating complementary provenance from noWorkflow & YesWorkflow
 
YesWorkflow: Retrospective Provenance Without a Runtime Provenance Recorder
YesWorkflow: Retrospective Provenance Without a Runtime Provenance RecorderYesWorkflow: Retrospective Provenance Without a Runtime Provenance Recorder
YesWorkflow: Retrospective Provenance Without a Runtime Provenance Recorder
 
From Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & ProvenanceFrom Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & Provenance
 
Bertram's talk on Hybrid (Black-box + White-box) Diagnosis at RuleML'14 in Pr...
Bertram's talk on Hybrid (Black-box + White-box) Diagnosis at RuleML'14 in Pr...Bertram's talk on Hybrid (Black-box + White-box) Diagnosis at RuleML'14 in Pr...
Bertram's talk on Hybrid (Black-box + White-box) Diagnosis at RuleML'14 in Pr...
 
From Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & ProvenanceFrom Data to Knowledge with Workflows & Provenance
From Data to Knowledge with Workflows & Provenance
 
Kurator: Towards Data Curation for Mere Mortals
Kurator: Towards Data Curation for Mere MortalsKurator: Towards Data Curation for Mere Mortals
Kurator: Towards Data Curation for Mere Mortals
 
Theory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineTheory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefine
 
Provenance in Databases and Scientific Workflows: Part II (Databases)
Provenance in Databases and Scientific Workflows: Part II (Databases)Provenance in Databases and Scientific Workflows: Part II (Databases)
Provenance in Databases and Scientific Workflows: Part II (Databases)
 
Provenance in Databases and Scientific Workflows: Part I
Provenance in Databases and Scientific Workflows: Part IProvenance in Databases and Scientific Workflows: Part I
Provenance in Databases and Scientific Workflows: Part I
 

Similar to Tdwg14 fp-kurator-ludaescher

Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Ian Foster
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Srinath Perera
 
From Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science TalesFrom Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science TalesBertram Ludäscher
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Jeroen Rombouts
 
Overview of Scientific Workflows - Why Use Them?
Overview of Scientific Workflows - Why Use Them?Overview of Scientific Workflows - Why Use Them?
Overview of Scientific Workflows - Why Use Them?inside-BigData.com
 
Tripal within the Arabidopsis Information Portal - PAG XXIII
Tripal within the Arabidopsis Information Portal - PAG XXIIITripal within the Arabidopsis Information Portal - PAG XXIII
Tripal within the Arabidopsis Information Portal - PAG XXIIIVivek Krishnakumar
 
Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)Globus
 
Globus Labs: Forging the Next Frontier
Globus Labs: Forging the Next FrontierGlobus Labs: Forging the Next Frontier
Globus Labs: Forging the Next FrontierGlobus
 
Implimenting and Mitigating Change with all of this Newfangled Technology
Implimenting and Mitigating Change with all of this Newfangled TechnologyImplimenting and Mitigating Change with all of this Newfangled Technology
Implimenting and Mitigating Change with all of this Newfangled TechnologyIndiana Online Users Group
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentationTao Feng
 
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015Ioan Toma
 
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...Globus
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesDataWorks Summit
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Lucas Jellema
 

Similar to Tdwg14 fp-kurator-ludaescher (20)

Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
From Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science TalesFrom Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science Tales
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10Elag workshop sessie 1 en 2 v10
Elag workshop sessie 1 en 2 v10
 
Overview of Scientific Workflows - Why Use Them?
Overview of Scientific Workflows - Why Use Them?Overview of Scientific Workflows - Why Use Them?
Overview of Scientific Workflows - Why Use Them?
 
Tripal within the Arabidopsis Information Portal - PAG XXIII
Tripal within the Arabidopsis Information Portal - PAG XXIIITripal within the Arabidopsis Information Portal - PAG XXIII
Tripal within the Arabidopsis Information Portal - PAG XXIII
 
Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)
 
Pieper NISO Virtual Conf Feb17
Pieper NISO Virtual Conf Feb17Pieper NISO Virtual Conf Feb17
Pieper NISO Virtual Conf Feb17
 
Globus Labs: Forging the Next Frontier
Globus Labs: Forging the Next FrontierGlobus Labs: Forging the Next Frontier
Globus Labs: Forging the Next Frontier
 
week15a.pdf
week15a.pdfweek15a.pdf
week15a.pdf
 
Implimenting and Mitigating Change with all of this Newfangled Technology
Implimenting and Mitigating Change with all of this Newfangled TechnologyImplimenting and Mitigating Change with all of this Newfangled Technology
Implimenting and Mitigating Change with all of this Newfangled Technology
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015
 
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
 
A machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companiesA machine learning and data science pipeline for real companies
A machine learning and data science pipeline for real companies
 
Taverna summary
Taverna summaryTaverna summary
Taverna summary
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Oow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BIOow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BI
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
 

More from Bertram Ludäscher

Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionBertram Ludäscher
 
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Bertram Ludäscher
 
[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database RulesBertram Ludäscher
 
[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database RulesBertram Ludäscher
 
Answering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsAnswering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsBertram Ludäscher
 
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Bertram Ludäscher
 
Which Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueWhich Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueBertram Ludäscher
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesBertram Ludäscher
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsBertram Ludäscher
 
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseDeduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseBertram Ludäscher
 
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...Bertram Ludäscher
 
Dissecting Reproducibility: A case study with ecological niche models in th...
Dissecting Reproducibility:  A case study with ecological niche models  in th...Dissecting Reproducibility:  A case study with ecological niche models  in th...
Dissecting Reproducibility: A case study with ecological niche models in th...Bertram Ludäscher
 
Incremental Recomputation: Those who cannot remember the past are condemned ...
Incremental Recomputation:  Those who cannot remember the past are condemned ...Incremental Recomputation:  Those who cannot remember the past are condemned ...
Incremental Recomputation: Those who cannot remember the past are condemned ...Bertram Ludäscher
 
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsValidation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsBertram Ludäscher
 
An ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsAn ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsBertram Ludäscher
 
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachKnowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachBertram Ludäscher
 
Whole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchWhole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchBertram Ludäscher
 
ETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatBertram Ludäscher
 
From Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable ProvenanceFrom Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable ProvenanceBertram Ludäscher
 
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligion
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligionWild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligion
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligionBertram Ludäscher
 

More from Bertram Ludäscher (20)

Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
 
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
 
[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules
 
[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules
 
Answering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsAnswering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query Patterns
 
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
 
Which Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueWhich Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A Dialogue
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science Tales
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
 
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseDeduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
 
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
 
Dissecting Reproducibility: A case study with ecological niche models in th...
Dissecting Reproducibility:  A case study with ecological niche models  in th...Dissecting Reproducibility:  A case study with ecological niche models  in th...
Dissecting Reproducibility: A case study with ecological niche models in th...
 
Incremental Recomputation: Those who cannot remember the past are condemned ...
Incremental Recomputation:  Those who cannot remember the past are condemned ...Incremental Recomputation:  Those who cannot remember the past are condemned ...
Incremental Recomputation: Those who cannot remember the past are condemned ...
 
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsValidation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
 
An ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsAn ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflows
 
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachKnowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
 
Whole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchWhole-Tale: The Experience of Research
Whole-Tale: The Experience of Research
 
ETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatETC & Authors in the Driver's Seat
ETC & Authors in the Driver's Seat
 
From Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable ProvenanceFrom Provenance Standards and Tools to Queries and Actionable Provenance
From Provenance Standards and Tools to Queries and Actionable Provenance
 
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligion
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligionWild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligion
Wild Ideas at TDWG'17: Embrace multiple possible worlds; abandon techno-ligion
 

Recently uploaded

Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 

Recently uploaded (20)

Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 

Tdwg14 fp-kurator-ludaescher

  • 1. 1 Workflow Support for Continuous Data Quality Control in a FilteredPush Network J. Hanken, D. Lowery, B. Ludäscher, J. Macklin, T. McPhillips P. Morris, B. Morris, T. Song
  • 2. Problem: Data & Metadata Quality • Collections & occurrence data … is all over the map … literally (off the map!) • DQ Issues, e.g., … – Lat/Long transposition, coordinate & projection issues – Scientific Names (spelling errors, other) – Data entry/creation, “fuzzy” data, naming issues, bit rot, data conversions and transformations, schema mappings, … (you name it) • Related Projects: – Filtered-Push – Kurator 2
  • 3. What problems are we trying to solve? • Detect and flag data quality issues • Repair if possible – … ask human curators as needed • Keep track of provenance – automatic repairs – human curators’ edits • Employ workflow (semi-)automation – Scientific workflow systems: • Kepler/COMAD, Restflow, Galaxy, Biovel/Taverna, Argo, VisTrails, … – Related technologies • Akka parallel execution platform • Script-based automation (e.g. Python) and digital notebooks (iPython) 3
  • 4. Data Curation Workflow 4 Dou, Lei., G. Cao, P.J. Morris, R.A. Morris, B. Ludäscher, J.A. Macklin, J. Hanken. 2012. Kurator: A Kepler Package for Data Curation Workflows, Procedia Computer Science, 9:1614-1619, doi:10.1016/j.procs.2012.04.177
  • 5. Customers of Curation Workflows • Collection Managers – … who are managing the collections databases – Can run curation workflows periodically • … in the presence of new data and/or new curation services • (Biodiversity) Researchers – To perform an analysis in the presence of (partially) dirty data, researchers need to • Clean or fix dirty data • Throw out unfixable data – Reporting back to the collection managers (cf. FPush) 5
  • 6. Filtered Push http://xkcd.com/386/ (1) Kvetch about data (2) Push to interested parties (3) Human Filter (4) Change data in databases (5) Store all assertions Source: Paul J. Morris 6
  • 7. 7 Introduction NEVP Digiitization NEVP Data Flow Annotations Duplicates Quality Control Symbiota Instance Symbiota Instance & DB Akka curation workflow on FP2, working on DW spreadsheet reports Source: Paul J. Morris
  • 8. Overall Dataflow Access Point Symbiota Portal FilteredPush Node Akka Kurator Workflows Occurrence Records Quality Control Annotations Quality Control Workflow Quality Controlled Data Set Source: Paul J. Morris 8
  • 9. Example Curation Workflow … • Load Dataset • Scientific Name Validation • Georeference Validation • Collection Date Validation • [Create Annotations into FPush Network] • Output results – translate to spreadsheet – with provenance! 9 some steps of a larger workflow
  • 10. … Curation Workflow Output … 10
  • 11. … close up … • CORRECT – Checked and OK • CURATED: – Checked and fixed • UNABLE_CURATE – Internally inconsistent – cannot fix • UNABLED_DET_VALIDITY – Not enough data: • No external reference found 11
  • 12. … even more close: Spreadsheet Provenance • Assertions made – sign changed coordinates are on the Earth's surface – Coordinates not inside country – transposed/sign changed coordinates to place inside country – Transposed/sign changed coordinates are near georeference of locality from Geolocate • Sources used – Land data from Natural Earth – Country boundary data from GeoCommunity – GeoLocate 12
  • 13. Date Validation • Check: – Collector’s life span – .. vs. Date-Collected • Possible outcomes: – Valid – Corrected – Unable to validate • Internal inconsistency – Contradicting dates • External inconsistency – Lack of date data 13
  • 14. The Logic Behind Each Step … • Date Collected – … collectors life-time vs date collected • Georeference Validation – Lat/long valid (on Earth) – … within a country (shape file), point in polygon – If georef is “bad” then try • … transpositions, sign-swapping etc of lat/long • If they match  fix it! • Make sure to record in provenance • Using the transposed (or sign-fixed) original date (not the Geolocate) 14
  • 15. … Logic Behind Each Step (cont’d) • Scientific Name Validation – Customer-dependent: • Collection Managers: – Nomenclature • Researchers: – Taxonomy (current names) – Several Remote services • IPNI, GNI, … • …. <your logic here> … 15
  • 16. Curation Workflow Challenges: Machine Cycles • Scalability & Technology Issues: – Clean aggregated data at a FP Node • Headless • Use of Kepler/COMAD, pros & cons: – OK on human cycles, but NOT OK on machine cycles • Akka – Parallelize remote service invocation: helps – Non-trivial programming • => add another layer on top of Akka • .. or … ?? <tell us about your technology!> 16
  • 17. Challenges: Human Cycles • New Kurator project: – Enable tool makers – Make it easy to build • components (software “actors”, services) • workflows (gluing services together) • Data Curation Workflows Interest Group !? – Service builders – Service & Workflow Registries • cf. myExperiment – Service aggregators • cf. BioVel, DwC validator, … 17
  • 18. What is Kurator? • NSF-DBI #1356751 – Collaborative Research: ABI Development: Kurator: A Provenance-enabled Workflow Platform and Toolkit to Curate Biodiversity Data – Sept. 2014 – 2017 – @Illinois: • B. Ludäscher, James Macklin, Tim McPhillips, … – @Harvard: • James Hanken, Paul Morris, Bob Morris, … – @TDWG community • <your name here> 18
  • 19. Kurator Tenets • Technology Agnostic – … to the extent we can … – … avoid reinventing the wheel – … one size probably doesn’t fit all => Deploy curation steps on different wf systems, platforms • For Tool Makers • Agile, Community-Driven Development • Kurator just started, evolving – Get involved now! – Kick-off meeting November 17 & 18 • @ NCSA (University of Illinois, Urbana-Champaign) 19
  • 20. How we do it • Build a library of curation services such that curation workflows can be run from various platforms – Scientific workflow systems • e.g. Restflow, Kepler, Taverna, Galaxy – Other platforms • e.g. Akka, Python-based, … • … leveraging existing technologies 20
  • 21. How we do it • Open source, community-friendly approach – git repository (NCSA open source projects) • Agile software development – NCSA support tools, e.g. JIRA, Bamboo • Inspired by – Small bioinformatics tools manifesto (post-facto) • cf. Unix tenets (small tools, use filters, pipes, … KISS!) – Experience with other (sometimes not so agile) development projects 21
  • 22. Agile Kurator Development 22 Interested in looking under the hood? Kurator/Akka curation wf demo: Wed PM Initial URL: opensource.ncsa.illinois.edu/projects/KURATOR
  • 23. Related Research (Tianhong Song, UC Davis) • Analyze linear workflow “story” • Use patterns to discover wf design issues (e.g. use before update); then fix them • Parallelize when possible 23