Professor at University of Illinois at Urbana-Champaign
Oct. 28, 2014•0 likes•793 views
1 of 23
Tdwg14 fp-kurator-ludaescher
Oct. 28, 2014•0 likes•793 views
Download to read offline
Report
Software
Workflow Support for Continuous Data Quality Control in a FilteredPush Network
J. Hanken, D. Lowery, B. Ludäscher, J. Macklin, T. McPhillips, P. Morris, B. Morris, T. Song
Presentation given at TDWG 2014
Jönköping, Sweden
1. 1
Workflow Support for Continuous Data Quality
Control in a FilteredPush Network
J. Hanken, D. Lowery, B. Ludäscher, J. Macklin, T. McPhillips
P. Morris, B. Morris, T. Song
2. Problem: Data & Metadata Quality
• Collections & occurrence data
… is all over the map
… literally (off the map!)
• DQ Issues, e.g., …
– Lat/Long transposition,
coordinate & projection issues
– Scientific Names (spelling
errors, other)
– Data entry/creation, “fuzzy”
data, naming issues, bit rot,
data conversions and
transformations, schema
mappings, … (you name it)
• Related Projects:
– Filtered-Push
– Kurator
2
3. What problems are we trying to solve?
• Detect and flag data quality issues
• Repair if possible
– … ask human curators as needed
• Keep track of provenance
– automatic repairs
– human curators’ edits
• Employ workflow (semi-)automation
– Scientific workflow systems:
• Kepler/COMAD, Restflow, Galaxy, Biovel/Taverna, Argo, VisTrails, …
– Related technologies
• Akka parallel execution platform
• Script-based automation (e.g. Python) and digital notebooks (iPython)
3
4. Data Curation Workflow
4
Dou, Lei., G. Cao, P.J. Morris, R.A. Morris, B. Ludäscher, J.A. Macklin, J. Hanken. 2012. Kurator: A Kepler Package for
Data Curation Workflows, Procedia Computer Science, 9:1614-1619, doi:10.1016/j.procs.2012.04.177
5. Customers of Curation Workflows
• Collection Managers
– … who are managing the collections databases
– Can run curation workflows periodically
• … in the presence of new data and/or new curation services
• (Biodiversity) Researchers
– To perform an analysis in the presence of (partially)
dirty data, researchers need to
• Clean or fix dirty data
• Throw out unfixable data
– Reporting back to the collection managers (cf. FPush)
5
6. Filtered Push
http://xkcd.com/386/
(1) Kvetch about data
(2) Push to interested parties
(3) Human Filter
(4) Change data
in databases
(5) Store all
assertions
Source: Paul J. Morris
6
7. 7
Introduction NEVP Digiitization NEVP Data Flow Annotations Duplicates Quality Control
Symbiota Instance
Symbiota Instance & DB
Akka curation workflow
on FP2, working on DW
spreadsheet reports
Source: Paul J. Morris
8. Overall Dataflow
Access
Point
Symbiota
Portal FilteredPush
Node
Akka
Kurator
Workflows
Occurrence
Records
Quality Control
Annotations
Quality Control
Workflow Quality Controlled
Data Set
Source: Paul J. Morris
8
9. Example Curation Workflow …
• Load Dataset
• Scientific Name Validation
• Georeference Validation
• Collection Date Validation
• [Create Annotations into FPush Network]
• Output results
– translate to spreadsheet
– with provenance!
9
some steps of a larger workflow
11. … close up …
• CORRECT
– Checked and OK
• CURATED:
– Checked and fixed
• UNABLE_CURATE
– Internally inconsistent
– cannot fix
• UNABLED_DET_VALIDITY
– Not enough data:
• No external reference found
11
12. … even more close: Spreadsheet Provenance
• Assertions made
– sign changed coordinates are on the Earth's surface
– Coordinates not inside country
– transposed/sign changed coordinates to place inside country
– Transposed/sign changed coordinates are near georeference
of locality from Geolocate
• Sources used
– Land data from Natural Earth
– Country boundary data from GeoCommunity
– GeoLocate
12
13. Date Validation
• Check:
– Collector’s life span
– .. vs. Date-Collected
• Possible outcomes:
– Valid
– Corrected
– Unable to validate
• Internal inconsistency
– Contradicting dates
• External inconsistency
– Lack of date data
13
14. The Logic Behind Each Step …
• Date Collected
– … collectors life-time vs date collected
• Georeference Validation
– Lat/long valid (on Earth)
– … within a country (shape file), point in polygon
– If georef is “bad” then try
• … transpositions, sign-swapping etc of lat/long
• If they match fix it!
• Make sure to record in provenance
• Using the transposed (or sign-fixed) original date
(not the Geolocate)
14
16. Curation Workflow Challenges:
Machine Cycles
• Scalability & Technology Issues:
– Clean aggregated data at a FP Node
• Headless
• Use of Kepler/COMAD, pros & cons:
– OK on human cycles, but NOT OK on machine cycles
• Akka
– Parallelize remote service invocation: helps
– Non-trivial programming
• => add another layer on top of Akka
• .. or … ?? <tell us about your technology!>
16
17. Challenges: Human Cycles
• New Kurator project:
– Enable tool makers
– Make it easy to build
• components (software “actors”, services)
• workflows (gluing services together)
• Data Curation Workflows Interest Group !?
– Service builders
– Service & Workflow Registries
• cf. myExperiment
– Service aggregators
• cf. BioVel, DwC validator, …
17
18. What is Kurator?
• NSF-DBI #1356751
– Collaborative Research: ABI Development:
Kurator: A Provenance-enabled Workflow Platform
and Toolkit to Curate Biodiversity Data
– Sept. 2014 – 2017
– @Illinois:
• B. Ludäscher, James Macklin, Tim McPhillips, …
– @Harvard:
• James Hanken, Paul Morris, Bob Morris, …
– @TDWG community
• <your name here>
18
19. Kurator Tenets
• Technology Agnostic
– … to the extent we can …
– … avoid reinventing the wheel
– … one size probably doesn’t fit all
=> Deploy curation steps on different wf systems, platforms
• For Tool Makers
• Agile, Community-Driven Development
• Kurator just started, evolving
– Get involved now!
– Kick-off meeting November 17 & 18
• @ NCSA (University of Illinois, Urbana-Champaign)
19
20. How we do it
• Build a library of curation services such that
curation workflows can be run from various
platforms
– Scientific workflow systems
• e.g. Restflow, Kepler, Taverna, Galaxy
– Other platforms
• e.g. Akka, Python-based, …
• … leveraging existing technologies
20
21. How we do it
• Open source, community-friendly approach
– git repository (NCSA open source projects)
• Agile software development
– NCSA support tools, e.g. JIRA, Bamboo
• Inspired by
– Small bioinformatics tools manifesto (post-facto)
• cf. Unix tenets (small tools, use filters, pipes, … KISS!)
– Experience with other (sometimes not so agile)
development projects
21
22. Agile Kurator Development
22
Interested in looking under the hood?
Kurator/Akka curation wf demo:
Wed PM
Initial URL:
opensource.ncsa.illinois.edu/projects/KURATOR
23. Related Research (Tianhong Song, UC Davis)
• Analyze linear workflow
“story”
• Use patterns to discover wf
design issues (e.g. use before
update); then fix them
• Parallelize when possible
23