Leveraging The Open Provenance Model as a Multi-Tier Model for Global Climate Research
1. Leveraging The Open
Provenance Model as a Multi-
Tier Model for Global Climate
Research
Eric Stephan, Todd Halter, Brian Ermold
IPAW, 2010
2. Discussion Outline
! Background on Atmospheric Radiation
Measurement (ARM) program.
! Challenges without Provenance
! Requirements Analysis
! Multi-Tier Provenance Model
! Use of Open Provenance Model
! Impacts
3. Background
! Atmospheric Radiation Measurement Program
! Production system designed and developed in 1990
! Data is collected from over 300 remote sensors worldwide.
Expanding to over 400 sensors in 2010
! Data collection will reach over 500 GB/day of atmospheric
and satellite data by FY11
! Value added products (VAPs)
developed to correlate, aggregate
and support quality studies of raw
data into computational models
3
4. Challenges Facing Current VAP Development
! Causality, Lineage, Referential Knowledge Not
Formalized:
! Captured in multiple ways and stored in different media and
representation forms.
! Sample causality not directly accessible to scientists
! Inability to seamlessly analyze and visualize knowledge
! Provenance Required By Different Audiences
! Producers – Operations/VAP developers
! Consumers –scientist relying on VAPs
4
5. Requirements Analysis 1 of 2
Value Added Product Directed Graph
Lineage (Path)
Acyclic Graph and
Value Added Product
Common Properties
Workflow Causality (Hedge)
Ordered Autonomous
Sample Causality … Acyclic Graphs When
Processing Data
Product (Branch)
6. Requirements Analysis 2 of 2
Tier Purpose Resources Status Operations Developer Researcher
Path Lineage N/A Future Needed Needed Needed
Path Curation Sample Level QC Exists In Use Needed Needed
Path/Hedge Reference Metadata Repository Exists In Use In Use Needed
Hedge Reference Configuration files Exists In Use In Use Needed
Hedge/Branch Causality Log files Exists Needed In Use Needed
Hedge/Branch Derived Trends/Anomalies Future Needed Needed Needed
Branch Causality Sample Derivation Method Exists In Use Needed Needed
Branch Causality Sample Source Exists In Use Needed Needed
6
7. ARM Provenance Model
! Characteristics
! Knowledge required to depict interdependency, overall
processing, and discrete sample processing
! Multi-tier
! Each tier representing different granularity and purpose
! Each hedge in context of path, branch in context of hedge.
! Declared tiers make knowledge easier to perform cross
comparison
! Because sample provenance at branch tier is autonomous and
ordered, provenance can be processed in parallel or stored in
chunks.
! Leverage Standards and Community Efforts
7
10. Estimated Cost of Provenance
Sample
Quality
Control
Field
Origin
~30K for
each VAP
sample 2 bytes for
each VAP
~5-10K sample
< 5K graph
VAP Lineage VAP Sample
Path Hedge Branch
10 Low Granularity Medium Granularity High Granularity
11. Analysis Examples
! Timeline Inspection Anomaly and Trend Detection
! Aggregation
! Out of 43,200 potential samples (560K log entries)
! 15 distinct processes
! 60 distinct process results e.g.
! No AERO G data within minutes of x
! No RRTM_LW output for x
! No RRTM_SW output for x
! No clear sky longwave cloud forcing run for x
! No clear sky shortwave cloud forcing run for x
! No emissivities file RRTM_SW_sfcemissdata
! This can be used to help users know the kinds of questions they can ask.
11
12. Impacts
! Provenance articulates ARM data processing causality
and lineage in a formal and recognizable way.
! Adding provenance creates a data intensive computing
challenge due to the shear volume of provenance
represented as a large semantic graph.
! Use of a multi-tier model makes analysis and visualization
possible because the provenance graph can be broken
into chunks for distributed or parallel processing.
! Modeling the branch tier as autonomous acyclic graphs
makes quantitative analysis possible to look for trends or
anomalies within one data product, or between multiple
data products.