SlideShare a Scribd company logo
1 of 12
Leveraging The Open
  Provenance Model as a Multi-
  Tier Model for Global Climate
            Research
Eric Stephan, Todd Halter, Brian Ermold
IPAW, 2010
Discussion Outline

!   Background on Atmospheric Radiation
    Measurement (ARM) program.
!   Challenges without Provenance
!   Requirements Analysis
!   Multi-Tier Provenance Model
!   Use of Open Provenance Model
!   Impacts
Background

!   Atmospheric Radiation Measurement Program
     !    Production system designed and developed in 1990
     !    Data is collected from over 300 remote sensors worldwide.
          Expanding to over 400 sensors in 2010
     !    Data collection will reach over 500 GB/day of atmospheric
          and satellite data by FY11
     !    Value added products (VAPs)
          developed to correlate, aggregate
          and support quality studies of raw
          data into computational models




3
Challenges Facing Current VAP Development

    !   Causality, Lineage, Referential Knowledge Not
        Formalized:
       !    Captured in multiple ways and stored in different media and
            representation forms.
       !    Sample causality not directly accessible to scientists
       !    Inability to seamlessly analyze and visualize knowledge
    !   Provenance Required By Different Audiences
       !    Producers – Operations/VAP developers
       !    Consumers –scientist relying on VAPs




4
Requirements Analysis 1 of 2


Value Added Product             Directed Graph
Lineage                         (Path)




                               Acyclic Graph and
Value Added Product
                               Common Properties
Workflow Causality             (Hedge)


                               Ordered Autonomous
Sample Causality           …   Acyclic Graphs When
                               Processing Data
                               Product (Branch)
Requirements Analysis 2 of 2


        Tier           Purpose     Resources                  Status   Operations   Developer   Researcher




        Path           Lineage     N/A                        Future   Needed       Needed      Needed




        Path           Curation    Sample Level QC            Exists   In Use       Needed      Needed



        Path/Hedge     Reference   Metadata Repository        Exists   In Use       In Use      Needed



        Hedge          Reference   Configuration files        Exists   In Use       In Use      Needed



        Hedge/Branch   Causality   Log files                  Exists   Needed       In Use      Needed



        Hedge/Branch   Derived     Trends/Anomalies           Future   Needed       Needed      Needed




        Branch         Causality   Sample Derivation Method   Exists   In Use       Needed      Needed




        Branch         Causality   Sample Source              Exists   In Use       Needed      Needed




6
ARM Provenance Model


    !   Characteristics
       !    Knowledge required to depict interdependency, overall
            processing, and discrete sample processing
       !    Multi-tier
             !   Each tier representing different granularity and purpose

             !   Each hedge in context of path, branch in context of hedge.

             !   Declared tiers make knowledge easier to perform cross
                 comparison
             !   Because sample provenance at branch tier is autonomous and
                 ordered, provenance can be processed in parallel or stored in
                 chunks.
    !   Leverage Standards and Community Efforts

7
8
PROVENANCE LISTENER PICTURE




9
Estimated Cost of Provenance




                                                                       Sample	
  Quality	
  Control	
  
                                                                           Field	
  Origin	
  
                                             ~30K for
                                            each VAP
                                              sample                    2 bytes for
                                                                        each VAP
                           ~5-10K                                        sample
      < 5K graph
      VAP Lineage             VAP                       Sample

          Path               Hedge                      Branch

10   Low Granularity   Medium Granularity           High Granularity
Analysis Examples
     !   Timeline Inspection                                    Anomaly and Trend Detection




     !    Aggregation
     !    Out of 43,200 potential samples (560K log entries)
           !   15 distinct processes
           !   60 distinct process results e.g.
                  !   No AERO G data within minutes of x
                  !   No RRTM_LW output for x
                  !   No RRTM_SW output for x
                  !   No clear sky longwave cloud forcing run for x
                  !   No clear sky shortwave cloud forcing run for x
                  !   No emissivities file RRTM_SW_sfcemissdata
           !   This can be used to help users know the kinds of questions they can ask.
11
Impacts

!   Provenance articulates ARM data processing causality
    and lineage in a formal and recognizable way.

!   Adding provenance creates a data intensive computing
    challenge due to the shear volume of provenance
    represented as a large semantic graph.

!   Use of a multi-tier model makes analysis and visualization
    possible because the provenance graph can be broken
    into chunks for distributed or parallel processing.

!   Modeling the branch tier as autonomous acyclic graphs
    makes quantitative analysis possible to look for trends or
    anomalies within one data product, or between multiple
    data products.

More Related Content

Similar to Leveraging The Open Provenance Model as a Multi-Tier Model for Global Climate Research

SNIA Emerald Introduction
SNIA Emerald IntroductionSNIA Emerald Introduction
SNIA Emerald Introduction
dlarusso15
 
Evlib2009forum8
Evlib2009forum8Evlib2009forum8
Evlib2009forum8
jatpack
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
sesejun
 
Mar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working GroupMar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working Group
GenomeInABottle
 

Similar to Leveraging The Open Provenance Model as a Multi-Tier Model for Global Climate Research (20)

Paper presentation: Taverna, reloaded
Paper presentation: Taverna, reloadedPaper presentation: Taverna, reloaded
Paper presentation: Taverna, reloaded
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
SNIA Emerald Introduction
SNIA Emerald IntroductionSNIA Emerald Introduction
SNIA Emerald Introduction
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 
Evlib2009forum8
Evlib2009forum8Evlib2009forum8
Evlib2009forum8
 
Stephen Friend CRUK-MD Anderson Cancer Workshop 2012-02-28
Stephen Friend CRUK-MD Anderson Cancer Workshop 2012-02-28Stephen Friend CRUK-MD Anderson Cancer Workshop 2012-02-28
Stephen Friend CRUK-MD Anderson Cancer Workshop 2012-02-28
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger
 
Top Cited Articles International Journal of Computer Science, Engineering and...
Top Cited Articles International Journal of Computer Science, Engineering and...Top Cited Articles International Journal of Computer Science, Engineering and...
Top Cited Articles International Journal of Computer Science, Engineering and...
 
Real-Time Non-Intrusive Speech Quality Estimation for VoIP
Real-Time Non-Intrusive Speech Quality Estimation for VoIPReal-Time Non-Intrusive Speech Quality Estimation for VoIP
Real-Time Non-Intrusive Speech Quality Estimation for VoIP
 
Molecular Biology Software Links
Molecular Biology Software LinksMolecular Biology Software Links
Molecular Biology Software Links
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the Databus
 
20110524zurichngs 1st pub
20110524zurichngs 1st pub20110524zurichngs 1st pub
20110524zurichngs 1st pub
 
Gwas.emes.comp
Gwas.emes.compGwas.emes.comp
Gwas.emes.comp
 
Mar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working GroupMar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working Group
 
Dfma
DfmaDfma
Dfma
 
Best Practices for Validating a Next-Gen Sequencing Workflow
Best Practices for Validating a Next-Gen Sequencing WorkflowBest Practices for Validating a Next-Gen Sequencing Workflow
Best Practices for Validating a Next-Gen Sequencing Workflow
 
BioDec Srl Company Profile
BioDec Srl Company ProfileBioDec Srl Company Profile
BioDec Srl Company Profile
 
Brizio rossibiodec
Brizio rossibiodecBrizio rossibiodec
Brizio rossibiodec
 
170326 giab abrf
170326 giab abrf170326 giab abrf
170326 giab abrf
 
2013-01-17 Research Object
2013-01-17 Research Object2013-01-17 Research Object
2013-01-17 Research Object
 

More from Eric Stephan (6)

Increasing the Reputation of your Published Data on the Web
Increasing the Reputation of your Published Data on the WebIncreasing the Reputation of your Published Data on the Web
Increasing the Reputation of your Published Data on the Web
 
Diary of a Wimpy Model Manager
Diary of a Wimpy Model ManagerDiary of a Wimpy Model Manager
Diary of a Wimpy Model Manager
 
Do It Yourself (DIY) Earth Science Collaboratories Using Best Practices and B...
Do It Yourself (DIY) Earth Science Collaboratories Using Best Practices and B...Do It Yourself (DIY) Earth Science Collaboratories Using Best Practices and B...
Do It Yourself (DIY) Earth Science Collaboratories Using Best Practices and B...
 
A Linked Fusion of Things, Services, and Data to Support a Collaborative Data...
A Linked Fusion of Things, Services, and Data to Support a Collaborative Data...A Linked Fusion of Things, Services, and Data to Support a Collaborative Data...
A Linked Fusion of Things, Services, and Data to Support a Collaborative Data...
 
Climate Science for a Sustainable Energy Future Provenance
Climate Science for a Sustainable Energy Future ProvenanceClimate Science for a Sustainable Energy Future Provenance
Climate Science for a Sustainable Energy Future Provenance
 
The Symbiotic Nature of Provenance and Workflow
The Symbiotic Nature of Provenance and WorkflowThe Symbiotic Nature of Provenance and Workflow
The Symbiotic Nature of Provenance and Workflow
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 

Leveraging The Open Provenance Model as a Multi-Tier Model for Global Climate Research

  • 1. Leveraging The Open Provenance Model as a Multi- Tier Model for Global Climate Research Eric Stephan, Todd Halter, Brian Ermold IPAW, 2010
  • 2. Discussion Outline !   Background on Atmospheric Radiation Measurement (ARM) program. !   Challenges without Provenance !   Requirements Analysis !   Multi-Tier Provenance Model !   Use of Open Provenance Model !   Impacts
  • 3. Background !   Atmospheric Radiation Measurement Program !  Production system designed and developed in 1990 !  Data is collected from over 300 remote sensors worldwide. Expanding to over 400 sensors in 2010 !  Data collection will reach over 500 GB/day of atmospheric and satellite data by FY11 !  Value added products (VAPs) developed to correlate, aggregate and support quality studies of raw data into computational models 3
  • 4. Challenges Facing Current VAP Development !   Causality, Lineage, Referential Knowledge Not Formalized: !  Captured in multiple ways and stored in different media and representation forms. !  Sample causality not directly accessible to scientists !  Inability to seamlessly analyze and visualize knowledge !   Provenance Required By Different Audiences !  Producers – Operations/VAP developers !  Consumers –scientist relying on VAPs 4
  • 5. Requirements Analysis 1 of 2 Value Added Product Directed Graph Lineage (Path) Acyclic Graph and Value Added Product Common Properties Workflow Causality (Hedge) Ordered Autonomous Sample Causality … Acyclic Graphs When Processing Data Product (Branch)
  • 6. Requirements Analysis 2 of 2 Tier Purpose Resources Status Operations Developer Researcher Path Lineage N/A Future Needed Needed Needed Path Curation Sample Level QC Exists In Use Needed Needed Path/Hedge Reference Metadata Repository Exists In Use In Use Needed Hedge Reference Configuration files Exists In Use In Use Needed Hedge/Branch Causality Log files Exists Needed In Use Needed Hedge/Branch Derived Trends/Anomalies Future Needed Needed Needed Branch Causality Sample Derivation Method Exists In Use Needed Needed Branch Causality Sample Source Exists In Use Needed Needed 6
  • 7. ARM Provenance Model !   Characteristics !  Knowledge required to depict interdependency, overall processing, and discrete sample processing !  Multi-tier !   Each tier representing different granularity and purpose !   Each hedge in context of path, branch in context of hedge. !   Declared tiers make knowledge easier to perform cross comparison !   Because sample provenance at branch tier is autonomous and ordered, provenance can be processed in parallel or stored in chunks. !   Leverage Standards and Community Efforts 7
  • 8. 8
  • 10. Estimated Cost of Provenance Sample  Quality  Control   Field  Origin   ~30K for each VAP sample 2 bytes for each VAP ~5-10K sample < 5K graph VAP Lineage VAP Sample Path Hedge Branch 10 Low Granularity Medium Granularity High Granularity
  • 11. Analysis Examples !   Timeline Inspection Anomaly and Trend Detection !  Aggregation !  Out of 43,200 potential samples (560K log entries) !   15 distinct processes !   60 distinct process results e.g. !   No AERO G data within minutes of x !   No RRTM_LW output for x !   No RRTM_SW output for x !   No clear sky longwave cloud forcing run for x !   No clear sky shortwave cloud forcing run for x !   No emissivities file RRTM_SW_sfcemissdata !   This can be used to help users know the kinds of questions they can ask. 11
  • 12. Impacts !   Provenance articulates ARM data processing causality and lineage in a formal and recognizable way. !   Adding provenance creates a data intensive computing challenge due to the shear volume of provenance represented as a large semantic graph. !   Use of a multi-tier model makes analysis and visualization possible because the provenance graph can be broken into chunks for distributed or parallel processing. !   Modeling the branch tier as autonomous acyclic graphs makes quantitative analysis possible to look for trends or anomalies within one data product, or between multiple data products.