Panda: A System forProvenance and Data    Presented By Vladimir Bukhin
Contents• Use of Provenance.• Panda Goals.• Example Workflow.• Provenance Operations.• Panda Implementation.
Use of Provenance• Explanation: Examine sources and evolution  of data elements.• Verification: Auditing how data was  prod...
Panda Goals• Merge data-based and process-based  provenance.• Define provenance operators to query and  analyze data mixed ...
Example Workflow•   De-duplicate 2 data sets.•   Partition into Euro and USA, then take union.•   Process predict Items mos...
Provenance Operations• Backward Tracing: If cowboy hats most sold  item, where are people buying from?• Forward Tracing: I...
Panda Implementation•   Query language to answer    questions like: Which    customer list contributes the    most to the ...
References• R. Ikeda and J. Widom, Panda: A System for  Provenance and Data, IEEE Data  Engineering Bulletin,Vol. 33, No. ...
Upcoming SlideShare
Loading in …5
×

Panda Provenance

446 views
333 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
446
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Panda Provenance

    1. 1. Panda: A System forProvenance and Data Presented By Vladimir Bukhin
    2. 2. Contents• Use of Provenance.• Panda Goals.• Example Workflow.• Provenance Operations.• Panda Implementation.
    3. 3. Use of Provenance• Explanation: Examine sources and evolution of data elements.• Verification: Auditing how data was produced.• Re-computation: Having found error, propagate changes downstream.
    4. 4. Panda Goals• Merge data-based and process-based provenance.• Define provenance operators to query and analyze data mixed with provenance.• Create open-source configurable system that can be used for wide variety of applications, having coupling capabilities with outside data/systems.
    5. 5. Example Workflow• De-duplicate 2 data sets.• Partition into Euro and USA, then take union.• Process predict Items most likely to be purchased from addition of 2 data sets.• Aggregation: Output of prediction table.
    6. 6. Provenance Operations• Backward Tracing: If cowboy hats most sold item, where are people buying from?• Forward Tracing: If we correct an error in the data, how would the outcome change?• Forward Propagation: Rerun processes for concerned erroneous data and recalc end result.• Refresh: Recalculate due to new data.
    7. 7. Panda Implementation• Query language to answer questions like: Which customer list contributes the most to the top 100 predicted items?• Use ‘predicates’ as references to refer back to data origins. (to trace back to info src.)• Python mapping/transformation nodes run per data point.
    8. 8. References• R. Ikeda and J. Widom, Panda: A System for Provenance and Data, IEEE Data Engineering Bulletin,Vol. 33, No. 3. September 2010.

    ×