Panda: A System forProvenance and Data Presented By Vladimir Bukhin
Contents• Use of Provenance.• Panda Goals.• Example Workﬂow.• Provenance Operations.• Panda Implementation.
Use of Provenance• Explanation: Examine sources and evolution of data elements.• Veriﬁcation: Auditing how data was produced.• Re-computation: Having found error, propagate changes downstream.
Panda Goals• Merge data-based and process-based provenance.• Deﬁne provenance operators to query and analyze data mixed with provenance.• Create open-source conﬁgurable system that can be used for wide variety of applications, having coupling capabilities with outside data/systems.
Example Workﬂow• De-duplicate 2 data sets.• Partition into Euro and USA, then take union.• Process predict Items most likely to be purchased from addition of 2 data sets.• Aggregation: Output of prediction table.
Provenance Operations• Backward Tracing: If cowboy hats most sold item, where are people buying from?• Forward Tracing: If we correct an error in the data, how would the outcome change?• Forward Propagation: Rerun processes for concerned erroneous data and recalc end result.• Refresh: Recalculate due to new data.
Panda Implementation• Query language to answer questions like: Which customer list contributes the most to the top 100 predicted items?• Use ‘predicates’ as references to refer back to data origins. (to trace back to info src.)• Python mapping/transformation nodes run per data point.
References• R. Ikeda and J. Widom, Panda: A System for Provenance and Data, IEEE Data Engineering Bulletin,Vol. 33, No. 3. September 2010.