2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON

Professor at Illinois Institute of Technology
Jul. 11, 2015
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON
1 of 28

More Related Content

What's hot

Web Data Extraction Como2010Web Data Extraction Como2010
Web Data Extraction Como2010Giorgio Orsi
8 query processing and optimization8 query processing and optimization
8 query processing and optimizationKumar
Complex Matching of RDF Datatype PropertiesComplex Matching of RDF Datatype Properties
Complex Matching of RDF Datatype PropertiesBesnik Fetahu
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets robertlz
Effective and Efficient Entity Search in RDF dataEffective and Efficient Entity Search in RDF data
Effective and Efficient Entity Search in RDF dataRoi Blanco
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsCarl Lu

Similar to 2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON

RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
Big Linked Data Federation - ExtremeEarth Open WorkshopBig Linked Data Federation - ExtremeEarth Open Workshop
Big Linked Data Federation - ExtremeEarth Open WorkshopExtremeEarth
Structured Streaming in SparkStructured Streaming in Spark
Structured Streaming in SparkDigital Vidya
Exactpro Systems for KSTU Students in KostromaExactpro Systems for KSTU Students in Kostroma
Exactpro Systems for KSTU Students in KostromaIosif Itkin
Generating test cases using UML Communication Diagram Generating test cases using UML Communication Diagram
Generating test cases using UML Communication Diagram Praveen Penumathsa

Similar to 2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON(20)

More from Boris Glavic

2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...
2019 - SIGMOD - Uncertainty Annotated Databases - A Lightweight Approach for ...Boris Glavic
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...
2019 - SIGMOD - Going Beyond Provenance: Explaining Query Answers with Patter...Boris Glavic
2016 VLDB - The iBench Integration Metadata Generator2016 VLDB - The iBench Integration Metadata Generator
2016 VLDB - The iBench Integration Metadata GeneratorBoris Glavic
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...
2016 VLDB - Messing Up with Bart: Error Generation for Evaluating Data-Cleani...Boris Glavic
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...Boris Glavic
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers2015 TaPP - Towards Constraint-based Explanations for Answers and Non-Answers
2015 TaPP - Towards Constraint-based Explanations for Answers and Non-AnswersBoris Glavic

More from Boris Glavic(18)

Recently uploaded

Doctorate Thesis PresentationDoctorate Thesis Presentation
Doctorate Thesis PresentationRony Pozner
TCI Quarterly Newsletter - July 2023TCI Quarterly Newsletter - July 2023
TCI Quarterly Newsletter - July 2023AshutoshKumar13713
unwinding the potentials of stem cellsunwinding the potentials of stem cells
unwinding the potentials of stem cellsSindhBiotech
VITAL SIGNS.pdfVITAL SIGNS.pdf
VITAL SIGNS.pdfKhyber medical university Peshawar Pakistan
On the Soundness of Android Static AnalysisOn the Soundness of Android Static Analysis
On the Soundness of Android Static AnalysisJordanSamhi
Plant Research ReagentsPlant Research Reagents
Plant Research ReagentsTokyo Chemicals Industry (TCI)

2015 TaPP - Interoperability for Provenance-aware Databases using PROV and JSON

Editor's Notes

  1. Hello everyone, I am Xing, I am Ph.D. student at Illinois Institute of Technology the database group. I am presenting Interoperability for Provenance-aware Databases using PROV and JSON. It is join work with Raghav and Boris together.
  2. This is outline of my talk, I will first give an introduction of the problem I am solving here in particular provenance-aware system and explain why it is hard to get interoperability of provenance-aware systems. Then I will give an overview of related work. Afterwards I will present Overview of our solution. Then discuss in detail how we realize/implement/solve export and import of provenance in our approach. Finally I will show some experimental results and finish with conclusions and future work. Introduce the problem and explain why it is hard to get interoperability of provenance-aware systems Given an overview of our solution for interoperability for provenance-aware databases Then discuss in detail how we realize/implement/solve export and import of provenance in our approach
  3. The PROV standards is a standardized, extensible representation of provenance graphs which is very useful for exchange provenance information between systems. Provenance-aware database manage system computes the provenance of database operations. Such as perm, GProM, DBNotes, Orchestra and LogicBlox.
  4. Here we use an example to explain a simplified PROV graph which will help us to illustrate the interoperability problems that arise when data is created by multiple systems The example is about extracting demographic information from tweets (for twitter users from monthly log of tweets.) For our example pipeline, we have two input files, Jan and Feb, which are monthly logs of tweets. Then they are fed into extractor, that is E1 and E2 over here, which extract single tweets from each file. (For our example pipeline, we have two input files which are monthly logs of tweets. For example, here we have Jan and Feb logs, and this one are fed into extractor that is E1 and E2 over here which extract single tweets from this files. ) For this case, we get tw1 to tw3. Then the three tweets are fed into classifiers which predicts the demographic information from user, this case would be the state, age and gender of user. For each tweet, we are inserting a tuple, the result of classifier, into the database. For example it would be t1 for the first tweet. Then we run a query that showing in top here which computes the average age of twitter user per state. And this gives us two result tuples to1 and to2, you can see the result here in the bottom. Such a provenance graph is useful for example such as for tracing erroneous outputs back into the inputs For example, this dot red line here represent data dependency which are wasDeriveFrom edges in the prov graph. Of course such graph requires that we can track provenance outside and inside of the database and connect the two types of provenance - This requires that we can track provenance outside and inside of the database and connect the two types of provenance
  5. Relational database systems the provenance-aware support tracking of data provenance, but they don't support import and export of provenance in PRVO. That means particularly, for example, in GProM we can produce wasDeriveFrom edges between the output tuples and the input tuples. For example, between to1 and t1.(go back to previous slide) However, we can not export the information as PROV graph and we loose track of provenance how tuples were created by processes outside the database. Like we wouldn't know t1 was actually derived through the whole twriter extraction workflow.
  6. The GProM system using in this work is generic provenance middleware which is Computing provenance over multiple database backends by translating query SQL language with provenance extension into the SQL dialect of backend database system to compute provenance. The left graph shows the simplified flow of the GProM system. At first the parser received the provenance requirement and translates SQL code to the algebra tree which will be the input of the provenance rewriter. then the provenance rewriter rewrites the Algebra tree and generate a new one which can compute the provenance of the query. At last, SQL code generator translates the algebra tree back to SQL code which can be run at database. Now let’s see how we would compute the database side provenance in system like GProM. We would use the PROVENANCE OF SQL language construct which computes the provenance of the query.
  7. Now let’s see the result of provenance of query Q get from GProM, Q is the query shows in previous slide. It gives us a specific table that essentially represents wasDeriveFrom sessions For example, consider the first row here, the result tuple here is illinois 50, that is tuple to2 which was derived from our illinois tuple, in this case is tuple t2. So this line represents the wasDerivedFrom edge between to2 and t2.
  8. Our goal here is to make provenance-aware databases interoperable with other provenance-aware systems to be able to create this kind of PROV graphs I show you in the beginning. And our approach to achieve this goal is follows. So we implement export and import of provenance information stored in PROV-JSON. JSON is JavaScript Object Notation, a light-weight data-interchange format. PROV-JSON is a JSON serialization of the PROV data model. So this allows me to import provenance when I am importing tuples to not use the link to how tuple is created and it allows me to export the provenance of queries to be used in other provenance-aware systems. This is only really worked if I also propagating imported provenance doing query processing, because otherwise loosing connection of query output tuple to provenance of the input tuple that we are used. We implemented it in GProM system using SQL We propose to make provenance-aware databases interoperable with other provenance-aware systems through the approach 1) Exports database provenance into PROV 2) Import provenance stored as PROV-JSON alongside with data from a relational database JSON is JavaScript Object Notation, a light-weight data-interchange format. PROV-JSON is a JSON serialization of the PROV data model. 3) Propagates imported provenance during query processing 4) We implemented it in GProM system using SQL
  9. There is one approach which trys to link provenance that was recorded by independent systems by identifying common nodes independently develop provenance graphs. So this approach differs from ours is that it addresses like the problem of finding connection between these provenance graphs but not realize how integrate phycisally into one object. This is what we are addressing. (this is just look how to connect the graph, this not solving to represent the same model how to propagate, just address linking problem) Essentially two other approaches have been developed for addressing some of the interoperability problem between database and other provenance-aware systems by either introducing some common model for both type of provenance or monitoring the access to the database and linking database and other provenance-aware systems.   Our approach differs from these approaches, we also use common model by using PROV, however, we don't require central authority for monitoring and recording provenance as both (first and second) type of systems require. the provenance can be independently developed and resupport this interopelly by importing and exporting provenance. So this makes less coupled with other systems.   The first one Gehani et al. [6] try to identify nodes in two provenance graph that represent the same real world entity, activity, or actor and discuss how to integrate such provenance graphs. Some approaches try to address the interoperability problem between database and other provenance-aware systems by introducing a common model, such as The model for both types of provenance or by monitoring database access to link database provenance with other provenance systems. With PROV we also rely on a common model for provenance, but instead of requiring a central authority for monitoring and provenance recording, we support interoperability through import and export of provenance in PROV. And our approach for tracking provenance of JSON path expressions is similar to work on tracking the provenance of path expressions in XML query languages Tracking the provenance of path expressions in XML query languages [12] [12] J. N. Foster, T. J. Green, and V. Tannen. Annotated XML: Queries and Provenance. In PODS, pages 271–280, 2008.
  10. We introduce techniques for exporting database provenance as PROV documents and Importing PROV graphs alongside data we Link outputs of SQL operation to imported provenance for its inputs We implemented it in GProM offloads generation of PROV documents to backend database by using SQL and string concatenation methods. (add grpom graph in the 5-6) We introduce techniques for exporting database provenance as PROV documents Importing PROV graphs alongside data Linking outputs of SQL operation to imported provenance for its inputs We implemented it in GProM offloads generation of PROV documents to backend database by using SQL and string concatenation methods.
  11. For export, we add this new TRANSLATE AS clauses which allows users to specify whether to translate the provenance in different format such as PROV-JSON. If the user asks for translation what we will do is we would construct prov-json document on the fly by adding new SQL operations on top of the database provenance construction. The step is like We are running several projections over the result of provenance computation, each of this projection creates element of one particular type. For example, entity nodes, "wasDerivedFrom, wasGeneratedBy edge. How do we do this is essentially concatenate some fix strings with infromations from the tuples. Then for example, this is like snippet how we would create the JSON fragment for single wasGeneratedBy edge. As You can see we refer to actual attributes and concatenate this with some fix strings. At last, we have a final projection which does string concatenation to create the final json document .   1) each of the projection create element in the prov graph with certain type such as enity nodes or certain type of edges. For example, this is like partially how we would generate wasDerivedFrom edges. We are using some fix kind of string, templetes and insert values from the actually tuples. And we do this essentailly for all types of edges and nodes. THen in the next step, we Uses aggregation to concatenate all snippets of a certain type, for example allUsed edges are being aggregated into one string representing this used edges. Then we have a final projection which does string concatenation to create the final json document . GProM System can compute provenance for database operations, such as queries, updates and transactions. It uses the SQL language extensions, add the PROVENANCE OF in front of the query. So for the export, we add Translate as … in the end to export the provenance in PROV. We implement it from constructing snippets of the PROV-JSON document.
  12. If the user imports data, then she can store any available PROV graphs alongside the data. Here we add “IMPORT PROV FOR” in the end of the query. We add three columns to each table to store imported provenance. prov doc: store a PROV-JSON snippet representing its provenance Prov_eid: indicates which of the entities in this snippet represents the imported tuple Prov_time: stores a timestamp as of the time when the tuple was imported These attributes are useful for querying imported provenance and, even more important, to correctly include it in exported provenance.
  13. Figure in the 4th Slide without the query and query outputs
  14. To export the provenance of a query over data with imported provenance, we include the imported provenance as bundles in the generated PROV graph and connect the entities representing input tuples in the imported provenance to the query activity and output tuple entities. Bundles enable nesting of PROV graphs within PROV graphs, treating a nested graph as a new entity.
  15. We also handle updates If a tuple is modified, e.g., by running an SQL UPDATE statement, then this should be reflected when provenance is exported. For instance, assume the user has run an update to correct tuple t1’s age value (setting age to 70) before running the query. This update should be reflected in the exported provenance So the activity update should be shown in the graph, here the u means the update activity. There should be two versions of tuple t1 entity, the t’1 is updated entity of t1. - GProM uses temporal database
  16. The main challenge in supporting updates for export is how to track the provenance of updates under transactional semantics. This problem has been recently addressed in GProM using the novel concept of reenactment queries. which are queires which simulate the history of part of the history and have the same provenance as the history, so we can use them to compute provenance. Using GProM the user can request the provenance of an past update, transaction, or set of updates executed with a given time interval. construct PROV document provenance for updates computed on-the-fly
  17. We used TPC-H benchmark datasets with scale factors from 0:01 to 10 (10MB up to 10GB size). Experiments were run on the following configure. Here we computed the provenance of a three way join between relations customer, order, and nation with additional selection conditions to control selectivity (and, thus, the size of the exported PROV-JSON document). Provenance export for queries is fully functional while import of PROV is done manually for now. Exporting of propagated imported provenance is supported, but we only support the default storage layout.
  18. The two figures show the runtime of these experiments averaged over 100 runs for database sizes of 1GB and 10GB varying the number of result tuples between 15 and 15K tuples. Generating the PROV-JSON document comes at some additional cost over just computing the provenance. However, this cost is linear in the size of the generated document and independent on the size of the database.
  19. To conclude, I show you our approach of integrateing import and export of provenance represented as PROV-JSON into/from provenance-aware databases to arise the goal to support Interoperability between the provenance-aware database and other proveance-aware system, This solves the problem I show in the beginning of the graph. we construct prove graphs on the fly using SQL and prov construction is part of the query that generate the provenance. We connect database provenance to imported PROV data by propagating it doing provenance computation. So that create the links between final query results and the provenance of the imported data.   The future work we plan to finalize our implementation for updates, we know already how to do this ,but this is not implement yet. It would be nice to support automatic storage management, for example, deduplication ,for imported provenance. There database side technique we could use , such oracle secure file, deduplication feature or we could make more like design problem. It would be nice if we have some automatic support for cross-referencing like the reference talk about in the related work section to support the usecase whether the user is not to where how the provenance relate to each other.