Successfully reported this slideshow.
Your SlideShare is downloading. ×

Thoughts on Knowledge Graphs & Deeper Provenance

Loading in …3

Check these out next

1 of 46 Ad

Thoughts on Knowledge Graphs & Deeper Provenance

Download to read offline

Thinking about the need for deeper provenance for knowledge graphs but also using knowledge graphs to enrich provenance. Presented at

Thinking about the need for deeper provenance for knowledge graphs but also using knowledge graphs to enrich provenance. Presented at


More Related Content

Slideshows for you (20)

Similar to Thoughts on Knowledge Graphs & Deeper Provenance (20)


More from Paul Groth (14)

Recently uploaded (20)


Thoughts on Knowledge Graphs & Deeper Provenance

  1. 1. Faculty of Science Paul Groth | @pgroth | Oct 29, 2019 Data Provenance Staff Week - Universidad de La Rioja Thoughts on Knowledge Graphs & Deeper Provenance
  2. 2. Faculty of Science OUTLINE 1. The problem 2. Knowledge Graphs 3. Transparency through data provenance 4. Assessment for human-machine teams
  3. 3. Faculty of Science The making of data is important “There is a major, largely unrealised potential to merge and integrate the data from different disciplines of science in order to reveal deep patterns in the multi-facetted complexity that underlies most of the domains of application that are intrinsic to the major global challenges that confront humanity.” – Grand Challenge for Science Committee on Data of the International Council for Science (CODATA)
  4. 4. Faculty of Science Software 2.0 “In the 2.0 stack, the programming is done by accumulating, massaging and cleaning datasets” Figure 8 Data Science Surveys 2017 & 2018 The making of data is hard
  5. 5. Faculty of Science
  6. 6. Faculty of Science NOT JUST DATA SCIENCE Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A., & Wyatt, S. (2019). Searching Data: A Review of Observational Data Retrieval Practices. Journal of the Association for Information Science and Technology. doi:10.1002/asi.24165 Some observations from @gregory_km survey & interviews : • The needs and behaviors of specific user groups (e.g. early career researchers, policy makers, students) are not well documented. • Participants require details about data collection and handling • Reconstructing data tables from journal articles, using general search engines, and making direct data requests are common. K Gregory, H Cousijn, P Groth, A Scharnhorst, S Wyatt (2018). Understanding Data Retrieval Practices: A Social Informatics Perspective. arXiv preprint arXiv:1801.04971
  7. 7. Faculty of Science OUTLINE 1. The problem 2. Knowledge Graphs 3. Transparency through data provenance 4. Assessment for human-machine teams
  8. 8. Faculty of Science Knowledge Graphs for Integration A knowledge graph is "graph structured knowledge bases (KBs) which store factual information in form of relationships between entities" (Nickel et al. 2015). Nickel, M., Murphy, K., Tresp, V., & Gabrilovich, E. (2015). A Review of Relational Machine Learning for Knowledge Graphs, 1–18.
  9. 9. Knowledge Graphs: In Science
  10. 10. Faculty of Science Frank van Harmelen Adoption of Knowledge Graphs:
  11. 11. Faculty of Science Knowledge Graphs
  12. 12. Faculty of Science LARGE SCALE PIPELINES Content Universal schema Surface form relations Structured relations Factorization model Matrix Construction Open Information Extraction Entity Resolution Matrix Factorization Knowledge graph Curation Predicted relations Matrix Completion Taxonomy Triple Extraction Concept Resolution 14M SD articles 475 M triples 3.3 million relations 49 M relations ~15k -> 1M entries Paul Groth, Sujit Pal, Darin McBeath, Brad Allen, Ron Daniel “Applying Universal Schemas for Domain Specific Ontology Expansion” 5th Workshop on Automated Knowledge Base Construction (AKBC) 2016 Michael Lauruhn, and Paul Groth. "Sources of Change for Modern Knowledge Organization Systems." Knowledge Organization 43, no. 8 (2016).
  13. 13. Faculty of Science Bottlenecks 1.Manual 2.Difficulty in creating flexible reusable workflows 3.Lack of transparency Paul Groth."The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE , vol.28, no.5, pp.44,48, Sept.- Oct. 2013 doi: 10.1109/MIS.2013.138 Paul Groth, "Transparency and Reliability in the Data Supply Chain," Internet Computing, IEEE, vol.17, no.2, pp.69,71, March-April 2013 doi: 10.1109/MIC.2013.41
  14. 14. Faculty of Science OUTLINE 1. The problem 2. Knowledge Graphs 3. Transparency through data provenance 4. Assessment for human-machine teams
  15. 15. Faculty of Science TRANSPARENCY
  16. 16. Faculty of Science PROVENANCE • Where and how was this data or document produced? • Data Provenance is “a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data” – W3C Provenance Recommendation • Central issues: • Data workflows go beyond single systems • How do you capture this information effectively? • What functionality can the provenance support? From:
  17. 17. Faculty of Science DATA PROVENANCE INTEROPERABILITY Moreau, Luc, Paul Groth, James Cheney, Timothy Lebo, and Simon Miles. "The rationale of PROV." Web Semantics: Science, Services and Agents on the World Wide Web 35 (2015): 235-257. Luc Moreau and Paul Groth. "Provenance: an introduction to Prov." Synthesis Lectures on the Semantic Web: Theory and Technology 3.4 (2013): 1-129. Paul Groth, Yolanda Gil, James Cheney, and Simon Miles. "Requirements for provenance on the web." International Journal of Digital Curation 7, no. 1 (2012): 39-56.
  18. 18. Faculty of Science OPENPHACTS PROVENANCE
  19. 19. Faculty of Science Select one of the activities in the PROV graph Entities and Activities are sized according to information flow Missing type information is automatically inferred Embed the generated visualisation in your own webpage
  20. 20. Faculty of Science What to capture? Simon Miles, Paul Groth, Paul, Steve Munroe, Luc Moreau. PrIMe: A methodology for developing provenance-aware applications. ACM Transactions on Software Engineering and Methodology, 20, (3), 2011. 1 9
  21. 21. Faculty of Science State of the art • Disclosed Provenance + Accuracy + High-level semantics • Intrusive • Manual Effort • Observed Provenance • False positives • Semantic Gap + Non-intrusive + Minimal manual effort CPL (Macko ‘12) Trio (Widom ‘09) Wings (Gil ‘11) Taverna (Oinn ‘06) VisTrails (Fraire ‘06) ES3 (Frew ‘08) Trec (Vahdat ‘98) PASSv2 (Holland ‘08) DTrace Tool (Gessiou ‘12) ProvChain (Liang ’17) Titian (Interlandi ‘15) DiffDataFlow (Chothia ‘16)
  22. 22. Faculty of Science What if we missed something? Disclosed provenance systems: • Re-apply methodology (e.g. PriME), produce new application version. • Time consuming. Observed provenance systems: • Update the applied instrumentation. • Instrumentation becomes progressively more intense. Provenance is Post-Hoc
  23. 23. Faculty of Science Re-execution Common tactic in disclosed provenance: • DB: Reenactment queries (Glavic ‘14) • DistSys: Chimera (Foster ‘02), Hadoop (Logothetis ‘13), DistTape (Zhao ‘12) • Workflows: Pegasus (Groth ‘09) • PL: Slicing (Perera ‘12) • Desktop: Excel (Asuncion ‘11) Can we extend this idea to observed provenance systems? 2 2
  24. 24. Faculty of Science Faster Capture: Record & Replay PROV 2R: Practical Provenance Analysis of Unstructured Processes M Stamatogiannakis, E Athanasopoulos, H Bos, P Groth (2017) ACM Transactions on Internet Technology (TOIT) 17 (4), 37
  25. 25. Methodology Selection Provenance analysis Instrumentation Execution Capture 24
  26. 26. Prototype Implementation • PANDA: an open-source Platform for Architecture-Neutral Dynamic Analysis. (Dolan- Gavitt ‘14) • Based on the QEMU virtualization platform. 25
  27. 27. • PANDA logs self-contained execution traces. – An initial RAM snapshot. – Non-deterministic inputs. • Logging happens at virtual CPU I/O ports. – Virtual device state is not logged  can’t “go-live”. Prototype Implementation (2/3) PANDA CPU RAM Input Interrupt DMA Initial RAM Snapshot Non- determinism log RAM PANDA Execution Trace 26
  28. 28. Prototype Implementation (3/3) • Analysis plugins – Read-only access to the VM state. – Invoked per instr., memory access, context switch, etc. – Can be combined to implement complex functionality. – OSI Linux, PROV-Tracer, ProcStrMatch, Taint tracking • Debian Linux guest. • Provenance stored PROV/RDF triples, queried with SPARQL. PANDA Execution Trace PANDA Triple Store Plugin APlugin C Plugin B CPU RAM 27 used endedAtTime wasAssociatedWith actedOnBehalfOf wasGeneratedBy wasAttributedTo wasDerivedFrom wasInformedBy Activity Entity Agent xsd:dateTime startedAtTime xsd:dateTime
  29. 29. Faculty of Science Enabling more detail Application • Observed provenance systems treat programs as black-boxes • Can’t tell if an input file was actually used • Can’t quantify the influence of input to output
  30. 30. Faculty of Science DATA TRACKER • Captures high-fidelity provenance using Taint Tracking • Key building blocks: • libdft (Kemerlis ‘12) ➞ Reusable taint-tracking framework • Intel Pin (Luk ‘05) ➞ Dynamic instrumentation framework • Does not require modification of applications • Does not require knowledge of application semantics Stamatogiannakis, Manolis, Paul Groth, and Herbert Bos. "Looking inside the black-box: capturing data provenance using dynamic instrumentation." In International Provenance and Annotation Workshop (IPAW’14), pp. 155-167. 2014.
  31. 31. Faculty of Science Systems provenance • Adam Bates, Dave Tian, Kevin R.B. Butler, and Thomas Moyer. Trustworthy Whole-System Provenance for the Linux Kernel. USENIX Security Symposium (SECURITY), August 2015.
  32. 32. Faculty of Science • We can capture ever more provenance • Still the question: what to capture? • But is that enough? Is this enough?
  33. 33. Faculty of Science OUTLINE 1. The problem 2. Knowledge Graphs 3. Transparency through data provenance 4. Assessment for human-machine teams
  34. 34. Faculty of Science Peer Review At Scale – People 2016: • 1.5 million papers submitted • Published 420,000 articles • 2,500 journals • 20,000 “level 1” editor • 60,000 editors content/uploads/2016/09/peer- review-the-nuts-and-bolts.pdf
  35. 35. Faculty of Science But are we missing a user?
  36. 36. Faculty of Science Machines see things differently than people From: Alain, G. and Bengio, Y. (2016). Understanding intermediate layers using linear classifier probes. arXiv:1610.01644v1. Thanks Brad Allen
  37. 37. Faculty of Science Machines learn things differently than people Thanks Brad Allen
  38. 38. Faculty of Science Models reuse data From: Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B. and Vijayanarasimhan, S. YouTube-8M: a large-scale video classification benchmark. arXiv:1609.08675.
  39. 39. Faculty of Science Models reuse models
  40. 40. Faculty of Science People read what machines say Towards Automating Data Narratives. Gil, Y.; and Garijo, D. In Proceedings of the Twenty-Second ACM International Conference on Intelligent User Interfaces (IUI-17), Limassol, Cyprus, 2017.
  41. 41. Faculty of Science Lauruhn, Michael, and Paul Groth. "Sources of Change for Modern Knowledge Organization Systems." Knowledge Organization 43, no. 8 (2016). Machines, People, Organizations
  42. 42. Faculty of Science Groth, Paul, "The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE , vol.28, no.5, pp.44,48, Sept.-Oct. 2013 doi: 10.1109/MIS.2013.138 • 10 different extractors • E.g mapping-based infobox extractor • Infobox uses a hand-built ontology based on the 350 • Based on acommonly used English language infoboxes • Integrates with Yago • Yago relies on Wikipedia + Wordnet • Upper ontology from Wordnet and then a mapping to Wikipedia categories based frequencies • Wordnet is built by psycholinguists Machines, People, Organizations
  43. 43. Faculty of Science An example
  44. 44. Faculty of Science • People are sources too – need modelling and assessment • Must take into account the entire provenance history including assessment structures • Propagation but also discounting and elevation are needed for computation of assessment • Not just explanation, but decisions All source assessment Ceolin, D., Groth, P., Maccatrozzo, V., Fokkink, W., Hage, W.R. Van and Nottamkandath, A. 2016. Combining User Reputation and Provenance Analysis for Trust Assessment. Journal of Data and Information Quality. 7, 1–2 (Jan. 2016), 1–28. DOI: Ceolin, D., Groth, P. and Hage, W.R. Van 2010. Calculating the Trust of Event Descriptions using Provenance. Proceedings Of The SWPM 2010, Workshop At The 9th International Semantic Web Conference, ISWC-2010 (Nov. 2010).
  45. 45. Faculty of Science Giant Global Provenance Graph? Martin Fenner and Amir Aryani “Introducing the PID Graph” March 28, 2019 P. Groth, H. Cousijn, T. Clark & C. Goble. FAIR data reuse – the path through data citation. Data Intelligence 2(2020), 78–86. doi: 10.1162/dint_a_00030
  46. 46. Faculty of Science • Data reuse though integration/munging/remixing is pervasive • Knowledge graphs are common and complex • Our information environments are heterogenous, deep, intermixed and socially embedded • Use provenance to help humans and machines perform assessments and make decisions Conclusion Contact: Paul Groth | @pgroth |

Editor's Notes

  • Work with dans
    Reviewed 400 papers deep dive 114
  • Effectively = low overhead, low implementation overhead
    Queries, reexecution, abstraction,
  • A big problem for systems capturing provenance is deciding what to capture.
    For disclosed provenance systems we can apply some methodology to decide what to capture.
  • Disclosed provenance methods require knowledge of application semantics and modification of the application.
    OTOH observed provenance methods usually have a high false positive ratio.
  • Execution Capture: happens realtime
    Instrumentation: applied on the captured trace to generate provenance information
    Analysis: the provenance information is explored using existing tools (e.g. SPARQL queries)
    Selection: a subset of the execution trace is selected – we start again with more intensive instrumentation
  • We implemented our methodology using PANDA.
  • PANDA is based on QEMU.

    Input includes both executed instructions and data.

    RAM snapshot + ND log are enough to accurately replay the whole execution.

    ND log conists of inputs to CPU/RAM and other device status is not logged  we can replay but we cannot “go live” (i.e. resume execution)
  • Note: Technically, plugins can modify VM state. However this will eventually crash the execution as the trace will be out of sync with the replay state.

    Plugins are implemented as dynamic libraries.

    We focus on the highlighted plugins in this presentation.
  • We built a tool based on taint tracking to capture provenance. Our tool is called DataTracker and has two key building blocks.
  • We don’t start with a full formal definition but formalize over time from usage