Faculty of Science
Paul Groth | @pgroth | pgroth.com
Oct 29, 2019
Data Provenance Staff Week - Universidad de La Rioja
Thoughts on Knowledge Graphs
& Deeper Provenance
Faculty of Science
OUTLINE
1. The problem
2. Knowledge Graphs
3. Transparency through data provenance
4. Assessment for human-machine teams
Faculty of Science
The making of data is important
“There is a major, largely unrealised potential to
merge and integrate the data from different
disciplines of science in order to reveal deep
patterns in the multi-facetted complexity that
underlies most of the domains of application that
are intrinsic to the major global challenges that
confront humanity.” – Grand Challenge for
Science
http://dataintegration.codata.org
Committee on Data of the
International Council for Science
(CODATA)
Faculty of Science
Software 2.0
https://link.medium.com/srrJhEl5bS
“In the 2.0 stack, the programming is done by
accumulating, massaging and cleaning datasets”
Figure 8
Data Science
Surveys 2017
& 2018
The making of data is hard
Faculty of Science
http://www.tamr.com/piketty-revisited-improving-economics-data-science/
Faculty of Science
NOT JUST DATA SCIENCE
Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A., & Wyatt, S. (2019).
Searching Data: A Review of Observational Data Retrieval
Practices. Journal of the Association for Information Science and
Technology. doi:10.1002/asi.24165
Some observations from @gregory_km
survey & interviews :
• The needs and behaviors of specific user groups (e.g. early
career researchers, policy makers, students) are not well
documented.
• Participants require details about data collection and handling
• Reconstructing data tables from journal articles, using
general search engines, and making direct data requests are
common.
K Gregory, H Cousijn, P Groth, A Scharnhorst, S Wyatt (2018).
Understanding Data Retrieval Practices: A Social Informatics Perspective.
arXiv preprint arXiv:1801.04971
Faculty of Science
OUTLINE
1. The problem
2. Knowledge Graphs
3. Transparency through data provenance
4. Assessment for human-machine teams
Faculty of Science
Knowledge Graphs for Integration
A knowledge graph is "graph structured knowledge bases (KBs) which store factual
information in form of relationships between entities" (Nickel et al. 2015).
Nickel, M., Murphy, K., Tresp, V., & Gabrilovich, E. (2015). A Review
of Relational Machine Learning for Knowledge Graphs, 1–18.
Knowledge Graphs: In
Science
Faculty of Science
Frank van Harmelen Adoption of Knowledge Graphs: https://www.slideshare.net/Frank.van.Harmelen/adoption-of-knowledge-graphs-mid-2019
Faculty of Science
Knowledge Graphs
Faculty of Science
LARGE SCALE PIPELINES
Content
Universal
schema
Surface form
relations
Structured
relations
Factorization
model
Matrix
Construction
Open
Information
Extraction
Entity
Resolution
Matrix
Factorization
Knowledge
graph
Curation
Predicted
relations
Matrix
Completion
Taxonomy
Triple
Extraction
Concept
Resolution
14M
SD articles
475 M
triples
3.3 million
relations
49 M
relations
~15k ->
1M
entries
Paul Groth, Sujit Pal, Darin McBeath, Brad Allen, Ron Daniel
“Applying Universal Schemas for Domain Specific Ontology Expansion”
5th Workshop on Automated Knowledge Base Construction (AKBC) 2016
Michael Lauruhn, and Paul Groth. "Sources of Change for Modern
Knowledge Organization Systems." Knowledge Organization 43, no. 8
(2016).
Faculty of Science
Bottlenecks
1.Manual
2.Difficulty in creating flexible reusable workflows
3.Lack of transparency
Paul Groth."The Knowledge-Remixing Bottleneck,"
Intelligent Systems, IEEE , vol.28, no.5, pp.44,48, Sept.-
Oct. 2013 doi: 10.1109/MIS.2013.138
Paul Groth, "Transparency and Reliability in the Data Supply
Chain," Internet Computing, IEEE, vol.17, no.2, pp.69,71,
March-April 2013 doi: 10.1109/MIC.2013.41
Faculty of Science
OUTLINE
1. The problem
2. Knowledge Graphs
3. Transparency through data provenance
4. Assessment for human-machine teams
Faculty of Science
TRANSPARENCY
Faculty of Science
PROVENANCE
• Where and how was this data or document produced?
• Data Provenance is “a record that describes the people,
institutions, entities, and activities involved in producing,
influencing, or delivering a piece of data” – W3C
Provenance Recommendation
• Central issues:
• Data workflows go beyond single systems
• How do you capture this information
effectively?
• What functionality can the provenance
support?
From: https://www.w3.org/TR/prov-primer/
Faculty of Science
DATA PROVENANCE INTEROPERABILITY
Moreau, Luc, Paul Groth, James Cheney, Timothy Lebo, and Simon Miles. "The
rationale of PROV." Web Semantics: Science, Services and Agents on the World Wide
Web 35 (2015): 235-257.
Luc Moreau and Paul Groth. "Provenance: an introduction to Prov." Synthesis Lectures
on the Semantic Web: Theory and Technology 3.4 (2013): 1-129.
Paul Groth, Yolanda Gil, James Cheney, and Simon Miles. "Requirements for
provenance on the web." International Journal of Digital Curation 7, no. 1 (2012): 39-56.
Faculty of Science
OPENPHACTS PROVENANCE
Faculty of Science
Select one of the activities in the PROV graph
Entities and Activities are sized according to information flow
Missing type information is automatically inferred
Embed the generated visualisation in your own webpage
Faculty of Science
What to capture?
Simon Miles, Paul Groth, Paul, Steve Munroe, Luc Moreau.
PrIMe: A methodology for developing provenance-aware
applications.
ACM Transactions on Software Engineering and Methodology, 20,
(3), 2011.
1
9
Faculty of Science
State of the art
• Disclosed Provenance
+ Accuracy
+ High-level semantics
• Intrusive
• Manual Effort
• Observed Provenance
• False positives
• Semantic Gap
+ Non-intrusive
+ Minimal manual effort
CPL (Macko ‘12)
Trio (Widom ‘09)
Wings (Gil ‘11)
Taverna (Oinn ‘06)
VisTrails (Fraire ‘06)
ES3 (Frew ‘08)
Trec (Vahdat ‘98)
PASSv2 (Holland ‘08)
DTrace Tool (Gessiou ‘12)
ProvChain (Liang ’17)
Titian (Interlandi ‘15)
DiffDataFlow (Chothia ‘16)
Faculty of Science
What if we missed something?
Disclosed provenance systems:
• Re-apply methodology (e.g. PriME), produce new application version.
• Time consuming.
Observed provenance systems:
• Update the applied instrumentation.
• Instrumentation becomes progressively more intense.
Provenance is Post-Hoc
Faculty of Science
Re-execution
Common tactic in disclosed provenance:
• DB: Reenactment queries (Glavic ‘14)
• DistSys: Chimera (Foster ‘02), Hadoop (Logothetis ‘13), DistTape (Zhao ‘12)
• Workflows: Pegasus (Groth ‘09)
• PL: Slicing (Perera ‘12)
• Desktop: Excel (Asuncion ‘11)
Can we extend this idea to observed provenance systems?
2
2
Faculty of Science
Faster Capture: Record & Replay
PROV 2R: Practical Provenance Analysis of Unstructured Processes
M Stamatogiannakis, E Athanasopoulos, H Bos, P Groth (2017)
ACM Transactions on Internet Technology (TOIT) 17 (4), 37
Methodology
Selection
Provenance analysis
Instrumentation
Execution Capture
24
Prototype Implementation
• PANDA: an open-source
Platform for
Architecture-Neutral
Dynamic Analysis. (Dolan-
Gavitt ‘14)
• Based on the QEMU
virtualization platform.
25
• PANDA logs self-contained execution traces.
– An initial RAM snapshot.
– Non-deterministic inputs.
• Logging happens at virtual CPU I/O ports.
– Virtual device state is not logged  can’t “go-live”.
Prototype Implementation (2/3)
PANDA
CPU RAM
Input
Interrupt
DMA
Initial RAM Snapshot
Non-
determinism
log
RAM
PANDA Execution Trace
26
Prototype Implementation (3/3)
• Analysis plugins
– Read-only access to the VM state.
– Invoked per instr., memory access, context switch, etc.
– Can be combined to implement complex functionality.
– OSI Linux, PROV-Tracer, ProcStrMatch, Taint tracking
• Debian Linux guest.
• Provenance stored PROV/RDF triples, queried with SPARQL.
PANDA
Execution
Trace
PANDA
Triple
Store
Plugin APlugin C
Plugin B
CPU
RAM
27
used
endedAtTime
wasAssociatedWith
actedOnBehalfOf
wasGeneratedBy
wasAttributedTo
wasDerivedFrom
wasInformedBy
Activity
Entity
Agent
xsd:dateTime
startedAtTime
xsd:dateTime
Faculty of Science
Enabling more detail
Application
• Observed provenance systems treat programs as
black-boxes
• Can’t tell if an input file was actually used
• Can’t quantify the influence of input to output
Faculty of Science
DATA TRACKER
• Captures high-fidelity provenance using Taint Tracking
• Key building blocks:
• libdft (Kemerlis ‘12) ➞ Reusable taint-tracking
framework
• Intel Pin (Luk ‘05) ➞ Dynamic instrumentation
framework
• Does not require modification of applications
• Does not require knowledge of application semantics
Stamatogiannakis, Manolis, Paul Groth, and Herbert Bos. "Looking inside the black-box:
capturing data provenance using dynamic instrumentation." In International Provenance and
Annotation Workshop (IPAW’14), pp. 155-167. 2014.
Faculty of Science
Systems provenance
• Adam Bates, Dave Tian, Kevin R.B. Butler, and Thomas
Moyer. Trustworthy Whole-System Provenance for the
Linux Kernel. USENIX Security Symposium (SECURITY),
August 2015.
Faculty of Science
• We can capture ever more provenance
• Still the question: what to capture?
• But is that enough?
Is this enough?
Faculty of Science
OUTLINE
1. The problem
2. Knowledge Graphs
3. Transparency through data provenance
4. Assessment for human-machine teams
Faculty of Science
Peer Review At Scale – People
2016:
• 1.5 million papers submitted
• Published 420,000 articles
• 2,500 journals
• 20,000 “level 1” editor
• 60,000 editors
http://senseaboutscience.org/wp-
content/uploads/2016/09/peer-
review-the-nuts-and-bolts.pdf
Faculty of Science
But are we missing a user?
Faculty of Science
Machines see things differently than people
From: Alain, G. and Bengio, Y. (2016). Understanding intermediate layers using linear classifier probes. arXiv:1610.01644v1.
Thanks Brad Allen
Faculty of Science
Machines learn things differently than people
Thanks Brad Allen
Faculty of Science
Models reuse data
From: Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B.
and Vijayanarasimhan, S. YouTube-8M: a large-scale video classification benchmark. arXiv:1609.08675.
Faculty of Science
Models reuse models
Faculty of Science
People read what machines say
Towards Automating Data Narratives.
Gil, Y.; and Garijo, D. In Proceedings of the
Twenty-Second ACM International Conference
on Intelligent User Interfaces (IUI-17), Limassol,
Cyprus, 2017.
Faculty of Science
Lauruhn, Michael, and Paul Groth. "Sources of
Change for Modern Knowledge Organization
Systems." Knowledge Organization 43, no. 8
(2016).
Machines, People, Organizations
Faculty of Science
Groth, Paul, "The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE
, vol.28, no.5, pp.44,48, Sept.-Oct. 2013 doi: 10.1109/MIS.2013.138
• 10 different extractors
• E.g mapping-based infobox extractor
• Infobox uses a hand-built ontology based on the 350
• Based on acommonly used English language infoboxes
• Integrates with Yago
• Yago relies on Wikipedia + Wordnet
• Upper ontology from Wordnet and then a mapping to Wikipedia
categories based frequencies
• Wordnet is built by psycholinguists
Machines, People, Organizations
Faculty of Science
An example
Faculty of Science
• People are sources too – need
modelling and assessment
• Must take into account the entire
provenance history including
assessment structures
• Propagation but also discounting
and elevation are needed for
computation of assessment
• Not just explanation, but decisions
All source assessment
Ceolin, D., Groth, P., Maccatrozzo, V., Fokkink, W., Hage, W.R. Van and Nottamkandath, A.
2016. Combining User Reputation and Provenance Analysis for Trust Assessment. Journal of
Data and Information Quality. 7, 1–2 (Jan. 2016), 1–28. DOI:https://doi.org/10.1145/2818382.
Ceolin, D., Groth, P. and Hage, W.R. Van 2010. Calculating the Trust of Event Descriptions
using Provenance. Proceedings Of The SWPM 2010, Workshop At The 9th International
Semantic Web Conference, ISWC-2010 (Nov. 2010).
Faculty of Science
Giant Global Provenance Graph?
Martin Fenner and Amir Aryani “Introducing the PID Graph” March 28, 2019
https://doi.org/10.5438/jwvf-8a66
P. Groth, H. Cousijn, T. Clark & C. Goble. FAIR data reuse – the path through data citation. Data
Intelligence 2(2020), 78–86. doi: 10.1162/dint_a_00030
Faculty of Science
• Data reuse though integration/munging/remixing is pervasive
• Knowledge graphs are common and complex
• Our information environments are heterogenous, deep,
intermixed and socially embedded
• Use provenance to help humans and machines perform
assessments and make decisions
Conclusion
Contact:
Paul Groth | @pgroth | pgroth.com

Thoughts on Knowledge Graphs & Deeper Provenance

  • 1.
    Faculty of Science PaulGroth | @pgroth | pgroth.com Oct 29, 2019 Data Provenance Staff Week - Universidad de La Rioja Thoughts on Knowledge Graphs & Deeper Provenance
  • 2.
    Faculty of Science OUTLINE 1.The problem 2. Knowledge Graphs 3. Transparency through data provenance 4. Assessment for human-machine teams
  • 3.
    Faculty of Science Themaking of data is important “There is a major, largely unrealised potential to merge and integrate the data from different disciplines of science in order to reveal deep patterns in the multi-facetted complexity that underlies most of the domains of application that are intrinsic to the major global challenges that confront humanity.” – Grand Challenge for Science http://dataintegration.codata.org Committee on Data of the International Council for Science (CODATA)
  • 4.
    Faculty of Science Software2.0 https://link.medium.com/srrJhEl5bS “In the 2.0 stack, the programming is done by accumulating, massaging and cleaning datasets” Figure 8 Data Science Surveys 2017 & 2018 The making of data is hard
  • 5.
  • 6.
    Faculty of Science NOTJUST DATA SCIENCE Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A., & Wyatt, S. (2019). Searching Data: A Review of Observational Data Retrieval Practices. Journal of the Association for Information Science and Technology. doi:10.1002/asi.24165 Some observations from @gregory_km survey & interviews : • The needs and behaviors of specific user groups (e.g. early career researchers, policy makers, students) are not well documented. • Participants require details about data collection and handling • Reconstructing data tables from journal articles, using general search engines, and making direct data requests are common. K Gregory, H Cousijn, P Groth, A Scharnhorst, S Wyatt (2018). Understanding Data Retrieval Practices: A Social Informatics Perspective. arXiv preprint arXiv:1801.04971
  • 7.
    Faculty of Science OUTLINE 1.The problem 2. Knowledge Graphs 3. Transparency through data provenance 4. Assessment for human-machine teams
  • 8.
    Faculty of Science KnowledgeGraphs for Integration A knowledge graph is "graph structured knowledge bases (KBs) which store factual information in form of relationships between entities" (Nickel et al. 2015). Nickel, M., Murphy, K., Tresp, V., & Gabrilovich, E. (2015). A Review of Relational Machine Learning for Knowledge Graphs, 1–18.
  • 9.
  • 10.
    Faculty of Science Frankvan Harmelen Adoption of Knowledge Graphs: https://www.slideshare.net/Frank.van.Harmelen/adoption-of-knowledge-graphs-mid-2019
  • 11.
  • 12.
    Faculty of Science LARGESCALE PIPELINES Content Universal schema Surface form relations Structured relations Factorization model Matrix Construction Open Information Extraction Entity Resolution Matrix Factorization Knowledge graph Curation Predicted relations Matrix Completion Taxonomy Triple Extraction Concept Resolution 14M SD articles 475 M triples 3.3 million relations 49 M relations ~15k -> 1M entries Paul Groth, Sujit Pal, Darin McBeath, Brad Allen, Ron Daniel “Applying Universal Schemas for Domain Specific Ontology Expansion” 5th Workshop on Automated Knowledge Base Construction (AKBC) 2016 Michael Lauruhn, and Paul Groth. "Sources of Change for Modern Knowledge Organization Systems." Knowledge Organization 43, no. 8 (2016).
  • 13.
    Faculty of Science Bottlenecks 1.Manual 2.Difficultyin creating flexible reusable workflows 3.Lack of transparency Paul Groth."The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE , vol.28, no.5, pp.44,48, Sept.- Oct. 2013 doi: 10.1109/MIS.2013.138 Paul Groth, "Transparency and Reliability in the Data Supply Chain," Internet Computing, IEEE, vol.17, no.2, pp.69,71, March-April 2013 doi: 10.1109/MIC.2013.41
  • 14.
    Faculty of Science OUTLINE 1.The problem 2. Knowledge Graphs 3. Transparency through data provenance 4. Assessment for human-machine teams
  • 15.
  • 16.
    Faculty of Science PROVENANCE •Where and how was this data or document produced? • Data Provenance is “a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data” – W3C Provenance Recommendation • Central issues: • Data workflows go beyond single systems • How do you capture this information effectively? • What functionality can the provenance support? From: https://www.w3.org/TR/prov-primer/
  • 17.
    Faculty of Science DATAPROVENANCE INTEROPERABILITY Moreau, Luc, Paul Groth, James Cheney, Timothy Lebo, and Simon Miles. "The rationale of PROV." Web Semantics: Science, Services and Agents on the World Wide Web 35 (2015): 235-257. Luc Moreau and Paul Groth. "Provenance: an introduction to Prov." Synthesis Lectures on the Semantic Web: Theory and Technology 3.4 (2013): 1-129. Paul Groth, Yolanda Gil, James Cheney, and Simon Miles. "Requirements for provenance on the web." International Journal of Digital Curation 7, no. 1 (2012): 39-56.
  • 18.
  • 19.
    Faculty of Science Selectone of the activities in the PROV graph Entities and Activities are sized according to information flow Missing type information is automatically inferred Embed the generated visualisation in your own webpage
  • 20.
    Faculty of Science Whatto capture? Simon Miles, Paul Groth, Paul, Steve Munroe, Luc Moreau. PrIMe: A methodology for developing provenance-aware applications. ACM Transactions on Software Engineering and Methodology, 20, (3), 2011. 1 9
  • 21.
    Faculty of Science Stateof the art • Disclosed Provenance + Accuracy + High-level semantics • Intrusive • Manual Effort • Observed Provenance • False positives • Semantic Gap + Non-intrusive + Minimal manual effort CPL (Macko ‘12) Trio (Widom ‘09) Wings (Gil ‘11) Taverna (Oinn ‘06) VisTrails (Fraire ‘06) ES3 (Frew ‘08) Trec (Vahdat ‘98) PASSv2 (Holland ‘08) DTrace Tool (Gessiou ‘12) ProvChain (Liang ’17) Titian (Interlandi ‘15) DiffDataFlow (Chothia ‘16)
  • 22.
    Faculty of Science Whatif we missed something? Disclosed provenance systems: • Re-apply methodology (e.g. PriME), produce new application version. • Time consuming. Observed provenance systems: • Update the applied instrumentation. • Instrumentation becomes progressively more intense. Provenance is Post-Hoc
  • 23.
    Faculty of Science Re-execution Commontactic in disclosed provenance: • DB: Reenactment queries (Glavic ‘14) • DistSys: Chimera (Foster ‘02), Hadoop (Logothetis ‘13), DistTape (Zhao ‘12) • Workflows: Pegasus (Groth ‘09) • PL: Slicing (Perera ‘12) • Desktop: Excel (Asuncion ‘11) Can we extend this idea to observed provenance systems? 2 2
  • 24.
    Faculty of Science FasterCapture: Record & Replay PROV 2R: Practical Provenance Analysis of Unstructured Processes M Stamatogiannakis, E Athanasopoulos, H Bos, P Groth (2017) ACM Transactions on Internet Technology (TOIT) 17 (4), 37
  • 25.
  • 26.
    Prototype Implementation • PANDA:an open-source Platform for Architecture-Neutral Dynamic Analysis. (Dolan- Gavitt ‘14) • Based on the QEMU virtualization platform. 25
  • 27.
    • PANDA logsself-contained execution traces. – An initial RAM snapshot. – Non-deterministic inputs. • Logging happens at virtual CPU I/O ports. – Virtual device state is not logged  can’t “go-live”. Prototype Implementation (2/3) PANDA CPU RAM Input Interrupt DMA Initial RAM Snapshot Non- determinism log RAM PANDA Execution Trace 26
  • 28.
    Prototype Implementation (3/3) •Analysis plugins – Read-only access to the VM state. – Invoked per instr., memory access, context switch, etc. – Can be combined to implement complex functionality. – OSI Linux, PROV-Tracer, ProcStrMatch, Taint tracking • Debian Linux guest. • Provenance stored PROV/RDF triples, queried with SPARQL. PANDA Execution Trace PANDA Triple Store Plugin APlugin C Plugin B CPU RAM 27 used endedAtTime wasAssociatedWith actedOnBehalfOf wasGeneratedBy wasAttributedTo wasDerivedFrom wasInformedBy Activity Entity Agent xsd:dateTime startedAtTime xsd:dateTime
  • 29.
    Faculty of Science Enablingmore detail Application • Observed provenance systems treat programs as black-boxes • Can’t tell if an input file was actually used • Can’t quantify the influence of input to output
  • 30.
    Faculty of Science DATATRACKER • Captures high-fidelity provenance using Taint Tracking • Key building blocks: • libdft (Kemerlis ‘12) ➞ Reusable taint-tracking framework • Intel Pin (Luk ‘05) ➞ Dynamic instrumentation framework • Does not require modification of applications • Does not require knowledge of application semantics Stamatogiannakis, Manolis, Paul Groth, and Herbert Bos. "Looking inside the black-box: capturing data provenance using dynamic instrumentation." In International Provenance and Annotation Workshop (IPAW’14), pp. 155-167. 2014.
  • 31.
    Faculty of Science Systemsprovenance • Adam Bates, Dave Tian, Kevin R.B. Butler, and Thomas Moyer. Trustworthy Whole-System Provenance for the Linux Kernel. USENIX Security Symposium (SECURITY), August 2015.
  • 32.
    Faculty of Science •We can capture ever more provenance • Still the question: what to capture? • But is that enough? Is this enough?
  • 33.
    Faculty of Science OUTLINE 1.The problem 2. Knowledge Graphs 3. Transparency through data provenance 4. Assessment for human-machine teams
  • 34.
    Faculty of Science PeerReview At Scale – People 2016: • 1.5 million papers submitted • Published 420,000 articles • 2,500 journals • 20,000 “level 1” editor • 60,000 editors http://senseaboutscience.org/wp- content/uploads/2016/09/peer- review-the-nuts-and-bolts.pdf
  • 35.
    Faculty of Science Butare we missing a user?
  • 36.
    Faculty of Science Machinessee things differently than people From: Alain, G. and Bengio, Y. (2016). Understanding intermediate layers using linear classifier probes. arXiv:1610.01644v1. Thanks Brad Allen
  • 37.
    Faculty of Science Machineslearn things differently than people Thanks Brad Allen
  • 38.
    Faculty of Science Modelsreuse data From: Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B. and Vijayanarasimhan, S. YouTube-8M: a large-scale video classification benchmark. arXiv:1609.08675.
  • 39.
  • 40.
    Faculty of Science Peopleread what machines say Towards Automating Data Narratives. Gil, Y.; and Garijo, D. In Proceedings of the Twenty-Second ACM International Conference on Intelligent User Interfaces (IUI-17), Limassol, Cyprus, 2017.
  • 41.
    Faculty of Science Lauruhn,Michael, and Paul Groth. "Sources of Change for Modern Knowledge Organization Systems." Knowledge Organization 43, no. 8 (2016). Machines, People, Organizations
  • 42.
    Faculty of Science Groth,Paul, "The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE , vol.28, no.5, pp.44,48, Sept.-Oct. 2013 doi: 10.1109/MIS.2013.138 • 10 different extractors • E.g mapping-based infobox extractor • Infobox uses a hand-built ontology based on the 350 • Based on acommonly used English language infoboxes • Integrates with Yago • Yago relies on Wikipedia + Wordnet • Upper ontology from Wordnet and then a mapping to Wikipedia categories based frequencies • Wordnet is built by psycholinguists Machines, People, Organizations
  • 43.
  • 44.
    Faculty of Science •People are sources too – need modelling and assessment • Must take into account the entire provenance history including assessment structures • Propagation but also discounting and elevation are needed for computation of assessment • Not just explanation, but decisions All source assessment Ceolin, D., Groth, P., Maccatrozzo, V., Fokkink, W., Hage, W.R. Van and Nottamkandath, A. 2016. Combining User Reputation and Provenance Analysis for Trust Assessment. Journal of Data and Information Quality. 7, 1–2 (Jan. 2016), 1–28. DOI:https://doi.org/10.1145/2818382. Ceolin, D., Groth, P. and Hage, W.R. Van 2010. Calculating the Trust of Event Descriptions using Provenance. Proceedings Of The SWPM 2010, Workshop At The 9th International Semantic Web Conference, ISWC-2010 (Nov. 2010).
  • 45.
    Faculty of Science GiantGlobal Provenance Graph? Martin Fenner and Amir Aryani “Introducing the PID Graph” March 28, 2019 https://doi.org/10.5438/jwvf-8a66 P. Groth, H. Cousijn, T. Clark & C. Goble. FAIR data reuse – the path through data citation. Data Intelligence 2(2020), 78–86. doi: 10.1162/dint_a_00030
  • 46.
    Faculty of Science •Data reuse though integration/munging/remixing is pervasive • Knowledge graphs are common and complex • Our information environments are heterogenous, deep, intermixed and socially embedded • Use provenance to help humans and machines perform assessments and make decisions Conclusion Contact: Paul Groth | @pgroth | pgroth.com

Editor's Notes

  • #7 Work with dans Reviewed 400 papers deep dive 114
  • #17 Effectively = low overhead, low implementation overhead Queries, reexecution, abstraction,
  • #21 A big problem for systems capturing provenance is deciding what to capture. For disclosed provenance systems we can apply some methodology to decide what to capture.
  • #22 Disclosed provenance methods require knowledge of application semantics and modification of the application. OTOH observed provenance methods usually have a high false positive ratio.
  • #26 Execution Capture: happens realtime Instrumentation: applied on the captured trace to generate provenance information Analysis: the provenance information is explored using existing tools (e.g. SPARQL queries) Selection: a subset of the execution trace is selected – we start again with more intensive instrumentation
  • #27 We implemented our methodology using PANDA.
  • #28 PANDA is based on QEMU. Input includes both executed instructions and data. RAM snapshot + ND log are enough to accurately replay the whole execution. ND log conists of inputs to CPU/RAM and other device status is not logged  we can replay but we cannot “go live” (i.e. resume execution)
  • #29 Note: Technically, plugins can modify VM state. However this will eventually crash the execution as the trace will be out of sync with the replay state. Plugins are implemented as dynamic libraries. We focus on the highlighted plugins in this presentation.
  • #31 We built a tool based on taint tracking to capture provenance. Our tool is called DataTracker and has two key building blocks.
  • #44 We don’t start with a full formal definition but formalize over time from usage