Thoughts on Knowledge Graphs & Deeper Provenance

Faculty of Science
Paul Groth | @pgroth | pgroth.com
Oct 29, 2019
Data Provenance Staff Week - Universidad de La Rioja
Thoughts on Knowledge Graphs
& Deeper Provenance

Faculty of Science
OUTLINE
1. The problem
2. Knowledge Graphs
3. Transparency through data provenance
4. Assessment for human-machine teams

Faculty of Science
The making of data is important
“There is a major, largely unrealised potential to
merge and integrate the data from different
disciplines of science in order to reveal deep
patterns in the multi-facetted complexity that
underlies most of the domains of application that
are intrinsic to the major global challenges that
confront humanity.” – Grand Challenge for
Science
http://dataintegration.codata.org
Committee on Data of the
International Council for Science
(CODATA)

Faculty of Science
Software 2.0
https://link.medium.com/srrJhEl5bS
“In the 2.0 stack, the programming is done by
accumulating, massaging and cleaning datasets”
Figure 8
Data Science
Surveys 2017
& 2018
The making of data is hard

Faculty of Science
http://www.tamr.com/piketty-revisited-improving-economics-data-science/

Faculty of Science
NOT JUST DATA SCIENCE
Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A., & Wyatt, S. (2019).
Searching Data: A Review of Observational Data Retrieval
Practices. Journal of the Association for Information Science and
Technology. doi:10.1002/asi.24165
Some observations from @gregory_km
survey & interviews :
• The needs and behaviors of specific user groups (e.g. early
career researchers, policy makers, students) are not well
documented.
• Participants require details about data collection and handling
• Reconstructing data tables from journal articles, using
general search engines, and making direct data requests are
common.
K Gregory, H Cousijn, P Groth, A Scharnhorst, S Wyatt (2018).
Understanding Data Retrieval Practices: A Social Informatics Perspective.
arXiv preprint arXiv:1801.04971

Faculty of Science
Knowledge Graphs for Integration
A knowledge graph is "graph structured knowledge bases (KBs) which store factual
information in form of relationships between entities" (Nickel et al. 2015).
Nickel, M., Murphy, K., Tresp, V., & Gabrilovich, E. (2015). A Review
of Relational Machine Learning for Knowledge Graphs, 1–18.

Faculty of Science
Frank van Harmelen Adoption of Knowledge Graphs: https://www.slideshare.net/Frank.van.Harmelen/adoption-of-knowledge-graphs-mid-2019

Faculty of Science
Knowledge Graphs

Faculty of Science
LARGE SCALE PIPELINES
Content
Universal
schema
Surface form
relations
Structured
relations
Factorization
model
Matrix
Construction
Open
Information
Extraction
Entity
Resolution
Matrix
Factorization
Knowledge
graph
Curation
Predicted
relations
Matrix
Completion
Taxonomy
Triple
Extraction
Concept
Resolution
14M
SD articles
475 M
triples
3.3 million
relations
49 M
relations
~15k ->
1M
entries
Paul Groth, Sujit Pal, Darin McBeath, Brad Allen, Ron Daniel
“Applying Universal Schemas for Domain Specific Ontology Expansion”
5th Workshop on Automated Knowledge Base Construction (AKBC) 2016
Michael Lauruhn, and Paul Groth. "Sources of Change for Modern
Knowledge Organization Systems." Knowledge Organization 43, no. 8
(2016).

Faculty of Science
Bottlenecks
1.Manual
2.Difficulty in creating flexible reusable workflows
3.Lack of transparency
Paul Groth."The Knowledge-Remixing Bottleneck,"
Intelligent Systems, IEEE , vol.28, no.5, pp.44,48, Sept.-
Oct. 2013 doi: 10.1109/MIS.2013.138
Paul Groth, "Transparency and Reliability in the Data Supply
Chain," Internet Computing, IEEE, vol.17, no.2, pp.69,71,
March-April 2013 doi: 10.1109/MIC.2013.41

Faculty of Science
TRANSPARENCY

Faculty of Science
PROVENANCE
• Where and how was this data or document produced?
• Data Provenance is “a record that describes the people,
institutions, entities, and activities involved in producing,
influencing, or delivering a piece of data” – W3C
Provenance Recommendation
• Central issues:
• Data workflows go beyond single systems
• How do you capture this information
effectively?
• What functionality can the provenance
support?
From: https://www.w3.org/TR/prov-primer/

Faculty of Science
DATA PROVENANCE INTEROPERABILITY
Moreau, Luc, Paul Groth, James Cheney, Timothy Lebo, and Simon Miles. "The
rationale of PROV." Web Semantics: Science, Services and Agents on the World Wide
Web 35 (2015): 235-257.
Luc Moreau and Paul Groth. "Provenance: an introduction to Prov." Synthesis Lectures
on the Semantic Web: Theory and Technology 3.4 (2013): 1-129.
Paul Groth, Yolanda Gil, James Cheney, and Simon Miles. "Requirements for
provenance on the web." International Journal of Digital Curation 7, no. 1 (2012): 39-56.

Faculty of Science
OPENPHACTS PROVENANCE

Faculty of Science
Select one of the activities in the PROV graph
Entities and Activities are sized according to information ﬂow
Missing type information is automatically inferred
Embed the generated visualisation in your own webpage

Faculty of Science
What to capture?
Simon Miles, Paul Groth, Paul, Steve Munroe, Luc Moreau.
PrIMe: A methodology for developing provenance-aware
applications.
ACM Transactions on Software Engineering and Methodology, 20,
(3), 2011.
1
9

Faculty of Science
State of the art
• Disclosed Provenance
+ Accuracy
+ High-level semantics
• Intrusive
• Manual Effort
• Observed Provenance
• False positives
• Semantic Gap
+ Non-intrusive
+ Minimal manual effort
CPL (Macko ‘12)
Trio (Widom ‘09)
Wings (Gil ‘11)
Taverna (Oinn ‘06)
VisTrails (Fraire ‘06)
ES3 (Frew ‘08)
Trec (Vahdat ‘98)
PASSv2 (Holland ‘08)
DTrace Tool (Gessiou ‘12)
ProvChain (Liang ’17)
Titian (Interlandi ‘15)
DiffDataFlow (Chothia ‘16)

Faculty of Science
What if we missed something?
Disclosed provenance systems:
• Re-apply methodology (e.g. PriME), produce new application version.
• Time consuming.
Observed provenance systems:
• Update the applied instrumentation.
• Instrumentation becomes progressively more intense.
Provenance is Post-Hoc

Faculty of Science
Re-execution
Common tactic in disclosed provenance:
• DB: Reenactment queries (Glavic ‘14)
• DistSys: Chimera (Foster ‘02), Hadoop (Logothetis ‘13), DistTape (Zhao ‘12)
• Workflows: Pegasus (Groth ‘09)
• PL: Slicing (Perera ‘12)
• Desktop: Excel (Asuncion ‘11)
Can we extend this idea to observed provenance systems?
2
2

Faculty of Science
Faster Capture: Record & Replay
PROV 2R: Practical Provenance Analysis of Unstructured Processes
M Stamatogiannakis, E Athanasopoulos, H Bos, P Groth (2017)
ACM Transactions on Internet Technology (TOIT) 17 (4), 37

Methodology
Selection
Provenance analysis
Instrumentation
Execution Capture
24

Prototype Implementation
• PANDA: an open-source
Platform for
Architecture-Neutral
Dynamic Analysis. (Dolan-
Gavitt ‘14)
• Based on the QEMU
virtualization platform.
25

• PANDA logs self-contained execution traces.
– An initial RAM snapshot.
– Non-deterministic inputs.
• Logging happens at virtual CPU I/O ports.
– Virtual device state is not logged  can’t “go-live”.
Prototype Implementation (2/3)
PANDA
CPU RAM
Input
Interrupt
DMA
Initial RAM Snapshot
Non-
determinism
log
RAM
PANDA Execution Trace
26

Prototype Implementation (3/3)
• Analysis plugins
– Read-only access to the VM state.
– Invoked per instr., memory access, context switch, etc.
– Can be combined to implement complex functionality.
– OSI Linux, PROV-Tracer, ProcStrMatch, Taint tracking
• Debian Linux guest.
• Provenance stored PROV/RDF triples, queried with SPARQL.
PANDA
Execution
Trace
PANDA
Triple
Store
Plugin APlugin C
Plugin B
CPU
RAM
27
used
endedAtTime
wasAssociatedWith
actedOnBehalfOf
wasGeneratedBy
wasAttributedTo
wasDerivedFrom
wasInformedBy
Activity
Entity
Agent
xsd:dateTime
startedAtTime
xsd:dateTime

Faculty of Science
Enabling more detail
Application
• Observed provenance systems treat programs as
black-boxes
• Can’t tell if an input file was actually used
• Can’t quantify the influence of input to output

Faculty of Science
DATA TRACKER
• Captures high-fidelity provenance using Taint Tracking
• Key building blocks:
• libdft (Kemerlis ‘12) ➞ Reusable taint-tracking
framework
• Intel Pin (Luk ‘05) ➞ Dynamic instrumentation
framework
• Does not require modification of applications
• Does not require knowledge of application semantics
Stamatogiannakis, Manolis, Paul Groth, and Herbert Bos. "Looking inside the black-box:
capturing data provenance using dynamic instrumentation." In International Provenance and
Annotation Workshop (IPAW’14), pp. 155-167. 2014.

Faculty of Science
Systems provenance
• Adam Bates, Dave Tian, Kevin R.B. Butler, and Thomas
Moyer. Trustworthy Whole-System Provenance for the
Linux Kernel. USENIX Security Symposium (SECURITY),
August 2015.

Faculty of Science
• We can capture ever more provenance
• Still the question: what to capture?
• But is that enough?
Is this enough?

Faculty of Science
Peer Review At Scale – People
2016:
• 1.5 million papers submitted
• Published 420,000 articles
• 2,500 journals
• 20,000 “level 1” editor
• 60,000 editors
http://senseaboutscience.org/wp-
content/uploads/2016/09/peer-
review-the-nuts-and-bolts.pdf

Faculty of Science
But are we missing a user?

Faculty of Science
Machines see things differently than people
From: Alain, G. and Bengio, Y. (2016). Understanding intermediate layers using linear classifier probes. arXiv:1610.01644v1.
Thanks Brad Allen

Faculty of Science
Machines learn things differently than people
Thanks Brad Allen

Faculty of Science
Models reuse data
From: Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B.
and Vijayanarasimhan, S. YouTube-8M: a large-scale video classification benchmark. arXiv:1609.08675.

Faculty of Science
Models reuse models

Faculty of Science
People read what machines say
Towards Automating Data Narratives.
Gil, Y.; and Garijo, D. In Proceedings of the
Twenty-Second ACM International Conference
on Intelligent User Interfaces (IUI-17), Limassol,
Cyprus, 2017.

Faculty of Science
Lauruhn, Michael, and Paul Groth. "Sources of
Change for Modern Knowledge Organization
Systems." Knowledge Organization 43, no. 8
(2016).
Machines, People, Organizations

Faculty of Science
Groth, Paul, "The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE
, vol.28, no.5, pp.44,48, Sept.-Oct. 2013 doi: 10.1109/MIS.2013.138
• 10 different extractors
• E.g mapping-based infobox extractor
• Infobox uses a hand-built ontology based on the 350
• Based on acommonly used English language infoboxes
• Integrates with Yago
• Yago relies on Wikipedia + Wordnet
• Upper ontology from Wordnet and then a mapping to Wikipedia
categories based frequencies
• Wordnet is built by psycholinguists
Machines, People, Organizations

Faculty of Science
• People are sources too – need
modelling and assessment
• Must take into account the entire
provenance history including
assessment structures
• Propagation but also discounting
and elevation are needed for
computation of assessment
• Not just explanation, but decisions
All source assessment
Ceolin, D., Groth, P., Maccatrozzo, V., Fokkink, W., Hage, W.R. Van and Nottamkandath, A.
2016. Combining User Reputation and Provenance Analysis for Trust Assessment. Journal of
Data and Information Quality. 7, 1–2 (Jan. 2016), 1–28. DOI:https://doi.org/10.1145/2818382.
Ceolin, D., Groth, P. and Hage, W.R. Van 2010. Calculating the Trust of Event Descriptions
using Provenance. Proceedings Of The SWPM 2010, Workshop At The 9th International
Semantic Web Conference, ISWC-2010 (Nov. 2010).

Faculty of Science
Giant Global Provenance Graph?
Martin Fenner and Amir Aryani “Introducing the PID Graph” March 28, 2019
https://doi.org/10.5438/jwvf-8a66
P. Groth, H. Cousijn, T. Clark & C. Goble. FAIR data reuse – the path through data citation. Data
Intelligence 2(2020), 78–86. doi: 10.1162/dint_a_00030

Faculty of Science
• Data reuse though integration/munging/remixing is pervasive
• Knowledge graphs are common and complex
• Our information environments are heterogenous, deep,
intermixed and socially embedded
• Use provenance to help humans and machines perform
assessments and make decisions
Conclusion
Contact:
Paul Groth | @pgroth | pgroth.com

Thoughts on Knowledge Graphs & Deeper Provenance

More Related Content

What's hot

Similar to Thoughts on Knowledge Graphs & Deeper Provenance

More from Paul Groth

Recently uploaded

Thoughts on Knowledge Graphs & Deeper Provenance

Editor's Notes