SlideShare a Scribd company logo
1 of 46
Download to read offline
Faculty of Science
Paul Groth | @pgroth | pgroth.com
Oct 29, 2019
Data Provenance Staff Week - Universidad de La Rioja
Thoughts on Knowledge Graphs
& Deeper Provenance
Faculty of Science
OUTLINE
1. The problem
2. Knowledge Graphs
3. Transparency through data provenance
4. Assessment for human-machine teams
Faculty of Science
The making of data is important
“There is a major, largely unrealised potential to
merge and integrate the data from different
disciplines of science in order to reveal deep
patterns in the multi-facetted complexity that
underlies most of the domains of application that
are intrinsic to the major global challenges that
confront humanity.” – Grand Challenge for
Science
http://dataintegration.codata.org
Committee on Data of the
International Council for Science
(CODATA)
Faculty of Science
Software 2.0
https://link.medium.com/srrJhEl5bS
“In the 2.0 stack, the programming is done by
accumulating, massaging and cleaning datasets”
Figure 8
Data Science
Surveys 2017
& 2018
The making of data is hard
Faculty of Science
http://www.tamr.com/piketty-revisited-improving-economics-data-science/
Faculty of Science
NOT JUST DATA SCIENCE
Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A., & Wyatt, S. (2019).
Searching Data: A Review of Observational Data Retrieval
Practices. Journal of the Association for Information Science and
Technology. doi:10.1002/asi.24165
Some observations from @gregory_km
survey & interviews :
• The needs and behaviors of specific user groups (e.g. early
career researchers, policy makers, students) are not well
documented.
• Participants require details about data collection and handling
• Reconstructing data tables from journal articles, using
general search engines, and making direct data requests are
common.
K Gregory, H Cousijn, P Groth, A Scharnhorst, S Wyatt (2018).
Understanding Data Retrieval Practices: A Social Informatics Perspective.
arXiv preprint arXiv:1801.04971
Faculty of Science
OUTLINE
1. The problem
2. Knowledge Graphs
3. Transparency through data provenance
4. Assessment for human-machine teams
Faculty of Science
Knowledge Graphs for Integration
A knowledge graph is "graph structured knowledge bases (KBs) which store factual
information in form of relationships between entities" (Nickel et al. 2015).
Nickel, M., Murphy, K., Tresp, V., & Gabrilovich, E. (2015). A Review
of Relational Machine Learning for Knowledge Graphs, 1–18.
Knowledge Graphs: In
Science
Faculty of Science
Frank van Harmelen Adoption of Knowledge Graphs: https://www.slideshare.net/Frank.van.Harmelen/adoption-of-knowledge-graphs-mid-2019
Faculty of Science
Knowledge Graphs
Faculty of Science
LARGE SCALE PIPELINES
Content
Universal
schema
Surface form
relations
Structured
relations
Factorization
model
Matrix
Construction
Open
Information
Extraction
Entity
Resolution
Matrix
Factorization
Knowledge
graph
Curation
Predicted
relations
Matrix
Completion
Taxonomy
Triple
Extraction
Concept
Resolution
14M
SD articles
475 M
triples
3.3 million
relations
49 M
relations
~15k ->
1M
entries
Paul Groth, Sujit Pal, Darin McBeath, Brad Allen, Ron Daniel
“Applying Universal Schemas for Domain Specific Ontology Expansion”
5th Workshop on Automated Knowledge Base Construction (AKBC) 2016
Michael Lauruhn, and Paul Groth. "Sources of Change for Modern
Knowledge Organization Systems." Knowledge Organization 43, no. 8
(2016).
Faculty of Science
Bottlenecks
1.Manual
2.Difficulty in creating flexible reusable workflows
3.Lack of transparency
Paul Groth."The Knowledge-Remixing Bottleneck,"
Intelligent Systems, IEEE , vol.28, no.5, pp.44,48, Sept.-
Oct. 2013 doi: 10.1109/MIS.2013.138
Paul Groth, "Transparency and Reliability in the Data Supply
Chain," Internet Computing, IEEE, vol.17, no.2, pp.69,71,
March-April 2013 doi: 10.1109/MIC.2013.41
Faculty of Science
OUTLINE
1. The problem
2. Knowledge Graphs
3. Transparency through data provenance
4. Assessment for human-machine teams
Faculty of Science
TRANSPARENCY
Faculty of Science
PROVENANCE
• Where and how was this data or document produced?
• Data Provenance is “a record that describes the people,
institutions, entities, and activities involved in producing,
influencing, or delivering a piece of data” – W3C
Provenance Recommendation
• Central issues:
• Data workflows go beyond single systems
• How do you capture this information
effectively?
• What functionality can the provenance
support?
From: https://www.w3.org/TR/prov-primer/
Faculty of Science
DATA PROVENANCE INTEROPERABILITY
Moreau, Luc, Paul Groth, James Cheney, Timothy Lebo, and Simon Miles. "The
rationale of PROV." Web Semantics: Science, Services and Agents on the World Wide
Web 35 (2015): 235-257.
Luc Moreau and Paul Groth. "Provenance: an introduction to Prov." Synthesis Lectures
on the Semantic Web: Theory and Technology 3.4 (2013): 1-129.
Paul Groth, Yolanda Gil, James Cheney, and Simon Miles. "Requirements for
provenance on the web." International Journal of Digital Curation 7, no. 1 (2012): 39-56.
Faculty of Science
OPENPHACTS PROVENANCE
Faculty of Science
Select one of the activities in the PROV graph
Entities and Activities are sized according to information flow
Missing type information is automatically inferred
Embed the generated visualisation in your own webpage
Faculty of Science
What to capture?
Simon Miles, Paul Groth, Paul, Steve Munroe, Luc Moreau.
PrIMe: A methodology for developing provenance-aware
applications.
ACM Transactions on Software Engineering and Methodology, 20,
(3), 2011.
1
9
Faculty of Science
State of the art
• Disclosed Provenance
+ Accuracy
+ High-level semantics
• Intrusive
• Manual Effort
• Observed Provenance
• False positives
• Semantic Gap
+ Non-intrusive
+ Minimal manual effort
CPL (Macko ‘12)
Trio (Widom ‘09)
Wings (Gil ‘11)
Taverna (Oinn ‘06)
VisTrails (Fraire ‘06)
ES3 (Frew ‘08)
Trec (Vahdat ‘98)
PASSv2 (Holland ‘08)
DTrace Tool (Gessiou ‘12)
ProvChain (Liang ’17)
Titian (Interlandi ‘15)
DiffDataFlow (Chothia ‘16)
Faculty of Science
What if we missed something?
Disclosed provenance systems:
• Re-apply methodology (e.g. PriME), produce new application version.
• Time consuming.
Observed provenance systems:
• Update the applied instrumentation.
• Instrumentation becomes progressively more intense.
Provenance is Post-Hoc
Faculty of Science
Re-execution
Common tactic in disclosed provenance:
• DB: Reenactment queries (Glavic ‘14)
• DistSys: Chimera (Foster ‘02), Hadoop (Logothetis ‘13), DistTape (Zhao ‘12)
• Workflows: Pegasus (Groth ‘09)
• PL: Slicing (Perera ‘12)
• Desktop: Excel (Asuncion ‘11)
Can we extend this idea to observed provenance systems?
2
2
Faculty of Science
Faster Capture: Record & Replay
PROV 2R: Practical Provenance Analysis of Unstructured Processes
M Stamatogiannakis, E Athanasopoulos, H Bos, P Groth (2017)
ACM Transactions on Internet Technology (TOIT) 17 (4), 37
Methodology
Selection
Provenance analysis
Instrumentation
Execution Capture
24
Prototype Implementation
• PANDA: an open-source
Platform for
Architecture-Neutral
Dynamic Analysis. (Dolan-
Gavitt ‘14)
• Based on the QEMU
virtualization platform.
25
• PANDA logs self-contained execution traces.
– An initial RAM snapshot.
– Non-deterministic inputs.
• Logging happens at virtual CPU I/O ports.
– Virtual device state is not logged  can’t “go-live”.
Prototype Implementation (2/3)
PANDA
CPU RAM
Input
Interrupt
DMA
Initial RAM Snapshot
Non-
determinism
log
RAM
PANDA Execution Trace
26
Prototype Implementation (3/3)
• Analysis plugins
– Read-only access to the VM state.
– Invoked per instr., memory access, context switch, etc.
– Can be combined to implement complex functionality.
– OSI Linux, PROV-Tracer, ProcStrMatch, Taint tracking
• Debian Linux guest.
• Provenance stored PROV/RDF triples, queried with SPARQL.
PANDA
Execution
Trace
PANDA
Triple
Store
Plugin APlugin C
Plugin B
CPU
RAM
27
used
endedAtTime
wasAssociatedWith
actedOnBehalfOf
wasGeneratedBy
wasAttributedTo
wasDerivedFrom
wasInformedBy
Activity
Entity
Agent
xsd:dateTime
startedAtTime
xsd:dateTime
Faculty of Science
Enabling more detail
Application
• Observed provenance systems treat programs as
black-boxes
• Can’t tell if an input file was actually used
• Can’t quantify the influence of input to output
Faculty of Science
DATA TRACKER
• Captures high-fidelity provenance using Taint Tracking
• Key building blocks:
• libdft (Kemerlis ‘12) ➞ Reusable taint-tracking
framework
• Intel Pin (Luk ‘05) ➞ Dynamic instrumentation
framework
• Does not require modification of applications
• Does not require knowledge of application semantics
Stamatogiannakis, Manolis, Paul Groth, and Herbert Bos. "Looking inside the black-box:
capturing data provenance using dynamic instrumentation." In International Provenance and
Annotation Workshop (IPAW’14), pp. 155-167. 2014.
Faculty of Science
Systems provenance
• Adam Bates, Dave Tian, Kevin R.B. Butler, and Thomas
Moyer. Trustworthy Whole-System Provenance for the
Linux Kernel. USENIX Security Symposium (SECURITY),
August 2015.
Faculty of Science
• We can capture ever more provenance
• Still the question: what to capture?
• But is that enough?
Is this enough?
Faculty of Science
OUTLINE
1. The problem
2. Knowledge Graphs
3. Transparency through data provenance
4. Assessment for human-machine teams
Faculty of Science
Peer Review At Scale – People
2016:
• 1.5 million papers submitted
• Published 420,000 articles
• 2,500 journals
• 20,000 “level 1” editor
• 60,000 editors
http://senseaboutscience.org/wp-
content/uploads/2016/09/peer-
review-the-nuts-and-bolts.pdf
Faculty of Science
But are we missing a user?
Faculty of Science
Machines see things differently than people
From: Alain, G. and Bengio, Y. (2016). Understanding intermediate layers using linear classifier probes. arXiv:1610.01644v1.
Thanks Brad Allen
Faculty of Science
Machines learn things differently than people
Thanks Brad Allen
Faculty of Science
Models reuse data
From: Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B.
and Vijayanarasimhan, S. YouTube-8M: a large-scale video classification benchmark. arXiv:1609.08675.
Faculty of Science
Models reuse models
Faculty of Science
People read what machines say
Towards Automating Data Narratives.
Gil, Y.; and Garijo, D. In Proceedings of the
Twenty-Second ACM International Conference
on Intelligent User Interfaces (IUI-17), Limassol,
Cyprus, 2017.
Faculty of Science
Lauruhn, Michael, and Paul Groth. "Sources of
Change for Modern Knowledge Organization
Systems." Knowledge Organization 43, no. 8
(2016).
Machines, People, Organizations
Faculty of Science
Groth, Paul, "The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE
, vol.28, no.5, pp.44,48, Sept.-Oct. 2013 doi: 10.1109/MIS.2013.138
• 10 different extractors
• E.g mapping-based infobox extractor
• Infobox uses a hand-built ontology based on the 350
• Based on acommonly used English language infoboxes
• Integrates with Yago
• Yago relies on Wikipedia + Wordnet
• Upper ontology from Wordnet and then a mapping to Wikipedia
categories based frequencies
• Wordnet is built by psycholinguists
Machines, People, Organizations
Faculty of Science
An example
Faculty of Science
• People are sources too – need
modelling and assessment
• Must take into account the entire
provenance history including
assessment structures
• Propagation but also discounting
and elevation are needed for
computation of assessment
• Not just explanation, but decisions
All source assessment
Ceolin, D., Groth, P., Maccatrozzo, V., Fokkink, W., Hage, W.R. Van and Nottamkandath, A.
2016. Combining User Reputation and Provenance Analysis for Trust Assessment. Journal of
Data and Information Quality. 7, 1–2 (Jan. 2016), 1–28. DOI:https://doi.org/10.1145/2818382.
Ceolin, D., Groth, P. and Hage, W.R. Van 2010. Calculating the Trust of Event Descriptions
using Provenance. Proceedings Of The SWPM 2010, Workshop At The 9th International
Semantic Web Conference, ISWC-2010 (Nov. 2010).
Faculty of Science
Giant Global Provenance Graph?
Martin Fenner and Amir Aryani “Introducing the PID Graph” March 28, 2019
https://doi.org/10.5438/jwvf-8a66
P. Groth, H. Cousijn, T. Clark & C. Goble. FAIR data reuse – the path through data citation. Data
Intelligence 2(2020), 78–86. doi: 10.1162/dint_a_00030
Faculty of Science
• Data reuse though integration/munging/remixing is pervasive
• Knowledge graphs are common and complex
• Our information environments are heterogenous, deep,
intermixed and socially embedded
• Use provenance to help humans and machines perform
assessments and make decisions
Conclusion
Contact:
Paul Groth | @pgroth | pgroth.com

More Related Content

What's hot

Knowledge Graph Futures
Knowledge Graph FuturesKnowledge Graph Futures
Knowledge Graph FuturesPaul Groth
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?Paul Groth
 
Machines are people too
Machines are people tooMachines are people too
Machines are people tooPaul Groth
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chainPaul Groth
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Paul Groth
 
From Data Search to Data Showcasing
From Data Search to Data ShowcasingFrom Data Search to Data Showcasing
From Data Search to Data ShowcasingPaul Groth
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataPaul Groth
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the WebRinke Hoekstra
 
An Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataAn Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataRinke Hoekstra
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionUniversity of Washington
 
Prov-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance VisualizationProv-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance VisualizationRinke Hoekstra
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsPaul Groth
 
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)Rinke Hoekstra
 
Describing Scholarly Contributions semantically with the Open Research Knowle...
Describing Scholarly Contributions semantically with the Open Research Knowle...Describing Scholarly Contributions semantically with the Open Research Knowle...
Describing Scholarly Contributions semantically with the Open Research Knowle...Sören Auer
 
Towards an Open Research Knowledge Graph
Towards an Open Research Knowledge GraphTowards an Open Research Knowledge Graph
Towards an Open Research Knowledge GraphSören Auer
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceUniversity of Washington
 

What's hot (20)

Knowledge Graph Futures
Knowledge Graph FuturesKnowledge Graph Futures
Knowledge Graph Futures
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?
 
Machines are people too
Machines are people tooMachines are people too
Machines are people too
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chain
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
From Data Search to Data Showcasing
From Data Search to Data ShowcasingFrom Data Search to Data Showcasing
From Data Search to Data Showcasing
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the Web
 
An Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataAn Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities Data
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
Prov-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance VisualizationProv-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance Visualization
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
 
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
 
Describing Scholarly Contributions semantically with the Open Research Knowle...
Describing Scholarly Contributions semantically with the Open Research Knowle...Describing Scholarly Contributions semantically with the Open Research Knowle...
Describing Scholarly Contributions semantically with the Open Research Knowle...
 
Urban Data Science at UW
Urban Data Science at UWUrban Data Science at UW
Urban Data Science at UW
 
Towards an Open Research Knowledge Graph
Towards an Open Research Knowledge GraphTowards an Open Research Knowledge Graph
Towards an Open Research Knowledge Graph
 
Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 

Similar to Thoughts on Knowledge Graphs & Deeper Provenance

Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedPhilip Bourne
 
Tragedy of the (Data) Commons
Tragedy of the (Data) CommonsTragedy of the (Data) Commons
Tragedy of the (Data) CommonsJames Hendler
 
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data EcosystemsReal-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data EcosystemsAnita de Waard
 
Trust and Accountability: experiences from the FAIRDOM Commons Initiative.
Trust and Accountability: experiences from the FAIRDOM Commons Initiative.Trust and Accountability: experiences from the FAIRDOM Commons Initiative.
Trust and Accountability: experiences from the FAIRDOM Commons Initiative.Carole Goble
 
Hattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in MaterialsHattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in MaterialsJason Hattrick-Simpers
 
Data Science Meets Biomedicine, Does Anything Change
Data Science Meets Biomedicine, Does Anything ChangeData Science Meets Biomedicine, Does Anything Change
Data Science Meets Biomedicine, Does Anything ChangePhilip Bourne
 
Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?LEARN Project
 
Ontology Tutorial: Semantic Technology for Intelligence, Defense and Security
Ontology Tutorial: Semantic Technology for Intelligence, Defense and SecurityOntology Tutorial: Semantic Technology for Intelligence, Defense and Security
Ontology Tutorial: Semantic Technology for Intelligence, Defense and SecurityBarry Smith
 
Biomedical Data Science: We Are Not Alone
Biomedical Data Science: We Are Not AloneBiomedical Data Science: We Are Not Alone
Biomedical Data Science: We Are Not AlonePhilip Bourne
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridIan Foster
 
Foundations for the Future of Science
Foundations for the Future of ScienceFoundations for the Future of Science
Foundations for the Future of ScienceGlobus
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8Scott Edmunds
 
What Data Science Will Mean to You - One Person's View
What Data Science Will Mean to You - One Person's ViewWhat Data Science Will Mean to You - One Person's View
What Data Science Will Mean to You - One Person's ViewPhilip Bourne
 
AI from the Perspective of a School of Data Science
AI from the Perspective of a School of Data ScienceAI from the Perspective of a School of Data Science
AI from the Perspective of a School of Data SciencePhilip Bourne
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science ServicesIan Foster
 
One View of Data Science
One View of Data ScienceOne View of Data Science
One View of Data SciencePhilip Bourne
 
Tragedy of the Data Commons (ODSC-East, 2021)
Tragedy of the Data Commons (ODSC-East, 2021)Tragedy of the Data Commons (ODSC-East, 2021)
Tragedy of the Data Commons (ODSC-East, 2021)James Hendler
 

Similar to Thoughts on Knowledge Graphs & Deeper Provenance (20)

Data Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has ChangedData Science and AI in Biomedicine: The World has Changed
Data Science and AI in Biomedicine: The World has Changed
 
Tragedy of the (Data) Commons
Tragedy of the (Data) CommonsTragedy of the (Data) Commons
Tragedy of the (Data) Commons
 
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data EcosystemsReal-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
 
Information entanglement
Information entanglementInformation entanglement
Information entanglement
 
Trust and Accountability: experiences from the FAIRDOM Commons Initiative.
Trust and Accountability: experiences from the FAIRDOM Commons Initiative.Trust and Accountability: experiences from the FAIRDOM Commons Initiative.
Trust and Accountability: experiences from the FAIRDOM Commons Initiative.
 
Hattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in MaterialsHattrick-Simpers MRS Webinar on AI in Materials
Hattrick-Simpers MRS Webinar on AI in Materials
 
Data Science Meets Biomedicine, Does Anything Change
Data Science Meets Biomedicine, Does Anything ChangeData Science Meets Biomedicine, Does Anything Change
Data Science Meets Biomedicine, Does Anything Change
 
Lecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptxLecture_1_Intro_toDS&AI.pptx
Lecture_1_Intro_toDS&AI.pptx
 
Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?Open Data in a Big Data World: easy to say, but hard to do?
Open Data in a Big Data World: easy to say, but hard to do?
 
Ontology Tutorial: Semantic Technology for Intelligence, Defense and Security
Ontology Tutorial: Semantic Technology for Intelligence, Defense and SecurityOntology Tutorial: Semantic Technology for Intelligence, Defense and Security
Ontology Tutorial: Semantic Technology for Intelligence, Defense and Security
 
Biomedical Data Science: We Are Not Alone
Biomedical Data Science: We Are Not AloneBiomedical Data Science: We Are Not Alone
Biomedical Data Science: We Are Not Alone
 
10 problems 06
10 problems 0610 problems 06
10 problems 06
 
Aaas Data Intensive Science And Grid
Aaas Data Intensive Science And GridAaas Data Intensive Science And Grid
Aaas Data Intensive Science And Grid
 
Foundations for the Future of Science
Foundations for the Future of ScienceFoundations for the Future of Science
Foundations for the Future of Science
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8
 
What Data Science Will Mean to You - One Person's View
What Data Science Will Mean to You - One Person's ViewWhat Data Science Will Mean to You - One Person's View
What Data Science Will Mean to You - One Person's View
 
AI from the Perspective of a School of Data Science
AI from the Perspective of a School of Data ScienceAI from the Perspective of a School of Data Science
AI from the Perspective of a School of Data Science
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science Services
 
One View of Data Science
One View of Data ScienceOne View of Data Science
One View of Data Science
 
Tragedy of the Data Commons (ODSC-East, 2021)
Tragedy of the Data Commons (ODSC-East, 2021)Tragedy of the Data Commons (ODSC-East, 2021)
Tragedy of the Data Commons (ODSC-East, 2021)
 

More from Paul Groth

Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIPaul Groth
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphPaul Groth
 
Diversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsDiversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsPaul Groth
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationPaul Groth
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicinePaul Groth
 
Are we finally ready for transclusion?*
Are we finally ready for transclusion?*Are we finally ready for transclusion?*
Are we finally ready for transclusion?*Paul Groth
 
Sources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsSources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsPaul Groth
 
Structured Data & the Future of Educational Material
Structured Data & the Future of Educational MaterialStructured Data & the Future of Educational Material
Structured Data & the Future of Educational MaterialPaul Groth
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkPaul Groth
 
Data for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchersData for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchersPaul Groth
 
Tradeoffs in Automatic Provenance Capture
Tradeoffs in Automatic Provenance CaptureTradeoffs in Automatic Provenance Capture
Tradeoffs in Automatic Provenance CapturePaul Groth
 
Knowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPediaKnowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPediaPaul Groth
 
Information architecture at Elsevier
Information architecture at ElsevierInformation architecture at Elsevier
Information architecture at ElsevierPaul Groth
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging EnvironmentsPaul Groth
 

More from Paul Groth (14)

Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AI
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge Graph
 
Diversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsDiversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domains
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computation
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicine
 
Are we finally ready for transclusion?*
Are we finally ready for transclusion?*Are we finally ready for transclusion?*
Are we finally ready for transclusion?*
 
Sources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsSources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization Systems
 
Structured Data & the Future of Educational Material
Structured Data & the Future of Educational MaterialStructured Data & the Future of Educational Material
Structured Data & the Future of Educational Material
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic Framework
 
Data for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchersData for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchers
 
Tradeoffs in Automatic Provenance Capture
Tradeoffs in Automatic Provenance CaptureTradeoffs in Automatic Provenance Capture
Tradeoffs in Automatic Provenance Capture
 
Knowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPediaKnowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPedia
 
Information architecture at Elsevier
Information architecture at ElsevierInformation architecture at Elsevier
Information architecture at Elsevier
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
 

Recently uploaded

Dublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptxDublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptxKunal Gupta
 
Laying the Data Foundations for Artificial Intelligence!
Laying the Data Foundations for Artificial Intelligence!Laying the Data Foundations for Artificial Intelligence!
Laying the Data Foundations for Artificial Intelligence!Memoori
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Which standard is best for your content?
Which standard is best for your content?Which standard is best for your content?
Which standard is best for your content?Rustici Software
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Software Security in the Real World w/Kelsey Hightower
Software Security in the Real World w/Kelsey HightowerSoftware Security in the Real World w/Kelsey Hightower
Software Security in the Real World w/Kelsey HightowerAnchore
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Introduction-to-Wazuh-and-its-integration.pptx
Introduction-to-Wazuh-and-its-integration.pptxIntroduction-to-Wazuh-and-its-integration.pptx
Introduction-to-Wazuh-and-its-integration.pptxmprakaash5
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Women in Automation 2024: Career session - explore career paths in automation
Women in Automation 2024: Career session - explore career paths in automationWomen in Automation 2024: Career session - explore career paths in automation
Women in Automation 2024: Career session - explore career paths in automationDianaGray10
 
Dynamical Context introduction word sensibility orientation
Dynamical Context introduction word sensibility orientationDynamical Context introduction word sensibility orientation
Dynamical Context introduction word sensibility orientationBuild Intuit
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 

Recently uploaded (20)

Dublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptxDublin_mulesoft_meetup_API_specifications.pptx
Dublin_mulesoft_meetup_API_specifications.pptx
 
Laying the Data Foundations for Artificial Intelligence!
Laying the Data Foundations for Artificial Intelligence!Laying the Data Foundations for Artificial Intelligence!
Laying the Data Foundations for Artificial Intelligence!
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Which standard is best for your content?
Which standard is best for your content?Which standard is best for your content?
Which standard is best for your content?
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Software Security in the Real World w/Kelsey Hightower
Software Security in the Real World w/Kelsey HightowerSoftware Security in the Real World w/Kelsey Hightower
Software Security in the Real World w/Kelsey Hightower
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Introduction-to-Wazuh-and-its-integration.pptx
Introduction-to-Wazuh-and-its-integration.pptxIntroduction-to-Wazuh-and-its-integration.pptx
Introduction-to-Wazuh-and-its-integration.pptx
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Women in Automation 2024: Career session - explore career paths in automation
Women in Automation 2024: Career session - explore career paths in automationWomen in Automation 2024: Career session - explore career paths in automation
Women in Automation 2024: Career session - explore career paths in automation
 
Dynamical Context introduction word sensibility orientation
Dynamical Context introduction word sensibility orientationDynamical Context introduction word sensibility orientation
Dynamical Context introduction word sensibility orientation
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 

Thoughts on Knowledge Graphs & Deeper Provenance

  • 1. Faculty of Science Paul Groth | @pgroth | pgroth.com Oct 29, 2019 Data Provenance Staff Week - Universidad de La Rioja Thoughts on Knowledge Graphs & Deeper Provenance
  • 2. Faculty of Science OUTLINE 1. The problem 2. Knowledge Graphs 3. Transparency through data provenance 4. Assessment for human-machine teams
  • 3. Faculty of Science The making of data is important “There is a major, largely unrealised potential to merge and integrate the data from different disciplines of science in order to reveal deep patterns in the multi-facetted complexity that underlies most of the domains of application that are intrinsic to the major global challenges that confront humanity.” – Grand Challenge for Science http://dataintegration.codata.org Committee on Data of the International Council for Science (CODATA)
  • 4. Faculty of Science Software 2.0 https://link.medium.com/srrJhEl5bS “In the 2.0 stack, the programming is done by accumulating, massaging and cleaning datasets” Figure 8 Data Science Surveys 2017 & 2018 The making of data is hard
  • 6. Faculty of Science NOT JUST DATA SCIENCE Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A., & Wyatt, S. (2019). Searching Data: A Review of Observational Data Retrieval Practices. Journal of the Association for Information Science and Technology. doi:10.1002/asi.24165 Some observations from @gregory_km survey & interviews : • The needs and behaviors of specific user groups (e.g. early career researchers, policy makers, students) are not well documented. • Participants require details about data collection and handling • Reconstructing data tables from journal articles, using general search engines, and making direct data requests are common. K Gregory, H Cousijn, P Groth, A Scharnhorst, S Wyatt (2018). Understanding Data Retrieval Practices: A Social Informatics Perspective. arXiv preprint arXiv:1801.04971
  • 7. Faculty of Science OUTLINE 1. The problem 2. Knowledge Graphs 3. Transparency through data provenance 4. Assessment for human-machine teams
  • 8. Faculty of Science Knowledge Graphs for Integration A knowledge graph is "graph structured knowledge bases (KBs) which store factual information in form of relationships between entities" (Nickel et al. 2015). Nickel, M., Murphy, K., Tresp, V., & Gabrilovich, E. (2015). A Review of Relational Machine Learning for Knowledge Graphs, 1–18.
  • 10. Faculty of Science Frank van Harmelen Adoption of Knowledge Graphs: https://www.slideshare.net/Frank.van.Harmelen/adoption-of-knowledge-graphs-mid-2019
  • 12. Faculty of Science LARGE SCALE PIPELINES Content Universal schema Surface form relations Structured relations Factorization model Matrix Construction Open Information Extraction Entity Resolution Matrix Factorization Knowledge graph Curation Predicted relations Matrix Completion Taxonomy Triple Extraction Concept Resolution 14M SD articles 475 M triples 3.3 million relations 49 M relations ~15k -> 1M entries Paul Groth, Sujit Pal, Darin McBeath, Brad Allen, Ron Daniel “Applying Universal Schemas for Domain Specific Ontology Expansion” 5th Workshop on Automated Knowledge Base Construction (AKBC) 2016 Michael Lauruhn, and Paul Groth. "Sources of Change for Modern Knowledge Organization Systems." Knowledge Organization 43, no. 8 (2016).
  • 13. Faculty of Science Bottlenecks 1.Manual 2.Difficulty in creating flexible reusable workflows 3.Lack of transparency Paul Groth."The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE , vol.28, no.5, pp.44,48, Sept.- Oct. 2013 doi: 10.1109/MIS.2013.138 Paul Groth, "Transparency and Reliability in the Data Supply Chain," Internet Computing, IEEE, vol.17, no.2, pp.69,71, March-April 2013 doi: 10.1109/MIC.2013.41
  • 14. Faculty of Science OUTLINE 1. The problem 2. Knowledge Graphs 3. Transparency through data provenance 4. Assessment for human-machine teams
  • 16. Faculty of Science PROVENANCE • Where and how was this data or document produced? • Data Provenance is “a record that describes the people, institutions, entities, and activities involved in producing, influencing, or delivering a piece of data” – W3C Provenance Recommendation • Central issues: • Data workflows go beyond single systems • How do you capture this information effectively? • What functionality can the provenance support? From: https://www.w3.org/TR/prov-primer/
  • 17. Faculty of Science DATA PROVENANCE INTEROPERABILITY Moreau, Luc, Paul Groth, James Cheney, Timothy Lebo, and Simon Miles. "The rationale of PROV." Web Semantics: Science, Services and Agents on the World Wide Web 35 (2015): 235-257. Luc Moreau and Paul Groth. "Provenance: an introduction to Prov." Synthesis Lectures on the Semantic Web: Theory and Technology 3.4 (2013): 1-129. Paul Groth, Yolanda Gil, James Cheney, and Simon Miles. "Requirements for provenance on the web." International Journal of Digital Curation 7, no. 1 (2012): 39-56.
  • 19. Faculty of Science Select one of the activities in the PROV graph Entities and Activities are sized according to information flow Missing type information is automatically inferred Embed the generated visualisation in your own webpage
  • 20. Faculty of Science What to capture? Simon Miles, Paul Groth, Paul, Steve Munroe, Luc Moreau. PrIMe: A methodology for developing provenance-aware applications. ACM Transactions on Software Engineering and Methodology, 20, (3), 2011. 1 9
  • 21. Faculty of Science State of the art • Disclosed Provenance + Accuracy + High-level semantics • Intrusive • Manual Effort • Observed Provenance • False positives • Semantic Gap + Non-intrusive + Minimal manual effort CPL (Macko ‘12) Trio (Widom ‘09) Wings (Gil ‘11) Taverna (Oinn ‘06) VisTrails (Fraire ‘06) ES3 (Frew ‘08) Trec (Vahdat ‘98) PASSv2 (Holland ‘08) DTrace Tool (Gessiou ‘12) ProvChain (Liang ’17) Titian (Interlandi ‘15) DiffDataFlow (Chothia ‘16)
  • 22. Faculty of Science What if we missed something? Disclosed provenance systems: • Re-apply methodology (e.g. PriME), produce new application version. • Time consuming. Observed provenance systems: • Update the applied instrumentation. • Instrumentation becomes progressively more intense. Provenance is Post-Hoc
  • 23. Faculty of Science Re-execution Common tactic in disclosed provenance: • DB: Reenactment queries (Glavic ‘14) • DistSys: Chimera (Foster ‘02), Hadoop (Logothetis ‘13), DistTape (Zhao ‘12) • Workflows: Pegasus (Groth ‘09) • PL: Slicing (Perera ‘12) • Desktop: Excel (Asuncion ‘11) Can we extend this idea to observed provenance systems? 2 2
  • 24. Faculty of Science Faster Capture: Record & Replay PROV 2R: Practical Provenance Analysis of Unstructured Processes M Stamatogiannakis, E Athanasopoulos, H Bos, P Groth (2017) ACM Transactions on Internet Technology (TOIT) 17 (4), 37
  • 26. Prototype Implementation • PANDA: an open-source Platform for Architecture-Neutral Dynamic Analysis. (Dolan- Gavitt ‘14) • Based on the QEMU virtualization platform. 25
  • 27. • PANDA logs self-contained execution traces. – An initial RAM snapshot. – Non-deterministic inputs. • Logging happens at virtual CPU I/O ports. – Virtual device state is not logged  can’t “go-live”. Prototype Implementation (2/3) PANDA CPU RAM Input Interrupt DMA Initial RAM Snapshot Non- determinism log RAM PANDA Execution Trace 26
  • 28. Prototype Implementation (3/3) • Analysis plugins – Read-only access to the VM state. – Invoked per instr., memory access, context switch, etc. – Can be combined to implement complex functionality. – OSI Linux, PROV-Tracer, ProcStrMatch, Taint tracking • Debian Linux guest. • Provenance stored PROV/RDF triples, queried with SPARQL. PANDA Execution Trace PANDA Triple Store Plugin APlugin C Plugin B CPU RAM 27 used endedAtTime wasAssociatedWith actedOnBehalfOf wasGeneratedBy wasAttributedTo wasDerivedFrom wasInformedBy Activity Entity Agent xsd:dateTime startedAtTime xsd:dateTime
  • 29. Faculty of Science Enabling more detail Application • Observed provenance systems treat programs as black-boxes • Can’t tell if an input file was actually used • Can’t quantify the influence of input to output
  • 30. Faculty of Science DATA TRACKER • Captures high-fidelity provenance using Taint Tracking • Key building blocks: • libdft (Kemerlis ‘12) ➞ Reusable taint-tracking framework • Intel Pin (Luk ‘05) ➞ Dynamic instrumentation framework • Does not require modification of applications • Does not require knowledge of application semantics Stamatogiannakis, Manolis, Paul Groth, and Herbert Bos. "Looking inside the black-box: capturing data provenance using dynamic instrumentation." In International Provenance and Annotation Workshop (IPAW’14), pp. 155-167. 2014.
  • 31. Faculty of Science Systems provenance • Adam Bates, Dave Tian, Kevin R.B. Butler, and Thomas Moyer. Trustworthy Whole-System Provenance for the Linux Kernel. USENIX Security Symposium (SECURITY), August 2015.
  • 32. Faculty of Science • We can capture ever more provenance • Still the question: what to capture? • But is that enough? Is this enough?
  • 33. Faculty of Science OUTLINE 1. The problem 2. Knowledge Graphs 3. Transparency through data provenance 4. Assessment for human-machine teams
  • 34. Faculty of Science Peer Review At Scale – People 2016: • 1.5 million papers submitted • Published 420,000 articles • 2,500 journals • 20,000 “level 1” editor • 60,000 editors http://senseaboutscience.org/wp- content/uploads/2016/09/peer- review-the-nuts-and-bolts.pdf
  • 35. Faculty of Science But are we missing a user?
  • 36. Faculty of Science Machines see things differently than people From: Alain, G. and Bengio, Y. (2016). Understanding intermediate layers using linear classifier probes. arXiv:1610.01644v1. Thanks Brad Allen
  • 37. Faculty of Science Machines learn things differently than people Thanks Brad Allen
  • 38. Faculty of Science Models reuse data From: Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B. and Vijayanarasimhan, S. YouTube-8M: a large-scale video classification benchmark. arXiv:1609.08675.
  • 40. Faculty of Science People read what machines say Towards Automating Data Narratives. Gil, Y.; and Garijo, D. In Proceedings of the Twenty-Second ACM International Conference on Intelligent User Interfaces (IUI-17), Limassol, Cyprus, 2017.
  • 41. Faculty of Science Lauruhn, Michael, and Paul Groth. "Sources of Change for Modern Knowledge Organization Systems." Knowledge Organization 43, no. 8 (2016). Machines, People, Organizations
  • 42. Faculty of Science Groth, Paul, "The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE , vol.28, no.5, pp.44,48, Sept.-Oct. 2013 doi: 10.1109/MIS.2013.138 • 10 different extractors • E.g mapping-based infobox extractor • Infobox uses a hand-built ontology based on the 350 • Based on acommonly used English language infoboxes • Integrates with Yago • Yago relies on Wikipedia + Wordnet • Upper ontology from Wordnet and then a mapping to Wikipedia categories based frequencies • Wordnet is built by psycholinguists Machines, People, Organizations
  • 44. Faculty of Science • People are sources too – need modelling and assessment • Must take into account the entire provenance history including assessment structures • Propagation but also discounting and elevation are needed for computation of assessment • Not just explanation, but decisions All source assessment Ceolin, D., Groth, P., Maccatrozzo, V., Fokkink, W., Hage, W.R. Van and Nottamkandath, A. 2016. Combining User Reputation and Provenance Analysis for Trust Assessment. Journal of Data and Information Quality. 7, 1–2 (Jan. 2016), 1–28. DOI:https://doi.org/10.1145/2818382. Ceolin, D., Groth, P. and Hage, W.R. Van 2010. Calculating the Trust of Event Descriptions using Provenance. Proceedings Of The SWPM 2010, Workshop At The 9th International Semantic Web Conference, ISWC-2010 (Nov. 2010).
  • 45. Faculty of Science Giant Global Provenance Graph? Martin Fenner and Amir Aryani “Introducing the PID Graph” March 28, 2019 https://doi.org/10.5438/jwvf-8a66 P. Groth, H. Cousijn, T. Clark & C. Goble. FAIR data reuse – the path through data citation. Data Intelligence 2(2020), 78–86. doi: 10.1162/dint_a_00030
  • 46. Faculty of Science • Data reuse though integration/munging/remixing is pervasive • Knowledge graphs are common and complex • Our information environments are heterogenous, deep, intermixed and socially embedded • Use provenance to help humans and machines perform assessments and make decisions Conclusion Contact: Paul Groth | @pgroth | pgroth.com

Editor's Notes

  1. Work with dans Reviewed 400 papers deep dive 114
  2. Effectively = low overhead, low implementation overhead Queries, reexecution, abstraction,
  3. A big problem for systems capturing provenance is deciding what to capture. For disclosed provenance systems we can apply some methodology to decide what to capture.
  4. Disclosed provenance methods require knowledge of application semantics and modification of the application. OTOH observed provenance methods usually have a high false positive ratio.
  5. Execution Capture: happens realtime Instrumentation: applied on the captured trace to generate provenance information Analysis: the provenance information is explored using existing tools (e.g. SPARQL queries) Selection: a subset of the execution trace is selected – we start again with more intensive instrumentation
  6. We implemented our methodology using PANDA.
  7. PANDA is based on QEMU. Input includes both executed instructions and data. RAM snapshot + ND log are enough to accurately replay the whole execution. ND log conists of inputs to CPU/RAM and other device status is not logged  we can replay but we cannot “go live” (i.e. resume execution)
  8. Note: Technically, plugins can modify VM state. However this will eventually crash the execution as the trace will be out of sync with the replay state. Plugins are implemented as dynamic libraries. We focus on the highlighted plugins in this presentation.
  9. We built a tool based on taint tracking to capture provenance. Our tool is called DataTracker and has two key building blocks.
  10. We don’t start with a full formal definition but formalize over time from usage