SlideShare a Scribd company logo
1 of 24
Decoupling Provenance
Capture and Analysis from
Execution
Manolis Stamatogiannakis (@mstamat)
Paul Groth (@pgroth)
Herbert Bos
Stamatogiannakis, M., Groth, P., & Bos, H. (2014). Looking
inside the black-box: Capturing data provenance using
dynamic instrumentation. In Provenance and Annotation of
Data and Processes (pp. 155-167). Springer International
Publishing.
http://bit.ly/dtracker-demo
2
Capturing Provenance
Disclosed Provenance
+ Accuracy
+ High-level semantics
– Intrusive
– Manual Effort
Observed Provenance
– False positives
– Semantic Gap
+ Non-intrusive
+ Minimal manual effort
CPL (Macko ‘12)
Trio (Widom ‘09)
PrIME (Miles ‘09)
Taverna (Oinn ‘06)
VisTrails (Fraire ‘06)
ES3 (Frew ‘08)
Trec (Vahdat ‘98)
PASSv2 (Holland ‘08)
DTrace Tool (Gessiou ‘12)
3
What to capture?
Miles, Simon, Groth, Paul, Munroe, Steve and Moreau, Luc
(2011) PrIMe: a methodology for developing provenance-aware
applications. ACM Transactions on Software Engineering and
Methodology, 20, (3), 8:1-8:42.
4
Provenance is post hoc.
Aim:
Eliminate the need for developers to know
what provenance needs to be captured.
5
Re-execution
Common tactic in provenance:
• DB: Reenactment queries (Glavic ‘14)
• DistSys: Chimera (Foster ‘02), Hadoop (Logothetis ‘13),
DistTape (Zhao ‘12)
• Workflows: Pegasus (Groth ‘09)
• PL: Slicing (Perera ‘12)
• OS: pTrace (Guo ‘11)
• Desktop: Excel (Asuncion ‘11)
6
http://www.androidreran.com
7
Methodology
Selection
Provenance analysis
Instrumentation
Execution Capture
8
• PANDA: an open-source Platform for Architecture-
Neutral Dynamic Analysis. (Dolan-Gavitt et al. ‘14)
• Based on QEMU virtualization platform.
• Logs self-contained execution traces.
– An initial RAM snapshot.
– Non-deterministic inputs.
Prototype Implementation
(1/2)
PANDA
CPU RAM
Input
Interrupt
DMA Initial RAM
Snapshot
Non-
determinism
log
RAM
PANDA Execution Trace
9
Prototype Implementation
(2/2)
• Debian Linux guest.
• Analysis plugins
– Read-only access to the VM state.
– Invoked per instr., memory access, context switch, etc.
– Can be combined to implement complex functionality.
– OSI Linux, PROV-Tracer, ProcStrMatch.
• Provenance stored PROV/RDF triples, queried with
SPARQL.
PANDA
Executio
n Trace
PANDA
Triple
Store
Plugin APlugin C
Plugin B
CPU
RAM
10
used
endedAtTime
wasAssociatedWith
actedOnBehalfOf
wasGeneratedBy
wasAttributedTo
wasDerivedFrom
wasInformedBy
Activity
Entity
Agent
xsd:dateTime
startedAtTime
xsd:dateTime
OS Introspection
• What processes are currently executing?
• Which libraries are used?
• What files are used?
• Possible approaches:
– Execute code inside the guest-OS.
– Reproduce guest-OS semantics purely from
the hardware state (RAM/registers).
11
Introspecting Kernel Structures
(1/2)
• Kernel structure
members are known.
• Their offsets depend on
compile-time
configuration.
• Each linux vendor
supplies a few different
kernel configurations.
• Rule of thumb: same
vendor/configuration/ve
rsion combo  same
offsets.
12
Introspecting Kernel Structures
(2/2)
• Offset profile created once for
each kernel “family”.
• E.g. one profile for all
Debian/amd64/3.2.* kernels.
• The profile is used by osi_linux
module to extract process info
from the execution trace.
Kernel
offset
profile
Live OS
Probe
Modul
e
Proces
s Info
PANDA
Replay VM
osi_linux
PAND
A
Trace
Monitored
OS
Pointer arithmetic:
struct task_struct t;
int off = &(t.memb) – &t;
printk(“off: %d”, off);
13
The PROV-Tracer Plugin
• Registers for process creation/destruction
events.
• Decodes executed system calls.
• Keeps track of what files are used as
input/output by each process.
• Emits provenance in an intermediate
format when a process terminates.
14
More Analysis Plugins
• ProcStrMatch plugin.
– Which processes contained string S in their
memory?
• Other possible types of analysis:
– Taint tracking
– Dynamic slicing
15
Execution Overhead (1/2)
• QEMU incurs a 5x slowdown.
• PANDA recording imposes an additional
1.1x – 1.2x slowdown.
Virtualization is the dominant overhead
factor.
16
Execution Overhead (2/2)
• QEMU is a suboptimal virtualization
option.
• ReVirt – User Mode Linux (Dunlap et al. ‘02)
– Slowdown: 1.08x rec. + 1.58x virt.
• ReTrace – VMWare (Xu et al. ‘07)
– Slowdown: 1.05x-2.6x rec. + ??? virt.
Virtualization slowdown is considered
acceptable.
Recording overhead is fairly low. 17
Storage Requirements
• Storage requirements vary with the workload.
• For PANDA (Dolan-Gavitt et al. ’14):
– 17-915 instructions per byte.
• In practice: O(10MB/min) uncompressed.
• Different approaches to reduce/manage
storage requirements.
– Compression, HD rotation, VM snapshots.
• 24/7 recording seems within limits of todays’
technology.
18
An Example (1)
<exe://pam-foreground-~3451> prov:endedAtTime 199090196 .
<exe://getent~3451> a prov:Activity .
<exe://getent~3451> rdf:type dt:getent .
<exe://cut~3452> a prov:Activity .
<exe://cut~3452> rdf:type dt:cut .
<file:/etc/nsswitch.conf> a prov:Entity .
<file:/etc/nsswitch.conf> rdfs:label "/etc/nsswitch.conf" .
<file:/etc/nsswitch.conf> rdf:type dt:Unknown .
<exe://getent~3451> prov:used <file:/etc/nsswitch.conf> .
# unused file:3477815296:getent~3451:/etc/passwd:r0:w0:f524288
<exe://getent~3451> prov:startedAtTime 199090196 .
<exe://getent~3451> prov:endedAtTime 200392668 .
<file:FD0_3452> a prov:Entity .
<file:FD0_3452> rdfs:label "FD0_3452"
19
An Example (2
20
Example (3)
21
Example (4)
22
Conclusion
• Decouple capture/analysis from execution
• VMs provide a useful indirection
mechanism
• Future:
– More plugins
– Real world analysis (rrshare.org)
– Cloud analysis
• Are traces/immutable logs primitive?
23
Source & Text
• PROV-Tracer source:
– https://github.com/m000/panda/tree/prov_tracer/
– Plugins under qemu/panda_plugins.
• Paper:
– http://workshops.inf.ed.ac.uk/tapp2015/TAPP15_I_3.p
df
24

More Related Content

What's hot

Skydive, real-time network analyzer, container integration
Skydive, real-time network analyzer, container integrationSkydive, real-time network analyzer, container integration
Skydive, real-time network analyzer, container integrationSylvain Afchain
 
JCache / JSR107 shortcomings
JCache / JSR107  shortcomingsJCache / JSR107  shortcomings
JCache / JSR107 shortcomingscruftex
 
cache2k, Java Caching, Turbo Charged, FOSDEM 2015
cache2k, Java Caching, Turbo Charged, FOSDEM 2015cache2k, Java Caching, Turbo Charged, FOSDEM 2015
cache2k, Java Caching, Turbo Charged, FOSDEM 2015cruftex
 
Hs java open_party
Hs java open_partyHs java open_party
Hs java open_partyOpen Party
 
Dpdk accelerated Ostinato
Dpdk accelerated OstinatoDpdk accelerated Ostinato
Dpdk accelerated Ostinatopstavirs
 
Cryptography and secure systems
Cryptography and secure systemsCryptography and secure systems
Cryptography and secure systemsVsevolod Stakhov
 
Skydive, real-time network analyzer
Skydive, real-time network analyzer Skydive, real-time network analyzer
Skydive, real-time network analyzer Sylvain Afchain
 
Practical SystemTAP basics: Perl memory profiling
Practical SystemTAP basics: Perl memory profilingPractical SystemTAP basics: Perl memory profiling
Practical SystemTAP basics: Perl memory profilingLubomir Rintel
 
Continuous Performance Regression Testing with JfrUnit
Continuous Performance Regression Testing with JfrUnitContinuous Performance Regression Testing with JfrUnit
Continuous Performance Regression Testing with JfrUnitScyllaDB
 
Ostinato - Craft Packets, Generate Traffic [SharkFest '20]
Ostinato - Craft Packets, Generate Traffic [SharkFest '20]Ostinato - Craft Packets, Generate Traffic [SharkFest '20]
Ostinato - Craft Packets, Generate Traffic [SharkFest '20]pstavirs
 
Introduction to RCU
Introduction to RCUIntroduction to RCU
Introduction to RCUKernel TLV
 
ゼロから作るパケット転送用OS (Internet Week 2014)
ゼロから作るパケット転送用OS (Internet Week 2014)ゼロから作るパケット転送用OS (Internet Week 2014)
ゼロから作るパケット転送用OS (Internet Week 2014)Hirochika Asai
 
Preparing OpenSHMEM for Exascale
Preparing OpenSHMEM for ExascalePreparing OpenSHMEM for Exascale
Preparing OpenSHMEM for Exascaleinside-BigData.com
 

What's hot (17)

Skydive, real-time network analyzer, container integration
Skydive, real-time network analyzer, container integrationSkydive, real-time network analyzer, container integration
Skydive, real-time network analyzer, container integration
 
JCache / JSR107 shortcomings
JCache / JSR107  shortcomingsJCache / JSR107  shortcomings
JCache / JSR107 shortcomings
 
Skydive 5/07/2016
Skydive 5/07/2016Skydive 5/07/2016
Skydive 5/07/2016
 
cache2k, Java Caching, Turbo Charged, FOSDEM 2015
cache2k, Java Caching, Turbo Charged, FOSDEM 2015cache2k, Java Caching, Turbo Charged, FOSDEM 2015
cache2k, Java Caching, Turbo Charged, FOSDEM 2015
 
Hs java open_party
Hs java open_partyHs java open_party
Hs java open_party
 
Dpdk accelerated Ostinato
Dpdk accelerated OstinatoDpdk accelerated Ostinato
Dpdk accelerated Ostinato
 
Cryptography and secure systems
Cryptography and secure systemsCryptography and secure systems
Cryptography and secure systems
 
Skydive, real-time network analyzer
Skydive, real-time network analyzer Skydive, real-time network analyzer
Skydive, real-time network analyzer
 
Practical SystemTAP basics: Perl memory profiling
Practical SystemTAP basics: Perl memory profilingPractical SystemTAP basics: Perl memory profiling
Practical SystemTAP basics: Perl memory profiling
 
Continuous Performance Regression Testing with JfrUnit
Continuous Performance Regression Testing with JfrUnitContinuous Performance Regression Testing with JfrUnit
Continuous Performance Regression Testing with JfrUnit
 
Ostinato - Craft Packets, Generate Traffic [SharkFest '20]
Ostinato - Craft Packets, Generate Traffic [SharkFest '20]Ostinato - Craft Packets, Generate Traffic [SharkFest '20]
Ostinato - Craft Packets, Generate Traffic [SharkFest '20]
 
Introduction to RCU
Introduction to RCUIntroduction to RCU
Introduction to RCU
 
Stress your DUT
Stress your DUTStress your DUT
Stress your DUT
 
ゼロから作るパケット転送用OS (Internet Week 2014)
ゼロから作るパケット転送用OS (Internet Week 2014)ゼロから作るパケット転送用OS (Internet Week 2014)
ゼロから作るパケット転送用OS (Internet Week 2014)
 
Lisa14
Lisa14Lisa14
Lisa14
 
Preparing OpenSHMEM for Exascale
Preparing OpenSHMEM for ExascalePreparing OpenSHMEM for Exascale
Preparing OpenSHMEM for Exascale
 
Open shmem
Open shmemOpen shmem
Open shmem
 

Viewers also liked

Sources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsSources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsPaul Groth
 
Telling your research story with (alt)metrics
Telling your research story with (alt)metricsTelling your research story with (alt)metrics
Telling your research story with (alt)metricsPaul Groth
 
"Don't Publish, Release" - Revisited
"Don't Publish, Release" - Revisited "Don't Publish, Release" - Revisited
"Don't Publish, Release" - Revisited Paul Groth
 
Transparency in the Data Supply Chain
Transparency in the Data Supply ChainTransparency in the Data Supply Chain
Transparency in the Data Supply ChainPaul Groth
 
Altmetrics: painting a broader picture of impact
Altmetrics: painting a broader picture of impactAltmetrics: painting a broader picture of impact
Altmetrics: painting a broader picture of impactPaul Groth
 
Data Integration vs Transparency: Tackling the tension
Data Integration vs Transparency: Tackling the tensionData Integration vs Transparency: Tackling the tension
Data Integration vs Transparency: Tackling the tensionPaul Groth
 
Data for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchersData for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchersPaul Groth
 
Information architecture at Elsevier
Information architecture at ElsevierInformation architecture at Elsevier
Information architecture at ElsevierPaul Groth
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkPaul Groth
 
Structured Data & the Future of Educational Material
Structured Data & the Future of Educational MaterialStructured Data & the Future of Educational Material
Structured Data & the Future of Educational MaterialPaul Groth
 
Knowledge Graphs at Elsevier
Knowledge Graphs at ElsevierKnowledge Graphs at Elsevier
Knowledge Graphs at ElsevierPaul Groth
 
Open PHACTS API Walkthrough
Open PHACTS API WalkthroughOpen PHACTS API Walkthrough
Open PHACTS API WalkthroughPaul Groth
 
Ideals and Norms in Scholarship
Ideals and Norms in ScholarshipIdeals and Norms in Scholarship
Ideals and Norms in ScholarshipPaul Groth
 
Machine Reading: What it means for publishers?
Machine Reading: What it means for publishers?Machine Reading: What it means for publishers?
Machine Reading: What it means for publishers?Paul Groth
 
How much does it cost sspmeeting may2015_kiley
How much does it cost sspmeeting may2015_kileyHow much does it cost sspmeeting may2015_kiley
How much does it cost sspmeeting may2015_kileyRobert Kiley
 
Knowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPediaKnowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPediaPaul Groth
 
Φύλλο εικόνας Background
Φύλλο εικόνας BackgroundΦύλλο εικόνας Background
Φύλλο εικόνας BackgroundPetros Michailidis
 
人類的最佳十個時間
人類的最佳十個時間人類的最佳十個時間
人類的最佳十個時間spacejam02
 
Gf presents 2.2.2.12
Gf presents 2.2.2.12Gf presents 2.2.2.12
Gf presents 2.2.2.12markoverbye
 

Viewers also liked (20)

Sources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization SystemsSources of Change in Modern Knowledge Organization Systems
Sources of Change in Modern Knowledge Organization Systems
 
Telling your research story with (alt)metrics
Telling your research story with (alt)metricsTelling your research story with (alt)metrics
Telling your research story with (alt)metrics
 
"Don't Publish, Release" - Revisited
"Don't Publish, Release" - Revisited "Don't Publish, Release" - Revisited
"Don't Publish, Release" - Revisited
 
Transparency in the Data Supply Chain
Transparency in the Data Supply ChainTransparency in the Data Supply Chain
Transparency in the Data Supply Chain
 
Altmetrics: painting a broader picture of impact
Altmetrics: painting a broader picture of impactAltmetrics: painting a broader picture of impact
Altmetrics: painting a broader picture of impact
 
Data Integration vs Transparency: Tackling the tension
Data Integration vs Transparency: Tackling the tensionData Integration vs Transparency: Tackling the tension
Data Integration vs Transparency: Tackling the tension
 
Data for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchersData for Science: How Elsevier is using data science to empower researchers
Data for Science: How Elsevier is using data science to empower researchers
 
Information architecture at Elsevier
Information architecture at ElsevierInformation architecture at Elsevier
Information architecture at Elsevier
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic Framework
 
Structured Data & the Future of Educational Material
Structured Data & the Future of Educational MaterialStructured Data & the Future of Educational Material
Structured Data & the Future of Educational Material
 
Knowledge Graphs at Elsevier
Knowledge Graphs at ElsevierKnowledge Graphs at Elsevier
Knowledge Graphs at Elsevier
 
Open PHACTS API Walkthrough
Open PHACTS API WalkthroughOpen PHACTS API Walkthrough
Open PHACTS API Walkthrough
 
Ideals and Norms in Scholarship
Ideals and Norms in ScholarshipIdeals and Norms in Scholarship
Ideals and Norms in Scholarship
 
Machine Reading: What it means for publishers?
Machine Reading: What it means for publishers?Machine Reading: What it means for publishers?
Machine Reading: What it means for publishers?
 
How much does it cost sspmeeting may2015_kiley
How much does it cost sspmeeting may2015_kileyHow much does it cost sspmeeting may2015_kiley
How much does it cost sspmeeting may2015_kiley
 
Knowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPediaKnowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPedia
 
Φύλλο εικόνας Background
Φύλλο εικόνας BackgroundΦύλλο εικόνας Background
Φύλλο εικόνας Background
 
人類的最佳十個時間
人類的最佳十個時間人類的最佳十個時間
人類的最佳十個時間
 
Gf presents 2.2.2.12
Gf presents 2.2.2.12Gf presents 2.2.2.12
Gf presents 2.2.2.12
 
Tween Booktalks
Tween BooktalksTween Booktalks
Tween Booktalks
 

Similar to Decoupling Provenance Capture and Analysis from Execution

Talk 160920 @ Cat System Workshop
Talk 160920 @ Cat System WorkshopTalk 160920 @ Cat System Workshop
Talk 160920 @ Cat System WorkshopQuey-Liang Kao
 
Scalability, Fidelity and Stealth in the DRAKVUF Dynamic Malware Analysis System
Scalability, Fidelity and Stealth in the DRAKVUF Dynamic Malware Analysis SystemScalability, Fidelity and Stealth in the DRAKVUF Dynamic Malware Analysis System
Scalability, Fidelity and Stealth in the DRAKVUF Dynamic Malware Analysis SystemTamas K Lengyel
 
NSC #2 - Challenge Solution
NSC #2 - Challenge SolutionNSC #2 - Challenge Solution
NSC #2 - Challenge SolutionNoSuchCon
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationPaul Groth
 
Distributed Postgres
Distributed PostgresDistributed Postgres
Distributed PostgresStas Kelvich
 
Varnish http accelerator
Varnish http acceleratorVarnish http accelerator
Varnish http acceleratorno no
 
HITBSecConf 2017-Shadow-Box-the Practical and Omnipotent Sandbox
HITBSecConf 2017-Shadow-Box-the Practical and Omnipotent SandboxHITBSecConf 2017-Shadow-Box-the Practical and Omnipotent Sandbox
HITBSecConf 2017-Shadow-Box-the Practical and Omnipotent SandboxSeunghun han
 
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...PROIDEA
 
Hardware Assisted Latency Investigations
Hardware Assisted Latency InvestigationsHardware Assisted Latency Investigations
Hardware Assisted Latency InvestigationsScyllaDB
 
Hacktivity2014: Virtual Machine Introspection to Detect and Protect
Hacktivity2014: Virtual Machine Introspection to Detect and ProtectHacktivity2014: Virtual Machine Introspection to Detect and Protect
Hacktivity2014: Virtual Machine Introspection to Detect and ProtectTamas K Lengyel
 
Advanced Administration, Monitoring and Backup
Advanced Administration, Monitoring and BackupAdvanced Administration, Monitoring and Backup
Advanced Administration, Monitoring and BackupMongoDB
 
HES2011 - Sebastien Tricaud - Capture me if you can
HES2011 - Sebastien Tricaud - Capture me if you canHES2011 - Sebastien Tricaud - Capture me if you can
HES2011 - Sebastien Tricaud - Capture me if you canHackito Ergo Sum
 
Hackito Ergo Sum 2011: Capture me if you can!
Hackito Ergo Sum 2011: Capture me if you can!Hackito Ergo Sum 2011: Capture me if you can!
Hackito Ergo Sum 2011: Capture me if you can!stricaud
 
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo..."Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...Yandex
 
NUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osioNUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osioHajime Tazaki
 
Got Big Data? Splunk on Nutanix
Got Big Data? Splunk on NutanixGot Big Data? Splunk on Nutanix
Got Big Data? Splunk on NutanixNEXTtour
 

Similar to Decoupling Provenance Capture and Analysis from Execution (20)

Lrz kurs: big data analysis
Lrz kurs: big data analysisLrz kurs: big data analysis
Lrz kurs: big data analysis
 
Talk 160920 @ Cat System Workshop
Talk 160920 @ Cat System WorkshopTalk 160920 @ Cat System Workshop
Talk 160920 @ Cat System Workshop
 
Scalability, Fidelity and Stealth in the DRAKVUF Dynamic Malware Analysis System
Scalability, Fidelity and Stealth in the DRAKVUF Dynamic Malware Analysis SystemScalability, Fidelity and Stealth in the DRAKVUF Dynamic Malware Analysis System
Scalability, Fidelity and Stealth in the DRAKVUF Dynamic Malware Analysis System
 
NSC #2 - Challenge Solution
NSC #2 - Challenge SolutionNSC #2 - Challenge Solution
NSC #2 - Challenge Solution
 
Progressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computationProgressive Provenance Capture Through Re-computation
Progressive Provenance Capture Through Re-computation
 
Distributed Postgres
Distributed PostgresDistributed Postgres
Distributed Postgres
 
Varnish http accelerator
Varnish http acceleratorVarnish http accelerator
Varnish http accelerator
 
HITBSecConf 2017-Shadow-Box-the Practical and Omnipotent Sandbox
HITBSecConf 2017-Shadow-Box-the Practical and Omnipotent SandboxHITBSecConf 2017-Shadow-Box-the Practical and Omnipotent Sandbox
HITBSecConf 2017-Shadow-Box-the Practical and Omnipotent Sandbox
 
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
 
Hardware Assisted Latency Investigations
Hardware Assisted Latency InvestigationsHardware Assisted Latency Investigations
Hardware Assisted Latency Investigations
 
Hacktivity2014: Virtual Machine Introspection to Detect and Protect
Hacktivity2014: Virtual Machine Introspection to Detect and ProtectHacktivity2014: Virtual Machine Introspection to Detect and Protect
Hacktivity2014: Virtual Machine Introspection to Detect and Protect
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 
Advanced Administration, Monitoring and Backup
Advanced Administration, Monitoring and BackupAdvanced Administration, Monitoring and Backup
Advanced Administration, Monitoring and Backup
 
Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
HES2011 - Sebastien Tricaud - Capture me if you can
HES2011 - Sebastien Tricaud - Capture me if you canHES2011 - Sebastien Tricaud - Capture me if you can
HES2011 - Sebastien Tricaud - Capture me if you can
 
Hackito Ergo Sum 2011: Capture me if you can!
Hackito Ergo Sum 2011: Capture me if you can!Hackito Ergo Sum 2011: Capture me if you can!
Hackito Ergo Sum 2011: Capture me if you can!
 
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo..."Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
 
NUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osioNUSE (Network Stack in Userspace) at #osio
NUSE (Network Stack in Userspace) at #osio
 
Got Big Data? Splunk on Nutanix
Got Big Data? Splunk on NutanixGot Big Data? Splunk on Nutanix
Got Big Data? Splunk on Nutanix
 
Strata - 03/31/2012
Strata - 03/31/2012Strata - 03/31/2012
Strata - 03/31/2012
 

More from Paul Groth

Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIPaul Groth
 
Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningPaul Groth
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Paul Groth
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-cziPaul Groth
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph MaintenancePaul Groth
 
Knowledge Graph Futures
Knowledge Graph FuturesKnowledge Graph Futures
Knowledge Graph FuturesPaul Groth
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph MaintenancePaul Groth
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
 
Thinking About the Making of Data
Thinking About the Making of DataThinking About the Making of Data
Thinking About the Making of DataPaul Groth
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text Paul Groth
 
From Data Search to Data Showcasing
From Data Search to Data ShowcasingFrom Data Search to Data Showcasing
From Data Search to Data ShowcasingPaul Groth
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphPaul Groth
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for SciencePaul Groth
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?Paul Groth
 
Diversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsDiversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsPaul Groth
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsPaul Groth
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsPaul Groth
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chainPaul Groth
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicinePaul Groth
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataPaul Groth
 

More from Paul Groth (20)

Data Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AIData Curation and Debugging for Data Centric AI
Data Curation and Debugging for Data Centric AI
 
Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learning
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-czi
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
 
Knowledge Graph Futures
Knowledge Graph FuturesKnowledge Graph Futures
Knowledge Graph Futures
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
 
Thinking About the Making of Data
Thinking About the Making of DataThinking About the Making of Data
Thinking About the Making of Data
 
End-to-End Learning for Answering Structured Queries Directly over Text
End-to-End Learning for  Answering Structured Queries Directly over Text End-to-End Learning for  Answering Structured Queries Directly over Text
End-to-End Learning for Answering Structured Queries Directly over Text
 
From Data Search to Data Showcasing
From Data Search to Data ShowcasingFrom Data Search to Data Showcasing
From Data Search to Data Showcasing
 
Elsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge GraphElsevier’s Healthcare Knowledge Graph
Elsevier’s Healthcare Knowledge Graph
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for Science
 
More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?More ways of symbol grounding for knowledge graphs?
More ways of symbol grounding for knowledge graphs?
 
Diversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domainsDiversity and Depth: Implementing AI across many long tail domains
Diversity and Depth: Implementing AI across many long tail domains
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge Graphs
 
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsCombining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
 
The need for a transparent data supply chain
The need for a transparent data supply chainThe need for a transparent data supply chain
The need for a transparent data supply chain
 
Knowledge graph construction for research & medicine
Knowledge graph construction for research & medicineKnowledge graph construction for research & medicine
Knowledge graph construction for research & medicine
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
 

Recently uploaded

Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Recently uploaded (20)

Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 

Decoupling Provenance Capture and Analysis from Execution

  • 1. Decoupling Provenance Capture and Analysis from Execution Manolis Stamatogiannakis (@mstamat) Paul Groth (@pgroth) Herbert Bos
  • 2. Stamatogiannakis, M., Groth, P., & Bos, H. (2014). Looking inside the black-box: Capturing data provenance using dynamic instrumentation. In Provenance and Annotation of Data and Processes (pp. 155-167). Springer International Publishing. http://bit.ly/dtracker-demo 2
  • 3. Capturing Provenance Disclosed Provenance + Accuracy + High-level semantics – Intrusive – Manual Effort Observed Provenance – False positives – Semantic Gap + Non-intrusive + Minimal manual effort CPL (Macko ‘12) Trio (Widom ‘09) PrIME (Miles ‘09) Taverna (Oinn ‘06) VisTrails (Fraire ‘06) ES3 (Frew ‘08) Trec (Vahdat ‘98) PASSv2 (Holland ‘08) DTrace Tool (Gessiou ‘12) 3
  • 4. What to capture? Miles, Simon, Groth, Paul, Munroe, Steve and Moreau, Luc (2011) PrIMe: a methodology for developing provenance-aware applications. ACM Transactions on Software Engineering and Methodology, 20, (3), 8:1-8:42. 4
  • 5. Provenance is post hoc. Aim: Eliminate the need for developers to know what provenance needs to be captured. 5
  • 6. Re-execution Common tactic in provenance: • DB: Reenactment queries (Glavic ‘14) • DistSys: Chimera (Foster ‘02), Hadoop (Logothetis ‘13), DistTape (Zhao ‘12) • Workflows: Pegasus (Groth ‘09) • PL: Slicing (Perera ‘12) • OS: pTrace (Guo ‘11) • Desktop: Excel (Asuncion ‘11) 6
  • 9. • PANDA: an open-source Platform for Architecture- Neutral Dynamic Analysis. (Dolan-Gavitt et al. ‘14) • Based on QEMU virtualization platform. • Logs self-contained execution traces. – An initial RAM snapshot. – Non-deterministic inputs. Prototype Implementation (1/2) PANDA CPU RAM Input Interrupt DMA Initial RAM Snapshot Non- determinism log RAM PANDA Execution Trace 9
  • 10. Prototype Implementation (2/2) • Debian Linux guest. • Analysis plugins – Read-only access to the VM state. – Invoked per instr., memory access, context switch, etc. – Can be combined to implement complex functionality. – OSI Linux, PROV-Tracer, ProcStrMatch. • Provenance stored PROV/RDF triples, queried with SPARQL. PANDA Executio n Trace PANDA Triple Store Plugin APlugin C Plugin B CPU RAM 10 used endedAtTime wasAssociatedWith actedOnBehalfOf wasGeneratedBy wasAttributedTo wasDerivedFrom wasInformedBy Activity Entity Agent xsd:dateTime startedAtTime xsd:dateTime
  • 11. OS Introspection • What processes are currently executing? • Which libraries are used? • What files are used? • Possible approaches: – Execute code inside the guest-OS. – Reproduce guest-OS semantics purely from the hardware state (RAM/registers). 11
  • 12. Introspecting Kernel Structures (1/2) • Kernel structure members are known. • Their offsets depend on compile-time configuration. • Each linux vendor supplies a few different kernel configurations. • Rule of thumb: same vendor/configuration/ve rsion combo  same offsets. 12
  • 13. Introspecting Kernel Structures (2/2) • Offset profile created once for each kernel “family”. • E.g. one profile for all Debian/amd64/3.2.* kernels. • The profile is used by osi_linux module to extract process info from the execution trace. Kernel offset profile Live OS Probe Modul e Proces s Info PANDA Replay VM osi_linux PAND A Trace Monitored OS Pointer arithmetic: struct task_struct t; int off = &(t.memb) – &t; printk(“off: %d”, off); 13
  • 14. The PROV-Tracer Plugin • Registers for process creation/destruction events. • Decodes executed system calls. • Keeps track of what files are used as input/output by each process. • Emits provenance in an intermediate format when a process terminates. 14
  • 15. More Analysis Plugins • ProcStrMatch plugin. – Which processes contained string S in their memory? • Other possible types of analysis: – Taint tracking – Dynamic slicing 15
  • 16. Execution Overhead (1/2) • QEMU incurs a 5x slowdown. • PANDA recording imposes an additional 1.1x – 1.2x slowdown. Virtualization is the dominant overhead factor. 16
  • 17. Execution Overhead (2/2) • QEMU is a suboptimal virtualization option. • ReVirt – User Mode Linux (Dunlap et al. ‘02) – Slowdown: 1.08x rec. + 1.58x virt. • ReTrace – VMWare (Xu et al. ‘07) – Slowdown: 1.05x-2.6x rec. + ??? virt. Virtualization slowdown is considered acceptable. Recording overhead is fairly low. 17
  • 18. Storage Requirements • Storage requirements vary with the workload. • For PANDA (Dolan-Gavitt et al. ’14): – 17-915 instructions per byte. • In practice: O(10MB/min) uncompressed. • Different approaches to reduce/manage storage requirements. – Compression, HD rotation, VM snapshots. • 24/7 recording seems within limits of todays’ technology. 18
  • 19. An Example (1) <exe://pam-foreground-~3451> prov:endedAtTime 199090196 . <exe://getent~3451> a prov:Activity . <exe://getent~3451> rdf:type dt:getent . <exe://cut~3452> a prov:Activity . <exe://cut~3452> rdf:type dt:cut . <file:/etc/nsswitch.conf> a prov:Entity . <file:/etc/nsswitch.conf> rdfs:label "/etc/nsswitch.conf" . <file:/etc/nsswitch.conf> rdf:type dt:Unknown . <exe://getent~3451> prov:used <file:/etc/nsswitch.conf> . # unused file:3477815296:getent~3451:/etc/passwd:r0:w0:f524288 <exe://getent~3451> prov:startedAtTime 199090196 . <exe://getent~3451> prov:endedAtTime 200392668 . <file:FD0_3452> a prov:Entity . <file:FD0_3452> rdfs:label "FD0_3452" 19
  • 23. Conclusion • Decouple capture/analysis from execution • VMs provide a useful indirection mechanism • Future: – More plugins – Real world analysis (rrshare.org) – Cloud analysis • Are traces/immutable logs primitive? 23
  • 24. Source & Text • PROV-Tracer source: – https://github.com/m000/panda/tree/prov_tracer/ – Plugins under qemu/panda_plugins. • Paper: – http://workshops.inf.ed.ac.uk/tapp2015/TAPP15_I_3.p df 24

Editor's Notes

  1. Last year in IPAW, we demonstrated a data tracking tool. No need for instrumentation, not affected by instrumentation captured. Solves the NxM problem.
  2. Gibson Tapp (2009
  3. Execution Capture: happens realtime Instrumentation: applied on the captured trace to generate provenance information Analysis: the provenance information is explored through existing tools (e.g. SPARQL queries) Selection: a subset of the execution trace is selected – we start again with more intensive instrumentation
  4. PANDA is based on QEMU. Input includes both executed instructions and data. RAM snapshot + ND log are enough to accurately replay the whole execution. ND log conists of inputs to CPU/RAM and other device status is not logged  we can replay but we cannot “go live” (i.e. resume execution)
  5. Note: Technically, plugins can modify VM state. However this will eventually crash the execution as the trace will be out of sync with the replay state. Plugins are implemented as dynamic libraries. We focus on the highlighted plugins in this presentation.
  6. Typical information that can be retrieved through VM introspections. In general, executing code inside the guest OS is complex. Moreover, in the case of PANDA we don’t have access to the state of devices. This makes injection and execution of new code even more complex and also more limited.
  7. Someone may ask why not just use the vendor-specific kernel configuration to extract the offsets from source. This is possible, but more complex than our approach. Kernel is a complex piece of software and heavily relies on its build system to get everything compile.
  8. A kernel module is inserted to a Live OS. Extracts the offset of members of kernel structures via simple pointer arithmetic. The offsets are saved as a “profile”. A PANDA trace is extracted. Using the trace and a (compatible) profile, we can introspect the guest OS
  9. QEMU is a good choice for prototyping, but overall submoptimal as virtualization option. Xu et al. do not give any numbers for virtualization slowdown. They (rightfully) consider it acceptable for most cases. 1.05x is for CPU bound processing. 2.6x is for I/O bound processing.