Progressive Provenance Capture Through Re-computation

•Download as PPTX, PDF•

1 like•409 views

Provenance capture relies upon instrumentation of processes (e.g. probes or extensive logging). The more instrumentation we can add to processes the richer our provenance traces can be, for example, through the addition of comprehensive descriptions of steps performed, mapping to higher levels of abstraction through ontologies, or distinguishing between automated or user actions. However, this instrumentation has costs in terms of capture time/overhead and it can be difficult to ascertain what should be instrumented upfront. In this talk, I'll discuss our research on using record-replay technology within virtual machines to incrementally add additional provenance instrumentation by replaying computations after the fact.

Technology

Progressive Provenance
Capture Through Re-
computation
Paul Groth
Elsevier Labs
@pgroth | pgroth.com
Joint work with Manolis Stamatogiannakis and Herbert Bos Vrije Universiteit
Amsterdam
Incremental Re-computation Workshop - Provenance Week 2018

What to capture?
Simon Miles, Paul Groth, Paul, Steve Munroe, Luc Moreau.
PrIMe: A methodology for developing provenance-aware
applications.
ACM Transactions on Software Engineering and Methodology, 20,
(3), 2011. 2

Provenance is Post-Hoc
• What if we missed something?
• Disclosed provenance systems:
– Re-apply methodology (e.g. PriME), produce new
application version.
– Time consuming.
• Observed provenance systems:
– Update the applied instrumentation.
– Instrumentation becomes progressively more intense.
3

Provenance is Post-Hoc
Aim: Eliminate the need for developers to know
what provenance needs to be captured.
4

Re-execution
• Common tactic in disclosed provenance:
– DB: Reenactment queries (Glavic ‘14)
– DistSys: Chimera (Foster ‘02), Hadoop (Logothetis ‘13),
DistTape (Zhao ‘12)
– Workflows: Pegasus (Groth ‘09)
– PL: Slicing (Perera ‘12)
– Desktop: Excel (Asuncion ‘11)
• Can we extend this idea to observed
provenance systems?
5

Methodology
Selection
Provenance analysis
Instrumentation
Execution Capture
7

Prototype Implementation
• PANDA: an open-source
Platform for
Architecture-Neutral
Dynamic Analysis. (Dolan-
Gavitt ‘14)
• Based on the QEMU
virtualization platform.
8

• PANDA logs self-contained execution traces.
– An initial RAM snapshot.
– Non-deterministic inputs.
• Logging happens at virtual CPU I/O ports.
– Virtual device state is not logged  can’t “go-live”.
Prototype Implementation (2/3)
PANDA
CPU RAM
Input
Interrupt
DMA
Initial RAM Snapshot
Non-
determinism
log
RAM
PANDA Execution Trace
9

Prototype Implementation (3/3)
• Analysis plugins
– Read-only access to the VM state.
– Invoked per instr., memory access, context switch, etc.
– Can be combined to implement complex functionality.
– OSI Linux, PROV-Tracer, ProcStrMatch, Taint tracking
• Debian Linux guest.
• Provenance stored PROV/RDF triples, queried with SPARQL.
PANDA
Execution
Trace
PANDA
Triple
Store
Plugin APlugin C
Plugin B
CPU
RAM
10
used
endedAtTime
wasAssociatedWith
actedOnBehalfOf
wasGeneratedBy
wasAttributedTo
wasDerivedFrom
wasInformedBy
Activity
Entity
Agent
xsd:dateTime
startedAtTime
xsd:dateTime

OS Introspection
• What processes are currently executing?
• Which libraries are used?
• What files are used?
• Possible approaches:
– Execute code inside the guest-OS.
– Reproduce guest-OS semantics purely from the
hardware state (RAM/registers).
11

12
(1) Alice downloads the front page of example.org.
(2) Alice edits the document and fixes a link that points to the wrong page.
(3) Alice re-uploads the HTML document and the image.
(4) Bob downloads the front page of example.org.
(5) Bob removes a paragraph of text.
(6) Bob re-uploads the the HTML document.
An example

Thoughts
• Decoupling provenance analysis from execution is
possible by the use of VM record & replay.
• Execution traces can be used for post-hoc
provenance analysis.
• 24/7 execution recording seems possible
• Can we extend this notion of instrumentation to
other capture systems?
14
Manolis Stamatogiannakis, Elias Athanasopoulos, Herbert Bos, Paul Groth:
PROV2R: Practical Provenance Analysis of Unstructured Processes. ACM
Transactions on Internet Technology 17(4): 37:1-37:24 (2017)

What's hot

Tradeoffs in Automatic Provenance Capture

Paul Groth

Review of micro Orm in c#

Kanstantsin Harbachou

Performance

Christophe Marchal

Spark Summit East 2015

Timothy Danford

Big data analysis from command line using GNU text utils. A lot of big data analysis tasks can be implemented using utils that can be found on almost every computer. Using such utils can help save time, money and give a good hint regarding an instance of problem. This presentations contains some historical background about GNU text utils, what they are capable of and when should one prefer command line utils upon modern Big Data technologies.

Your data isn't that big @ Big Things Meetup 2016-05-16

Boaz Menuhin

Lecture 9 -_pthreads-linux_threads

Prashant Pawar

Graylog2 (MongoBerlin/MongoHamburg 2010)

lennartkoopmann

Get Started with CrateDB: Sensor Data

Crate.io

Network & Filesystem: Doing less cross rings memory copy

Scaleway

Ns3

Rehmat Ullah

What's hot (10)

Tradeoffs in Automatic Provenance Capture

Review of micro Orm in c#

Performance

Spark Summit East 2015

Your data isn't that big @ Big Things Meetup 2016-05-16

Lecture 9 -_pthreads-linux_threads

Graylog2 (MongoBerlin/MongoHamburg 2010)

Get Started with CrateDB: Sensor Data

Network & Filesystem: Doing less cross rings memory copy

Ns3

Similar to Progressive Provenance Capture Through Re-computation

talks-afanasyev2013ndnsim-tutorial.pptx

hazwan30

Scalability, Fidelity and Stealth in the DRAKVUF Dynamic Malware Analysis System

Tamas K Lengyel

Currently, approaches to scientific research require activities that take up much time but do not actually advance our scientific understanding. For example, researchers and students spend countless hours reformatting data and writing code to attempt to reproduce previously published research. What if the scientific community could find a better way to create and publish our workflows, data, and models to minimize the amount of the time spent “reinventing the wheel”? Popper is an NSF and CROSS sponsored protocol and CLI tool for implementing scientific exploration pipelines following a DevOps approach. Popper allows researchers and students to generate work that is easy to reproduce. Modern open source software (OSS) development communities have created tools that make it easier to manage large codebases, allowing them to deal with high levels of complexity, not only in terms of managing code changes, but with the entire ecosystem that is needed in order to deliver changes to software in an agile, rapidly changing environment. These practices and tools are collectively referred to as DevOps. The Popper Experimentation Protocol repurposes the DevOps practice in the context of scientific explorations so that researchers can leverage existing tools and technologies to maintain and publish scientific analyses that are easy to reproduce. By following Popper, researchers can produce portable, automated and version-controlled experimentation pipelines that are easier to re-execute. In this talk/poster, we will briefly introduce DevOps and give an overview of best practices. We will then show how these practices can be repurposed for carrying out scientific explorations and illustrate using some examples. We will also walk the audience through the usage of the Popper CLI tool, showing examples from multiple domains such as High Energy Physics, Genomics, and Atmospheric Sciences.

Reproducible, Automated and Portable Computational and Data Science Experimen...

Ivo Jimenez

Interactive Data Analysis for End Users on HN Science Cloud

Helix Nebula The Science Cloud

In this video from the 2015 Stanford HPC Conference, Pavel Shamis from ORNL presents: Preparing OpenSHMEM for Exascale. "OpenSHMEM is a partitioned global address space (PGAS) one-sided communications library that enables remote memory access (RMA) across processing elements (PEs). Its API allows data to be transferred from one PE memory space to another PE’s symmetric memory space; decoupling the data transfers from synchronizations. OpenSHMEM is useful for applications that are latency driven or that have irregular communication patterns, because its one-sided API can be mapped very efficiently to hardware (e.g. RDMA interconnects, etc), and its one-sided programming model helps the overlapping of communication with computation. Summit is Oak Ridge National Laboratory’s next high performance supercomputer system that will be based on a many core/GPU hybrid architecture. In order to prepare OpenSHMEM for future systems, it is important to enhance its programming model to enable efficient utilization of the new hardware capabilities (e.g. massive multithreaded systems, accesses different type memories, next generation of interconnects, etc). This session will present recent advances in the area of OpenSHMEM extensions, implementations, and tools.” Watch the video: http://insidehpc.com/2015/02/video-preparing-openshmem-for-exascale/ See more talks in the Stanford HPC Conference Video Gallery: http://wp.me/P3RLHQ-dOO

Preparing OpenSHMEM for Exascale

inside-BigData.com

Shaping the Future: To Globus Compute and Beyond!

Globus

Linux Memory Analysis with Volatility

Andrew Case

Big data at experimental facilities

Ian Foster

Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)

Yury Leonychev

4055-841_Project_ShailendraSadh

Shailendra Sadh - CISSP

The Attached slide was presented at Null Open Security/OWAP/G4H combined community event, the document shared here is a representation of Independent study on usage of Metasploit on purpose built vulnerable machine Metasploitable3. With New attack vectors such as Elastic Search API and Jenkins servers -21/01/2017 Contains 1. Introduction to Metasploit (why metasploit?) 2. Demo Setup and talked on how to- Using Metasploitable3 3. Networking with VirtualBox for personal lab 4. Auxiliary Modules (Scanners and Servers ) - Demo of snmp_enum 5. Exploit Module (searching exploits) 6. Payload types 7. Exploit Demo 1 - /exploit/multi/elasticsearch/script_mvel_rce 8. Exploit Demo 2 - /exploit/multi/http/jenkins_script_console

Metasploit For Beginners

Ramnath Shenoy

Practical Chaos Engineering

SIGHUP

Ase2010 shang

SAIL_QU

Mac Memory Analysis with Volatility

Andrew Case

Monitoring in 2017 - TIAD Camp Docker

The Incredible Automation Day

"Data Provenance: Principles and Why it matters for BioMedical Applications"

Pinar Alper

Video: https://youtu.be/eO94l0aGLCA?t=3m37s . Talk by Brendan Gregg for ACM Applicative 2016 "System Methodology - Holistic Performance Analysis on Modern Systems Traditional systems performance engineering makes do with vendor-supplied metrics, often involving interpretation and inference, and with numerous blind spots. Much in the field of systems performance is still living in the past: documentation, procedures, and analysis GUIs built upon the same old metrics. For modern systems, we can choose the metrics, and can choose ones we need to support new holistic performance analysis methodologies. These methodologies provide faster, more accurate, and more complete analysis, and can provide a starting point for unfamiliar systems. Methodologies are especially helpful for modern applications and their workloads, which can pose extremely complex problems with no obvious starting point. There are also continuous deployment environments such as the Netflix cloud, where these problems must be solved in shorter time frames. Fortunately, with advances in system observability and tracers, we have virtually endless custom metrics to aid performance analysis. The problem becomes which metrics to use, and how to navigate them quickly to locate the root cause of problems. System methodologies provide a starting point for analysis, as well as guidance for quickly moving through the metrics to root cause. They also pose questions that the existing metrics may not yet answer, which may be critical in solving the toughest problems. System methodologies include the USE method, workload characterization, drill-down analysis, off-CPU analysis, and more. This talk will discuss various system performance issues, and the methodologies, tools, and processes used to solve them. The focus is on single systems (any operating system), including single cloud instances, and quickly locating performance issues or exonerating the system. Many methodologies will be discussed, along with recommendations for their implementation, which may be as documented checklists of tools, or custom dashboards of supporting metrics. In general, you will learn to think differently about your systems, and how to ask better questions."

ACM Applicative System Methodology 2016

Brendan Gregg

The size, number and complexity of macromolecular structures has been growing dramatically in recent years making visualisation and analysis of macromolecules non-trivial and sometimes impossible. At the same time, developments within genomics, web-based game development and Big Data mean that hardware and software now support such analysis. However existing macromolecular file formats present an I/O bottleneck meaning the power of such technologies cannot be harnessed. In this work we present a modern MacroMolecular Transmission Format (MMTF). MMTF is 91% smaller than mmCIF and is up to two orders of magnitude faster to parse. Both these changes provide a paradigm shift in the way structural biology can be carried out. The largest structures can now be visualised on all devices and the entire archive can be interactively queried and analysed in seconds through an efficient in-memory representation.

Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...

Anthony Bradley

Analisis Estatico y de Comportamiento de un Binario Malicioso

Conferencias FIST

### Delivered at grrcon.com ### One of the primary data sources we use on the Splunk Security Research Team is attack data collected from various corners of the globe. We often obtain this data in the wild using honeypots, with the goal of uncovering new or unusual attack techniques and other malicious activities for research purposes. The nirvana state is a honeypot tailored to mimic the kind of attack/attacker you are hoping to study. To do this effectively, the honeypot must very closely resemble a legitimate system. As a principal security research at Splunk, co-founder of Zenedge (Now part of Oracle), and Security Architect at Akamai I have spent many years protecting organizations from targeted as well as internet-wide attacks, and honeypots has been extremely useful (at times better than threat intel) tool at capturing and studying active malicious actors. In this talk, I aim to provide an introduction to honeypots, explain some of the experiences and lessons learned we have had running Cowrie a medium interaction SSH honeypot base on Kippo. How we modified cowrie to make it more realistic and mimic the systems and attack we are trying to capture as well as our approach for the next generation of honeypots we plan to use in our research work. The audience in this talk will learn how to deploy and use cowrie honeypot as a defense mechanism in their organization. Also, we will share techniques on how to modify cowrie in order to masquerade different systems and vulnerabilities mimicking the asset(s) being defended. Finally, share example data produced by the honeypot and analytic techniques that can be used as feedback to improve the deployed honeypot. We will close off the talk by sharing thoughts on how we are evolving our approach for capturing attack data using honeypots and why.

How to Make a Honeypot Stickier (SSH*)

Jose Hernandez

Similar to Progressive Provenance Capture Through Re-computation (20)

talks-afanasyev2013ndnsim-tutorial.pptx

Scalability, Fidelity and Stealth in the DRAKVUF Dynamic Malware Analysis System

Reproducible, Automated and Portable Computational and Data Science Experimen...

Interactive Data Analysis for End Users on HN Science Cloud

Preparing OpenSHMEM for Exascale

Shaping the Future: To Globus Compute and Beyond!

Linux Memory Analysis with Volatility

Big data at experimental facilities

Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)

4055-841_Project_ShailendraSadh

Metasploit For Beginners

Practical Chaos Engineering

Ase2010 shang

Mac Memory Analysis with Volatility

Monitoring in 2017 - TIAD Camp Docker

"Data Provenance: Principles and Why it matters for BioMedical Applications"

ACM Applicative System Methodology 2016

Small, fast and useful – MMTF a new paradigm in macromolecular data transmiss...

Analisis Estatico y de Comportamiento de un Binario Malicioso

How to Make a Honeypot Stickier (SSH*)

More from Paul Groth

It is increasingly recognized that data is a central challenge for AI systems - whether training an entirely new model, discovering data for a model, or applying an existing model to new data. Given this centrality of data, there is need to provide new tools that are able to help data teams create, curate and debug datasets in the context of complex machine learning pipelines. In this talk, I outline the underlying challenges for data debugging and curation in these environments. I then discuss our recent research that both takes advantage of ML to improve datasets but also uses core database techniques for debugging in such complex ML pipelines. Presented at DBML 2022 at ICDE - https://www.wis.ewi.tudelft.nl/dbml2022

Data Curation and Debugging for Data Centric AI

Paul Groth

Content-centric organizations have increasingly recognized the value of their material for analytics and decision support systems based on machine learning. However, as anyone involved in machine learning projects will tell you the difficulty is not in the provision of the content itself but in the production of annotations necessary to make use of that content for ML. The transformation of content into training data often requires manual human annotation. This is expensive particularly when the nature of the content requires subject matter experts to be involved. In this talk, I highlight emerging approaches to tackling this challenge using what's known as weak supervision - using other signals to help annotate data. I discuss how content companies often overlook resources that they have in-house to provide these signals. I aim to show how looking at a data estate in terms of signals can amplify its value for artificial intelligence.

Content + Signals: The value of the entire data estate for machine learning

Paul Groth

Description Data is a critical both to facilitate an organization and as a product. How can you make that data more usable for both internal and external stakeholders? There are a myriad of recommendations, advice, and strictures about what data providers should do to facilitate data (re)use. It can be overwhelming. Based on recent empirical work (analyzing data reuse proxies at scale, understanding data sensemaking and looking at how researchers search for data), I talk about what practices are a good place to start for helping others to reuse your data. I put this in the context of the notion data communities that organizations can use to help foster the use of data both within your organization and externally.

Data Communities - reusable data in and outside your organization.

Paul Groth

The literature contains a myriad of recommendations, advice, and strictures about what data providers should do to facilitate data reuse. It can be overwhelming. Based on recent empirical work (analyzing data reuse proxies at scale, understanding data sensemaking and looking at how researchers search for data), I talk about what practices are a good place to start for helping others to reuse your data.

Minimal viable-datareuse-czi

Paul Groth

Presentation for NEC Lab Europe. Knowledge graphs are increasingly built using complex multifaceted machine learning-based systems relying on a wide of different data sources. To be effective these must constantly evolve and thus be maintained. I present work on combining knowledge graph construction (e.g. information extraction) and refinement (e.g. link prediction) in end to end systems. In particular, I will discuss recent work on using inductive representations for link predication. I then discuss the challenges of ongoing system maintenance, knowledge graph quality and traceability.

Knowledge Graph Maintenance

Paul Groth

Knowledge Graph Futures

Paul Groth

Knowledge Graph Maintenance

Paul Groth

Thoughts on Knowledge Graphs & Deeper Provenance

Paul Groth

Thinking About the Making of Data

Paul Groth

End-to-End Learning for Answering Structured Queries Directly over Text

Paul Groth

From Data Search to Data Showcasing

Paul Groth

Elsevier’s Healthcare Knowledge Graph

Paul Groth

Over the past 5 years, we have seen multiple successes in the development of knowledge graphs for supporting science in domains ranging from drug discovery to social science. However, in order to really improve scientific productivity, we need to expand and deepen our knowledge graphs. To do so, I believe we need to address two critical challenges: 1) dealing with low resource domains; and 2) improving quality. In this talk, I describe these challenges in detail and discuss some efforts to overcome them through the application of techniques such as unsupervised learning; the use of non-experts in expert domains, and the integration of action-oriented knowledge (i.e. experiments) into knowledge graphs.

The Challenge of Deeper Knowledge Graphs for Science

Paul Groth

More ways of symbol grounding for knowledge graphs?

Paul Groth

Presentation at the IJCAI 2018 Industry Day Elsevier serves researchers, doctors, and nurses. They have come to expect the same AI based services that they use in everyday life in their work environment, e.g.: recommendations, answer driven search, and summarized information. However, providing these sorts of services over the plethora of low resource domains that characterize science and medicine is a challenging proposition. (For example, most of the shelf NLP components are trained on newspaper corpora and exhibit much worse performance on scientific text). Furthermore, the level of precision expected in these domains is quite high. In this talk, we overview our efforts to overcome this challenge through the application of four techniques: 1) unsupervised learning; 2) leveraging of highly skilled but low volume expert annotators; 2) designing annotation tasks for non-experts in expert domains; and 4) transfer learning. We conclude with a series of open issues for the AI community stemming from our experience.

Diversity and Depth: Implementing AI across many long tail domains

Paul Groth

From Text to Data to the World: The Future of Knowledge Graphs

Paul Groth

Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs

Paul Groth

The need for a transparent data supply chain

Paul Groth

Knowledge graph construction for research & medicine

Paul Groth

The Roots: Linked data and the foundations of successful Agriculture Data

Paul Groth

More from Paul Groth (20)

Data Curation and Debugging for Data Centric AI

Content + Signals: The value of the entire data estate for machine learning

Data Communities - reusable data in and outside your organization.

Minimal viable-datareuse-czi

Knowledge Graph Maintenance

Knowledge Graph Futures

Knowledge Graph Maintenance

Thoughts on Knowledge Graphs & Deeper Provenance

Thinking About the Making of Data

End-to-End Learning for Answering Structured Queries Directly over Text

From Data Search to Data Showcasing

Elsevier’s Healthcare Knowledge Graph

The Challenge of Deeper Knowledge Graphs for Science

More ways of symbol grounding for knowledge graphs?

Diversity and Depth: Implementing AI across many long tail domains

From Text to Data to the World: The Future of Knowledge Graphs

Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs

The need for a transparent data supply chain

Knowledge graph construction for research & medicine

The Roots: Linked data and the foundations of successful Agriculture Data

Recently uploaded

Following the popularity of "Cloud Revolution: Exploring the New Wave of Serverless Spatial Data," we're thrilled to announce this much-anticipated encore webinar. In this sequel, we'll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR. Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios. Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects. Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you're building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Safe Software

Vector Search -An Introduction in Oracle Database 23ai.pptx

Remote DBA Services

FWD Group - Insurer Innovation Award 2024

The Digital Insurer

Artificial Intelligence Chap.5 : Uncertainty

Khushali Kathiriya

Passkeys: Developing APIs to enable passwordless authentication Cody Salas, Sr Developer Advocate | Solutions Architect - Yubico Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

apidays

Understanding the FAA Part 107 License ..

Christopher Logan Kennedy

The value of a flexible API Management solution for Open Banking Steve Melan, Manager for IT Innovation and Architecture - State's and Saving's Bank of Luxembourg Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The value of a flexible API Management solution for O...

apidays

Whatsapp Number Escorts Call girls 8617370543 Available 24x7 Mcleodganj Call Girls Service Offer Genuine VIP Model Escorts Call Girls in Your Budget. Mcleodganj Call Girls Service Provide Real Call Girls Number. Make Your Sexual Pleasure Memorable with Our Mcleodganj Call Girls at Affordable Price. Top VIP Escorts Call Girls, High Profile Independent Escorts Call Girls, Housewife Women Escorts Call Girl, College Girls Escorts Call Girls, Russian Escorts Call girls Service in Your Budget.

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model

Deepika Singh

Accelerating FinTech Innovation: Unleashing API Economy and GenAI Vasa Krishnan, Chief Technology Officer - FinResults Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

apidays

Corporate and higher education. Two industries that, in the past, have had a clear divide with very little crossover. The difference in goals, learning styles and objectives paved the way for differing learning technologies platforms to evolve. Now, those stark lines are blurring as both sides are discovering they have content that’s relevant to the other. Join Tammy Rutherford as she walks through the pros and cons of corporate and higher ed collaborating. And the challenges of these different technology platforms working together for a brighter future.

Corporate and higher education May webinar.pptx

Rustici Software

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

MadyBayot

presentation ICT roal in 21st century education

jfdjdjcjdnsjd

Tracing the root cause of a performance issue requires a lot of patience, experience, and focus. It’s so hard that we sometimes attempt to guess by trying out tentative fixes, but that usually results in frustration, messy code, and a considerable waste of time and money. This talk explains how to correctly zoom in on a performance bottleneck using three levels of profiling: distributed tracing, metrics, and method profiling. After we learn to read the JVM profiler output as a flame graph, we explore a series of bottlenecks typical for backend systems, like connection/thread pool starvation, invisible aspects, blocking code, hot CPU methods, lock contention, and Virtual Thread pinning, and we learn to trace them even if they occur in library code you are not familiar with. Attend this talk and prepare for the performance issues that will eventually hit any successful system. About authorWith two decades of experience, Victor is a Java Champion working as a trainer for top companies in Europe. Five thousands developers in 120 companies attended his workshops, so he gets to debate every week the challenges that various projects struggle with. In return, Victor summarizes key points from these workshops in conference talks and online meetups for the European Software Crafters, the world’s largest developer community around architecture, refactoring, and testing. Discover how Victor can help you on victorrentea.ro : company training catalog, consultancy and YouTube playlists.

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

Victor Rentea

Discover the innovative features and strategic vision that keep WSO2 an industry leader. Explore the exciting 2024 roadmap of WSO2 API management, showcasing innovations, unified APIM/APK control plane, natural language API interaction, and cloud native agility. Discover how open source solutions, microservices architecture, and cloud native technologies unlock seamless API management in today's dynamic landscapes. Leave with a clear blueprint to revolutionize your API journey and achieve industry success!

WSO2's API Vision: Unifying Control, Empowering Developers

WSO2

MINDCTI Revenue Release Quarter One 2024

MIND CTI

Exploring Multimodal Embeddings with Milvus

Zilliz

In this keynote, Asanka Abeysinghe, CTO,WSO2 will explore the shift towards platformless technology ecosystems and their importance in driving digital adaptability and innovation. We will discuss strategies for leveraging decentralized architectures and integrating diverse technologies, with a focus on building resilient, flexible, and future-ready IT infrastructures. We will also highlight WSO2's roadmap, emphasizing our commitment to supporting this transformative journey with our evolving product suite.

Platformless Horizons for Digital Adaptability

WSO2

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Edi Saputra

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Juan lago vázquez

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Product Anonymous

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Vector Search -An Introduction in Oracle Database 23ai.pptx

FWD Group - Insurer Innovation Award 2024

Artificial Intelligence Chap.5 : Uncertainty

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

Understanding the FAA Part 107 License ..

Apidays New York 2024 - The value of a flexible API Management solution for O...

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Corporate and higher education May webinar.pptx

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

presentation ICT roal in 21st century education

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

WSO2's API Vision: Unifying Control, Empowering Developers

MINDCTI Revenue Release Quarter One 2024

Exploring Multimodal Embeddings with Milvus

Platformless Horizons for Digital Adaptability

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

Progressive Provenance Capture Through Re-computation

1. Progressive Provenance Capture Through Re- computation Paul Groth Elsevier Labs @pgroth | pgroth.com Joint work with Manolis Stamatogiannakis and Herbert Bos Vrije Universiteit Amsterdam Incremental Re-computation Workshop - Provenance Week 2018

2. What to capture? Simon Miles, Paul Groth, Paul, Steve Munroe, Luc Moreau. PrIMe: A methodology for developing provenance-aware applications. ACM Transactions on Software Engineering and Methodology, 20, (3), 2011. 2

3. Provenance is Post-Hoc • What if we missed something? • Disclosed provenance systems: – Re-apply methodology (e.g. PriME), produce new application version. – Time consuming. • Observed provenance systems: – Update the applied instrumentation. – Instrumentation becomes progressively more intense. 3

4. Provenance is Post-Hoc Aim: Eliminate the need for developers to know what provenance needs to be captured. 4

5. Re-execution • Common tactic in disclosed provenance: – DB: Reenactment queries (Glavic ‘14) – DistSys: Chimera (Foster ‘02), Hadoop (Logothetis ‘13), DistTape (Zhao ‘12) – Workflows: Pegasus (Groth ‘09) – PL: Slicing (Perera ‘12) – Desktop: Excel (Asuncion ‘11) • Can we extend this idea to observed provenance systems? 5

6. Full-system logging and replay 6

7. Methodology Selection Provenance analysis Instrumentation Execution Capture 7

8. Prototype Implementation • PANDA: an open-source Platform for Architecture-Neutral Dynamic Analysis. (Dolan- Gavitt ‘14) • Based on the QEMU virtualization platform. 8

9. • PANDA logs self-contained execution traces. – An initial RAM snapshot. – Non-deterministic inputs. • Logging happens at virtual CPU I/O ports. – Virtual device state is not logged  can’t “go-live”. Prototype Implementation (2/3) PANDA CPU RAM Input Interrupt DMA Initial RAM Snapshot Non- determinism log RAM PANDA Execution Trace 9

10. Prototype Implementation (3/3) • Analysis plugins – Read-only access to the VM state. – Invoked per instr., memory access, context switch, etc. – Can be combined to implement complex functionality. – OSI Linux, PROV-Tracer, ProcStrMatch, Taint tracking • Debian Linux guest. • Provenance stored PROV/RDF triples, queried with SPARQL. PANDA Execution Trace PANDA Triple Store Plugin APlugin C Plugin B CPU RAM 10 used endedAtTime wasAssociatedWith actedOnBehalfOf wasGeneratedBy wasAttributedTo wasDerivedFrom wasInformedBy Activity Entity Agent xsd:dateTime startedAtTime xsd:dateTime

11. OS Introspection • What processes are currently executing? • Which libraries are used? • What files are used? • Possible approaches: – Execute code inside the guest-OS. – Reproduce guest-OS semantics purely from the hardware state (RAM/registers). 11

12. 12 (1) Alice downloads the front page of example.org. (2) Alice edits the document and fixes a link that points to the wrong page. (3) Alice re-uploads the HTML document and the image. (4) Bob downloads the front page of example.org. (5) Bob removes a paragraph of text. (6) Bob re-uploads the the HTML document. An example

13. 13 Select Replay

14. Thoughts • Decoupling provenance analysis from execution is possible by the use of VM record & replay. • Execution traces can be used for post-hoc provenance analysis. • 24/7 execution recording seems possible • Can we extend this notion of instrumentation to other capture systems? 14 Manolis Stamatogiannakis, Elias Athanasopoulos, Herbert Bos, Paul Groth: PROV2R: Practical Provenance Analysis of Unstructured Processes. ACM Transactions on Internet Technology 17(4): 37:1-37:24 (2017)

Editor's Notes

A big problem for systems capturing provenance is deciding what to capture. For disclosed provenance systems we can apply some methodology to decide what to capture.
The root of the problem is that provenance is post-hoc. Deciding what to capture in advance will always miss something. Ideally, we would like to…
Decouple analysis from execution. Has been proposed for security analysis on mobile phones. (Paranoid Android, Portokalidis ‘10)
Execution Capture: happens realtime Instrumentation: applied on the captured trace to generate provenance information Analysis: the provenance information is explored using existing tools (e.g. SPARQL queries) Selection: a subset of the execution trace is selected – we start again with more intensive instrumentation
We implemented our methodology using PANDA.
PANDA is based on QEMU. Input includes both executed instructions and data. RAM snapshot + ND log are enough to accurately replay the whole execution. ND log conists of inputs to CPU/RAM and other device status is not logged  we can replay but we cannot “go live” (i.e. resume execution)
Note: Technically, plugins can modify VM state. However this will eventually crash the execution as the trace will be out of sync with the replay state. Plugins are implemented as dynamic libraries. We focus on the highlighted plugins in this presentation.
Typical information that can be retrieved through VM introspection. In general, executing code inside the guest OS is complex. Moreover, in the case of PANDA we don’t have access to the state of devices. This makes injection and execution of new code even more complex and also more limited.

Progressive Provenance Capture Through Re-computation

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Similar to Progressive Provenance Capture Through Re-computation

Similar to Progressive Provenance Capture Through Re-computation (20)

More from Paul Groth

More from Paul Groth (20)

Recently uploaded

Recently uploaded (20)

Progressive Provenance Capture Through Re-computation

Editor's Notes