Data munging is a crucial task across domains ranging from drug discovery and policy studies to data science. Indeed, it has been reported that data munging accounts for 60% of the time spent in data analysis. Because data munging involves a wide variety of tasks using data from multiple sources, it often becomes difficult to understand how a cleaned dataset was actually produced (i.e. its provenance). In this talk, I discuss our recent work on tracking data provenance within desktop systems, which addresses problems of efficient and fine grained capture. I also describe our work on scalable provence tracking within a triple store/graph database that supports messy web data. Finally, I briefly touch on whether we will move from adhoc data munging approaches to more declarative knowledge representation languages such as Probabilistic Soft Logic.
Presented at Information Sciences Institute - August 13, 2015
Decoupling Provenance Capture and Analysis from ExecutionPaul Groth
Presentation for the paper:
Manolis Stamatogiannakis, Paul Groth and Herbert Bos. Decoupling Provenance Capture and Analysis from Execution
Presented at Theory and Practice of Provenance 2015 (TaPP'15)
http://workshops.inf.ed.ac.uk/tapp2015/
In this video from PASC18, Alexander Nitz from the Max Planck Institute for Gravitational Physics in Germany presents: The Search for Gravitational Waves.
"The LIGO and Virgo detectors have completed a prolific observation run. We are now observing gravitational waves from both the mergers of binary black holes and neutron stars. We’ll discuss how these discoveries were made and look into what the near future of searching for gravitational waves from compact binary mergers will look like."
Watch the video: https://wp.me/p3RLHQ-iTv
Learn more: github.com/gwastro/pycbc
and
https://pasc18.pasc-conference.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Slides for my Associate Professor (oavlönad docent) lecture.
The lecture is about Data Streaming (its evolution and basic concepts) and also contains an overview of my research.
Labmeeting - 20151013 - Adaptive Video Streaming over HTTP with Dynamic Resou...Syuan Wang
Truong Cong Thang Univ. of Aizu, Aizu-Wakamatsu, Japan
Le, H.T. ; Nguyen, H.X. ; Pham, A.T. ; Jung Won Kang ; Yong Man Ro
Journal of Communications and Networks, Vol. 15, No. 6, Dec. 2013
Decoupling Provenance Capture and Analysis from ExecutionPaul Groth
Presentation for the paper:
Manolis Stamatogiannakis, Paul Groth and Herbert Bos. Decoupling Provenance Capture and Analysis from Execution
Presented at Theory and Practice of Provenance 2015 (TaPP'15)
http://workshops.inf.ed.ac.uk/tapp2015/
In this video from PASC18, Alexander Nitz from the Max Planck Institute for Gravitational Physics in Germany presents: The Search for Gravitational Waves.
"The LIGO and Virgo detectors have completed a prolific observation run. We are now observing gravitational waves from both the mergers of binary black holes and neutron stars. We’ll discuss how these discoveries were made and look into what the near future of searching for gravitational waves from compact binary mergers will look like."
Watch the video: https://wp.me/p3RLHQ-iTv
Learn more: github.com/gwastro/pycbc
and
https://pasc18.pasc-conference.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Slides for my Associate Professor (oavlönad docent) lecture.
The lecture is about Data Streaming (its evolution and basic concepts) and also contains an overview of my research.
Labmeeting - 20151013 - Adaptive Video Streaming over HTTP with Dynamic Resou...Syuan Wang
Truong Cong Thang Univ. of Aizu, Aizu-Wakamatsu, Japan
Le, H.T. ; Nguyen, H.X. ; Pham, A.T. ; Jung Won Kang ; Yong Man Ro
Journal of Communications and Networks, Vol. 15, No. 6, Dec. 2013
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In in this lecture we overview the mining of data streams
Information and network security 11 cryptography and cryptanalysisVaibhav Khanna
The purpose of cryptography is to hide the contents of messages by encrypting them so as to make them unrecognizable except by someone who has been given a special decryption key. The purpose of cryptanalysis is then to defeat this by finding ways to decrypt messages without being given the key
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo PeriniFlink Forward
World's toughest and most interesting analysis tasks lie at the intersection of graph data (inter-dependencies in data) and deep learning (inter-dependencies in the model). Classical graph embedding techniques have for years occupied research groups seeking how complex graphs can be encoded into a low-dimensional latent space. Recently, deep learning has dominated the space of embeddings generation due to its ability to automatically generate embeddings given any static graph.
Grapharis is a project that revitalizes the concept of graph embeddings, yet it does so in a real setting were graphs are not static but keep changing over time (think of user interactions in social networks). More specifically, we explored how a system like Flink can be used to simplify both the process of training a graph embedding model incrementally but also make complex inferences and predictions in real time using graph structured data streams. To our knowledge, Grapharis is the first complete data pipeline using Flink and Tensorflow for real-time deep graph learning. This talk will cover how we can train, store and generate embeddings continuously and accurately as data evolves over time without the need to re-train the underlying model.
A novel approach to prevent cache based side-channel attack in the cloud (1)mrigakshi goel
Summary of “A Novel Approach to Prevent Cache-Based Side-Channel Attack in the Cloud”
Read more at https://mrg-goel.medium.com/summary-of-a-novel-approach-to-prevent-cache-based-side-channel-attack-in-the-cloud-2bd802e20155
In our data-driven world, the need for speed has never been greater. The advent of Flink has no doubt paved the way for faster and more efficient data delivery solutions; however, it is not without its costs. The amount of time, talent and resources required to effectively manipulate streams and conduct analysis at scale is far from trivial, and it can be especially daunting to the uninitiated or technically challenged. In an effort to make scalable stream processing more readily accessible to the world, Cogility Software created Cogynt: a zero-coding analytics platform for the masses. Cogynt enables engineers and non-engineers alike to manipulate and analyze streams on an abstracted level, while leveraging the power of Flink and Kafka under the hood to declaratively build complex Flink jobs. Shielding the analyst from low-level system configuration and programming API’s lends itself to creating an environment where analysts can focus on what’s most important to their businesses – the data. This session will demonstrate, from a data science perspective, how Cogynt can easily do almost anything Flink users can do with code, and more!
Domains such as drug discovery, data science, and policy studies increasing rely on the combination of complex analysis pipelines with integrated data sources to come to conclusions. A key question then arises is what are these conclusions based upon? Thus, there is a tension between integrating data for analysis and understanding where that data comes from (its provenance). In this talk, I describe recent work that is attempting to facilitate transparency by combining provenance tracked within databases with the data integration and analytics pipelines that feed them. I discuss this with respect to use cases from public policy as well as drug discovery.
Given at: http://ccct.uva.nl/content/ccct-seminar-21-february-2014
This talk takes as inspiration Prof. Carole Goble's notion that research communication should be more like software development practices. It looks at some of the state of the art and how it fits into to that framework. It argues that we are moving towards that vision and discusses some of the norms that need to be accepted in this new world. Presented at http://www.dagstuhl.de/15302
Telling your research story with (alt)metricsPaul Groth
Presentation on the use of altmetrics to inform stories about altmetrics. Presented for Open Access week 2013 in Amsterdam. See http://uba.uva.nl/home/componenten/agenda-2/agenda-2/content/folder/lezingen/13/10/altmetrics.html
Data for Science: How Elsevier is using data science to empower researchersPaul Groth
Each month 12 million people use Elsevier’s ScienceDirect platform. The Mendeley social network has 4.6 million registered users. 3500 institutions make use of ClinicalKey to bring the latest in medical research to doctors and nurses. How can we help these users be more effective? In this talk, I give an overview of how Elsevier is employing data science to improve its services from recommendation systems, to natural language processing and analytics. While data science is changing how Elsevier serves researchers, it’s also changing research practice itself. In that context, I discuss the impact that large amounts of open research data are having and the challenges researchers face in making use of it, in particular, in terms of data integration and reuse. We are at just beginning to see of how technology and data is changing science correspondingly this impacts how best to empower those who practice it.
Research Data Sharing: A Basic FrameworkPaul Groth
Some thoughts on thinking about data sharing. Prepared for the 2016 LERU Doctoral Summer School - Data Stewardship for Scientific Discovery and Innovation.
http://www.dtls.nl/fair-data/fair-data-training/leru-summer-school/
Sources of Change in Modern Knowledge Organization SystemsPaul Groth
Talk covering how knowledge graphs are making us rethink how change occurs in Knowledge Organization Systems. Based on https://arxiv.org/abs/1611.00217
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In in this lecture we overview the mining of data streams
Information and network security 11 cryptography and cryptanalysisVaibhav Khanna
The purpose of cryptography is to hide the contents of messages by encrypting them so as to make them unrecognizable except by someone who has been given a special decryption key. The purpose of cryptanalysis is then to defeat this by finding ways to decrypt messages without being given the key
Deep Stream Dynamic Graph Analytics with Grapharis - Massimo PeriniFlink Forward
World's toughest and most interesting analysis tasks lie at the intersection of graph data (inter-dependencies in data) and deep learning (inter-dependencies in the model). Classical graph embedding techniques have for years occupied research groups seeking how complex graphs can be encoded into a low-dimensional latent space. Recently, deep learning has dominated the space of embeddings generation due to its ability to automatically generate embeddings given any static graph.
Grapharis is a project that revitalizes the concept of graph embeddings, yet it does so in a real setting were graphs are not static but keep changing over time (think of user interactions in social networks). More specifically, we explored how a system like Flink can be used to simplify both the process of training a graph embedding model incrementally but also make complex inferences and predictions in real time using graph structured data streams. To our knowledge, Grapharis is the first complete data pipeline using Flink and Tensorflow for real-time deep graph learning. This talk will cover how we can train, store and generate embeddings continuously and accurately as data evolves over time without the need to re-train the underlying model.
A novel approach to prevent cache based side-channel attack in the cloud (1)mrigakshi goel
Summary of “A Novel Approach to Prevent Cache-Based Side-Channel Attack in the Cloud”
Read more at https://mrg-goel.medium.com/summary-of-a-novel-approach-to-prevent-cache-based-side-channel-attack-in-the-cloud-2bd802e20155
In our data-driven world, the need for speed has never been greater. The advent of Flink has no doubt paved the way for faster and more efficient data delivery solutions; however, it is not without its costs. The amount of time, talent and resources required to effectively manipulate streams and conduct analysis at scale is far from trivial, and it can be especially daunting to the uninitiated or technically challenged. In an effort to make scalable stream processing more readily accessible to the world, Cogility Software created Cogynt: a zero-coding analytics platform for the masses. Cogynt enables engineers and non-engineers alike to manipulate and analyze streams on an abstracted level, while leveraging the power of Flink and Kafka under the hood to declaratively build complex Flink jobs. Shielding the analyst from low-level system configuration and programming API’s lends itself to creating an environment where analysts can focus on what’s most important to their businesses – the data. This session will demonstrate, from a data science perspective, how Cogynt can easily do almost anything Flink users can do with code, and more!
Domains such as drug discovery, data science, and policy studies increasing rely on the combination of complex analysis pipelines with integrated data sources to come to conclusions. A key question then arises is what are these conclusions based upon? Thus, there is a tension between integrating data for analysis and understanding where that data comes from (its provenance). In this talk, I describe recent work that is attempting to facilitate transparency by combining provenance tracked within databases with the data integration and analytics pipelines that feed them. I discuss this with respect to use cases from public policy as well as drug discovery.
Given at: http://ccct.uva.nl/content/ccct-seminar-21-february-2014
This talk takes as inspiration Prof. Carole Goble's notion that research communication should be more like software development practices. It looks at some of the state of the art and how it fits into to that framework. It argues that we are moving towards that vision and discusses some of the norms that need to be accepted in this new world. Presented at http://www.dagstuhl.de/15302
Telling your research story with (alt)metricsPaul Groth
Presentation on the use of altmetrics to inform stories about altmetrics. Presented for Open Access week 2013 in Amsterdam. See http://uba.uva.nl/home/componenten/agenda-2/agenda-2/content/folder/lezingen/13/10/altmetrics.html
Data for Science: How Elsevier is using data science to empower researchersPaul Groth
Each month 12 million people use Elsevier’s ScienceDirect platform. The Mendeley social network has 4.6 million registered users. 3500 institutions make use of ClinicalKey to bring the latest in medical research to doctors and nurses. How can we help these users be more effective? In this talk, I give an overview of how Elsevier is employing data science to improve its services from recommendation systems, to natural language processing and analytics. While data science is changing how Elsevier serves researchers, it’s also changing research practice itself. In that context, I discuss the impact that large amounts of open research data are having and the challenges researchers face in making use of it, in particular, in terms of data integration and reuse. We are at just beginning to see of how technology and data is changing science correspondingly this impacts how best to empower those who practice it.
Research Data Sharing: A Basic FrameworkPaul Groth
Some thoughts on thinking about data sharing. Prepared for the 2016 LERU Doctoral Summer School - Data Stewardship for Scientific Discovery and Innovation.
http://www.dtls.nl/fair-data/fair-data-training/leru-summer-school/
Sources of Change in Modern Knowledge Organization SystemsPaul Groth
Talk covering how knowledge graphs are making us rethink how change occurs in Knowledge Organization Systems. Based on https://arxiv.org/abs/1611.00217
Search Engine Training Institute in Ambala!Batra Computer Centrejatin batra
Batra Computer Centre is An ISO certified 9001:2008 training Centre in Ambala.
We Provide Best Search Engine Training in Ambala. BATRA COMPUTER CENTRE provides best training in C, C++, S.E.O, Web Designing, Web Development and So many other courses are available.
An over-ambitious introduction to Spark programming, test and deployment. This slide tries to cover most core technologies and design patterns used in SpookyStuff, the fastest query engine for data collection/mashup from the deep web.
For more information please follow: https://github.com/tribbloid/spookystuff
A bug in PowerPoint used to cause transparent background color not being rendered properly. This has been fixed in a recent upload.
The 20th annual Enterprise Data World (EDW) Conference took place in San Diego last month April 17-21. It is recognized as the most comprehensive educational conference on data management in the world.
Joe Caserta was a featured presenter. His session “Evolving from the Data Warehouse to Big Data Analytics - the Emerging Role of the Data Lake," highlighted the challenges and steps to needed to becoming a data-driven organization.
Joe also participated in in two panel discussions during the show:
• "Data Lake or Data Warehouse?"
• "Big Data Investments Have Been Made, But What's Next
For more information on Caserta Concepts, visit our website at http://casertaconcepts.com/.
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
Monitoring an entire application is not a simple task, but with the right tools it is not a hard task either. However, events like Black Friday can push your application to the limit, and even cause crashes. As the system is stressed, it generates a lot more logs, which may crash the monitoring system as well. In this talk I will walk through the best practices when using the Elastic Stack to centralize and monitor your logs. I will also share some tricks to help you with the huge increase of traffic typical in Black Fridays.
From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks
Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data.
Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations.
This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.
How to teach your data scientist to leverage an analytics cluster with Presto...Alluxio, Inc.
Data Orchestration Summit 2020 organized by Alluxio
https://www.alluxio.io/data-orchestration-summit-2020/
How to teach your data scientist to leverage an analytics cluster with Presto, Spark, and Alluxio
Katarzyna Orzechowska, Data Scientist (ING Tech)
Mariusz Derela, DevOps Engineer (ING Tech)
About Alluxio: alluxio.io
Engage with the open source community on slack: alluxio.io/slack
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
In this talk, we will share the experiences of applying Cassandra with two real customers in China. In the first use case, we deployed Cassandra at Sany Group, a leading company of Machinery manufacturing, to manage the sensor data generated by construction machinery. By designing a specific schema and optimizing the write process, we successfully managed over 1.5 billion historical data records and achieved the online write throughput of 10k write operations per second with 5 servers. MapReduce is also used on Cassandra for valued-added services, e.g. operations management, machine failure prediction, and abnormal behavior mining. In the second use case, Cassandra is deployed in the China Meteorological Administration to manage the Meteorological data. We design a hybrid schema to support both slice query and time window based query efficiently. Also, we explored the optimized compaction and deletion strategy for meteorological data in this case.
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Guglielmo Iozzia
Slides from my talk at the Hadoop User Group Ireland meetup on June 13th 2016: building a data pipeline to ingest data from sources of different nature into Hadoop in minutes (and no coding at all) using the Open Source Streamsets Data Collector tool.
Progressive Provenance Capture Through Re-computationPaul Groth
Provenance capture relies upon instrumentation of processes (e.g. probes or extensive logging). The more instrumentation we can add to processes the richer our provenance traces can be, for example, through the addition of comprehensive descriptions of steps performed, mapping to higher levels of abstraction through ontologies, or distinguishing between automated or user actions. However, this instrumentation has costs in terms of capture time/overhead and it can be difficult to ascertain what should be instrumented upfront. In this talk, I'll discuss our research on using record-replay technology within virtual machines to incrementally add additional provenance instrumentation by replaying computations after the fact.
InfluxEnterprise Architecture Patterns by Tim Hall & Sam DillardInfluxData
In this InfluxDays NYC 2019 presentation, InfluxData VP of Products Tim Hall and Sales Engineer Sam Dillard discuss architecture patterns with InfluxEnterprise time series platform. They cover an overview of InfluxEnterprise, features, ingestion and query rates, deployment examples, replication patterns, and general advice. Presentation highlights include InfluxEnterprise cluster architecture and how to determine if you're ready for adopting InfluxEnterprise.
Workshop: Big Data Visualization for SecurityRaffael Marty
Big Data is the latest hype in the security industry. We will have a closer look at what big data is comprised of: Hadoop, Spark, ElasticSearch, Hive, MongoDB, etc. We will learn how to best manage security data in a small Hadoop cluster for different types of use-cases. Doing so, we will encounter a number of big-data open source tools, such as LogStash and Moloch that help with managing log files and packet captures.
As a second topic we will look at visualization and how we can leverage visualization to learn more about our data. In the hands-on part, we will use some of the big data tools, as well as a number of visualization tools to actively investigate a sample data set.
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxData
Dean discusses architecture patterns with InfluxDB Enterprise, covering an overview of InfluxDB Enterprise, features, ingestion and query rates, deployment examples, replication patterns, and general advice.
Large Data Analyze with PyTables,
This presentation has been collected from several other presentations(PyTables presentation).
For more presentation in this field please refer to this link (http://pytables.org/moin/HowToUse#Presentations).
Data Curation and Debugging for Data Centric AIPaul Groth
It is increasingly recognized that data is a central challenge for AI systems - whether training an entirely new model, discovering data for a model, or applying an existing model to new data. Given this centrality of data, there is need to provide new tools that are able to help data teams create, curate and debug datasets in the context of complex machine learning pipelines. In this talk, I outline the underlying challenges for data debugging and curation in these environments. I then discuss our recent research that both takes advantage of ML to improve datasets but also uses core database techniques for debugging in such complex ML pipelines.
Presented at DBML 2022 at ICDE - https://www.wis.ewi.tudelft.nl/dbml2022
Content + Signals: The value of the entire data estate for machine learningPaul Groth
Content-centric organizations have increasingly recognized the value of their material for analytics and decision support systems based on machine learning. However, as anyone involved in machine learning projects will tell you the difficulty is not in the provision of the content itself but in the production of annotations necessary to make use of that content for ML. The transformation of content into training data often requires manual human annotation. This is expensive particularly when the nature of the content requires subject matter experts to be involved.
In this talk, I highlight emerging approaches to tackling this challenge using what's known as weak supervision - using other signals to help annotate data. I discuss how content companies often overlook resources that they have in-house to provide these signals. I aim to show how looking at a data estate in terms of signals can amplify its value for artificial intelligence.
Data Communities - reusable data in and outside your organization.Paul Groth
Description
Data is a critical both to facilitate an organization and as a product. How can you make that data more usable for both internal and external stakeholders? There are a myriad of recommendations, advice, and strictures about what data providers should do to facilitate data (re)use. It can be overwhelming. Based on recent empirical work (analyzing data reuse proxies at scale, understanding data sensemaking and looking at how researchers search for data), I talk about what practices are a good place to start for helping others to reuse your data. I put this in the context of the notion data communities that organizations can use to help foster the use of data both within your organization and externally.
The literature contains a myriad of recommendations, advice, and strictures about what data providers should do to facilitate data reuse. It can be overwhelming. Based on recent empirical work (analyzing data reuse proxies at scale, understanding data sensemaking and looking at how researchers search for data), I talk about what practices are a good place to start for helping others to reuse your data.
Presentation for NEC Lab Europe.
Knowledge graphs are increasingly built using complex multifaceted machine learning-based systems relying on a wide of different data sources. To be effective these must constantly evolve and thus be maintained. I present work on combining knowledge graph construction (e.g. information extraction) and refinement (e.g. link prediction) in end to end systems. In particular, I will discuss recent work on using inductive representations for link predication. I then discuss the challenges of ongoing system maintenance, knowledge graph quality and traceability.
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
Thinking about the need for deeper provenance for knowledge graphs but also using knowledge graphs to enrich provenance. Presented at https://seminariomirianandres.unirioja.es/sw19/
The Challenge of Deeper Knowledge Graphs for SciencePaul Groth
Over the past 5 years, we have seen multiple successes in the development of knowledge graphs for supporting science in domains ranging from drug discovery to social science. However, in order to really improve scientific productivity, we need to expand and deepen our knowledge graphs. To do so, I believe we need to address two critical challenges: 1) dealing with low resource domains; and 2) improving quality. In this talk, I describe these challenges in detail and discuss some efforts to overcome them through the application of techniques such as unsupervised learning; the use of non-experts in expert domains, and the integration of action-oriented knowledge (i.e. experiments) into knowledge graphs.
Diversity and Depth: Implementing AI across many long tail domainsPaul Groth
Presentation at the IJCAI 2018 Industry Day
Elsevier serves researchers, doctors, and nurses. They have come to expect the same AI based services that they use in everyday life in their work environment, e.g.: recommendations, answer driven search, and summarized information. However, providing these sorts of services over the plethora of low resource domains that characterize science and medicine is a challenging proposition. (For example, most of the shelf NLP components are trained on newspaper corpora and exhibit much worse performance on scientific text). Furthermore, the level of precision expected in these domains is quite high. In this talk, we overview our efforts to overcome this challenge through the application of four techniques: 1) unsupervised learning; 2) leveraging of highly skilled but low volume expert annotators; 2) designing annotation tasks for non-experts in expert domains; and 4) transfer learning. We conclude with a series of open issues for the AI community stemming from our experience.
From Text to Data to the World: The Future of Knowledge GraphsPaul Groth
Keynote Integrative Bioinformatics 2018
https://docs.google.com/document/d/1E7D4_CS0vlldEcEuknXjEnSBZSZCJvbI5w1FdFh-gG4/edit
Can we improve research productivity through providing answers stemming from knowledge graphs? In this presentation, I discuss different ways of building and combining knowledge graphs.
Combining Explicit and Latent Web Semantics for Maintaining Knowledge GraphsPaul Groth
A look at how the thinking about Web Data and the sources of semantics can help drive decisions on combining latent and explicit knowledge. Examples from Elsevier and lots of pointers to related work.
The need for a transparent data supply chainPaul Groth
Illustrating data supply chains and motivating the need for a more transparent data supply chain in the context of responsible data science. Presented at the 2018 KNAW-Royal Society bilateral meeting on responsible data science.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Welocme to ViralQR, your best QR code generator.ViralQR
Welcome to ViralQR, your best QR code generator available on the market!
At ViralQR, we design static and dynamic QR codes. Our mission is to make business operations easier and customer engagement more powerful through the use of QR technology. Be it a small-scale business or a huge enterprise, our easy-to-use platform provides multiple choices that can be tailored according to your company's branding and marketing strategies.
Our Vision
We are here to make the process of creating QR codes easy and smooth, thus enhancing customer interaction and making business more fluid. We very strongly believe in the ability of QR codes to change the world for businesses in their interaction with customers and are set on making that technology accessible and usable far and wide.
Our Achievements
Ever since its inception, we have successfully served many clients by offering QR codes in their marketing, service delivery, and collection of feedback across various industries. Our platform has been recognized for its ease of use and amazing features, which helped a business to make QR codes.
Our Services
At ViralQR, here is a comprehensive suite of services that caters to your very needs:
Static QR Codes: Create free static QR codes. These QR codes are able to store significant information such as URLs, vCards, plain text, emails and SMS, Wi-Fi credentials, and Bitcoin addresses.
Dynamic QR codes: These also have all the advanced features but are subscription-based. They can directly link to PDF files, images, micro-landing pages, social accounts, review forms, business pages, and applications. In addition, they can be branded with CTAs, frames, patterns, colors, and logos to enhance your branding.
Pricing and Packages
Additionally, there is a 14-day free offer to ViralQR, which is an exceptional opportunity for new users to take a feel of this platform. One can easily subscribe from there and experience the full dynamic of using QR codes. The subscription plans are not only meant for business; they are priced very flexibly so that literally every business could afford to benefit from our service.
Why choose us?
ViralQR will provide services for marketing, advertising, catering, retail, and the like. The QR codes can be posted on fliers, packaging, merchandise, and banners, as well as to substitute for cash and cards in a restaurant or coffee shop. With QR codes integrated into your business, improve customer engagement and streamline operations.
Comprehensive Analytics
Subscribers of ViralQR receive detailed analytics and tracking tools in light of having a view of the core values of QR code performance. Our analytics dashboard shows aggregate views and unique views, as well as detailed information about each impression, including time, device, browser, and estimated location by city and country.
So, thank you for choosing ViralQR; we have an offer of nothing but the best in terms of QR code services to meet business diversity!
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
The Art of the Pitch: WordPress Relationships and Sales
Provenance for Data Munging Environments
1. Paul Groth
Elsevier Labs
@pgroth | pgroth.com
Provenance for
Data Munging Environments
Information Sciences Institute – August 13, 2015
2. Outline
• What’s data munging and why it’s
important?
• The role of provenance
• The reality….
• Desktop data munging & provenance
• Database data munging & provenance
• Declarative data munging (?)
10. Solution:
Tracking and exposing
provenance*
* a record that describes the people, institutions,
entities, and activities involved in producing,
influencing, or delivering a piece of data”
The PROV Data Model
(W3C Recommendation)
23. References
Manolis Stamatogiannakis, Paul Groth, Herbert Bos.
Looking Inside the Black-Box: Capturing Data
Provenance Using Dynamic Instrumentation.
5th International Provenance and Annotation Workshop
(IPAW'14)
Manolis Stamatogiannakis, Paul Groth, Herbert Bos.
Decoupling Provenance Capture and Analysis from
Execution.
7th USENIX Workshop on the Theory and Practice of
Provenance (TaPP'15)
23
25. Challenge
• Can we capture provenance
– with low false positive ratio?
– without manual/obtrusive integration effort?
• We have to rely on observed provenance.
25
26. State of the art
Application
• Observed provenance systems treat programs as black-
boxes.
• Can’t tell if an input file was actually used.
• Can’t quantify the influence of input to output.
26
30. Evaluation: tackling the n×m
problem
30
• DataTracker is able
to track the actual
use of the input data.
• Read data ≠ Use
data.
• Eliminates false
positives (---->)
present in other
observed
provenance capture
methods.
32. Can we do good enough?
• Can taint tracking
a. become an “always-on” feature?
b. be turned on for all running processes?
• What if we want to also run other analysis
code?
• Can we pre-determine the right analysis
code?
32
36. Prototype Implementation
• PANDA: an open-
source Platform for
Architecture-Neutral
Dynamic Analysis.
(Dolan-Gavitt ‘14)
• Based on the QEMU
virtualization platform.
36
37. • PANDA logs self-contained execution traces.
– An initial RAM snapshot.
– Non-deterministic inputs.
• Logging happens at virtual CPU I/O ports.
– Virtual device state is not logged can’t “go-live”.
Prototype Implementation (2/3)
PANDA
CPU RAM
Input
Interrupt
DMA Initial RAM
Snapshot
Non-
determinism
log
RAM
PANDA Execution Trace
37
38. Prototype Implementation (3/3)
• Analysis plugins
– Read-only access to the VM state.
– Invoked per instr., memory access, context switch, etc.
– Can be combined to implement complex functionality.
– OSI Linux, PROV-Tracer, ProcStrMatch.
• Debian Linux guest.
• Provenance stored PROV/RDF triples, queried with
SPARQL.
PANDA
Executio
n Trace
PANDA
Triple
Store
Plugin APlugin C
Plugin B
CPU
RAM
38
used
endedAtTime
wasAssociatedWith
actedOnBehalfOf
wasGeneratedBy
wasAttributedTo
wasDerivedFrom
wasInformedBy
Activity
Entity
Agent
xsd:dateTime
startedAtTime
xsd:dateTime
39. OS Introspection
• What processes are currently executing?
• Which libraries are used?
• What files are used?
• Possible approaches:
– Execute code inside the guest-OS.
– Reproduce guest-OS semantics purely from
the hardware state (RAM/registers).
39
40. The PROV-Tracer Plugin
• Registers for process creation/destruction
events.
• Decodes executed system calls.
• Keeps track of what files are used as
input/output by each process.
• Emits provenance in an intermediate
format when a process terminates.
40
41. More Analysis Plugins
• ProcStrMatch plugin.
– Which processes contained string S in their
memory?
• Other possible types of analysis:
– Taint tracking
– Dynamic slicing
41
42. Overhead (again) (1/2)
• QEMU incurs a 5x slowdown.
• PANDA recording imposes an additional
1.1x – 1.2x slowdown.
Virtualization is the dominant overhead
factor.
42
43. Overhead (again) (2/2)
• QEMU is a suboptimal virtualization
option.
• ReVirt – User Mode Linux (Dunlap ‘02)
– Slowdown: 1.08x rec. + 1.58x virt.
• ReTrace – VMWare (Xu ‘07)
– Slowdown: 1.05x-2.6x rec. + ??? virt.
Virtualization slowdown is considered
acceptable.
Recording overhead is fairly low. 43
44. Storage Requirements
• Storage requirements vary with the
workload.
• For PANDA (Dolan-Gavitt ‘14):
– 17-915 instructions per byte.
• In practice: O(10MB/min) uncompressed.
• Different approaches to reduce/manage
storage requirements.
– Compression, HD rotation, VM snapshots.
• 24/7 recording seems within limits of
todays’ technology. 44
45. Highlights
• Taint tracking analysis is a powerful method
for capturing provenance.
– Eliminates many false positives.
– Tackles the “n×m problem”.
• Decoupling provenance analysis from
execution is possible by the use of VM record
& replay.
• Execution traces can be used for post-hoc
provenance analysis.
45
47. References
Marcin Wylot, Philip Cudré-Mauroux, Paul Groth
TripleProv: Efficient Processing of Lineage Queries
over a Native RDF Store
World Wide Web Conference 2014
Marcin Wylot, Philip Cudré-Mauroux, Paul Groth
Executing Provenance-Enabled Queries over Web
Data
World Wide Web Conference 2015
47
48. RDF is great for munging data
➢ Ability to arbitrarily add new
information (schemaless)
➢ Syntaxes are easy to concatenate
new data
➢ Information has a well defined
structure
➢ Identifiers are distributed but
controlled
48
50. Graph-based Query
select ?lat ?long ?g1 ?g2 ?g3 ?g4
where {
graph ?g1 {?a [] "Eiffel Tower" . }
graph ?g2 {?a inCountry FR . }
graph ?g3 {?a lat ?lat . }
graph ?g4 {?a long ?long . }
}
lat long l1 l2 l4 l4,
lat long l1 l2 l4 l5,
lat long l1 l2 l5 l4,
lat long l1 l2 l5 l5,
lat long l1 l3 l4 l4,
lat long l1 l3 l4 l5,
lat long l1 l3 l5 l4,
lat long l1 l3 l5 l5,
lat long l2 l2 l4 l4,
lat long l2 l2 l4 l5,
lat long l2 l2 l5 l4,
lat long l2 l2 l5 l5,
lat long l2 l3 l4 l4,
lat long l2 l3 l4 l5,
lat long l2 l3 l5 l4,
lat long l2 l3 l5 l5,
lat long l3 l2 l4 l4,
lat long l3 l2 l4 l5,
lat long l3 l2 l5 l4,
lat long l3 l2 l5 l5,
lat long l3 l3 l4 l4,
lat long l3 l3 l4 l5,
lat long l3 l3 l5 l4,
lat long l3 l3 l5 l5,
51. Provenance Polynomials
➢ Ability to characterize ways each source contributed
➢ Pinpoint the exact source to each result
➢ Trace back the list of sources the way they were combined
to deliver a result
52. Polynomials Operators
➢ Union (⊕)
○ constraint or projection satisfied with multiple sources
l1 ⊕ l2 ⊕ l3
○ multiple entities satisfy a set of constraints or
projections
➢ Join (⊗)
○ sources joined to handle a constraint or a projection
○ OS and OO joins between few sets of constraints
(l1 ⊕ l2) ⊗ (l3 ⊕ l4)
56. Datasets
➢ Two collections of RDF data gathered from the Web
○ Billion Triple Challenge (BTC): Crawled from the linked
open data cloud
○ Web Data Commons (WDC): RDFa, Microdata
extracted from common crawl
➢ Typical collections gathered from multiple sources
➢ sampled subsets of ~110 million triples each; ~25GB each
57. Workloads
➢ 8 Queries defined for BTC
○ T. Neumann and G. Weikum. Scalable join processing on very large rdf
graphs. In Proceedings of the 2009 ACM SIGMOD International
Conference on Management of data, pages 627–640. ACM, 2009.
➢ Two additional queries with UNION and OPTIONAL
clauses
➢ 7 various new queries for WDC
http://exascale.info/tripleprov
58. Results
Overhead of tracking provenance compared to
vanilla version of the system for BTC dataset
source-level co-
located
source-level
annotated
triple-level co-
located
triple-level
annotated
59. TripleProv: Query Execution
Pipeline
input: provenance-enable query
➢ execute the provenance query
➢ optionally pre-materializing or co-locating data
➢ optionally rewrite the workload queries
➢ execute the workload queries
output: the workload query results, restricted to those which were derived
from data specified by the provenance query 59
60. Experiments
What is the most efficient query
execution strategy for provenance-
enabled queries?
60
61. Datasets
➢ Two collections of RDF data gathered from the Web
○ Billion Triple Challenge (BTC): Crawled from the linked open data
cloud
○ Web Data Commons (WDC): RDFa, Microdata extracted from
common crawl
➢ Typical collections gathered from multiple sources
➢ Sampled subsets of ~40 million triples each; ~10GB each
➢ Added provenance specific triples (184 for WDC and 360 for BTC); that
the provenance queries do not modify the result sets of the workload
queries
61
62. Results for BTC
➢ Full Materialization: 44x faster
than the vanilla version of the
system
➢ Partial Materialization: 35x faster
➢ Pre-Filtering: 23x faster
➢ Adaptive Partial Materialization
executes a provenance query and
materialize data 475 times faster
than Full Materialization
➢ Query Rewriting and Post-
Filtering strategies perform
significantly slower
62
63. Data Analysis
➢ How many context values refer
to how many triples? How
selective it is?
➢ 6'819'826 unique context values
in the BTC dataset.
➢ The majority of the context
values are highly selective.
63
➢ average selectivity
○ 5.8 triples per context value
○ 2.3 molecules per context value
65. References
Sara Magliacane, Philip Stutz, Paul Groth, Abraham
Bernstein
foxPSL: A Fast, Optimized and eXtended PSL
implementation
International Journal of Approximate Reasoning (2015)
65
66. Why logic?
- Concise & natural way to represent relations
- Declarative representation:
- Can reuse, extend, combine rules
- Experts can write rules
- First order logic:
- Can exploit symmetries to avoid duplicated
computation (e.g. lifted inference)
67. Let the reasoner munge the
data.
See Sebastien Riedel’s etc. work towards
pushing more NLP problems in to the
reasoner.
http://cl.naist.jp/~kevinduh/z/acltutorialslide
s/matrix_acl2015tutorial.pdf
68. Statistical Relational Learning
● Several flavors:
o Markov Logic Networks,
o Bayesian Logic Programs
o Probabilistic Soft Logic (PSL) [Broecheler, Getoor,
UAI 2010]
● PSL was successfully applied:
o Entity resolution, Link prediction
o Ontology alignment, Knowledge graph
identification
o Computer vision, trust propagation, …
70. FoxPSL: Fast Optimized eXtended PSL
classes ∃partially
grounded rules
optimizations
DSL:
FoxPSL
lang
71. Experiments: comparison with ACO
SLURM cluster: 4 nodes, each with 2x10 cores and 128GB RAM
ACO = implementation of consensus optimization on
GraphLab used for grounded PSL
72. Conclusions
• Data munging is a central task
• Provenance is a requirement
• Now:
• Provenance by stealth (ack Carole Goble)
• Separate provenance analysis from
instrumentation.
• Future:
• The computer should do the work
73. Future Research
• Explore optimizations of taint tracking for
capturing provenance.
• Provenance analysis of real-world traces
(e.g. from rrshare.org).
• Tracking provenance across environments
• Traces/logs as central provenance
primitive
• Declarative data munging
73
Disclosed provenance methods require knowledge of application semantics and modification of the application.
OTOH observed provenance methods usually have a high false positive ratio.
Let’s look on a physical-world provenance problem.
Geologists want to know the provenance of streams flowing out of the foothills of a mountain. To do so they pour dye on the suspected source of the stream.
We can apply a similar method, called taint tracking to finding the provenance of data streams.
Taint tracking allows us to examine the flow of data in what was previously a black box.
We built a tool based on taint tracking to capture provenance. Our tool is called DataTracker and has two key building blocks.
We evaluated DataTracker with some sample programs to show that it can tackle the nxm problem and eliminate false positives present in other observed provenance capture methods.
The nxm problem is a major drawback of other observed provenance methods. In summary, it means that in the presence of n inputs and m outputs, the provenance graph will include nxm derivation edges.
Decouple analysis from execution.
Has been proposed for security analysis on mobile phones. (Paranoid Android, Portokalidis ‘10)
Execution Capture: happens realtime
Instrumentation: applied on the captured trace to generate provenance information
Analysis: the provenance information is explored using existing tools (e.g. SPARQL queries)
Selection: a subset of the execution trace is selected – we start again with more intensive instrumentation
We implemented our methodology using PANDA.
PANDA is based on QEMU.
Input includes both executed instructions and data.
RAM snapshot + ND log are enough to accurately replay the whole execution.
ND log conists of inputs to CPU/RAM and other device status is not logged we can replay but we cannot “go live” (i.e. resume execution)
Note: Technically, plugins can modify VM state. However this will eventually crash the execution as the trace will be out of sync with the replay state.
Plugins are implemented as dynamic libraries.
We focus on the highlighted plugins in this presentation.
Typical information that can be retrieved through VM introspections.
In general, executing code inside the guest OS is complex.
Moreover, in the case of PANDA we don’t have access to the state of devices. This makes injection and execution of new code even more complex and also more limited.
QEMU is a good choice for prototyping, but overall submoptimal as virtualization option.
Xu et al. do not give any numbers for virtualization slowdown. They (rightfully) consider it acceptable for most cases.
1.05x is for CPU bound processing. 2.6x is for I/O bound processing.
A few dozens of GBs per day.
nowadays…. as we integrate a myriad of datasets from the Web
we need a solution:
trace which pieces of data and how were combined to deliver the result (previous work)
tailor query execution process with information on data provenance, to filter pieces of data used in processing a query (this work)
------------------
we have to deal with issues like: ascertaining trust, establishing transparency, and estimating costs of a query answer
before moving to our way of dealing with it …. I’d like to have a look….. if it couldn’t be done with some of existing systems…..
let’s try use named graphs to store the source for each triple….
- we can load quads, 4th element is taken as named graph
- we can even query it to retrieve some kind of provenance information….
on the picture,
g1,q2,q3,q4 - named graphs, we use to store source of data
as a result we have a huge list of permuted elements,
l - lineage, source of the triples used to produce a particular entity
- standard query results, enriched with named graphs
- simple list of concatenated sources
- permutations of values bound to variables referring to data used to answer the query
- no formal, compact representation of provenance
- no detailed full-fledged provenance polynomials,
and how would it be with TripleProv?? ……. voila….
the question is: How to represent provenance information?
it must fulfill three main conditions
characterize ways each source contributed to the result
pinpoint the exact sources to each result
we need a capacity….. to trace back the list of sources and the way they were combined to deliver a result
in our polynomials, we use two logical operators
Union
constraint or projection is satisfied with multiple sources (same triple from multiple sources)
multiple entities satisfy a set of constraints or projections (the answer is composed of multiple records)
Join
sources joined to handle a set constraints or a projections, joins based on subject…
OS and OO joins between few sets of constraints
Let me now give you some examples…..
As a first example we take a simple star query
the polynomial shows that
- the first constraint was satisfied with lineage l1, l2 or l3, => Union of multiple sources, the constraint was satisfied with triples from multiple sources
- the second was satisfied with l4 or l5.
- the first projection was processed with elements having a lineage of l6 or l7,
- the second one was processed with elements from l8 or l9.
All the triples involved were joined on variable ?a, which is expressed in the polynomial…..by the join operators
TripleProv is built on top of a NATIVE rdf store named Diplodocus,
it has a modular architecture
containing 6 main subcomponents
query executor responsible for parsing the incoming query, rewriting the query plans, collecting and finally returning the results along with the provenance polynomials
lexicographik tree in charge of encoding URIs and literals into compact system identifiers and of translating them back;
type index clusters all keys based on their RDF types;
RDF molecules the main storing structure, it stores RDF data as very compact subgraphs, along with the source for each piece of data
in molecule index for each key we store a list of molecules where the key can be found.
the main question in the database world is how fast it is?
we transfer it to…...
how expensive it is to trace provenance…..
what is the overhead of tracking provenance
Two subsets…. sampled from collections of RDF data gathered from the Web
Billion Triple Challenge
Web Data Commons
Typical collections gathered from multiple sources
tracking provenance for them seems to precisely address the problem we focus,
what is the provenance of a query answer in a dataset integrated from many sources
as a workload for BTC we used
- 8 Queries from the work of Thomas Neumann, SIGMOD 2009
- two extra queries with UNION and OPTIONAL clauses
for WDC we prepared 7 various queries
they represent different kinds of typical query patterns including
star-queries up to 5 joins,
object-object joins,
object-subject joins,
and triangular joins
all of them are available on the project web page,
now we can have a quick look at the performance
on the picture you can see the overhead over the vanilla version of the system (w/o provenance) for BTC dataset
horizontal axis: queries
vertical axis: overhead
you can see results for 4 variants of the system, those are permutations of gramulality levels and storage models
--------------------------------------------------------------------------------------------
Overall, the performance penalty created by tracking provenance ranges from a few percents to almost 350%.
we observe a significant difference between the two storage models implemented
-retrieving data from co-located structures takes about 10%-20% more time than from simply annotated graph nodes
caused by the additional look-ups and loops that have to be considered when reading from extra physical data containers
We also notice difference between the two granularity levels.
more detailed triple-level requires more time
such simple post execution join would of course result in poor performance,
in our methods the query execution process can vary depending on the exact strategy
typically we start by executing the blue provenance query and optionally pre-materializing or co-locating data;
the green workload queries are then optionally rewritten…..
by taking into account results of the provenance query
and finally they get executed
The process returns as an output the workload query results, restricted to those which are following the specification expressed in the provenance query
the main question in the database world is how fast it is?
in our case we will try to answer the question,
what is the most efficient query execution strategy for provenance-enabled queries?
for our experiments, we used….
Two subsets sampled from collections of RDF data gathered from the Web
Billion Triple Challenge
Web Data Commons
those are… typical collections gathered from multiple sources
executing provenance-enabled queries for them seems to precisely address the problem we focus,
our goal is to fairly compare our provenance aware query execution strategies and the vanilla version of the system, that's why...
for the datasets we added some triples so that the provenance queries do not change the results of workload queries
overall…
Full Materialization: 44x faster than the vanilla version of the system
Partial Materialization: 35x faster
Pre-Filtering: 23x faster
The advantage of the Partial Materialization strategy over the Full Materialization strategy…
is that for the Partial Materialization, the time to execute a provenance query and materialize data is 475 times lower.
it’s basically faster to prepare data for executing workload queries
Query Rewriting and Post-Filtering strategies perform significantly slower
to better understand the influence of provenance queries on performance,
So to find the reason of such performance gain over the pure triplestore
we analysed the BTC dataset and provenance distribution
the figure shows how many context values refer to how many triples
we found that
there are only a handful of context values that are widespread (left-hand side of the figure)
and that the vast majority of the context values are highly selective (right-hand side of the figure)
we leveraged those properties during the query execution,
our strategies prune molecules early based on their context values