Presented at the International Workshop on Semantic Big Data (SBD 2016), held in conjunction with the 2016 ACM SIGMOD Conference
July 1st, 2016, San Francisco, USA
TPC-DI - The First Industry Benchmark for Data IntegrationTilmann Rabl
This presentation was held by Meikel Poess on September 3, 2014 at VLDB 2014 in Hangzhou, China.
Full paper and additional information available at:
http://msrg.org/papers/VLDB2014TPCDI
Abstract:
Historically, the process of synchronizing a decision support system with data from operational systems has been referred to as Extract, Transform, Load (ETL) and the tools supporting such process have been referred to as ETL tools. Recently, ETL was replaced by the more comprehensive acronym, data integration (DI). DI describes the process of extracting and combining data from a variety of data source formats, transforming that data into a unified data model representation and loading it into a data store. This is done in the context of a variety of scenarios, such as data acquisition for business intelligence, analytics and data warehousing, but also synchronization of data between operational applications, data migrations and conversions, master data management, enterprise data sharing and delivery of data services in a service-oriented architecture context, amongst others. With these scenarios relying on up-to-date information it is critical to implement a highly performing, scalable and easy to maintain data integration system. This is especially important as the complexity, variety and volume of data is constantly increasing and performance of data integration systems is becoming very critical. Despite the significance of having a highly performing DI system, there has been no industry standard for measuring and comparing their performance. The TPC, acknowledging this void, has released TPC-DI, an innovative benchmark for data integration. This paper motivates the reasons behind its development, describes its main characteristics including workload, run rules, metric, and explains key decisions.
This presentation was held at ISC 2014 on June 26, 2014 in Leipzig, Germany.
More information available at:
http://msrg.org/papers/ISC2014-Rabl
Abstract:
The Workshops for Big Data Benchmarking (http://clds.sdsc.edu/bdbc/workshops), which have been underway since May 2012, have identified a set of characteristics of big data applications that apply to industry as well as scientific application scenarios involving pipelines of processing with steps that include aggregation, cleaning, and annotation of large volumes of data; filtering, integration, fusion, subsetting, and compaction of data; and, subsequent analysis, including visualization, data mining, predictive analytics and, eventually, decision making. One of the outcomes of the WBDB workshops has been the formation of a Transaction Processing Council subcommittee on Big Data, which is initially defining a Hadoop systems benchmark, TPCx-HS, based on Terasort. TPCx-HS would be a simple, functional benchmark that would assist in determining basic resiliency and scalability features of large-scale systems. Other proposals are also actively under development including BigBench, which extends the TPC-DS benchmark for big data scenarios; Big Decision Benchmark from HP; HiBench from Intel; and the Deep Analytics Pipeline (DAP), which defines a sequence of end-to-end processing steps consisting of some of the operations mentioned above. Pipeline benchmarks reveal the need for different processing modalities and system characteristics for different steps in the pipeline. For example, early processing steps may process very large volumes of data and may benefit from a Hadoop and MapReduce-style of computing, while later steps may operate on more structured data and may require, say, SMP-style architectures or very large memory systems. This talk will provide an overview of these benchmark activities and discuss opportunities for collaboration and future work with industry partners.
Integration of data ninja services with oracle spatial and graphData Ninja API
Data Ninja Services provides a set of cloud-based APIs that can extract entities from the document texts as well as their relationships, and produce RDF triples which can be populated into an Oracle Spatial and Graph in a seamless integration. The risk analysis case study based on the Zika virus binds actionable insights from Oracle with the semantic content produced by the Data Ninja services.
Presentation slides for SDCSB Cytoscape Workshop on 5/19/2016. The presentation contains current status of Cytoscape project and overview of the Cytoscape ecosystem. It briefly mentions the Cytoscape Cyberinfrastructure.
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...Revolution Analytics
[Presentation by Skylar Lyon at DataWeek 2014, September 17 2014.]
I recently faced the task of how to scale out an existing analytics process. The schedule was compressed - it always is in my world. The data was big - 400+ million rows waiting in database. What did I do? I offered my favorite type of solution - quick and dirty.
At the outset, I wasn't sure how easy it would be. Nor was I certain of realized performance gains. But the concept seemed sound and the exercise fun. Let's move the compute to the data via Revolution R Enterprise for Teradata.
This presentation outlines my approach in leveraging a colleague's R models as I experimented with running R in-database. Would my path lead to significant improvement? Could it be used to productionalize the workflow?
TPC-DI - The First Industry Benchmark for Data IntegrationTilmann Rabl
This presentation was held by Meikel Poess on September 3, 2014 at VLDB 2014 in Hangzhou, China.
Full paper and additional information available at:
http://msrg.org/papers/VLDB2014TPCDI
Abstract:
Historically, the process of synchronizing a decision support system with data from operational systems has been referred to as Extract, Transform, Load (ETL) and the tools supporting such process have been referred to as ETL tools. Recently, ETL was replaced by the more comprehensive acronym, data integration (DI). DI describes the process of extracting and combining data from a variety of data source formats, transforming that data into a unified data model representation and loading it into a data store. This is done in the context of a variety of scenarios, such as data acquisition for business intelligence, analytics and data warehousing, but also synchronization of data between operational applications, data migrations and conversions, master data management, enterprise data sharing and delivery of data services in a service-oriented architecture context, amongst others. With these scenarios relying on up-to-date information it is critical to implement a highly performing, scalable and easy to maintain data integration system. This is especially important as the complexity, variety and volume of data is constantly increasing and performance of data integration systems is becoming very critical. Despite the significance of having a highly performing DI system, there has been no industry standard for measuring and comparing their performance. The TPC, acknowledging this void, has released TPC-DI, an innovative benchmark for data integration. This paper motivates the reasons behind its development, describes its main characteristics including workload, run rules, metric, and explains key decisions.
This presentation was held at ISC 2014 on June 26, 2014 in Leipzig, Germany.
More information available at:
http://msrg.org/papers/ISC2014-Rabl
Abstract:
The Workshops for Big Data Benchmarking (http://clds.sdsc.edu/bdbc/workshops), which have been underway since May 2012, have identified a set of characteristics of big data applications that apply to industry as well as scientific application scenarios involving pipelines of processing with steps that include aggregation, cleaning, and annotation of large volumes of data; filtering, integration, fusion, subsetting, and compaction of data; and, subsequent analysis, including visualization, data mining, predictive analytics and, eventually, decision making. One of the outcomes of the WBDB workshops has been the formation of a Transaction Processing Council subcommittee on Big Data, which is initially defining a Hadoop systems benchmark, TPCx-HS, based on Terasort. TPCx-HS would be a simple, functional benchmark that would assist in determining basic resiliency and scalability features of large-scale systems. Other proposals are also actively under development including BigBench, which extends the TPC-DS benchmark for big data scenarios; Big Decision Benchmark from HP; HiBench from Intel; and the Deep Analytics Pipeline (DAP), which defines a sequence of end-to-end processing steps consisting of some of the operations mentioned above. Pipeline benchmarks reveal the need for different processing modalities and system characteristics for different steps in the pipeline. For example, early processing steps may process very large volumes of data and may benefit from a Hadoop and MapReduce-style of computing, while later steps may operate on more structured data and may require, say, SMP-style architectures or very large memory systems. This talk will provide an overview of these benchmark activities and discuss opportunities for collaboration and future work with industry partners.
Integration of data ninja services with oracle spatial and graphData Ninja API
Data Ninja Services provides a set of cloud-based APIs that can extract entities from the document texts as well as their relationships, and produce RDF triples which can be populated into an Oracle Spatial and Graph in a seamless integration. The risk analysis case study based on the Zika virus binds actionable insights from Oracle with the semantic content produced by the Data Ninja services.
Presentation slides for SDCSB Cytoscape Workshop on 5/19/2016. The presentation contains current status of Cytoscape project and overview of the Cytoscape ecosystem. It briefly mentions the Cytoscape Cyberinfrastructure.
Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on ...Revolution Analytics
[Presentation by Skylar Lyon at DataWeek 2014, September 17 2014.]
I recently faced the task of how to scale out an existing analytics process. The schedule was compressed - it always is in my world. The data was big - 400+ million rows waiting in database. What did I do? I offered my favorite type of solution - quick and dirty.
At the outset, I wasn't sure how easy it would be. Nor was I certain of realized performance gains. But the concept seemed sound and the exercise fun. Let's move the compute to the data via Revolution R Enterprise for Teradata.
This presentation outlines my approach in leveraging a colleague's R models as I experimented with running R in-database. Would my path lead to significant improvement? Could it be used to productionalize the workflow?
SBIC Enterprise Information Security Strategic TechnologiesEMC
This report from the Security for Business Innovation Council describes next generation technologies that support an Information-Driven Security strategy.
The Global IT Trust Curve survey - Comprehensive Results PresentationEMC
The 2013 IT Trust Curve study surveyed 3,200 respondents to assess their organizations’ IT maturity levels and ability to withstand and quickly recover from disruptive incidents such as unplanned downtime, security breaches, and data loss.
Discover the impact and upside of having high IT Trust maturity, as captured in this overview of the survey results.
More via http://www.emc.com/campaign/it-trust-curve/index.htm
Managing Cyber Risk: Are Companies Safeguarding Their Assets?EMC
This white paper summarizes the results of a survey done by RSA, NYSE Governance Series, and Corporate Board Member, in association with Ernst & Young, with 200 audit committee members responding on a variety of issues regarding their cyber risk oversight program.
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...EMC
This EMC Isilon sizing and performance guideline White Paper reviews the Key Performance Indicators (KPIs) that most strongly impact the production processes for the storage of data from Next-Generation Sequencing (NGS) workflows.
This white paper from Goode Intelligence explores how existing provisioning solutions are failing to support the business in an era where new IT service models are rapidly being deployed. New IT service models that support mobile and cloud computing have created problems for organizations that are already struggling with outdated identity and access governance tools. The paper explores a vision for Provisioning 2.0 where the goal is to weave provisioning into the very fabric of business process. Provisioning 2.0 is business driven, is easy to deploy and maintain and is built for today’s agile IT.
ACM SIGMOD SBD2016 - Querying and reasoning over large scale building dataset...Pieter Pauwels
Presentation at the International Workshop on Semantic Big Data (SBD 2016), held in conjunction with the 2016 ACM SIGMOD Conference in San Francisco, USA. Authored by Pieter Pauwels, Tarcisio Mendes de Farias, Chi Zhang, Ana Roxin, Jakob Beetz, Jos De Roo, Christophe Nicolle.
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...t_ivanov
Columnar file formats provide an efficient way to store data to be queried by SQL-on-Hadoop engines. Related works consider the performance of processing engine and file format together, which makes it impossible to predict their individual impact. In this work, we propose an alternative approach: by executing each file format on the same processing engine, we compare the different file formats as well as their different parameter settings. We apply our strategy to two processing engines, Hive and SparkSQL, and evaluate the performance of two columnar file formats, ORC and Parquet. We use BigBench (TPCx-BB), a standardized application-level benchmark for Big Data scenarios. Our experiments confirm that the file format selection and its configuration significantly affect the overall performance. We show that ORC generally performs better on Hive, whereas Parquet achieves best performance with SparkSQL. Using ZLIB compression brings up to 60.2% improvement with ORC, while Parquet achieves up to 7% improvement with Snappy. Exceptions are the queries involving text processing, which do not benefit from using any compression.
SBIC Enterprise Information Security Strategic TechnologiesEMC
This report from the Security for Business Innovation Council describes next generation technologies that support an Information-Driven Security strategy.
The Global IT Trust Curve survey - Comprehensive Results PresentationEMC
The 2013 IT Trust Curve study surveyed 3,200 respondents to assess their organizations’ IT maturity levels and ability to withstand and quickly recover from disruptive incidents such as unplanned downtime, security breaches, and data loss.
Discover the impact and upside of having high IT Trust maturity, as captured in this overview of the survey results.
More via http://www.emc.com/campaign/it-trust-curve/index.htm
Managing Cyber Risk: Are Companies Safeguarding Their Assets?EMC
This white paper summarizes the results of a survey done by RSA, NYSE Governance Series, and Corporate Board Member, in association with Ernst & Young, with 200 audit committee members responding on a variety of issues regarding their cyber risk oversight program.
White Paper: Next-Generation Genome Sequencing Using EMC Isilon Scale-Out NAS...EMC
This EMC Isilon sizing and performance guideline White Paper reviews the Key Performance Indicators (KPIs) that most strongly impact the production processes for the storage of data from Next-Generation Sequencing (NGS) workflows.
This white paper from Goode Intelligence explores how existing provisioning solutions are failing to support the business in an era where new IT service models are rapidly being deployed. New IT service models that support mobile and cloud computing have created problems for organizations that are already struggling with outdated identity and access governance tools. The paper explores a vision for Provisioning 2.0 where the goal is to weave provisioning into the very fabric of business process. Provisioning 2.0 is business driven, is easy to deploy and maintain and is built for today’s agile IT.
ACM SIGMOD SBD2016 - Querying and reasoning over large scale building dataset...Pieter Pauwels
Presentation at the International Workshop on Semantic Big Data (SBD 2016), held in conjunction with the 2016 ACM SIGMOD Conference in San Francisco, USA. Authored by Pieter Pauwels, Tarcisio Mendes de Farias, Chi Zhang, Ana Roxin, Jakob Beetz, Jos De Roo, Christophe Nicolle.
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...t_ivanov
Columnar file formats provide an efficient way to store data to be queried by SQL-on-Hadoop engines. Related works consider the performance of processing engine and file format together, which makes it impossible to predict their individual impact. In this work, we propose an alternative approach: by executing each file format on the same processing engine, we compare the different file formats as well as their different parameter settings. We apply our strategy to two processing engines, Hive and SparkSQL, and evaluate the performance of two columnar file formats, ORC and Parquet. We use BigBench (TPCx-BB), a standardized application-level benchmark for Big Data scenarios. Our experiments confirm that the file format selection and its configuration significantly affect the overall performance. We show that ORC generally performs better on Hive, whereas Parquet achieves best performance with SparkSQL. Using ZLIB compression brings up to 60.2% improvement with ORC, while Parquet achieves up to 7% improvement with Snappy. Exceptions are the queries involving text processing, which do not benefit from using any compression.
Agile Big Data Analytics Development: An Architecture-Centric ApproachSoftServe
Presented at The Hawaii International Conference on System Sciences by Hong-Mei Chen and Rick Kazman (University of Hawaii), Serge Haziyev (SoftServe).
Ontology-based data access: why it is so cool!Josef Hardi
A brief introduction about ontology-based data access (shortly OBDA) and its core implementation. I presented too a recent simple benchmark between -ontop- and Semantika---two most available software for OBDA framework---in term of query performance (including details in the appendix section). The slides were presented for Friday Research Meeting in Stanford Center for Biomedical Informatics Research (BMIR).
License: Creative Commons by Attribution 3.0
Gluent Extending Enterprise Applications with Hadoopgluent.
This presentation shows how to transparently extend enterprise applications with the power of modern data platforms such as Hadoop. Application re-writing is not needed and there is no downtime when virtualizing data with Gluent.
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
This talk supports the Ph.D. in Computational & Data Enabled Science & Engineering at Jackson State University. It describes related educational activities at Indiana University, the Big Data phenomena, jobs and HPC and Big Data computations. It then describes how HPC and Big Data can be converged into a single theme.
WSO2 Machine Learner takes data one step further, pairing data gathering and analytics with predictive intelligence: this helps you understand not just the present, but to predict scenarios and generate solutions for the future.
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
London Spark Meetup 2014-11-11 @Skimlinks
http://www.meetup.com/Spark-London/events/217362972/
To paraphrase the immortal crooner Don Ho: "Tiny Batches, in the wine, make me happy, make me feel fine." http://youtu.be/mlCiDEXuxxA
Apache Spark provides support for streaming use cases, such as real-time analytics on log files, by leveraging a model called discretized streams (D-Streams). These "micro batch" computations operated on small time intervals, generally from 500 milliseconds up. One major innovation of Spark Streaming is that it leverages a unified engine. In other words, the same business logic can be used across multiple uses cases: streaming, but also interactive, iterative, machine learning, etc.
This talk will compare case studies for production deployments of Spark Streaming, emerging design patterns for integration with popular complementary OSS frameworks, plus some of the more advanced features such as approximation algorithms, and take a look at what's ahead — including the new Python support for Spark Streaming that will be in the upcoming 1.2 release.
Also, let's chat a bit about the new Databricks + O'Reilly developer certification for Apache Spark…
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
Alluxio Global Online Meetup
Apr 23, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Jiao (Jennie) Wang, Intel
Tsai Louie, Intel
Bin Fan, Alluxio
Today, many people run deep learning applications with training data from separate storage such as object storage or remote data centers. This presentation will demo the Intel Analytics Zoo + Alluxio stack, an architecture that enables high performance while keeping cost and resource efficiency balanced without network being I/O bottlenecked.
Intel Analytics Zoo is a unified data analytics and AI platform open-sourced by Intel. It seamlessly unites TensorFlow, Keras, PyTorch, Spark, Flink, and Ray programs into an integrated pipeline, which can transparently scale from a laptop to large clusters to process production big data. Alluxio, as an open-source data orchestration layer, accelerates data loading and processing in Analytics Zoo deep learning applications.
This talk, we will go over:
- What is Analytics Zoo and how it works
- How to run Analytics Zoo with Alluxio in deep learning applications
- Initial performance benchmark results using the Analytics Zoo + Alluxio stack
Exascale Computing Project - Driving a HUGE Change in a Changing Worldinside-BigData.com
In this video from the OpenFabrics Workshop in Austin, Al Geist from ORNL presents: Exascale Computing Project - Driving a HUGE Change in a Changing World.
"In this keynote, Mr. Geist will discuss the need for future Department of Energy supercomputers to solve emerging data science and machine learning problems in addition to running traditional modeling and simulation applications. In August 2016, the Exascale Computing Project (ECP) was approved to support a huge lift in the trajectory of U.S. High Performance Computing (HPC). The ECP goals are intended to enable the delivery of capable exascale computers in 2022 and one early exascale system in 2021, which will foster a rich exascale ecosystem and work toward ensuring continued U.S. leadership in HPC. He will also share how the ECP plans to achieve these goals and the potential positive impacts for OFA."
Learn more: https://exascaleproject.org/
and
https://www.openfabrics.org/index.php/abstracts-agenda.html
Sign up for our insideHPC Newsletter: https://www.openfabrics.org/index.php/abstracts-agenda.html
High Performance Data Analytics and a Java Grande Run TimeGeoffrey Fox
There is perhaps a broad consensus as to important issues in practical parallel computing as applied to large scale simulations; this is reflected in supercomputer architectures, algorithms, libraries, languages, compilers and best practice for application development.
However the same is not so true for data intensive even though commercially clouds devote many more resources to data analytics than supercomputers devote to simulations.
Here we use a sample of over 50 big data applications to identify characteristics of data intensive applications and to deduce needed runtime and architectures.
We propose a big data version of the famous Berkeley dwarfs and NAS parallel benchmarks.
Our analysis builds on the Apache software stack that is well used in modern cloud computing.
We give some examples including clustering, deep-learning and multi-dimensional scaling.
One suggestion from this work is value of a high performance Java (Grande) runtime that supports simulations and big data
Critical Facilities Operations Framework: Explanations and illustrative examples.
For training videos, please visit https://m.youtube.com/channel/UCYw2fG4p7buyhJD0EYHahuQ
Présentation faite le mercredi 23 octobre 2019, lors de ma participation au BIM Workshop (BIM In Motion) organisé par Bouygues Construction, sur leur site Challenger, dans la région parisienne. Après une introduction aux systèmes experts (à base de connaissances), sont donnés quelques exemples d'applications pertinentes dans un contexte BIM. Sont aussi fournis des liens vers des publications et des présentations exposant plus en détail ces approches.
Presentation at the BIM (BIM In Motion) Workshop organized by Bouygues Construction, at their Challenger site outside Paris. The BIM Workshop took place on Wednesday October the 23rd 2019.
Presentation made at the 5th eduBIM Workshop. After a review and rating of the main vocabularies for BIM uploaded on the Linked Open Data cloud, some applications are discussed
[Cib]achieving interoperability between bim and gis finalAna Roxin
Presentation given Thursday, September 19th 2019 at CIB W78, at Northumbria University, by Elio Hbeich (1st year PhD student). After a brief summary of main issues related to BIM/GIS interoperability, we depict our conceptual approach for achieving BIM/GIS semantic interoperability. This approach relies on a) federation among GIS and BIM bodies of knowledge , and b) granularity for defining and linking abstractions of the overall knowledge.
After a quick presentation of the main issue with BIM today (following Mark Baldwin's post), Linked Data principles are defined and exemplified in the context of BIM (ifcOWL). An example is provided regarding how IFD, QUDT and ifcOWL vocabularies could be linked. Finally, three main application areas for BIM are presented. Links are given to main work done in the field since 2016.
On the relation between Model View Definitions (MVDs) and Linked Data technol...Ana Roxin
This white paper outlines the proposals from the Linked Data Working Group (LDWG) on how technologies and approaches that are common to the domains of Semantic Web, Linked Data, and the Web of Data (hereafter jointly called ‘Linked Data’) are related to Model View Definitions (MVDs). After a brief introduction of both the MVD concept (Section 1) and linked data technologies (Section 2), two main topics are discussed:
● Technical: handling MVDs with Linked Data technologies (Section 3)
● Industrial use cases: making most from the traditional MVD approach and Linked Data technologies (Section 4)
The purpose of this white paper is to discuss how Linked Data technologies and approaches could be effectively deployed to support industrial use cases that are typically related to the generation, use and maintenance of MVDs.
Geographic information - standards available for describing geographical dataAna Roxin
Presentation done at the 1st "Geopositionning and intelligent mobility" day
UTBM (University of Technology of Belfort-Montbéliard), Belfort, March 2010
Customizing Semantic Profiling for Digital AdvertisingAna Roxin
Presentation done at the 3rd International Workshop on Methods, Evaluation, Tools and Applications for the Creation and Consumption of Structured Data for the e-Society (Meta4eS’14)
ifcWOD (Web Of Data) - Semantically Adapting IFC Model Relations into OWL Pro...Ana Roxin
Presented at the Technical Room, at the buildingSMART Summit
12th April 2016, Rotterdam, The Netherlands
Describes the semi-automatical conception of the ifcWOD ontology, based on the IFC EXPRESS model, ifcOWL and IFC Property Set Definitions (PSD)
COBieOWL An OWL ontology based on COBie standardAna Roxin
Presentation made on October 28th 2015, at The 14th International Conference on Ontologies, DataBases, and Applications of Semantics (ODBASE2015), Rhodes, Greece.
We describe our method for semi-automatically conceiving an OWL ontology for the COBie standard starting from a COBie spreadsheet template. We call this ontology COBieOWL and we populate it directly from COBie spreadsheet data files as used by building actors. We also discuss various benefits of adopting our approach, for example: it reduces semantic heterogeneity of the COBie model.
Have you ever wondered how search works while visiting an e-commerce site, internal website, or searching through other types of online resources? Look no further than this informative session on the ways that taxonomies help end-users navigate the internet! Hear from taxonomists and other information professionals who have first-hand experience creating and working with taxonomies that aid in navigation, search, and discovery across a range of disciplines.
Acorn Recovery: Restore IT infra within minutesIP ServerOne
Introducing Acorn Recovery as a Service, a simple, fast, and secure managed disaster recovery (DRaaS) by IP ServerOne. A DR solution that helps restore your IT infra within minutes.
This presentation, created by Syed Faiz ul Hassan, explores the profound influence of media on public perception and behavior. It delves into the evolution of media from oral traditions to modern digital and social media platforms. Key topics include the role of media in information propagation, socialization, crisis awareness, globalization, and education. The presentation also examines media influence through agenda setting, propaganda, and manipulative techniques used by advertisers and marketers. Furthermore, it highlights the impact of surveillance enabled by media technologies on personal behavior and preferences. Through this comprehensive overview, the presentation aims to shed light on how media shapes collective consciousness and public opinion.
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Orkestra
UIIN Conference, Madrid, 27-29 May 2024
James Wilson, Orkestra and Deusto Business School
Emily Wise, Lund University
Madeline Smith, The Glasgow School of Art
0x01 - Newton's Third Law: Static vs. Dynamic AbusersOWASP Beja
f you offer a service on the web, odds are that someone will abuse it. Be it an API, a SaaS, a PaaS, or even a static website, someone somewhere will try to figure out a way to use it to their own needs. In this talk we'll compare measures that are effective against static attackers and how to battle a dynamic attacker who adapts to your counter-measures.
About the Speaker
===============
Diogo Sousa, Engineering Manager @ Canonical
An opinionated individual with an interest in cryptography and its intersection with secure software development.
This presentation by Morris Kleiner (University of Minnesota), was made during the discussion “Competition and Regulation in Professions and Occupations” held at the Working Party No. 2 on Competition and Regulation on 10 June 2024. More papers and presentations on the topic can be found out at oe.cd/crps.
This presentation was uploaded with the author’s consent.
Querying and reasoning over large scale building datasets: an outline of a performance benchmark
1. AnaROXIN–ana-maria.roxin@u-bourgogne.fr
PieterPAUWELS–Pieter.pauwels@ugent.be
Querying and reasoning over
large scale building datasets: an outline of
a performance benchmark
Pieter Pauwels, Tarcisio Mendes de Farias, Chi Zhang,
Ana Roxin, Jakob Beetz, Jos De Roo, Christophe Nicolle
International Workshop on Semantic Big Data (SBD 2016)
in conjunction with the 2016 ACM SIGMOD Conference in San Francisco, USA
3. AnaROXIN–ana-maria.roxin@u-bourgogne.fr
PieterPAUWELS–Pieter.pauwels@ugent.be
Context description
◼ The architectural design and construction domains work on a daily
basis with massive amounts of data.
◼ In the context of BIM, a neutral, interoperable representation of
information consists in the Industry Foundation Classes (IFC)
standard
Difficult to handle the EXPRESS format
◼ Semantic Web technologies have been identified as a possible
solution
Semantic data enrichment
Schema and data transformations
◼ A semantic approach involves 3 main components:
Schema (Tbox)
• OWL ontology
• Information structure
Instances (ABox)
• Assertions
• Respects schema
definition
Rules (RBox)
• If-Then statements
• Involving elements
from the ABox and
theTBox
Querying and reasoning over large scale building datasets: an outline of a performance benchmark
3
July 1st, 2016
4. AnaROXIN–ana-maria.roxin@u-bourgogne.fr
PieterPAUWELS–Pieter.pauwels@ugent.be
Problem identified
◼ Different implementations exist for the components (TBox, ABox,
RBox) of such Semantic approach
Diverse reasoning engines
Diverse query processing techniques
Diverse query handling
Diverse dataset size
Diverse dataset complexity
◼ Missing an appropriate rule and query execution performance
benchmark
Expressiveness
vs.
performance
Querying and reasoning over large scale building datasets: an outline of a performance benchmark
4
July 1st, 2016
5. AnaROXIN–ana-maria.roxin@u-bourgogne.fr
PieterPAUWELS–Pieter.pauwels@ugent.be
Performance benchmark variables
◼ Main components
◼ These elements are implemented into 3 different systems
SPIN (SPARQL Inference Notation) and Jena
EYE
Stardog
◼ An ensemble of queries is addressed to the so-created systems
Schema
(TBox)
• ifcOWL
Instances
(ABox)
• 369 ifcOWL-
compliant
building
models
Rules
(RBox)
• 68 data
transformation
rules
Querying and reasoning over large scale building datasets: an outline of a performance benchmark
5
July 1st, 2016
6. AnaROXIN–ana-maria.roxin@u-bourgogne.fr
PieterPAUWELS–Pieter.pauwels@ugent.be
TBox - the ifcOWL ontology
◼ All building models are encoded using the ifcOWL ontology
Built up under the impulse of numerous initiatives during the last 10
years
◼ The ontology used is the one that is made publicly available by the
buildingSMART Linked Data Working Group (LDWG)
http://ifcowl.openbimstandards.org/IFC4#
http://ifcowl.openbimstandards.org/IFC4_ADD1#
http://ifcowl.openbimstandards.org/IFC2X3_TC1#
http://ifcowl.openbimstandards.org/IFC2X3_Final#
Querying and reasoning over large scale building datasets: an outline of a performance benchmark
6
July 1st, 2016
7. AnaROXIN–ana-maria.roxin@u-bourgogne.fr
PieterPAUWELS–Pieter.pauwels@ugent.be
Call for papers – special issue in SWJ
◼ Semantic Web Journal – Interoperability, Usability, Applicability
http://www.semantic-web-journal.net
◼ Special issue on "Semantic Technologies and Interoperability in the
Built Environment"
◼ Important dates
March, 1st 2017 – paper submission deadline
May 1st 2017 – notification of acceptance
Ontologies for
AEC/FM
Linking BIM
models to
external data
sources
Multiple scale
integration
through semanitc
interoperability
Multilingual data
access and
annotation
Query
processing, query
performance
Semantic-based
building
monitoring
systems
Reasoning with
building data
Building data
publication
strategies
Big Linked Data
for building
information
Querying and reasoning over large scale building datasets: an outline of a performance benchmark
7
July 1st, 2016
8. AnaROXIN–ana-maria.roxin@u-bourgogne.fr
PieterPAUWELS–Pieter.pauwels@ugent.be
ifcOWL Stats
July 1st, 2016 Querying and reasoning over large scale building datasets: an outline of a performance benchmark
8
Axioms 21306
Logical Axioms 13649
Classes 1230
Object properties 1578
Data properties 5
Individuals 1627
DL expressivity SROIQ(D)
SubClassOf axioms 4622
EquivalentClasses axioms 266
DisjointClasses axioms 2429
SubObjectPropertyOf axioms 1
InverseObjectProperties axioms 94
FunctionalObjectProperty axioms 1441
TransitiveObjectProperty axioms 1
ObjectPropertyDomain axioms 1577
ObjectPropertyRange axioms 1576
FunctionalDataProperty axioms 5
DataPropertyDomain axioms 5
DataPropertyRange axioms 5
Pieter Pauwels and Walter Terkaj, EXPRESS to OWL
for construction industry: towards a recommendable
and usable ifcOWL ontology. Automation in
Construction 63: 100-133 (2016).
9. AnaROXIN–ana-maria.roxin@u-bourgogne.fr
PieterPAUWELS–Pieter.pauwels@ugent.be
ABox – Building sets
◼ Some BIM models are publicly available (364), whereas other are
undisclosed (5)
Building information models
created with different BIM
modelling environments
Exported to IFC2x3
Transformed into ifcOWL-
compliant RDF graphs using
a publicly available converter
BIM environment Number of files
Tekla Structures 227 (61,5%)
unknown or manual 38 (10,3%)
Autodesk Revit 27 (7,3%)
Xella BIM 15
Autodesk AutoCAD 12
iTConcrete 9
SDS 8
Nemetschek AllPlan 7
GraphiSoft ArchiCAD 5
Various others 21
IFC instances Average file size Number of files
0 – 500,000 0 – 30 MB 321
500,000 –
2,000,000
30 – 100 MB 37
> 2,000,000 > 100 MB 11
Querying and reasoning over large scale building datasets: an outline of a performance benchmark
9
July 1st, 2016
10. AnaROXIN–ana-maria.roxin@u-bourgogne.fr
PieterPAUWELS–Pieter.pauwels@ugent.be
RBox – Data transformation rules
◼ Need for a representative set of rewrite rules
◼ 68 manually built rules
◼ Classified in several rule sets according to their content
Rule Set
(RS)
Description
RS1
Contains 2 rules for rewriting property set references into additional property statements
sbd:hasPropertySet and sbd:hasProperty. This is a small, yet often used rule set that can be used in
many contexts to simplify querying and data publication of common simple properties attached to IFC entity
instances.
RS2
Includes 31 rules, all involving subtypes of the IfcRelationship class (e.g. ifcowl:IfcRelAssigns,
ifcowl:IfcRelDecomposes, ifcowl:IfcRelAssociates, ifcowl:IfcRelDefines,
ifcowl:IfcRelConnects)
RS3 Contains 3 rules related to handling lists in IFC.
RS4 Contains one rule that allows wrapping simple data types.
RS5
Consists of 20 rules for inferring single property statements sbd:hasPropertySet and
sbd:hasProperty.
RS6
Extends RS5 and RS1 with 6 additional rules for inferring whether an objet is internal or external to a
building.
RS7 Contains 7 rules dealing with the (de)composition of building spaces and spatial elements.
Querying and reasoning over large scale building datasets: an outline of a performance benchmark
10
July 1st, 2016
12. AnaROXIN–ana-maria.roxin@u-bourgogne.fr
PieterPAUWELS–Pieter.pauwels@ugent.be
Implementation
• Implemented based on the
open source APIs of Topbraid
SPIN (SPIN API 1.4.0) and
Apache Jena (Jena Core 2.11.0,
Jena ARQ 2.11.0, Jena TDB
1.0.0)
• Rules are written with Topbraid
Composer Free version, and
they are exported as RDF Turtle
files.
• A small Java program is
implemented to read RDF
models, schema, rules from the
TDB store and query data.
• All the SPARQL queries are
configured using the Jena
org.apache.jena.sparql.algebra
package
• To avoid unnecessary
reasoning processes, in this
test environment only the RDFS
vocabulary is supported.
SPIN + Jena TDB
• Version ‘EYE-
Winter16.0302.1557’ (‘SWI-
Prolog 7.2.3 (amd64): Aug 25
2015, 12:24:59’).
• EYE is a semi-backward
reasoner enhanced with Euler
path detection.
• As our rule set currently
contains only rules using =>,
forward reasoning will take
place.
• Each command is executed 5
times
• Each command includes the
full ontology, the full set of rules
and the RDFS vocabulary, as
well as one of the 369 building
model files and one of the 3
query files.
• No triple store is used: triples
are processed directly from the
considered files.
EYE
• 4.0.2 Stardog semantic graph
database (Java 8, RDF 1.1
graph data model, OWL2
profiles, SPARQL 1.1)
• OWL reasoner + rule engine.
• Support of SWRL rules,
backward-chaining reasoning
• Reasoning is performed by
applying a query rewriting
approach (SWRL rules are
taken into account during the
query rewriting process).
• Stardog allows attaining a DL-
expressivity level of SROIQ(D).
• In this approach, SWRL rules
are taken into account during
the query rewriting process.
Stardog
Querying and reasoning over large scale building datasets: an outline of a performance benchmark
12
July 1st, 2016
13. AnaROXIN–ana-maria.roxin@u-bourgogne.fr
PieterPAUWELS–Pieter.pauwels@ugent.be
Queries
◼ We have built a limited list of 60 queries, each of which triggers at
least one of the available rules.
◼ As we focus here on query execution performance, the considered
queries are entirely based on the right-hand sides of the considered
rules.
◼ 3 queries:
Q1 a simple query with little results,
Q2 a simple query with many results,
and Q3 a complex query that triggers a considerable number of rules
Query Query Contents
Q1 ?obj sbd:hasProperty ?p
Q2
?point sbd:hasCoordinateX ?x .
?point sbd:hasCoordinateY ?y .
?point sbd:hasCoordinateZ ?z
Q3 ?d rdf:type sbd:ExternalWall
Querying and reasoning over large scale building datasets: an outline of a performance benchmark
13
July 1st, 2016
14. AnaROXIN–ana-maria.roxin@u-bourgogne.fr
PieterPAUWELS–Pieter.pauwels@ugent.be
Test environment
◼ In one central server
Supplied by the University of Burgundy, research group CheckSem,
Following specifications: Ubuntu OS, Intel Xeon CPU E5-2430 at 2.2GHz,
6 cores and 16GB of DDR3 RAM memory
◼ 3 Virtual Machines (VMs) were set up in this central server
SPIN VM (Jena TDB), EYE VM (EYE inference engine), Stardog VM
(Stardog triplestore)
◼ The VMs were managed as separate test environments and
Each of these VMs had 2 cores out of 6 allocated
Each contained the above resources (ontologies, data, rules, queries).
Querying and reasoning over large scale building datasets: an outline of a performance benchmark
14
July 1st, 2016
15. AnaROXIN–ana-maria.roxin@u-bourgogne.fr
PieterPAUWELS–Pieter.pauwels@ugent.be
Results
◼ Queries applied on 6 hand-
picked building models of
varying size
◼ In the SPIN approach
For Q1 and Q2, the
execution time = backward-
chaining inference process
+ actual query execution
time
For Q3, execution time =
query execution time itself
◼ In the EYE approach
Networking time is ignored
◼ In the Stardog approach
Execution time = backward-
chaining inference + actual
query execution time
Query
Building
Model
SPIN
(s)
EYE
(s)
Stardog
(s)
Q1
(simple,
little
results)
BM1 135,36 37,11 13,44
BM2 1,47 0,29 0,17
BM3 24,01 4,87 1,4
BM4 41,28 12,95 3,55
BM5 4,99 1,05 0,33
BM6 0,55 0,16 0,08
Q2
(simple,
many
results)
BM1 46,17 2,10 6,82
BM2 92,03 4,20 15,83
BM3 82,68 4,12 15,28
BM4 19,93 1,04 2,81
BM5 3,69 0,21 1,36
BM6 0,74 0,045 1,00
Q3
(complex)
BM1 0,001 0,001 0,07
BM2 0,006 0,003 0,12
BM3 0,002 0,003 0,31
BM4 0,005 0,001 0,20
BM5 0,006 0,013 0,20
BM6 0,001 0,001 0,13
Querying and reasoning over large scale building datasets: an outline of a performance benchmark
15
July 1st, 2016
16. AnaROXIN–ana-maria.roxin@u-bourgogne.fr
PieterPAUWELS–Pieter.pauwels@ugent.be
Query time related to result count
For Q1 for each of the considered
approaches
(green = SPIN; blue = EYE; black = Stardog)
For Q2 for each of the considered
approaches
(green = SPIN; blue = EYE; black = Stardog)
Querying and reasoning over large scale building datasets: an outline of a performance benchmark
16
July 1st, 2016
17. AnaROXIN–ana-maria.roxin@u-bourgogne.fr
PieterPAUWELS–Pieter.pauwels@ugent.be
Additional findings
• The three considered procedures are quite far apart from each other, explaining the considerable performance
differences, not only between the procedures, but also between diverse usages within one and the same
system.
• Algorithms and optimization techniques used for each approach aren't entirely used: differences in indexation
algorithms, query rewriting techniques and rule handling strategies used.
Indexing algorithms, query rewriting techniques, and rule handling strategies
• The disadvantage of forward-chaining reasoning process is that millions of triples can be materialized (EYE,
SPIN for Q1 and Q2)
• Using backward-chaining reasoning allows avoiding triple materialization, thus saving query execution time
(Stardog, SPIN for Q3).
Forward- versus backward-chaining
• Query Q3 triggers a rule that in turn triggers several other rules in the rule set. If the first rule does not fire,
however, the process stops early.
• Query Q2, however, fires relatively long rules. It takes more time to make these matches in all three approaches.
Type of data in the building model
• Loading files in memory at query execution time leads to considerable delays.
Impact of the triple store
• Linear relation: the more results are available, the more triples need to be matched, leading to more assertions.
Impact of the number of output results
Querying and reasoning over large scale building datasets: an outline of a performance benchmark
17
July 1st, 2016
18. AnaROXIN–ana-maria.roxin@u-bourgogne.fr
PieterPAUWELS–Pieter.pauwels@ugent.be
Conclusion and future work
◼ Comparison of 3 different approaches
SPIN, EYE and Stardog
◼ 3 queries applied over 6 different building models
◼ Future work consists in
Specifying more this initial performance benchmark with additional
data and rules
Executing additional queries on the rest of the set of building models
Comparing results on a wider scale:
― for the individual approaches separately,
― as well as with other approaches not considered here.
Querying and reasoning over large scale building datasets: an outline of a performance benchmark
18
July 1st, 2016