SlideShare a Scribd company logo
1
Big Linked Data ETL Benchmark
on Cloud Commodity Hardware
iMinds – Ghent University
Dieter De Witte, Laurens De Vocht,
Ruben Verborgh, Erik Mannens, Rik Van de Walle
Ontoforce
Kenny Knecht, Filip Pattyn, Hans Constandt
2
Introduction
Approach
Benchmark
Results
Conclusions & Next Steps
2
3
Introduction
Approach
Benchmark
Results
Conclusions & Next Steps
3
4
Introduction
 Facilitate development of semantic federated query engine
close the (semantic) analytics gap in life sciences.
 The query engine drives an exploratory search application: DisQover
 Approach to federated querying by implementing ETL pipeline
indexes the user views in advance.
 Combine Linked Open Data with private and licensed (proprietary) data
discovery of biomedical data
new insights in medicine development.
5
DisQover: which data?
6
 Ensure minimal knowledge about data linking or annotation is
required
to explore and find results.
 Write SPARQL directly
detailed knowledge of the predicates is required
might require first exploring to determine the URIs.
 Scaling out to more data
 Search queries are complex because search spans two distinct
domains:
1. the ‘space’ of clinical studies;
2. ‘drugs/chemicals’.
Challenges
7
Introduction
Approach
Benchmark
Results
Conclusions & Next Steps
7
8
Approach
How to do federated search with
minimal latency for end-user?
Which RDF stores support the
infrastructure?
What aspects should the design
of a reusable benchmark take
into account?
9
The scaling-out approach relies on low-end commodity
hardware but uses many nodes in a distributed system:
1. Specialized scalable RDF stores, the focus of this work;
2. Translating SPARQL and RDF to existing NoSQL stores;
3. Translating SPARQL and RDF to existing Big Data approaches
such as MapReduce, Impala, Apache Spark;
4. Distributing the data in physically separated SPARQL endpoints
over the Semantic Web, using federated querying techniques
to resolve complex questions.
Note: Compression (in-memory) is an alternative for distribution.
RDF datasets can be compressed (e.g. “Header Dictionary
Triples” – HDT).
Scaling out: techniques
10
ETL in instead of direct querying
Direct ETL
11
 Typical DisQover queries introduce much query latency when directly
federated.
 Facets consist of multiple separate SPARQL queries and serve both as
filter and as dashboard.
 Data integration in DisQover:
Facets filter across all data originating from multiple different
sources.
Why?
12
Introduction
Approach
Benchmark
Results
Conclusions & Next Steps
12
13
ETL
Design of benchmark focus:
 ETL part needs to be optimally cost efficient.
 SPARQL queries for indexes maximally aligned with
front-end.
 What is are the tradeoffs for each RDF store?
Benchmark
14
 What is the most cost-effective storage solution to support Linked Data
applications that need to be able to deal with heavy ETL query
workloads?
 Which performance trade-offs do storage solutions offer in terms of
scalability?
 What is the impact of different query types (templates)?
 Is there a difference in performance between the stores based on the
structural properties of the queries?
Note: not taken into account implicitly derived facts, inference or reasoning.
Questions the benchmark answers
15
WatDiv provides stress testing tools for SPARQL
existing benchmarks not always suitable for testing systems in
diverse queries and varied workloads:
 generic benchmark + not application specific;
 covers a broad spectrum
result cardinality
triple-pattern selectivity
ensured through the data and query generation method;
 Benchmark is repeatable with different dataset sizes or numbers of
queries.
Data and Query Generation
16
The RDF store should be capable of serving in a production environment with
Linked Data in Life Sciences.
The initial selection was made by choosing stores with:
• a high adoption/popularity as defined by DB-Engines.com ranking for RDF
stores;
• enterprise support;
• support for distributed deployment;
• full SPARQL 1.1 compliance.
The four stores we selected all comply with these constraints.
Note: The names of two stores we tested could not be disclosed.
They are being referred to as Enterprise Store I and II (ESI and ESII)
RDF Store Selection
17
The benchmark process consists of a data loading phase, followed by
running the SPARQL benchmarker:
1. The data is loaded in compressed format (gzip).
2. The benchmarker runs in multi-threaded mode (8 threads),
runs a set of 2000 queries multiple times.
3. These runs consists of at least one warm-up run which is not
counted.
4. In order to obtain robust results the tail results (most extreme) are
discarded before calculating average query runtimes.
5. The benchmarker generates a CSV file containing the run times
and response times etc. of all queries which we visualized.
Process
18
Query Driver
“SPARQL Query Benchmarker” is a general purpose API and CLI that is
designed primarily for testing remote SPARQL servers.
By default operations are run in a random order to avoid the system under
test (SUT) learning the pattern of operations.
Hardware
Executed all benchmarks on the Amazon Web Services (AWS) Elastic
Compute Cloud (EC2) and Simple Storage Solutions (S3).
Used the default (commercial) deployments of the SUT for the results to
be reproducible:
 both the hardware and the machine images can be easily acquired.
 more generally, cloud deployments offer the advantage of not
requiring dedicated on-premises hardware.
Infrastructure
19
Introduction
Approach
Benchmark
Results
Conclusions & Next Steps
19
20
Cost
Scalability
Behavior (Different Query Types)
Errors and Time-outs
Results
21
CostCost
22
Scalability: 0.01 B – 0.1 B – 1 B
23
Scalability: 1B
300
24
Behavior: different query types
S FL
Combinations of those
C
C
25
Behavior: different query types
26
Errors and time-outs
Every runtime > 300s is a time-out.
If the run-time reaches a maximum of < 300s we detect an internal set time-
out.
This was in particular the case voor ESII (3 nodes)
60
27
Scalability: 1B revisited
60
ESII-3 still outperforms ESII-1
when looking at queries that
did not time-out
28
Issues in the followed approach
 Choose for virtual machine images in the cloud (AWS) for
reproducibility;
but cloud solutions might not always be best suited for production.
 The results of different benchmark studies might depend on many
(hidden) configuration factors leading to different or even
contradicting results.
 The difference in performance between the stores might be attributed
to the use of commodity hardware in the cloud.
 Differences partially attributed to the quality of the recommended
configuration parameters as provided by the virtual machine images.
29
Introduction
Approach
Benchmark
Results
Conclusions & Next Steps
29
30
Conclusions & Next steps
 Compared enterprise RDF stores
default configuration
without the intervention of enterprise support.
 Run stores in their optimal configuration (reflecting a production
setting)
with more instances (> 3).
 Repeat the benchmark with DisQover data and queries.
 Create overview of RDF solutions for different
use cases, configurations and real-world (life science) datasets.
 Investigate whether the WatDiv results are confirmed when running the
benchmark with other queries and data.
 Release tools for repeating the benchmark with new storage solutions.
31
Contact Details
laurens.devocht@ugent.beE-MAIL:
@laurens_d_vTWITTER:
SLIDES: slideshare.net/laurensdv

More Related Content

What's hot

Introduction to ETL and Data Integration
Introduction to ETL and Data IntegrationIntroduction to ETL and Data Integration
Introduction to ETL and Data Integration
CloverDX (formerly known as CloverETL)
 
What is ETL testing & how to enforce it in Data Wharehouse
What is ETL testing & how to enforce it in Data WharehouseWhat is ETL testing & how to enforce it in Data Wharehouse
What is ETL testing & how to enforce it in Data Wharehouse
BugRaptors
 
TimesTen Overview
TimesTen OverviewTimesTen Overview
TimesTen Overview
Rex Wang
 
(ATS4-APP05) What's new in Isentris 4.0SP1
(ATS4-APP05) What's new in Isentris 4.0SP1(ATS4-APP05) What's new in Isentris 4.0SP1
(ATS4-APP05) What's new in Isentris 4.0SP1
BIOVIA
 
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
Principled Technologies
 
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
Principled Technologies
 
ESG Lab Report - Catalogic Software DPX
ESG Lab Report - Catalogic Software DPXESG Lab Report - Catalogic Software DPX
ESG Lab Report - Catalogic Software DPX
Catalogic Software
 
Building Efficient Software with Property Based Testing
Building Efficient Software with Property Based TestingBuilding Efficient Software with Property Based Testing
Building Efficient Software with Property Based Testing
CitiusTech
 
Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process
Omid Vahdaty
 

What's hot (10)

Jeevananthan_Informatica
Jeevananthan_InformaticaJeevananthan_Informatica
Jeevananthan_Informatica
 
Introduction to ETL and Data Integration
Introduction to ETL and Data IntegrationIntroduction to ETL and Data Integration
Introduction to ETL and Data Integration
 
What is ETL testing & how to enforce it in Data Wharehouse
What is ETL testing & how to enforce it in Data WharehouseWhat is ETL testing & how to enforce it in Data Wharehouse
What is ETL testing & how to enforce it in Data Wharehouse
 
TimesTen Overview
TimesTen OverviewTimesTen Overview
TimesTen Overview
 
(ATS4-APP05) What's new in Isentris 4.0SP1
(ATS4-APP05) What's new in Isentris 4.0SP1(ATS4-APP05) What's new in Isentris 4.0SP1
(ATS4-APP05) What's new in Isentris 4.0SP1
 
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
 
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
 
ESG Lab Report - Catalogic Software DPX
ESG Lab Report - Catalogic Software DPXESG Lab Report - Catalogic Software DPX
ESG Lab Report - Catalogic Software DPX
 
Building Efficient Software with Property Based Testing
Building Efficient Software with Property Based TestingBuilding Efficient Software with Property Based Testing
Building Efficient Software with Property Based Testing
 
Introduction to ETL process
Introduction to ETL process Introduction to ETL process
Introduction to ETL process
 

Viewers also liked

Big Data Lessons from the Cloud
Big Data Lessons from the CloudBig Data Lessons from the Cloud
Big Data Lessons from the Cloud
MapR Technologies
 
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
MapR Technologies
 
Aligning Web Collaboration Tools with Research Data for Scholars
Aligning Web Collaboration Tools with Research Data for ScholarsAligning Web Collaboration Tools with Research Data for Scholars
Aligning Web Collaboration Tools with Research Data for Scholars
Laurens De Vocht
 
The DataTank, RML and Domain Modelling
The DataTank, RML and Domain ModellingThe DataTank, RML and Domain Modelling
The DataTank, RML and Domain Modelling
Laurens De Vocht
 
Using Triple Pattern Fragments To Enable Streaming of Top-k Shortest Paths vi...
Using Triple Pattern Fragments To Enable Streaming of Top-k Shortest Paths vi...Using Triple Pattern Fragments To Enable Streaming of Top-k Shortest Paths vi...
Using Triple Pattern Fragments To Enable Streaming of Top-k Shortest Paths vi...
Laurens De Vocht
 
Discovering Meaningful Connections between Resources in the Web of Data
Discovering Meaningful Connections between Resources in the Web of DataDiscovering Meaningful Connections between Resources in the Web of Data
Discovering Meaningful Connections between Resources in the Web of Data
Laurens De Vocht
 
Providing Interchangeable Open Data to Accelerate Development of Sustainable ...
Providing Interchangeable Open Data to Accelerate Development of Sustainable ...Providing Interchangeable Open Data to Accelerate Development of Sustainable ...
Providing Interchangeable Open Data to Accelerate Development of Sustainable ...
Laurens De Vocht
 
A Framework Concept for Profiling Researchers on Twitter using the Web of Data
A Framework Concept for Profiling Researchers on Twitter using the Web of DataA Framework Concept for Profiling Researchers on Twitter using the Web of Data
A Framework Concept for Profiling Researchers on Twitter using the Web of Data
Laurens De Vocht
 
Talend for big_data_intorduction
Talend for big_data_intorductionTalend for big_data_intorduction
Talend for big_data_intorduction
Lakshman Dhullipalla
 
Meet David - ETL / Informatica Consultant
Meet David - ETL / Informatica ConsultantMeet David - ETL / Informatica Consultant
Meet David - ETL / Informatica Consultant
David Hubbard
 
Etl with talend (data integeration)
Etl with talend (data integeration)Etl with talend (data integeration)
Etl with talend (data integeration)
pomishra
 
A Visual Exploration Workflow as Enabler for the Exploitation of Linked Open ...
A Visual Exploration Workflow as Enabler for the Exploitation of Linked Open ...A Visual Exploration Workflow as Enabler for the Exploitation of Linked Open ...
A Visual Exploration Workflow as Enabler for the Exploitation of Linked Open ...
Laurens De Vocht
 
Researcher Profiling based on Semantic Analysis in Social Networks
Researcher Profiling based on Semantic Analysis in Social NetworksResearcher Profiling based on Semantic Analysis in Social Networks
Researcher Profiling based on Semantic Analysis in Social Networks
Laurens De Vocht
 
Talend Community Use Group Bristol: Preparing your business for mastering dat...
Talend Community Use Group Bristol: Preparing your business for mastering dat...Talend Community Use Group Bristol: Preparing your business for mastering dat...
Talend Community Use Group Bristol: Preparing your business for mastering dat...
KETL Limited
 
Benchmarking the Effectiveness of Associating Chains of Links for Exploratory...
Benchmarking the Effectiveness of Associating Chains of Links for Exploratory...Benchmarking the Effectiveness of Associating Chains of Links for Exploratory...
Benchmarking the Effectiveness of Associating Chains of Links for Exploratory...
Laurens De Vocht
 
Effect of Heuristics on Serendipity in Path-Based Storytelling with Linked Data
Effect of Heuristics on Serendipity in Path-Based Storytelling with Linked DataEffect of Heuristics on Serendipity in Path-Based Storytelling with Linked Data
Effect of Heuristics on Serendipity in Path-Based Storytelling with Linked Data
Laurens De Vocht
 
OSLO: Open Standards for Linked Organizations
OSLO: Open Standards for Linked OrganizationsOSLO: Open Standards for Linked Organizations
OSLO: Open Standards for Linked Organizations
Laurens De Vocht
 
vBACD July 2012 - Apache Hadoop, Now and Beyond
vBACD July 2012 - Apache Hadoop, Now and BeyondvBACD July 2012 - Apache Hadoop, Now and Beyond
vBACD July 2012 - Apache Hadoop, Now and Beyond
CloudStack - Open Source Cloud Computing Project
 
Talend winter 2017 overview webinar
Talend winter 2017 overview webinarTalend winter 2017 overview webinar
Talend winter 2017 overview webinar
Jean-Michel Franco
 
Présentation de Talend Winter 2017
Présentation de Talend Winter 2017 Présentation de Talend Winter 2017
Présentation de Talend Winter 2017
Jean-Michel Franco
 

Viewers also liked (20)

Big Data Lessons from the Cloud
Big Data Lessons from the CloudBig Data Lessons from the Cloud
Big Data Lessons from the Cloud
 
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
Strata 2014-tdunning-anomaly-detection-140211162923-phpapp01
 
Aligning Web Collaboration Tools with Research Data for Scholars
Aligning Web Collaboration Tools with Research Data for ScholarsAligning Web Collaboration Tools with Research Data for Scholars
Aligning Web Collaboration Tools with Research Data for Scholars
 
The DataTank, RML and Domain Modelling
The DataTank, RML and Domain ModellingThe DataTank, RML and Domain Modelling
The DataTank, RML and Domain Modelling
 
Using Triple Pattern Fragments To Enable Streaming of Top-k Shortest Paths vi...
Using Triple Pattern Fragments To Enable Streaming of Top-k Shortest Paths vi...Using Triple Pattern Fragments To Enable Streaming of Top-k Shortest Paths vi...
Using Triple Pattern Fragments To Enable Streaming of Top-k Shortest Paths vi...
 
Discovering Meaningful Connections between Resources in the Web of Data
Discovering Meaningful Connections between Resources in the Web of DataDiscovering Meaningful Connections between Resources in the Web of Data
Discovering Meaningful Connections between Resources in the Web of Data
 
Providing Interchangeable Open Data to Accelerate Development of Sustainable ...
Providing Interchangeable Open Data to Accelerate Development of Sustainable ...Providing Interchangeable Open Data to Accelerate Development of Sustainable ...
Providing Interchangeable Open Data to Accelerate Development of Sustainable ...
 
A Framework Concept for Profiling Researchers on Twitter using the Web of Data
A Framework Concept for Profiling Researchers on Twitter using the Web of DataA Framework Concept for Profiling Researchers on Twitter using the Web of Data
A Framework Concept for Profiling Researchers on Twitter using the Web of Data
 
Talend for big_data_intorduction
Talend for big_data_intorductionTalend for big_data_intorduction
Talend for big_data_intorduction
 
Meet David - ETL / Informatica Consultant
Meet David - ETL / Informatica ConsultantMeet David - ETL / Informatica Consultant
Meet David - ETL / Informatica Consultant
 
Etl with talend (data integeration)
Etl with talend (data integeration)Etl with talend (data integeration)
Etl with talend (data integeration)
 
A Visual Exploration Workflow as Enabler for the Exploitation of Linked Open ...
A Visual Exploration Workflow as Enabler for the Exploitation of Linked Open ...A Visual Exploration Workflow as Enabler for the Exploitation of Linked Open ...
A Visual Exploration Workflow as Enabler for the Exploitation of Linked Open ...
 
Researcher Profiling based on Semantic Analysis in Social Networks
Researcher Profiling based on Semantic Analysis in Social NetworksResearcher Profiling based on Semantic Analysis in Social Networks
Researcher Profiling based on Semantic Analysis in Social Networks
 
Talend Community Use Group Bristol: Preparing your business for mastering dat...
Talend Community Use Group Bristol: Preparing your business for mastering dat...Talend Community Use Group Bristol: Preparing your business for mastering dat...
Talend Community Use Group Bristol: Preparing your business for mastering dat...
 
Benchmarking the Effectiveness of Associating Chains of Links for Exploratory...
Benchmarking the Effectiveness of Associating Chains of Links for Exploratory...Benchmarking the Effectiveness of Associating Chains of Links for Exploratory...
Benchmarking the Effectiveness of Associating Chains of Links for Exploratory...
 
Effect of Heuristics on Serendipity in Path-Based Storytelling with Linked Data
Effect of Heuristics on Serendipity in Path-Based Storytelling with Linked DataEffect of Heuristics on Serendipity in Path-Based Storytelling with Linked Data
Effect of Heuristics on Serendipity in Path-Based Storytelling with Linked Data
 
OSLO: Open Standards for Linked Organizations
OSLO: Open Standards for Linked OrganizationsOSLO: Open Standards for Linked Organizations
OSLO: Open Standards for Linked Organizations
 
vBACD July 2012 - Apache Hadoop, Now and Beyond
vBACD July 2012 - Apache Hadoop, Now and BeyondvBACD July 2012 - Apache Hadoop, Now and Beyond
vBACD July 2012 - Apache Hadoop, Now and Beyond
 
Talend winter 2017 overview webinar
Talend winter 2017 overview webinarTalend winter 2017 overview webinar
Talend winter 2017 overview webinar
 
Présentation de Talend Winter 2017
Présentation de Talend Winter 2017 Présentation de Talend Winter 2017
Présentation de Talend Winter 2017
 

Similar to Big Linked Data ETL Benchmark on Cloud Commodity Hardware

Building High Performance MySQL Query Systems and Analytic Applications
Building High Performance MySQL Query Systems and Analytic ApplicationsBuilding High Performance MySQL Query Systems and Analytic Applications
Building High Performance MySQL Query Systems and Analytic Applications
Calpont
 
Building High Performance MySql Query Systems And Analytic Applications
Building High Performance MySql Query Systems And Analytic ApplicationsBuilding High Performance MySql Query Systems And Analytic Applications
Building High Performance MySql Query Systems And Analytic Applications
guest40cda0b
 
1 extreme performance - part i
1   extreme performance - part i1   extreme performance - part i
1 extreme performance - part isqlserver.co.il
 
Streaming is a Detail
Streaming is a DetailStreaming is a Detail
Streaming is a Detail
HostedbyConfluent
 
SQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query setSQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query set
Kognitio
 
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdHadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
IBM Analytics
 
Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryData Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
RTTS
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
RTTS
 
MIGRATION OF AN OLTP SYSTEM FROM ORACLE TO MYSQL AND COMPARATIVE PERFORMANCE ...
MIGRATION OF AN OLTP SYSTEM FROM ORACLE TO MYSQL AND COMPARATIVE PERFORMANCE ...MIGRATION OF AN OLTP SYSTEM FROM ORACLE TO MYSQL AND COMPARATIVE PERFORMANCE ...
MIGRATION OF AN OLTP SYSTEM FROM ORACLE TO MYSQL AND COMPARATIVE PERFORMANCE ...
cscpconf
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
dallemang
 
Iod session 3423 analytics patterns of expertise, the fast path to amazing ...
Iod session 3423   analytics patterns of expertise, the fast path to amazing ...Iod session 3423   analytics patterns of expertise, the fast path to amazing ...
Iod session 3423 analytics patterns of expertise, the fast path to amazing ...
Rachel Bland
 
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMsScaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
Matei Zaharia
 
Global Azure Bootcamp 2017 - Why I love S2D for MSSQL on Azure
Global Azure Bootcamp 2017 - Why I love S2D for MSSQL on AzureGlobal Azure Bootcamp 2017 - Why I love S2D for MSSQL on Azure
Global Azure Bootcamp 2017 - Why I love S2D for MSSQL on Azure
Karim Vaes
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Lucidworks
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
RTTS
 
Get a clearer picture of potential cloud performance by looking beyond SPECra...
Get a clearer picture of potential cloud performance by looking beyond SPECra...Get a clearer picture of potential cloud performance by looking beyond SPECra...
Get a clearer picture of potential cloud performance by looking beyond SPECra...
Principled Technologies
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
RTTS
 
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
Denodo
 
1 SDEV 460 – Homework 4 Input Validation and Busine
1  SDEV 460 – Homework 4 Input Validation and Busine1  SDEV 460 – Homework 4 Input Validation and Busine
1 SDEV 460 – Homework 4 Input Validation and Busine
VannaJoy20
 

Similar to Big Linked Data ETL Benchmark on Cloud Commodity Hardware (20)

Building High Performance MySQL Query Systems and Analytic Applications
Building High Performance MySQL Query Systems and Analytic ApplicationsBuilding High Performance MySQL Query Systems and Analytic Applications
Building High Performance MySQL Query Systems and Analytic Applications
 
Building High Performance MySql Query Systems And Analytic Applications
Building High Performance MySql Query Systems And Analytic ApplicationsBuilding High Performance MySql Query Systems And Analytic Applications
Building High Performance MySql Query Systems And Analytic Applications
 
1 extreme performance - part i
1   extreme performance - part i1   extreme performance - part i
1 extreme performance - part i
 
Streaming is a Detail
Streaming is a DetailStreaming is a Detail
Streaming is a Detail
 
SQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query setSQL on Hadoop benchmarks using TPC-DS query set
SQL on Hadoop benchmarks using TPC-DS query set
 
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdHadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
 
Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryData Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
 
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
Big Data Testing : Automate theTesting of Hadoop, NoSQL & DWH without Writing...
 
MIGRATION OF AN OLTP SYSTEM FROM ORACLE TO MYSQL AND COMPARATIVE PERFORMANCE ...
MIGRATION OF AN OLTP SYSTEM FROM ORACLE TO MYSQL AND COMPARATIVE PERFORMANCE ...MIGRATION OF AN OLTP SYSTEM FROM ORACLE TO MYSQL AND COMPARATIVE PERFORMANCE ...
MIGRATION OF AN OLTP SYSTEM FROM ORACLE TO MYSQL AND COMPARATIVE PERFORMANCE ...
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
 
Iod session 3423 analytics patterns of expertise, the fast path to amazing ...
Iod session 3423   analytics patterns of expertise, the fast path to amazing ...Iod session 3423   analytics patterns of expertise, the fast path to amazing ...
Iod session 3423 analytics patterns of expertise, the fast path to amazing ...
 
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMsScaling Databricks to Run Data and ML Workloads on Millions of VMs
Scaling Databricks to Run Data and ML Workloads on Millions of VMs
 
Global Azure Bootcamp 2017 - Why I love S2D for MSSQL on Azure
Global Azure Bootcamp 2017 - Why I love S2D for MSSQL on AzureGlobal Azure Bootcamp 2017 - Why I love S2D for MSSQL on Azure
Global Azure Bootcamp 2017 - Why I love S2D for MSSQL on Azure
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
 
optimizing_ceph_flash
optimizing_ceph_flashoptimizing_ceph_flash
optimizing_ceph_flash
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
 
Get a clearer picture of potential cloud performance by looking beyond SPECra...
Get a clearer picture of potential cloud performance by looking beyond SPECra...Get a clearer picture of potential cloud performance by looking beyond SPECra...
Get a clearer picture of potential cloud performance by looking beyond SPECra...
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
Bridging the Last Mile: Getting Data to the People Who Need It (APAC)
 
1 SDEV 460 – Homework 4 Input Validation and Busine
1  SDEV 460 – Homework 4 Input Validation and Busine1  SDEV 460 – Homework 4 Input Validation and Busine
1 SDEV 460 – Homework 4 Input Validation and Busine
 

Recently uploaded

Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 

Recently uploaded (20)

Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 

Big Linked Data ETL Benchmark on Cloud Commodity Hardware

  • 1. 1 Big Linked Data ETL Benchmark on Cloud Commodity Hardware iMinds – Ghent University Dieter De Witte, Laurens De Vocht, Ruben Verborgh, Erik Mannens, Rik Van de Walle Ontoforce Kenny Knecht, Filip Pattyn, Hans Constandt
  • 4. 4 Introduction  Facilitate development of semantic federated query engine close the (semantic) analytics gap in life sciences.  The query engine drives an exploratory search application: DisQover  Approach to federated querying by implementing ETL pipeline indexes the user views in advance.  Combine Linked Open Data with private and licensed (proprietary) data discovery of biomedical data new insights in medicine development.
  • 6. 6  Ensure minimal knowledge about data linking or annotation is required to explore and find results.  Write SPARQL directly detailed knowledge of the predicates is required might require first exploring to determine the URIs.  Scaling out to more data  Search queries are complex because search spans two distinct domains: 1. the ‘space’ of clinical studies; 2. ‘drugs/chemicals’. Challenges
  • 8. 8 Approach How to do federated search with minimal latency for end-user? Which RDF stores support the infrastructure? What aspects should the design of a reusable benchmark take into account?
  • 9. 9 The scaling-out approach relies on low-end commodity hardware but uses many nodes in a distributed system: 1. Specialized scalable RDF stores, the focus of this work; 2. Translating SPARQL and RDF to existing NoSQL stores; 3. Translating SPARQL and RDF to existing Big Data approaches such as MapReduce, Impala, Apache Spark; 4. Distributing the data in physically separated SPARQL endpoints over the Semantic Web, using federated querying techniques to resolve complex questions. Note: Compression (in-memory) is an alternative for distribution. RDF datasets can be compressed (e.g. “Header Dictionary Triples” – HDT). Scaling out: techniques
  • 10. 10 ETL in instead of direct querying Direct ETL
  • 11. 11  Typical DisQover queries introduce much query latency when directly federated.  Facets consist of multiple separate SPARQL queries and serve both as filter and as dashboard.  Data integration in DisQover: Facets filter across all data originating from multiple different sources. Why?
  • 13. 13 ETL Design of benchmark focus:  ETL part needs to be optimally cost efficient.  SPARQL queries for indexes maximally aligned with front-end.  What is are the tradeoffs for each RDF store? Benchmark
  • 14. 14  What is the most cost-effective storage solution to support Linked Data applications that need to be able to deal with heavy ETL query workloads?  Which performance trade-offs do storage solutions offer in terms of scalability?  What is the impact of different query types (templates)?  Is there a difference in performance between the stores based on the structural properties of the queries? Note: not taken into account implicitly derived facts, inference or reasoning. Questions the benchmark answers
  • 15. 15 WatDiv provides stress testing tools for SPARQL existing benchmarks not always suitable for testing systems in diverse queries and varied workloads:  generic benchmark + not application specific;  covers a broad spectrum result cardinality triple-pattern selectivity ensured through the data and query generation method;  Benchmark is repeatable with different dataset sizes or numbers of queries. Data and Query Generation
  • 16. 16 The RDF store should be capable of serving in a production environment with Linked Data in Life Sciences. The initial selection was made by choosing stores with: • a high adoption/popularity as defined by DB-Engines.com ranking for RDF stores; • enterprise support; • support for distributed deployment; • full SPARQL 1.1 compliance. The four stores we selected all comply with these constraints. Note: The names of two stores we tested could not be disclosed. They are being referred to as Enterprise Store I and II (ESI and ESII) RDF Store Selection
  • 17. 17 The benchmark process consists of a data loading phase, followed by running the SPARQL benchmarker: 1. The data is loaded in compressed format (gzip). 2. The benchmarker runs in multi-threaded mode (8 threads), runs a set of 2000 queries multiple times. 3. These runs consists of at least one warm-up run which is not counted. 4. In order to obtain robust results the tail results (most extreme) are discarded before calculating average query runtimes. 5. The benchmarker generates a CSV file containing the run times and response times etc. of all queries which we visualized. Process
  • 18. 18 Query Driver “SPARQL Query Benchmarker” is a general purpose API and CLI that is designed primarily for testing remote SPARQL servers. By default operations are run in a random order to avoid the system under test (SUT) learning the pattern of operations. Hardware Executed all benchmarks on the Amazon Web Services (AWS) Elastic Compute Cloud (EC2) and Simple Storage Solutions (S3). Used the default (commercial) deployments of the SUT for the results to be reproducible:  both the hardware and the machine images can be easily acquired.  more generally, cloud deployments offer the advantage of not requiring dedicated on-premises hardware. Infrastructure
  • 20. 20 Cost Scalability Behavior (Different Query Types) Errors and Time-outs Results
  • 22. 22 Scalability: 0.01 B – 0.1 B – 1 B
  • 24. 24 Behavior: different query types S FL Combinations of those C C
  • 26. 26 Errors and time-outs Every runtime > 300s is a time-out. If the run-time reaches a maximum of < 300s we detect an internal set time- out. This was in particular the case voor ESII (3 nodes) 60
  • 27. 27 Scalability: 1B revisited 60 ESII-3 still outperforms ESII-1 when looking at queries that did not time-out
  • 28. 28 Issues in the followed approach  Choose for virtual machine images in the cloud (AWS) for reproducibility; but cloud solutions might not always be best suited for production.  The results of different benchmark studies might depend on many (hidden) configuration factors leading to different or even contradicting results.  The difference in performance between the stores might be attributed to the use of commodity hardware in the cloud.  Differences partially attributed to the quality of the recommended configuration parameters as provided by the virtual machine images.
  • 30. 30 Conclusions & Next steps  Compared enterprise RDF stores default configuration without the intervention of enterprise support.  Run stores in their optimal configuration (reflecting a production setting) with more instances (> 3).  Repeat the benchmark with DisQover data and queries.  Create overview of RDF solutions for different use cases, configurations and real-world (life science) datasets.  Investigate whether the WatDiv results are confirmed when running the benchmark with other queries and data.  Release tools for repeating the benchmark with new storage solutions.

Editor's Notes

  1. 1. Title slide 2. Problem: What is the problem that you are addressing and why the problem is important? Who will benefit if you succeed? Who should care? 3. State of the art: Why is the problem difficult? What have others tried to do? 4. Research questions and hypothesis: What is the object of your study? What is the hypothesis that you will test? 5. Preliminary results: Do you have any preliminary results that demonstrate that your approach is promising 6. Your approach: What is the main idea behind your approach? The key innovation? 7. Evaluation plan: How do you plan to test your hypothesis? What will you measure? What will you compare to? 8. Reflections: Provide an argument, based either on common knowledge or on evidence that you have accumulated, the your approach is likely to succeed. 9. Lessons Learned: Summarize the lessons that you have learned so far. Discuss the positive and negative results that you have observed.
  2. 1. Title slide 2. Problem: What is the problem that you are addressing and why the problem is important? Who will benefit if you succeed? Who should care? 3. State of the art: Why is the problem difficult? What have others tried to do? 4. Research questions and hypothesis: What is the object of your study? What is the hypothesis that you will test? 5. Preliminary results: Do you have any preliminary results that demonstrate that your approach is promising 6. Your approach: What is the main idea behind your approach? The key innovation? 7. Evaluation plan: How do you plan to test your hypothesis? What will you measure? What will you compare to? 8. Reflections: Provide an argument, based either on common knowledge or on evidence that you have accumulated, the your approach is likely to succeed. 9. Lessons Learned: Summarize the lessons that you have learned so far. Discuss the positive and negative results that you have observed.
  3. 1. Title slide 2. Problem: What is the problem that you are addressing and why the problem is important? Who will benefit if you succeed? Who should care? 3. State of the art: Why is the problem difficult? What have others tried to do? 4. Research questions and hypothesis: What is the object of your study? What is the hypothesis that you will test? 5. Preliminary results: Do you have any preliminary results that demonstrate that your approach is promising 6. Your approach: What is the main idea behind your approach? The key innovation? 7. Evaluation plan: How do you plan to test your hypothesis? What will you measure? What will you compare to? 8. Reflections: Provide an argument, based either on common knowledge or on evidence that you have accumulated, the your approach is likely to succeed. 9. Lessons Learned: Summarize the lessons that you have learned so far. Discuss the positive and negative results that you have observed.
  4. 1. Title slide 2. Problem: What is the problem that you are addressing and why the problem is important? Who will benefit if you succeed? Who should care? 3. State of the art: Why is the problem difficult? What have others tried to do? 4. Research questions and hypothesis: What is the object of your study? What is the hypothesis that you will test? 5. Preliminary results: Do you have any preliminary results that demonstrate that your approach is promising 6. Your approach: What is the main idea behind your approach? The key innovation? 7. Evaluation plan: How do you plan to test your hypothesis? What will you measure? What will you compare to? 8. Reflections: Provide an argument, based either on common knowledge or on evidence that you have accumulated, the your approach is likely to succeed. 9. Lessons Learned: Summarize the lessons that you have learned so far. Discuss the positive and negative results that you have observed.
  5. 1. Title slide 2. Problem: What is the problem that you are addressing and why the problem is important? Who will benefit if you succeed? Who should care? 3. State of the art: Why is the problem difficult? What have others tried to do? 4. Research questions and hypothesis: What is the object of your study? What is the hypothesis that you will test? 5. Preliminary results: Do you have any preliminary results that demonstrate that your approach is promising 6. Your approach: What is the main idea behind your approach? The key innovation? 7. Evaluation plan: How do you plan to test your hypothesis? What will you measure? What will you compare to? 8. Reflections: Provide an argument, based either on common knowledge or on evidence that you have accumulated, the your approach is likely to succeed. 9. Lessons Learned: Summarize the lessons that you have learned so far. Discuss the positive and negative results that you have observed.
  6. There is no clear second place. Whereas ESI performs well on small datasets, Blazegraph shows better results for larger datasets. ESII’s performance on a single instance benchmark is worse than the others, but its claim of being highly scalable is confirmed in a configuration with three instances where it performs significantly better. All data stores have acceptible results for 10 and 100 million triples and the choice to go by one or the other could depend on additional features each of the stores has to offer such as support for full-text indexing, support for linked data fragment interfaces or superior automatic inferencing. For the larger datasets Virtuoso should be the first choice as a single instance solution. The initial results in a distributed setup with ESII shows promising results in terms of scaling out, proving that this store’s power might only be revealed in large multi-instance benchmarks.
  7. 1. Title slide 2. Problem: What is the problem that you are addressing and why the problem is important? Who will benefit if you succeed? Who should care? 3. State of the art: Why is the problem difficult? What have others tried to do? 4. Research questions and hypothesis: What is the object of your study? What is the hypothesis that you will test? 5. Preliminary results: Do you have any preliminary results that demonstrate that your approach is promising 6. Your approach: What is the main idea behind your approach? The key innovation? 7. Evaluation plan: How do you plan to test your hypothesis? What will you measure? What will you compare to? 8. Reflections: Provide an argument, based either on common knowledge or on evidence that you have accumulated, the your approach is likely to succeed. 9. Lessons Learned: Summarize the lessons that you have learned so far. Discuss the positive and negative results that you have observed.