SlideShare a Scribd company logo
oheila Dehghanzadeh
Agenda
ntroduction to Trade-offs in Integration Systems
equirements and Research Questions
ontributions
onclusions and Future Work
Introduction
hat is data integration?
• “Combining data from different distributed sources”1
.
hy is it important?
• Most queries requires integrating data from various sources.
hy is it challenging?
• Sources are autonomous and distributed.
• Distributing query among sources to provide the response has
performance, scalability and availability problems.
• Caching solves above problems but leads to inconsistencies.
• Maintaining cache increases latency.
3
1. https://en.wikipedia.org/wiki/Data_integration
The latency/consistency trade-off
4
High consistencyLow consistency
Low latency
High latency
Ideal case
Data
warehouse
Mediator
systems
Data integration
ata integration approaches
• Data warehouse (DW)
• Low latency
• Low consistency
High consistencyLow consistency
Low latency
High latency
Ideal case
Data
warehouse
Mediator
systems
Data warehouse
Low latency
Low consistency
Data Market: Lowest latency with a
consistency threshold
Minimize cost (financial and latency) as
far as consistency is above a threshold
Find me
emails of
“The North
Face”
customers.
My existing
data can
provide you
a response
with 60%
freshness.
Ok
Here is the
responseNo, I want
the fastest
response
with at least
80%
freshness
To provide
80%
freshness
you need to
wait 30 sec
and pay 60$
Research Question 1
How to optimally maintain data when
consistency is restricted and latency is demanded
to be minimized?
8
Summary of contribution 1
method to estimate the response freshness using the existing data
(JIST2014, ISWC2014).
• Extend summarization techniques to trace the freshness.
• Indexing, histogram and Qtree
• Use summary to estimate the response freshness.
valuation
• We managed to estimate the freshness of a query with 6% error rate.
uture work
• Use more advanced summarizations to lower the error rate.
9
Data integration
ata integration approaches
• Data warehouse (DW)
• Low latency
• Low consistency
• Mediator systems (MS)
• High latency
• High consistency
High consistencyLow consistency
Low latency
High latency
Ideal case
Data
warehouse
Mediator
systems
Data
warehouse
Mediator System
High latency
high consistency
Mediator system: Highest consistency
with a latency threshold
Join
RDF Stream
Generator
Background data
(SPARQL endpoint)
12
Mediator system: Highest consistency
with a latency threshold
Join
RDF Stream
Generator
Background data
(SPARQL endpoint)
Local
View
13
Mediator system: Highest consistency
with a latency threshold
Join
RDF Stream
Generator
Background data
(SPARQL endpoint)
Local
View
Maintenance
Process
Freshness decreases Refresh
Cost/Quality trade-
off
14
Research Question 2
How to optimally maintain data when the latency
is restricted and consistency is demanded to be
maximized?
15
Summary of contribution 2
maintenance process to maximize consistency with respect to latency
constraint (WWW2015, ICWE2015).
• Query driven: maintain cache entries that are involved in current
evaluation
• Freshness driven: maintain cache entries that
• Are stale
• Change less frequently
• Affect future evaluations
valuation
• The proposed approach outperforms a set of baseline policies.
his work has already been followed up
• Queries with FILTER clauses (ICWE2016)
• Queries with complex join patterns (ISWC2016) 16
Data integration
ata integration approaches
• Data warehouse (DW)
• Low latency
• Low consistency
• Mediator systems (MS)
• High latency
• High consistency
ntegration in a real system
High consistencyLow consistency
Low latency
High latency
Ideal case
Data
warehouse
Mediator
systems
Data
warehouse
Mediator
systems
Contributing the proposed policies to
CSPARQL
• So far we assumed all
required data to provide
the response exists in
the local cache but
needs to be maintained.
• What if required data
does not fit in the local
cache?
18
entries
SERVICE
Provider
Local cache
Research Question 3
How to take into account space constraint while
optimizing data integration with regards to
latency or consistency constraints?
19
20
Summary of contribution 3
• An extension of the maintenance policy (contribution 2) to take into
account both latency and space constraints.
• Fetching policies to cope with cache incompleteness
• A freshness based cache replacement policy
• An implementation in CSPARQL
• Evaluation
• The proposed replacement policy outperforms state-of-the-art
replacement policies.
• Future work
• Investigating more complex queries (e.g., with multiple SERVICE
clauses, complex join patterns)
Conclusions
n ideal integration engine (low latency and high consistency) is not
possible because these two dimensions are in trade-off.
ontributions:
• Optimizing response latency with consistency threshold has been
studied in the context of Data Marketplace.
• A maintenance policy to optimize response consistency with latency
threshold in the context of knowledge-based event processing.
• Introduction of space constraints to integrate my approach in CSPARQL.
21
High consistencyLow consistency
Low latency
High latency
Ideal case
Data
warehouse
Mediator
systems
Data
warehouse
Mediator
systems
Slide 22
Data Integration
Data Stream Data Source
Cache
Maintenance
Process
Freshness
decreases
Refresh based on latency constraint
Query (critical latency)
Data Source Data Source
Cache
Maintenance
Process
Freshness
decreases
Refresh based on consistency constraint
Query (critical consistency)
1. Maintaining
cache based on
latency
constraint of
query
(Event
Detection)
2. Maintaining
cache based on
consistency
constraint of
query
(Data Market)
Soheila.dehghanzdeh@insight-centre.org Unit for Reasoning
and Querying
Mediator system: Highest consistency with
a latency threshold
24
Query: find Twitter users that have been
mentioned more than 5 times in the last
minute and are followed by more than
1000 users
Stream Processor
Twitter mention stream
#X has 1007 followers
#Y has 2000 followers
#Z has 500 followers
Twitter Follower API
#X is super hero
#X won the gold medal
#X broke the world record
#X is awesome
#X
…
#Y is super hero
#Y won the bronze medal
#Y broke the world record
#Y is awesome
#Y
…
#Z is great
#Z won the silver
medal
#Z broke the world record
#Z is awesome
Well done to #Z, #Y,
#X
User Mentione
d
Followed
by
#X 7 1007
#Y 6 2000
#X has 1007 followers
#Y has 2000 followers
#Z has 600 followers
#X has 998
followers
Contributing the proposed policies to
CSPARQL
Requirements
•A local cache R
•Fetch SERVICE from R
•Maintain R
•ESPER external time
25
The modified engine is available on github
Time stamp
entries
SERVICE
Provider
Local cache
Workloads with significant improvements
with proposed policy
e hypothesize that WSJ-WBM is more influential if :
• Hypothesis 1: the BKG data change slower
• Hypothesis 2: the BKG data changes with more diversity in change rate
• Hypothesis 3: there is a negative correlation between the streaming rate
and the change rate
• Hypothesis 4: total number of possible events (i.e., caching space) is
larger
he time overhead of WSJ-WBM is negligible
26
Experiments set up
data generator to generate various workloads with
• Various change rate distributions within an interval- random or normal
distribution
• Various streaming rates- the inter arrival time of elements follows a
Poisson distribution with various lambda intervals
27
Hypothesis 1: BKG data change slower.
28
Hypothesis 2: BKG data changes with
more diversity in change rate.
29
Hypothesis 3: negative correlation
between the streaming and change rate
30
Hypothesis 4: total number of possible
events (i.e., caching space) is larger
31
Hypothesis 4: The time overhead of WSJ-
WBM is negligible
32
LocalRemote
Combining RDF Streams and Remotely Stored
Background Data
e move to an approximate setting, and we introduce a local view to
store part of the data involved in the query processing, and update part
of it to capture the dynamicity
33
A query-driven maintenance process
ELECT * WHERE WINDOW(S, ω, β) PW
. SERVICE(BKG) PS
34
WINDOW clause
JOIN Proposer Ranker
Maintaine
r
Local View
4 2
3
1
SERVICE clause
E
C
RND
LRU
WBM
CWSJ
WSJ
GNR
LRU
FRP
Evaluation
35

More Related Content

What's hot

Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
Bill Hayduk
 
Testing the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big ProblemsTesting the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big Problems
TechWell
 
Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration
Saurabh K. Gupta
 
Hardening Hadoop for Healthcare with Project Rhino
Hardening Hadoop for Healthcare with Project RhinoHardening Hadoop for Healthcare with Project Rhino
Hardening Hadoop for Healthcare with Project Rhino
Amazon Web Services
 
MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented...
MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented...MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented...
MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented...
MongoDB
 
Starting the Hadoop Journey at a Global Leader in Cancer Research
Starting the Hadoop Journey at a Global Leader in Cancer ResearchStarting the Hadoop Journey at a Global Leader in Cancer Research
Starting the Hadoop Journey at a Global Leader in Cancer Research
DataWorks Summit/Hadoop Summit
 
SQL Server Managing Test Data & Stress Testing January 2011
SQL Server Managing Test Data & Stress Testing January 2011SQL Server Managing Test Data & Stress Testing January 2011
SQL Server Managing Test Data & Stress Testing January 2011
Mark Ginnebaugh
 
Applying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and HadoopApplying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and Hadoop
Mark Johnson
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
Saurabh K. Gupta
 
Pervasive analytics through data & analytic centricity
Pervasive analytics through data & analytic centricityPervasive analytics through data & analytic centricity
Pervasive analytics through data & analytic centricity
Cloudera, Inc.
 
In-Memory Stream Processing with Hazelcast Jet @MorningAtLohika
In-Memory Stream Processing with Hazelcast Jet @MorningAtLohikaIn-Memory Stream Processing with Hazelcast Jet @MorningAtLohika
In-Memory Stream Processing with Hazelcast Jet @MorningAtLohika
Nazarii Cherkas
 
Introducing a horizontally scalable, inference-based business Rules Engine fo...
Introducing a horizontally scalable, inference-based business Rules Engine fo...Introducing a horizontally scalable, inference-based business Rules Engine fo...
Introducing a horizontally scalable, inference-based business Rules Engine fo...
Cask Data
 
About CDAP
About CDAPAbout CDAP
About CDAP
Cask Data
 
GDPR compliance application architecture and implementation using Hadoop and ...
GDPR compliance application architecture and implementation using Hadoop and ...GDPR compliance application architecture and implementation using Hadoop and ...
GDPR compliance application architecture and implementation using Hadoop and ...
DataWorks Summit
 
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonUsing Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonMapR Technologies
 
HBaseCon 2012 | Real-Time and Batch HBase for Healthcare at Explorys
HBaseCon 2012 | Real-Time and Batch HBase for Healthcare at ExplorysHBaseCon 2012 | Real-Time and Batch HBase for Healthcare at Explorys
HBaseCon 2012 | Real-Time and Batch HBase for Healthcare at Explorys
Cloudera, Inc.
 
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
DataWorks Summit
 
Just the sketch: advanced streaming analytics in Apache Metron
Just the sketch: advanced streaming analytics in Apache MetronJust the sketch: advanced streaming analytics in Apache Metron
Just the sketch: advanced streaming analytics in Apache Metron
DataWorks Summit
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
DataWorks Summit
 
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
DataStax
 

What's hot (20)

Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
 
Testing the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big ProblemsTesting the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big Problems
 
Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration Achieve data democracy in data lake with data integration
Achieve data democracy in data lake with data integration
 
Hardening Hadoop for Healthcare with Project Rhino
Hardening Hadoop for Healthcare with Project RhinoHardening Hadoop for Healthcare with Project Rhino
Hardening Hadoop for Healthcare with Project Rhino
 
MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented...
MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented...MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented...
MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented...
 
Starting the Hadoop Journey at a Global Leader in Cancer Research
Starting the Hadoop Journey at a Global Leader in Cancer ResearchStarting the Hadoop Journey at a Global Leader in Cancer Research
Starting the Hadoop Journey at a Global Leader in Cancer Research
 
SQL Server Managing Test Data & Stress Testing January 2011
SQL Server Managing Test Data & Stress Testing January 2011SQL Server Managing Test Data & Stress Testing January 2011
SQL Server Managing Test Data & Stress Testing January 2011
 
Applying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and HadoopApplying Testing Techniques for Big Data and Hadoop
Applying Testing Techniques for Big Data and Hadoop
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
 
Pervasive analytics through data & analytic centricity
Pervasive analytics through data & analytic centricityPervasive analytics through data & analytic centricity
Pervasive analytics through data & analytic centricity
 
In-Memory Stream Processing with Hazelcast Jet @MorningAtLohika
In-Memory Stream Processing with Hazelcast Jet @MorningAtLohikaIn-Memory Stream Processing with Hazelcast Jet @MorningAtLohika
In-Memory Stream Processing with Hazelcast Jet @MorningAtLohika
 
Introducing a horizontally scalable, inference-based business Rules Engine fo...
Introducing a horizontally scalable, inference-based business Rules Engine fo...Introducing a horizontally scalable, inference-based business Rules Engine fo...
Introducing a horizontally scalable, inference-based business Rules Engine fo...
 
About CDAP
About CDAPAbout CDAP
About CDAP
 
GDPR compliance application architecture and implementation using Hadoop and ...
GDPR compliance application architecture and implementation using Hadoop and ...GDPR compliance application architecture and implementation using Hadoop and ...
GDPR compliance application architecture and implementation using Hadoop and ...
 
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonUsing Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
 
HBaseCon 2012 | Real-Time and Batch HBase for Healthcare at Explorys
HBaseCon 2012 | Real-Time and Batch HBase for Healthcare at ExplorysHBaseCon 2012 | Real-Time and Batch HBase for Healthcare at Explorys
HBaseCon 2012 | Real-Time and Batch HBase for Healthcare at Explorys
 
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
It Takes a Village: Organizational Alignment to Deliver Big Data Value in Hea...
 
Just the sketch: advanced streaming analytics in Apache Metron
Just the sketch: advanced streaming analytics in Apache MetronJust the sketch: advanced streaming analytics in Apache Metron
Just the sketch: advanced streaming analytics in Apache Metron
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
 
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
 

Similar to Offsite presentation original

Southwest Power Pool big data case study
Southwest Power Pool big data case study Southwest Power Pool big data case study
Southwest Power Pool big data case study
Seeling Cheung
 
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...
Institute of Information Systems (HES-SO)
 
Can data virtualization uphold performance with complex queries?
Can data virtualization uphold performance with complex queries?Can data virtualization uphold performance with complex queries?
Can data virtualization uphold performance with complex queries?
Denodo
 
Relevant Query Answering on Dynamic and Distributed Datasets
Relevant Query Answering on Dynamic and Distributed DatasetsRelevant Query Answering on Dynamic and Distributed Datasets
Relevant Query Answering on Dynamic and Distributed Datasets
Shima Zahmatkesh
 
PostgreSQL as a Big Data Platform
PostgreSQL as a Big Data Platform PostgreSQL as a Big Data Platform
PostgreSQL as a Big Data Platform
Chris Travers
 
Tuning Java Driver for Apache Cassandra
Tuning Java Driver for Apache CassandraTuning Java Driver for Apache Cassandra
Tuning Java Driver for Apache Cassandra
Nenad Bozic
 
Data Café — A Platform For Creating Biomedical Data Lakes
Data Café — A Platform For Creating Biomedical Data LakesData Café — A Platform For Creating Biomedical Data Lakes
Data Café — A Platform For Creating Biomedical Data Lakes
Pradeeban Kathiravelu, Ph.D.
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
DataStax
 
Work with hundred of hot terabytes in JVMs
Work with hundred of hot terabytes in JVMsWork with hundred of hot terabytes in JVMs
Work with hundred of hot terabytes in JVMs
Malin Weiss
 
Datawarehousing Terminology
Datawarehousing TerminologyDatawarehousing Terminology
Datawarehousing Terminology
Dev EngineersSaathi
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
SoftServe
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
Tiago Knoch
 
Difference between data warehouse and data mining
Difference between data warehouse and data miningDifference between data warehouse and data mining
Difference between data warehouse and data mining
maxonlinetr
 
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Big Data Spain
 
Interactive query using hadoop
Interactive query using hadoopInteractive query using hadoop
Interactive query using hadoop
Arvind Radhakrishnen
 
Got documents - The Raven Bouns Edition
Got documents - The Raven Bouns EditionGot documents - The Raven Bouns Edition
Got documents - The Raven Bouns Edition
Maggie Pint
 
Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciences
Uri Laserson
 
Distributed RDBMS: Data Distribution Policy: Part 1 - What is a Data Distribu...
Distributed RDBMS: Data Distribution Policy: Part 1 - What is a Data Distribu...Distributed RDBMS: Data Distribution Policy: Part 1 - What is a Data Distribu...
Distributed RDBMS: Data Distribution Policy: Part 1 - What is a Data Distribu...
ScaleBase
 
Distributed RDBMS: Data Distribution Policy: Part 2 - Creating a Data Distrib...
Distributed RDBMS: Data Distribution Policy: Part 2 - Creating a Data Distrib...Distributed RDBMS: Data Distribution Policy: Part 2 - Creating a Data Distrib...
Distributed RDBMS: Data Distribution Policy: Part 2 - Creating a Data Distrib...
ScaleBase
 
Technical Product Manager Case Challenge
Technical Product Manager Case ChallengeTechnical Product Manager Case Challenge
Technical Product Manager Case Challenge
Arush Sharma
 

Similar to Offsite presentation original (20)

Southwest Power Pool big data case study
Southwest Power Pool big data case study Southwest Power Pool big data case study
Southwest Power Pool big data case study
 
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...
MUSYOP: Towards a Query Optimization for Heterogeneous Distributed Database S...
 
Can data virtualization uphold performance with complex queries?
Can data virtualization uphold performance with complex queries?Can data virtualization uphold performance with complex queries?
Can data virtualization uphold performance with complex queries?
 
Relevant Query Answering on Dynamic and Distributed Datasets
Relevant Query Answering on Dynamic and Distributed DatasetsRelevant Query Answering on Dynamic and Distributed Datasets
Relevant Query Answering on Dynamic and Distributed Datasets
 
PostgreSQL as a Big Data Platform
PostgreSQL as a Big Data Platform PostgreSQL as a Big Data Platform
PostgreSQL as a Big Data Platform
 
Tuning Java Driver for Apache Cassandra
Tuning Java Driver for Apache CassandraTuning Java Driver for Apache Cassandra
Tuning Java Driver for Apache Cassandra
 
Data Café — A Platform For Creating Biomedical Data Lakes
Data Café — A Platform For Creating Biomedical Data LakesData Café — A Platform For Creating Biomedical Data Lakes
Data Café — A Platform For Creating Biomedical Data Lakes
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Work with hundred of hot terabytes in JVMs
Work with hundred of hot terabytes in JVMsWork with hundred of hot terabytes in JVMs
Work with hundred of hot terabytes in JVMs
 
Datawarehousing Terminology
Datawarehousing TerminologyDatawarehousing Terminology
Datawarehousing Terminology
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
 
Difference between data warehouse and data mining
Difference between data warehouse and data miningDifference between data warehouse and data mining
Difference between data warehouse and data mining
 
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017
 
Interactive query using hadoop
Interactive query using hadoopInteractive query using hadoop
Interactive query using hadoop
 
Got documents - The Raven Bouns Edition
Got documents - The Raven Bouns EditionGot documents - The Raven Bouns Edition
Got documents - The Raven Bouns Edition
 
Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciences
 
Distributed RDBMS: Data Distribution Policy: Part 1 - What is a Data Distribu...
Distributed RDBMS: Data Distribution Policy: Part 1 - What is a Data Distribu...Distributed RDBMS: Data Distribution Policy: Part 1 - What is a Data Distribu...
Distributed RDBMS: Data Distribution Policy: Part 1 - What is a Data Distribu...
 
Distributed RDBMS: Data Distribution Policy: Part 2 - Creating a Data Distrib...
Distributed RDBMS: Data Distribution Policy: Part 2 - Creating a Data Distrib...Distributed RDBMS: Data Distribution Policy: Part 2 - Creating a Data Distrib...
Distributed RDBMS: Data Distribution Policy: Part 2 - Creating a Data Distrib...
 
Technical Product Manager Case Challenge
Technical Product Manager Case ChallengeTechnical Product Manager Case Challenge
Technical Product Manager Case Challenge
 

Recently uploaded

Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
RaedMohamed3
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
Celine George
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
GeoBlogs
 
Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
Celine George
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
Pavel ( NSTU)
 
Sectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdfSectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdf
Vivekanand Anglo Vedic Academy
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
Celine George
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
PedroFerreira53928
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
Col Mukteshwar Prasad
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
Excellence Foundation for South Sudan
 

Recently uploaded (20)

Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
 
Fish and Chips - have they had their chips
Fish and Chips - have they had their chipsFish and Chips - have they had their chips
Fish and Chips - have they had their chips
 
Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......Ethnobotany and Ethnopharmacology ......
Ethnobotany and Ethnopharmacology ......
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Model Attribute Check Company Auto Property
Model Attribute  Check Company Auto PropertyModel Attribute  Check Company Auto Property
Model Attribute Check Company Auto Property
 
Synthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptxSynthetic Fiber Construction in lab .pptx
Synthetic Fiber Construction in lab .pptx
 
Sectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdfSectors of the Indian Economy - Class 10 Study Notes pdf
Sectors of the Indian Economy - Class 10 Study Notes pdf
 
How to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERPHow to Create Map Views in the Odoo 17 ERP
How to Create Map Views in the Odoo 17 ERP
 
PART A. Introduction to Costumer Service
PART A. Introduction to Costumer ServicePART A. Introduction to Costumer Service
PART A. Introduction to Costumer Service
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 
How to Break the cycle of negative Thoughts
How to Break the cycle of negative ThoughtsHow to Break the cycle of negative Thoughts
How to Break the cycle of negative Thoughts
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
Introduction to Quality Improvement Essentials
Introduction to Quality Improvement EssentialsIntroduction to Quality Improvement Essentials
Introduction to Quality Improvement Essentials
 

Offsite presentation original

  • 2. Agenda ntroduction to Trade-offs in Integration Systems equirements and Research Questions ontributions onclusions and Future Work
  • 3. Introduction hat is data integration? • “Combining data from different distributed sources”1 . hy is it important? • Most queries requires integrating data from various sources. hy is it challenging? • Sources are autonomous and distributed. • Distributing query among sources to provide the response has performance, scalability and availability problems. • Caching solves above problems but leads to inconsistencies. • Maintaining cache increases latency. 3 1. https://en.wikipedia.org/wiki/Data_integration
  • 4. The latency/consistency trade-off 4 High consistencyLow consistency Low latency High latency Ideal case Data warehouse Mediator systems
  • 5. Data integration ata integration approaches • Data warehouse (DW) • Low latency • Low consistency High consistencyLow consistency Low latency High latency Ideal case Data warehouse Mediator systems
  • 7. Data Market: Lowest latency with a consistency threshold Minimize cost (financial and latency) as far as consistency is above a threshold Find me emails of “The North Face” customers. My existing data can provide you a response with 60% freshness. Ok Here is the responseNo, I want the fastest response with at least 80% freshness To provide 80% freshness you need to wait 30 sec and pay 60$
  • 8. Research Question 1 How to optimally maintain data when consistency is restricted and latency is demanded to be minimized? 8
  • 9. Summary of contribution 1 method to estimate the response freshness using the existing data (JIST2014, ISWC2014). • Extend summarization techniques to trace the freshness. • Indexing, histogram and Qtree • Use summary to estimate the response freshness. valuation • We managed to estimate the freshness of a query with 6% error rate. uture work • Use more advanced summarizations to lower the error rate. 9
  • 10. Data integration ata integration approaches • Data warehouse (DW) • Low latency • Low consistency • Mediator systems (MS) • High latency • High consistency High consistencyLow consistency Low latency High latency Ideal case Data warehouse Mediator systems Data warehouse
  • 12. Mediator system: Highest consistency with a latency threshold Join RDF Stream Generator Background data (SPARQL endpoint) 12
  • 13. Mediator system: Highest consistency with a latency threshold Join RDF Stream Generator Background data (SPARQL endpoint) Local View 13
  • 14. Mediator system: Highest consistency with a latency threshold Join RDF Stream Generator Background data (SPARQL endpoint) Local View Maintenance Process Freshness decreases Refresh Cost/Quality trade- off 14
  • 15. Research Question 2 How to optimally maintain data when the latency is restricted and consistency is demanded to be maximized? 15
  • 16. Summary of contribution 2 maintenance process to maximize consistency with respect to latency constraint (WWW2015, ICWE2015). • Query driven: maintain cache entries that are involved in current evaluation • Freshness driven: maintain cache entries that • Are stale • Change less frequently • Affect future evaluations valuation • The proposed approach outperforms a set of baseline policies. his work has already been followed up • Queries with FILTER clauses (ICWE2016) • Queries with complex join patterns (ISWC2016) 16
  • 17. Data integration ata integration approaches • Data warehouse (DW) • Low latency • Low consistency • Mediator systems (MS) • High latency • High consistency ntegration in a real system High consistencyLow consistency Low latency High latency Ideal case Data warehouse Mediator systems Data warehouse Mediator systems
  • 18. Contributing the proposed policies to CSPARQL • So far we assumed all required data to provide the response exists in the local cache but needs to be maintained. • What if required data does not fit in the local cache? 18 entries SERVICE Provider Local cache
  • 19. Research Question 3 How to take into account space constraint while optimizing data integration with regards to latency or consistency constraints? 19
  • 20. 20 Summary of contribution 3 • An extension of the maintenance policy (contribution 2) to take into account both latency and space constraints. • Fetching policies to cope with cache incompleteness • A freshness based cache replacement policy • An implementation in CSPARQL • Evaluation • The proposed replacement policy outperforms state-of-the-art replacement policies. • Future work • Investigating more complex queries (e.g., with multiple SERVICE clauses, complex join patterns)
  • 21. Conclusions n ideal integration engine (low latency and high consistency) is not possible because these two dimensions are in trade-off. ontributions: • Optimizing response latency with consistency threshold has been studied in the context of Data Marketplace. • A maintenance policy to optimize response consistency with latency threshold in the context of knowledge-based event processing. • Introduction of space constraints to integrate my approach in CSPARQL. 21 High consistencyLow consistency Low latency High latency Ideal case Data warehouse Mediator systems Data warehouse Mediator systems
  • 23. Data Integration Data Stream Data Source Cache Maintenance Process Freshness decreases Refresh based on latency constraint Query (critical latency) Data Source Data Source Cache Maintenance Process Freshness decreases Refresh based on consistency constraint Query (critical consistency) 1. Maintaining cache based on latency constraint of query (Event Detection) 2. Maintaining cache based on consistency constraint of query (Data Market) Soheila.dehghanzdeh@insight-centre.org Unit for Reasoning and Querying
  • 24. Mediator system: Highest consistency with a latency threshold 24 Query: find Twitter users that have been mentioned more than 5 times in the last minute and are followed by more than 1000 users Stream Processor Twitter mention stream #X has 1007 followers #Y has 2000 followers #Z has 500 followers Twitter Follower API #X is super hero #X won the gold medal #X broke the world record #X is awesome #X … #Y is super hero #Y won the bronze medal #Y broke the world record #Y is awesome #Y … #Z is great #Z won the silver medal #Z broke the world record #Z is awesome Well done to #Z, #Y, #X User Mentione d Followed by #X 7 1007 #Y 6 2000 #X has 1007 followers #Y has 2000 followers #Z has 600 followers #X has 998 followers
  • 25. Contributing the proposed policies to CSPARQL Requirements •A local cache R •Fetch SERVICE from R •Maintain R •ESPER external time 25 The modified engine is available on github Time stamp entries SERVICE Provider Local cache
  • 26. Workloads with significant improvements with proposed policy e hypothesize that WSJ-WBM is more influential if : • Hypothesis 1: the BKG data change slower • Hypothesis 2: the BKG data changes with more diversity in change rate • Hypothesis 3: there is a negative correlation between the streaming rate and the change rate • Hypothesis 4: total number of possible events (i.e., caching space) is larger he time overhead of WSJ-WBM is negligible 26
  • 27. Experiments set up data generator to generate various workloads with • Various change rate distributions within an interval- random or normal distribution • Various streaming rates- the inter arrival time of elements follows a Poisson distribution with various lambda intervals 27
  • 28. Hypothesis 1: BKG data change slower. 28
  • 29. Hypothesis 2: BKG data changes with more diversity in change rate. 29
  • 30. Hypothesis 3: negative correlation between the streaming and change rate 30
  • 31. Hypothesis 4: total number of possible events (i.e., caching space) is larger 31
  • 32. Hypothesis 4: The time overhead of WSJ- WBM is negligible 32 LocalRemote
  • 33. Combining RDF Streams and Remotely Stored Background Data e move to an approximate setting, and we introduce a local view to store part of the data involved in the query processing, and update part of it to capture the dynamicity 33
  • 34. A query-driven maintenance process ELECT * WHERE WINDOW(S, ω, β) PW . SERVICE(BKG) PS 34 WINDOW clause JOIN Proposer Ranker Maintaine r Local View 4 2 3 1 SERVICE clause E C RND LRU WBM CWSJ WSJ GNR LRU FRP

Editor's Notes

  1. Never use inconsistency
  2. Introduction to trade-offss Given that not every body knows my work, I give a summary of my work first then … I want to briefly explain what I did in previous years and particularly I focus on what I did in last year.
  3. What is the problem ? Talk about the trade-off The problem I investigate is …. First I studied this problem in the data ware house setting.
  4. Users want to keep low latency but with a reasonable amount of consistency => consistency constraint
  5. Fetch those that can increase the response latency more User aim to minimize the cost as far as the quality of data is reasonable A threshold for consistency Cost is the result of fetching request => latency Infochimps and Microsoft azure Market place. You are charged for the amount of requested freshness in your response.
  6. What was the contribution on DW
  7. What is the problem ? Talk about the trade-off The problem I investigate is …. First I studied this problem in the data ware house setting.
  8. We want to keep the high consistency but with a reasonable latency=> latency constraint
  9. The less we maintain the faster we can process queries, but how much less? How to minimize the maintenance? Extension: to consider all users from the stream, if a user doesn’t exist in the local view, we fetch it and replace it with one of the existing entries from the local view
  10. The less we maintain the faster we can process queries, but how much less? How to minimize the maintenance? Extension: to consider all users from the stream, if a user doesn’t exist in the local view, we fetch it and replace it with one of the existing entries from the local view
  11. 2-3 slides introducing WSJ and WBM Limitation of ICWE => we assume that the local view always contains all the elements needed to compute the current answer
  12. I ALSO DID SOME OTHER EXPERIEMNTS WITH csaprql, I MEASURED THE OVERHEAD AND IT IS LESS THAN 1% The clock of CSPARQL should consider the time-stamp carried by the streaming data
  13. The less we maintain the faster we can process queries, but how much less? How to minimize the maintenance? Extension: to consider all users from the stream, if a user doesn’t exist in the local view, we fetch it and replace it with one of the existing entries from the local view
  14. I ALSO DID SOME OTHER EXPERIEMNTS WITH csaprql, I MEASURED THE OVERHEAD AND IT IS LESS THAN 1% The clock of CSPARQL should consider the time-stamp carried by the streaming data
  15. The time overhead of WSJ-WBM is negligible ????
  16. Put the size of the cache in these 2 plots
  17. Introduce overhead percentage… Local vs remote? Why remote has the less overhead that local???
  18. In real use cases, bkg data is located on different locations, and it is not possible to replicate it on the engine machine: Limitations on the number of data that can be retrieved over time Data changes on the source and changes are not pushed to the engine RSP-QL captures: the dynamicity of graph -> time-varying graph the remote location -> SERVICE clause We move to an approximate setting, and we introduce a cache to store part of the data involved in the query processing, and update part of it to capture the dynamicity
  19. Ranker -> introdurre i simboli della slide successiva
  20. Introdurre prima assi, poi linee un po’ alla volta