addressing tim/quality trade-off in view maintenance

Addressing Time/Quality Trade-off in View
Maintenance
Soheila Dehghanzadeh

Outline
• Introduction
• Terminology
• Problem definition
• Proposed solution
• Experimental results
• Conclusion
Insight Centre for Data Analytics Slide 2

Introduction: Query Processing On Linked Data
• Report changes to the local store (maintenance)
• sources pro-actively report changes or their existence (pushing).
• query processor discover new sources and changes by crawling (pulling).
• Maintenance trade-off
• Fast maintenance leads high quality but slow response and vice versa.
• Problem: Maintenance according to user defined trade-off.
• Why is it important? It eliminates unnecessary maintenance and leads to faster
response and better scalability.
Replication (database) or Caching (web)
Off-line
materialization
Local
Store
Query
Processor
Query
Response
NEW
sources
Scalability
Availability
performance
Scalability
Availability
performance

View Maintenance Categorization
Trade-off
Management
V.s.
Change
Reporting
Mechanism
Time/quality trade-off
query level replica level
quality
constraint
time
constraint
quality
constraint
time
constraint
update stream A B C D
no update
stream
E F G H

Problem Definition
• Problem E
• Optimizing maintenance to satisfy quality constraints within the
lowest response time for each query.
• Problem F
• Optimizing maintenance to satisfy time constraints with the highest
response quality for each query.

Terminology
• Quality requirements:
• Freshness B/(A+B)
• Completeness B/(B+C)
• Maintenance plan
• Each set of views chosen for maintenance is called a maintenance
plan.
• Having n views, number of maintenance plans is 2 𝑛
.
• Each maintenance plan leads to a different response quality.
V1 V2 V3 V4
20% 90% 10% 80%

Freshness Example
a1 b1 T
a2 b2 T
a3 b3 F
a4 b4 T
a5 b5 F
a1 c1 F
a1 c2 F
a1 c3 T
a2 c4 T
a6 c5 F
a1 b1 c1 F
a1 b1 c2 F
a1 b1 c3 T
a2 b2 c4 T
60% 40% 50%
a1 b1 T
a2 b2 T
a3 b3 T
a4 b4 T
a5 b5 T
a1 c1 F
a1 c2 F
a1 c3 T
a2 c4 T
a6 c5 F
a1 b1 c1 F
a1 b1 c2 F
a1 b1 c3 T
a2 b2 c4 T
100% 40% 50%
a1 b1 T
a2 b2 T
a3 b3 F
a4 b4 T
a5 b5 F
a1 c1 T
a1 c2 T
a1 c3 T
a2 c4 T
a6 c5 T
a1 b1 c1 T
a1 b1 c2 T
a1 b1 c3 T
a2 b2 c4 T
60% 100% 100%

Research questions
• What is the least costly maintenance plan that fulfills
response quality requirements.
• What is the quality of response without maintenance?
• What is the quality of response of each “maintenance plan”.

Experiment
• We use BSBM benchmark to create a dataset and a query
set.
• We label triples with true/false to specify their freshness
status.
• We summarize the cache to estimate the quality of a query
response without actually executing the query on cache.
• To summarize the cache we extended the cardinality
estimation techniques for freshness estimation problem.
Alice Lives Dublin True
Bob Lives Berlin False
Alice Job Teacher True
Bob Job Developer False

Cardinality Estimation
• Capture the data distribution by splitting data into buckets
and only keep the bucket cardinality in the summary.
Alice Job Teacher
Alice Lives Dublin
Alice Job PhD student
Alice Lives Athlon
Bob Job Manager
Bob Lives Berlin
Bob Lives Chicago
Bob Lives Munich
Bob Lives Belfast
Bob Lives Limerick
Bob Job CEO
Bob Job Consultant
Alice Job * 2
Bob Job * 3
Alice Lives * 2
Bob Lives * 5
* Job * 5
* Lives * 7
Freshness
True
True
False
False
True
True
True
False
False
False
False
False
2
3
1
1
1
2
Q1: ?a Job ?b
Q2: (?a Job ?b)^(?a Lives ?c)
Estimated Actual
5 5
35 19
Estimated Actual
5 5
19 19
Estimated Actual
2/5 2/5
6/35 3/19
Estimated Actual
2/5 2/5
3/19 3/19

Cardinality Estimation Approaches
• System R assumptions for cardinality estimation:
• data is uniformly distributed per attribute.
• predicates are independent (either in same table or among different
tables).
• predicate multiplication approaches make both assumptions.
• Histogram captures the dependencies among predicates for
more accurate estimation.

Measure accuracy of the estimation
approach
n is the number of queries
Measure the difference between the actual and estimated
freshness of queries in a query set.

Estimation Error 1
a Job teacher T
a Job professor F
a Job PhD F
b Job developer T
a Lives in Dublin T
b Lives in Galway F
b Lives in Cork T
b Lives in Limerick T
a teacher Dublin T
a Professor Dublin F
a PhD Dublin F
b Developer Galway F
b Developer Cork T
b Developer Limerick T
?s, Job, ?o 50%
50%
?s, Lives in, ?o 75%
Reason : Dependencies
Solution :
• A more granular index on join (subject) and bounded dimension (predicate).
• Histogram and table level synopses can capture these dependencies and reduce
the error accordingly.
Experiment: We did not observe this error in our experiment because we didn’t have
such dependencies in the dataset.
37.5% summary
Data
<?s,Job,?o1> join <?s, Lives in,?o2>

Estimation Error 2
20 October 2014Insight Centre for Data Analytics Slide 15
?s, Job, ?o1 50% ?s, Lives in, ?o2 75% summary
a Job teacher T
a Job professor F
a Job PhD F
b Job developer T
a Lives in Dublin T
b Lives in Galway F
b Lives in Cork T
b Lives in Limerick T
Data
<?s,Job,Developer> join <?s, Lives in,?o2>
b Developer Galway F
b Developer Cork T
b Developer Limerick T
37.5%
66%
Reason : bounded object
Solution :
• A more granular index on join dimension (subject) and bounded dimensions
(predicate and object) => we need to index the whole dataset-> not efficient.
Experiment: We did not observe any improvement on this error by using histogram.

Concern 1 on problem definition
Bob Job Teacher True
Bob Job PhD True
Alice Job Profess
or
True
Bob Job PhD False
Alice Job Profess
or
True
Bob Job PhD False
Alice Job Profess
or
False
Bob Job Teacher False
Bob Job PhD False
Alice Job Profess
or
False
Bob Lives in Limeric
k
True
Bob Lives in Galway True
Alice Lives in Dublin True
Alice Lives in Cork True
k
True
Bob Lives in Galway True
Alice Lives in Cork False
k
True
Bob Lives in Galway False
k
False
Bob Lives in Galway False
Bob Teacher Limerick True
Bob Teacher Galway True
Bob PhD Limerick True
Bob PhD Galway True
Alice Professor Dublin True
Alice Professor Cork True
Bob Teacher Galway True
Bob PhD Limerick False
Bob PhD Galway False
Alice Professor Dublin True
Alice Professor Cork False
Bob Teacher Galway False
Alice Professor Dublin False
Bob Teacher Limerick False
Bob Teacher Galway False
Alice Professor Dublin False
100%
100%
100%
66%
75%
50%
33%
50%
16%
0%
25%
0%
True
False
True
True
False
66%

Concern 2 on the suggested solution
• We need to build one summaries for each maintenance plan
because summary of one maintenance plan can not be used for
estimating freshness of a query executed on another maintenance
plan.
• This is very inefficient given the space requirements and cost of
maintaining these summaries.

Conclusion
• We defined quality constraints based on freshness and completeness.
• We summarized a snapshot of a dataset to estimate the freshness of various queries
using indexing and histogram for our freshness estimation problem.
• We need to build individual summaries for each maintenance plan since a summary
for one maintenance plan can not be used to estimate the quality of a query executed on
other maintenance plans.
• Our experiment didn’t fail by estimation error caused by dependency due to lack of
such errors in the dataset. Next step is to design a more realistic dataset and again
compare the result of histogram and predicate multiplication.
• Summarization techniques are designed for a very static environment and any
change on the underlying data needs to build the summary from scratch. So does it really
make sense to extend cardinality estimation for freshness estimation?

Problem Definition
• Problem E
• Optimizing maintenance to satisfy quality constraints within the
lowest response time for each query.
• Problem F
• Optimizing maintenance to satisfy time constraints with the highest
response quality for each query.

Problem description without join
Replica
User queries the replica with time
constraints
Replica should maintain only a subset
of result that is more likely to be
expired.

ScenarioStream Data
Back Ground Data
Window Replica

Use CaseTwitter Stream
Data
Back Ground Data
Number of
mentions in the
last twitter
window
User follower
count Replica
Raising stars Query: find users who have been
mentioned more than 100 times in the last 10 minutes
and have more than 1000 followers.
With constraint on the execution time.

Continuous join operator with one replica
• We implemented a set of continuous join operators
• DWJoin : Uses the static replica and never change it(the quality of
response degrades).
• Baseline join: Uses the LRU entry to choose entries to update from
set of matches.(not necessary Least recently updated requires
updating).
• Oracle Join: fetch data directly from source.
• Smart Join: compute statistics of change rate and choose those
who are likely to be expired for fetching.
• Mixed baseline-smart(possible extensions).

Performance of join operators

Possible extensions
The problem becomes
complicated when the query is a
join between replicas
Updating which combination of
entries incurs the highest
increase in join update?
ReplicaReplica

Future works
• Use a better model for learning the change rate in smart
policy.
• We believe that smart policy will perform better if the change
rate is more predictable.
• Investigate the problem where there is joins on the background
knowledge side to know which combination of stale entries will
contribute more to the result correctness if they become updated.

Thanks a lot for your attention !

addressing tim/quality trade-off in view maintenance

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Viewers also liked

Viewers also liked (8)

Similar to addressing tim/quality trade-off in view maintenance

Similar to addressing tim/quality trade-off in view maintenance (20)

Recently uploaded

Recently uploaded (20)

addressing tim/quality trade-off in view maintenance

Editor's Notes