Once upon a time in Datatown
(Or: Query-driven Data Completeness Management)
Simon Razniewski
Supervisor: Werner Nutt
Once upon a time in Datatown …
2
One database for all schools
Monitoring school developments
• The central school administration decided last year that
instead of HTML, now Ruby on Rails shall be taught in
Computer Science classes.
• School district administrator Alice wants to monitor the
impact of this decision
3
Query result:
2012: 8563
2013: 8619 (+0,7%)
2014: 3202 (-63%)
How many pupils have
grade A in Computer Science?
DB
• Was teaching Ruby instead of HTML a terrible idea??
• Alice orders her assistant Bob to investigate
Bob calls school A … ”No problem, everything as usual”
Bob calls school B … “No, the CS grades in our school
are as usual”
Bob calls school C … “What do you want? Everything is fine
here”
….
• Bob concludes that something must be wrong with the data
4
2014: 3202 (-63%)
“Something must be wrong with the data”
• Bob calls the DB admin Tom
• Tom: “Dude, of course these numbers are nonsense,
most of the data wasn’t loaded yet!
5
“Most of the data wasn’t loaded yet”
• Alice is relieved to hear that probably
the change in teaching did not wreck the grades
• But how can such misunderstandings be prevented
in the future?
6
• Alice gives Bob and Tom the research question to find a technique for analyzing whether
query answers over partially complete databases are complete
• Tom finds cryptic old papers in the archive that seem related
• Tom: “Motro describes a similar problem to ours:
When do queries return complete answers over incomplete databases?”
• Bob: “Levy introduces a formalism to describe
which parts of a database table are complete”
• Tom: “But those papers do not contain algorithms”
How can such misunderstandings be prevented
in the future?
7
Obtaining complete
answers from
incomplete databases
Alon Y. Levy
1996
Integrity =
Validity +
Completeness
Amihai Motro
1989
“But those papers do not contain algorithms”
• Bob: “Maybe we can reduce this to conjunctive-query
style containment?”
• Tom: “That works, but note that we also need to find
procedures for asymmetric containment problems”
Bob and Tom sit down and write these procedures
Result 1: Development of decision procedures for
completeness reasoning and complexity analysis
[VLDB’11]
8
Does this also work for null values?
• When first presenting a demo system to Alice, the demo system crashes
• Tom: “Understandable, because it is not clear what a null means, whether computer
science was an ungraded subject, or whether the grade is missing
• Alice: “Fix it!”
…Tom goes to work
Result 2: Extension of completeness reasoning to databases with null values, complexity
analysis, and introduction of a technique to avoid the ambiguity of null values
[CIKM’12] 9
java.lang.NullpointerException
("Grade in CS is null")
Late evening, Bar Nadamas
• Alice greatly impresses her colleague Frank,
head of the statistics office of Datatown, with
her new completeness tool/toy
• New EU guidelines on open government require Frank’s office to publish their data
in RDF
– Frank: “Do you think this tool could be adapted to also handle RDF data?”
– Alice: “What’s the difference?”
– Frank: “Well, there’s the OPT-construct, the RDFS closure, and also, the completeness
statements should be expressed in RDF themselves”
– Alice: “Let me ask Tom…”
Result 3: Formalisms and algorithms for assessing the completeness of
SPARQL queries over RDF data
[ISWC’13]
10
Completeness of Geographical Data
• When bored at work, Bob likes to draw random objects
into the free open mapping project OpenStreetMap
• When getting blocked the 17th time, he decides finally for
a useful contribution
• Bob: Couldn’t we use the completeness statements on the
OSM Wiki to also annotate spatial queries with
completeness information?
(a few games of Minesweeper later)
• Bob: But things are different there, query completeness is
not a binary issue, instead, queries are complete in
certain areas while in others they are not. Also, we
can divide objects in the database now into certain,
possible and impossible answers
11
Result 4: Model, algorithms and experimental evaluation of techniques for calculating the
completeness area of a query over spatial data, and for classifying answers into certain
answers, possible answers and impossible answers
[BNCOD’13, SIGSPATIAL’14 (submitted)]
Automation
• The techniques work, but minions are often
too lazy to give completeness statements
• Alice: Can’t we automate this by looking
at the work processes?
• Tom: In principle yes, but we need a formal description of processes that
manipulate data in the database and in the real-world
• Alice: Transition systems are the most general formalism, let’s extend
those
• Tom: Ok, and I think we can again use containment-style reasoning
Result 5: Introduction of quality-aware transition systems (QATS) and
development of algorithms for checking query completeness
over QATS
[BPM’13] 12
Open issues
• Alice: If a query is not complete, could you give me at
least numerical estimates?
• Bob: How can we utilize the state of the database to
draw additional conclusions?
• Tom: I would like to study Mathematics and solve
the problem of Query Determinacy as raised
by Gauß, Segoufin, Fermat and Vianu
13
The end.
14
Main publications
• 1: Completeness of Queries over Incomplete Databases, Simon Razniewski and Werner Nutt, Int. Conference on
Very Large Databases (VLDB), 2011
– Acceptance rate: 18,1%
• 2: Completeness of Queries over SQL Databases, Werner Nutt and Simon Razniewski, Conference on Information
and Knowledge Management (CIKM), 2012
– Acceptance rate: 13,4%
• 3: Completeness Statements about RDF Data Sources and Their Use for Query Answering, Fariz Darari, Werner
Nutt, Giuseppe Pirro and Simon Razniewski, Int. Semantic Web Conference (ISWC), 2013
– Acceptance rate: 21,5%
• 4a: Assessing the Completeness of Geographical Data, Simon Razniewski and Werner Nutt, British National
Conference on Databases (BNCOD), 2013 (Short Paper)
– Acceptance rate 47,6%
• 4b: Adding Completeness Information to Query Answers over Spatial Databases, Simon Razniewski and Werner
Nutt, SIGSPATIAL 2014,
– Submitted
• 5: Verification of Query Completeness over Processes, Simon Razniewski, Werner Nutt and Marco Montali,
International Conference on Business Process Management (BPM), 2013
– Acceptance rate: 14,4%
15

Once upon a time in Datatown ...

  • 1.
    Once upon atime in Datatown (Or: Query-driven Data Completeness Management) Simon Razniewski Supervisor: Werner Nutt
  • 2.
    Once upon atime in Datatown … 2 One database for all schools
  • 3.
    Monitoring school developments •The central school administration decided last year that instead of HTML, now Ruby on Rails shall be taught in Computer Science classes. • School district administrator Alice wants to monitor the impact of this decision 3 Query result: 2012: 8563 2013: 8619 (+0,7%) 2014: 3202 (-63%) How many pupils have grade A in Computer Science? DB
  • 4.
    • Was teachingRuby instead of HTML a terrible idea?? • Alice orders her assistant Bob to investigate Bob calls school A … ”No problem, everything as usual” Bob calls school B … “No, the CS grades in our school are as usual” Bob calls school C … “What do you want? Everything is fine here” …. • Bob concludes that something must be wrong with the data 4 2014: 3202 (-63%)
  • 5.
    “Something must bewrong with the data” • Bob calls the DB admin Tom • Tom: “Dude, of course these numbers are nonsense, most of the data wasn’t loaded yet! 5
  • 6.
    “Most of thedata wasn’t loaded yet” • Alice is relieved to hear that probably the change in teaching did not wreck the grades • But how can such misunderstandings be prevented in the future? 6
  • 7.
    • Alice givesBob and Tom the research question to find a technique for analyzing whether query answers over partially complete databases are complete • Tom finds cryptic old papers in the archive that seem related • Tom: “Motro describes a similar problem to ours: When do queries return complete answers over incomplete databases?” • Bob: “Levy introduces a formalism to describe which parts of a database table are complete” • Tom: “But those papers do not contain algorithms” How can such misunderstandings be prevented in the future? 7 Obtaining complete answers from incomplete databases Alon Y. Levy 1996 Integrity = Validity + Completeness Amihai Motro 1989
  • 8.
    “But those papersdo not contain algorithms” • Bob: “Maybe we can reduce this to conjunctive-query style containment?” • Tom: “That works, but note that we also need to find procedures for asymmetric containment problems” Bob and Tom sit down and write these procedures Result 1: Development of decision procedures for completeness reasoning and complexity analysis [VLDB’11] 8
  • 9.
    Does this alsowork for null values? • When first presenting a demo system to Alice, the demo system crashes • Tom: “Understandable, because it is not clear what a null means, whether computer science was an ungraded subject, or whether the grade is missing • Alice: “Fix it!” …Tom goes to work Result 2: Extension of completeness reasoning to databases with null values, complexity analysis, and introduction of a technique to avoid the ambiguity of null values [CIKM’12] 9 java.lang.NullpointerException ("Grade in CS is null")
  • 10.
    Late evening, BarNadamas • Alice greatly impresses her colleague Frank, head of the statistics office of Datatown, with her new completeness tool/toy • New EU guidelines on open government require Frank’s office to publish their data in RDF – Frank: “Do you think this tool could be adapted to also handle RDF data?” – Alice: “What’s the difference?” – Frank: “Well, there’s the OPT-construct, the RDFS closure, and also, the completeness statements should be expressed in RDF themselves” – Alice: “Let me ask Tom…” Result 3: Formalisms and algorithms for assessing the completeness of SPARQL queries over RDF data [ISWC’13] 10
  • 11.
    Completeness of GeographicalData • When bored at work, Bob likes to draw random objects into the free open mapping project OpenStreetMap • When getting blocked the 17th time, he decides finally for a useful contribution • Bob: Couldn’t we use the completeness statements on the OSM Wiki to also annotate spatial queries with completeness information? (a few games of Minesweeper later) • Bob: But things are different there, query completeness is not a binary issue, instead, queries are complete in certain areas while in others they are not. Also, we can divide objects in the database now into certain, possible and impossible answers 11 Result 4: Model, algorithms and experimental evaluation of techniques for calculating the completeness area of a query over spatial data, and for classifying answers into certain answers, possible answers and impossible answers [BNCOD’13, SIGSPATIAL’14 (submitted)]
  • 12.
    Automation • The techniqueswork, but minions are often too lazy to give completeness statements • Alice: Can’t we automate this by looking at the work processes? • Tom: In principle yes, but we need a formal description of processes that manipulate data in the database and in the real-world • Alice: Transition systems are the most general formalism, let’s extend those • Tom: Ok, and I think we can again use containment-style reasoning Result 5: Introduction of quality-aware transition systems (QATS) and development of algorithms for checking query completeness over QATS [BPM’13] 12
  • 13.
    Open issues • Alice:If a query is not complete, could you give me at least numerical estimates? • Bob: How can we utilize the state of the database to draw additional conclusions? • Tom: I would like to study Mathematics and solve the problem of Query Determinacy as raised by Gauß, Segoufin, Fermat and Vianu 13
  • 14.
  • 15.
    Main publications • 1:Completeness of Queries over Incomplete Databases, Simon Razniewski and Werner Nutt, Int. Conference on Very Large Databases (VLDB), 2011 – Acceptance rate: 18,1% • 2: Completeness of Queries over SQL Databases, Werner Nutt and Simon Razniewski, Conference on Information and Knowledge Management (CIKM), 2012 – Acceptance rate: 13,4% • 3: Completeness Statements about RDF Data Sources and Their Use for Query Answering, Fariz Darari, Werner Nutt, Giuseppe Pirro and Simon Razniewski, Int. Semantic Web Conference (ISWC), 2013 – Acceptance rate: 21,5% • 4a: Assessing the Completeness of Geographical Data, Simon Razniewski and Werner Nutt, British National Conference on Databases (BNCOD), 2013 (Short Paper) – Acceptance rate 47,6% • 4b: Adding Completeness Information to Query Answers over Spatial Databases, Simon Razniewski and Werner Nutt, SIGSPATIAL 2014, – Submitted • 5: Verification of Query Completeness over Processes, Simon Razniewski, Werner Nutt and Marco Montali, International Conference on Business Process Management (BPM), 2013 – Acceptance rate: 14,4% 15