Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Once upon a time in Datatown ...


Published on

Slides from my PhD defense admission exam, Bozen, 9.7.2014

Published in: Science, Education, Technology
  • Be the first to comment

  • Be the first to like this

Once upon a time in Datatown ...

  1. 1. Once upon a time in Datatown (Or: Query-driven Data Completeness Management) Simon Razniewski Supervisor: Werner Nutt
  2. 2. Once upon a time in Datatown … 2 One database for all schools
  3. 3. Monitoring school developments • The central school administration decided last year that instead of HTML, now Ruby on Rails shall be taught in Computer Science classes. • School district administrator Alice wants to monitor the impact of this decision 3 Query result: 2012: 8563 2013: 8619 (+0,7%) 2014: 3202 (-63%) How many pupils have grade A in Computer Science? DB
  4. 4. • Was teaching Ruby instead of HTML a terrible idea?? • Alice orders her assistant Bob to investigate Bob calls school A … ”No problem, everything as usual” Bob calls school B … “No, the CS grades in our school are as usual” Bob calls school C … “What do you want? Everything is fine here” …. • Bob concludes that something must be wrong with the data 4 2014: 3202 (-63%)
  5. 5. “Something must be wrong with the data” • Bob calls the DB admin Tom • Tom: “Dude, of course these numbers are nonsense, most of the data wasn’t loaded yet! 5
  6. 6. “Most of the data wasn’t loaded yet” • Alice is relieved to hear that probably the change in teaching did not wreck the grades • But how can such misunderstandings be prevented in the future? 6
  7. 7. • Alice gives Bob and Tom the research question to find a technique for analyzing whether query answers over partially complete databases are complete • Tom finds cryptic old papers in the archive that seem related • Tom: “Motro describes a similar problem to ours: When do queries return complete answers over incomplete databases?” • Bob: “Levy introduces a formalism to describe which parts of a database table are complete” • Tom: “But those papers do not contain algorithms” How can such misunderstandings be prevented in the future? 7 Obtaining complete answers from incomplete databases Alon Y. Levy 1996 Integrity = Validity + Completeness Amihai Motro 1989
  8. 8. “But those papers do not contain algorithms” • Bob: “Maybe we can reduce this to conjunctive-query style containment?” • Tom: “That works, but note that we also need to find procedures for asymmetric containment problems” Bob and Tom sit down and write these procedures Result 1: Development of decision procedures for completeness reasoning and complexity analysis [VLDB’11] 8
  9. 9. Does this also work for null values? • When first presenting a demo system to Alice, the demo system crashes • Tom: “Understandable, because it is not clear what a null means, whether computer science was an ungraded subject, or whether the grade is missing • Alice: “Fix it!” …Tom goes to work Result 2: Extension of completeness reasoning to databases with null values, complexity analysis, and introduction of a technique to avoid the ambiguity of null values [CIKM’12] 9 java.lang.NullpointerException ("Grade in CS is null")
  10. 10. Late evening, Bar Nadamas • Alice greatly impresses her colleague Frank, head of the statistics office of Datatown, with her new completeness tool/toy • New EU guidelines on open government require Frank’s office to publish their data in RDF – Frank: “Do you think this tool could be adapted to also handle RDF data?” – Alice: “What’s the difference?” – Frank: “Well, there’s the OPT-construct, the RDFS closure, and also, the completeness statements should be expressed in RDF themselves” – Alice: “Let me ask Tom…” Result 3: Formalisms and algorithms for assessing the completeness of SPARQL queries over RDF data [ISWC’13] 10
  11. 11. Completeness of Geographical Data • When bored at work, Bob likes to draw random objects into the free open mapping project OpenStreetMap • When getting blocked the 17th time, he decides finally for a useful contribution • Bob: Couldn’t we use the completeness statements on the OSM Wiki to also annotate spatial queries with completeness information? (a few games of Minesweeper later) • Bob: But things are different there, query completeness is not a binary issue, instead, queries are complete in certain areas while in others they are not. Also, we can divide objects in the database now into certain, possible and impossible answers 11 Result 4: Model, algorithms and experimental evaluation of techniques for calculating the completeness area of a query over spatial data, and for classifying answers into certain answers, possible answers and impossible answers [BNCOD’13, SIGSPATIAL’14 (submitted)]
  12. 12. Automation • The techniques work, but minions are often too lazy to give completeness statements • Alice: Can’t we automate this by looking at the work processes? • Tom: In principle yes, but we need a formal description of processes that manipulate data in the database and in the real-world • Alice: Transition systems are the most general formalism, let’s extend those • Tom: Ok, and I think we can again use containment-style reasoning Result 5: Introduction of quality-aware transition systems (QATS) and development of algorithms for checking query completeness over QATS [BPM’13] 12
  13. 13. Open issues • Alice: If a query is not complete, could you give me at least numerical estimates? • Bob: How can we utilize the state of the database to draw additional conclusions? • Tom: I would like to study Mathematics and solve the problem of Query Determinacy as raised by Gauß, Segoufin, Fermat and Vianu 13
  14. 14. The end. 14
  15. 15. Main publications • 1: Completeness of Queries over Incomplete Databases, Simon Razniewski and Werner Nutt, Int. Conference on Very Large Databases (VLDB), 2011 – Acceptance rate: 18,1% • 2: Completeness of Queries over SQL Databases, Werner Nutt and Simon Razniewski, Conference on Information and Knowledge Management (CIKM), 2012 – Acceptance rate: 13,4% • 3: Completeness Statements about RDF Data Sources and Their Use for Query Answering, Fariz Darari, Werner Nutt, Giuseppe Pirro and Simon Razniewski, Int. Semantic Web Conference (ISWC), 2013 – Acceptance rate: 21,5% • 4a: Assessing the Completeness of Geographical Data, Simon Razniewski and Werner Nutt, British National Conference on Databases (BNCOD), 2013 (Short Paper) – Acceptance rate 47,6% • 4b: Adding Completeness Information to Query Answers over Spatial Databases, Simon Razniewski and Werner Nutt, SIGSPATIAL 2014, – Submitted • 5: Verification of Query Completeness over Processes, Simon Razniewski, Werner Nutt and Marco Montali, International Conference on Business Process Management (BPM), 2013 – Acceptance rate: 14,4% 15