Successfully reported this slideshow.
Your SlideShare is downloading. ×

Incomplete And Missing Data In Geoscience Databases

Ad

Incomplete and missing data in geoscience databases Towards the OWA relational model ? Stephen Henley Presented at the eSI...

Ad

Not just geoscience <ul><li>The title says this is about geoscience – but the conclusions are much more widely applicable ...

Ad

Geoscience data <ul><li>Very commonly may be </li></ul><ul><ul><li>Imprecise </li></ul></ul><ul><ul><li>Incomplete </li></...

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Upcoming SlideShare
Climate Change Geoscience
Climate Change Geoscience
Loading in …3
×

Check these out next

1 of 36 Ad
1 of 36 Ad

More Related Content

Incomplete And Missing Data In Geoscience Databases

  1. 1. Incomplete and missing data in geoscience databases Towards the OWA relational model ? Stephen Henley Presented at the eSI workshop The Closed World of Databases Meets the Open World of the Semantic Web, Edinburgh 12-13 Oct 2006 Resources Computing International Ltd
  2. 2. Not just geoscience <ul><li>The title says this is about geoscience – but the conclusions are much more widely applicable </li></ul>
  3. 3. Geoscience data <ul><li>Very commonly may be </li></ul><ul><ul><li>Imprecise </li></ul></ul><ul><ul><li>Incomplete </li></ul></ul><ul><ul><li>Missing </li></ul></ul>
  4. 4. Typical imprecise data Sample SiO2 % Cu ppm #101 53.5 128 #102 49.2 185 #103 66.3 163
  5. 5. Typical imprecise data Sample SiO2 % Cu ppm #101 53.5 128 #102 49.2 185 #103 66.3 163
  6. 6. What is “49.2% SiO2” ? <ul><li>A recorded value from a laboratory </li></ul><ul><li>Imprecise: the true value could be 49.2%, 49.21% or 48.55%? </li></ul><ul><li>because of instrumental errors </li></ul><ul><li>and because of sampling errors </li></ul><ul><li>The full data item should include “49.2” AND data about the error distribution </li></ul>
  7. 7. Each value has its own error distribution Sample SiO2 % Cu ppm #101 ~53.5 ~128 #102 ~49.2 ~185 #103 ~66.3 ~163
  8. 8. What about queries ? <ul><li>Given the SiO2 value of “~ 49.2” </li></ul><ul><ul><li>Query “WHERE SiO2 >50 …” </li></ul></ul><ul><ul><li>Not TRUE or FALSE but P = 0.317 (for example) </li></ul></ul><ul><li>So the simple 2VL does not apply – instead a continuous scale of probability estimates from P=0 (FALSE) to P=1 (TRUE) </li></ul><ul><li>Related to ‘fuzzy logic’ – but let’s not go there today ! </li></ul>
  9. 9. Incomplete data
  10. 10. Incomplete data Hole_ID Total_Depth D_green #301 320.0 250.0 #302 300.0 270.0 #303 200.0 Unknown ?
  11. 11. Incomplete data Hole_ID Total_Depth D_green #301 320.0 250.0 #302 300.0 270.0 #303 200.0 > 200.0
  12. 12. Incomplete data
  13. 13. Incomplete data <ul><li>This value “>200.0” is semi-quantitative </li></ul><ul><li>(another similar example – “below detection limit” in chemical analysis data – e.g. “< 5 ppm”) </li></ul><ul><li>It is not a NULL so Chris Date ought to be quite happy about it </li></ul><ul><li>BUT queries will not always give unambiguous TRUE or FALSE </li></ul>
  14. 14. Querying a tuple with D_green value “>200” <ul><li>WHERE D_green > 150 … TRUE </li></ul><ul><li>WHERE D_green < 100 … FALSE </li></ul><ul><li>WHERE D_green > 250 … UNKNOWN </li></ul><ul><li>WHERE D_Green <250 … UNKNOWN </li></ul><ul><li>So 2VL (true/false) is inadequate here also </li></ul>
  15. 15. Missing data <ul><li>Very often there are genuine gaps in data sets, for many possible reasons </li></ul><ul><ul><li>Samples not collected </li></ul></ul><ul><ul><li>Observations not taken </li></ul></ul><ul><ul><li>Instrumental malfunction </li></ul></ul><ul><ul><li>. . . 1001 other possible reasons </li></ul></ul><ul><li>These gaps may be single data items or whole rows (tuples) </li></ul>
  16. 16. Missing data item Sample SiO2 % Cu ppm #101 53.5 128 #102 - 185 #103 66.3 163
  17. 17. Missing data item <ul><li>We know sample #102 must have a SiO2 value – we just don’t know what it is </li></ul><ul><li>So this value is just “missing”. It’s not “inapplicable” (which might justify re-designing the database) </li></ul><ul><li>If we use Chris Date’s ‘CWA relational’ model then we are not allowed ‘NULL’ so how do we represent this ? </li></ul>
  18. 18. CWA: Avoiding NULL <ul><li>Several proposed methods to get around the prohibition of NULL, including – </li></ul><ul><ul><li>Default-value solutions (Chris Date) </li></ul></ul><ul><ul><li>Other suggestions (Hugh Darwen and Fabian Pascal) </li></ul></ul>
  19. 19. The default-value ‘solution’ as proposed by Date <ul><li>Instead of a global ‘null’ </li></ul><ul><li>A default value defined separately for each domain </li></ul><ul><li>If a legitimate value for the domain, how are missing values distinguished from actual values ? </li></ul><ul><li>If not a legitimate value for the domain, it’s just another sort of ‘null’ – no better, but more complicated, so in fact worse </li></ul>
  20. 20. Proposals by Darwen and Pascal <ul><li>Different in detail, but both involve decomposition to ‘hide’ the missingness of data values </li></ul>
  21. 21. Decompose into ‘null-free’ relations Sample SiO2% Cu ppm #101 53.5 128 #102 - 185 #103 66.3 163 Sample SiO2 % Sample Cu ppm #101 53.5 #101 128 #103 66.3 #102 185 #103 163
  22. 22. In this way … <ul><li>We certainly get rid of the ‘NULL’ </li></ul><ul><li>Any missing data item is expressed instead as a missing tuple in a binary relation </li></ul>
  23. 23. The CWA states that … <ul><li>where r is any relation and t is any possible tuple that conforms to the heading of r :- </li></ul><ul><li>If t appears in the body of r , then it is a true instantiation of the predicate (i.e. the corresponding proposition is considered to be true); </li></ul><ul><li>conversely, if t does not appear in the body of r , then it is a false instantiation (i.e. the corresponding proposition is considered to be false ) </li></ul><ul><ul><ul><ul><li>Date & Darwen 1998, 2000, … </li></ul></ul></ul></ul>
  24. 24. … . so <ul><li>Under the CWA, any tuple that is legitimate but is missing is assumed to represent a FALSE proposition. </li></ul><ul><li>So what about our decomposed ‘null-free’ relations ? … </li></ul>
  25. 25. No tuple for sample #102 in the SiO2 relation Sample SiO2 % Sample Cu ppm #101 53.5 #101 128 #103 66.3 #102 185 #103 163
  26. 26. Under the CWA … <ul><li>There are infinitely many possible legitimate tuples for sample #102: for example </li></ul><ul><li>But NONE of them is included </li></ul><ul><li>So ALL are interpreted as FALSE propositions </li></ul><ul><li>This means that under the CWA interpretation, for sample #102 there is NO acceptable value of SiO2 – it does not mean that the value is merely unknown. </li></ul>Sample SiO2% or Sample SiO2% #102 51.2 #102 45.5
  27. 27. This implies that … <ul><li>If we have any missing data, then the CWA is not appropriate. </li></ul><ul><li>Does this mean we can’t use the relational model for geoscience data ? </li></ul><ul><li>Of course not. Just that the narrow ‘CWA’ version of relational, defined by Date, Darwen, & Pascal, is inadequate </li></ul><ul><li>- but is that really the only game in town ? </li></ul>
  28. 28. Codd was right <ul><li>We need to revert to the “true” relational model as defined by Codd – which ALLOWS for the reality, that there will always be missing and incomplete data </li></ul><ul><li>Codd’s 1979 RM/T paper and his 1990 book leave many unanswered questions – but they do allow us to use the open world assumption </li></ul><ul><li>This does not restrict us to 2VL but uses a 3VL - including truth value UNKNOWN . </li></ul>
  29. 29. So let’s take a look at the truth tables <ul><li>2VL – two valued logic for CWA </li></ul><ul><li>Then extended to allow for probabilities </li></ul><ul><li>Then 3VL as needed by OWA </li></ul>
  30. 30. CWA - 2VL T represents TRUE F represents FALSE
  31. 31. 2VL with probabilities T represents p=1; F represents p=0 p(A  B) , p(A  B) in general need statistical computation
  32. 32. OWA - 3VL T represents TRUE F represents FALSE U represents UNKNOWN
  33. 33. Conclusions <ul><li>If any data are imprecise, incomplete, or missing, then CWA and 2VL are inadequate </li></ul><ul><li>Imprecise data need a probabilistic approach – is this an extension of CWA / 2VL ? </li></ul><ul><li>If we have any incomplete (e.g. truncated) or missing data we need OWA / 3VL </li></ul>
  34. 34. Conclusions <ul><li>A database is not about what IS, but about what IS KNOWN. </li></ul><ul><li>Perfectly reasonable to use the CWA about what IS – </li></ul><ul><li>– but not about what IS KNOWN – precisely because ‘I don’t know’ has to be a valid answer: hence truth value UNKNOWN must be legal </li></ul>
  35. 35. Conclusions <ul><li>3VL need not be scary. It isn’t actually much more complex than 2VL </li></ul><ul><li>Relational databases can use the richness of the OWA. We just need to do it right. </li></ul><ul><li>See www.OpenWorldDBMS.com </li></ul>
  36. 36. Some final words from E.F.Codd (1990) <ul><li>In developing the relational model, I have tried to follow Einstein’s advice, “Make it as simple as possible, but no simpler”. I believe that in the last clause he was discouraging the pursuit of simplicity to the extent of distorting reality. </li></ul><ul><li>Is insistence on CWA and 2VL perhaps distorting reality ? A little TOO simple ? </li></ul>

Editor's Notes

  • What I am going to talk about is an old problem. It’s one that we were aware of when I was working with Keith Jeffery on relational systems development 30 years ago. Unfortunately, rather than solving the problem, insistence on the relational model being restricted to the closed world assumption is pushing us backwards
  • - as are data in many other fields too
  • Laboratory analyses are typically reported – and recorded in databases – as single numbers with greater or less precision , but usually the accuracy is ignored.
  • … so this number 49.2 may well be what has been reported by the laboratory but of course it is unlikely to be exactly the ‘true’ value for the silica content of sample 102.
  • Each data value consists of the reported ‘mean’ or ‘best estimate’ together with a definition of the error distribution about that estimate.
  • With such data, a simple &gt; test will no longer give absolute true or false results, but computed probability estimates based on the error distributions of the operands.
  • OK, let’s move on to a different problem. Three drillholes. What is the depth of the ‘top of green’ in drill hole 303 ?
  • Is it actually unknown ? Not completely. We do know that it hasn’t been intersected to the maximum depth that hole 303 has been drilled. We have some information about it.
  • So instead of ‘missing data’ or the dreaded NULL we need to put a ‘partial’ data value in there.
  • Let’s just take another look at the geology before moving on.
  • So we have the strange situation of a data value which sometimes gives absolute true or false truth values in queries, and other times gives unknown.
  • Now to the thorny question of completely missing data items.
  • OK, sample 102 wasn’t analysed for SiO2, by mistake, or oversight, or the instrument malfunctioned. The value is just missing in the purest sense of the word.
  • Among those who insist that relational means closed-world, there have been some very inventive methods devised to avoid the NULL representation for missing data – and a strong feeling that all data ought really to be complete.
  • The default-value or special-value solution simply does not hold water. Not only is it more complicated than a global ‘null’ representation, but it also requires applications (or the user) to carry and manipulate a lot of extra garbage. At best it consists of a domain-specific null value which needs the same sort of processing as a global null.
  • Effectively what we are looking at is normalisation to 6NFwhere all relations are, at most, binary.
  • So at least we have established that any occurrence of NULL can be converted to a missing tuple. Does this actually get us anywhere ?
  • This is important. We can discuss later what the “corresponding proposition” is or ought to be, because this may lie at the heart of the question.
  • Decomposition has eliminated the tuple that we had, with the missing-data placeholder for SiO2 in sample 102.
  • Here is the nub of it. What is the “corresponding proposition” ? “ Sample N contains X% SiO2” or “Sample N is reported to contain X% SiO2” ? In other words is the database a record of (a) what is or of (b) what is known ? This reasoning works for (a). For (b) it is perfectly valid for a value to be unknown and for a null or some other missing value placeholder to be used, so the problem does not arise
  • Basically the same but allowing a continuity of values between True and False
  • Please note that U is a truth value. It should not be confused with NULL or any missing-data placeholder. Indeed, it would be perfectly valid to have a type 3VL domain in which T,F, and U all appear as genuine values, as well as a missing-data placeholder (let’s call it null?)

×