trio-boeing.ppt

406 views
295 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
406
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

trio-boeing.ppt

  1. 1. Trio-One: Layering Uncertainty and Lineage on a Conventional DBMS Martin Theobald Jennifer Widom Stanford University DATA UNCERTAINTY LINEAGE
  2. 2. Outline <ul><li>Motivation and Applications </li></ul><ul><li>Working Model for ULDBs </li></ul><ul><li>Trio-One System Architecture </li></ul><ul><li>Efficient Confidence Computations </li></ul>
  3. 3. Motivation <ul><li>Many applications involve data that is uncertain </li></ul><ul><li>(approximate, probabilistic, inexact, incomplete, </li></ul><ul><li>imprecise, fuzzy, inaccurate, ...) </li></ul><ul><li>Many of the same applications need to track the lineage of their data </li></ul><ul><li>Neither uncertainty nor lineage are supported by conventional Database Management Systems (DBMSs) </li></ul>Coincidence or Fate?
  4. 4. Sample Applications <ul><li>Deduplication </li></ul><ul><ul><li>Uncertainty: Match and merge </li></ul></ul><ul><ul><li>Lineage: Source records </li></ul></ul><ul><li>Information extraction </li></ul><ul><ul><li>Uncertainty: Extracted labels and values </li></ul></ul><ul><ul><li>Lineage: Original context </li></ul></ul><ul><li>Information integration </li></ul><ul><ul><li>Uncertainty: Inconsistent information </li></ul></ul><ul><ul><li>Lineage: Original sources </li></ul></ul>
  5. 5. Sample Applications <ul><li>Scientific experiments </li></ul><ul><ul><li>Uncertainty: Captured (and derived) data </li></ul></ul><ul><ul><li>Lineage: Layers of views </li></ul></ul><ul><li>Sensor data </li></ul><ul><ul><li>Uncertainty: Sensor values, aggregations, missing readings </li></ul></ul><ul><ul><li>Lineage: Original readings, views </li></ul></ul>
  6. 6. Claim <ul><li>The connection between uncertainty and lineage goes deeper than just a shared need by several applications </li></ul>Coincidence or Fate?
  7. 7. Lineage and Uncertainty <ul><li>Lineage... </li></ul><ul><ul><li>Enables simple and consistent representation of uncertain data </li></ul></ul><ul><ul><li>Allows for intuitive and complete working models </li></ul></ul><ul><ul><li>Correlates uncertainty in query results with uncertainty in the input data </li></ul></ul><ul><ul><li>Can make computation over uncertain data more efficient </li></ul></ul><ul><li>Applications use lineage to reduce or resolve </li></ul><ul><li>uncertainty </li></ul>
  8. 8. Goal <ul><li>A new kind of DBMS in which: </li></ul><ul><ul><li>Data </li></ul></ul><ul><ul><li>Uncertainty </li></ul></ul><ul><ul><li>Lineage </li></ul></ul><ul><li>are all first-class interrelated concepts </li></ul>Trio }
  9. 9. The Trio Trio <ul><li>Data Model </li></ul><ul><ul><li>Simplest extension to relational model that’s sufficiently expressive </li></ul></ul><ul><li>Query Language </li></ul><ul><ul><li>Simple extension to SQL with well-defined semantics and intuitive behavior </li></ul></ul><ul><li>System </li></ul><ul><ul><li>A complete open-source DBMS that people want to use </li></ul></ul>
  10. 10. The Present <ul><li>Data Model </li></ul><ul><ul><li>Uncertainty-Lineage Databases (ULDBs) </li></ul></ul><ul><li>Query Language </li></ul><ul><ul><li>TriQL </li></ul></ul><ul><li>System </li></ul><ul><ul><li>Trio-One: Layered prototype deployable on top of any standard relational DBMS </li></ul></ul>
  11. 11. Running Example: Crime-Solving <ul><li>Saw( witness , car ) // may be uncertain </li></ul><ul><li>Drives( person , car ) // may be uncertain </li></ul><ul><li>Suspects( person ) = π person (Saw ⋈ car Drives) </li></ul>
  12. 12. Data Model: Uncertainty <ul><li>An uncertain database represents a set of </li></ul><ul><li>possible instances of data values </li></ul><ul><ul><li>Amy saw either a Honda or a Toyota </li></ul></ul><ul><ul><li>Jimmy drives a Toyota, a Mazda, or both </li></ul></ul><ul><ul><li>Betty saw an Acura with confidence 0.5 or a Toyota with confidence 0.3 </li></ul></ul><ul><ul><li>Hank is a suspect with confidence 0.7 </li></ul></ul>
  13. 13. Our Model for Uncertainty <ul><li>1. Alternatives (mutually exclusive) </li></ul><ul><li>2. ‘?’ (Maybe) Annotations </li></ul><ul><li>3. Confidences </li></ul>
  14. 14. Our Model for Uncertainty <ul><li>1. Alternatives: uncertainty about value </li></ul><ul><li>2. ‘?’ (Maybe) Annotations </li></ul><ul><li>3. Confidences </li></ul>= Three possible instances (Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda) Saw (witness,car) { Honda, Toyota, Mazda } car Amy witness
  15. 15. Our Model for Uncertainty <ul><li>1. Alternatives </li></ul><ul><li>2. ‘?’ (Maybe): uncertainty about presence </li></ul><ul><li>3. Confidences </li></ul>Six possible instances ? (Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda) (Betty, Acura) Saw (witness,car)
  16. 16. Our Model for Uncertainty <ul><li>1. Alternatives </li></ul><ul><li>2. ‘?’ (Maybe) Annotations </li></ul><ul><li>3. Confidences: weighted uncertainty </li></ul>? Six possible instances, each with a probability (Amy, Honda): 0.5 ∥ (Amy,Toyota): 0.3 ∥ (Amy, Mazda): 0.2 (Betty, Acura): 0.6 Saw (witness,car)
  17. 17. Models for Uncertainty <ul><li>Our model (so far) is not especially new </li></ul><ul><li>We spent some time exploring the space of models for uncertainty </li></ul><ul><li>Tension between understandability and expressiveness </li></ul><ul><ul><li>Our model is understandable </li></ul></ul><ul><ul><li>But it is not complete, or even closed under common operations </li></ul></ul>
  18. 18. Closure and Completeness <ul><li>Completeness </li></ul><ul><li>Can represent any finite set of possible instances </li></ul><ul><li>Closure </li></ul><ul><li>Can represent any result of relational operators </li></ul><ul><li>Note: Completeness  Closure </li></ul>
  19. 19. Models for Uncertainty <ul><li>A-Tuples – Uncertainty on attribute values </li></ul><ul><ul><li>(Amy, {Toyota, Mazda}) </li></ul></ul><ul><ul><li>Not closed </li></ul></ul><ul><li>X-Tuples – Uncertainty on entire tuples </li></ul><ul><ul><li>{ (Amy, Toyota), (Amy, Mazda) } </li></ul></ul><ul><ul><li>Still not closed </li></ul></ul><ul><li>C-Tables – Tuples + Boolean constraints </li></ul><ul><ul><li>{ (Amy, Toyota, X=1 ), </li></ul></ul><ul><li>(Amy, Mazda, X<>1 ), </li></ul><ul><li>(Cathy, Honda, X<>1 ) } </li></ul><ul><ul><li>Closed and complete! </li></ul></ul><ul><ul><li>Understandable? </li></ul></ul>
  20. 20. Our Model is Not Closed Suspects = π person (Saw ⋈ car Drives) ? ? ? Does not correctly capture possible instances in the result CANNOT (Cathy, Honda) ∥ (Cathy, Mazda) Saw (witness,car) (Billy, Honda) ∥ (Frank, Honda) (Hank, Honda) (Jimmy, Toyota) ∥ (Jimmy, Mazda) Drives (person,car) Jimmy Billy ∥ Frank Hank Suspects
  21. 21. to the Rescue <ul><li>Lineage (provenance): “where data came from” </li></ul><ul><ul><li>Internal lineage </li></ul></ul><ul><ul><li>External lineage </li></ul></ul><ul><li>In Trio: A function λ from alternatives to other </li></ul><ul><li>alternatives (or external sources) </li></ul>Lineage
  22. 22. Example with Lineage ? ? ? Suspects = π person (Saw ⋈ car Drives) λ (31) = (11,2),(21,2) λ (32,1) = (11,1),(22,1); λ (32,2) = (11,1),(22,2) λ (33) = (11,1), 23 11 ID (Cathy, Honda) ∥ (Cathy, Mazda) Saw (witness,car) 23 22 21 ID (Billy, Honda) ∥ (Frank, Honda) (Hank, Honda) (Jimmy, Toyota) ∥ (Jimmy, Mazda) Drives (person,car) 33 32 31 ID Jimmy Billy ∥ Frank Hank Suspects Correctly captures possible instances in the result
  23. 23. Trio Data Model <ul><li>1. Alternatives (mutually exclusive) </li></ul><ul><li>2. ‘?’ (Maybe) Annotations </li></ul><ul><li>3. Confidences </li></ul><ul><li>4. Lineage </li></ul><ul><li>ULDBs are closed and complete </li></ul>Uncertainty-Lineage Databases (ULDBs)
  24. 24. ULDBs: Lineage <ul><li>Conjunctive lineage sufficient for most operations </li></ul><ul><ul><li>Disjunctive lineage for duplicate-elimination </li></ul></ul><ul><ul><li>Negative lineage for difference </li></ul></ul>
  25. 25. ULDBs: Minimality <ul><li>A ULDB relation R represents a set of possible </li></ul><ul><li>instances </li></ul><ul><li>Does every tuple in R appear in some possible instance? (no extraneous tuples) </li></ul><ul><li>Does every maybe -tuple in R not appear in some possible instance? (no extraneous ‘?’s) </li></ul><ul><li>Also </li></ul>Data-minimality Lineage-minimality
  26. 26. Data Minimality Examples <ul><li>Extraneous ‘?’ </li></ul>? λ (40,1)=(10,1); λ (40,2)=(10,2) extraneous ? ? 10 (Billy, Honda) ∥ (Frank, Honda) . . . . . . 40 . . . Billy ∥ Frank . . . 20 . . . (Billy, Honda) . . . 30 . . . (Frank, Honda) . . .
  27. 27. Data Minimality Examples <ul><li>Extraneous tuple </li></ul>extraneous ? ? ? (Diane, Mazda) ∥ (Diane, Acura) Diane (Diane, Mazda) (Diane, Acura)
  28. 28. ULDBs: Membership Questions <ul><li>Does a given tuple t appear in some (all) possible instance (s) of R ? </li></ul><ul><li>Is a given table T one of (all of) the possible instances of R ? </li></ul>Polynomial algorithms based on data-minimization NP-Hard
  29. 29. Non-Theorists... <ul><li>Wake Back Up! </li></ul>
  30. 30. Querying ULDBs <ul><li>Simple extension to SQL </li></ul><ul><li>Formal semantics, intuitive meaning </li></ul><ul><li>Query uncertainty, confidences, and lineage </li></ul>TriQL
  31. 31. Initial TriQL Example SELECT Drives.person INTO Suspects FROM Saw, Drives WHERE Saw.car = Drives.car ? ? ? λ (31) = (11,2),(21,2) λ (32,1)=(11,1),(22,1); λ (32,2)=(11,1),(22,2) λ (33) = (11,1), 23 11 ID (Cathy, Honda) ∥ (Cathy, Mazda) Saw (witness,car) 23 22 21 ID (Billy, Honda) ∥ (Frank, Honda) (Hank, Honda) (Jimmy, Toyota) ∥ (Jimmy, Mazda) Drives (person,car) 33 32 31 ID Jimmy Billy ∥ Frank Hank Suspects
  32. 32. Formal Semantics <ul><li>Query Q on ULDB D </li></ul>D D 1 , D 2 , …, D n possible instances Q on each instance representation of instances Q(D 1 ), Q(D 2 ), …, Q ( D n ) D’ implementation of Q operational semantics D + Result
  33. 33. Operational Semantics <ul><li>Over conventional relational database: </li></ul><ul><li>For each tuple in cross-product of X 1 , X 2 , ..., X n </li></ul><ul><ul><li>Evaluate the predicate </li></ul></ul><ul><ul><li>If true, project attr-list to create result tuple </li></ul></ul><ul><ul><li>If INTO clause, insert into table </li></ul></ul>SELECT attr-list [ INTO table ] FROM X 1 , X 2 , ..., X n WHERE predicate
  34. 34. Operational Semantics <ul><li>Over ULDB: </li></ul><ul><li>For each tuple in cross-product of X 1 , X 2 , ..., X n </li></ul><ul><ul><li>Create “super tuple” T from all combinations of alternatives </li></ul></ul><ul><ul><li>Evaluate predicate on each alternative in T ; keep only the true ones </li></ul></ul><ul><ul><li>Project attr-list on each alternative to create result tuple </li></ul></ul><ul><ul><li>Details: ‘?’, lineage, confidences </li></ul></ul>SELECT attr-list [ INTO table ] FROM X 1 , X 2 , ..., X n WHERE predicate
  35. 35. Operational Semantics: Example SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car (Cathy, Honda) ∥ (Cathy, Mazda) Saw (witness,car) (Hank, Honda) (Jim, Mazda) ∥ (Bill, Mazda) Drives (person,car)
  36. 36. Operational Semantics: Example SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car (Cathy, Honda) ∥ (Cathy, Mazda) Saw (witness,car) (Hank, Honda) (Jim, Mazda) ∥ (Bill, Mazda) Drives (person,car) (Cathy,Honda,Jim,Mazda) ∥ (Cathy,Honda,Bill,Mazda) ∥ (Cathy,Mazda,Jim,Mazda) ∥ (Cathy,Mazda,Bill,Mazda)
  37. 37. Operational Semantics: Example SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car (Cathy, Honda) ∥ (Cathy, Mazda) Saw (witness,car) (Hank, Honda) (Jim, Mazda) ∥ (Bill, Mazda) Drives (person,car) (Cathy,Honda,Jim,Mazda) ∥ (Cathy,Honda,Bill,Mazda) ∥ (Cathy,Mazda,Jim,Mazda) ∥ (Cathy,Mazda,Bill,Mazda)
  38. 38. Operational Semantics: Example SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car (Cathy, Honda) ∥ (Cathy, Mazda) Saw (witness,car) (Hank, Honda) (Jim, Mazda) ∥ (Bill, Mazda) Drives (person,car) (Cathy,Honda,Jim,Mazda) ∥ (Cathy,Honda,Bill,Mazda) ∥ (Cathy,Mazda, Jim ,Mazda) ∥ (Cathy,Mazda, Bill ,Mazda)
  39. 39. Operational Semantics: Example SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car (Cathy,Honda,Jim,Mazda) ∥ (Cathy,Honda,Bill,Mazda) ∥ (Cathy,Mazda, Jim ,Mazda) ∥ (Cathy,Mazda, Bill ,Mazda) (Cathy, Honda) ∥ (Cathy, Mazda) Saw (witness,car) (Hank, Honda) (Jim, Mazda) ∥ (Bill, Mazda) Drives (person,car) (Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)
  40. 40. Operational Semantics: Example SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car (Cathy,Honda,Jim,Mazda) ∥ (Cathy,Honda,Bill,Mazda) ∥ (Cathy,Mazda, Jim ,Mazda) ∥ (Cathy,Mazda, Bill ,Mazda) (Cathy, Honda) ∥ (Cathy, Mazda) Saw (witness,car) (Hank, Honda) (Jim, Mazda) ∥ (Bill, Mazda) Drives (person,car) (Cathy,Honda,Hank,Honda) ∥ (Cathy,Mazda,Hank,Honda)
  41. 41. Operational Semantics: Example SELECT Drives.person FROM Saw, Drives WHERE Saw.car = Drives.car (Cathy,Honda,Jim,Mazda) ∥ (Cathy,Honda,Bill,Mazda) ∥ (Cathy,Mazda, Jim ,Mazda) ∥ (Cathy,Mazda, Bill ,Mazda) (Cathy, Honda) ∥ (Cathy, Mazda) Saw (witness,car) (Hank, Honda) (Jim, Mazda) ∥ (Bill, Mazda) Drives (person,car) (Cathy,Honda, Hank ,Honda) ∥ (Cathy,Mazda,Hank,Honda)
  42. 42. Operational Semantics: Example SELECT Drives.person INTO Suspects FROM Saw, Drives WHERE Saw.car = Drives.car ? ? λ ( ) = ... λ ( ) = ... (Cathy, Honda) ∥ (Cathy, Mazda) Saw (witness,car) (Hank, Honda) (Jim, Mazda) ∥ (Bill, Mazda) Drives (person,car) Jim ∥ Bill Hank Suspects
  43. 43. Confidences <ul><li>Confidences supplied with base data </li></ul><ul><li>Trio computes confidences on query results </li></ul><ul><ul><li>Default probabilistic interpretation </li></ul></ul><ul><ul><li>Can choose to plug in different arithmetic </li></ul></ul>? ? ? 0.3 0.4 0.6 Probabilistic Min (Cathy, Honda): 0.6 ∥ (Cathy, Mazda): 0.4 Saw (witness,car) (Hank, Honda) ): 1.0 (Jim, Mazda): 0.3 ∥ (Bill, Mazda): 0.6 Drives (person,car) Jim: 0.12 ∥ Bill: 0.24 Hank: 0.6 Suspects
  44. 44. TriQL: Querying Confidences <ul><li>Built-in function: Conf() </li></ul><ul><li>SELECT Drives.person INTO Suspects </li></ul><ul><li>FROM Saw, Drives </li></ul><ul><li>WHERE Saw.car = Drives.car </li></ul><ul><li>AND Conf(Saw) > 0.5 AND Conf(Drives) > 0.8 </li></ul><ul><li> SELECT Drives.person INTO Suspects </li></ul><ul><li>FROM Saw, Drives </li></ul><ul><li>WHERE Saw.car = Drives.car </li></ul><ul><li>AND Conf(*) > 0.4 </li></ul>
  45. 45. TriQL: Querying Lineage <ul><li>Built-in join predicate: Lineage() </li></ul><ul><li>SELECT Saw.witness INTO AccusesHank </li></ul><ul><li>FROM Suspects, Saw </li></ul><ul><li>WHERE Lineage(Suspects,Saw) </li></ul><ul><li>AND Suspects.person = ‘Hank’ </li></ul><ul><li>Als o works for transitive lineage: </li></ul><ul><li>SELECT witness </li></ul><ul><li>FROM AccusesHank </li></ul><ul><li>WHERE Lineage(Saw,AccusesHank) </li></ul>
  46. 46. Additional Query Constructs <ul><li>“ Horizontal subqueries” </li></ul><ul><ul><li>Refer to tuple alternatives as a relation </li></ul></ul><ul><li>Unmerged (horizontal duplicates) </li></ul><ul><li>Flatten , GroupAlts </li></ul><ul><li>NoLineage , NoConf , NoMaybe </li></ul><ul><li>Query-computed confidences </li></ul><ul><li>Data modification statements </li></ul>
  47. 47. Final Example Query Credibility List suspects with conf values based on accuser credibility (1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy, Hank) (2, Cathy, Frank) ∥ (2, Betty, Freddy) PrimeSuspect (crime#, accuser, suspect) Cathy Betty Amy person 15 5 10 score Jimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166 Frank: 0.25 ∥ Freddy: 0.75 Suspects
  48. 48. Final Example Query Credibility SELECT suspect, score/[sum(score)] as conf FROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P) (1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy, Hank) (2, Cathy, Frank) ∥ (2, Betty, Freddy) PrimeSuspect (crime#, accuser, suspect) Cathy Betty Amy person 15 5 10 score Jimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166 Frank: 0.25 ∥ Freddy: 0.75 Suspects
  49. 49. Final Example Query Credibility SELECT suspect, score/[sum(score)] as conf FROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P) (1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy, Hank) (2, Cathy, Frank) ∥ (2, Betty, Freddy) PrimeSuspect (crime#, accuser, suspect) Cathy Betty Amy person 15 5 10 score Jimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166 Frank: 0.25 ∥ Freddy: 0.75 Suspects
  50. 50. Final Example Query Credibility SELECT suspect, score/[sum(score)] as conf FROM (SELECT suspect, (SELECT score FROM Credibility C WHERE C.person = P.accuser) FROM PrimeSuspect P) “ Scaled as conf” “ Uniform as conf” … (1, Amy, Jimmy) ∥ (1, Betty, Billy) ∥ (1, Cathy, Hank) (2, Cathy, Frank) ∥ (2, Betty, Freddy) PrimeSuspect (crime#, accuser, suspect) Cathy Betty Amy person 15 5 10 score Jimmy: 0.33 ∥ Billy: 0.5 ∥ Hank: 0.166 Frank: 0.25 ∥ Freddy: 0.75 Suspects
  51. 51. The Trio System <ul><li>Version 1 </li></ul><ul><ul><li>Entirely on top of conventional DBMS </li></ul></ul><ul><ul><li>Surprisingly easy and complete, reasonably efficient </li></ul></ul>
  52. 52. The Trio System: Version 1 Relational DBMS create trio table T(A,B) select C into R ... Trio Metadata Trio API SQL commands <ul><li>Result cursors </li></ul><ul><li>Traverse lineage </li></ul>Lin:R aidTo table aidFr R C xid aid T B A xid aid
  53. 53. Relation Encoding ? ? Saw Drives Suspects Lin:Suspects Mazda Bill 2 1 22 Mazda Jim 2 1 21 2 xid 23 aid 1 conf Hank person Honda car Jim ∥ Bill Hank Suspects Bill 3 1 32 33 31 aid 2 1 xid 2 3 conf Hank Jim person (Cathy, Honda) ∥ (Cathy, Mazda) Saw (witness,car) (Hank, Honda) (Jim, Mazda) ∥ (Bill, Mazda) Drives (person,car) Honda Cathy 2 1 11 1 xid 12 aid 2 conf Cathy witness Mazda car 11 Saw 33 12 Saw 31 21 Drives 31 aidTo table aidFr Drives Drives Saw 33 32 32 22 12 23
  54. 54. Relation Encoding with Confidences Saw Drives Suspects 0.12 0.24 0.6 Mazda Bill 0.6 1 22 Mazda Jim 0.3 1 21 2 xid 23 aid 1.0 conf Hank person Honda car Bill NULL 1 32 33 31 aid 2 1 xid NULL NULL conf Hank Jim person (Cathy, Honda): 0.6 ∥ (Cathy, Mazda): 0.4 Saw (witness,car) (Hank, Honda) (Jim, Mazda): 0.3 ∥ (Bill, Mazda): 0.6 Drives (person,car) Honda Cathy 0.6 1 11 1 xid 12 aid 0.4 conf Cathy witness Mazda car
  55. 55. Query Translation <ul><li>Persistent results: Query Q into result table R </li></ul><ul><li>Run query Q’ to produce “super-result” R </li></ul><ul><ul><li>Q’ ≈ Q but adds aid/xid’s of source tuples, joins lineage tables for lineage() predicates </li></ul></ul><ul><li>Group R into alternatives, generate new xid’s </li></ul><ul><li>Materialze lineage data into lin:R </li></ul><ul><li>Compute confidences? (optional) </li></ul><ul><li>Add/update metadata: schemas, confidence info, lineage structure </li></ul><ul><li>Transient results: stop at 2, return cursor </li></ul><ul><li> over new xid’s </li></ul>
  56. 56. The Trio System: Version 1 Relational DBMS create trio table T(A,B) select C into R ... Trio Metadata Trio API SQL commands <ul><li>Result cursors </li></ul><ul><li>Traverse lineage </li></ul>Lin:R aid table aid R C xid aid T B A xid aid
  57. 57. The Trio System: Version 1 Trio API and translator (TriQL parser in Python) Trio Metadata TrioExplorer (GUI client) Trio Stored Procedures Encoded Data Tables Lineage Tables Standard SQL <ul><li>Data & system </li></ul><ul><li>columns </li></ul><ul><li>A-IDs for </li></ul><ul><li>alternatives </li></ul><ul><li>“ Verticalized” </li></ul><ul><li>through shared X-IDs </li></ul><ul><li>Confidences </li></ul><ul><li>One lineage table </li></ul><ul><li>per derived table </li></ul><ul><li>Connecting A-IDs </li></ul><ul><li>Table types </li></ul><ul><li>Schema-level </li></ul><ul><li>lineage structure </li></ul><ul><li>conf() </li></ul><ul><li>lineage() </li></ul><ul><li>TriQL queries </li></ul><ul><li>DDL commands </li></ul><ul><li>Schema browsing </li></ul><ul><li>Table browsing </li></ul><ul><li>Explore lineage </li></ul><ul><li>On-demand </li></ul><ul><li>confidence </li></ul><ul><li>computation </li></ul>Command-line client Relational DBMS SPI-Plugin for PostgreSQL Replacing “ Fetch-All”
  58. 58. The Trio System: Version 2 Trio API and translator TrioExplorer (GUI client) Command-line client Trio DBMS Specialized Trio Data Structures Specialized Trio Processing & Optimizing Standard SQL Direct function calls
  59. 59. Efficient Confidence Computation <ul><li>Previous approach (probabilistic databases) </li></ul><ul><ul><li>Each operator computes confidences during query execution </li></ul></ul><ul><ul><li>Only certain (“safe”) query plans allowed </li></ul></ul><ul><li>In ULDBs </li></ul><ul><ul><li>Confidence of alternative A is function of confidences in λ ( A ) </li></ul></ul><ul><li>Our approach </li></ul><ul><ul><li>Decoupled data and confidence computation </li></ul></ul><ul><ul><li>Allow any query plan for data computation </li></ul></ul><ul><ul><li>Compute confidences on-demand based on lineage </li></ul></ul>
  60. 60. CIDs & Confidence Memoization <ul><li>Find set of Closest Independent Descendants (CIDs) for each relation in the schema-level lineage DAG (relations with no shared ancestors) </li></ul><ul><li>Batched confidence computations in SQL </li></ul><ul><ul><li>Select unions of distinct base alternatives from all root-to-leaf lineage paths </li></ul></ul><ul><ul><li>With CIDs and memoization enabled </li></ul></ul><ul><ul><ul><li>if confidences are memoized then stop at CIDs </li></ul></ul></ul><ul><ul><ul><li>else traverse down to the leafs (base data) </li></ul></ul></ul><ul><ul><ul><li>and memoize at CIDs </li></ul></ul></ul><ul><ul><li>(ongoing work) </li></ul></ul>
  61. 61. “Probabilistic TPC-H” Setup <ul><li>Horizontal split of TPC-H tables </li></ul><ul><li>Uniform-random number of alternatives [1-10] </li></ul><ul><li>Uniform confidences </li></ul>Partsupps Lineitems Orders O 2 O 1 O L P OL LP OLP 1.5M 1.0M 0.1M 1.0M 0.5M 0.7M O 3 L 2 L 1 L 3 L 4 P 2 P 1 P 3 O L P CID(OLP) = {O, L, P} 0.8M 6M 1.5M
  62. 62. Confidence Computation Times
  63. 63. Per-Tuple vs. Batching (no CIDs)
  64. 64. Memoization & Join Multiplicity
  65. 65. Current Topics <ul><li>System </li></ul><ul><ul><li>Full query language </li></ul></ul><ul><ul><li>Nice interface </li></ul></ul><ul><ul><li>Performance experiments </li></ul></ul><ul><ul><li>Demo applications </li></ul></ul><ul><li>Algorithms: confidence & lineage computation, </li></ul><ul><li>extraneous data, membership questions </li></ul><ul><ul><li>Minimize lineage traversal </li></ul></ul><ul><ul><li>Confidence memoization </li></ul></ul><ul><ul><li>Efficient batch computations </li></ul></ul>
  66. 66. Future Directions <ul><li>Theory, Model, Algorithms </li></ul><ul><ul><li>Unlimited opportunities </li></ul></ul><ul><li>System </li></ul><ul><ul><li>Storage, indexing, partitioning </li></ul></ul><ul><ul><li>Statistics and query optimization </li></ul></ul><ul><li>More features </li></ul><ul><ul><li>Top- K by confidence </li></ul></ul><ul><ul><li>Incomplete relations; continuous uncertainty; correlated uncertainty </li></ul></ul><ul><ul><li>External lineage; update lineage; versioning </li></ul></ul>
  67. 67. Search “stanford trio” Current Trio team: Parag Agrawal, Anish Das Sarma, Raghotham Murthy, Michi Mutsuzaki, Shubha Nabar, Tomoe Sugihara, Martin Theobald, Jennifer Widom Special thanks to: Ashok Chandra, Alon Halevy, Jeff Ullman, Omar Benjelloun but don’t forget the lineage… DATA UNCERTAINTY LINEAGE

×