Trio:  A System for Data, Uncertainty, and Lineage Search “stanford trio” http://i.stanford.edu/trio DATA UNCERTAINTY LINEAGE
People Current Jennifer Widom  (faculty) Omar Benjelloun  (post-doc) Parag Agrawal, Anish Das Sarma, Shubha Nabar  (PhD) Michi Mutsuzaki  (MS) Tomoe Sugihara  (visitor) Incoming Martin Theobald  (post-doc) Raghu Murthy  (MS) Ander de Keijzer  (visitor) Alums Alon Halevy, Ashok Chandra  (visitors) Chris Hayworth  (MS)
Why Uncertainty + Lineage? Many applications seem to need both From a technical standpoint, it turns out that  lineage... Enables simple and consistent representation of uncertain data Correlates uncertainty in query results with uncertainty in the input data Can make computation over uncertain data more efficient
Trio Components Data Model ULDBs (Uncertainty-Lineage Databases): Simple extension to relational model  Query Language TriQL:  Simple extension to SQL, well-defined semantics and intuitive behavior System Version 1:  Complete system and GUI built  on top of conventional DBMS
Running Example: Crime-Solving Saw( witness , car )  // may be uncertain Drives( person , car )  // may be uncertain Suspects( person )   =  π person (Saw  ⋈  Drives)
Our Model for Uncertainty 1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences
Our Model for Uncertainty 1. Alternatives:   uncertainty about value 2. ‘?’ (Maybe) Annotations 3. Confidences = Three possible instances (Amy, Honda)  ∥  (Amy, Toyota)  ∥  (Amy, Mazda) Saw (witness,car) { Honda, Toyota, Mazda } car Amy witness
Our Model for Uncertainty 1. Alternatives 2.   ‘?’ (Maybe):  uncertainty about presence 3. Confidences Six possible instances ? (Amy, Honda)  ∥  (Amy, Toyota)  ∥  (Amy, Mazda) (Betty, Acura) Saw (witness,car)
Our Model for Uncertainty 1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences:  weighted uncertainty ? Six possible instances, each with a probability (Amy, Honda): 0.5  ∥  (Amy,Toyota): 0.3  ∥  (Amy, Mazda): 0.2 (Betty, Acura): 0.6 Saw (witness,car)
Models for Uncertainty Our model (so far) is not especially new We spent some time exploring the space of models for uncertainty  [ICDE 06, journal] Tension between  understandability  and  expressiveness Our model is  understandable But it is  not complete, or even closed  under  common operations
Our Model is Not Closed Suspects   =  π person (Saw  ⋈  Drives) ? ? ? Does not correctly capture possible instances in the result CANNOT (Cathy, Honda)  ∥  (Cathy, Mazda) Saw (witness,car) (Billy, Honda)  ∥ (Frank, Honda) (Hank, Honda) (Jimmy, Toyota)  ∥ (Jimmy, Mazda) Drives (person,car) Jimmy Billy  ∥ Frank Hank Suspects
Lineage to the Rescue Lineage Captures  “where data came from” In Trio:   A function   λ  from alternatives to other alternatives (or external sources)
Example with Lineage ? ? ? Suspects   =  π person (Saw  ⋈  Drives) λ (31) = (11,2),(21,2) λ (32,1) = (11,1),(22,1);  λ (32,2) = (11,1),(22,2) λ (33) = (11,1), 23 11 ID (Cathy, Honda)  ∥  (Cathy, Mazda) Saw (witness,car) 23 22 21 ID (Billy, Honda)  ∥ (Frank, Honda) (Hank, Honda) (Jimmy, Toyota)  ∥ (Jimmy, Mazda) Drives (person,car) 33 32 31 ID Jimmy Billy  ∥ Frank Hank Suspects Correctly captures possible instances in the result
Uncertainty-Lineage Databases (ULDBs) 1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences 4. Lineage ULDBs are closed and complete [VLDB 06]
ULDBs: Lineage Conjunctive lineage  sufficient for most operations Duplicate-elimination:  Disjunctive lineage   Difference:  Negative lineage General case after multiple operations/queries:  Boolean formula
ULDBs: Interesting Questions Data-minimality:  extraneous alternatives, extraneous “?” Lineage-minimality:  harder Membership:  tuple and table,  some-instance  and  all-instances Coexistence:  multiple tuples Extraction:  remove tables, retain possible-instances
Example: Extraneous Data extraneous ? ? ? (Diane, Mazda)  ∥ (Diane, Acura) Diane (Diane, Mazda) (Diane, Acura)
Example: Coexistence ? ? ? ? Can’t coexist Mazda Acura (Diane, Mazda)  ∥ (Diane, Acura) (Diane, Mazda) (Diane, Acura)
Querying ULDBs: Semantics Query  Q   on ULDB  D D D 1 ,  D 2 , …,  D n possible instances Q   on each instance representation of instances Q(D 1 ),  Q(D 2 ), …,  Q ( D n ) D’ implementation of  Q operational semantics D + Result
Querying ULDBs: TriQL Basic TriQL:  SQL with new semantics Obeys commutative diagram for uncertain data Tracks lineage Query results: new table or on-the-fly Implemented TriQL:  also built-in predicates  conf(),   lineage(), lineage*()
Additional TriQL Constructs [Language manual on web site] “ Horizontal subqueries” Refer to tuple alternatives as a relation Unmerged  (horizontal duplicates) Flatten ,  GroupAlts NoLineage ,  NoConf ,  NoMaybe Query-specified confidences  [done] Data modification statements
Confidence Computation Confidences computed  on-demand  based on lineage Confidence of alternative  A  is function of confidences in  λ * ( A ) Permits any query plan for data computation Default probabilistic interpretation, but queries can override SELECT person,  min(conf(Saw),conf(Drives)) as conf FROM Saw, Drives WHERE Saw.car = Drives.car
Trio System: Version 1 Standard relational DBMS Trio API and translator (Python) Command-line client Trio Metadata TrioExplorer (GUI client) Trio Stored Procedures Encoded Data Tables Lineage Tables Standard SQL “ Verticalize” Shared IDs for alternatives Columns for  confidence,“?” One per result table Uses unique IDs Table types Schema-level lineage structure conf() lineage()  “==>” lineage*() “==>>” DDL commands TriQL queries Schema browsing Table browsing Explore lineage On-demand confidence computation
Current & Future Topics Algorithms:  confidence computation, coexistence extraneous data Minimize lineage traversal Memoization Batch operations System Full query language More internal processing  ? Storage and indexing Statistics and query optimization
Current & Future Topics Top- K  by confidence  Extend basic uncertainty model Incomplete relations Continuous uncertainty Correlated uncertainty  ? External lineage, update lineage, versioning

Mazda Trio Meeting

  • 1.
    Trio: ASystem for Data, Uncertainty, and Lineage Search “stanford trio” http://i.stanford.edu/trio DATA UNCERTAINTY LINEAGE
  • 2.
    People Current JenniferWidom (faculty) Omar Benjelloun (post-doc) Parag Agrawal, Anish Das Sarma, Shubha Nabar (PhD) Michi Mutsuzaki (MS) Tomoe Sugihara (visitor) Incoming Martin Theobald (post-doc) Raghu Murthy (MS) Ander de Keijzer (visitor) Alums Alon Halevy, Ashok Chandra (visitors) Chris Hayworth (MS)
  • 3.
    Why Uncertainty +Lineage? Many applications seem to need both From a technical standpoint, it turns out that lineage... Enables simple and consistent representation of uncertain data Correlates uncertainty in query results with uncertainty in the input data Can make computation over uncertain data more efficient
  • 4.
    Trio Components DataModel ULDBs (Uncertainty-Lineage Databases): Simple extension to relational model Query Language TriQL: Simple extension to SQL, well-defined semantics and intuitive behavior System Version 1: Complete system and GUI built on top of conventional DBMS
  • 5.
    Running Example: Crime-SolvingSaw( witness , car ) // may be uncertain Drives( person , car ) // may be uncertain Suspects( person ) = π person (Saw ⋈ Drives)
  • 6.
    Our Model forUncertainty 1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences
  • 7.
    Our Model forUncertainty 1. Alternatives: uncertainty about value 2. ‘?’ (Maybe) Annotations 3. Confidences = Three possible instances (Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda) Saw (witness,car) { Honda, Toyota, Mazda } car Amy witness
  • 8.
    Our Model forUncertainty 1. Alternatives 2. ‘?’ (Maybe): uncertainty about presence 3. Confidences Six possible instances ? (Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda) (Betty, Acura) Saw (witness,car)
  • 9.
    Our Model forUncertainty 1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences: weighted uncertainty ? Six possible instances, each with a probability (Amy, Honda): 0.5 ∥ (Amy,Toyota): 0.3 ∥ (Amy, Mazda): 0.2 (Betty, Acura): 0.6 Saw (witness,car)
  • 10.
    Models for UncertaintyOur model (so far) is not especially new We spent some time exploring the space of models for uncertainty [ICDE 06, journal] Tension between understandability and expressiveness Our model is understandable But it is not complete, or even closed under common operations
  • 11.
    Our Model isNot Closed Suspects = π person (Saw ⋈ Drives) ? ? ? Does not correctly capture possible instances in the result CANNOT (Cathy, Honda) ∥ (Cathy, Mazda) Saw (witness,car) (Billy, Honda) ∥ (Frank, Honda) (Hank, Honda) (Jimmy, Toyota) ∥ (Jimmy, Mazda) Drives (person,car) Jimmy Billy ∥ Frank Hank Suspects
  • 12.
    Lineage to theRescue Lineage Captures “where data came from” In Trio: A function λ from alternatives to other alternatives (or external sources)
  • 13.
    Example with Lineage? ? ? Suspects = π person (Saw ⋈ Drives) λ (31) = (11,2),(21,2) λ (32,1) = (11,1),(22,1); λ (32,2) = (11,1),(22,2) λ (33) = (11,1), 23 11 ID (Cathy, Honda) ∥ (Cathy, Mazda) Saw (witness,car) 23 22 21 ID (Billy, Honda) ∥ (Frank, Honda) (Hank, Honda) (Jimmy, Toyota) ∥ (Jimmy, Mazda) Drives (person,car) 33 32 31 ID Jimmy Billy ∥ Frank Hank Suspects Correctly captures possible instances in the result
  • 14.
    Uncertainty-Lineage Databases (ULDBs)1. Alternatives 2. ‘?’ (Maybe) Annotations 3. Confidences 4. Lineage ULDBs are closed and complete [VLDB 06]
  • 15.
    ULDBs: Lineage Conjunctivelineage sufficient for most operations Duplicate-elimination: Disjunctive lineage Difference: Negative lineage General case after multiple operations/queries: Boolean formula
  • 16.
    ULDBs: Interesting QuestionsData-minimality: extraneous alternatives, extraneous “?” Lineage-minimality: harder Membership: tuple and table, some-instance and all-instances Coexistence: multiple tuples Extraction: remove tables, retain possible-instances
  • 17.
    Example: Extraneous Dataextraneous ? ? ? (Diane, Mazda) ∥ (Diane, Acura) Diane (Diane, Mazda) (Diane, Acura)
  • 18.
    Example: Coexistence ?? ? ? Can’t coexist Mazda Acura (Diane, Mazda) ∥ (Diane, Acura) (Diane, Mazda) (Diane, Acura)
  • 19.
    Querying ULDBs: SemanticsQuery Q on ULDB D D D 1 , D 2 , …, D n possible instances Q on each instance representation of instances Q(D 1 ), Q(D 2 ), …, Q ( D n ) D’ implementation of Q operational semantics D + Result
  • 20.
    Querying ULDBs: TriQLBasic TriQL: SQL with new semantics Obeys commutative diagram for uncertain data Tracks lineage Query results: new table or on-the-fly Implemented TriQL: also built-in predicates conf(), lineage(), lineage*()
  • 21.
    Additional TriQL Constructs[Language manual on web site] “ Horizontal subqueries” Refer to tuple alternatives as a relation Unmerged (horizontal duplicates) Flatten , GroupAlts NoLineage , NoConf , NoMaybe Query-specified confidences [done] Data modification statements
  • 22.
    Confidence Computation Confidencescomputed on-demand based on lineage Confidence of alternative A is function of confidences in λ * ( A ) Permits any query plan for data computation Default probabilistic interpretation, but queries can override SELECT person, min(conf(Saw),conf(Drives)) as conf FROM Saw, Drives WHERE Saw.car = Drives.car
  • 23.
    Trio System: Version1 Standard relational DBMS Trio API and translator (Python) Command-line client Trio Metadata TrioExplorer (GUI client) Trio Stored Procedures Encoded Data Tables Lineage Tables Standard SQL “ Verticalize” Shared IDs for alternatives Columns for confidence,“?” One per result table Uses unique IDs Table types Schema-level lineage structure conf() lineage() “==>” lineage*() “==>>” DDL commands TriQL queries Schema browsing Table browsing Explore lineage On-demand confidence computation
  • 24.
    Current & FutureTopics Algorithms: confidence computation, coexistence extraneous data Minimize lineage traversal Memoization Batch operations System Full query language More internal processing ? Storage and indexing Statistics and query optimization
  • 25.
    Current & FutureTopics Top- K by confidence Extend basic uncertainty model Incomplete relations Continuous uncertainty Correlated uncertainty ? External lineage, update lineage, versioning