Architecture
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Architecture

on

  • 810 views

 

Statistics

Views

Total Views
810
Views on SlideShare
810
Embed Views
0

Actions

Likes
0
Downloads
11
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • In a push scenario action is motivated by the source. On the pull, by the client.
  • Heterogeneity is formats, vocabularies and access methods prevents unified access. Integrated View, governed by integrated global schema/ontology provides unified access.
  • Start with warehousing Vs on-demand axis Info acquisition: does the client acquire the information be issuing a declarative query against an integrated view, or it has to specify with brute-force code or a workflow system step-by-step how the data are to be obtained and put together. Information model:during the current transient period other combinations are also observed. Great stuff on query processing, in almost all of the points above, coming from IDM. Query reformulation, distributed query optimization and execution.
  • Apart from the data integration problem, there is the access methods integration problem. The developer builds queries, but can the mediator execute them, given the limited access methods of the sources? I developed the CLIDE system, which guides the developer toward executable queries only. This will be the first part of my talk. The second part will be the QURSED system that semi-automates the generation of powerful query forms and reports for complex data, thus relieving the developer from extensive coding. I also designed a language for exporting web services targeting databases. QSSL is used by source owners to export a large number of parameterized queries in a compact way. Let’s start with CLIDE. I will simply mention two more contributions. I developed ProtoPlasm, a schema matching platform, which automatically computes correspondences between attributes of the source schemas and attributes of the integrated one, thus helping the integration engineer to carry out his task. Last, but not least, I designed a language for exporting data-oriented services. QSSL is used by source owners to export a large number of parameterized queries in a compact way. Let’s start with CLIDE.
  • Apart from the data integration problem, there is the access methods integration problem. The developer builds queries, but can the mediator execute them, given the limited access methods of the sources? I developed the CLIDE system, which guides the developer toward executable queries only. This will be the first part of my talk. The second part will be the QURSED system that semi-automates the generation of powerful query forms and reports for complex data, thus relieving the developer from extensive coding. I also designed a language for exporting web services targeting databases. QSSL is used by source owners to export a large number of parameterized queries in a compact way. Let’s start with CLIDE. I will simply mention two more contributions. I developed ProtoPlasm, a schema matching platform, which automatically computes correspondences between attributes of the source schemas and attributes of the integrated one, thus helping the integration engineer to carry out his task. Last, but not least, I designed a language for exporting data-oriented services. QSSL is used by source owners to export a large number of parameterized queries in a compact way. Let’s start with CLIDE.
  • Queries consist of table, selection, join and projection atoms
  • Don’t tell they are NOT foreign keys Semantically compatible attributes
  • Mediators have query capabilities Feasible queries are not obvious The query is infeasible because none of the web services provide all the computers They only provide them for a certain attribute value For example, Amazon does not give you all books On purpose, for security, business reasons, … They accept queries and use the underlying views to answer them Not all of them can be answered, the ones that can are called feasible and the rest infeasible Provide Capabilities Transparency Analogous to Physical Layer Transparency in RDBMSs Extend Source Capabilities Exact Rewritings Only Local-as-View Approach Closed World Assumption
  • Covered extensively in literature Developer has to replicate the logic of the mediator and figure out which queries are feasible Option 1 is not transparent
  • He has the option to execute the join or introduce some other table from the schema
  • Rate must be provide according to V2 Either provide a constant or provide its value indirectly by saying that Rou1.rate is equal to Net1.rate via a join Either constant or join
  • Because we are interested in rapid convergence, we will formalize the CLIDE interaction using the interaction graph A path corresponds to a particular interaction Now, let’s see how the CLIDE back-end works How we model the interaction, which actions do we color and how, and what properties do we guarantee Syntactic Isomorphism q 1  q 2 Rename aliases of q 1 such that distinct aliases get distinct names q 1 and q 2 have same sets of atoms in SELECT, FROM and WHERE clauses
  • Actions colored based on queries that syntactically extend semantically restrict the current one
  • Blue is crucial in ensuring completeness and minimality Can be defined in terms of paths All feasible queries require picking this action If you pick red, you can’t reach a feasible query
  • the definition depends on the entire set of feasible queries How do we color algorithmically? Intuition: It is enough to consider FQ C , because all the others can be reached from them Input A node n in the interaction graph representing the current query q( n ) The set of possible actions A C Output A partition of A C into yellow, blue, white and red suggested actions Feasibility flag Feasible queries are infinitely many!
  • Maximally contained queries FQ MC in q( n ) For each q i in FQ MC , q i is feasible and contained in q( n ) For distinct q i , q j in FQ MC , q i is not contained in q j For each query q i in FQ there exists q j in FQ MC such that q i is contained in q j
  • Projections and parameters complicate the algorithm a little bit more Let’s talk about the rest first
  • Polynomial to the query tree width when query is acyclic Acyclic queries

Architecture Presentation Transcript

  • 1. From Enterprise Information Integration to Community-Based Mediation Alin Deutsch, Yannis Katsis, Michalis Petropoulos Yannis Papakonstantinou A presentation by on joint works with CSE Department
  • 2. Data Integration Requirements & Desiderata (high level)
    • Provide application with integrated database
      • single point of (query/update) access to the data
    • Provide distribution and heterogeneity transparency
      • heterogenous formats, heterogenous interfaces, different rates of change (static versus dynamic), autonomous sources
    • Decouple application logic from integration
    • Easily add/change sources
    • Customize the delivery of content
  • 3. Most-Generic Integration System Architecture Integration Software . . . Information Source Information Source Information Source Client Application Push Pull Client Application Client Application
  • 4. SIGMOD Community’s Architecture for Unified Access to Data & Services Information + Service Source Local Common Model (XML) View + Services Mediator Integrated (XML) Global View / Ontology + Services Wrapper Cache & Replication (Web) Client Application Information + Service Source Wrapper (Web) Client Application (Web) Client Application Local Common Model (XML) View + Services
  • 5. Approaches towards View-Based Data Integration Local As View (LAV) Global As View (GAV) GLAV=GAV+LAV Integration Specification Method Info Model & Query Language Relational (SQL) XML (XQuery) Object-Oriented Warehousing (materialized views) On-Demand (virtual views) Storage Method
  • 6. Enterprise Information Integration Reaches Maturity
    • Materialized View (Warehousing) approach well-adopted since mid/late 90s
      • GAV function role played by Extract-Transform-Load tools
      • Human Intervention Occasionally Needed in Cleaning Up
        • Concordance tables for Object Identification
    • Virtual View (Mediation) approach at early adoption
      • many years of research
        • Distributed db’s, federated db’s, mediators
      • moving well into mainstream
        • BEA AquaLogic (XML, Virtual, GAV view)
        • IBM DB2
    Enterprise
  • 7. Current Enterprise Information Integration Deployments Integration Admin Marketing Local View M Integrated Global View V(M, S, E) Sales Local View S Service Local View E View Builder (design time) Mediator Query Processor (run time) GAV View V Schemas Data
    • Small Domain
    • Mostly Vertical Partition of Sources
    • Primarily Application-Driven View or Identity View
    • Integration Administrator/Developer in charge
    Enterprise
  • 8. Opportunities and Needs Presented by “Motivated” Communities
    • Emerging Myriads of Internet Communities of
      • Myriads of sources and clients
      • Source owners motivated to participate
    • EII does not address needs
      • Expensive
      • Bottleneck of Single Integration Admin
    • Make building corresponding portals similar to starting and participating in newsgroups
    • Appropriate tools needed to enable source owner and client participation
    Communities
  • 9. A Community-Based Information Modeling Architecture Information Source 1 Local XML View S 1 Client Application 1 Integrated XML View G Information Source n Local XML View S n GLAV: V 1 GLAV: V n GAV: V 1 a GAV: V m a Application View V 1 a (G) Application View V m a (G) Client Application m Integrated View Owner’s Domain Source Owner’s Domain Source Owner’s Domain Data Services 1 Data Services n Data Services
  • 10. Visual Tools Matter! (example from the Enosys Query Builder) OPEN & VIEW SOURCE SCHEMAS IN XML DRAG & DROP TO CREATE TARGET XML VIEW TARGET SCHEMA (XML VIEW) AUTOMATICALLY GENERATED MAPS 1 2 C:EnosysprojectsallPONS.qpr* - Enosys Query Builder
  • 11. XML RESULT XQUERY BASED ON DESIGN SPECS RUN & TEST XQUERY 3 C:EnosysprojectsallPONS.qpr* - Enosys Query Builder
  • 12. Architecture for Large-Scale Data Integration System and Design Tools How can the user query and Browse the integrated data? QURSED What queries can my app issue? What integrated view services can I build? CLIDE How do I export my database services functionality? RIDE-Services Source Domain Web Domain Application Domain Integration Domain  Application Data Source Data Source Mediator Global View Schema Developer  Integration Engineer  Source Owner  Application Web Forms & Reports Source Schema …  Web Service Web Service Web Service Source Schema … How do I export my data? RIDE Web Services Cache (Metadata)
  • 13. Dual Interactive Registration Problems New App ? New Query ? Source Services Register Source Given Global Schema, Constraints &Queries Guide the client in query/form writing Apps ? Queries ? Register Client Given Sources Guide the source owner in registering a new source and services New Source and Services   Global View Global View
  • 14. Source Data Registration
    • How do my source attributes map to global attributes
      • mappers & automatic matchers
    • How do my data relate to queries & other sources
      • Inconsistencies?
      • What takes to contribute to queries?
      • How much should I clean up?
    • Multiple ways of dealing with redundancy
    Apps ? Queries ? Server Side New Source  Global View
  • 15. How to achieve this Goal Apps ? Queries ? Before New Source  Apps ? Queries ? Now New Source   Look at all sources & queries Decide how to register your source   Follow the suggestions of the interface  Global View Global View Source Registration Tool
  • 16. Our Goal in Source Registration Guide the source owner visually through the registration of the source so as to avoid/warn about (potential) inconsistencies and contribute information to the answer of the queries while exposing the minimum information possible and/or minimizing effort
  • 17. The Problem ? Client Queries ? ? Mediator (Global DB) Sources (Actual Local DBs)
  • 18. The Contribution Problem ? Client Queries ? ? Sources (Actual Local DBs)
    • What is the contribution of source S to the result of the query Q?
    S Q Mediator (Global DB)
  • 19. The Problem Client Queries ? Sources (Actual Local DBs)
    • What is the contribution of source S to the result of the query Q?
    S Q Mediator (Global DB) Q: cars cars reviews Q: cars JOIN reviews S is Self Sufficient w.r.t. Q S is Now Complementary w.r.t. Q
  • 20. Relational Schemas: Local and Global ?
    • Relational Schemas
    • Visual Representation
    make S 1 Carmake Origin Sales auto S 2 Id Model detail Id Engine Baseprice Source 1 Business Magazine Source 2 Car Magazine Global Car Portal car G Model Carmake Carmake brand Origin Doors Baseprice Carmake attributes relations
  • 21. Source Registration using GLAV Mappings
    • Source Registration: Correspondence between a source schema and the global schema
    • =
    • Set of Mapping Constraints of the form
    • (U  V)
    • Open World
    • Global and Local As View (GLAV)
    ? CQ = over source schema CQ = over global schema 
  • 22. Target Constraints
    • Constraints on the global schema
    • =
    • Set of Constraints of the form
    • (U  V)
    • Also Expresses Dependencies (PKs, Ref Integrity, …)
    ? CQ = over global schema CQ = over global schema
  • 23. Visual Representation of Mappings (1) ?
    • Visual Representation (IBM Clio)
    Business Magazine: Provides Carmake and Origin car G Model Carmake Carmake brand Origin Doors Baseprice make S 1 Carmake Origin Sales O C O C brand U 1 (C, O) :- make(C, O, S) V 1 (C, O) :- brand(C, O) (U 1  V 1 ) O C O C S make
  • 24. Visual Representation of Mappings (2) ?
    • Visual Representation (IBM Clio)
    Car Magazine: Provides Model, Carmake and Baseprice auto S 2 Id Model detail Id Engine Baseprice Carmake car G Model Carmake Carmake brand Origin Doors Baseprice B E I C M detail I auto ? C M B car
  • 25. Example of Target Constraint ?
    • (Model, Carmake) is a PK of car
    car G Model Carmake Carmake brand Origin Doors Baseprice  U 1 (M, C, D 1 , B 1 , D 2 , B 2 ) :- car(M, C, D 1 , B 1 ), car(M, C, D 2 , B 2 ) V 1 (M, C, D, B, D, B) :- car(M, C, D, B) (U 1  V 1 )
  • 26. Query Semantics
    • Queries in UCQ =
    • Set of Possible Global Instances
      • Set of global instances that satisfy all constraints
    • Query Answers = Set of Certain Answers
      • The tuples appearing in the answer to Q for any possible global instance
    Possible global instances Answer to Q for any of the possible global instances  Certain Answers to Q Q ?  Answer to Q for any of the possible global instances Certain Answers to Q Q Possible global instances
  • 27. Source Instance’s Contribution Answer to Q - Answer to Q
    • For given instances of the sources
    • Contribution to Q of Source Instance
    • =
    • The tuples in answer of Q not provided by the other sources
  • 28. Source Registration’s Contribution
    • Source Registration: Source Mappings
    • Degrees of Source Registration’s Contribution
    •  Self Sufficient
    •  Now Complementary
    •  Later Complementary
    •  Unusable
    More contribution Less contribution
  • 29. Self Sufficient Registration: Example ? Baseprices of Models car G Model Carmake Carmake brand Origin Doors Baseprice Example ? BMW M3 45K car Green Registration is Self Sufficient Doors Carmake Model Baseprice car
  • 30. Self Sufficient Registration: Definition Answer to Q
    •  Source instance
    • s.t.
    • The source has a non empty contribution in the absence of the other sources
    Answer to Q -    Self Sufficient X X X X
  • 31. Now Complementary Registration: Example ? Baseprices of Models by German manufacturers car G Model Carmake Carmake brand Origin Doors Baseprice Example ? BMW M3 45K car Germany BMW brand Green Registration is Now Complementary Origin = ‘Germany’ Carmake brand Doors Carmake Model Baseprice car
  • 32. Now Complementary Registration: Definition Answer to Q
    • Not Self Sufficient
    • &
    •  Source instances
    • s.t.
    • The source has a non empty contribution in combination with the other existing sources
    Answer to Q -    Now Complementary
  • 33. Later Complementary Registration: Example ? Baseprices of Models by German manufacturers car G Model Carmake Carmake brand Origin Doors Baseprice Example ? BMW M3 45K car Germany BMW brand Green Registration is Later Complementary Origin = ‘Germany’ Carmake brand Doors Carmake Model Baseprice car
  • 34. Later Complementary Registration: Definition Answer to Q
    • Not Self Sufficient &
    • Not Now Complementary
    • &
    •  Potential future sources
      • & Source instances
    • s.t.
    • The source has a non empty contribution in combination with the future sources
    Answer to Q -    Later Complementary
  • 35. Unusable Registration: Example ? car G Model Carmake Carmake brand Origin Doors Baseprice Origin of Carmakes Example Origin Carmake brand Green Registration is Unusable
  • 36. Unusable Registration: Definition Answer to Q Not Self Sufficient & Not Now Complementary & Not Later Complementary  The source has a empty contribution regardless of what sources enter the system Answer to Q - =   Unusable
  • 37. Subtleties for Unusable Registrations ? Baseprices and Doors of Models car G Model Carmake Carmake brand Origin Doors Baseprice Example ? BMW M3 45K car 2 BMW M3 ? car Green Registration is Unusable Doors Carmake Model Baseprice car
  • 38. In presence of PK Unusable Example becomes Later Complementary ? Baseprices and Doors of Models car G Model Carmake Carmake brand Origin Doors Baseprice  Example Green Registration is Later Complementary ? BMW M3 45K car 2 BMW M3 ? car M3 BMW M3 BMW 2 BMW M3 45K car Doors Carmake Model Baseprice car
  • 39. Decidability Results Target constraints Degrees Overview: What is decidable ? Yes Yes Unusable ? Yes Yes Later complementary No Yes Yes Now complementary No Yes Yes Self Sufficient Primary keys + Referential Integrity Constraints Primary keys None
  • 40. Issues Unique client query Multiple client queries Vs Contribute to: - all queries? - one query? - specific queries? - some queries based on some ranking? Data independence Data dependence Vs M 1 : cars, refPrices M 2 : reviews Q : cars JOIN reviews JOIN refPrices e.g. DB 1 : cars, refPrices (Audis) DB 2 : reviews (Hondas) (M 2 , Q) now-complementary but Certain Answers for Instances DB 1 , DB 2 =   
  • 41. Putting it all together Architecture ? Query Global Schema Local Schemas Q S 1 S n S n+1 … M 1 M n M n+1 S’ Guide the source owner visually through the registration of the source so as to raise contribution to the answer of the queries while exposing the minimum info possible and/or minimizing effort 4 categories: Self Sufficient / Now Complementary / Later Complementary / Unusable Query Answering / Mappings / Schemas Architecture Goal Registered sources New source Mappings … Contribution
  • 42. Example 1 car Community price model drive usedAd * model review * vin * price refPrice * model model condition quality Local Schemas Global Schema car AppQuery price model drive usedAd * model review * vin * price refPrice * model model condition quality Query Without primary keys in the target Unusable BLUE: Map at least one of the groups car AutoTrader vin cmodel price ad * carId id *
  • 43. Example 1 car Community price model drive usedAd * model review * vin * price refPrice * model condition quality Local Schemas Global Schema Query Unusable = cmo car AutoTrader vin cmodel price ad * carId id * car AppQuery price model drive usedAd * model review * vin * price refPrice * model model condition quality model Without primary keys in the target
  • 44. Example 1 car Community price model drive usedAd * model review * vin * price refPrice * model condition quality Local Schemas Global Schema Query = cmo car AutoTrader vin cmodel price ad * carId id * car AppQuery price model drive usedAd * model review * vin * price refPrice * model model condition quality model = price Later Complementary Without primary keys in the target
  • 45. Example 2 car Community price model drive usedAd * model review * vin * price refPrice * model model condition quality Local Schemas Global Schema car AppQuery price model drive usedAd * model review * vin * price refPrice * model model condition quality Query Unusable car AutoTrader vin cmodel price ad * carId id * With primary keys in the target
  • 46. Example 2 car Community price model drive usedAd * model review * vin * price refPrice * model model condition quality Local Schemas Global Schema car AppQuery price model drive usedAd * model review * vin * price refPrice * model model condition quality Query = price = vin car AutoTrader vin cmodel price ad * carId id * Later Complementary With primary keys in the target
  • 47. Lessons learned To merge data with that of other sources (become complementary): Pick a relation and provide… … all its attributes asked by the query … its primary key and one of its attributes asked by the query In absence of primary keys In presence of primary keys The number of choices increases in presence of primary keys   Foreign keys on the target affect the suggestions Target constraints make a difference
  • 48. Large-Scale Data Integration Systems How can the user query and Browse the integrated data? QURSED What queries can the mediator answer for me? CLIDE How do I export my database services functionality? RIDE-Services Source Domain Web Domain Application Domain Integration Domain  Application Data Source Data Source Mediator Global View Schema Developer  Integration Engineer  Source Owner  Application Web Forms & Reports Source Schema …  Web Service Web Service Web Service Source Schema … How do I export my data? RIDE
  • 49. Running Example
    • Schema
    • Computers (cid, cpu, ram, price)
    • NetCards (cid, rate, standard, interface)
    • Views
    • V1 ComByCpu ( cpu )  ( Computer )*
    • SELECT DISTINCT Com1.*
    • FROM Computers Com1
    • WHERE Com1.cpu= cpu
    • V2 ComNetByCpuRate ( cpu , rate ) 
    • ( Computer , NetCard )*
    • SELECT DISTINCT Com1.*, Net1.*
    • FROM Computers Com1, Network Net1
    • WHERE Com1.cid=Net1.cid
    • AND Com1.cpu= cpu
    • AND Net1.rate= rate
    Parameterized Views Dell Cisco Schema Routers (rate, standard, price, type) Views V3 RouByTypeW ()  ( Router )* SELECT DISTINCT Rou1.* FROM Routers Rou1 WHERE Rou1.type= 'Wired' V4 RouByTypeWL ()  ( Router )* SELECT DISTINCT Rou1.* FROM Routers Rou1 WHERE Rou1.type= 'Wireless' Computers for a given cpu Computers & NetCards for a given cpu & rate Wired Routers Wireless Routers
  • 50. Running Example
    • Global schema puts together the Dell and Cisco schemas
    • Resembles the schema of CNET.com portal
    • Column Associations
    • (Computers.cid, NetCards.cid)
    • (NetCards.rate, Routers.rate)
    • (NetCards.standard, Routers.standard)
    Global Schema V1  Application V3 V2 Dell Cisco Mediator Global Schema Developer  V4
  • 51. Sophisticated Mediators Make Feasibility Hard to Predict
    • Feasible Queries FQ
    • Equivalent CQ query rewritings using the views
    • Might involve more than one views
    • Order might matter
    V4 Mediator RouByTypeWL () A B V2 ComNetByCpuRate ( ‘P4’ , ‘10’ ) C D Feasible ComNetByCpuRate ( ‘P4’ , ‘54’ ) E Query: Get all ‘P4’ Computers , together with their NetCards and their compatible ‘Wireless’ Routers Query: Get all Computers Infeasible Wireless 120 .11g 54 Wireless 50 .11b 10 Routers.* USB .11g 54 B123 550 1024 P4 B123 USB .11b 10 A123 400 512 P4 A123 NetCards.* Computers.* Wireless 120 .11g 54 USB .11g 54 B123 550 1024 P4 B123 Wireless 50 .11b 10 USB .11b 10 A123 400 512 P4 A123 Routers.* NetCards.* Computers.*
  • 52. Problem
    • Large number of sources
    • Large number of views
    • Mediator capabilities
    • Developer formulates an application query
    • Is an application query feasible?
    • If not, how do I know which ones are feasible?
    • Previous options:
      • The developer had to browse the view definitions and somehow formulate a feasible query
      • Or formulate queries until a feasible one is found (trial-and-error)
    • No system-provided guidance
  • 53. The CLIDE Solution
    • A query formulation interface , which interactively guides the user toward feasible queries by employing a coloring scheme
    CLIDE V1  Application V3 V2 Dell Cisco Mediator Global Schema Developer  V4
  • 54. QBE-Like Interfaces Microsoft SQL-Server
  • 55. CLIDE Interface
    • Table, selection, projection and join actions
    • Color-based suggestions
    • Feasibility Flag
    Projection Boxes Table Boxes Selection Boxes Feasibility Flag Table Alias
  • 56. CLIDE Interface
    • Yellow  required action
      • All feasible queries require this action
    • White  optional action
      • Feasible queries can be formulated w/ or w/o these actions
    Snapshot 1
  • 57. CLIDE Interface Snapshot 2
    • Blue  required choice of action
      • At least one feasible (next) query cannot be formulated unless this action is performed
    V1 Mediator ComByCpu ( ‘P4’ ) A B C 550 1024 P4 B123 400 512 P4 A123 price ram cpu cid 550 1024 400 512 price ram
  • 58. CLIDE Interface
    • Join Lines:
    • Only yellow and blue are displayed
    • Must appear in Column Associations
    Snapshot 3
  • 59. CLIDE Interface Snapshot 4
  • 60. CLIDE Interface Snapshot 5
    • *  any other constant
    • Red  prohibited action
      • Does not appear in any feasible query
      • Lead to “Dead End” state
  • 61. CLIDE Interface Snapshot 6 V4 Mediator RouByTypeWL () A B V2 ComNetByCpuRate ( ‘P4’ , rate ) D E F Wireless 1024 .11g 54 Wireless 512 .11b 10 Routers.* 120 .11g 54 B123 550 1024 P4 B123 50 .11b 10 A123 400 512 P4 A123 NetCards.* Computers.* 120 USB 54 550 1024 50 USB 10 400 512 price interface rate price ram
  • 62. CLIDE Facts
    • Rapid Convergence
      • At every step, yellow and blue actions lead to a feasible query in a minimum number of steps
    • Completeness of Suggestions
      • Every feasible query can be formulated by performing yellow and blue actions at every step
    • Minimality of Suggestions
      • At every step, only a minimal number of actions are suggested, i.e., the ones that are needed to preserve completeness
  • 63. Interaction Graph
    • Nodes are queries
      • One for each q  CQ
    • Edges are actions
      • Table, selection, projection and join actions
    • Green nodes are feasible queries
    • Infinitely big structure
      • All CQ queries
      • All possible combinations of actions formulating them
    Join Action Table Action Selection Action Com1.cid=Net1.cid Com1.cpu=‘P4’ Com1 Com1.ram Rou1 … … Com1.price … … … … … … … Net1 …
  • 64. Interaction Graph: Colorable Actions
    • Colorable actions A C label outgoing edges of the current node
    Net1 Com1.cpu=* Com1.price=* Rou1 Com1.ram=* Com1.cid=* Com2 Com1.cid … … … … … Com1.cpu … … … … Current Node
  • 65. Interaction Graph: Colors Com1.cpu=* … … … … … … … … … … … … Current Node Net1 Com1.cid=Net1.cid Com2.cid=Net1.cid Com2 Com2.cpu=‘P4’ Net1.rate=‘54Mbps’ Net1.rate=’54Mbps’ … … … … … … … Com1.cpu=* Com1.cpu=* Rou1 Net1.rate=Rou1.rate … … … … Net1.rate=’54Mbps’ … Com1.cid=Net1.cid Com1.cid=Net1.cid … Net1 Com1.cpu=* Com1.price=* Rou1 Com1.ram=* Com1.cid=* Com2 Com1.cid Com1.cpu
    • Yellow action 
      • Every path from current node n to a feasible node contains 
    • Blue action 
      • At least one feasible query cannot be formulated unless this action is performed (minimality)
    • Red action 
      • No path to a feasible node contains 
  • 66. Color Determined By a Finite Set of Feasible Queries
    • Start by considering the closest feasible queries FQ C
    • FQ C is sufficient to color actions in A C
    • Theorem: Set of Closest Feasible Queries is Finite
    • How far can closest feasible queries FQ C be?
    • Based on Maximally Contained Queries FQ MC ?
    n … … … … … … Closest Feasible Queries FQ C Challenge: Infinitely Many Feasible Queries Radius Infinitely many feasible queries ? … …
  • 67. Color Algorithm
    • Assuming fixed SELECT clause (projection list)
    • Covered extensively in literature
      • MiniCon, Bucket, InverseRules
    • FQ MC is finite
    Maximally Contained Query Maximally Contained Queries FQ MC Query: Q1 Get all Computers Query: Q2 Get all Computers with a given cpu Query: Q3 Get all Computers with a given cpu & ram Not Maximally Contained Maximally Contained Query Query: Q4 Get all Computers with a given ram
  • 68. Color Algorithm
    • Compute maximally contained queries FQ MC
    • The radius p L is the longest path to a node n ’ such that q( n ’) in FQ MC
    • All FQ C queries are reachable via a path of length p  p L
    Closest Feasible Queries FQ C Maximally Contained Queries FQ MC n … … … … … … Maximally Contained Queries FQ MC p L Radius …
  • 69. Color Algorithm
    • Theorem: All queries in FQ MC are in FQ C
    • But not all queries in FQ C are in FQ MC
    More on Closest Feasible Queries Closest Feasible Queries FQ C Maximally Contained Feasible Queries FQ MC … … … … … … More feasible nodes n
  • 70. Color Algorithm
    • Naïve Approach
      • Start from n and explore paths up to length p L
    More on Closest Feasible Queries Closest Feasible Queries FQ C Maximally Contained Feasible Queries FQ MC … … … … … … n
  • 71. Color Algorithm
    • Collapse Aliases to compute FQ C FQ MC
    • Check satisfiability
    Collapse Aliases Closest Feasible Queries FQ C Maximally Contained Feasible Queries FQ MC n … … … … … …
  • 72. Color Algorithm
    • Coloring Non-Projection Actions
    • No interaction graph materialization
    • Use of containment mapping from current query to the closest feasible ones
    • An action  is colored
      • Yellow, if  is mapped into all queries in FQ C
      • Red, if  is not mapped into any query in FQ C
      • Blue, if  is mapped into at least one query q F in FQ C , no other action in A P is mapped into q F , and  is neither yellow nor red
    • Coloring Projection Actions
    • Never colored yellow
    • Can be colored blue only if
      • the current query is feasible
      • it is not colored red
    • Which ones are red?
      • Bring all projection atoms from views such that feasibility is preserved
      • If action  is not mapped into any query in FQ C , then  is red
  • 73. CLIDE Implementation Other Back-End Parameterized Views Back-End Action Current Query Closest Feasible Queries Schemas Views MiniCon Containment Test Collapse Aliases Color Actions Front-End Developer  Maximally Contained Queries Optimal Maximally Contained Queries Colored Actions Column Associations
    • MiniCon
    • Outputs redundant and non-minimal queries
    • Affects CLIDE’s rapid convergence and minimality properties
    • Containment Test
    • Well-known NP-complete problem
    • Polynomial when query is acyclic
    • Collapse Aliases / Color Actions
    • Reuse containment mappings created by MiniCon
  • 74. CLIDE Performance
    • Queries
    A-span = 7 B-span = 4 Selections = 4,6,8,10 A Chains of Stars B 1 … C 1 B 2 C 1 A B K B 1 … C 1 C L …
    • Schema
    … B i … C i
    • Views
    A B K B 1 … C 1 C L … … … B iM B i1 … C iM C i1 …