Building Data Integration Queries By Demonstration

760 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
760
On SlideShare
0
From Embeds
0
Number of Embeds
41
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Building Data Integration Queries By Demonstration

  1. 1. Building Data Integration Queries by Demonstration Rattapoom Tuchinda, Pedro Szekely, and Craig A. Knoblock IUI ’ 07 Reporter:Chao-Ting Ting
  2. 2. Outline <ul><li>Introduction </li></ul><ul><li>Approach </li></ul><ul><ul><li>Building a single column table </li></ul></ul><ul><ul><ul><li>Set intersection constraint </li></ul></ul></ul><ul><ul><li>Building a multiple column table </li></ul></ul><ul><ul><ul><li>Reachable attributes </li></ul></ul></ul><ul><ul><ul><li>Partial plans </li></ul></ul></ul><ul><li>Experiment </li></ul><ul><li>Conclusion </li></ul>
  3. 3. Introduction <ul><li>With the proliferation of the Internet, most information can be found on the Internet today </li></ul><ul><li>The information we need is usually scattered among multiple websites </li></ul><ul><ul><li>It is very time consuming to access, combine, filter, and make sense of that data manually </li></ul></ul>
  4. 4. Example <ul><li>A particular restaurant </li></ul><ul><ul><li>rave reviews from a restaurant review website </li></ul></ul><ul><ul><li>‘ C ’ rating on a government ’ s health inspection website </li></ul></ul><ul><li>A health conscious person would require information from both websites </li></ul>
  5. 5. Limitation <ul><li>For computer literate users, their choices are limited to </li></ul><ul><ul><li>finding the information on their own by browsing web sites </li></ul></ul><ul><ul><li>relying on the data integration providers to supply web interfaces to access the integrated information </li></ul></ul>
  6. 6. Goals <ul><li>Karma </li></ul><ul><ul><li>A user interface where any computer literate user could easily build his/her own mashups </li></ul></ul><ul><ul><li>A service that integrates information from multiple data sources </li></ul></ul>
  7. 7. Problems need to solve <ul><li>Data retrieval </li></ul><ul><li>Data cleaning and schema matching </li></ul><ul><li>Data Integration </li></ul><ul><li>Filtering and visualization </li></ul>
  8. 8. In this paper <ul><li>Focus on data integration aspect </li></ul><ul><ul><li>Translate partially-filled rows into queries & retrieve data from multiple sources </li></ul></ul><ul><ul><li>Use constraints and partial plans to decrease possible attributes/values </li></ul></ul><ul><ul><li>generate consistent queries that always return data </li></ul></ul>
  9. 9. Hypothesis <ul><li>The data is clean and aligned </li></ul><ul><ul><li>fix misspellings and resolve format inconsistencies </li></ul></ul><ul><ul><li>Unify attributes names </li></ul></ul>
  10. 10. A snapshot of Karma
  11. 11. Approach - Index tables and data source definition <ul><li>S : a set of all available web sources </li></ul><ul><li>A : a lookup hashtable with its key and value being a -> { s } where, 1) {s} ⊆ S and 2) ∀ s ⊂ S : a ∈ att(s) </li></ul><ul><li>V : a lookup hashtable with its key and value being v -> {(a,s)} where ∀ (a,s): v ∈ val(a,s) ∧ a ∈ att(s) </li></ul><ul><ul><li>att(s) : a procedure that returns the set of attributes from the source s </li></ul></ul><ul><ul><li>val(a,s) : a procedure that returns the set of values associated with the attribute a in the source s. </li></ul></ul>
  12. 12. The data sources in the scenario <ul><li>Zagat ( $restaurant name, $cuisine, $address, $city, $state,$zipcode, review rating ) </li></ul><ul><li>Asian_food_review ( $restaurant name, $cuisine, $price,$address, $city, $state, $zipcode, review rating ) </li></ul><ul><li>LA_health_rating ( $restaurant name, $address, $city, $state, $zipcode, inspection date, health rating ) </li></ul><ul><li>EU_country_info ($country name, language, population, gdp, date, location) </li></ul>(The set of attributes with the $ in the data source model acts as primary keys)
  13. 13. Building a single column table-start by entering an attribute set <ul><li>{ v } = val( a,s ) where s ⊂ {s} </li></ul><ul><ul><li>(SELECT Cuisine FROM Zagat) </li></ul></ul><ul><ul><li>UNION (SELECT Cuisine FROM Asian_food_review) </li></ul></ul>
  14. 14. The set intersection constraint Cuisine
  15. 15. The set intersection constraint <ul><li>{x} = Set intersection({a}) over all the value rows </li></ul><ul><li>To enter a value, the possible value set is: </li></ul><ul><ul><li>{v} = val( a,s ) where a ∈ {x} ∧ s is any source where att(s) ∩ {x} ≠ {} </li></ul></ul>
  16. 16. Example : start by entering a value & use the set intersection constraint {(a,s)} = {(Cuisine, Zagat), (Language, EU_country_info)} French (SELECT Cuisine FROM Zagat) UNION (SELECT Cuisine FROM Asian_food_review) UNION (SELECT Language FROM EU_country_info) Vietnamese {(a,s)} = {(Cuisine, Zagat), (Cuisine, Asian_food_review)} Cuisine (SELECT Cuisine FROM Zagat) UNION (SELECT Cuisine FROM Asian_food_review)
  17. 17. Building a multiple column table <ul><li>Enter attribute </li></ul><ul><li>Enter value when the user has selected the attribute </li></ul><ul><li>Enter value when the user hasn ’ t selected the attribute </li></ul>
  18. 18. Reachable attributes Cuisine French {(a,s)} = {(Cuisine, ), (Language, )} Reachable attributes of row 1 : { restaurant name, cuisine, address, city, state, zipcode, review rating , country name, language, population, gdp, date, location , inspection date, health rating } Zagat EU_country_info Get from source “LA_health_rating”
  19. 19. Constraint for entering a new attribute <ul><li>The set intersection of “ reachable ” attributes of all partially filled rows </li></ul><ul><li>Each attribute in the set intersection of the “ reachable ” attributes set must produce a non-empty suggested value set </li></ul><ul><ul><li>Execute multiple queries through partial plans </li></ul></ul>
  20. 20. Partial plan tree Blue : value node White : place holder node Black : hidden node f( a,s,v ) = (cuisine, Zagat, French) a( a,s,v ) = ( restaurant_name, Zagat, _PLACE_HOLDER )
  21. 21. Tree evaluation <ul><li>A value node implies a value equal “ = ” condition </li></ul><ul><li>A place holder node can only be included in the SELECT part of the query </li></ul><ul><li>A node with multiple parents implies a join condition over its parents </li></ul>
  22. 22. Example( find possible candidate set for the attribute “ restaurant name ” )
  23. 23. Tree Construction - points for attention French root f 1 st f( a,s,v ) = (cuisine, Zagat, French) root f 1 st f( a,s,v ) = (language, EU_country_info, French) Row 1: Vietnamese root f 1 st f( a,s,v ) = (cuisine, Zagat, Vietnamese) root f 1 st f( a,s,v ) = (cuisine, Asian_food_review, Vietnamese) Row 2:
  24. 24. Construction Step 1 <ul><li>Compute the set intersection of the reachable attributes of all the rows & keep only reachable attributes that return non-empty suggest value sets </li></ul>root f 1 st f( a,s,v ) = (cuisine, Zagat, French)
  25. 25. Construction Step 2 <ul><li>User selects the attribute & the attribute is in the same data source </li></ul><ul><ul><li>add the new node as the child of the root </li></ul></ul><ul><li>User selects the attribute & the attribute is in a different source </li></ul><ul><ul><li>create necessary hidden nodes according to the primary key constraints and set the new node as the child of those hidden nodes </li></ul></ul><ul><li>User decides to enter a value </li></ul><ul><ul><li>Permute the partial plan for the attribute set in 1 </li></ul></ul>
  26. 26. Experiment <ul><li>Retrieving the restaurant health rating </li></ul><ul><ul><li>only retrieves data from one data source (LA_health_rating) </li></ul></ul><ul><li>Retrieving the restaurant information with reviews and health ratings </li></ul><ul><ul><li>includes the join between two data sources </li></ul></ul><ul><ul><li>limits the number of sources so there would be no union operation </li></ul></ul><ul><li>Retrieving the restaurant information with reviews and health ratings </li></ul><ul><ul><li>includes union and join as discussed in the previous section </li></ul></ul>
  27. 27. Experiment <ul><li>Baseline : Microsoft Access, which integrates the QBE approach into its query design view </li></ul>Typing in a value or Selecting a value = 1t unit Selecting a data source to use = 1d unit Selecting an attribute = 1a unit
  28. 28. Conclusion <ul><li>An approach to data integration with the following characteristics </li></ul><ul><ul><li>does not require the user to know details about data sources or existing values </li></ul></ul><ul><ul><li>suggests valid possible values to the user </li></ul></ul><ul><ul><li>creates consistent queries that always return values </li></ul></ul>

×