Upcoming SlideShare
×

# A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

1,654 views

Published on

Published in: Education, Technology
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Be the first to comment

Views
Total views
1,654
On SlideShare
0
From Embeds
0
Number of Embeds
38
Actions
Shares
0
32
0
Likes
1
Embeds 0
No embeds

No notes for slide
• M = set of known source modelsk=Total number of links in M (&gt;= |EG|)c1(e)=number of links in M whose &lt;label,source, target&gt; match ec2(e)=number of links in M whose &lt;label,source&gt; match ec3(e)=number of links in M whose &lt;label,target&gt; match ec4(e)=number of links in M whose &lt;label&gt; match ewe=Min( k - ε/ [k-c1], k - ε/ [k * (k-c2)],k - ε/ [k * (k-c3)], k - ε/ [k * k * (k-c4)])
• If number of mappings from semantic types to the nodes in the graph is too large, only keep top k promising mappings: score the mappings based on their coherence (the nodes in the same pattern)
• If there is large number of mappings, we sort them based on their coherency: how many number of each mapping belong to the same pattern?
• ### A Graph-Based Approach to Learn Semantic Descriptions of Data Sources

1. 1. A Graph-based Approach to Learn Semantic Descriptions of Data Sources Mohsen Taheriyan Craig Knoblock Pedro Szekely Jose Luis Ambite
2. 2. Problem: How to learn semantic descriptions?
3. 3. First, what is a semantic description?
4. 4. Semantic Description Describing the source in terms of the concepts and relationships defined by the domain ontology Domain Ontology bornIn nearby birthdate isPartOf livesIn Place name Person organizer name ceo worksFor Event Organization phone postalCode location City state State object property data property subClassOf title startDate name Source 3 Column 1 Bill Gates Mark Zuckerberg Larry Page Column 2 Oct 1955 May 1984 Mar 1973 Column 3 Microsoft Facebook Google Column 4 Seattle White Plains East Lansing Column 5 WA NY MI
5. 5. Semantic Types Person Person Organization name State name birthdate Column 1 Column 2 Column 3 Column 4 Column 5 Bill Gates Oct 1955 Microsoft Seattle WA Mark Zuckerberg May 1984 Facebook White Plains NY Larry Page 4 name City name Mar 1973 Google East Lansing MI
6. 6. Relationships bornIn Person name City state worksFor name birthdate Organization State name name Column 1 Column 2 Column 3 Column 4 Column 5 Bill Gates Oct 1955 Microsoft Seattle WA Mark Zuckerberg May 1984 Facebook White Plains NY Larry Page Mar 1973 Google East Lansing MI This semantic model is converted to a semantic description in R2RML 5
7. 7. Previous approach to learn semantic descriptions
8. 8. Karma Sample Data C R F Learn Semantic Types Steiner Tree Extract Relationships Semantic Model Domain Ontology http://www.isi.edu/integration/karma 7 @KarmaSemWeb
9. 9. Refining The Model Initial Model 8
10. 10. Refining The Model Refined Model • Previous work does not learn the changes done by the user in relationships • User has to go through the refinement process each time 9
11. 11. Our new approach to learn semantic descriptions
12. 12. Key Idea • Sources in the same domain often have similar data • Exploit knowledge of existing source models • Leverage relationships in known source models to hypothesize relationships for new sources 11
13. 13. Approach Inputs … S1 S2 Sn Known Source Models Domain Ontology New Source C R F Learn Semantic Types Construct Graph G Generate Candidate Models 12 Rank Results
14. 14. Example Domain Ontology Known Source Models worksFor Organization City Person name state State state name birthdate name| birthdate| name name city| state| S1 = personalInfo workplace State name isPartOf City ceo City name state | city S2 = getCities Person name State name name company| ceo| city| New Source S4 = postalCodeLookup(zipcode, city, state) 13 location Organization bornIn name state S3 = businessInfo
15. 15. Build a Graph from Known Models • Create a component in G for each known source model – Only add if the model is not subgraph of an existing component • Annotate links with list of supporting models S1 = personalInfo name Organization worksFor {s1} {s1} bornIn Org.name state {s1} City Person name {s1} {s1} {s1} birthdate Person.name {s1} name City.name Person.birthdate Component 1 14 State {s1} name State.name
16. 16. Build a Graph from Known Models • Create a component in G for each known source model – Only add if the model is not subgraph of an existing component • Annotate links with list of supporting models S1 = personalInfo S2 = getCities name Organization worksFor {s1} {s1} bornIn Org.name state {s1} City State {s1,s2} Person name {s1} {s1,s2} name {s1} birthdate Person.name City.name Person.birthdate Component 1 15 {s1,s2} name State.name
17. 17. Build a Graph from Known Models • Create a component in G for each known source model – Only add if the model is not subgraph of an existing component • Annotate links with list of supporting models S1 = personalInfo S2 = getCities name Organization worksFor {s1} {s1} bornIn Org.name state {s1} City State {s1,s2} Person name {s1} {s1,s2} name {s1} birthdate Person.name City.name Person.birthdate Component 1 16 {s1,s2} name State.name S3 = businessInfo Organization location {s3} name {s3} City ceo {s3} isPartOf State {s3} Person {s3} name {s3} name Org.name name {s3} State.name City.name Person.name Component 2
18. 18. Build a Graph from Known Models • Connect graph components using all paths inferred from the ontology name Organization worksFor {s1} {s1} bornIn Org.name state {s1} City State {s1,s2} Person name {s1} {s1,s2} name {s1} birthdate City.name Person.name {s1,s2} worksFor ceo name State.name organizer Organization Person.birthdate location name {s3} Person Event isPartOf location {s3} name isPartOf {s3} State {s3} name Org.name isPartOf location Place name {s3} Person.name isPartOf isPartOf 17 City ceo {s3} organizer location {s3} State.name City.name
19. 19. Build a Graph from Known Models • • Assign low weight = ε to links within a component (black links) Weight other links according to their (green links) M = known source models Wmax = number of links in M (>= |EG|) = 18 c1(e) = number of links in M whose <label,source, target> match e c2(e) = number of links in M whose <label> match e we = Min(Wmax - c1 , Wmax - c2/Wmax) name Organization worksFor {s1} {s1} bornIn Org.name state {s1} City State {s1,s2} Person name {s1} {s1,s2} name {s1} birthdate City.name Person.name {s1,s2} name 17 worksFor State.name 17 ceo Organization Person.birthdate organizer 18 17.94 location name organizer 18 isPartOf location {s3} name isPartOf {s3} State {s3} name Org.name isPartOf location 17.94 Place name {s3} Person.name 17.94 18 {s3} Person Event 17.94 City ceo {s3} 17.94 location {s3} 17.94 isPartOf isPartOf State.name City.name
20. 20. Learn Semantic Types (Previous Work) • A CRF-based model to assign a Semantic Type to each column from its data • Semantic Type – Ontology Class – Data Property + Domain Domain Ontology Place.postalCode S4 = postalCodeLookup 19 City.name (zipcode , city , State.name state)
21. 21. Generate Candidate Models • Map learned semantic types to nodes in graph G – There might be multiple mappings • Compute Steiner tree (minimal tree) for each mapping name Organization worksFor {s1} {s1} bornIn Org.name state {s1} City State {s1,s2} Person name {s1} {s1,s2} name {s1} birthdate City.name Person.name {s1,s2} Place.postalCode S4 = postalCodeLookup name ceo Organization 17.94 name City {s3} Person Event state) isPartOf location isPartOf location 17.94 name {s3} Person.name 17.94 17.94 {s3} name isPartOf {s3} State {s3} name Org.name Place 20 location {s3} ceo {s3} 17.94 organizer 18 17.94 city, State.name 17 location (zipcode, State.name 17 worksFor Person.birthdate organizer 18 City.name isPartOf isPartOf State.name City.name
22. 22. Generate Candidate Models • Map learned semantic types to nodes in graph G – There might be multiple mappings • Compute Steiner tree (minimal tree) for each mapping Mapping 1 name Organization worksFor {s1} {s1} bornIn Org.name state {s1} City State {s1,s2} Person name {s1} {s1,s2} name {s1} birthdate City.name Person.name {s1,s2} Place.postalCode S4 = postalCodeLookup name ceo Organization 17.94 name 21 Place.postalCode City {s3} Person Event isPartOf location postalCode location {s3} ceo {s3} 17.94 organizer 18 17.94 city, state) State.name 17 location (zipcode, State.name 17 worksFor Person.birthdate organizer 18 City.name isPartOf location Place name {s3} Person.name 17.94 {s3} State {s3} name Org.name 17.94 17.94 {s3} name isPartOf isPartOf isPartOf State.name City.name
23. 23. Generate Candidate Models • Map learned semantic types to nodes in graph G – There might be multiple mappings • Compute Steiner tree (minimal tree) for each mapping Mapping 1 name Organization worksFor {s1} {s1} bornIn Org.name state {s1} City State {s1,s2} Person name {s1} {s1,s2} name {s1} birthdate City.name Person.name {s1,s2} Place.postalCode S4 = postalCodeLookup name ceo Organization 17.94 name 22 Place.postalCode City {s3} Person Event isPartOf location postalCode location {s3} ceo {s3} 17.94 organizer 18 17.94 city, state) State.name 17 location (zipcode, State.name 17 worksFor Person.birthdate organizer 18 City.name isPartOf location Place name {s3} Person.name 17.94 {s3} State {s3} name Org.name 17.94 17.94 {s3} name isPartOf isPartOf isPartOf State.name City.name
24. 24. Generate Candidate Models • Map learned semantic types to nodes in graph G – There might be multiple mappings • Compute Steiner tree (minimal tree) for each mapping Mapping 2 name Organization worksFor {s1} {s1} bornIn Org.name state {s1} City State {s1,s2} Person name {s1} {s1,s2} name {s1} birthdate City.name Person.name {s1,s2} Place.postalCode S4 = postalCodeLookup name ceo Organization 17.94 name 23 Place.postalCode City {s3} Person Event isPartOf location postalCode location {s3} ceo {s3} 17.94 organizer 18 17.94 city, state) State.name 17 location (zipcode, State.name 17 worksFor Person.birthdate organizer 18 City.name isPartOf location Place name {s3} Person.name 17.94 {s3} State {s3} name Org.name 17.94 17.94 {s3} name isPartOf isPartOf isPartOf State.name City.name
25. 25. Generate Candidate Models • Map learned semantic types to nodes in graph G – There might be multiple mappings • Compute Steiner tree (minimal tree) for each mapping Mapping 2 name Organization worksFor {s1} {s1} bornIn Org.name state {s1} City State {s1,s2} Person name {s1} {s1,s2} name {s1} birthdate City.name Person.name {s1,s2} Place.postalCode S4 = postalCodeLookup name ceo Organization 17.94 name 24 Place.postalCode City {s3} Person Event isPartOf location postalCode location {s3} ceo {s3} 17.94 organizer 18 17.94 city, state) State.name 17 location (zipcode, State.name 17 worksFor Person.birthdate organizer 18 City.name isPartOf location Place name {s3} Person.name 17.94 {s3} State {s3} name Org.name 17.94 17.94 {s3} name isPartOf isPartOf isPartOf State.name City.name
26. 26. Generate Candidate Models • Map learned semantic types to nodes in graph G – There might be multiple mappings • Compute Steiner tree (minimal tree) for each mapping Mapping 3 name Organization worksFor {s1} {s1} bornIn Org.name state {s1} City State {s1,s2} Person name {s1} {s1,s2} name {s1} birthdate City.name Person.name {s1,s2} Place.postalCode S4 = postalCodeLookup name ceo Organization 17.94 name 25 Place.postalCode state) isPartOf location isPartOf location Place State {s3} name {s3} Person.name 17.94 {s3} {s3} name {s3} name Org.name 17.94 17.94 isPartOf City Person Event postalCode location {s3} ceo {s3} 17.94 organizer 18 17.94 city, State.name 17 location (zipcode, State.name 17 worksFor Person.birthdate organizer 18 City.name isPartOf isPartOf State.name City.name
27. 27. Generate Candidate Models • Map learned semantic types to nodes in graph G – There might be multiple mappings • Compute Steiner tree (minimal tree) for each mapping Mapping 3 name Organization worksFor {s1} {s1} bornIn Org.name state {s1} City State {s1,s2} Person name {s1} {s1,s2} name {s1} birthdate City.name Person.name {s1,s2} Place.postalCode S4 = postalCodeLookup name ceo Organization 17.94 name Event isPartOf postalCode 26 Place.postalCode state) Place State {s3} name {s3} Person.name 17.94 {s3} name {s3} {s3} name Org.name 17.94 17.94 isPartOf City Person isPartOf location location location {s3} ceo {s3} 17.94 organizer 18 17.94 city, State.name 17 location (zipcode, State.name 17 worksFor Person.birthdate organizer 18 City.name isPartOf isPartOf State.name City.name
28. 28. Generate Candidate Models • Map learned semantic types to nodes in graph G – There might be multiple mappings • Compute Steiner tree (minimal tree) for each mapping Mapping 4 name Organization worksFor {s1} {s1} bornIn Org.name state {s1} City State {s1,s2} Person name {s1} {s1,s2} name {s1} birthdate City.name Person.name {s1,s2} Place.postalCode S4 = postalCodeLookup name ceo Organization 17.94 name 27 Place.postalCode state) isPartOf location isPartOf location 17.94 Place 17.94 {s3} State {s3} {s3} name {s3} name Org.name name {s3} Person.name 17.94 isPartOf City Person Event postalCode location {s3} ceo {s3} 17.94 organizer 18 17.94 city, State.name 17 location (zipcode, State.name 17 worksFor Person.birthdate organizer 18 City.name isPartOf isPartOf State.name City.name
29. 29. Generate Candidate Models • Map learned semantic types to nodes in graph G – There might be multiple mappings • Compute Steiner tree (minimal tree) for each mapping Mapping 4 name Organization worksFor {s1} {s1} bornIn Org.name state {s1} City State {s1,s2} Person name {s1} {s1,s2} name {s1} birthdate City.name Person.name {s1,s2} Place.postalCode S4 = postalCodeLookup name ceo Organization 17.94 name 28 Place.postalCode state) isPartOf location isPartOf 17.94 Place name {s3} Person.name 17.94 17.94 {s3} State {s3} Org.name location isPartOf City Person Event postalCode location {s3} ceo {s3} 17.94 organizer 18 17.94 city, State.name 17 location (zipcode, State.name 17 worksFor Person.birthdate organizer 18 City.name isPartOf isPartOf name {s3} {s3} name State.name City.name
30. 30. Rank Source Models • Rank the candidates based on: – Cost: sum of the weights – Coherence: prefer the models with higher number of supporting models Rank 1: Candidate 1 isPartOf City state State Rank 2: Candidate 4 isPartOf City {s1,s2} Place {s1,s2} name {s1,s2} name postalCode State.name Place.postalCode isPartOf {s3} name postalCode postalCode {s1,s2} name State {s3} name State.name Place.postalCode City.name 29 State {s3} name State.name Place.postalCode City.name City Place {s3} Place City.name isPartOf isPartOf Rank 3: Candidate 2 isPartOf isPartOf State City Place name {s1,s2} name {s3} postalCode State.name Place.postalCode City.name Rank 3: Candidate 3
31. 31. Evaluation • Dataset 1 – 17 data sources containing overlapping data – Semantic descriptions created manually using DBPedia, FOAF, GeoNames, and WGS84 ontologies • Dataset 2 – 6 museum sources – Semantic descriptions created by domain experts using EDM, SKOS, and FOAF ontologies • Learned a source model assuming other models as input • Computed the Graph Edit Distance (GED) between the learned model and the correct one – Operations: node insertion, node deletion, edge insertion, edge deletion, edge relabeling • Compared the results with our previous work in Karma 30
32. 32. Results - Dataset 1 GED Source Signature nearestCity(lat, lng, city, state, country) findRestaurant(zipcode, restaurantName, phone, address) zipcodesInCity(city, state, postalCode) parseAddress(address, city, state, zipcode, country) citiesOfState(state, city) ocean(lat, lng, name) postalCodeLookup(zipCode, city, state, country) country(lat, lng, code, name) companyCEO(company, name) personalInfo(firstname, lastname, birthdate, brithCity, birthCountry) businessInfo(company, phone, homepage, city, country, name) restaurantChef(restaurant, firstname, lastname) findSchool(city, state, name, code, homepage, ranking, dean) employees(organization, firstname, lastname, birthdate) education(person, hometown, homecountry, school, city, country) administrativeDistrict(city, province, country) capital(country, city) TOTAL #Attributes Previous work 5 4 3 5 2 3 4 4 2 5 6 3 7 4 6 3 2 68 6 1 3 6 1 2 6 2 1 4 10 2 8 1 9 4 2 68 New Approach (Rank 1) 1 0 1 1 0 1 1 0 0 1 8 1 6 2 4 1 1 29 57% improvement 31
33. 33. Results - Dataset 2 GED #Attributes Previous work New Approach (Rank 1) S1(Attribution, BeginDate, EndDate, Title, Dated, Medium, Dimensions) 7 1 0 S2(ObjectID, ObjectTitle, ObjectWorkType, ArtistName, ArtistBirthDate, ArtistDeathDate, ObjectEarliestDate, ObjectRights, ObjectFacetValue1) 8 2 3 S3(death, birth, name) 3 0 0 S4(accessionNumber, artist, creditLine, dimensions, imageURL, materials, relatedArtworksURL, creationDate, provenance, keywordValues) 10 9 6 S5(AccessionNumber, Classification, CreditLine, Date, Description, DimensionsOrphan, WhatValues, Who, image, relatedArtworksValues) 10 9 5 S6(Artist, ArtistBornDate, ArtistDiedDate, Classification, Copyright, CreditLine, Image, KeywordValues, Ref, SitterValues) 10 8 6 TOTAL 68 29 20 Source Signature 31% improvement 32
34. 34. Related Work • Writing semantic descriptions by hand – R2RML, SWRL – Tedious and time-consuming task – Requires expertise in SW technologies • Semantic annotation of Web services and Web tables – Very limited in learning the relationships • Learning Semantic Definitions of Online Information Sources [Carman, Knoblock, 2007] – Learns LAV rules from known sources – Can only learn descriptions that overlap known sources 33
35. 35. Discussion • Automatically build rich semantic descriptions of data sources • Exploit the background knowledge from (i) the domain ontology, and (ii) the known source models • Semantic descriptions are the key ingredients to automate many tasks, e.g., – Source Discovery – Data Integration – Service Composition 34
36. 36. Future Work • Investigate how to create a more compact graph – Consolidate the overlapping segments of the known semantic models • Relax the problem by removing the constraint that the correct semantic type of each attribute is known – CRF part returns a set of candidate semantic types along with their confidence values • Use the data available in Linked Open Data (LOD) cloud to learn more accurate models • Put the user in the loop – Integrate the new approach into Karma – The user refines one of the suggested models – The new model will be added to the graph as a new pattern 35