Graphinder Semantic Search
Relational Keyword Search over Data Graphs
Thanh Tran, Lei Zhang, Veli Bicer, Yongtao Ma
Resear...
Agenda
•
•
•
•
•

Introduction
Graphinder: Overview
Keyword Query Translation
Keyword Query Result Ranking
Keyword Query R...
INTRODUCTION
Motivation: lots of structured data
Semantic Search: use information about entities and
relationships explicitly given in structured data to provide
relevant ...
Entity Semantic Search: find relevant entity, return
structured data summary, facts, related entities
Relational Semantic Search: find relevant entities
involved in a relationship, return entity summaries…
Semantic Search Problem: understand user inputs as
entities and relationships and find relevant answers

“single written b...
Relational Semantic Search at Facebook: recognizes entities and
relationships via LMs, uses manually specified template (g...
OVERVIEW
Graphinder Semantic Search: a translation-based approach
for relational keyword search over data graphs

Single

Artist

P...
Graphinder: selected publications
• On-demand, domain-independent, relational keyword search
over data graphs
–
–
–
–

Str...
QUERY TRANSLATION
0) Query Translation: constructing pseudo schema graph
representing all possible connections between data elements
•

•

•...
1) Query Translation: constructing search space
representing all possible interpretations of query keywords
“written by fr...
2) Query Translation: score-directed algorithm for finding
top-k subgraphs connecting keyword matching elements
“written b...
RESULT RANKING
Ranking Using Structured LMs: Keyword query is short and
ambiguous, while structured data provide rich structure
informati...
Relevance Models
freddie queen
Query
F Documents

Merc
ury
Brian
May
Prote
st
Raid
Clas
h
Bank
West

• Term probabilities ...
Structured Relevance Models
Structured Data

queen single
Query

F Results

Merc
ury
Brian
May
Prote
st
Raid
Clas
h
Bank
W...
Ranking: construct edge-specific query model for each unique e
from feedback resources FR, edge-specific model for every
c...
QUERY REWRITING
Query Rewriting: find syntactically and semantically valid
rewrites to suggest as user types
single from freddy mercury qu...
Probabilistic Model for Query Rewriting: the rank of a
query rewrite (suggestion) S is based on the
probability of observi...
Token Rewriting
• Modeling token rewriting P(Q|S)

Split: |
Concatenate: +

• Independence assumption

• Modeling syntacti...
Query Segmentation
• Modeling query segmentation P(S|D)
single writer freddie mercury que

α = concatenate?
α = split?
whe...
Estimating Probability of Segmentation
• Maximum likelihood estimation (MLE)

where C(ti…tj) denotes the count of occurren...
Estimating Probability of Segmentation Case 1: previous
segment si has length equal or more than context N
• Two cases: (1...
Estimating Probability of Segmentation Case 2: previous
segment si has length less than context N
• (2) When the previous ...
EXPERIMENTAL RESULTS &
CONCLUSIONS
• Graphinder, a relational keyword search approach for suggesting query
•

•

•

•
•

completions, translating queries and...
Thanks!

Tran Duc Thanh
tran.du.th@gmail.com
http://sites.google.com/site/kimducthanh/
References (1)
– [VLDB14] Yongtao Ma, Thanh Tran
Probabilistic Query Rewriting for Efficient and and Effective Keyword Sea...
References (2)
– [WWW12] Daniel Herzig, Thanh Tran
Heterogeneous Web Data Search Using Relevance-based On The Fly Data Int...
BACKUP
Upcoming SlideShare
Loading in …5
×

Graphinder semantic search

613 views

Published on

Relational semantic search over data graphs, cover facebook graphsearch and our solution to relational keyword search over structured data.

Published in: Technology, Design
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
613
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Construct query model from structured data elements that are close to the queryIndex resources in the data graph where resources are treated as documents and attributes and attribute values are indexed as document terms use standard inverted index implementation and IR search engine to retrieve resources for a given keyword query initial run of the query yields F results
  • Query model: probability of terms in the query model is estimated using F resources: intuitively, probability of a term is estimated as the probability of observing these terms in the F resources (based on the probability of observing the term in the e-value of r, and the probability of e) Weight by the importance of that resource: a resource is more important if query terms are more likely to be observed in that resources, compared to other resources in FEdge-specific resourcemodel:probability of observingterm v in e-value of r, smoothing with prpobability of observing term v in all values of rThe score of a resource calculated based on cross-entropy of edge-specific RM and edge-specific ResM:Aggrgated over EVERY E: Alpha allows to control the importance of edgesInstead of singleentities, rankingcomplexgraphscomprisingmultupleentities,calledJoinedResultTuple: modelcomplexresultsas a geometricmean of the entitymodelsRanking aggregated JRTs: The cross entropy between the edge-specific RM (Query Model) and geometric mean of combined edge-specific ResM:The proposed ranking function is monotonic with respect to the individual resource scores (a necessary property for using top-k algorithms)A language model is constructed for every attribute of the resource to capture the probability of a word being observed via repeated sampling from the content of a specific attribute of rLambda controls the weight of the edge-specific attribute, small value means less emphasis on the term of the attribute and more emphasis on the terms of the entire resource (terms in all attributes)Pe is the probability of observing a word v in the edge specific attribute a P* is the probability of observing a word v in all attributes of rConsider the co-occurences of a word and query words in the content of a specific attribute aThe sampling process we implement is iidiidsamping: query words and w are iid sampled from a unigram distribution a, i.e. representing content of the specific attribute a, then sample v from a, and then sample k times query words from a distribution representing the content of all attributes of r
  • Graphinder semantic search

    1. 1. Graphinder Semantic Search Relational Keyword Search over Data Graphs Thanh Tran, Lei Zhang, Veli Bicer, Yongtao Ma Researcher: www.sites.google.com/site/kimducthanh Co-Founder: www.graphinder.com
    2. 2. Agenda • • • • • Introduction Graphinder: Overview Keyword Query Translation Keyword Query Result Ranking Keyword Query Rewriting – Suggesting correct and meaningful queries – Auto-complete as user types
    3. 3. INTRODUCTION
    4. 4. Motivation: lots of structured data
    5. 5. Semantic Search: use information about entities and relationships explicitly given in structured data to provide relevant answers for complex questions asked using intuitive interfaces “singles written by freddie, who is member of the band queen” “single written by freddie queen” MusicBrainz Single Artist Queen Person Queen Elizabeth 1 <x, type, Single> <Freddie Mercury, writer, x> <Freddie Mercury, type, Artist> <Freddie Mercury, member, Queen> <Queen, type, Band> DBpedia Freddie Mercury Brian May writer Liar 1971 single <x, type, Single> <x, wrritenBy, Freddy> Links <Freddy, same-as, Freddy Mercury>
    6. 6. Entity Semantic Search: find relevant entity, return structured data summary, facts, related entities
    7. 7. Relational Semantic Search: find relevant entities involved in a relationship, return entity summaries…
    8. 8. Semantic Search Problem: understand user inputs as entities and relationships and find relevant answers “single written by freddie queen” “singles written by freddie, who is member of the band queen” Single Artist Queen Freddie Mercury Brian May writer Person Queen Elizabeth 1 Liar 1971 single Query Translation: What are possible connections (schema-level) between recognized entities and relationships? 1) <x, type, Single> <Freddie Mercury, writer, x> <Freddie Mercury, member, Queen> 2) …. Query Answering: What are actual connections (data-level) between recognized entities and relationships? 1) <Liar Liar, type, Single> <Freddie Mercury, writer, Liar Liar> <Freddie Mercury, member, Queen> 2) …
    9. 9. Relational Semantic Search at Facebook: recognizes entities and relationships via LMs, uses manually specified template (grammar) to find possible connections between them and computes answers via resulting translated queries “my friends, who is member of queen” [start] my friends, who is member of [id:Queen1] friends(x,me), member(x,Queen1) [user-head] my friends friends(x,me) [user-filter] who is member of [id:1] member(x,Queen1) [who] who - [member-vp] is member of [id:1] member(x,Queen1) [member-of-v] is member of member() friends member {band} [id:Queen1] Queen1 queen Grammar: set of production rules, capturing all possible connections, i.e. the search space of all parse trees [start]  [users] [users]  my friends friends(x, me) […]  is member of [bands] member(x, $1) [bands]  {band} $1 … Grammar-based Query Translation: which combination of production rules results in a parse tree that connects the recognized entities and relationships?
    10. 10. OVERVIEW
    11. 11. Graphinder Semantic Search: a translation-based approach for relational keyword search over data graphs Single Artist Person Queen Queen Elizabeth 1 Freddie Mercury Brian May Liar 1971 single writer Sem. Auto-completion Query Translation - Entity + Relationships - Multi-source - Domain-independent - Low manual effort
    12. 12. Graphinder: selected publications • On-demand, domain-independent, relational keyword search over data graphs – – – – Structure index for data graphs (TKDE13b) Top-k exploration of translation candidates (ICDE09) Index-based materialization of graphs (CIKM11a) Ranking results using structured relevance model (SRM) (CIKM11b) • Multi-source – Deduplication using inferred type information: TYPifier (ICDE13), TYPimatch (WSDM13) – On-the-fly deduplication using SRM (WWW11) – Ranking with deduplication (ISWC13) – Routing keyword queries to relevant data graphs (TKDE13a) – Hermes: keyword search over heterogeneous data graphs (SIGMOD09) • Semantic auto-completion – Computing valid query rewrites for given keywords (VLDB14)
    13. 13. QUERY TRANSLATION
    14. 14. 0) Query Translation: constructing pseudo schema graph representing all possible connections between data elements • • • Structure index for data graph: nodes are groups of data elements that are share same structure pattern Parameters: structure pattern with edge labels L and paths of maximum length n Pseudo schema – Node groups all instances that have same set of properties – structure pattern: all properties, i.e. all outgoing paths with n = 1, L = all edge labels • Algorithm: – Start with one single partition/node representing all instances – Spit until all nodes are “stable”, i.e., all contained instances share same structure pattern Single Artist Queen Freddie Mercury Brian May Person Queen Elizabeth 1 Liar single writer member Artist producer Thing12 writer Single marital status Person Value2
    15. 15. 1) Query Translation: constructing search space representing all possible interpretations of query keywords “written by freddie queen single” Freddie Mercury Queen Elizabeth 1 Artist Freddie Mercury producer Band Queen Data Index single writer member Queen Single Single Schema Index marital status writer Keyword Interpretation: use inverted index and LM-based ranking function to return relevant schema and data elements Person Literal Queen Elizabeth 1 single Search Space Construction: augment pseudo schema with query-specific keyword matching elements • All possible connections of predicates applicable to recognized query keywords Top-k Subgraph Exploration Result Retrieval & Ranking
    16. 16. 2) Query Translation: score-directed algorithm for finding top-k subgraphs connecting keyword matching elements “written by freddie queen single” member Artist Freddie Mercury • • • • • • producer Band Queen marital status writer Single Person Literal Queen Elizabeth 1 single <x, type, Single> <Queen, producer, x> <Freddie Mercury, writer, x> <Queen, type, Band> <Freddy Mercury, type, Artist> Algorithm: score-directed top-k Steiner graph search Start: explore all distinct paths starting from keyword elements Every iteration • One step expansion of current path with highest score • When connecting element found, merge paths and add resulting graph to list Top-k termination: lowest score of the candidate list > highest possible score that can achieved with paths in the queues yet to be explored Termination: all paths of maximum length d have been explored Final step: mapping rules to translate Steiner graph to structured query
    17. 17. RESULT RANKING
    18. 18. Ranking Using Structured LMs: Keyword query is short and ambiguous, while structured data provide rich structure information: ranking based on LMs capturing both content and structure • Structured LMs for structured results r • Structured LM for queries using structured pseudorelevant feedback results FR (relevance model) • Compute distance between query and result LMs RM r (v ) P(v | r ) RMFr (v) P(v | Fr ) Score( r ) RM Fr ( v ) log RM r ( v ) v V
    19. 19. Relevance Models freddie queen Query F Documents Merc ury Brian May Prote st Raid Clas h Bank West • Term probabilities of query model is based on documents • Ranking behaves like similarity search between pseudo-relevant feedback documents and corpus documents Candidate Documents Merc ury Brian May Prote st Raid Clas h Bank West
    20. 20. Structured Relevance Models Structured Data queen single Query F Results Merc ury Brian May Prote st Raid Clas h Bank West • Term probabilities of query model is based on pseudo-relevant structured data • Ranking behaves like similarity search between pseudo-relevant structured results and structured result candidates Structured Data Candidate Results Merc ury Brian May Prote st Raid Clas h Bank West
    21. 21. Ranking: construct edge-specific query model for each unique e from feedback resources FR, edge-specific model for every candidate r, and finally, compute distance For all resources r in FR Prob of observing term v in value of property e of resource r RMname RMcomment RMx Mercury .091 .01 … Brian .082 .01 … Champion Importance of resource r w.r.t. query v .081 .02 … Protest .001 .042 … Raid .006 .014 … … … … … v RMname RMcomment RMx Mercury .073 .01 … Brian .052 .01 … … … … …
    22. 22. QUERY REWRITING
    23. 23. Query Rewriting: find syntactically and semantically valid rewrites to suggest as user types single from freddy mercury que Freddie Mercury Queen Elizabeth 1 Queen single writer Single Data Index Schema Index Benefits: - Higher selectivity of query terms (quality) - Reduced number of query terms (efficiency) - Better search experience… Freddie Mercury Data Index Queen writer Single Schema Index Challenges: many rewrite candidates, some are semantically not “valid” in the relational setting single (marital status) writer “freddie mercury” queen (the queen of UK) Token rewriting via syntactic distance Keyword Interpretation: - Imprecise / fuzzy matching 1) single from freddie mercury queen - Match every keyword … Token rewriting via semantic distance 1) single writer freddie mercury queen … Query segmentation 1) single writer “freddie mercury” queen … Keyword / Key Phrase Interpretation: - Precise matching - Match keyword and key phrases Search Space Construction Search Space Construction Result Retrieval & Ranking
    24. 24. Probabilistic Model for Query Rewriting: the rank of a query rewrite (suggestion) S is based on the probability of observing S in the data, given the query Based on Bayes„ Theorem Probability users write spelling errors / semantically related query independent of data D single writer freddy mercury que 1) single writer freddie mercury queen 2) single writer freddrick mercury monarch 3) song writer freddrick mercury head of state Constant given query Q and data D Single Artist Person Queen Queen Elizabeth 1 Token Rewriting: S is ranked high when prob that query Q can be observed in S is high Query Segmentation: S is ranked high when prob that S can be observed in the data D is high Freddie Mercury Brian May Liar writer 1971 single
    25. 25. Token Rewriting • Modeling token rewriting P(Q|S) Split: | Concatenate: + • Independence assumption • Modeling syntactic and semantic differences P(q|t): is high when q is syntactically and semantically close to t single writer freddy mercury que 1) single writer “freddie mercury” queen 2) single writer “freddrick mercury” monarch 3) single writer “freddrick mercury” head of state single | writer | freddie + mercury | queen
    26. 26. Query Segmentation • Modeling query segmentation P(S|D) single writer freddie mercury que α = concatenate? α = split? where PD(αiti+1|t1α1t2…αi-1ti) stands for P(αiti+1|t1α1t2…αi-1ti,D). Singl e Art ist single writer freddie Queen Elizabeth 1 Freddie Mercury Brian May Liar writer • Nth order Markov assumption Person Queen 1 9 7 1 single
    27. 27. Estimating Probability of Segmentation • Maximum likelihood estimation (MLE) where C(ti…tj) denotes the count of occurrences of the token sequence ti…tj Segmentation in structured data setting • Concatenate two segments si and sj when they co-occur in the data • Split when si and sj are connected (si ↭ sj), i.e., when the two data elements ni and ni mentioning si and sj are connected in the data single writer freddie mercury queen Single Artist α = concatenate? α = split? single writer freddie Person Queen Freddie Mercury Brian May writer Queen Elizabeth 1 Liar 1971 single
    28. 28. Estimating Probability of Segmentation Case 1: previous segment si has length equal or more than context N • Two cases: (1) l(si) ≥ N; (2) l(si) < N • (1) When the previously induced segment si has length equal or more than N, i.e. l(si) ≥ N, it suffices to focus on si (N) to predict the next action αi on ti+1 freddie j. mercury queen freddie j. mercury queen • Estimation of probability where C(st) denotes the count of co-occurrences of the sequence st in D and C(s ↭ t) is the count of all occurrences of token t connected to segment s
    29. 29. Estimating Probability of Segmentation Case 2: previous segment si has length less than context N • (2) When the previous segment si has length less than N, i.e. l(si) < N, the action αi on the next token ti+1 depends on si and Pi(N), the set of segments that precede si that together with si, contains at most N tokens in total, i.e., single writer freddie mercury single writer freddie mercury • Estimation of probability where C(P ↭ s) denotes the count of all occurrences of the segment s connected to all segments in P
    30. 30. EXPERIMENTAL RESULTS & CONCLUSIONS
    31. 31. • Graphinder, a relational keyword search approach for suggesting query • • • • • completions, translating queries and ranking results Keyword translation performance – Query translation and index-based approaches at least one-order of magnitude faster than online in-memory search (bidirectional) – Query translation comparable with index-based approaches, but less space Keyword translation result quality – According to recent benchmark, our ranking consistently outperforms all existing ranking systems in precision, recall and MAP (10% - 30% improvement) Effect of query rewriting – Better user experience – Improves efficiency by reducing number of query terms – Improves quality / selectivity of query terms – …depends on complexity of queries and underlying keyword search engine Tight integration of query suggestion and translation From research prototypes to Graphinder, a powerful, flexible, low upfront-cost semantic search system
    32. 32. Thanks! Tran Duc Thanh tran.du.th@gmail.com http://sites.google.com/site/kimducthanh/
    33. 33. References (1) – [VLDB14] Yongtao Ma, Thanh Tran Probabilistic Query Rewriting for Efficient and and Effective Keyword Search on Graph Data In International Conference on Very Large Data Bases (VLDB'14). Hangzhou, China, September, 2014 – [ISWC13] Daniel Herzig, Roi Blanco, Peter Mika and Thanh Tran Federated Entity Search Using On-the-Fly Consolidation In International Semantic Web Conference (ISWC'13). Sydney, Australia, October, 2013 – [ICDE13] Yongtao Ma, Thanh Tran TYPifier: Inferring the Type Semantics of Structured Data In International Conference on Data Engineering (ICDE'13). Brisbane, Australia, April, 2013 – [WSDM13] Yongtao Ma, Thanh Tran TYPiMatch: Type-specific Unsupervised Learning of Keys and Key Values for Heterogeneous Web Data Integration In International Conference on Web Search and Data Mining (WSDM'13). Rome, Italy, February, 2013 – [TKDE12a] Thanh Tran, Günter Ladwig, Sebastian Rudolph Managing Structured and Semi-structured RDF Data Using Structure Indexes In Transactions on Knowledge and Data Engineering journal. – [TKDE12b] Thanh Tran, Lei Zhang Keyword Query Routing In Transactions on Knowledge and Data Engineering journal.
    34. 34. References (2) – [WWW12] Daniel Herzig, Thanh Tran Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration In Proceedings of 21st International World Wide Web Conference (WWW'12). Lyon, France, April, 2012 – [CIKM11a] Günter Ladwig, Thanh Tran Index Structures and Top-k Join Algorithms for Native Keyword Search Databases In Proceedings of 20th ACM Conference on Information and Knowledge Management (CIKM'11). Glasgow, UK, October, 2011 – [CIKM11b] Veli Bicer, Thanh Tran Ranking Support for Keyword Search on Structured Data using Relevance Models In Proceedings of 20th ACM Conference on Information and Knowledge Management (CIKM'11). Glasgow, UK, October, 2011 – [SIGIR11] Roi Blanco, Harry Halpin, Daniel M. Herzig, Peter Mika, Jeffrey Pound, Henry S. Thompson, Thanh Tran Duc Repeatable and Reliable Search System Evaluation using Crowdsourcing In Proceedings of 34th Annual International ACM SIGIR Conference (SIGIR'11), Beijing, China, July, 2011 – [ICDE09] Duc Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano Top-k Exploration of Query Graph Candidates for Efficient Keyword Search on RDF In Proceedings of the 25th International Conference on Data Engineering (ICDE'09). Shanghai, China, March 2009 – [SIGMOD09] Haofen Wang, Thomas Penin, Kaifeng Xu, Junquan Chen, Xinruo Sun, Linyun Fu, Yong Yu, Thanh Tran, Peter Haase, Rudi Studer Hermes: A Travel through Semantics in the Data Web In Proceedings of SIGMOD Conference 2009. Providence, USA, June-July, 2009
    35. 35. BACKUP

    ×