Keyword Search on Structured Data usingRelevance Models*Veli Bicer                                                        ...
About the presenterVeli Bicer        Research Scientist at FZI Research Center for Information         Technology, Karlsr...
AgendaIntroduction    Keyword search on structured data    Relevance modelsApproach      Ranking scheme using relevance...
INFORMATIK                                                              FZI FORSCHUNGSZENTRUM      Introduction10.04.2012 ...
Keyword Search on Structured DataRationale   4 billion web searches daily   Data-driven websites have relational databas...
Keyword Search on Structured DataExample      Who is the character played by Audrey Hepburn in Roman Holiday?Query result...
Keyword Search on Structured DataMany approaches are proposed recently   Performance focus   Less consideration of ranki...
Relevance ModelsProposed by Lavrenko and Croft (SIGIR 01)                          Q                     DAssumes that    ...
INFORMATIK                                                          FZI FORSCHUNGSZENTRUM      Approach10.04.2012       © ...
Overview of Approach1       Query                             2   PRF                                                     ...
Data ModelDifferent kinds of data            e.g. relational, XML and RDF dataData Graph of nodes and edges (G=(V,E))Reso...
Edge-Specific Relevance Models                                                                                      1   2 ...
Edge Specific Resource Models                                          4      5Each resource (a tuple) is also represented...
SmoothingWell-known technique to address data sparseness and improve  accuracy of RMs (and LMs)                   is the ...
Smoothing                                                                                                      words      ...
Ranking JRTs                                                        9Ranking aggregated JRTs:      Cross entropy between ...
Query Translation*                                                                                                      6 ...
Top-k Query Processing                                                    8Top-k query processing (TQP) is highly common i...
Top-k Query Processing  Result candidate c=<(x1,…,xk),score>         complete when all variables are bound to some resour...
Top-k Query Processing  Result candidate c=<(x1,…,xk),score>         complete when all variables are bound to some resour...
Top-k Query Processing  Result candidate c=<(x1,…,xk),score>         complete when all variables are bound to some resour...
Top-k Query Processing  Result candidate c=<(x1,…,xk),score>         complete when all variables are bound to some resour...
Top-k Query Processing  Result candidate c=<(x1,…,xk),score>         complete when all variables are bound to some resour...
INFORMATIK                                                        FZI FORSCHUNGSZENTRUMExperiments              © FZI Fors...
ExperimentsDatasets: Subsets of Wikipedia, IMDB and Mondial Web  databasesQueries: 50 queries for each dataset including “...
Experiments                                                   MAP scores for all queries                                  ...
Experiments       Precision-recall for TREC-style queries on Wikipedia                    © FZI Forschungszentrum Informat...
INFORMATIK                                                        FZI FORSCHUNGSZENTRUMApplication              © FZI Fors...
Large amount of environmental dataEnvironmental issues stir public interests      Increase transparency, awareness, respo...
Opportunity: mass dissemination andconsumption of environmental dataThe percentage of people who actively find environment...
KOIOS – OverviewA semantic search system    Exploit semantics in the data for keywords interpretation to hide     complex...
KOIOS – Architecture                © FZI Forschungszentrum Informatik   32
Facets generationDerive facets from query results (not from query!) for refinement     Attributes serve as facet categori...
SelectorsSelector: parameterized, predefined result and view templates    Data parameters: specify scope of information n...
Selector initializationSelectors      capture templates for information needs and presentation of their       resultsMap ...
DeploymentHippolytos project (Theseus)      Easy access to spatial data       warehouse (disy Cadenza) built for       do...
Facets and selectors                © FZI Forschungszentrum Informatik   37
Chart-based visualization
Map-based visualization
ConclusionsKeyword search on structured data is a popular problem for  which various solutions exist.We focus on the aspec...
INFORMATIK                                FZI FORSCHUNGSZENTRUMThank you for your attention!Questions?
Opportunity: mass dissemination andconsumption of environmental dataIncrease transparency, awareness, responsibility, prot...
Challenges: intuitive access and visualization ofstructured environmental data and analyticsThe percentage of people who a...
KOIOSSemantic search system, KOIOS, for intuitive access, analysis,  and visualization of structured environmental informa...
ConclusionsReplace predefined forms and hard-coded visualizationSemantic search using lightweight semantics in data and  s...
Inverted Index                                             princess      m1, c1                                          ...
Ranking SchemesProximity between keyword nodes          EASE:          XRank:             w is the smallest text window...
Ranking SchemesBased on graph structure          BANKS             Nodes:             Edges :          PageRank-like m...
Ranking Schemes                                                        1 ln(1 ln(tf ))      N 1                           ...
Relevance Models                   Relevance                                          sample probabilities                ...
Upcoming SlideShare
Loading in …5
×

Keyword Search on Structured Data using Relevance Models

675 views
631 views

Published on

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
675
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Top-K Queries are a long studied topic in the database and information retrieval communitiesThe main objective of these queries is to return the K highest-ranked answers quickly and efficiently.A Top-K query returns the subset of most relevant answers, instead of ALL answers, for two reasons: i) to minimize the cost metric that is associated with the retrieval of all answers (e.g., disk, network, etc.)ii) to maximize the quality of the answer set, such that the user is not overwhelmed with irrelevant results
  • Keyword Search on Structured Data using Relevance Models

    1. 1. Keyword Search on Structured Data usingRelevance Models*Veli Bicer INFORMATIKFZI Research Center for Information TechnologyKarlsruhe, Germany FZI FORSCHUNGSZENTRUMJoint work with Thanh Tran from Semantic Search Group, AIFBInstitute, KIT* based on the papers @ 20th ACM Conference on Information and KnowledgeManagement (CIKM’11) and @ 10th International Semantic Web Conference (ISWC’11) © FZI Forschungszentrum Informatik 1
    2. 2. About the presenterVeli Bicer  Research Scientist at FZI Research Center for Information Technology, Karlsruhe, Germany  Associated Researcher at Karlsruhe Service Research Institute (KSRI)  KSRI founded by IBM GermanyResearch Interests  Semantic Data Management/Search  Relational Learning  Software Engineering (for Services)Projects  German Internet Research Programme THESEUS  KOIOS Semantic Search in Core Technology Cluster  TEXO Internet-of-Services Use-case  Previously, EU ICT Artemis, Satine, Saphire and Ride10.04.2012 © FZI Forschungszentrum Informatik 2
    3. 3. AgendaIntroduction  Keyword search on structured data  Relevance modelsApproach  Ranking scheme using relevance models  Top-k Query processingExperimentsApplication  Search on environmental dataConclusion © FZI Forschungszentrum Informatik 3
    4. 4. INFORMATIK FZI FORSCHUNGSZENTRUM Introduction10.04.2012 © FZI Forschungszentrum Informatik 4
    5. 5. Keyword Search on Structured DataRationale  4 billion web searches daily  Data-driven websites have relational database backend  Predefined search forms constrain retrieval  SQL difficult to learn  simplify data retrieval by not using SQL © FZI Forschungszentrum Informatik 5
    6. 6. Keyword Search on Structured DataExample  Who is the character played by Audrey Hepburn in Roman Holiday?Query result Person Character  A tree of tuples that is reduced id name id name pid mid with respect to the query. p1 Audrey Hepburn c1 Princess p1 m1 AnnWhich would you rather write? p3 Kate Winslet c3 Iris p3 m2 … ……… Simpkins SELECT C.name … …….. FROM Person, Character, Movie WHERE Person.id = Character.pId Movie AND Character.mid = Movie.id id title plot AND Person.name = ‘Audrey Hepburn m1 Roman Holiday Princess Ann is a royal princess AND Movie.title = ‘Roman Holiday ; of unknow of an … m2 The Holiday Iris swaps her cottage for the  or “Hepburn Holiday” holiday along the next two … m3 The Aviator Hughes and Hepburn go to a holiday and fly together .. … …… ….. © FZI Forschungszentrum Informatik 6
    7. 7. Keyword Search on Structured DataMany approaches are proposed recently  Performance focus  Less consideration of rankingRecent study (Coffman and Weaver, CIKM 2010)  effectiveness of previous works are below expectations  problem about ranking strategies, not performanceTwo major types of ranking schemes:  IR-inspired TF-IDF ranking  (Liu et al, 2006) (SPARK, 2007)  Proximity based approaches  (Banks, 2002) (Bidirectional, 2005)Problem:  Missing a robust and principled approach!! © FZI Forschungszentrum Informatik 7
    8. 8. Relevance ModelsProposed by Lavrenko and Croft (SIGIR 01) Q DAssumes that Classical Model  queries and documents are samples from a hidden representation space and  generated from the same generative modelInitial representation of relevance is R unknown  Estimated from query Q D Language Model R Q D © FZI Forschungszentrum Informatik 8 Relevance Model
    9. 9. INFORMATIK FZI FORSCHUNGSZENTRUM Approach10.04.2012 © FZI Forschungszentrum Informatik 9
    10. 10. Overview of Approach1 Query 2 PRF 3 Query RM 4 Res. RM words p words p words p hepburn 0.5 hepburn 0.21 5 Res. Score hepburn 0.12 holiday 0.5 holiday 0.15 holiday 0.18 audrey 0.13 audrey 0.11 katharine 0.09 D(RMQ||RMR) katharine 0.05 princess 0.01 princess 0.00 roman 0.01 roman 0.06 …. … …. … Title Name Roman Holiday Audrey Hepburn Breakfast at Tiff. Audrey Hepburn The Aviator Katharine Hepbun The Holiday Kate Winslet 6 Query Generation 7 Structured Queries 8 Top-k Query Proc. 9 Result Ranking © FZI Forschungszentrum Informatik 10
    11. 11. Data ModelDifferent kinds of data  e.g. relational, XML and RDF dataData Graph of nodes and edges (G=(V,E))Resource nodes, attribute nodes  Every resource is typed  Resources have unique ids, (e.g. primary keys)10.04.2012 © FZI Forschungszentrum Informatik 11
    12. 12. Edge-Specific Relevance Models 1 2 3 A set of feedback resources FR are retrieved from an inverted keyword index:  E.g. Q={Hepburn, Holiday}, FR = {m1, p1, p4,m2, c2,m3} Edge-specific relevance model for each unique edge e: Probability of word at resource Importance of resource w.r.t. query Inverted Index FR Edge-specific Relevance Modelsprincess  m1, c1breakfast  m3 p1 name birthplacehepburnhepburn  m3,p1,p4,c2 Audrey Hepburn Ixelles Belgiummelbourne  p2iris  c3 m3 title The Holidayholidayholiday  m1,m2,m3 plotbreakfast  m3 Iris swaps her cottage for theann  m1,c2 holiday along the next two …..………. … ……. © FZI Forschungszentrum Informatik 12
    13. 13. Edge Specific Resource Models 4 5Each resource (a tuple) is also represented as a RM  …as final results (joint tuples) are obtained by combining resourcesEdge-specific resource model:The score of resource: cross-entropy of edge-specific RM and ResM: © FZI Forschungszentrum Informatik 13
    14. 14. SmoothingWell-known technique to address data sparseness and improve accuracy of RMs (and LMs)  is the core probability for both query and resource RMLocal smoothingNeighborhood of attribute a is another attribute a’:  a and a’ shares the same resources  resources of a and a’ are of the same type  resources of a and a’ are connected over a FK Neighborhood of a © FZI Forschungszentrum Informatik 14
    15. 15. Smoothing words P name (v | p1 ) r a Person Character audrey 0.5 0.4 0.37 0.36 type type type hepburn 0.5 0.4 0.39 0.38 pid_fk p1 c1 ixelles 0.1 0.09 0.08 p4 birthplace belgium name name 0.1 0.09 0.08 name Audrey Hepburn Ixelles Belgium Princess Ann katharine 0.02 0.01KatharineHepburn birthplace connecticut 0.02 0.01 Connecticut USA usa 0.02 0.01 princess 0.035 ann 0.035 Smoothing of each type is controlled by weights: where γ1 ,γ2 ,γ3 are control parameters set in experiments 10.04.2012 © FZI Forschungszentrum Informatik 15
    16. 16. Ranking JRTs 9Ranking aggregated JRTs:  Cross entropy between edge-specific RM (Query Model) and geometric mean of combined edge-specific ResM:The proposed score is monotonic w.r.t. individual resource scores  …a desired property for most of top-k algorithms © FZI Forschungszentrum Informatik 16
    17. 17. Query Translation* 6 7Mapping of keywords to data elements Hepburn Hepburn Holiday Holiday title name name title  Result in a set of keyword elements p4 p1 m1 m3Data Graph exploration type type  Search for substructures (query graph) pid_fk Character Person mid_fk connecting keyword elements bornIn Movie  Bi-directional exploration of query Is-a Location graphs operates on summary of data hasDist hasLoc graph only Summary Producer StudioTop-k computation Graph worksFor  Search guided by a scoring function to Person Character Movie output only the top-k queries type type type pid_fk mid_fkQuery graphs to be processed name ?p ?c ?m title  Free vs. Non-free variables Hepburn Holiday*[Tran et al. ICDE’09] © FZI Forschungszentrum Informatik 17
    18. 18. Top-k Query Processing 8Top-k query processing (TQP) is highly common in Web- accessible databases  return K highest-ranked answers  avoid unnecessary accesses to databaseTQP assumes  Scoring function and attribute values to be known a-priori (e.g. RankJoin)  Combine attribute values by aggregation function  Sorted access (SA), random access (RA) probesHow to adapt TQP to return top-k relevant results?  Results are joined set of resources  Scores are query-dependent  No indexing is possibleIdea:  Retrieve resources for non-free variables and rank  Use SA on those initially retrieved resources  Use RA to find other resources © FZI Forschungszentrum Informatik 18
    19. 19. Top-k Query Processing Result candidate c=<(x1,…,xk),score>  complete when all variables are bound to some resources  xi =* indicates unbounded Threshold Binding operator 0.50  c’=(c,xiri) Threshold determines upper bound for unseen resources  Scheduling between SA and RA  Tight bound is desired Priority Queue <(p1,*,*),0.50> Person Character Movie <(*,*,m2),0.50> type type type pid_fk mid_fk ?p ?c ?m title name Hepburn HolidayPerson Character 0.11 Movieid name S(r) id name S(r) id title S(r)p1 Audrey Hepburn 0.20 c1 Princess Ann m2 The Holiday 0.19 Output K=1p3 Katharine Hepburn 0.18 c2 Katharine Hepburn m1 Roman Holiday 0.18p5 Philip Hepburn 0.13 c3 Iris Simpkins m3 Holiday Blues 0.09p6 Anna Hepburn 0.12 c4 Louise m4 Family Holiday 0.08 © FZI Forschungszentrum Informatik 19
    20. 20. Top-k Query Processing Result candidate c=<(x1,…,xk),score>  complete when all variables are bound to some resources  xi =* indicates unbounded Threshold Binding operator 0.48  c’=(c,xiri) Threshold determines upper bound for unseen resources  Scheduling between SA and RA  Tight bound is desired Priority Queue <(p1,*,*),0.50> Person Character Movie <(*,*,m2),0.50> type type type pid_fk mid_fk <(p3,*,*),0.48> ?p ?c ?m title name Hepburn HolidayPerson Character 0.11 Movieid name S(r) id name S(r) id title S(r)p1 Audrey Hepburn 0.20 c1 Princess Ann m2 The Holiday 0.19 Output K=1p3 Katharine Hepburn 0.18 c2 Katharine Hepburn m1 Roman Holiday 0.18p5 Philip Hepburn 0.13 c3 Iris Simpkins m3 Holiday Blues 0.09p6 Anna Hepburn 0.12 c4 Louise m4 Family Holiday 0.08 © FZI Forschungszentrum Informatik 20
    21. 21. Top-k Query Processing Result candidate c=<(x1,…,xk),score>  complete when all variables are bound to some resources  xi =* indicates unbounded Threshold Binding operator 0.47  c’=(c,xiri) Threshold determines upper bound for unseen resources  Scheduling between SA and RA  Tight bound is desired Priority Queue <(*,*,m2),0.50> Person Character Movie <(p1,c1,*),0.49> type type type pid_fk mid_fk <(p3,*,*),0.48> ?p ?c ?m title name Hepburn HolidayPerson Character 0.10 Movieid name S(r) id name S(r) id title S(r)p1 Audrey Hepburn 0.20 c1 Princess Ann 0.10 m2 The Holiday 0.19 Output K=1p3 Katharine Hepburn 0.18 c2 Katharine Hepburn m1 Roman Holiday 0.18p5 Philip Hepburn 0.13 c3 Iris Simpkins m3 Holiday Blues 0.09p6 Anna Hepburn 0.12 c4 Louise m4 Family Holiday 0.08 © FZI Forschungszentrum Informatik 21
    22. 22. Top-k Query Processing Result candidate c=<(x1,…,xk),score>  complete when all variables are bound to some resources  xi =* indicates unbounded Threshold Binding operator 0.46  c’=(c,xiri) Threshold determines upper bound for unseen resources  Scheduling between SA and RA  Tight bound is desired Priority Queue <(p1,c1,*),0.49> Person Character Movie <(p3,*,*),0.48> type type type pid_fk mid_fk <(*,c3,m2),0.44> ?p ?c ?m title name Hepburn HolidayPerson Character 0.09 Movieid name S(r) id name S(r) id title S(r)p1 Audrey Hepburn 0.20 c1 Princess Ann 0.10 m2 The Holiday 0.19 Output K=1p3 Katharine Hepburn 0.18 c2 Katharine Hepburn m1 Roman Holiday 0.18p5 Philip Hepburn 0.13 c3 Iris Simpkins 0.05 m3 Holiday Blues 0.09p6 Anna Hepburn 0.12 c4 Louise m4 Family Holiday 0.08 © FZI Forschungszentrum Informatik 22
    23. 23. Top-k Query Processing Result candidate c=<(x1,…,xk),score>  complete when all variables are bound to some resources  xi =* indicates unbounded Threshold Binding operator 0.46  c’=(c,xiri) Threshold determines upper bound for unseen resources  Scheduling between SA and RA  Tight bound is desired Priority Queue <(p3,*,*),0.48> Person Character Movie <(*,c3,m2),0.44> type type type pid_fk mid_fk ?p ?c ?m title name Hepburn HolidayPerson Character 0.09 Movieid name S(r) id name S(r) id title S(r)p1 Audrey Hepburn 0.20 c1 Princess Ann 0.10 m2 The Holiday 0.19 Output K=1p3 Katharine Hepburn 0.18 c2 Katharine Hepburn m1 Roman Holiday 0.18 <(p1,c1,m1),0.48>p5 Philip Hepburn 0.13 c3 Iris Simpkins 0.05 m3 Holiday Blues 0.09p6 Anna Hepburn 0.12 c4 Louise m4 Family Holiday 0.08 © FZI Forschungszentrum Informatik 23
    24. 24. INFORMATIK FZI FORSCHUNGSZENTRUMExperiments © FZI Forschungszentrum Informatik 24
    25. 25. ExperimentsDatasets: Subsets of Wikipedia, IMDB and Mondial Web databasesQueries: 50 queries for each dataset including “TREC style” queries and “single resource” queriesMetrics: Three metrics are used: (1) the number of top-1 relevant results, (2) Reciprocal rank and (3) Mean Average Precision (MAP)Baselines: BANKS , Bidirectional (proximity) , Efficient , SPARK, CoveredDensity (TF-IDF).RM-S: Our approach © FZI Forschungszentrum Informatik 25
    26. 26. Experiments MAP scores for all queries Reciprocal rank for single resource queries © FZI Forschungszentrum Informatik 26
    27. 27. Experiments Precision-recall for TREC-style queries on Wikipedia © FZI Forschungszentrum Informatik 27
    28. 28. INFORMATIK FZI FORSCHUNGSZENTRUMApplication © FZI Forschungszentrum Informatik 28
    29. 29. Large amount of environmental dataEnvironmental issues stir public interests  Increase transparency, awareness, responsibility, protectionGrowing amount of data  Public access through EU directive 2003/4/EC  PortalU (Germany) http://www.portalu.de/  EDP (UK) http://www.edp.nerc.ac.uk  Envirofacts (USA) http://www.epa.gov/enviro/index.htmlLinking data in international context  Local government databases of environmental part of LOD cloud  Linked environment data for the life sciences © FZI Forschungszentrum Informatik 29
    30. 30. Opportunity: mass dissemination andconsumption of environmental dataThe percentage of people who actively find environmental information is significantly lower than those who have those with frequent access to it!Complex results  CO emission values around Karlsruhe area in GermanyAnalytics  CO emission values around Karlsruhe area in Germany  Sorted by year  Bar chart  Emission values of US and Germany  Compare average  Timeline visualization © FZI Forschungszentrum Informatik 30
    31. 31. KOIOS – OverviewA semantic search system  Exploit semantics in the data for keywords interpretation to hide complexity of query languages and data representation  Keyword search for searching structured data  Lower access barriers while enabling richness of data to be fully harnessedContribution  Transfer research results to commercial EIS  Selector mechanismProcess  Input: keywords  Facet-based refinement  Selector (result and view template) initialization  Output: query results embedded in specific views © FZI Forschungszentrum Informatik 31
    32. 32. KOIOS – Architecture © FZI Forschungszentrum Informatik 32
    33. 33. Facets generationDerive facets from query results (not from query!) for refinement  Attributes serve as facet categories  Attribute values as facet valuesE.g. for ?s  Statistics.description: “CO-Emission , PKW”, “CO-Emission , LKW”…  Value.year: 2005,2006,… © FZI Forschungszentrum Informatik 33
    34. 34. SelectorsSelector: parameterized, predefined result and view templates  Data parameters: specify scope of information need, initialized to a particular values based on facet categories and values  Query parameter: additional data processing for analysis tasks (GROUP-BY, SORT, MIN, MAX, AVERAGE etc.)  Presentation parameter: visualization types (data value, data series, data table, map-based, specific diagram type, etc.) © FZI Forschungszentrum Informatik 34
    35. 35. Selector initializationSelectors  capture templates for information needs and presentation of their resultsMap facets to selectors and initialize them  Applicable selectors: cover facet categories  Initialize selectors based on facet values  Initialized values are captured in the WHERE clause  Non-initialized parameters are included in the SELECT clause © FZI Forschungszentrum Informatik 35
    36. 36. DeploymentHippolytos project (Theseus)  Easy access to spatial data warehouse (disy Cadenza) built for domain of environmental administrationData about  Emission and waste  From the Baden-Württemberg  Provided by: Umweltinformationssystem (UIS) Baden-Württemberg, Landesamt für Geoinformation und Landentwicklung (LGL) Baden- Württemberg and Statistisches Landesamt Baden-Württemberg © FZI Forschungszentrum Informatik 36
    37. 37. Facets and selectors © FZI Forschungszentrum Informatik 37
    38. 38. Chart-based visualization
    39. 39. Map-based visualization
    40. 40. ConclusionsKeyword search on structured data is a popular problem for which various solutions exist.We focus on the aspect of result ranking, providing a principled approach that employs relevance models.Experiments show that RMs are promising for searching structured data.Top-k Query processing proposed to get only most relevant resultsApplication on environmental data enables intuitive  Access  Visualization  Analysis of environmental information! © FZI Forschungszentrum Informatik 40
    41. 41. INFORMATIK FZI FORSCHUNGSZENTRUMThank you for your attention!Questions?
    42. 42. Opportunity: mass dissemination andconsumption of environmental dataIncrease transparency, awareness, responsibility, protection © FZI Forschungszentrum Informatik 42
    43. 43. Challenges: intuitive access and visualization ofstructured environmental data and analyticsThe percentage of people who actively find environmental information is significantly lower than those who have those with frequent access to it! Complex structured queries Knowledge of the underlying data / query language Complex structured data Heterogeneity and distribution of environmental data is overwhelming Complex structured results Understanding results and extracting relevant information / analytics are difficult tasks © FZI Forschungszentrum Informatik 43
    44. 44. KOIOSSemantic search system, KOIOS, for intuitive access, analysis, and visualization of structured environmental information Overview and architecture Structured query generation from keywords Facet-based browsing and refinement Selector initialization for final result and view construction Implementation and deployment Conclusions © FZI Forschungszentrum Informatik 44
    45. 45. ConclusionsReplace predefined forms and hard-coded visualizationSemantic search using lightweight semantics in data and schema to dynamically  Translate keywords to queries  Generate facets for results  Initialize result and presentation templatesEnables intuitive  Access  Visualization  Analysis of environmental information! © FZI Forschungszentrum Informatik 45
    46. 46. Inverted Index princess  m1, c1 breakfast  m3 hepburn  m3,p1,p4,c2 melbourne  p2 iris  c3 holiday  m1,m2,m3 breakfast  m3 ann  m1,c2 ………. … …….04.04.2011 © FZI Forschungszentrum Informatik 49
    47. 47. Ranking SchemesProximity between keyword nodes  EASE:  XRank:  w is the smallest text window in n that contains all search keywords2012-4-10SIGMOD09 Tutorial 50
    48. 48. Ranking SchemesBased on graph structure  BANKS  Nodes:  Edges :  PageRank-like methods  XRank [Guo et al, SIGMOD03]  ObjectRank [Balmin et al, VLDB04] : considers both Global ObjectRank and Keyword-specific ObjectRank2012-4-10SIGMOD09 Tutorial 51
    49. 49. Ranking Schemes 1 ln(1 ln(tf )) N 1 Score(n, Q) ln w Q n (1 s ) s dl / avdl df TF*IDF based:  Discover/EASE  [Liu et al, SIGMOD06]  SPARK  but not at the node level2012-4-10SIGMOD09 Tutorial 52
    50. 50. Relevance Models Relevance sample probabilities Model q1 P(w|Q) w israeli .077 palestinian M q2 palestinian .055 israel .034 jerusalem M q3 raids .033 protest M .027 raid w ??? .011 clash P(q | w) .010 bank .010 west P( w) .010 troopP( w | q1...qk ) P(q | M ) P( M | w) … P(q1...qk ) q M P(q1...qk | w)

    ×