• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Keyword Search on Structured Data using Relevance Models
 

Keyword Search on Structured Data using Relevance Models

on

  • 702 views

 

Statistics

Views

Total Views
702
Views on SlideShare
702
Embed Views
0

Actions

Likes
0
Downloads
14
Comments
1

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • blessing_66666@yahoo.com

    My name is Blessing
    i am a young lady with a kind and open heart,
    I enjoy my life,but life can't be complete if you don't have a person to share it
    with. blessing_66666@yahoo.com

    Hoping To Hear From You
    Yours Blessing
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Top-K Queries are a long studied topic in the database and information retrieval communitiesThe main objective of these queries is to return the K highest-ranked answers quickly and efficiently.A Top-K query returns the subset of most relevant answers, instead of ALL answers, for two reasons: i) to minimize the cost metric that is associated with the retrieval of all answers (e.g., disk, network, etc.)ii) to maximize the quality of the answer set, such that the user is not overwhelmed with irrelevant results

Keyword Search on Structured Data using Relevance Models Keyword Search on Structured Data using Relevance Models Presentation Transcript

  • Keyword Search on Structured Data usingRelevance Models*Veli Bicer INFORMATIKFZI Research Center for Information TechnologyKarlsruhe, Germany FZI FORSCHUNGSZENTRUMJoint work with Thanh Tran from Semantic Search Group, AIFBInstitute, KIT* based on the papers @ 20th ACM Conference on Information and KnowledgeManagement (CIKM’11) and @ 10th International Semantic Web Conference (ISWC’11) © FZI Forschungszentrum Informatik 1
  • About the presenterVeli Bicer  Research Scientist at FZI Research Center for Information Technology, Karlsruhe, Germany  Associated Researcher at Karlsruhe Service Research Institute (KSRI)  KSRI founded by IBM GermanyResearch Interests  Semantic Data Management/Search  Relational Learning  Software Engineering (for Services)Projects  German Internet Research Programme THESEUS  KOIOS Semantic Search in Core Technology Cluster  TEXO Internet-of-Services Use-case  Previously, EU ICT Artemis, Satine, Saphire and Ride10.04.2012 © FZI Forschungszentrum Informatik 2
  • AgendaIntroduction  Keyword search on structured data  Relevance modelsApproach  Ranking scheme using relevance models  Top-k Query processingExperimentsApplication  Search on environmental dataConclusion © FZI Forschungszentrum Informatik 3
  • INFORMATIK FZI FORSCHUNGSZENTRUM Introduction10.04.2012 © FZI Forschungszentrum Informatik 4
  • Keyword Search on Structured DataRationale  4 billion web searches daily  Data-driven websites have relational database backend  Predefined search forms constrain retrieval  SQL difficult to learn  simplify data retrieval by not using SQL © FZI Forschungszentrum Informatik 5
  • Keyword Search on Structured DataExample  Who is the character played by Audrey Hepburn in Roman Holiday?Query result Person Character  A tree of tuples that is reduced id name id name pid mid with respect to the query. p1 Audrey Hepburn c1 Princess p1 m1 AnnWhich would you rather write? p3 Kate Winslet c3 Iris p3 m2 … ……… Simpkins SELECT C.name … …….. FROM Person, Character, Movie WHERE Person.id = Character.pId Movie AND Character.mid = Movie.id id title plot AND Person.name = ‘Audrey Hepburn m1 Roman Holiday Princess Ann is a royal princess AND Movie.title = ‘Roman Holiday ; of unknow of an … m2 The Holiday Iris swaps her cottage for the  or “Hepburn Holiday” holiday along the next two … m3 The Aviator Hughes and Hepburn go to a holiday and fly together .. … …… ….. © FZI Forschungszentrum Informatik 6
  • Keyword Search on Structured DataMany approaches are proposed recently  Performance focus  Less consideration of rankingRecent study (Coffman and Weaver, CIKM 2010)  effectiveness of previous works are below expectations  problem about ranking strategies, not performanceTwo major types of ranking schemes:  IR-inspired TF-IDF ranking  (Liu et al, 2006) (SPARK, 2007)  Proximity based approaches  (Banks, 2002) (Bidirectional, 2005)Problem:  Missing a robust and principled approach!! © FZI Forschungszentrum Informatik 7
  • Relevance ModelsProposed by Lavrenko and Croft (SIGIR 01) Q DAssumes that Classical Model  queries and documents are samples from a hidden representation space and  generated from the same generative modelInitial representation of relevance is R unknown  Estimated from query Q D Language Model R Q D © FZI Forschungszentrum Informatik 8 Relevance Model
  • INFORMATIK FZI FORSCHUNGSZENTRUM Approach10.04.2012 © FZI Forschungszentrum Informatik 9
  • Overview of Approach1 Query 2 PRF 3 Query RM 4 Res. RM words p words p words p hepburn 0.5 hepburn 0.21 5 Res. Score hepburn 0.12 holiday 0.5 holiday 0.15 holiday 0.18 audrey 0.13 audrey 0.11 katharine 0.09 D(RMQ||RMR) katharine 0.05 princess 0.01 princess 0.00 roman 0.01 roman 0.06 …. … …. … Title Name Roman Holiday Audrey Hepburn Breakfast at Tiff. Audrey Hepburn The Aviator Katharine Hepbun The Holiday Kate Winslet 6 Query Generation 7 Structured Queries 8 Top-k Query Proc. 9 Result Ranking © FZI Forschungszentrum Informatik 10
  • Data ModelDifferent kinds of data  e.g. relational, XML and RDF dataData Graph of nodes and edges (G=(V,E))Resource nodes, attribute nodes  Every resource is typed  Resources have unique ids, (e.g. primary keys)10.04.2012 © FZI Forschungszentrum Informatik 11
  • Edge-Specific Relevance Models 1 2 3 A set of feedback resources FR are retrieved from an inverted keyword index:  E.g. Q={Hepburn, Holiday}, FR = {m1, p1, p4,m2, c2,m3} Edge-specific relevance model for each unique edge e: Probability of word at resource Importance of resource w.r.t. query Inverted Index FR Edge-specific Relevance Modelsprincess  m1, c1breakfast  m3 p1 name birthplacehepburnhepburn  m3,p1,p4,c2 Audrey Hepburn Ixelles Belgiummelbourne  p2iris  c3 m3 title The Holidayholidayholiday  m1,m2,m3 plotbreakfast  m3 Iris swaps her cottage for theann  m1,c2 holiday along the next two …..………. … ……. © FZI Forschungszentrum Informatik 12
  • Edge Specific Resource Models 4 5Each resource (a tuple) is also represented as a RM  …as final results (joint tuples) are obtained by combining resourcesEdge-specific resource model:The score of resource: cross-entropy of edge-specific RM and ResM: © FZI Forschungszentrum Informatik 13
  • SmoothingWell-known technique to address data sparseness and improve accuracy of RMs (and LMs)  is the core probability for both query and resource RMLocal smoothingNeighborhood of attribute a is another attribute a’:  a and a’ shares the same resources  resources of a and a’ are of the same type  resources of a and a’ are connected over a FK Neighborhood of a © FZI Forschungszentrum Informatik 14
  • Smoothing words P name (v | p1 ) r a Person Character audrey 0.5 0.4 0.37 0.36 type type type hepburn 0.5 0.4 0.39 0.38 pid_fk p1 c1 ixelles 0.1 0.09 0.08 p4 birthplace belgium name name 0.1 0.09 0.08 name Audrey Hepburn Ixelles Belgium Princess Ann katharine 0.02 0.01KatharineHepburn birthplace connecticut 0.02 0.01 Connecticut USA usa 0.02 0.01 princess 0.035 ann 0.035 Smoothing of each type is controlled by weights: where γ1 ,γ2 ,γ3 are control parameters set in experiments 10.04.2012 © FZI Forschungszentrum Informatik 15
  • Ranking JRTs 9Ranking aggregated JRTs:  Cross entropy between edge-specific RM (Query Model) and geometric mean of combined edge-specific ResM:The proposed score is monotonic w.r.t. individual resource scores  …a desired property for most of top-k algorithms © FZI Forschungszentrum Informatik 16
  • Query Translation* 6 7Mapping of keywords to data elements Hepburn Hepburn Holiday Holiday title name name title  Result in a set of keyword elements p4 p1 m1 m3Data Graph exploration type type  Search for substructures (query graph) pid_fk Character Person mid_fk connecting keyword elements bornIn Movie  Bi-directional exploration of query Is-a Location graphs operates on summary of data hasDist hasLoc graph only Summary Producer StudioTop-k computation Graph worksFor  Search guided by a scoring function to Person Character Movie output only the top-k queries type type type pid_fk mid_fkQuery graphs to be processed name ?p ?c ?m title  Free vs. Non-free variables Hepburn Holiday*[Tran et al. ICDE’09] © FZI Forschungszentrum Informatik 17
  • Top-k Query Processing 8Top-k query processing (TQP) is highly common in Web- accessible databases  return K highest-ranked answers  avoid unnecessary accesses to databaseTQP assumes  Scoring function and attribute values to be known a-priori (e.g. RankJoin)  Combine attribute values by aggregation function  Sorted access (SA), random access (RA) probesHow to adapt TQP to return top-k relevant results?  Results are joined set of resources  Scores are query-dependent  No indexing is possibleIdea:  Retrieve resources for non-free variables and rank  Use SA on those initially retrieved resources  Use RA to find other resources © FZI Forschungszentrum Informatik 18
  • Top-k Query Processing Result candidate c=<(x1,…,xk),score>  complete when all variables are bound to some resources  xi =* indicates unbounded Threshold Binding operator 0.50  c’=(c,xiri) Threshold determines upper bound for unseen resources  Scheduling between SA and RA  Tight bound is desired Priority Queue <(p1,*,*),0.50> Person Character Movie <(*,*,m2),0.50> type type type pid_fk mid_fk ?p ?c ?m title name Hepburn HolidayPerson Character 0.11 Movieid name S(r) id name S(r) id title S(r)p1 Audrey Hepburn 0.20 c1 Princess Ann m2 The Holiday 0.19 Output K=1p3 Katharine Hepburn 0.18 c2 Katharine Hepburn m1 Roman Holiday 0.18p5 Philip Hepburn 0.13 c3 Iris Simpkins m3 Holiday Blues 0.09p6 Anna Hepburn 0.12 c4 Louise m4 Family Holiday 0.08 © FZI Forschungszentrum Informatik 19
  • Top-k Query Processing Result candidate c=<(x1,…,xk),score>  complete when all variables are bound to some resources  xi =* indicates unbounded Threshold Binding operator 0.48  c’=(c,xiri) Threshold determines upper bound for unseen resources  Scheduling between SA and RA  Tight bound is desired Priority Queue <(p1,*,*),0.50> Person Character Movie <(*,*,m2),0.50> type type type pid_fk mid_fk <(p3,*,*),0.48> ?p ?c ?m title name Hepburn HolidayPerson Character 0.11 Movieid name S(r) id name S(r) id title S(r)p1 Audrey Hepburn 0.20 c1 Princess Ann m2 The Holiday 0.19 Output K=1p3 Katharine Hepburn 0.18 c2 Katharine Hepburn m1 Roman Holiday 0.18p5 Philip Hepburn 0.13 c3 Iris Simpkins m3 Holiday Blues 0.09p6 Anna Hepburn 0.12 c4 Louise m4 Family Holiday 0.08 © FZI Forschungszentrum Informatik 20
  • Top-k Query Processing Result candidate c=<(x1,…,xk),score>  complete when all variables are bound to some resources  xi =* indicates unbounded Threshold Binding operator 0.47  c’=(c,xiri) Threshold determines upper bound for unseen resources  Scheduling between SA and RA  Tight bound is desired Priority Queue <(*,*,m2),0.50> Person Character Movie <(p1,c1,*),0.49> type type type pid_fk mid_fk <(p3,*,*),0.48> ?p ?c ?m title name Hepburn HolidayPerson Character 0.10 Movieid name S(r) id name S(r) id title S(r)p1 Audrey Hepburn 0.20 c1 Princess Ann 0.10 m2 The Holiday 0.19 Output K=1p3 Katharine Hepburn 0.18 c2 Katharine Hepburn m1 Roman Holiday 0.18p5 Philip Hepburn 0.13 c3 Iris Simpkins m3 Holiday Blues 0.09p6 Anna Hepburn 0.12 c4 Louise m4 Family Holiday 0.08 © FZI Forschungszentrum Informatik 21
  • Top-k Query Processing Result candidate c=<(x1,…,xk),score>  complete when all variables are bound to some resources  xi =* indicates unbounded Threshold Binding operator 0.46  c’=(c,xiri) Threshold determines upper bound for unseen resources  Scheduling between SA and RA  Tight bound is desired Priority Queue <(p1,c1,*),0.49> Person Character Movie <(p3,*,*),0.48> type type type pid_fk mid_fk <(*,c3,m2),0.44> ?p ?c ?m title name Hepburn HolidayPerson Character 0.09 Movieid name S(r) id name S(r) id title S(r)p1 Audrey Hepburn 0.20 c1 Princess Ann 0.10 m2 The Holiday 0.19 Output K=1p3 Katharine Hepburn 0.18 c2 Katharine Hepburn m1 Roman Holiday 0.18p5 Philip Hepburn 0.13 c3 Iris Simpkins 0.05 m3 Holiday Blues 0.09p6 Anna Hepburn 0.12 c4 Louise m4 Family Holiday 0.08 © FZI Forschungszentrum Informatik 22
  • Top-k Query Processing Result candidate c=<(x1,…,xk),score>  complete when all variables are bound to some resources  xi =* indicates unbounded Threshold Binding operator 0.46  c’=(c,xiri) Threshold determines upper bound for unseen resources  Scheduling between SA and RA  Tight bound is desired Priority Queue <(p3,*,*),0.48> Person Character Movie <(*,c3,m2),0.44> type type type pid_fk mid_fk ?p ?c ?m title name Hepburn HolidayPerson Character 0.09 Movieid name S(r) id name S(r) id title S(r)p1 Audrey Hepburn 0.20 c1 Princess Ann 0.10 m2 The Holiday 0.19 Output K=1p3 Katharine Hepburn 0.18 c2 Katharine Hepburn m1 Roman Holiday 0.18 <(p1,c1,m1),0.48>p5 Philip Hepburn 0.13 c3 Iris Simpkins 0.05 m3 Holiday Blues 0.09p6 Anna Hepburn 0.12 c4 Louise m4 Family Holiday 0.08 © FZI Forschungszentrum Informatik 23
  • INFORMATIK FZI FORSCHUNGSZENTRUMExperiments © FZI Forschungszentrum Informatik 24
  • ExperimentsDatasets: Subsets of Wikipedia, IMDB and Mondial Web databasesQueries: 50 queries for each dataset including “TREC style” queries and “single resource” queriesMetrics: Three metrics are used: (1) the number of top-1 relevant results, (2) Reciprocal rank and (3) Mean Average Precision (MAP)Baselines: BANKS , Bidirectional (proximity) , Efficient , SPARK, CoveredDensity (TF-IDF).RM-S: Our approach © FZI Forschungszentrum Informatik 25
  • Experiments MAP scores for all queries Reciprocal rank for single resource queries © FZI Forschungszentrum Informatik 26
  • Experiments Precision-recall for TREC-style queries on Wikipedia © FZI Forschungszentrum Informatik 27
  • INFORMATIK FZI FORSCHUNGSZENTRUMApplication © FZI Forschungszentrum Informatik 28
  • Large amount of environmental dataEnvironmental issues stir public interests  Increase transparency, awareness, responsibility, protectionGrowing amount of data  Public access through EU directive 2003/4/EC  PortalU (Germany) http://www.portalu.de/  EDP (UK) http://www.edp.nerc.ac.uk  Envirofacts (USA) http://www.epa.gov/enviro/index.htmlLinking data in international context  Local government databases of environmental part of LOD cloud  Linked environment data for the life sciences © FZI Forschungszentrum Informatik 29
  • Opportunity: mass dissemination andconsumption of environmental dataThe percentage of people who actively find environmental information is significantly lower than those who have those with frequent access to it!Complex results  CO emission values around Karlsruhe area in GermanyAnalytics  CO emission values around Karlsruhe area in Germany  Sorted by year  Bar chart  Emission values of US and Germany  Compare average  Timeline visualization © FZI Forschungszentrum Informatik 30
  • KOIOS – OverviewA semantic search system  Exploit semantics in the data for keywords interpretation to hide complexity of query languages and data representation  Keyword search for searching structured data  Lower access barriers while enabling richness of data to be fully harnessedContribution  Transfer research results to commercial EIS  Selector mechanismProcess  Input: keywords  Facet-based refinement  Selector (result and view template) initialization  Output: query results embedded in specific views © FZI Forschungszentrum Informatik 31
  • KOIOS – Architecture © FZI Forschungszentrum Informatik 32
  • Facets generationDerive facets from query results (not from query!) for refinement  Attributes serve as facet categories  Attribute values as facet valuesE.g. for ?s  Statistics.description: “CO-Emission , PKW”, “CO-Emission , LKW”…  Value.year: 2005,2006,… © FZI Forschungszentrum Informatik 33
  • SelectorsSelector: parameterized, predefined result and view templates  Data parameters: specify scope of information need, initialized to a particular values based on facet categories and values  Query parameter: additional data processing for analysis tasks (GROUP-BY, SORT, MIN, MAX, AVERAGE etc.)  Presentation parameter: visualization types (data value, data series, data table, map-based, specific diagram type, etc.) © FZI Forschungszentrum Informatik 34
  • Selector initializationSelectors  capture templates for information needs and presentation of their resultsMap facets to selectors and initialize them  Applicable selectors: cover facet categories  Initialize selectors based on facet values  Initialized values are captured in the WHERE clause  Non-initialized parameters are included in the SELECT clause © FZI Forschungszentrum Informatik 35
  • DeploymentHippolytos project (Theseus)  Easy access to spatial data warehouse (disy Cadenza) built for domain of environmental administrationData about  Emission and waste  From the Baden-Württemberg  Provided by: Umweltinformationssystem (UIS) Baden-Württemberg, Landesamt für Geoinformation und Landentwicklung (LGL) Baden- Württemberg and Statistisches Landesamt Baden-Württemberg © FZI Forschungszentrum Informatik 36
  • Facets and selectors © FZI Forschungszentrum Informatik 37
  • Chart-based visualization
  • Map-based visualization
  • ConclusionsKeyword search on structured data is a popular problem for which various solutions exist.We focus on the aspect of result ranking, providing a principled approach that employs relevance models.Experiments show that RMs are promising for searching structured data.Top-k Query processing proposed to get only most relevant resultsApplication on environmental data enables intuitive  Access  Visualization  Analysis of environmental information! © FZI Forschungszentrum Informatik 40
  • INFORMATIK FZI FORSCHUNGSZENTRUMThank you for your attention!Questions?
  • Opportunity: mass dissemination andconsumption of environmental dataIncrease transparency, awareness, responsibility, protection © FZI Forschungszentrum Informatik 42
  • Challenges: intuitive access and visualization ofstructured environmental data and analyticsThe percentage of people who actively find environmental information is significantly lower than those who have those with frequent access to it! Complex structured queries Knowledge of the underlying data / query language Complex structured data Heterogeneity and distribution of environmental data is overwhelming Complex structured results Understanding results and extracting relevant information / analytics are difficult tasks © FZI Forschungszentrum Informatik 43
  • KOIOSSemantic search system, KOIOS, for intuitive access, analysis, and visualization of structured environmental information Overview and architecture Structured query generation from keywords Facet-based browsing and refinement Selector initialization for final result and view construction Implementation and deployment Conclusions © FZI Forschungszentrum Informatik 44
  • ConclusionsReplace predefined forms and hard-coded visualizationSemantic search using lightweight semantics in data and schema to dynamically  Translate keywords to queries  Generate facets for results  Initialize result and presentation templatesEnables intuitive  Access  Visualization  Analysis of environmental information! © FZI Forschungszentrum Informatik 45
  • Inverted Index princess  m1, c1 breakfast  m3 hepburn  m3,p1,p4,c2 melbourne  p2 iris  c3 holiday  m1,m2,m3 breakfast  m3 ann  m1,c2 ………. … …….04.04.2011 © FZI Forschungszentrum Informatik 49
  • Ranking SchemesProximity between keyword nodes  EASE:  XRank:  w is the smallest text window in n that contains all search keywords2012-4-10SIGMOD09 Tutorial 50
  • Ranking SchemesBased on graph structure  BANKS  Nodes:  Edges :  PageRank-like methods  XRank [Guo et al, SIGMOD03]  ObjectRank [Balmin et al, VLDB04] : considers both Global ObjectRank and Keyword-specific ObjectRank2012-4-10SIGMOD09 Tutorial 51
  • Ranking Schemes 1 ln(1 ln(tf )) N 1 Score(n, Q) ln w Q n (1 s ) s dl / avdl df TF*IDF based:  Discover/EASE  [Liu et al, SIGMOD06]  SPARK  but not at the node level2012-4-10SIGMOD09 Tutorial 52
  • Relevance Models Relevance sample probabilities Model q1 P(w|Q) w israeli .077 palestinian M q2 palestinian .055 israel .034 jerusalem M q3 raids .033 protest M .027 raid w ??? .011 clash P(q | w) .010 bank .010 west P( w) .010 troopP( w | q1...qk ) P(q | M ) P( M | w) … P(q1...qk ) q M P(q1...qk | w)