Your SlideShare is downloading. ×
Presentation
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Presentation

50
views

Published on

Published in: Technology, Design

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
50
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. KEYWORDS SEARCH ON STRUCTURED DATABASE Xiaoyu Chen, Min Li, Yihan Gao, Tianning Xu
  • 2. Introduction  Structured data  Schema as a summary of the data  Retrieve through structured language  What would big data bring to structured data retrieval?
  • 3. Introduction  In terms of high volume of data  Hadoop + Pig Latin came to rescue  However, is this enough?  Recall how you write selection. What do you need to know  Can you remember this ?
  • 4. Introduction  Big data-> big and complicated schema  Hard to remember and operate!  May not even fit in main memory!  What should we do about it ?  How does information retrieval deals with this ?
  • 5. Introduction  Search based on keywords  No need for schema  Efficiency guaranteed using index  All seem to to be straightforward and easy  What are the challenges ?
  • 6. Introduction  Search for “Apple + company”  Match to “apple(fruit)”, “Apple Inc.”, “Adams’ apple”  Which one is correct ? How to filter? Challenge1: Filtering and disambiguat
  • 7. Introduction  Search for “Steve Jobs + Apple”  Normalization. What to return ? ID Nam e Gend er Employ er Location ID Company Locatio n Type Product ID Street City State Countr y Challenge2: Automatic join back
  • 8. Introduction  Search for “Jordan”  Match “Jordan (brand)” , ”Michael Jordan (player)”, “Michael Jordan (professor)” etc.  All of them should match. Which one is better ?  Ranking Challenge3: Ranking of the result
  • 9. Literature Overview  Two kinds of approaches  1. Interpretative approach  Reuse database query language and index  Translate the keywords into queries  Will introduce 3 papers  2. Un-interpretative approach (focus)  Typically build own index and data structure  Model as graph and use graph-based analysis  Will introduce 3 papers
  • 10. Literature Overview – Interpretative approach  DBXplorer Sanjay Agrawal et al.  General: two steps  Publish step: pre-computation, indexing etc.  Search step: lookup, enumerate over join tree, generate SQL etc.  Efficiency:  Symbol table (index) design  Symbol table compaction
  • 11. Literature Overview – Interpretative approach  Publish step:  1: A database is identified, along with the set of tables and columns within the database to be published.  2: Auxiliary tables are created for supporting keyword searches. E.g. index table  But, how to build efficient index ?
  • 12. Literature Overview – Interpretative approach  Index goal: find out the keyword belonging row_id and column_id.  If the column (attribute) already has index, we need only column_id index (reuse database index) ID Name Gender Addr Org 1 2 3 Column index Row index
  • 13. Literature Overview – Interpretative approach  Compress index table  Foreign key constraint etc.  General Algorithm -- CP-Comp Name Product … Name Gender … Sells table Person table Table 1. Compressed table Table 2. Uncompressed table
  • 14. Literature Overview – Interpretative approach  Search step  Step 1: look up index find columns/rows of the database that contain the query keywords.  Step 2: All potential subsets of tables in the database that, if joined, might contain rows having all keywords, are identified and enumerated. Join Tree  Step 3: For each enumerated join tree, a SQL statement is constructed (and executed) that joins the tables in the tree and selects those rows that contain all keywords. The final rows are ranked and presented to the user.
  • 15. Literature Overview – Interpretative approach  Join Tree example:
  • 16. Literature Overview – Interpretative approach  Keyword Search in Databases: The Power of RDBMS  Lu Qin et al.  SIGMOD 09
  • 17. Integrating IR and DB  DB techniques provide users with efficient ways to access structured data in RDBMSs  IR techniques allow users to use keywords to access unstructured data  Eg. Structural keyword search, finds how tuples that contain keywords in a RDB are interconnected (the structure), three types:
  • 18. Schema-based approach Connected Tree Semantics: query results in minimal total joining network of tuples; adjacent tuples joined by foreign key reference, #tuples <= Tmax
  • 19. Connected Tree Semantics  1. Candidate Network (CN) generation: relational algebra expressions that creates trees with all keywords up to a certain size  2. CN evaluation: evaluates generated CNs using SQL
  • 20. Schema-based approach Distinct Root Semantics: query results in collection of tuples all reachable from root; root uniquely defines tuples, distance(any tuple, root) <= Dmax
  • 21. Schema-based approach Distinct Core Semantics: query results in multi-center subgraphs (communities); keyword tuples uniquely defines a community, distance(any keyword tuple, any center tuple) <= Dmax
  • 22. Distinct Core/Root Semantics  1. Creates pairs between tuple containing keyword and every other tuple, that is the shortest distance between them  2. generate graphs using SQL with distinct core/roots
  • 23. Literature Overview – Interpretative approach  Keyword search over relational databases: a metadata approach.  Bergamaschiet al.  SIGMOD 11
  • 24. Problem Definition  A database D is a collection of relational tables. Each relational table contains its name, attributes and value domains. All these elements together form the vocabulary.  A keyword query q is an ordered list of keywords. Each keyword specifies the element of the interest.  A configuration of a keyword query on Database is an injective mapping from the keyword to vocabulary of the database  Task: First derive the top configurations based on some metrics and then interpret it as SQL query (select-project-join interpretations)
  • 25. From Keywords to Queries  Need to consider inter-dependency of the query keywords: Introduce two different kinds of weights: the intrinsic weights, and the contextual weights  Need to give a ranked list of all the configurations Develop an algorithm based on and extends the Hungarian (a.k.a., Munkres) algorithm  Need to separate the process of evaluating the schema terms and value terms Evaluate the value weights based on the schema mapping
  • 26. Contributions and Insights  Formally define the problem of keyword querying over relational databases that lack a-priori access to the database instance  Introduce the notion of a weight as a measure of the likelihood that the semantics of a keyword are represented by a database structure. Need to consider both intrinsic weights and contextual weights  Extend and exploit the Hungarian (a.k.a., Munkres) algorithm to generate a ranking of different interpretations.
  • 27. Literature Overview  Two kinds of approaches  1. Interpretative approach  Reuse database query language and index  Translate the keywords into queries  2. Un-interpretative approach  Typically build own index and data structure  Model as graph and use graph-based analysis
  • 28. Literature Overview – Un-interpretative approach  Effective Keyword Search in Relational Databases  Fang Liu et al.  SIGMOD 06
  • 29. Difficulties of Keyword Search  Keyword search in text databases only need to compute score for each document  Keyword search on RDBMS more complicated (relations, attributes, tuples):  1. Generate tuple trees (answers) by joining tuples from different tables  2. Rank the answers by computing score
  • 30. Generate Answer Tuple Trees  Tuple tree answer rules: 1. Each leaf node in a tuple tree must contain at least one keyword 2. Each tuple only appears at most once in tree  Separate tuples into tuple sets that contain keywords and tuple sets that contain all tuples for each relation, join adjacent sets from schema graph within constraints of answer trees
  • 31. Ranking Tuple Trees  Treat the text of each tuple within an answer set as a “document”  Assign similarity rating between each document and query, normalizing for:  Term Frequency  Document Frequency  Document Length  Compute score for tuple tree as average over all documents
  • 32. Focused work  Keyword Searching and Browsing in Databases using BANKS  Gaurav Bhalotia et al.  ICDE 02
  • 33. BANKS (Browsing And Keyword Searching)  a system which enables keyword- based search on relational databases, together with data and schema browsing User HTTP BANKS System JDBC Database
  • 34. Database and Query Model  Relational Database -> Directed Graph  Each Tuple in Database -> Node in Graph  Foreign Key -> Directed Edge
  • 35. Database and Query Model
  • 36. Database and Query Model  An answer to a query should be a subgraph connecting nodes matching the keywords.  The importance of a link depends upon the type of the link i.e. what relations it connects and on its semantics  Ignoring directionality would cause problems because of “hubs” which are connected to a large numbers of nodes.
  • 37. Database and Query Model  We may restrict the information node to be from a selected set of nodes of the graph  We incorporate another interesting feature, namely node weights, inspired by prestige rankings  Node weights and tree weights need to be combined to get an overall relevance score
  • 38. Formal Model  Node Weight : N(u) Depends on the prestige Set the node prestige = the in-degree of the node Nodes that have multiple pointers to them get a higher prestige
  • 39. Formal Model  Edge Weights Some pupluar tuples can be connected many other tuples  Edge with forward and backward edge weights Weight of a forward link = the strength of the proximity relationship between two tuples (set to 1 by default) Weight of a backward link = in-degree of edges pointing to the node
  • 40. Formal Model 
  • 41. Result Result of query “sudarshan soumen”
  • 42. Searching for the best answer  Backward Expanding Search Algorithm Intuition: find vertices from which a forward path exists to at least one node from each Si. Run concurrent single source shortest path algorithm from each node matching a keyword
  • 43. Searching for the best answer S. Sudarsha n Prasan Roy writes author paper Charuta BANKS: Keyword search…
  • 44. As an extension of BANKS  BLINKS: ranked keyword searches on graphs.  He H et al.  SIGMOD 07
  • 45. Introduction  Efficient ranked keyword searches on schemaless node-labeled graphs.  Challenges:  Lack of schema for optimization  Hard to guarantee strong performance  Proposed technique  Backward search algorithm  SLINKS: single-level index search *  Extension for scalability: BLINKS ( bi-level index search )  Contributions  Cost-balanced expansion based backward search  Combining indexing with search  Partition-based indexing (bi-level indexing)
  • 46. Problem Formulation 
  • 47. Backward search algorithm 
  • 48. A single level index 
  • 49. A single level index 
  • 50. SLINKS Algorithm 
  • 51. BLINKS ( brief idea)  The index is too large to store and too expensive to construct in large graphs? Use a divide and conquer approach to create a bi-level index  Partition the data graph into multiple subgraphs, or blocks.  Intra-Block Index  indexes information inside a block  4 kinds of index, 2 for separator nodes (important, so specially considered )  Block Index  2 simple index
  • 52. Conclusion  Keywords search challenges:  Filtering and disambiguation  Automatic join back  Ranking of the result  Additional consideration:  Efficiency  Space
  • 53. Thank you and have fun