What to do when one size does not fit all?!
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

What to do when one size does not fit all?!

on

  • 721 views

Keynote talk about "Search by Strategy" at the ESAIR 2011 workshop, held at CIKM 2011.

Keynote talk about "Search by Strategy" at the ESAIR 2011 workshop, held at CIKM 2011.

Statistics

Views

Total Views
721
Views on SlideShare
718
Embed Views
3

Actions

Likes
0
Downloads
5
Comments
0

2 Embeds 3

http://www.linkedin.com 2
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Viewing a BB as a function can be used later to sketch SpinQL.
  • Does “Entity-based ranking” make sense?
  • NOTE: MATERIALIZED VIEWs, where supported (not in MonetDB), can be used instead of TABLEs when stored relations (index) are expected to get updates.
  • This is how it should be done. How it is done at the moment: always append (like filters). Up/down means: upvote/downvote the selected bucket
  • This is how it should be done. How it is done at the moment: always append (like filters). Up/down means: upvote/downvote the selected bucket

What to do when one size does not fit all?! Presentation Transcript

  • 1. What to do when one size does not fit all?! Arjen P. de Vries [email_address] Centrum Wiskunde & Informatica Delft University of Technology Spinque B.V.
  • 2. Core Questions
    • How to represent information?
      • The information need and search requests
      • The objects to be shown in response to an information request
    • How to match information representations
      • (Deductive) data retrieval, (inductive) information retrieval, or a mix?!
  • 3. Complications
    • Heterogeneous data sources
      • WWW, wikipedia, news, e-mail, patents, twitter, personal information, …
    • Varying result types
      • “ Documents”, tweets, courses, people, experts, gene expressions, temperatures, …
    • Multiple dimensions of relevance
      • Topicality, recency, reading level, …
  • 4. Complications
    • Many search tasks require a mix within these dimensions:
      • News and patents
      • Companies and their CEOs
      • Recent and on topic
    • Many search tasks also require a mix across these dimensions:
      • Patents assigned to our top 3 competitors in market segments mentioned in the recent press releases issued by our top 10 clients
  • 5. Complications
    • System’s internal information representation
      • Linguistic annotations
        • Named entities, sentiment, dependencies, …
      • Knowledge resources
        • Wikipedia, Freebase, IDC9, IPTC, …
      • Links to related documents
        • Citations, urls
  • 6. Complications
    • Anchors that describe the URI
      • Anchor text
    • Queries that lead to clicks on the URI
      • Session, user, dwell-time, …
    • Tweets that mention the URI
      • Time, location, user, …
    • Other social media that describe the URI
      • User, rating
      • Tag, organisation of `folksonomy’
  • 7. Tweets about blip.tv
    • E.g.: http://blip.tv/file/2168377
      • Amazing
      • Watching “World’s most realistic 3D city models?”
      • Google Earth/Maps killer
      • Ludvig Emgard shows how maps/satellite pics on web is done (learn Google and MS!)
            • and ~120 more Tweets
  • 8. Even More Complications
    • Uncertainty in matching process
      • Vocabulary mismatch
      • Incomplete relevance information
    • Imperfect and noisy representations of both documents and information need
      • OCR, multimedia analysis, NE taggers, HTML table extraction, …
  • 9. The one size fits all "semantically enhanced retrieval model“? BM25 BM25F LM RM VSM DFR QIR? Learning to rank? Document Collection: Anchors Entity types Sentiment Tweets Cited documents … Context User Ran ked list of answers
  • 10. http://www.hellokids.com/c_19938/coloring-page/holiday-coloring-pages/easter-coloring-pages/jesus-coloring-pages/the-holy-grail-coloring-page
  • 11. Parameterised Search System Cannot we ‘remove’ this IR engineer from the loop, like DBMS software removes the data engineer from the loop? Cornacchia, De Vries, ECIR 2007 A Parametrised Search System
  • 12. Search by Strategy
    • Visually construct search strategies by connecting building blocks
  • 13.  
  • 14.  
  • 15. Generate Search Engine!
  • 16. Search by Strategy
    • Visually construct search strategies by connecting building blocks
    • Each block describes either data or actions upon that data
  • 17. Strategy Builder
  • 18. From Patent to Inventor
  • 19. Reports Visits
  • 20.  
  • 21. BBs and typed pins
    • N input pins, 1 output pin
      • Pins represent data / result sets
    • M user-parameters (u)
      • instantiated at query-time
    • A BB can be viewed as a function
      • out = BB(in 1 ,..,in N , u 1 ,..,u M )
    • Pins are typed
      • doc / sec / term / ne (named entity) / tuple
      • only pins of the same type can be connected
    • No assumption on the underlying data store / data API
    • The content of the BB respects the type contract.
    BB 1 (in 1 ,in 2 ,in 3 , u 1 ,u 2 ) in 1 in 2 in 3 out BB 2 (in 1 ) in 1 out
  • 22. From Strategies to DB Queries
    • Database Spinque: RDBMS (MonetDB)
    • Data flow Spinque: strategy
    • Query: strategy made operational Spinque: PRA
    CREATE VIEW a AS SELECT .. CREATE VIEW b AS SELECT .. CREATE VIEW c AS SELECT .. Strategy Relational DB BB 1 (in 1 ,in 2 ,in 3 , u 1 ,u 2 ) in 1 in 2 in 3 out BB 2 (in 1 ) in 1 out
  • 23. Probabilistic Relational Algebra Strategy Relational DB
    • SQL explicit probabilities
      • CREATE VIEW x AS
      • SELECT a1, a3,
      • 1-prod(1-prob) AS prob
      • FROM y
      • GROUP BY a1, a3;
    • PRA: probabilistic relational algebra (Fuhr and Roelleke, TOIS 2001)
      • x = Project DISTINCT
      • [$1,$3](y);
  • 24. SpinQL, the sneak preview
    • PRA still too low-level; who writes algebraic plans?!
    • SpinQL: “See objects, generate SQL under the hood”
      • Understands Spinque data types (doc, sec, term, named-entity, tuple)
      • Allows to build levels of abstractions, to describe:
      • access to probabilistic relations
      • domain-unaware typed data streams (e.g. person.name() )
      • domain aware data streams (e.g. person.inventor_of() )
      • building blocks (building blocks are functions)
      • strategies (if building blocks are functions, strategies are as well)
    • SpinQL not contained in a strategy, SpinQL is a strategy specification
      • SpinQL describes all, the editor shows desired granularity / expertise level
      • E.g. show patent-strategy, zoom in on “inventors”, then “persons”, “ne”, raw data access
  • 25. What’s in the DB?
    • Text-based ranking
      • term-doc-freq relations (inverted file)
        • One per language, stemming, section
      • Domain-independent, click and index
    • Entity ranking
      • Probabilistic triples
      • Domain-aware
        • Needs supervised indexing
    • Content-based (MM) retrieval
      • Feature vectors, click and index
    T D f t 0 d 3 3 t 0 d 5 10 t 1 d 2 4 subj pred/attr obj/value p Arjen speaks_to you 0.95 you follow Arjen 0.5 speech minutes 45 0.8 Img_id f 1 … f N 0 0.12 … 0.84 1 0.54 … 0.31 2 0.23 … 0.1
  • 26. VIEWS and TABLES
    • BB content: sequence of VIEW definitions
    • A VIEW is pre-computable when
      • All the relations addressed are pre-computable / stored
      • No dependency on user parameters
    • Pre-computable VIEWs can become TABLEs (or MATERIALIZED VIEWs)
      • Query-independent computations are performed only once , then read from TABLEs at each query
      • Recognition of these patterns is fully automatic
      • Extends MonetDB’s per-session caching to across-sessions caching
    CREATE VIEW a AS SELECT … FROM term-doc … ; CREATE VIEW b AS SELECT … FROM a WHERE a.x = u 1 ; CREATE VIEW c AS SELECT … FROM a WHERE a.x = 42 ; CREATE VIEW d AS SELECT … FROM b … ; CREATE TABLE a AS SELECT … FROM term-doc … ; CREATE VIEW b AS SELECT … FROM a WHERE a.x = u 1 ; CREATE TABLE c AS SELECT … FROM a WHERE a.x = 42 ; CREATE VIEW d AS SELECT … FROM b … ; User parameter Stored relation No user parameter Pre-computable relation
  • 27. Exploratory Search
    • Search & (Faceted) Browsing
      • Help discover schema, ontology, etc.
      • Help discover the relevant sources
        • Within-collection (by year/location, by type, …)
        • Across multiple collections (by source)
  • 28. Probabilistic faceted browsing
    • Traditional (boolean filters)
    • Probabilistic
      • 100K - 200K
      • 200K - 300K
      • 300K - 400K
    Price
      • 3
      • 4
      • 5
    Rooms
      • 100 - 150 m 2
      • 150 - 200 m 2
      • 200 - 250 m 2
    Size
      • 100K - 200K
      • 200K - 300K
      • 300K - 400K
    Price
      • 3
      • 4
      • 5
    Rooms
      • 100 - 150 m 2
      • 150 - 200 m 2
      • 200 - 250 m 2
    Size
    • Good when user knows exactly which filters to apply
    • Will see perfect-match results
    • Won’t see “interesting” results
    • Good for exploratory search
    • Will see perfect-match results
    • Will also see “interesting” results
  • 29. Dynamic facets
    • Pre-indexed
    • Dynamic
      • 100K - 200K
      • 200K - 300K
      • 300K - 400K
    Price
      • 3
      • 4
      • 5
    Rooms
      • 100 - 150 m 2
      • 150 - 200 m 2
      • 200 - 250 m 2
    Size
      • 100K - 200K
      • 200K - 300K
      • 300K - 400K
    Price
      • 3
      • 4
      • 5
    Rooms
      • 100 - 150 m 2
      • 150 - 200 m 2
      • 200 - 250 m 2
    Size
    • Pre-defined ad-hoc indices intersected with result set
    • Challenge: many indices to maintain
    • Facets decided from result set
    • Challenge: dynamically adapt granularity
      • Different price ranges for villa/garage!
    • Challenge: heavy concurrent queries to DB
  • 30. Probabilistic facets and strategies (current)
      • 100K - 200K
      • 200K - 300K
      • 300K - 400K
    Price
      • 3
      • 4
      • 5
    Rooms
      • 100 - 150 m 2
      • 150 - 200 m 2
      • 200 - 250 m 2
    Size
    • Filter on Size (in/out)
      • no ranking!
    • Re-rank on
      • Rooms (up/down)
      • Price (up/down)
      • no filter!
    • BAD: Order of Rooms/Price matters!
    • BAD: Ranking function of Rooms/Price internally smoothed with previous ranking (done for efficiency)
    • BAD: possible to weight each facet, but not consistently with others
    Original strategy
      • 100 - 150 m 2
      • 150 - 200 m 2
      • 200 - 250 m 2
    Size
      • 100K - 200K
      • 200K - 300K
      • 300K - 400K
    Price
      • 3
      • 4
      • 5
    Rooms
  • 31. Probabilistic facets and strategies (better)
      • 100K - 200K
      • 200K - 300K
      • 300K - 400K
    Price
      • 3
      • 4
      • 5
    Rooms
      • 100 - 150 m 2
      • 150 - 200 m 2
      • 200 - 250 m 2
    Size
    • Filter on Size (in/out)
      • no ranking!
    • Re-rank on Rooms (up/down)
      • no filter!
    • Re-rank on Price (up/down)
      • no filter!
    • Mix 3 different rankings:
      • Rooms
      • original (always present)
      • Price
    • Change coefficients to explore
    • Challenge: use algebraic SpinQL representation for rewritings
      • e.g. push filters up
    Original strategy 20% 50% 30% Mix
      • 100 - 150 m 2
      • 150 - 200 m 2
      • 200 - 250 m 2
    Size
      • 100K - 200K
      • 200K - 300K
      • 300K - 400K
    Price
      • 3
      • 4
      • 5
    Rooms
  • 32. Mixing probabilistic data streams
    • N inputs streams S 1 ,…,S N , 1 output
    • All streams: (id, p) Same id type (e.g. docs)
    • Linear combination:
    • p 1 …p N must be comparable; on same scale
      • Take care in scripting the blocks
    • Expensive operation in relational algebra
    20% 50% 30% Mix
  • 33. Mixing probabilistic data streams in RA
    • Sum( p , GroupBy( id , (Union( α 1 S 1 ,…, α n S N )))
        • GroupBy optim. for few large groups – opposite here
        • Example shows 3 ids, 2 streams: 3 groups (could be millions), each group large max 2 (usually < 5).
    α 1 S 1 α 2 S 2 S 1 S 2 id p id 0 0.2*0.1 = 0.02 id 1 0.2*0.7 = 0.14 id 2 0.2*0.9 = 0.18 id 0 0.8*0.2 = 0.16 id 2 0.8*1.0 = 0.8 id p id 0 0.02 + 0.16 = 0.18 id 1 0.14 id 2 0.18 + 0.8 = 0.26 20% 80% Mix id p id 0 0.1 id 1 0.7 id 2 0.9 id p id 0 0.2 id 2 1.0 Inputs Union( α 1 S 1 ,…, α n S N ))) Sum( p , GroupBy( id))
  • 34. Mixing probabilistic data streams in RA
    • Project( α 1 p 1 +…+ α N p N , OuterJoin( id 1 =… id n , ( S 1 ,…,S N )))
        • Explicit summation of few values more efficient than aggregation
        • Example omits handling NULLs from OuterJoin – not free
        • Super-fast if streams ordered on id – impossible with Union/Group
    S 1 S 2 20% 80% Mix id p id 0 0.1 id 1 0.7 id 2 0.9 id p id 0 0.2*0.1 + 0.8*0.2 = 0.18 id 1 0.2*0.7 = 0.14 id 2 0.2*0.9 + 0.8*1.0 = 0.26 id p id 0 0.2 id 2 1.0 id p 1 p 2 id 0 0.1 0.2 id 1 0.7 id 2 0.9 1.0 Inputs OuterJoin( id 1 =… id n , ( S 1 ,…,S N )) Project( α 1 p 1 +…+ α N p N )
  • 35. Limitations Search & Browse
    • Faceted exploration does not include joins
      • Cannot construct new data sources from existing ones!
      • Only the pre-defined paths through the information space can actually be traversed
  • 36. Who needs a Join?
    • You!!! … whenever ‘relevance cues’ are typed:
      • People (e.g., inventors)
      • Companies (e.g., assignees)
      • Categories (e.g., IPTC)
      • Time (e.g., expiry date)
      • Location (e.g., country)
    • … or whenever multiple sources are to be combined
      • E.g., patents & news, patents & Wikipedia, …
  • 37. Patents on X by Y(y) by Y(y)
  • 38. 1. Which universities/colleges hold patents? 2. Who are the inventors named in those patents? 3. Which inventors are active in the area of our company? Real-life patent search example: Which researchers associated to universities and colleges should our Human Resources manager know to hire the right people on time?
  • 39. How Strategies Help
    • Strategies improve communication between search intermediary and user
      • Encapsulate domain expert knowledge
      • Abstract representation of search expert knowledge
      • Analyze information seeking process at any stage
    • Strategies facilitate knowledge management
      • Store / share / publish / refine
    • Strategies mix exact (DB) and ranked (IR) searches
      • Avoid the need for “human (probabilistic) joins”
  • 40.  
  • 41. Conclusion
    • “ No idealized one-shot search engine”
    • Empower the user!
  • 42. Search Intermediaries
    • Travel agency
    • Real estate agents
    • Recruiters
    • Librarians
    • Archivists
    • Digital forensics detectives
    • Patent information specialists
    Task complexity
  • 43.  
  • 44. Research Opportunities
    • Assist the user make the best out of their increased level of control
      • Integrate usage data from live system to help improve or adapt strategies
    • Handle “even larger” scale data
      • Patent demo fine on ~17GB semi-structured data (i.e., Fairview Research’s Green Energy collection), without specific optimizations, even with fairly large strategies
    • Formalism
      • Score normalization
    • Close the loop!
  • 45. Current Situation
    • index ;
    • repeat {
    • specify ;
    • retrieve
    • } until 
    Search & explore Schema definition
  • 46. Desirable Situation
    • repeat {
    • index ;
    • specify ;
    • retrieve
    • } until 
    Mixed Initiative Schema definition Search & explore
  • 47. Interactive Information Access
    • Feedback:
      • Interaction improves information representation
    • Faceted Browsing:
      • Interaction can let user take over where machine would fail
    • Search by Strategy:
      • Interaction can let user take over where system designer would fail