SlideShare a Scribd company logo
1 of 51
How to build the next 1000
    search engines?!

         Arjen P. de Vries
          arjen@acm.org
      Centrum Wiskunde & Informatica
       Delft University of Technology
                Spinque B.V.
Search is everywhere
Search is everywhere
 Yet it only works well on the web…
Complications
 Heterogeneous data sources
   WWW, wikipedia, news, e-
    mail, patents, twitter, personal information, …
 Varying result types
   “Documents”, tweets, courses, people, expert
    s, gene expressions, temperatures, …
 Multiple dimensions of relevance
   Topicality, recency, reading level, …
Complications
 Many search tasks require a mix within
  these dimensions:
   News and patents
   Companies and their CEOs
   Recent and on topic
 Many search tasks also require a mix
  across these dimensions:
   Patents assigned to our top 3 competitors in
    market segments mentioned in the recent
    press releases issued by our top 10 clients
 System‟s internal information representation
   Linguistic annotations
      Named entities, sentiment, dependencies, …
   Knowledge resources
      Wikipedia, Freebase, IDC9, IPTC, …
   Links to related documents
      Citations, urls
 Anchors that describe the URI
   Anchor text
 Queries that lead to clicks on the URI
   Session, user, dwell-time, …
 Tweets that mention the URI
   Time, location, user, …
 Other social media that describe the URI
   User, rating
   Tag, organisation of `folksonomy‟
     + UNCERTAINTY ALL OVER!
What goes in the black box?
   Document Collection:
      Anchors
      Entity types
      Sentiment
      Tweets                       BM25
      Cited documents              BM25F
               …                     LM
                                    RM              Ranked
                                    VSM               list
                                    DFR                of
                                                    answers
                                     QIR?
User                           Learning to rank?




   Context

                  ECIR / CIKM / SIGIR / ICTIR / WSDM papers!
Rarely & scarcely addressed…

    Student: How do I build it?
   Professor: Who will build it for
               me?


        Last session of the conference…
Search System
Parameterised Search System




        Cornacchia, De Vries, ECIR 2007
        A Parametrised Search System
Parameterised Search System

    Cannot we ‘remove’
    this IR engineer (or
    scientist!) from the
      loop, like DBMS
     software removes
     the data engineer
       from the loop?




                      Cornacchia, De Vries, ECIR 2007
                      A Parametrised Search System
And three (four?) children, a startup and 5 years later, a PhD defense!
Search by Strategy
 Visually construct search strategies by
  connecting building blocks
Search by Strategy
 Visually construct search strategies by
  connecting building blocks
 Each block describes either data or actions
  upon that data
   Connection points (“pins”) are typed:
    doc / sec / term / ne (named entity) / tuple
   Actions are expressed as scripts (later more)
Strategy Builder
From Patent to Inventor
Reports




          Visits
Generate Search Engine!




Or, really, generate a REST API from the strategy specification!
Demo
(Showed demo of children‟s search engine)
How Strategies Help
 Strategies improve communication between
  search intermediary and user
   Encapsulate domain expert knowledge
   Abstract representation of search expert knowledge
   Analyze information seeking process at any stage
 Strategies facilitate knowledge management
   Store / share / publish / refine
 Strategies mix exact (DB) and ranked (IR)
  searches
   Avoid the need for “human (probabilistic) joins”
Search Intermediaries
 Travel agency




                                   Task complexity
 Real estate agents
 Recruiters
 Librarians
 Archivists
 Digital forensics detectives
 Patent information specialists
Exploratory Search
 Search & (Faceted) Browsing
   Help discover schema, ontology, etc.
   Help discover the relevant sources
     Within-collection (by year/location, by type, …)
     Across multiple collections (by source)
Probabilistic faceted browsing
    Traditional (boolean
    filters)                                      Probabilistic
                                 Price                                             Price

                                 • 100K - 200K                                     • 100K - 200K
                                 • 200K - 300K                                     • 200K - 300K
                                 • 300K - 400K                                     • 300K - 400K

                                 Rooms                                             Rooms

                                 • 3                                               • 3
                                 • 4                                               • 4
                                 • 5                                               • 5

                                 Size                                              Size

                                 • 100 - 150 m2                                    • 100 - 150 m2
                                 • 150 - 200 m2                                    • 150 - 200 m2
                                 • 200 - 250 m2                                    • 200 - 250 m2



•    Good when user knows exactly                 •   Good for exploratory search
     which filters to apply
                                                  •   Will see perfect-match results
•    Will see perfect-match results
•    Won’t see “interesting” results              •   Will also see “interesting” results
Dynamic facets

    Pre-indexed                                Dynamic
                              Price                                                 Price

                              • 100K - 200K                                         • 100K - 200K
                              • 200K - 300K                                         • 200K - 300K
                              • 300K - 400K                                         • 300K - 400K

                              Rooms                                                 Rooms

                              • 3                                                   • 3
                              • 4                                                   • 4
                              • 5                                                   • 5

                              Size                                                  Size

                              • 100 - 150 m2                                        • 100 - 150 m2
                              • 150 - 200 m2                                        • 150 - 200 m2
                              • 200 - 250 m2                                        • 200 - 250 m2




•   Pre-defined ad-hoc indices                 •   Facets decided from result set
    intersected with result set                •   Challenge: dynamically adapt granularity
•   Challenge: many indices to maintain             • Different price ranges for villa/garage!
                                               •   Challenge: heavy concurrent queries to DB
Demo
(Showed Spinque‟s Real-estate search
  demo)
Limitations Search & Browse
 Faceted exploration does not include joins
   Cannot construct new data sources from
    existing ones!
   Only the pre-defined paths through the
    information space can actually be traversed
Who needs a Join?
 You!!!
  … whenever „relevance cues‟ are typed:
   People (e.g., inventors)
   Companies (e.g., assignees)
   Categories (e.g., IPTC)
   Time (e.g., expiry date)
   Location (e.g., country)
 … or whenever multiple sources are to be
 combined
   E.g., patents & news, patents & Wikipedia, …
Patents on X by Y(y)

            by Y(y)
Interactive Information Access

 Feedback:
   Interaction improves information
    representation
 Faceted Browsing:
   Interaction can let user take over where
    machine would fail
 Search by Strategy:
   Interaction can let user take over where
    system designer would fail
Conclusion
 “No idealized one-shot search engine”
 Empower the user!
Under the Hood
From Strategies to DB Queries
  in1     in2         in3
                                 Strategy

                            • Data flow
  BB1(in1,in2,in3, u1,u2)


                out

         in1


         BB2(in1)
                              Spinque: strategy
                out




  CREATE VIEW a AS
  SELECT ..                 • Query: strategy made operational
  CREATE VIEW b AS
  SELECT ..

  CREATE VIEW c AS
                              Spinque: PRA
  SELECT ..




                             Database
                              Spinque: RDBMS (MonetDB)
                                 Relational DB
Probabilistic Relational Algebra
                     Strategy




 x = Project DISTINCT
                                     • PRA: probabilistic
             [$1,$3](y);               relational algebra
                                       (Fuhr and
                                       Roelleke, TOIS 2001)

 CREATE VIEW x AS
 SELECT a1, a3,                      • SQL
         1-prod(1-prob) AS prob
 FROM y                                explicit probabilities
 GROUP BY a1, a3;



                     Relational DB
What‟s in the DB?
 Text-based ranking                                   T         D          f
   term-doc-freq relations (inverted file)            t0        d3         3
      One per language, stemming, section             t0        d5      10
   Domain-independent, click and index                t1        d2         4


 Entity ranking                         subj      pred/attr      obj/value      p

   Probabilistic triples                Arjen     speaks_to          you       0.95

   Domain-aware                             you    follow            Arjen     0.5

                                         speech    minutes             45       0.8
      Needs supervised indexing

 Content-based (MM) retrieval           Img_id             f1           …       fN

                                                                         …
   Feature vectors, click and index
                                              0          0.12                   0.84

                                              1          0.54            …      0.31

                                              2          0.23            …      0.1
VIEWS and TABLES
                                                                             User
                                                   Stored relation        parameter


   CREATE   VIEW
            TABLE   a   AS   SELECT   …   FROM   term-doc … ;
   CREATE   VIEW    b   AS   SELECT   …   FROM   a WHERE a.x = u1 ;
   CREATE   VIEW
            TABLE   c   AS   SELECT   …   FROM   a WHERE a.x = 42 ;
   CREATE   VIEW    d   AS   SELECT   …   FROM   b … ;                             No user
                                                                                  parameter
                                                                 Pre-computable
 BB content: sequence of VIEW definitions                           relation
 A VIEW is pre-computable when
    All the relations addressed are pre-computable / stored
    No dependency on user parameters
 Pre-computable VIEWs can become TABLEs (or MATERIALIZED
  VIEWs)
    Query-independent computations are performed only once, then
     read from TABLEs at each query
    Recognition of these patterns is fully automatic
    Extends MonetDB‟s per-session caching to across-sessions caching
What Next?
Current Situation
 index ;              Schema definition
 repeat {
      specify ;
      retrieve        Search & explore
 } until 
Traditional Indexing




 Preprocessing determines to large extend how
  search request form will be processed
   Especially regarding tokenization, stemming, etc.
 Fast and scalable, but inflexible
   E.g., entity search hard-coded on top of engine,
    advertisements matched on different data, etc.
Search by Strategy




 Flexible: generate arbitrary engine on the fly
 Not as fast as highly optimized and very well
  engineered inverted file based systems
Desirable Situation
 repeat {
      index ;     Mixed Initiative
      specify ;     Schema definition
                     Search & explore
      retrieve
 } until 
Non-Indexed Search




 Grep
   Very flexible
      Use it all the time on my mh mail folders when gmail
       fails me!
   Not scalable, little or no structure
Minimal Indexing




 How to reduce pre-processing necessary to
  create a search engine over a new collection?
   Can we do without a keyword index?
   Can we avoid hardwired decisions for tokenization,
    language detection, stemming, …
Suffix Array
 Pro's:
   provides many core search functions: term
    statistics, keyword search, phrase search.
   no upfront tokenization needed (access at
    character level)
   no upfront language detection needed
 Con's:
   difficult to build for large corpora
   expensive w.r.t. disk space
Demo
(Showed patent search demo)
“Real Code”
Patents on X by Y(y)

            by Y(y)
PRA
s__STRATEGY___filter_DOC_with_NE_nes =
Project [$2,$3](
 Join [$1 = $2](
    s__STRATEGY___clef_ip_patents_DATA_result,
    Project [$1,$3](
       Select [$2 = "ipcr-classification"](
          s__STRATEGY___clef_ip_patents_DATA_ne_doc
       )
    )
 )
);
CREATE TABLE s__STRATEGY___filter_DOC_with_NE_nes AS
   SELECT
    tmp_1814091754.a2 AS a1,
    tmp_1814091754.a3 AS a2,
    tmp_1814091754.prob AS prob
   FROM
   (
     SELECT
         s__STRATEGY___clef_ip_patents_DATA_result.a1 AS a1,
         tmp__1652836708.a1 AS a2,
         tmp__1652836708.a2 AS a3,
        s__STRATEGY___clef_ip_patents_DATA_result.prob
           * tmp__1652836708.prob AS prob
    FROM
        s__STRATEGY___clef_ip_patents_DATA_result,
        (
            SELECT
                 tmp_1444787941.a1 AS a1,
                 tmp_1444787941.a3 AS a2,
                 tmp_1444787941.prob AS prob
             FROM
                 (
                      SELECT
                          s__STRATEGY___clef_ip_patents_DATA_ne_doc.a1 AS a1,
                          s__STRATEGY___clef_ip_patents_DATA_ne_doc.a2 AS a2,
                          s__STRATEGY___clef_ip_patents_DATA_ne_doc.a3 AS a3,
                          s__STRATEGY___clef_ip_patents_DATA_ne_doc.prob AS prob
                  FROM
                     s__STRATEGY___clef_ip_patents_DATA_ne_doc
                  WHERE
                     s__STRATEGY___clef_ip_patents_DATA_ne_doc.a2
                             =‘ipcr-classification’
              ) AS tmp_1444787941
       ) AS tmp__1652836708
    WHERE
       s__STRATEGY___clef_ip_patents_DATA_result.a1
              = tmp__1652836708.a2
   ) AS tmp_1814091754
   ORDER BY a1
   WITH DATA;
info@spinque.com
    www.spinque.com
facebook.com/spinque

More Related Content

Similar to How to build the next 1000 search engines?!

Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesOpenSource Connections
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectorsSimon Hughes
 
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Lucidworks
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingSimon Hughes
 
OSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialOSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialSteven Francia
 
Building a Real-time Solr-powered Recommendation Engine
Building a Real-time Solr-powered Recommendation EngineBuilding a Real-time Solr-powered Recommendation Engine
Building a Real-time Solr-powered Recommendation Enginelucenerevolution
 
Recommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRecommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRoelof Pieters
 
Infinum Android Talks #03 - Android Design Best Practices - for Designers and...
Infinum Android Talks #03 - Android Design Best Practices - for Designers and...Infinum Android Talks #03 - Android Design Best Practices - for Designers and...
Infinum Android Talks #03 - Android Design Best Practices - for Designers and...Infinum
 
Intelligent Stream Filtering Using MongoDB
Intelligent Stream Filtering Using MongoDBIntelligent Stream Filtering Using MongoDB
Intelligent Stream Filtering Using MongoDBMihnea Giurgea
 
ZendCon 2011 UnCon Domain-Driven Design
ZendCon 2011 UnCon Domain-Driven DesignZendCon 2011 UnCon Domain-Driven Design
ZendCon 2011 UnCon Domain-Driven DesignBradley Holt
 
DDC2011 - Association
DDC2011 - AssociationDDC2011 - Association
DDC2011 - AssociationBuhwan Jeong
 
DevLOVE Beautiful Development - 第一幕 陽の巻
DevLOVE Beautiful Development - 第一幕 陽の巻DevLOVE Beautiful Development - 第一幕 陽の巻
DevLOVE Beautiful Development - 第一幕 陽の巻都元ダイスケ Miyamoto
 
U of A Web Strategy and Sitecore
U of A Web Strategy and SitecoreU of A Web Strategy and Sitecore
U of A Web Strategy and SitecoreTim Schneider
 
Sitecore at the University of Alberta
Sitecore at the University of AlbertaSitecore at the University of Alberta
Sitecore at the University of AlbertaTim Schneider
 
Tiers of Abstraction and Audience in Cultural Heritage Data Modeling
Tiers of Abstraction and Audience in Cultural Heritage Data ModelingTiers of Abstraction and Audience in Cultural Heritage Data Modeling
Tiers of Abstraction and Audience in Cultural Heritage Data ModelingRobert Sanderson
 
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...Lucidworks
 
Clients in control: building demand-driven systems with Om Next
Clients in control: building demand-driven systems with Om NextClients in control: building demand-driven systems with Om Next
Clients in control: building demand-driven systems with Om NextAntónio Monteiro
 
Android Talks #3 Android Design Best Practices - for Designers and Developers
Android Talks #3 Android Design Best Practices - for Designers and DevelopersAndroid Talks #3 Android Design Best Practices - for Designers and Developers
Android Talks #3 Android Design Best Practices - for Designers and DevelopersDenis_infinum
 

Similar to How to build the next 1000 search engines?! (20)

Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectors
 
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic Matching
 
OSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialOSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB Tutorial
 
MongoDB for Genealogy
MongoDB for GenealogyMongoDB for Genealogy
MongoDB for Genealogy
 
Building a Real-time Solr-powered Recommendation Engine
Building a Real-time Solr-powered Recommendation EngineBuilding a Real-time Solr-powered Recommendation Engine
Building a Real-time Solr-powered Recommendation Engine
 
Recommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRecommender Systems, Matrices and Graphs
Recommender Systems, Matrices and Graphs
 
Infinum Android Talks #03 - Android Design Best Practices - for Designers and...
Infinum Android Talks #03 - Android Design Best Practices - for Designers and...Infinum Android Talks #03 - Android Design Best Practices - for Designers and...
Infinum Android Talks #03 - Android Design Best Practices - for Designers and...
 
Intelligent Stream Filtering Using MongoDB
Intelligent Stream Filtering Using MongoDBIntelligent Stream Filtering Using MongoDB
Intelligent Stream Filtering Using MongoDB
 
ZendCon 2011 UnCon Domain-Driven Design
ZendCon 2011 UnCon Domain-Driven DesignZendCon 2011 UnCon Domain-Driven Design
ZendCon 2011 UnCon Domain-Driven Design
 
DDC2011 - Association
DDC2011 - AssociationDDC2011 - Association
DDC2011 - Association
 
DevLOVE Beautiful Development - 第一幕 陽の巻
DevLOVE Beautiful Development - 第一幕 陽の巻DevLOVE Beautiful Development - 第一幕 陽の巻
DevLOVE Beautiful Development - 第一幕 陽の巻
 
U of A Web Strategy and Sitecore
U of A Web Strategy and SitecoreU of A Web Strategy and Sitecore
U of A Web Strategy and Sitecore
 
Sitecore at the University of Alberta
Sitecore at the University of AlbertaSitecore at the University of Alberta
Sitecore at the University of Alberta
 
Tiers of Abstraction and Audience in Cultural Heritage Data Modeling
Tiers of Abstraction and Audience in Cultural Heritage Data ModelingTiers of Abstraction and Audience in Cultural Heritage Data Modeling
Tiers of Abstraction and Audience in Cultural Heritage Data Modeling
 
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
Searching and Querying Knowledge Graphs with Solr/SIREn - A Reference Archite...
 
Clients in control: building demand-driven systems with Om Next
Clients in control: building demand-driven systems with Om NextClients in control: building demand-driven systems with Om Next
Clients in control: building demand-driven systems with Om Next
 
14 spatial analyst
14   spatial analyst14   spatial analyst
14 spatial analyst
 
Android Talks #3 Android Design Best Practices - for Designers and Developers
Android Talks #3 Android Design Best Practices - for Designers and DevelopersAndroid Talks #3 Android Design Best Practices - for Designers and Developers
Android Talks #3 Android Design Best Practices - for Designers and Developers
 

More from Arjen de Vries

Masterclass Big Data (leerlingen)
Masterclass Big Data (leerlingen) Masterclass Big Data (leerlingen)
Masterclass Big Data (leerlingen) Arjen de Vries
 
Beverwedstrijd Big Data (klas 3/4/5/6)
Beverwedstrijd Big Data (klas 3/4/5/6) Beverwedstrijd Big Data (klas 3/4/5/6)
Beverwedstrijd Big Data (klas 3/4/5/6) Arjen de Vries
 
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)Beverwedstrijd Big Data (groep 5/6 en klas 1/2)
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)Arjen de Vries
 
Web Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineWeb Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineArjen de Vries
 
Information Retrieval and Social Media
Information Retrieval and Social MediaInformation Retrieval and Social Media
Information Retrieval and Social MediaArjen de Vries
 
Information Retrieval intro TMM
Information Retrieval intro TMMInformation Retrieval intro TMM
Information Retrieval intro TMMArjen de Vries
 
ACM SIGIR 2017 - Opening - PC Chairs
ACM SIGIR 2017 - Opening - PC ChairsACM SIGIR 2017 - Opening - PC Chairs
ACM SIGIR 2017 - Opening - PC ChairsArjen de Vries
 
Data Science Master Specialisation
Data Science Master SpecialisationData Science Master Specialisation
Data Science Master SpecialisationArjen de Vries
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big DataArjen de Vries
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part IIArjen de Vries
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with SparkArjen de Vries
 
TREC 2016: Looking Forward Panel
TREC 2016: Looking Forward PanelTREC 2016: Looking Forward Panel
TREC 2016: Looking Forward PanelArjen de Vries
 
The personal search engine
The personal search engineThe personal search engine
The personal search engineArjen de Vries
 
Models for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationModels for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationArjen de Vries
 
Better Contextual Suggestions by Applying Domain Knowledge
Better Contextual Suggestions by Applying Domain KnowledgeBetter Contextual Suggestions by Applying Domain Knowledge
Better Contextual Suggestions by Applying Domain KnowledgeArjen de Vries
 
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013Arjen de Vries
 
Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Arjen de Vries
 
Twente ir-course 20-10-2010
Twente ir-course 20-10-2010Twente ir-course 20-10-2010
Twente ir-course 20-10-2010Arjen de Vries
 
Context Adaptation in Image Search
Context Adaptation in Image SearchContext Adaptation in Image Search
Context Adaptation in Image SearchArjen de Vries
 

More from Arjen de Vries (20)

Doing a PhD @ DOSSIER
Doing a PhD @ DOSSIERDoing a PhD @ DOSSIER
Doing a PhD @ DOSSIER
 
Masterclass Big Data (leerlingen)
Masterclass Big Data (leerlingen) Masterclass Big Data (leerlingen)
Masterclass Big Data (leerlingen)
 
Beverwedstrijd Big Data (klas 3/4/5/6)
Beverwedstrijd Big Data (klas 3/4/5/6) Beverwedstrijd Big Data (klas 3/4/5/6)
Beverwedstrijd Big Data (klas 3/4/5/6)
 
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)Beverwedstrijd Big Data (groep 5/6 en klas 1/2)
Beverwedstrijd Big Data (groep 5/6 en klas 1/2)
 
Web Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search EngineWeb Archives and the dream of the Personal Search Engine
Web Archives and the dream of the Personal Search Engine
 
Information Retrieval and Social Media
Information Retrieval and Social MediaInformation Retrieval and Social Media
Information Retrieval and Social Media
 
Information Retrieval intro TMM
Information Retrieval intro TMMInformation Retrieval intro TMM
Information Retrieval intro TMM
 
ACM SIGIR 2017 - Opening - PC Chairs
ACM SIGIR 2017 - Opening - PC ChairsACM SIGIR 2017 - Opening - PC Chairs
ACM SIGIR 2017 - Opening - PC Chairs
 
Data Science Master Specialisation
Data Science Master SpecialisationData Science Master Specialisation
Data Science Master Specialisation
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
 
TREC 2016: Looking Forward Panel
TREC 2016: Looking Forward PanelTREC 2016: Looking Forward Panel
TREC 2016: Looking Forward Panel
 
The personal search engine
The personal search engineThe personal search engine
The personal search engine
 
Models for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationModels for Information Retrieval and Recommendation
Models for Information Retrieval and Recommendation
 
Better Contextual Suggestions by Applying Domain Knowledge
Better Contextual Suggestions by Applying Domain KnowledgeBetter Contextual Suggestions by Applying Domain Knowledge
Better Contextual Suggestions by Applying Domain Knowledge
 
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013
Similarity & Recommendation - CWI Scientific Meeting - Sep 27th, 2013
 
Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?Recommendation and Information Retrieval: Two Sides of the Same Coin?
Recommendation and Information Retrieval: Two Sides of the Same Coin?
 
Twente ir-course 20-10-2010
Twente ir-course 20-10-2010Twente ir-course 20-10-2010
Twente ir-course 20-10-2010
 
Context Adaptation in Image Search
Context Adaptation in Image SearchContext Adaptation in Image Search
Context Adaptation in Image Search
 

Recently uploaded

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

How to build the next 1000 search engines?!

  • 1. How to build the next 1000 search engines?! Arjen P. de Vries arjen@acm.org Centrum Wiskunde & Informatica Delft University of Technology Spinque B.V.
  • 3. Search is everywhere  Yet it only works well on the web…
  • 4. Complications  Heterogeneous data sources  WWW, wikipedia, news, e- mail, patents, twitter, personal information, …  Varying result types  “Documents”, tweets, courses, people, expert s, gene expressions, temperatures, …  Multiple dimensions of relevance  Topicality, recency, reading level, …
  • 5. Complications  Many search tasks require a mix within these dimensions:  News and patents  Companies and their CEOs  Recent and on topic  Many search tasks also require a mix across these dimensions:  Patents assigned to our top 3 competitors in market segments mentioned in the recent press releases issued by our top 10 clients
  • 6.  System‟s internal information representation  Linguistic annotations  Named entities, sentiment, dependencies, …  Knowledge resources  Wikipedia, Freebase, IDC9, IPTC, …  Links to related documents  Citations, urls  Anchors that describe the URI  Anchor text  Queries that lead to clicks on the URI  Session, user, dwell-time, …  Tweets that mention the URI  Time, location, user, …  Other social media that describe the URI  User, rating  Tag, organisation of `folksonomy‟ + UNCERTAINTY ALL OVER!
  • 7. What goes in the black box? Document Collection: Anchors Entity types Sentiment Tweets BM25 Cited documents BM25F … LM RM Ranked VSM list DFR of answers QIR? User Learning to rank? Context ECIR / CIKM / SIGIR / ICTIR / WSDM papers!
  • 8. Rarely & scarcely addressed… Student: How do I build it? Professor: Who will build it for me? Last session of the conference…
  • 10. Parameterised Search System Cornacchia, De Vries, ECIR 2007 A Parametrised Search System
  • 11. Parameterised Search System Cannot we ‘remove’ this IR engineer (or scientist!) from the loop, like DBMS software removes the data engineer from the loop? Cornacchia, De Vries, ECIR 2007 A Parametrised Search System And three (four?) children, a startup and 5 years later, a PhD defense!
  • 12. Search by Strategy  Visually construct search strategies by connecting building blocks
  • 13.
  • 14. Search by Strategy  Visually construct search strategies by connecting building blocks  Each block describes either data or actions upon that data  Connection points (“pins”) are typed: doc / sec / term / ne (named entity) / tuple  Actions are expressed as scripts (later more)
  • 16. From Patent to Inventor
  • 17. Reports Visits
  • 18. Generate Search Engine! Or, really, generate a REST API from the strategy specification!
  • 19. Demo (Showed demo of children‟s search engine)
  • 20. How Strategies Help  Strategies improve communication between search intermediary and user  Encapsulate domain expert knowledge  Abstract representation of search expert knowledge  Analyze information seeking process at any stage  Strategies facilitate knowledge management  Store / share / publish / refine  Strategies mix exact (DB) and ranked (IR) searches  Avoid the need for “human (probabilistic) joins”
  • 21.
  • 22. Search Intermediaries  Travel agency Task complexity  Real estate agents  Recruiters  Librarians  Archivists  Digital forensics detectives  Patent information specialists
  • 23. Exploratory Search  Search & (Faceted) Browsing  Help discover schema, ontology, etc.  Help discover the relevant sources  Within-collection (by year/location, by type, …)  Across multiple collections (by source)
  • 24. Probabilistic faceted browsing Traditional (boolean filters) Probabilistic Price Price • 100K - 200K • 100K - 200K • 200K - 300K • 200K - 300K • 300K - 400K • 300K - 400K Rooms Rooms • 3 • 3 • 4 • 4 • 5 • 5 Size Size • 100 - 150 m2 • 100 - 150 m2 • 150 - 200 m2 • 150 - 200 m2 • 200 - 250 m2 • 200 - 250 m2 • Good when user knows exactly • Good for exploratory search which filters to apply • Will see perfect-match results • Will see perfect-match results • Won’t see “interesting” results • Will also see “interesting” results
  • 25. Dynamic facets Pre-indexed Dynamic Price Price • 100K - 200K • 100K - 200K • 200K - 300K • 200K - 300K • 300K - 400K • 300K - 400K Rooms Rooms • 3 • 3 • 4 • 4 • 5 • 5 Size Size • 100 - 150 m2 • 100 - 150 m2 • 150 - 200 m2 • 150 - 200 m2 • 200 - 250 m2 • 200 - 250 m2 • Pre-defined ad-hoc indices • Facets decided from result set intersected with result set • Challenge: dynamically adapt granularity • Challenge: many indices to maintain • Different price ranges for villa/garage! • Challenge: heavy concurrent queries to DB
  • 27. Limitations Search & Browse  Faceted exploration does not include joins  Cannot construct new data sources from existing ones!  Only the pre-defined paths through the information space can actually be traversed
  • 28. Who needs a Join?  You!!! … whenever „relevance cues‟ are typed:  People (e.g., inventors)  Companies (e.g., assignees)  Categories (e.g., IPTC)  Time (e.g., expiry date)  Location (e.g., country) … or whenever multiple sources are to be combined  E.g., patents & news, patents & Wikipedia, …
  • 29. Patents on X by Y(y) by Y(y)
  • 30. Interactive Information Access  Feedback:  Interaction improves information representation  Faceted Browsing:  Interaction can let user take over where machine would fail  Search by Strategy:  Interaction can let user take over where system designer would fail
  • 31. Conclusion  “No idealized one-shot search engine”  Empower the user!
  • 33. From Strategies to DB Queries in1 in2 in3 Strategy • Data flow BB1(in1,in2,in3, u1,u2) out in1 BB2(in1) Spinque: strategy out CREATE VIEW a AS SELECT .. • Query: strategy made operational CREATE VIEW b AS SELECT .. CREATE VIEW c AS Spinque: PRA SELECT ..  Database Spinque: RDBMS (MonetDB) Relational DB
  • 34. Probabilistic Relational Algebra Strategy x = Project DISTINCT • PRA: probabilistic [$1,$3](y); relational algebra (Fuhr and Roelleke, TOIS 2001) CREATE VIEW x AS SELECT a1, a3, • SQL 1-prod(1-prob) AS prob FROM y explicit probabilities GROUP BY a1, a3; Relational DB
  • 35. What‟s in the DB?  Text-based ranking T D f  term-doc-freq relations (inverted file) t0 d3 3  One per language, stemming, section t0 d5 10  Domain-independent, click and index t1 d2 4  Entity ranking subj pred/attr obj/value p  Probabilistic triples Arjen speaks_to you 0.95  Domain-aware you follow Arjen 0.5 speech minutes 45 0.8  Needs supervised indexing  Content-based (MM) retrieval Img_id f1 … fN …  Feature vectors, click and index 0 0.12 0.84 1 0.54 … 0.31 2 0.23 … 0.1
  • 36. VIEWS and TABLES User Stored relation parameter CREATE VIEW TABLE a AS SELECT … FROM term-doc … ; CREATE VIEW b AS SELECT … FROM a WHERE a.x = u1 ; CREATE VIEW TABLE c AS SELECT … FROM a WHERE a.x = 42 ; CREATE VIEW d AS SELECT … FROM b … ; No user parameter Pre-computable  BB content: sequence of VIEW definitions relation  A VIEW is pre-computable when  All the relations addressed are pre-computable / stored  No dependency on user parameters  Pre-computable VIEWs can become TABLEs (or MATERIALIZED VIEWs)  Query-independent computations are performed only once, then read from TABLEs at each query  Recognition of these patterns is fully automatic  Extends MonetDB‟s per-session caching to across-sessions caching
  • 38. Current Situation  index ; Schema definition  repeat {  specify ;  retrieve Search & explore  } until 
  • 39. Traditional Indexing  Preprocessing determines to large extend how search request form will be processed  Especially regarding tokenization, stemming, etc.  Fast and scalable, but inflexible  E.g., entity search hard-coded on top of engine, advertisements matched on different data, etc.
  • 40. Search by Strategy  Flexible: generate arbitrary engine on the fly  Not as fast as highly optimized and very well engineered inverted file based systems
  • 41. Desirable Situation  repeat {  index ; Mixed Initiative  specify ; Schema definition Search & explore  retrieve  } until 
  • 42. Non-Indexed Search  Grep  Very flexible  Use it all the time on my mh mail folders when gmail fails me!  Not scalable, little or no structure
  • 43. Minimal Indexing  How to reduce pre-processing necessary to create a search engine over a new collection?  Can we do without a keyword index?  Can we avoid hardwired decisions for tokenization, language detection, stemming, …
  • 44. Suffix Array  Pro's:  provides many core search functions: term statistics, keyword search, phrase search.  no upfront tokenization needed (access at character level)  no upfront language detection needed  Con's:  difficult to build for large corpora  expensive w.r.t. disk space
  • 45.
  • 48. Patents on X by Y(y) by Y(y)
  • 49. PRA s__STRATEGY___filter_DOC_with_NE_nes = Project [$2,$3]( Join [$1 = $2]( s__STRATEGY___clef_ip_patents_DATA_result, Project [$1,$3]( Select [$2 = "ipcr-classification"]( s__STRATEGY___clef_ip_patents_DATA_ne_doc ) ) ) );
  • 50. CREATE TABLE s__STRATEGY___filter_DOC_with_NE_nes AS SELECT tmp_1814091754.a2 AS a1, tmp_1814091754.a3 AS a2, tmp_1814091754.prob AS prob FROM ( SELECT s__STRATEGY___clef_ip_patents_DATA_result.a1 AS a1, tmp__1652836708.a1 AS a2, tmp__1652836708.a2 AS a3, s__STRATEGY___clef_ip_patents_DATA_result.prob * tmp__1652836708.prob AS prob FROM s__STRATEGY___clef_ip_patents_DATA_result, ( SELECT tmp_1444787941.a1 AS a1, tmp_1444787941.a3 AS a2, tmp_1444787941.prob AS prob FROM ( SELECT s__STRATEGY___clef_ip_patents_DATA_ne_doc.a1 AS a1, s__STRATEGY___clef_ip_patents_DATA_ne_doc.a2 AS a2, s__STRATEGY___clef_ip_patents_DATA_ne_doc.a3 AS a3, s__STRATEGY___clef_ip_patents_DATA_ne_doc.prob AS prob FROM s__STRATEGY___clef_ip_patents_DATA_ne_doc WHERE s__STRATEGY___clef_ip_patents_DATA_ne_doc.a2 =‘ipcr-classification’ ) AS tmp_1444787941 ) AS tmp__1652836708 WHERE s__STRATEGY___clef_ip_patents_DATA_result.a1 = tmp__1652836708.a2 ) AS tmp_1814091754 ORDER BY a1 WITH DATA;
  • 51. info@spinque.com www.spinque.com facebook.com/spinque

Editor's Notes

  1. Does “Entity-based ranking” make sense?
  2. NOTE: MATERIALIZED VIEWs, where supported (not in MonetDB), can be used instead of TABLEs when stored relations (index) are expected to get updates.