SlideShare a Scribd company logo
1




                                 U s a g e of S olr a t T r ov it
                        A Search Engine For Classified Ads



                                                             Marc Sturlese
                                                                    Trovit

                                                           marc@trovit.com
                           Apache Lucene Eurocon 2010, Prague, 20 May 2010


Apache Lucene EuroCon                                                   4 May 2010
Agenda

              ● Trovit, a Solr use case
              ● Types of index
              ● Architecture overview
              ● Relevance tuning
              ● Out of the box features
              ● Custom features
              ● Sharding
              ● Future directions
              ● Questions


Apache Lucene EuroCon   05/16/10
W h a t is T r o v it? A S e a r c h E n g in e F o r C la s s ifie d A d s




Apache Lucene EuroCon   05/16/10
T y pe s o f in de x

              There are 3 different types of index
              ● Organic ads index
              ● Sponsored ads index
              ● Recommended searches index


              There is an index per country and per business category for
              every type... what means a total of 180 index
              Some of them are sharded. All of them have replicas.




Apache Lucene EuroCon    05/16/10
T y pe s o f in de x




                        Captura donde se vean los 3 tipos de índice




Apache Lucene EuroCon       05/16/10
A r qu ite ctu r e o v e r v ie w   crawling / parsing

                                           wharehouse

                                          indexing

                                           Solr indexer
                                                                   back end
                                           replication

                                                          Solr
                                                          slaves



                                            load balancer


                                                     frontal
                                          load balancing

                                            load balancer          front end
                                          request



Apache Lucene EuroCon   05/16/10                                               6
A r ch ite ctu r e o v e r v ie w

              M a s te r s - I n de x in g
              ● 4 servers. Continuously updating index sequentially
                        ● 1 server to index organic ads for all countries/categories
                        ● 1 server to index powered ads for all countries/categories
                        ● 1 server to index recommended searches for all countries/categories


              S la v e s – S e r v in g s e a r c h r e q u e s ts
              ● Index with high traffic have 4 replicas
              ● Indexs with less traffic have 3 replicas




Apache Lucene EuroCon           05/16/10
A r qu ite ctu r e o v e r v ir e w

       ● Index are replicated using modified c o l l e c t i o n
                    d i s t r i b u t i o n scripts to allow multi core
       ● Snapshooter and snappuller are sequentially executed
       ● Snapinstaller is executed at the same time on each slave
           to preserve exactly the same content all the time
       ● Started load balancing with P e r l b a l . It was producing
                high CPU loads




Apache Lucene EuroCon   05/16/10
L ife o f a u s e r s e a r ch r e qu e s t

           For every user search:
           ● A request is done to the organic and sponsored index
           ● Per each result of the organic search, a request to the
                  recommended searches ads is done


           ● 13 Solr request per user search! And once this is done...
                  The user search request is going to be batch processed to decide
                  if it must be indexed in the similar user searches index




Apache Lucene EuroCon      05/16/10
L ife o f a u s e r s e a r ch r e qu e s t




Apache Lucene EuroCon   05/16/10
R e le v a n c e tu n in g

           ● Basic searches use dismax qt. Build on top of Lucenes
                        DisjunctionMaxQuery
           ● Boosting queries to make latest ads more relevant
           ● Boost some ads at document level at indexing time to
                        make them more important than others
           ● Boost ads at field level at query time to make the match
                        more important in some fields than in others




Apache Lucene EuroCon      05/16/10
R e le v a n c e tu n in g

         Us er s ea r ch: hom e tennes s ee
         ● Higher quality ad




         ● Lower quality ad




Apache Lucene EuroCon   05/16/10
O u t o f th e bo x S o lr fe a tu r e s

          ● Synonyms for USA states
          ● Per country and per business category stopwords
          ● MoreLikeThis request handler
          ● TrieFields to index housing latitude and longitude
          ● Facet fields, queries and dates.
          ● Warming queries from a specific file using an EventListener.
                  Issue SOLR-784




Apache Lucene EuroCon    05/16/10
O u t o f th e bo x S o lr fe a tu r e s : M o r e L ik e T h is




Apache Lucene EuroCon   05/16/10
O u t o f th e bo x S o lr fe a tu r e s : U s a g e o f T r ie F ie ld s




Apache Lucene EuroCon   05/16/10
Cus tom fe a tu r e s

          ● Duplicates detection
                  ● Coming from the same source: Indexing time
                  ● Coming from different sources: Indexing and search
                         time
          ● Pseudo field collapsing
          ● Custom ranking for sponsored ads
          ● Custom Data Import Handler for full indexing and updates




Apache Lucene EuroCon     05/16/10
C u s to m fe a tu r e s – N e a r d u plic a te s d e te c tio n


          ● A ds c om in g fr om th e s a m e s ou r c e
                 ● Last who comes is the one that will be kept on the index
                 ● Deduplication method using SignatureUpdateProcessor
                 ● Small hack to custom the TextProfileSignature


          ● A ds c om in g fr om diffe r e n t s ou r c e s
                 ● Give the user the chance to decide the source to visit
                ● Based on field collapsing issue (SOLR-236) and
                 SignatureUpdateProcessor used in Deduplication
                 ● Done in 2 steps, one at index time and one at search time.
Apache Lucene EuroCon     05/16/10
N e a r d u plic a te s d e te c tio n
          A ds c o m in g fr o m diffe r e n t s o u r c e s




Apache Lucene EuroCon     05/16/10
C u s to m fe a tu r e s – N e a r d u plic a te s d e te c tio n
      A ds c o m in g fr o m diffe r e n t s o u r c e s


      ● Why to calculate them at index time?
              ● Avoid loading FieldCache of a “big field” at search time.
                        Very memory consuming!




Apache Lucene EuroCon    05/16/10
C u s to m fe a tu r e s – P s e u d o fie ld c o lla ps in g


      ● Don't want to show first results pages with all ads from the
                        same sources
      ● “Bad” results will be send to the later pages
      ● SOLR-236 makes a double trip, not so good in performance
                        terms
      ● Core hack to avoid the double trip... SOLR–1311
      ● Does not support proper distributed search at the moment




Apache Lucene EuroCon           05/16/10
C u s to m fe a tu r e s – S pe cia l r a n k in g fo r S po n s o r e d
          Ads
          ● Not just relevance is important. External factors are
          important too.
          ● Implemented using a Solr SearchComponent
          ● External factors are loaded from a resource and used
                        in a Lucene FieldComparatorSource to alter the
                           score of the documents




Apache Lucene EuroCon      05/16/10
C u s to m fe a tu r e s – H a c k e d D a ta I m po r tH a n d le r
      ● DIH is a tool to index data to Solr from different sources
      (xml, txt, data bases...)
      ● Extended transformers to alter data before it is indexed
      ● Delta imports are meant to be used not updating huge
      amounts of rows. Doing that can end up with memory
      problems
      ● If something crashes we have to reindex. It can sometimes
      take a long time. We want to keep going from the last indexed
      doc
      ● Hacks to allow us to use it as distributed indexer.

Apache Lucene EuroCon   05/16/10
S h a r din g

          F ir s t s tr a te g y
          ● No distributed IDF's at the moment Better to choose
          randomly the shard where to index a doc:
                  SolrDocUniqueField.hashCode / NumberOfShards = ShardNumber

          ● Once we started keeping track of near duplicates among
          ads from different sources this was not good anymore.
                  W h y ? Dups system is based on SOLR-236: Duplicated
                        documents must be indexed on the same shard to
                        be detected!!!


Apache Lucene EuroCon      05/16/10
S h a r din g

       S e cond s tr a te gy
       ● HashCode of the signature field will decide the shard number
       ● This forces the signature field to be calculated in the
                           warehouse so when indexing process starts we
       already             have it


       T h ir d a n d fu tu r e s tr a te g y
       ● Calculate duplicates in the warehouse
       ● There will be no need for the dups to be in the same shard
                        anymore
Apache Lucene EuroCon        05/16/10
F u tu r e dir e ctio n s
         P r o pe r dis tr ibu te d I D F ' s
         ● Allows to have absolute relevance among shards.
                More accurate results
         ● Issue SOLR-1632
         ● Still some bugs specially when using boosting functions
         ● Allows to improve sharding strategies. No need to choose the
                shard number randomly anymore.




Apache Lucene EuroCon     05/16/10
F u tu r e dir e ctio n s
      L o a d ba la n c e w ith Z o o k e e pe r ( S o lr C lo u d )
      ● Use Solr Cloud to manage sharding
      ● Currently being commited to trunk
      ● Replace load balancer for Zookeeper
      ● Let Zookeeper handle distributed configuration stuff




Apache Lucene EuroCon    05/16/10
?
Apache Lucene EuroCon   05/16/10
T ha nk y ou
                                    for y ou r a tte n tion

                                                          Marc Sturlese
                                                                 Trovit

                                                        marc@trovit.com
                        Apache Lucene Eurocon 2010, Prague, 20 May 2010

Apache Lucene EuroCon    05/16/10

More Related Content

Similar to Use of-solr-at-trovit-classified-ads marc-sturlese

Introduction to Apache Flink, Vienna 07.11.2018
Introduction to Apache Flink, Vienna 07.11.2018Introduction to Apache Flink, Vienna 07.11.2018
Introduction to Apache Flink, Vienna 07.11.2018
Andrey Zagrebin
 
Erlangfactory
ErlangfactoryErlangfactory
Erlangfactory
Ezra Zygmuntowicz
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
Andrea Gazzarini
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
Sease
 
How the Lucene More Like This Works
How the Lucene More Like This WorksHow the Lucene More Like This Works
How the Lucene More Like This Works
Sease
 
Splunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search DojoSplunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search Dojo
Splunk
 
Splunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search DojoSplunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search Dojo
Splunk
 
Splunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search DojoSplunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search Dojo
Splunk
 
Splunk Ninja: New Features, Pivot and Search Dojo
 Splunk Ninja: New Features, Pivot and Search Dojo Splunk Ninja: New Features, Pivot and Search Dojo
Splunk Ninja: New Features, Pivot and Search Dojo
Splunk
 
OWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not know
OWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not knowOWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not know
OWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not know
OWASP
 
How OPNFV Uses OpenStack & How It's Useful
How OPNFV Uses OpenStack & How It's UsefulHow OPNFV Uses OpenStack & How It's Useful
How OPNFV Uses OpenStack & How It's Useful
OPNFV
 
Developing SDN apps in Ryu
Developing SDN apps in RyuDeveloping SDN apps in Ryu
Developing SDN apps in Ryu
Che Wei Lin
 
From Generator to Fiber the Road to Coroutine in PHP
From Generator to Fiber the Road to Coroutine in PHPFrom Generator to Fiber the Road to Coroutine in PHP
From Generator to Fiber the Road to Coroutine in PHP
Albert Chen
 
معماری Splunk
معماری Splunkمعماری Splunk
معماری Splunk
Arash Gholamabolfazl
 
Migrating Fast to Solr
Migrating Fast to SolrMigrating Fast to Solr
Migrating Fast to Solr
Cominvent AS
 
Splunk conf2014 - Using Selenium and Splunk for Transaction Monitoring Insight
Splunk conf2014 - Using Selenium and Splunk for Transaction Monitoring InsightSplunk conf2014 - Using Selenium and Splunk for Transaction Monitoring Insight
Splunk conf2014 - Using Selenium and Splunk for Transaction Monitoring Insight
Splunk
 
NBN:URN Generator and Resolver
NBN:URN Generator and ResolverNBN:URN Generator and Resolver
NBN:URN Generator and Resolver
horvadam
 
Deploying Splunk. Arquitetura e dimensionamento do Splunk
Deploying Splunk. Arquitetura e dimensionamento do SplunkDeploying Splunk. Arquitetura e dimensionamento do Splunk
Deploying Splunk. Arquitetura e dimensionamento do Splunk
Splunk
 
BKK16-106 ODP Project Update
BKK16-106 ODP Project UpdateBKK16-106 ODP Project Update
BKK16-106 ODP Project Update
Linaro
 
TSC Sponsored BoF: Can Linux and Automotive Functional Safety Mix ? Take 2: T...
TSC Sponsored BoF: Can Linux and Automotive Functional Safety Mix ? Take 2: T...TSC Sponsored BoF: Can Linux and Automotive Functional Safety Mix ? Take 2: T...
TSC Sponsored BoF: Can Linux and Automotive Functional Safety Mix ? Take 2: T...
Linaro
 

Similar to Use of-solr-at-trovit-classified-ads marc-sturlese (20)

Introduction to Apache Flink, Vienna 07.11.2018
Introduction to Apache Flink, Vienna 07.11.2018Introduction to Apache Flink, Vienna 07.11.2018
Introduction to Apache Flink, Vienna 07.11.2018
 
Erlangfactory
ErlangfactoryErlangfactory
Erlangfactory
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
How the Lucene More Like This Works
How the Lucene More Like This WorksHow the Lucene More Like This Works
How the Lucene More Like This Works
 
Splunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search DojoSplunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search Dojo
 
Splunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search DojoSplunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search Dojo
 
Splunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search DojoSplunk Ninjas: New Features, Pivot and Search Dojo
Splunk Ninjas: New Features, Pivot and Search Dojo
 
Splunk Ninja: New Features, Pivot and Search Dojo
 Splunk Ninja: New Features, Pivot and Search Dojo Splunk Ninja: New Features, Pivot and Search Dojo
Splunk Ninja: New Features, Pivot and Search Dojo
 
OWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not know
OWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not knowOWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not know
OWASP Poland Day 2018 - Jakub Botwicz - AFL that you do not know
 
How OPNFV Uses OpenStack & How It's Useful
How OPNFV Uses OpenStack & How It's UsefulHow OPNFV Uses OpenStack & How It's Useful
How OPNFV Uses OpenStack & How It's Useful
 
Developing SDN apps in Ryu
Developing SDN apps in RyuDeveloping SDN apps in Ryu
Developing SDN apps in Ryu
 
From Generator to Fiber the Road to Coroutine in PHP
From Generator to Fiber the Road to Coroutine in PHPFrom Generator to Fiber the Road to Coroutine in PHP
From Generator to Fiber the Road to Coroutine in PHP
 
معماری Splunk
معماری Splunkمعماری Splunk
معماری Splunk
 
Migrating Fast to Solr
Migrating Fast to SolrMigrating Fast to Solr
Migrating Fast to Solr
 
Splunk conf2014 - Using Selenium and Splunk for Transaction Monitoring Insight
Splunk conf2014 - Using Selenium and Splunk for Transaction Monitoring InsightSplunk conf2014 - Using Selenium and Splunk for Transaction Monitoring Insight
Splunk conf2014 - Using Selenium and Splunk for Transaction Monitoring Insight
 
NBN:URN Generator and Resolver
NBN:URN Generator and ResolverNBN:URN Generator and Resolver
NBN:URN Generator and Resolver
 
Deploying Splunk. Arquitetura e dimensionamento do Splunk
Deploying Splunk. Arquitetura e dimensionamento do SplunkDeploying Splunk. Arquitetura e dimensionamento do Splunk
Deploying Splunk. Arquitetura e dimensionamento do Splunk
 
BKK16-106 ODP Project Update
BKK16-106 ODP Project UpdateBKK16-106 ODP Project Update
BKK16-106 ODP Project Update
 
TSC Sponsored BoF: Can Linux and Automotive Functional Safety Mix ? Take 2: T...
TSC Sponsored BoF: Can Linux and Automotive Functional Safety Mix ? Take 2: T...TSC Sponsored BoF: Can Linux and Automotive Functional Safety Mix ? Take 2: T...
TSC Sponsored BoF: Can Linux and Automotive Functional Safety Mix ? Take 2: T...
 

Recently uploaded

20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 

Recently uploaded (20)

20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 

Use of-solr-at-trovit-classified-ads marc-sturlese

  • 1. 1 U s a g e of S olr a t T r ov it A Search Engine For Classified Ads Marc Sturlese Trovit marc@trovit.com Apache Lucene Eurocon 2010, Prague, 20 May 2010 Apache Lucene EuroCon 4 May 2010
  • 2. Agenda ● Trovit, a Solr use case ● Types of index ● Architecture overview ● Relevance tuning ● Out of the box features ● Custom features ● Sharding ● Future directions ● Questions Apache Lucene EuroCon 05/16/10
  • 3. W h a t is T r o v it? A S e a r c h E n g in e F o r C la s s ifie d A d s Apache Lucene EuroCon 05/16/10
  • 4. T y pe s o f in de x There are 3 different types of index ● Organic ads index ● Sponsored ads index ● Recommended searches index There is an index per country and per business category for every type... what means a total of 180 index Some of them are sharded. All of them have replicas. Apache Lucene EuroCon 05/16/10
  • 5. T y pe s o f in de x Captura donde se vean los 3 tipos de índice Apache Lucene EuroCon 05/16/10
  • 6. A r qu ite ctu r e o v e r v ie w crawling / parsing wharehouse indexing Solr indexer back end replication Solr slaves load balancer frontal load balancing load balancer front end request Apache Lucene EuroCon 05/16/10 6
  • 7. A r ch ite ctu r e o v e r v ie w M a s te r s - I n de x in g ● 4 servers. Continuously updating index sequentially ● 1 server to index organic ads for all countries/categories ● 1 server to index powered ads for all countries/categories ● 1 server to index recommended searches for all countries/categories S la v e s – S e r v in g s e a r c h r e q u e s ts ● Index with high traffic have 4 replicas ● Indexs with less traffic have 3 replicas Apache Lucene EuroCon 05/16/10
  • 8. A r qu ite ctu r e o v e r v ir e w ● Index are replicated using modified c o l l e c t i o n d i s t r i b u t i o n scripts to allow multi core ● Snapshooter and snappuller are sequentially executed ● Snapinstaller is executed at the same time on each slave to preserve exactly the same content all the time ● Started load balancing with P e r l b a l . It was producing high CPU loads Apache Lucene EuroCon 05/16/10
  • 9. L ife o f a u s e r s e a r ch r e qu e s t For every user search: ● A request is done to the organic and sponsored index ● Per each result of the organic search, a request to the recommended searches ads is done ● 13 Solr request per user search! And once this is done... The user search request is going to be batch processed to decide if it must be indexed in the similar user searches index Apache Lucene EuroCon 05/16/10
  • 10. L ife o f a u s e r s e a r ch r e qu e s t Apache Lucene EuroCon 05/16/10
  • 11. R e le v a n c e tu n in g ● Basic searches use dismax qt. Build on top of Lucenes DisjunctionMaxQuery ● Boosting queries to make latest ads more relevant ● Boost some ads at document level at indexing time to make them more important than others ● Boost ads at field level at query time to make the match more important in some fields than in others Apache Lucene EuroCon 05/16/10
  • 12. R e le v a n c e tu n in g Us er s ea r ch: hom e tennes s ee ● Higher quality ad ● Lower quality ad Apache Lucene EuroCon 05/16/10
  • 13. O u t o f th e bo x S o lr fe a tu r e s ● Synonyms for USA states ● Per country and per business category stopwords ● MoreLikeThis request handler ● TrieFields to index housing latitude and longitude ● Facet fields, queries and dates. ● Warming queries from a specific file using an EventListener. Issue SOLR-784 Apache Lucene EuroCon 05/16/10
  • 14. O u t o f th e bo x S o lr fe a tu r e s : M o r e L ik e T h is Apache Lucene EuroCon 05/16/10
  • 15. O u t o f th e bo x S o lr fe a tu r e s : U s a g e o f T r ie F ie ld s Apache Lucene EuroCon 05/16/10
  • 16. Cus tom fe a tu r e s ● Duplicates detection ● Coming from the same source: Indexing time ● Coming from different sources: Indexing and search time ● Pseudo field collapsing ● Custom ranking for sponsored ads ● Custom Data Import Handler for full indexing and updates Apache Lucene EuroCon 05/16/10
  • 17. C u s to m fe a tu r e s – N e a r d u plic a te s d e te c tio n ● A ds c om in g fr om th e s a m e s ou r c e ● Last who comes is the one that will be kept on the index ● Deduplication method using SignatureUpdateProcessor ● Small hack to custom the TextProfileSignature ● A ds c om in g fr om diffe r e n t s ou r c e s ● Give the user the chance to decide the source to visit ● Based on field collapsing issue (SOLR-236) and SignatureUpdateProcessor used in Deduplication ● Done in 2 steps, one at index time and one at search time. Apache Lucene EuroCon 05/16/10
  • 18. N e a r d u plic a te s d e te c tio n A ds c o m in g fr o m diffe r e n t s o u r c e s Apache Lucene EuroCon 05/16/10
  • 19. C u s to m fe a tu r e s – N e a r d u plic a te s d e te c tio n A ds c o m in g fr o m diffe r e n t s o u r c e s ● Why to calculate them at index time? ● Avoid loading FieldCache of a “big field” at search time. Very memory consuming! Apache Lucene EuroCon 05/16/10
  • 20. C u s to m fe a tu r e s – P s e u d o fie ld c o lla ps in g ● Don't want to show first results pages with all ads from the same sources ● “Bad” results will be send to the later pages ● SOLR-236 makes a double trip, not so good in performance terms ● Core hack to avoid the double trip... SOLR–1311 ● Does not support proper distributed search at the moment Apache Lucene EuroCon 05/16/10
  • 21. C u s to m fe a tu r e s – S pe cia l r a n k in g fo r S po n s o r e d Ads ● Not just relevance is important. External factors are important too. ● Implemented using a Solr SearchComponent ● External factors are loaded from a resource and used in a Lucene FieldComparatorSource to alter the score of the documents Apache Lucene EuroCon 05/16/10
  • 22. C u s to m fe a tu r e s – H a c k e d D a ta I m po r tH a n d le r ● DIH is a tool to index data to Solr from different sources (xml, txt, data bases...) ● Extended transformers to alter data before it is indexed ● Delta imports are meant to be used not updating huge amounts of rows. Doing that can end up with memory problems ● If something crashes we have to reindex. It can sometimes take a long time. We want to keep going from the last indexed doc ● Hacks to allow us to use it as distributed indexer. Apache Lucene EuroCon 05/16/10
  • 23. S h a r din g F ir s t s tr a te g y ● No distributed IDF's at the moment Better to choose randomly the shard where to index a doc: SolrDocUniqueField.hashCode / NumberOfShards = ShardNumber ● Once we started keeping track of near duplicates among ads from different sources this was not good anymore. W h y ? Dups system is based on SOLR-236: Duplicated documents must be indexed on the same shard to be detected!!! Apache Lucene EuroCon 05/16/10
  • 24. S h a r din g S e cond s tr a te gy ● HashCode of the signature field will decide the shard number ● This forces the signature field to be calculated in the warehouse so when indexing process starts we already have it T h ir d a n d fu tu r e s tr a te g y ● Calculate duplicates in the warehouse ● There will be no need for the dups to be in the same shard anymore Apache Lucene EuroCon 05/16/10
  • 25. F u tu r e dir e ctio n s P r o pe r dis tr ibu te d I D F ' s ● Allows to have absolute relevance among shards. More accurate results ● Issue SOLR-1632 ● Still some bugs specially when using boosting functions ● Allows to improve sharding strategies. No need to choose the shard number randomly anymore. Apache Lucene EuroCon 05/16/10
  • 26. F u tu r e dir e ctio n s L o a d ba la n c e w ith Z o o k e e pe r ( S o lr C lo u d ) ● Use Solr Cloud to manage sharding ● Currently being commited to trunk ● Replace load balancer for Zookeeper ● Let Zookeeper handle distributed configuration stuff Apache Lucene EuroCon 05/16/10
  • 28. T ha nk y ou for y ou r a tte n tion Marc Sturlese Trovit marc@trovit.com Apache Lucene Eurocon 2010, Prague, 20 May 2010 Apache Lucene EuroCon 05/16/10