Your SlideShare is downloading. ×
  • Like
  • Save
Use of-solr-at-trovit-classified-ads marc-sturlese
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Use of-solr-at-trovit-classified-ads marc-sturlese



Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. 1 U s a g e of S olr a t T r ov it A Search Engine For Classified Ads Marc Sturlese Trovit Apache Lucene Eurocon 2010, Prague, 20 May 2010 Apache Lucene EuroCon 4 May 2010
  • 2. Agenda ● Trovit, a Solr use case ● Types of index ● Architecture overview ● Relevance tuning ● Out of the box features ● Custom features ● Sharding ● Future directions ● Questions Apache Lucene EuroCon 05/16/10
  • 3. W h a t is T r o v it? A S e a r c h E n g in e F o r C la s s ifie d A d s Apache Lucene EuroCon 05/16/10
  • 4. T y pe s o f in de x There are 3 different types of index ● Organic ads index ● Sponsored ads index ● Recommended searches index There is an index per country and per business category for every type... what means a total of 180 index Some of them are sharded. All of them have replicas. Apache Lucene EuroCon 05/16/10
  • 5. T y pe s o f in de x Captura donde se vean los 3 tipos de índice Apache Lucene EuroCon 05/16/10
  • 6. A r qu ite ctu r e o v e r v ie w crawling / parsing wharehouse indexing Solr indexer back end replication Solr slaves load balancer frontal load balancing load balancer front end request Apache Lucene EuroCon 05/16/10 6
  • 7. A r ch ite ctu r e o v e r v ie w M a s te r s - I n de x in g ● 4 servers. Continuously updating index sequentially ● 1 server to index organic ads for all countries/categories ● 1 server to index powered ads for all countries/categories ● 1 server to index recommended searches for all countries/categories S la v e s – S e r v in g s e a r c h r e q u e s ts ● Index with high traffic have 4 replicas ● Indexs with less traffic have 3 replicas Apache Lucene EuroCon 05/16/10
  • 8. A r qu ite ctu r e o v e r v ir e w ● Index are replicated using modified c o l l e c t i o n d i s t r i b u t i o n scripts to allow multi core ● Snapshooter and snappuller are sequentially executed ● Snapinstaller is executed at the same time on each slave to preserve exactly the same content all the time ● Started load balancing with P e r l b a l . It was producing high CPU loads Apache Lucene EuroCon 05/16/10
  • 9. L ife o f a u s e r s e a r ch r e qu e s t For every user search: ● A request is done to the organic and sponsored index ● Per each result of the organic search, a request to the recommended searches ads is done ● 13 Solr request per user search! And once this is done... The user search request is going to be batch processed to decide if it must be indexed in the similar user searches index Apache Lucene EuroCon 05/16/10
  • 10. L ife o f a u s e r s e a r ch r e qu e s t Apache Lucene EuroCon 05/16/10
  • 11. R e le v a n c e tu n in g ● Basic searches use dismax qt. Build on top of Lucenes DisjunctionMaxQuery ● Boosting queries to make latest ads more relevant ● Boost some ads at document level at indexing time to make them more important than others ● Boost ads at field level at query time to make the match more important in some fields than in others Apache Lucene EuroCon 05/16/10
  • 12. R e le v a n c e tu n in g Us er s ea r ch: hom e tennes s ee ● Higher quality ad ● Lower quality ad Apache Lucene EuroCon 05/16/10
  • 13. O u t o f th e bo x S o lr fe a tu r e s ● Synonyms for USA states ● Per country and per business category stopwords ● MoreLikeThis request handler ● TrieFields to index housing latitude and longitude ● Facet fields, queries and dates. ● Warming queries from a specific file using an EventListener. Issue SOLR-784 Apache Lucene EuroCon 05/16/10
  • 14. O u t o f th e bo x S o lr fe a tu r e s : M o r e L ik e T h is Apache Lucene EuroCon 05/16/10
  • 15. O u t o f th e bo x S o lr fe a tu r e s : U s a g e o f T r ie F ie ld s Apache Lucene EuroCon 05/16/10
  • 16. Cus tom fe a tu r e s ● Duplicates detection ● Coming from the same source: Indexing time ● Coming from different sources: Indexing and search time ● Pseudo field collapsing ● Custom ranking for sponsored ads ● Custom Data Import Handler for full indexing and updates Apache Lucene EuroCon 05/16/10
  • 17. C u s to m fe a tu r e s – N e a r d u plic a te s d e te c tio n ● A ds c om in g fr om th e s a m e s ou r c e ● Last who comes is the one that will be kept on the index ● Deduplication method using SignatureUpdateProcessor ● Small hack to custom the TextProfileSignature ● A ds c om in g fr om diffe r e n t s ou r c e s ● Give the user the chance to decide the source to visit ● Based on field collapsing issue (SOLR-236) and SignatureUpdateProcessor used in Deduplication ● Done in 2 steps, one at index time and one at search time. Apache Lucene EuroCon 05/16/10
  • 18. N e a r d u plic a te s d e te c tio n A ds c o m in g fr o m diffe r e n t s o u r c e s Apache Lucene EuroCon 05/16/10
  • 19. C u s to m fe a tu r e s – N e a r d u plic a te s d e te c tio n A ds c o m in g fr o m diffe r e n t s o u r c e s ● Why to calculate them at index time? ● Avoid loading FieldCache of a “big field” at search time. Very memory consuming! Apache Lucene EuroCon 05/16/10
  • 20. C u s to m fe a tu r e s – P s e u d o fie ld c o lla ps in g ● Don't want to show first results pages with all ads from the same sources ● “Bad” results will be send to the later pages ● SOLR-236 makes a double trip, not so good in performance terms ● Core hack to avoid the double trip... SOLR–1311 ● Does not support proper distributed search at the moment Apache Lucene EuroCon 05/16/10
  • 21. C u s to m fe a tu r e s – S pe cia l r a n k in g fo r S po n s o r e d Ads ● Not just relevance is important. External factors are important too. ● Implemented using a Solr SearchComponent ● External factors are loaded from a resource and used in a Lucene FieldComparatorSource to alter the score of the documents Apache Lucene EuroCon 05/16/10
  • 22. C u s to m fe a tu r e s – H a c k e d D a ta I m po r tH a n d le r ● DIH is a tool to index data to Solr from different sources (xml, txt, data bases...) ● Extended transformers to alter data before it is indexed ● Delta imports are meant to be used not updating huge amounts of rows. Doing that can end up with memory problems ● If something crashes we have to reindex. It can sometimes take a long time. We want to keep going from the last indexed doc ● Hacks to allow us to use it as distributed indexer. Apache Lucene EuroCon 05/16/10
  • 23. S h a r din g F ir s t s tr a te g y ● No distributed IDF's at the moment Better to choose randomly the shard where to index a doc: SolrDocUniqueField.hashCode / NumberOfShards = ShardNumber ● Once we started keeping track of near duplicates among ads from different sources this was not good anymore. W h y ? Dups system is based on SOLR-236: Duplicated documents must be indexed on the same shard to be detected!!! Apache Lucene EuroCon 05/16/10
  • 24. S h a r din g S e cond s tr a te gy ● HashCode of the signature field will decide the shard number ● This forces the signature field to be calculated in the warehouse so when indexing process starts we already have it T h ir d a n d fu tu r e s tr a te g y ● Calculate duplicates in the warehouse ● There will be no need for the dups to be in the same shard anymore Apache Lucene EuroCon 05/16/10
  • 25. F u tu r e dir e ctio n s P r o pe r dis tr ibu te d I D F ' s ● Allows to have absolute relevance among shards. More accurate results ● Issue SOLR-1632 ● Still some bugs specially when using boosting functions ● Allows to improve sharding strategies. No need to choose the shard number randomly anymore. Apache Lucene EuroCon 05/16/10
  • 26. F u tu r e dir e ctio n s L o a d ba la n c e w ith Z o o k e e pe r ( S o lr C lo u d ) ● Use Solr Cloud to manage sharding ● Currently being commited to trunk ● Replace load balancer for Zookeeper ● Let Zookeeper handle distributed configuration stuff Apache Lucene EuroCon 05/16/10
  • 27. ? Apache Lucene EuroCon 05/16/10
  • 28. T ha nk y ou for y ou r a tte n tion Marc Sturlese Trovit Apache Lucene Eurocon 2010, Prague, 20 May 2010 Apache Lucene EuroCon 05/16/10