Building big social network search system using lucene


Published on

Presented by Aleksey Shevchuk, Lead developer,

We will explain how search systems of social network Odnoklassniki work. Each day 40mln people use Odnoklassniki to communicate and entertain themselves. These activities are hard to imagine without proper search system. A dozen big index's and thousands of small indexes are responding to more than 4000 searches per second at peak times. Users can search within specific site sections of the site or the whole site. Search system will decide which indexes should be queried, and which results to show. To improve relevance we use information from social graph and various activity statistics available for indexed entities. Query log analysis? Again Lucene!

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Building big social network search system using lucene

  1. 1. Building a big social networksearch system using LuceneAleksey ShevchukLead developer @ Odnoklassniki
  2. 2. AgendaFunctions and architectureProblems & solutions1
  3. 3. About Odnoklassniki social network• Audience:– 200 mln accounts;– Up to 6 mln users online;– More then 40 mln visitors a day• Within a second:– 290 000 web pages,100 000 photos viewed;– 4000 search requests,average search time 70 ms2
  4. 4. Why we have chosen Lucene?• Back in 2009 we had user search based on MS SQL –this simplified initial requirement definition• We wanted an OpenSource written in Java• Tests had shown that Solr underperforms for us• Developed our own server around Lucene3
  5. 5. Search system duties today4UsersVideoMusicGroupsCommunitiesEventsGiftsLocationsHobbiesHelpGroup users
  6. 6. Quick portal search5
  7. 7. Expanded portal search6
  8. 8. Architecture7Search facadeEventMaker + DBSearchUpdateQuery ReplicationQueryServicesGet Entity cachePresentation
  9. 9. Architecture: maker8• Collects notifications about changed entities• Uses Cassandra to store additional entity data• Responsible for domain index writing• Controls index replication to query servers
  10. 10. Architecture: query9• Many servers in different hardware configuration• Unified application• For quick start store index’s on disk• Queries are executed in heap memory– IndexReader rewritten to eliminate unnecessary operations– Own stored field retrieval method:• No garbage• Accessing values without actual deserialization
  11. 11. Architecture: search facade10• Creates & manages personal index’s• Schedules query execution• Reduce query results to search results• Loads data for result rendering
  12. 12. Problems & solutions11
  13. 13. Problem: spelling vs performance12• Most of the content is in Russian language:– Proper Russian– Common misspells– Misspells made by people who try to write in Russian– Russian words written in Latin (Translit & Crazy Russian)– Wrong keyboard layout• Few examples, with common misspells omitted:– машина = мышына, масына, mashina, moshina– Кашин = kashin, кашен, ka6in– Kosheen = кошин, cosheen, koshin
  14. 14. Solution: spelling vs performance13• Reduce number of terms using phonetics:MOSHINO = машина, мышына, масына, mashina, moshina• Query is expanded with few phonetic keys:– Common misspellings– Synonyms we know• Distinguish writing using 1 byte hash code per term– If possible, perform hash check only for top documents
  15. 15. Problem: personal index availability14• Queries take 5 – 100 ms• Personal index composition takes 50 - 300 msCache CacheService Service Service*2 *2 *2• Network load on cache servers quickly hit 700 Mb/s• Meanwhile, there were no CPU load on cache servers
  16. 16. Solution: personal index availability15Service 0-19 Service 20-39 Service 40-59 Service 60-79 Service 80-9937• Bind users to concrete servers• Store personal index’s locally (in off-heap memory)• Determine substitution order• Whole network load is under 100 Mb/s• Even CPU load on all servers
  17. 17. Problem: gender and country filters16• Usually index is split into shards, till average querytime meets some bounds– This solves response time problem– All possible documents are checked• There is 2 filters which make user queries slow:– Gender– One very popular country
  18. 18. Solution: gender and country filters• Remove this condition checks – saves 17% CPU• Exclude documents which could not match this filters– saves another 12% CPURussian malesRussian femalesOther malesOther females
  19. 19. Problem: users online search18• People wish to quickly find a person they can talk to• At any given moment, only small fraction of usersare online• Standard solution – filter out onlines from generalsearch results:+ easy to implement+ reliable– slow, especially at random users query– wastes CPU
  20. 20. Solution: users online search19• Create separate index, with online users only:+ works quickly+ no tricks required– more then 200.000 changes/minute– correct results depend on index maker availability
  21. 21. Problem: user search inside group20• This kind of search is in demand from group owners• Some numbers:– 200 million users in 16 shards– 7 million groups in 8 shards– Each group has from 1 to several million users– Number of group to user connections – billions• “Dummy solutions” were not checked
  22. 22. Problem: user search inside group21GroupsUsers• We use mechanics from personal indexes• Currently indexed groups are updated with changes• Small group indexes are discontinued after 1 hour• Big groups indexes are kept until application restartsSearch façadeHeapmemoryOff-heapmemoryPortal servicesSmall groups
  23. 23. More information22Aleksey with Odnoklassniki.ru
  24. 24. Aleksey