Your SlideShare is downloading. ×
  • Like
Building big social network search system using lucene
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Building big social network search system using lucene

  • 942 views
Published

Presented by Aleksey Shevchuk, Lead developer, odnoklassniki.ru …

Presented by Aleksey Shevchuk, Lead developer, odnoklassniki.ru

We will explain how search systems of social network Odnoklassniki work. Each day 40mln people use Odnoklassniki to communicate and entertain themselves. These activities are hard to imagine without proper search system. A dozen big index's and thousands of small indexes are responding to more than 4000 searches per second at peak times. Users can search within specific site sections of the site or the whole site. Search system will decide which indexes should be queried, and which results to show. To improve relevance we use information from social graph and various activity statistics available for indexed entities. Query log analysis? Again Lucene!

Published in Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
942
On SlideShare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
21
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Building a big social networksearch system using LuceneAleksey ShevchukLead developer @ Odnoklassniki
  • 2. AgendaFunctions and architectureProblems & solutions1
  • 3. About Odnoklassniki social network• Audience:– 200 mln accounts;– Up to 6 mln users online;– More then 40 mln visitors a day• Within a second:– 290 000 web pages,100 000 photos viewed;– 4000 search requests,average search time 70 ms2
  • 4. Why we have chosen Lucene?• Back in 2009 we had user search based on MS SQL –this simplified initial requirement definition• We wanted an OpenSource written in Java• Tests had shown that Solr underperforms for us• Developed our own server around Lucene3
  • 5. Search system duties today4UsersVideoMusicGroupsCommunitiesEventsGiftsLocationsHobbiesHelpGroup users
  • 6. Quick portal search5
  • 7. Expanded portal search6
  • 8. Architecture7Search facadeEventMaker + DBSearchUpdateQuery ReplicationQueryServicesGet Entity cachePresentation
  • 9. Architecture: maker8• Collects notifications about changed entities• Uses Cassandra to store additional entity data• Responsible for domain index writing• Controls index replication to query servers
  • 10. Architecture: query9• Many servers in different hardware configuration• Unified application• For quick start store index’s on disk• Queries are executed in heap memory– IndexReader rewritten to eliminate unnecessary operations– Own stored field retrieval method:• No garbage• Accessing values without actual deserialization
  • 11. Architecture: search facade10• Creates & manages personal index’s• Schedules query execution• Reduce query results to search results• Loads data for result rendering
  • 12. Problems & solutions11
  • 13. Problem: spelling vs performance12• Most of the content is in Russian language:– Proper Russian– Common misspells– Misspells made by people who try to write in Russian– Russian words written in Latin (Translit & Crazy Russian)– Wrong keyboard layout• Few examples, with common misspells omitted:– машина = мышына, масына, mashina, moshina– Кашин = kashin, кашен, ka6in– Kosheen = кошин, cosheen, koshin
  • 14. Solution: spelling vs performance13• Reduce number of terms using phonetics:MOSHINO = машина, мышына, масына, mashina, moshina• Query is expanded with few phonetic keys:– Common misspellings– Synonyms we know• Distinguish writing using 1 byte hash code per term– If possible, perform hash check only for top documents
  • 15. Problem: personal index availability14• Queries take 5 – 100 ms• Personal index composition takes 50 - 300 msCache CacheService Service Service*2 *2 *2• Network load on cache servers quickly hit 700 Mb/s• Meanwhile, there were no CPU load on cache servers
  • 16. Solution: personal index availability15Service 0-19 Service 20-39 Service 40-59 Service 60-79 Service 80-9937• Bind users to concrete servers• Store personal index’s locally (in off-heap memory)• Determine substitution order• Whole network load is under 100 Mb/s• Even CPU load on all servers
  • 17. Problem: gender and country filters16• Usually index is split into shards, till average querytime meets some bounds– This solves response time problem– All possible documents are checked• There is 2 filters which make user queries slow:– Gender– One very popular country
  • 18. Solution: gender and country filters• Remove this condition checks – saves 17% CPU• Exclude documents which could not match this filters– saves another 12% CPURussian malesRussian femalesOther malesOther females
  • 19. Problem: users online search18• People wish to quickly find a person they can talk to• At any given moment, only small fraction of usersare online• Standard solution – filter out onlines from generalsearch results:+ easy to implement+ reliable– slow, especially at random users query– wastes CPU
  • 20. Solution: users online search19• Create separate index, with online users only:+ works quickly+ no tricks required– more then 200.000 changes/minute– correct results depend on index maker availability
  • 21. Problem: user search inside group20• This kind of search is in demand from group owners• Some numbers:– 200 million users in 16 shards– 7 million groups in 8 shards– Each group has from 1 to several million users– Number of group to user connections – billions• “Dummy solutions” were not checked
  • 22. Problem: user search inside group21GroupsUsers• We use mechanics from personal indexes• Currently indexed groups are updated with changes• Small group indexes are discontinued after 1 hour• Big groups indexes are kept until application restartsSearch façadeHeapmemoryOff-heapmemoryPortal servicesSmall groups
  • 23. More information22Aleksey Shevchuk@AlekseyShevchukaleksey.shevchuk@odnoklassniki.ruodnoklassniki.ru/mrSearchOdnoklassniki.ruhttp://v.ok.ruIntegration with Odnoklassniki.ruhttp://connect.ok.ruone-nioslideshare.net/m0nstermind/presentationsgithub.com/odnoklassniki/one-nioCassandragithub.com/odnoklassniki/apache-cassandra
  • 24. Aleksey Shevchuk@AlekseyShevchukaleksey.shevchuk@odnoklassniki.ruodnoklassniki.ru/mrSearch