Your SlideShare is downloading. ×
Search@Hyves
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Search@Hyves

740
views

Published on

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
740
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • SPH_SORT_EXPR,"@weight + relevance*100"
  • Give Example of recent job failure, Jenkins ran out of disk space and this was also replicated to secondary jenkins machine, however it worked
  • Transcript

    • 1. [email_address]
      • Anuj Ahuja | anuj@hyves.nl | anujahuja.hyves.nl | #anujmca
      female single amsterdam 20
    • 2. Old Search
        • MySQL Full text indexes
        • Hash Match - Combinations of the searchterms are stored
      • Limitations
        • Indexing is very slow - takes ~5h to index
        • Fragile State management - As coordinated by daemons
        • Scalability - It is not transparent for Application
        • No support for indexing data from distributed databases
    • 3. Scale@Hyves?
        • MySql Master-Slave architectures
          • 40 Masters, 284 Slaves 
        • Store
          • ~64 Clusters, ~256 Mysql Hosts
        • How big is dataset [Jan 2011] ?
          • ~400G of indexable data
          • Includes reactions, photos Title, WWW, etc
    • 4. Search Architecture - Ideas
      • Function
        • Enable search for everything on Hyves
        • Apply social relevance/weight to content
        • Make new data available for search within an hour
      • Tech
        • Combine data from multiple data sources
        • Attributes based filtering - for example geo location
        • Abstract state management from data import jobs 
        • Scaling should be transparent to application layer
    • 5. Search Architecture - Decisions 
        • Pure data jobs  Vs Leveraging Hyves application stack(PHP)
        • Listeners Vs Iterator
        • Handling deletes - Realtime updates Vs Ignore on select
    • 6. Search Architecture - Technology 
        • Search backend - Sphinx
        • Data Importers - PHP and Hadoop Job
        • Pre-Indexing database - Mysql on temp fs
        • State Management - Mysql (Innodb)
        • Job Orchestration – Jenkins
        • Deploy – Puppet, Hyves Deploy Script
        • Monitoring - Ganglia, Realtime stats, Google Analytics
    • 7. Search Architecture - SearchTube
    • 8. Stage1 - Data Importers
        • Support two architectures
          • Master-Slaves [Runs on slow slave to reduce network traffic]
          • Store [Job runs in hadoop cluster]
        • Data is imported in batches
        • Used for both Main and Delta indexes
        • Easy to add new jobs by implementing following methods
             abstract protected function getIndexType();     abstract protected function getIndexSubType();     abstract protected function getPrimaryKeyField();     abstract protected function getTabletName();     abstract protected function getDatabaseConnection();     abstract protected function getDataFromMasterSlave( $startObjectId , $endObjectId );
    • 9. Stage 2 - Updaters
        • Enrich information from other architecture
        • Written in php to leverage existing infrastructure
        • Examples
          • Geo Location Updater
          • Hub Aliase Updater
          • City Name Updater
    • 10. Stage 3 - Sphinx Indexer
        • Builds Main and Delta indexes on Sphinx Slaves
        • Data is pulled from Pre-Index database
        • Each Slave has subset of data (% n slaves)
    • 11. Sphinx?
        • Sphinx is full text search server written in C++
        • Easy Distribution
        • Attributes based filtering
        • Support querying multiple indexes
        • Ranking - (BM25 + Phrase Proximity) + Social Relevance
        • Utilize multi-core machines by distributed index
        • Benchmarking results
    • 12. Search Tube - J ob Orchestration 
        • Responsible for executing and synchronizing various jobs
        • Jenkins Plugin
          • Join Plugin - Job synchronization
          • Plot plugin - Reporting
          • Dependency Graph View Plugin - Visualization
        • Other servers are added as labeled nodes
          • slow slaves, hadoop node, search slaves, etc.
        •   Puppetized and Jenkins API
        • https://github.com/salimfadhley/jenkinsapi
    • 13. Search Tube - Jenkins
    • 14. Search Tube - Reporting
    • 15. Search - Failover Scenarios Failed 1 Failed 2 Failed 3 Failed 4 Failed 5 Failed 6 Failed 7 Failed 8
    • 16. Search - What is new?
        • Simplified user interface- Single search field for search
          • “ ivo utrecht 26” [first name + city + age]
          • “ amsterdam female 20” [city + gender + age]
          • “ ram* van alte* ams” [partial search]
          • “ milea marius” [last name + First name]
          • “ coumans amst” [last name + city]
          • “ hyves hq” [hub name]
        • Improved Ranking
          • Member results are influenced by number of friends
          • Hub results are influenced by number of hub members.
        • Snappy search
          • Server side it takes ~ 20ms
          • Enabled search on every key stroke.
        • Refining results 
          • Results can be further refined by type for example member, hub, etc.
        • New Content is indexed every hour
    • 17. Search Result [December]
        • Page View - 8,599,572
        • Ajax Search Queries - 28,742,425
        • Search Slaves (2x3 slaves, 2 search master )
          • During peeks hours 120 Search/sec
          • Average query ~20ms
        • Google Analytic shows click through and relevance 
      * Only 1% of traffic is measured by Google Analytic 
    • 18. Questions? Anuj Ahuja | anuj@hyves.nl | anujahuja.hyves.nl | #anujmca