Search@Hyves
Upcoming SlideShare
Loading in...5
×
 

Search@Hyves

on

  • 1,010 views

 

Statistics

Views

Total Views
1,010
Views on SlideShare
1,010
Embed Views
0

Actions

Likes
0
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • SPH_SORT_EXPR,"@weight + relevance*100"
  • Give Example of recent job failure, Jenkins ran out of disk space and this was also replicated to secondary jenkins machine, however it worked

Search@Hyves Search@Hyves Presentation Transcript

  • [email_address]
    • Anuj Ahuja | anuj@hyves.nl | anujahuja.hyves.nl | #anujmca
    female single amsterdam 20
  • Old Search
      • MySQL Full text indexes
      • Hash Match - Combinations of the searchterms are stored
    • Limitations
      • Indexing is very slow - takes ~5h to index
      • Fragile State management - As coordinated by daemons
      • Scalability - It is not transparent for Application
      • No support for indexing data from distributed databases
  • Scale@Hyves?
      • MySql Master-Slave architectures
        • 40 Masters, 284 Slaves 
      • Store
        • ~64 Clusters, ~256 Mysql Hosts
      • How big is dataset [Jan 2011] ?
        • ~400G of indexable data
        • Includes reactions, photos Title, WWW, etc
  • Search Architecture - Ideas
    • Function
      • Enable search for everything on Hyves
      • Apply social relevance/weight to content
      • Make new data available for search within an hour
    • Tech
      • Combine data from multiple data sources
      • Attributes based filtering - for example geo location
      • Abstract state management from data import jobs 
      • Scaling should be transparent to application layer
  • Search Architecture - Decisions 
      • Pure data jobs  Vs Leveraging Hyves application stack(PHP)
      • Listeners Vs Iterator
      • Handling deletes - Realtime updates Vs Ignore on select
  • Search Architecture - Technology 
      • Search backend - Sphinx
      • Data Importers - PHP and Hadoop Job
      • Pre-Indexing database - Mysql on temp fs
      • State Management - Mysql (Innodb)
      • Job Orchestration – Jenkins
      • Deploy – Puppet, Hyves Deploy Script
      • Monitoring - Ganglia, Realtime stats, Google Analytics
  • Search Architecture - SearchTube
  • Stage1 - Data Importers
      • Support two architectures
        • Master-Slaves [Runs on slow slave to reduce network traffic]
        • Store [Job runs in hadoop cluster]
      • Data is imported in batches
      • Used for both Main and Delta indexes
      • Easy to add new jobs by implementing following methods
           abstract protected function getIndexType();     abstract protected function getIndexSubType();     abstract protected function getPrimaryKeyField();     abstract protected function getTabletName();     abstract protected function getDatabaseConnection();     abstract protected function getDataFromMasterSlave( $startObjectId , $endObjectId );
  • Stage 2 - Updaters
      • Enrich information from other architecture
      • Written in php to leverage existing infrastructure
      • Examples
        • Geo Location Updater
        • Hub Aliase Updater
        • City Name Updater
  • Stage 3 - Sphinx Indexer
      • Builds Main and Delta indexes on Sphinx Slaves
      • Data is pulled from Pre-Index database
      • Each Slave has subset of data (% n slaves)
  • Sphinx?
      • Sphinx is full text search server written in C++
      • Easy Distribution
      • Attributes based filtering
      • Support querying multiple indexes
      • Ranking - (BM25 + Phrase Proximity) + Social Relevance
      • Utilize multi-core machines by distributed index
      • Benchmarking results
  • Search Tube - J ob Orchestration 
      • Responsible for executing and synchronizing various jobs
      • Jenkins Plugin
        • Join Plugin - Job synchronization
        • Plot plugin - Reporting
        • Dependency Graph View Plugin - Visualization
      • Other servers are added as labeled nodes
        • slow slaves, hadoop node, search slaves, etc.
      •   Puppetized and Jenkins API
      • https://github.com/salimfadhley/jenkinsapi
  • Search Tube - Jenkins
  • Search Tube - Reporting
  • Search - Failover Scenarios Failed 1 Failed 2 Failed 3 Failed 4 Failed 5 Failed 6 Failed 7 Failed 8
  • Search - What is new?
      • Simplified user interface- Single search field for search
        • “ ivo utrecht 26” [first name + city + age]
        • “ amsterdam female 20” [city + gender + age]
        • “ ram* van alte* ams” [partial search]
        • “ milea marius” [last name + First name]
        • “ coumans amst” [last name + city]
        • “ hyves hq” [hub name]
      • Improved Ranking
        • Member results are influenced by number of friends
        • Hub results are influenced by number of hub members.
      • Snappy search
        • Server side it takes ~ 20ms
        • Enabled search on every key stroke.
      • Refining results 
        • Results can be further refined by type for example member, hub, etc.
      • New Content is indexed every hour
  • Search Result [December]
      • Page View - 8,599,572
      • Ajax Search Queries - 28,742,425
      • Search Slaves (2x3 slaves, 2 search master )
        • During peeks hours 120 Search/sec
        • Average query ~20ms
      • Google Analytic shows click through and relevance 
    * Only 1% of traffic is measured by Google Analytic 
  • Questions? Anuj Ahuja | anuj@hyves.nl | anujahuja.hyves.nl | #anujmca