Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • SPH_SORT_EXPR,"@weight + relevance*100"
  • Give Example of recent job failure, Jenkins ran out of disk space and this was also replicated to secondary jenkins machine, however it worked
  • Search@Hyves

    1. 1. [email_address] <ul><li>Anuj Ahuja | | | #anujmca </li></ul>female single amsterdam 20
    2. 2. Old Search <ul><ul><li>MySQL Full text indexes </li></ul></ul><ul><ul><li>Hash Match - Combinations of the searchterms are stored </li></ul></ul><ul><li>Limitations </li></ul><ul><ul><li>Indexing is very slow - takes ~5h to index </li></ul></ul><ul><ul><li>Fragile State management - As coordinated by daemons </li></ul></ul><ul><ul><li>Scalability - It is not transparent for Application </li></ul></ul><ul><ul><li>No support for indexing data from distributed databases </li></ul></ul>
    3. 3. Scale@Hyves? <ul><ul><li>MySql Master-Slave architectures </li></ul></ul><ul><ul><ul><li>40 Masters, 284 Slaves  </li></ul></ul></ul><ul><ul><li>Store </li></ul></ul><ul><ul><ul><li>~64 Clusters, ~256 Mysql Hosts </li></ul></ul></ul><ul><ul><li>How big is dataset [Jan 2011] ? </li></ul></ul><ul><ul><ul><li>~400G of indexable data </li></ul></ul></ul><ul><ul><ul><li>Includes reactions, photos Title, WWW, etc </li></ul></ul></ul>
    4. 4. Search Architecture - Ideas <ul><li>Function </li></ul><ul><ul><li>Enable search for everything on Hyves </li></ul></ul><ul><ul><li>Apply social relevance/weight to content </li></ul></ul><ul><ul><li>Make new data available for search within an hour </li></ul></ul><ul><li>Tech </li></ul><ul><ul><li>Combine data from multiple data sources </li></ul></ul><ul><ul><li>Attributes based filtering - for example geo location </li></ul></ul><ul><ul><li>Abstract state management from data import jobs  </li></ul></ul><ul><ul><li>Scaling should be transparent to application layer </li></ul></ul>
    5. 5. Search Architecture - Decisions  <ul><ul><li>Pure data jobs  Vs Leveraging Hyves application stack(PHP) </li></ul></ul><ul><ul><li>Listeners Vs Iterator </li></ul></ul><ul><ul><li>Handling deletes - Realtime updates Vs Ignore on select </li></ul></ul>
    6. 6. Search Architecture - Technology  <ul><ul><li>Search backend - Sphinx </li></ul></ul><ul><ul><li>Data Importers - PHP and Hadoop Job </li></ul></ul><ul><ul><li>Pre-Indexing database - Mysql on temp fs </li></ul></ul><ul><ul><li>State Management - Mysql (Innodb) </li></ul></ul><ul><ul><li>Job Orchestration – Jenkins </li></ul></ul><ul><ul><li>Deploy – Puppet, Hyves Deploy Script </li></ul></ul><ul><ul><li>Monitoring - Ganglia, Realtime stats, Google Analytics </li></ul></ul>
    7. 7. Search Architecture - SearchTube
    8. 8. Stage1 - Data Importers <ul><ul><li>Support two architectures </li></ul></ul><ul><ul><ul><li>Master-Slaves [Runs on slow slave to reduce network traffic] </li></ul></ul></ul><ul><ul><ul><li>Store [Job runs in hadoop cluster] </li></ul></ul></ul><ul><ul><li>Data is imported in batches </li></ul></ul><ul><ul><li>Used for both Main and Delta indexes </li></ul></ul><ul><ul><li>Easy to add new jobs by implementing following methods </li></ul></ul>       abstract protected function getIndexType();     abstract protected function getIndexSubType();     abstract protected function getPrimaryKeyField();     abstract protected function getTabletName();     abstract protected function getDatabaseConnection();     abstract protected function getDataFromMasterSlave( $startObjectId , $endObjectId );
    9. 9. Stage 2 - Updaters <ul><ul><li>Enrich information from other architecture </li></ul></ul><ul><ul><li>Written in php to leverage existing infrastructure </li></ul></ul><ul><ul><li>Examples </li></ul></ul><ul><ul><ul><li>Geo Location Updater </li></ul></ul></ul><ul><ul><ul><li>Hub Aliase Updater </li></ul></ul></ul><ul><ul><ul><li>City Name Updater </li></ul></ul></ul>
    10. 10. Stage 3 - Sphinx Indexer <ul><ul><li>Builds Main and Delta indexes on Sphinx Slaves </li></ul></ul><ul><ul><li>Data is pulled from Pre-Index database </li></ul></ul><ul><ul><li>Each Slave has subset of data (% n slaves) </li></ul></ul>
    11. 11. Sphinx? <ul><ul><li>Sphinx is full text search server written in C++ </li></ul></ul><ul><ul><li>Easy Distribution </li></ul></ul><ul><ul><li>Attributes based filtering </li></ul></ul><ul><ul><li>Support querying multiple indexes </li></ul></ul><ul><ul><li>Ranking - (BM25 + Phrase Proximity) + Social Relevance </li></ul></ul><ul><ul><li>Utilize multi-core machines by distributed index </li></ul></ul><ul><ul><li>Benchmarking results </li></ul></ul>
    12. 12. Search Tube - J ob Orchestration  <ul><ul><li>Responsible for executing and synchronizing various jobs </li></ul></ul><ul><ul><li>Jenkins Plugin </li></ul></ul><ul><ul><ul><li>Join Plugin - Job synchronization </li></ul></ul></ul><ul><ul><ul><li>Plot plugin - Reporting </li></ul></ul></ul><ul><ul><ul><li>Dependency Graph View Plugin - Visualization </li></ul></ul></ul><ul><ul><li>Other servers are added as labeled nodes </li></ul></ul><ul><ul><ul><li>slow slaves, hadoop node, search slaves, etc. </li></ul></ul></ul><ul><ul><li>  Puppetized and Jenkins API </li></ul></ul><ul><ul><li> </li></ul></ul>
    13. 13. Search Tube - Jenkins
    14. 14. Search Tube - Reporting
    15. 15. Search - Failover Scenarios Failed 1 Failed 2 Failed 3 Failed 4 Failed 5 Failed 6 Failed 7 Failed 8
    16. 16. Search - What is new? <ul><ul><li>Simplified user interface- Single search field for search </li></ul></ul><ul><ul><ul><li>“ ivo utrecht 26” [first name + city + age] </li></ul></ul></ul><ul><ul><ul><li>“ amsterdam female 20” [city + gender + age] </li></ul></ul></ul><ul><ul><ul><li>“ ram* van alte* ams” [partial search] </li></ul></ul></ul><ul><ul><ul><li>“ milea marius” [last name + First name] </li></ul></ul></ul><ul><ul><ul><li>“ coumans amst” [last name + city] </li></ul></ul></ul><ul><ul><ul><li>“ hyves hq” [hub name] </li></ul></ul></ul><ul><ul><li>Improved Ranking </li></ul></ul><ul><ul><ul><li>Member results are influenced by number of friends </li></ul></ul></ul><ul><ul><ul><li>Hub results are influenced by number of hub members. </li></ul></ul></ul><ul><ul><li>Snappy search </li></ul></ul><ul><ul><ul><li>Server side it takes ~ 20ms </li></ul></ul></ul><ul><ul><ul><li>Enabled search on every key stroke. </li></ul></ul></ul><ul><ul><li>Refining results  </li></ul></ul><ul><ul><ul><li>Results can be further refined by type for example member, hub, etc. </li></ul></ul></ul><ul><ul><li>New Content is indexed every hour </li></ul></ul>
    17. 17. Search Result [December] <ul><ul><li>Page View - 8,599,572 </li></ul></ul><ul><ul><li>Ajax Search Queries - 28,742,425 </li></ul></ul><ul><ul><li>Search Slaves (2x3 slaves, 2 search master ) </li></ul></ul><ul><ul><ul><li>During peeks hours 120 Search/sec </li></ul></ul></ul><ul><ul><ul><li>Average query ~20ms </li></ul></ul></ul><ul><ul><li>Google Analytic shows click through and relevance  </li></ul></ul>* Only 1% of traffic is measured by Google Analytic 
    18. 18. Questions? Anuj Ahuja | | | #anujmca