SlideShare a Scribd company logo
1 of 18
Download to read offline
Scaling search &
content filtering
Search optimization



Netlog => social network
 • meet / connect to new people => search essential
 • localized content => content filtering essential

Types of searches
Content filtering
Search filtering
Daily search statistics on Netlog
How to handle this



Problem 1:
Large number of requests
+ each request is kind of unique

Problem 2:
Content to search on is spread
• different distributions (nl, en, fr, .. )
 • each with their own databasehosts/ isolations :
   videos, photos, ...
• different shards as explained previously
Solution #1



 Add fulltext indexes to tables
 aggregate different data later on
 f.e. VIDEOS
   Full text index on title, tags, description,
   combine results at the end


 Problems
 • Large indexes
 • Not all indexes are effective
 • Locking of table => searches are having an impact on
   other things on the site
 • May work good for a small site but otherwise => BAD
Solution #2



 Create seperate tables with fulltext indexes especially
 for searching queries
 f.e. VIDEOS
 • Table SEARCH_VIDEOS (videoid (int), searchvideo(text))
    Combine title, tags, description, .. in 1 mysql text field: “searchvideo”.
    Add a full text index on it. Combine results at the end.


 Problems
 • Duplication of data may cause inconsistencies
 • Not easy to rebuild (takes a very long time)
 • Peak moments: updates of changes + a lot of
   searches => table locks. (MyISAM)
Solution #3 ...almost there :)


Looking for non MySQL based alternatives
• Google
 • no control over results or whats being indexed/ when its being
     indexed.
• Yahoo BOSS
 • promising, great step on making search more open.
     Is rather new, so may suffer from bugs.
 •   still rely on a third party for delivering your results,
     f.e. footnote on site: * BOSS offers developers unlimited daily queries, though Yahoo!
     reserves the right to limit unintended usage

• Lucene
 • Java based, from the creators of Apache
 • Servers are not optimized for running java/ tomcat +
     more custom coding is needed to make php <-> java bridge
• Sphinx
 • C++ based, more inhouse expertise
 • fast results in test setup
Solution #3 ...sphinx!


How sphinx works:
• Full text search engine
   • two essential cli- tools:
   • indexer
      • creating indexes
   • searchd
      • daemon that serves indexes & handles search requests, delivers
          results in form of documentids & attributes
      •   uses custom protocol for retreiving results => need a sphinx API
          in PHP, java,.. to talk to this daemon: (use search for debugging)
• Some sphinx terminology
  • sphinx.conf the basic config file, with two essential parts: sources &
      indexes
  •   documentid: id that uniquely identifies a document in the sphinx
      search index (must be unique!)
  •   attribute: each documentid can have additional attributtes, these can
Indexing (1)



 • Indexing
    • We need to index a data source (SQL database, text files, html
      files.. ) defining this in sphinx.conf can be as easy as
      source users
      {
         type = mysql
         sql_host = localhost
         sql_db = localdb
         sql_user = jayme
         sql_pass = *******
         sql_port = 3306
         sql_query = SELECT id, firstname, lastname, counter_photos FROM USERS
         sql_attr_uint = counter_photos
      }


    • We define counter_photos as an attribute, because we want to sort/
      filter on it later on.
Indexing (2)



1. Define the index in the config, which searchd will
   serve. An index can have more then 1 source.
       index users
       {
          docinfo    = extern
          source     = users
         path      = /var/lib/sphinx/data/users
       }

2. When running the indexer, sphinx splits up each
   document (SQL record in our case) in to several words
    internally :
     a. creates a dictionary of all of these words. (WordIDs)
     b. keeps references to documentIDs for each WordID
     c. stores attributes with references to documentIDs
Indexing (3) & searching



• indexing
  ./indexer -c ../etc/sphinx.conf users or
  ./indexer -c ../etc/sphinx.conf users --rotate (when searchd is running)
  Searching
  using php api:
Sphinx netlog setup



We use a main+delta scheme
main:
For each search type (people, video, photo,..) we have a main index that is
being rebuild every night. Takes +- 20 minutes to rebuild the largest table
that we have.

delta:
Changes to videos, photos, .. are tracked in a table
f.e. SPHINX_PHOTO_UPDATE, with 1 column, the ID of the photo.
Halfhourly : sphinx regenerates a delta index based on this index. This table
is truncated once day.

When searching we use 2 indexes: $cl->Query(‘test’, ‘users users_delta’)
Sphinx will use the last index first when searching,
so if needed newer content will be found / returned
Future developments on Netlog with sphinx



 Indexing of shards (messages / friendships)
 • Running an indexer on each shard
 • Creating a main index for x shards
   (merge these shards in to 1)
 • Running distributed searches on these indexes

 Generation of tag clouds
 ./indexer -c ..etc/sphinx.conf users --buildstops test.txt 100 --buildstops
 => sphinx has an option to generate the most used words in an index which
 can be relevant for tags
Some sphinx tips & tweaks



• Use range queries when indexing data
  try always to have a an autoincrement field on MySQL tables when
  indexing. Sphinx has a mechanism which does indexes ranges of data,
  thus avoiding table locks.
  (where id > 1000 AND id < 2000 etc.. )
• Narrowest search first
  (f.e. when searching for users in Belgium that are basketball @hobbies
  basket @country BE)
• Avoid searhes on small words with OR (f.e. the|new|...)
• Define a charset table when indexing UTF-8,
  foreign languages
• Check if there are no trailing spaces after  in your sphinx.conf
   when using multi -lined queries, can cause weird errors else.
• Cache results!
• More info/ advanced usage on: sphinxsearch.com
Questions?




 netlog.com/go/developer
jayme@netlog.com - jurriaan@netlog.com

More Related Content

What's hot

Distributed percolator in elasticsearch
Distributed percolator in elasticsearchDistributed percolator in elasticsearch
Distributed percolator in elasticsearch
martijnvg
 
Debugging and Testing ES Systems
Debugging and Testing ES SystemsDebugging and Testing ES Systems
Debugging and Testing ES Systems
Chris Birchall
 
mongoDB Performance
mongoDB PerformancemongoDB Performance
mongoDB Performance
Moshe Kaplan
 

What's hot (20)

Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy Sokolenko
 
Spark with Elasticsearch
Spark with ElasticsearchSpark with Elasticsearch
Spark with Elasticsearch
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache Solr
 
SQLite Techniques
SQLite TechniquesSQLite Techniques
SQLite Techniques
 
How to automate all your SEO projects
How to automate all your SEO projectsHow to automate all your SEO projects
How to automate all your SEO projects
 
2232021 cs 265 — assignment – json dirhttpswww.cs.dre
2232021 cs 265 — assignment – json dirhttpswww.cs.dre2232021 cs 265 — assignment – json dirhttpswww.cs.dre
2232021 cs 265 — assignment – json dirhttpswww.cs.dre
 
Distributed percolator in elasticsearch
Distributed percolator in elasticsearchDistributed percolator in elasticsearch
Distributed percolator in elasticsearch
 
SQLite Techniques
SQLite TechniquesSQLite Techniques
SQLite Techniques
 
Analyse your SEO Data with R and Kibana
Analyse your SEO Data with R and KibanaAnalyse your SEO Data with R and Kibana
Analyse your SEO Data with R and Kibana
 
Google Hacking Basics
Google Hacking BasicsGoogle Hacking Basics
Google Hacking Basics
 
Scrapy.for.dummies
Scrapy.for.dummiesScrapy.for.dummies
Scrapy.for.dummies
 
MySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELKMySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELK
 
Scrapy
ScrapyScrapy
Scrapy
 
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
 
Scrapy-101
Scrapy-101Scrapy-101
Scrapy-101
 
Debugging and Testing ES Systems
Debugging and Testing ES SystemsDebugging and Testing ES Systems
Debugging and Testing ES Systems
 
mongoDB Performance
mongoDB PerformancemongoDB Performance
mongoDB Performance
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engine
 
Cool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchCool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearch
 

Similar to Scaling / optimizing search on netlog

Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docx
smile790243
 
How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life
琛琳 饶
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 

Similar to Scaling / optimizing search on netlog (20)

Lessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedLessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at Vinted
 
F8 tech talk_pinterest_v4
F8 tech talk_pinterest_v4F8 tech talk_pinterest_v4
F8 tech talk_pinterest_v4
 
Play framework
Play frameworkPlay framework
Play framework
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearch
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docx
 
ExtBase workshop
ExtBase workshop ExtBase workshop
ExtBase workshop
 
12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat
 
SphinxSE with MySQL
SphinxSE with MySQLSphinxSE with MySQL
SphinxSE with MySQL
 
Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overview
 
[2 d1] elasticsearch 성능 최적화
[2 d1] elasticsearch 성능 최적화[2 d1] elasticsearch 성능 최적화
[2 d1] elasticsearch 성능 최적화
 
Managing Your Security Logs with Elasticsearch
Managing Your Security Logs with ElasticsearchManaging Your Security Logs with Elasticsearch
Managing Your Security Logs with Elasticsearch
 
Basic PowerShell Toolmaking - Spiceworld 2016 session
Basic PowerShell Toolmaking - Spiceworld 2016 sessionBasic PowerShell Toolmaking - Spiceworld 2016 session
Basic PowerShell Toolmaking - Spiceworld 2016 session
 
How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life
 
[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performance
 
How do i Meet MongoDB
How do i Meet MongoDBHow do i Meet MongoDB
How do i Meet MongoDB
 
AD113 Speed Up Your Applications w/ Nginx and PageSpeed
AD113  Speed Up Your Applications w/ Nginx and PageSpeedAD113  Speed Up Your Applications w/ Nginx and PageSpeed
AD113 Speed Up Your Applications w/ Nginx and PageSpeed
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 

Scaling / optimizing search on netlog

  • 1.
  • 3. Search optimization Netlog => social network • meet / connect to new people => search essential • localized content => content filtering essential Types of searches
  • 7. How to handle this Problem 1: Large number of requests + each request is kind of unique Problem 2: Content to search on is spread • different distributions (nl, en, fr, .. ) • each with their own databasehosts/ isolations : videos, photos, ... • different shards as explained previously
  • 8. Solution #1 Add fulltext indexes to tables aggregate different data later on f.e. VIDEOS Full text index on title, tags, description, combine results at the end Problems • Large indexes • Not all indexes are effective • Locking of table => searches are having an impact on other things on the site • May work good for a small site but otherwise => BAD
  • 9. Solution #2 Create seperate tables with fulltext indexes especially for searching queries f.e. VIDEOS • Table SEARCH_VIDEOS (videoid (int), searchvideo(text)) Combine title, tags, description, .. in 1 mysql text field: “searchvideo”. Add a full text index on it. Combine results at the end. Problems • Duplication of data may cause inconsistencies • Not easy to rebuild (takes a very long time) • Peak moments: updates of changes + a lot of searches => table locks. (MyISAM)
  • 10. Solution #3 ...almost there :) Looking for non MySQL based alternatives • Google • no control over results or whats being indexed/ when its being indexed. • Yahoo BOSS • promising, great step on making search more open. Is rather new, so may suffer from bugs. • still rely on a third party for delivering your results, f.e. footnote on site: * BOSS offers developers unlimited daily queries, though Yahoo! reserves the right to limit unintended usage • Lucene • Java based, from the creators of Apache • Servers are not optimized for running java/ tomcat + more custom coding is needed to make php <-> java bridge • Sphinx • C++ based, more inhouse expertise • fast results in test setup
  • 11. Solution #3 ...sphinx! How sphinx works: • Full text search engine • two essential cli- tools: • indexer • creating indexes • searchd • daemon that serves indexes & handles search requests, delivers results in form of documentids & attributes • uses custom protocol for retreiving results => need a sphinx API in PHP, java,.. to talk to this daemon: (use search for debugging) • Some sphinx terminology • sphinx.conf the basic config file, with two essential parts: sources & indexes • documentid: id that uniquely identifies a document in the sphinx search index (must be unique!) • attribute: each documentid can have additional attributtes, these can
  • 12. Indexing (1) • Indexing • We need to index a data source (SQL database, text files, html files.. ) defining this in sphinx.conf can be as easy as source users { type = mysql sql_host = localhost sql_db = localdb sql_user = jayme sql_pass = ******* sql_port = 3306 sql_query = SELECT id, firstname, lastname, counter_photos FROM USERS sql_attr_uint = counter_photos } • We define counter_photos as an attribute, because we want to sort/ filter on it later on.
  • 13. Indexing (2) 1. Define the index in the config, which searchd will serve. An index can have more then 1 source. index users { docinfo = extern source = users path = /var/lib/sphinx/data/users } 2. When running the indexer, sphinx splits up each document (SQL record in our case) in to several words internally : a. creates a dictionary of all of these words. (WordIDs) b. keeps references to documentIDs for each WordID c. stores attributes with references to documentIDs
  • 14. Indexing (3) & searching • indexing ./indexer -c ../etc/sphinx.conf users or ./indexer -c ../etc/sphinx.conf users --rotate (when searchd is running) Searching using php api:
  • 15. Sphinx netlog setup We use a main+delta scheme main: For each search type (people, video, photo,..) we have a main index that is being rebuild every night. Takes +- 20 minutes to rebuild the largest table that we have. delta: Changes to videos, photos, .. are tracked in a table f.e. SPHINX_PHOTO_UPDATE, with 1 column, the ID of the photo. Halfhourly : sphinx regenerates a delta index based on this index. This table is truncated once day. When searching we use 2 indexes: $cl->Query(‘test’, ‘users users_delta’) Sphinx will use the last index first when searching, so if needed newer content will be found / returned
  • 16. Future developments on Netlog with sphinx Indexing of shards (messages / friendships) • Running an indexer on each shard • Creating a main index for x shards (merge these shards in to 1) • Running distributed searches on these indexes Generation of tag clouds ./indexer -c ..etc/sphinx.conf users --buildstops test.txt 100 --buildstops => sphinx has an option to generate the most used words in an index which can be relevant for tags
  • 17. Some sphinx tips & tweaks • Use range queries when indexing data try always to have a an autoincrement field on MySQL tables when indexing. Sphinx has a mechanism which does indexes ranges of data, thus avoiding table locks. (where id > 1000 AND id < 2000 etc.. ) • Narrowest search first (f.e. when searching for users in Belgium that are basketball @hobbies basket @country BE) • Avoid searhes on small words with OR (f.e. the|new|...) • Define a charset table when indexing UTF-8, foreign languages • Check if there are no trailing spaces after in your sphinx.conf when using multi -lined queries, can cause weird errors else. • Cache results! • More info/ advanced usage on: sphinxsearch.com