SlideShare a Scribd company logo
1 of 18
Download to read offline
Scaling search &
content filtering
Search optimization



Netlog => social network
 • meet / connect to new people => search essential
 • localized content => content filtering essential

Types of searches
Content filtering
Search filtering
Daily search statistics on Netlog
How to handle this



Problem 1:
Large number of requests
+ each request is kind of unique

Problem 2:
Content to search on is spread
• different distributions (nl, en, fr, .. )
 • each with their own databasehosts/ isolations :
   videos, photos, ...
• different shards as explained previously
Solution #1



 Add fulltext indexes to tables
 aggregate different data later on
 f.e. VIDEOS
   Full text index on title, tags, description,
   combine results at the end


 Problems
 • Large indexes
 • Not all indexes are effective
 • Locking of table => searches are having an impact on
   other things on the site
 • May work good for a small site but otherwise => BAD
Solution #2



 Create seperate tables with fulltext indexes especially
 for searching queries
 f.e. VIDEOS
 • Table SEARCH_VIDEOS (videoid (int), searchvideo(text))
    Combine title, tags, description, .. in 1 mysql text field: “searchvideo”.
    Add a full text index on it. Combine results at the end.


 Problems
 • Duplication of data may cause inconsistencies
 • Not easy to rebuild (takes a very long time)
 • Peak moments: updates of changes + a lot of
   searches => table locks. (MyISAM)
Solution #3 ...almost there :)


Looking for non MySQL based alternatives
• Google
 • no control over results or whats being indexed/ when its being
     indexed.
• Yahoo BOSS
 • promising, great step on making search more open.
     Is rather new, so may suffer from bugs.
 •   still rely on a third party for delivering your results,
     f.e. footnote on site: * BOSS offers developers unlimited daily queries, though Yahoo!
     reserves the right to limit unintended usage

• Lucene
 • Java based, from the creators of Apache
 • Servers are not optimized for running java/ tomcat +
     more custom coding is needed to make php <-> java bridge
• Sphinx
 • C++ based, more inhouse expertise
 • fast results in test setup
Solution #3 ...sphinx!


How sphinx works:
• Full text search engine
   • two essential cli- tools:
   • indexer
      • creating indexes
   • searchd
      • daemon that serves indexes & handles search requests, delivers
          results in form of documentids & attributes
      •   uses custom protocol for retreiving results => need a sphinx API
          in PHP, java,.. to talk to this daemon: (use search for debugging)
• Some sphinx terminology
  • sphinx.conf the basic config file, with two essential parts: sources &
      indexes
  •   documentid: id that uniquely identifies a document in the sphinx
      search index (must be unique!)
  •   attribute: each documentid can have additional attributtes, these can
Indexing (1)



 • Indexing
    • We need to index a data source (SQL database, text files, html
      files.. ) defining this in sphinx.conf can be as easy as
      source users
      {
         type = mysql
         sql_host = localhost
         sql_db = localdb
         sql_user = jayme
         sql_pass = *******
         sql_port = 3306
         sql_query = SELECT id, firstname, lastname, counter_photos FROM USERS
         sql_attr_uint = counter_photos
      }


    • We define counter_photos as an attribute, because we want to sort/
      filter on it later on.
Indexing (2)



1. Define the index in the config, which searchd will
   serve. An index can have more then 1 source.
       index users
       {
          docinfo    = extern
          source     = users
         path      = /var/lib/sphinx/data/users
       }

2. When running the indexer, sphinx splits up each
   document (SQL record in our case) in to several words
    internally :
     a. creates a dictionary of all of these words. (WordIDs)
     b. keeps references to documentIDs for each WordID
     c. stores attributes with references to documentIDs
Indexing (3) & searching



• indexing
  ./indexer -c ../etc/sphinx.conf users or
  ./indexer -c ../etc/sphinx.conf users --rotate (when searchd is running)
  Searching
  using php api:
Sphinx netlog setup



We use a main+delta scheme
main:
For each search type (people, video, photo,..) we have a main index that is
being rebuild every night. Takes +- 20 minutes to rebuild the largest table
that we have.

delta:
Changes to videos, photos, .. are tracked in a table
f.e. SPHINX_PHOTO_UPDATE, with 1 column, the ID of the photo.
Halfhourly : sphinx regenerates a delta index based on this index. This table
is truncated once day.

When searching we use 2 indexes: $cl->Query(‘test’, ‘users users_delta’)
Sphinx will use the last index first when searching,
so if needed newer content will be found / returned
Future developments on Netlog with sphinx



 Indexing of shards (messages / friendships)
 • Running an indexer on each shard
 • Creating a main index for x shards
   (merge these shards in to 1)
 • Running distributed searches on these indexes

 Generation of tag clouds
 ./indexer -c ..etc/sphinx.conf users --buildstops test.txt 100 --buildstops
 => sphinx has an option to generate the most used words in an index which
 can be relevant for tags
Some sphinx tips & tweaks



• Use range queries when indexing data
  try always to have a an autoincrement field on MySQL tables when
  indexing. Sphinx has a mechanism which does indexes ranges of data,
  thus avoiding table locks.
  (where id > 1000 AND id < 2000 etc.. )
• Narrowest search first
  (f.e. when searching for users in Belgium that are basketball @hobbies
  basket @country BE)
• Avoid searhes on small words with OR (f.e. the|new|...)
• Define a charset table when indexing UTF-8,
  foreign languages
• Check if there are no trailing spaces after  in your sphinx.conf
   when using multi -lined queries, can cause weird errors else.
• Cache results!
• More info/ advanced usage on: sphinxsearch.com
Questions?




 netlog.com/go/developer
jayme@netlog.com - jurriaan@netlog.com

More Related Content

What's hot

Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsOpenSource Connections
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy SokolenkoProvectus
 
Spark with Elasticsearch
Spark with ElasticsearchSpark with Elasticsearch
Spark with ElasticsearchHolden Karau
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache SolrBiogeeks
 
SQLite Techniques
SQLite TechniquesSQLite Techniques
SQLite Techniquesjoaopmaia
 
How to automate all your SEO projects
How to automate all your SEO projectsHow to automate all your SEO projects
How to automate all your SEO projectsVincent Terrasi
 
2232021 cs 265 — assignment – json dirhttpswww.cs.dre
2232021 cs 265 — assignment – json dirhttpswww.cs.dre2232021 cs 265 — assignment – json dirhttpswww.cs.dre
2232021 cs 265 — assignment – json dirhttpswww.cs.dresmile790243
 
Distributed percolator in elasticsearch
Distributed percolator in elasticsearchDistributed percolator in elasticsearch
Distributed percolator in elasticsearchmartijnvg
 
SQLite Techniques
SQLite TechniquesSQLite Techniques
SQLite Techniquesjoaopmaia
 
Analyse your SEO Data with R and Kibana
Analyse your SEO Data with R and KibanaAnalyse your SEO Data with R and Kibana
Analyse your SEO Data with R and KibanaVincent Terrasi
 
MySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELKMySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELKYoungHeon (Roy) Kim
 
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Ben Busby
 
Debugging and Testing ES Systems
Debugging and Testing ES SystemsDebugging and Testing ES Systems
Debugging and Testing ES SystemsChris Birchall
 
mongoDB Performance
mongoDB PerformancemongoDB Performance
mongoDB PerformanceMoshe Kaplan
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engineth0masr
 
Cool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchCool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchclintongormley
 

What's hot (20)

Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy Sokolenko
 
Spark with Elasticsearch
Spark with ElasticsearchSpark with Elasticsearch
Spark with Elasticsearch
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache Solr
 
SQLite Techniques
SQLite TechniquesSQLite Techniques
SQLite Techniques
 
How to automate all your SEO projects
How to automate all your SEO projectsHow to automate all your SEO projects
How to automate all your SEO projects
 
2232021 cs 265 — assignment – json dirhttpswww.cs.dre
2232021 cs 265 — assignment – json dirhttpswww.cs.dre2232021 cs 265 — assignment – json dirhttpswww.cs.dre
2232021 cs 265 — assignment – json dirhttpswww.cs.dre
 
Distributed percolator in elasticsearch
Distributed percolator in elasticsearchDistributed percolator in elasticsearch
Distributed percolator in elasticsearch
 
SQLite Techniques
SQLite TechniquesSQLite Techniques
SQLite Techniques
 
Analyse your SEO Data with R and Kibana
Analyse your SEO Data with R and KibanaAnalyse your SEO Data with R and Kibana
Analyse your SEO Data with R and Kibana
 
Google Hacking Basics
Google Hacking BasicsGoogle Hacking Basics
Google Hacking Basics
 
Scrapy.for.dummies
Scrapy.for.dummiesScrapy.for.dummies
Scrapy.for.dummies
 
MySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELKMySQL Slow Query log Monitoring using Beats & ELK
MySQL Slow Query log Monitoring using Beats & ELK
 
Scrapy
ScrapyScrapy
Scrapy
 
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
 
Scrapy-101
Scrapy-101Scrapy-101
Scrapy-101
 
Debugging and Testing ES Systems
Debugging and Testing ES SystemsDebugging and Testing ES Systems
Debugging and Testing ES Systems
 
mongoDB Performance
mongoDB PerformancemongoDB Performance
mongoDB Performance
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engine
 
Cool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchCool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearch
 

Similar to Scaling / optimizing search on netlog

Lessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedLessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedDainius Jocas
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearchMinsoo Jun
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxsmile790243
 
ExtBase workshop
ExtBase workshop ExtBase workshop
ExtBase workshop schmutt
 
12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocratlinoj
 
Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overviewAmit Juneja
 
[2 d1] elasticsearch 성능 최적화
[2 d1] elasticsearch 성능 최적화[2 d1] elasticsearch 성능 최적화
[2 d1] elasticsearch 성능 최적화Henry Jeong
 
Managing Your Security Logs with Elasticsearch
Managing Your Security Logs with ElasticsearchManaging Your Security Logs with Elasticsearch
Managing Your Security Logs with ElasticsearchVic Hargrave
 
Basic PowerShell Toolmaking - Spiceworld 2016 session
Basic PowerShell Toolmaking - Spiceworld 2016 sessionBasic PowerShell Toolmaking - Spiceworld 2016 session
Basic PowerShell Toolmaking - Spiceworld 2016 sessionRob Dunn
 
How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life琛琳 饶
 
[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화NAVER D2
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutchsebastian_nagel
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performanceEngine Yard
 
AD113 Speed Up Your Applications w/ Nginx and PageSpeed
AD113  Speed Up Your Applications w/ Nginx and PageSpeedAD113  Speed Up Your Applications w/ Nginx and PageSpeed
AD113 Speed Up Your Applications w/ Nginx and PageSpeededm00se
 

Similar to Scaling / optimizing search on netlog (20)

Lessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at VintedLessons Learned While Scaling Elasticsearch at Vinted
Lessons Learned While Scaling Elasticsearch at Vinted
 
F8 tech talk_pinterest_v4
F8 tech talk_pinterest_v4F8 tech talk_pinterest_v4
F8 tech talk_pinterest_v4
 
Play framework
Play frameworkPlay framework
Play framework
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearch
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docx
 
ExtBase workshop
ExtBase workshop ExtBase workshop
ExtBase workshop
 
12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat
 
SphinxSE with MySQL
SphinxSE with MySQLSphinxSE with MySQL
SphinxSE with MySQL
 
Elasticsearch an overview
Elasticsearch   an overviewElasticsearch   an overview
Elasticsearch an overview
 
[2 d1] elasticsearch 성능 최적화
[2 d1] elasticsearch 성능 최적화[2 d1] elasticsearch 성능 최적화
[2 d1] elasticsearch 성능 최적화
 
Managing Your Security Logs with Elasticsearch
Managing Your Security Logs with ElasticsearchManaging Your Security Logs with Elasticsearch
Managing Your Security Logs with Elasticsearch
 
Basic PowerShell Toolmaking - Spiceworld 2016 session
Basic PowerShell Toolmaking - Spiceworld 2016 sessionBasic PowerShell Toolmaking - Spiceworld 2016 session
Basic PowerShell Toolmaking - Spiceworld 2016 session
 
How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life
 
[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
6 tips for improving ruby performance
6 tips for improving ruby performance6 tips for improving ruby performance
6 tips for improving ruby performance
 
How do i Meet MongoDB
How do i Meet MongoDBHow do i Meet MongoDB
How do i Meet MongoDB
 
AD113 Speed Up Your Applications w/ Nginx and PageSpeed
AD113  Speed Up Your Applications w/ Nginx and PageSpeedAD113  Speed Up Your Applications w/ Nginx and PageSpeed
AD113 Speed Up Your Applications w/ Nginx and PageSpeed
 

Recently uploaded

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Recently uploaded (20)

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Scaling / optimizing search on netlog

  • 1.
  • 3. Search optimization Netlog => social network • meet / connect to new people => search essential • localized content => content filtering essential Types of searches
  • 7. How to handle this Problem 1: Large number of requests + each request is kind of unique Problem 2: Content to search on is spread • different distributions (nl, en, fr, .. ) • each with their own databasehosts/ isolations : videos, photos, ... • different shards as explained previously
  • 8. Solution #1 Add fulltext indexes to tables aggregate different data later on f.e. VIDEOS Full text index on title, tags, description, combine results at the end Problems • Large indexes • Not all indexes are effective • Locking of table => searches are having an impact on other things on the site • May work good for a small site but otherwise => BAD
  • 9. Solution #2 Create seperate tables with fulltext indexes especially for searching queries f.e. VIDEOS • Table SEARCH_VIDEOS (videoid (int), searchvideo(text)) Combine title, tags, description, .. in 1 mysql text field: “searchvideo”. Add a full text index on it. Combine results at the end. Problems • Duplication of data may cause inconsistencies • Not easy to rebuild (takes a very long time) • Peak moments: updates of changes + a lot of searches => table locks. (MyISAM)
  • 10. Solution #3 ...almost there :) Looking for non MySQL based alternatives • Google • no control over results or whats being indexed/ when its being indexed. • Yahoo BOSS • promising, great step on making search more open. Is rather new, so may suffer from bugs. • still rely on a third party for delivering your results, f.e. footnote on site: * BOSS offers developers unlimited daily queries, though Yahoo! reserves the right to limit unintended usage • Lucene • Java based, from the creators of Apache • Servers are not optimized for running java/ tomcat + more custom coding is needed to make php <-> java bridge • Sphinx • C++ based, more inhouse expertise • fast results in test setup
  • 11. Solution #3 ...sphinx! How sphinx works: • Full text search engine • two essential cli- tools: • indexer • creating indexes • searchd • daemon that serves indexes & handles search requests, delivers results in form of documentids & attributes • uses custom protocol for retreiving results => need a sphinx API in PHP, java,.. to talk to this daemon: (use search for debugging) • Some sphinx terminology • sphinx.conf the basic config file, with two essential parts: sources & indexes • documentid: id that uniquely identifies a document in the sphinx search index (must be unique!) • attribute: each documentid can have additional attributtes, these can
  • 12. Indexing (1) • Indexing • We need to index a data source (SQL database, text files, html files.. ) defining this in sphinx.conf can be as easy as source users { type = mysql sql_host = localhost sql_db = localdb sql_user = jayme sql_pass = ******* sql_port = 3306 sql_query = SELECT id, firstname, lastname, counter_photos FROM USERS sql_attr_uint = counter_photos } • We define counter_photos as an attribute, because we want to sort/ filter on it later on.
  • 13. Indexing (2) 1. Define the index in the config, which searchd will serve. An index can have more then 1 source. index users { docinfo = extern source = users path = /var/lib/sphinx/data/users } 2. When running the indexer, sphinx splits up each document (SQL record in our case) in to several words internally : a. creates a dictionary of all of these words. (WordIDs) b. keeps references to documentIDs for each WordID c. stores attributes with references to documentIDs
  • 14. Indexing (3) & searching • indexing ./indexer -c ../etc/sphinx.conf users or ./indexer -c ../etc/sphinx.conf users --rotate (when searchd is running) Searching using php api:
  • 15. Sphinx netlog setup We use a main+delta scheme main: For each search type (people, video, photo,..) we have a main index that is being rebuild every night. Takes +- 20 minutes to rebuild the largest table that we have. delta: Changes to videos, photos, .. are tracked in a table f.e. SPHINX_PHOTO_UPDATE, with 1 column, the ID of the photo. Halfhourly : sphinx regenerates a delta index based on this index. This table is truncated once day. When searching we use 2 indexes: $cl->Query(‘test’, ‘users users_delta’) Sphinx will use the last index first when searching, so if needed newer content will be found / returned
  • 16. Future developments on Netlog with sphinx Indexing of shards (messages / friendships) • Running an indexer on each shard • Creating a main index for x shards (merge these shards in to 1) • Running distributed searches on these indexes Generation of tag clouds ./indexer -c ..etc/sphinx.conf users --buildstops test.txt 100 --buildstops => sphinx has an option to generate the most used words in an index which can be relevant for tags
  • 17. Some sphinx tips & tweaks • Use range queries when indexing data try always to have a an autoincrement field on MySQL tables when indexing. Sphinx has a mechanism which does indexes ranges of data, thus avoiding table locks. (where id > 1000 AND id < 2000 etc.. ) • Narrowest search first (f.e. when searching for users in Belgium that are basketball @hobbies basket @country BE) • Avoid searhes on small words with OR (f.e. the|new|...) • Define a charset table when indexing UTF-8, foreign languages • Check if there are no trailing spaces after in your sphinx.conf when using multi -lined queries, can cause weird errors else. • Cache results! • More info/ advanced usage on: sphinxsearch.com