SlideShare a Scribd company logo
1 of 23
Building a Large Scale SEO/SEM 
Application with Apache Solr 
Rahul Jain 
Freelance Big-data/Search Consultant 
@rahuldausa 
dynamicrahul2020@gmail.com
About Me… 
• Freelance Big-data/Search Consultant based out of Hyderabad, India 
• Provide Consulting services and solutions for Solr, Elasticsearch and other Big 
data solutions (Apache Hadoop and Spark) 
• Organizer of two Meetup groups in Hyderabad 
• Hyderabad Apache Solr/Lucene 
• Big Data Hyderabad
What I am going to talk 
Share our experience in working on Search in this application … 
• What all issues we have faced and Lessons learned 
• How we do Database Import, Batch Indexing… 
• Techniques to Scale and improve Search latency 
• The System Architecture 
• Some tips for tuning Solr 
• Q/A
What does the Application do 
 Keyword Research and Competitor Analysis Tool for SEO (Search Engine Optimization) and SEM 
(Search Engine Marketing) Professionals 
 End user search for a keyword or a domain, and get all insights about that. 
 Aggregate data for the top 50 results of Google and Bing across 3 countries for 80million+ keywords. 
 Provide key metrics like keywords, CPM (Cost per mille), CPC (Cost per click), competitor’s details etc. 
Web 
crawling 
Data Collection 
Data Processing 
& Aggregation 
Ad 
Networks 
Apis Databases 
*All trademarks and logos belong to their respective owners.
Technology Stack
High level Architecture 
Load Balancer 
(HAProxy) 
Managed Cache 
Apache 
Solr 
Cache Cluster 
(Redis) 
Apache 
Solr 
Internet 
Database 
(MySQL) 
App Server 
(Tomcat) 
Apache 
Solr 
Search 
Head 
Web Server Farm 
Php App 
(Nginx) 
Cluster Manager 
(Custom using 
Zookeeper) 
Search Head : 
• Is a Solr Server which does not contain any data. 
• Make a Distributed Search request and aggregate the Search Results 
• Also works as a Load Balancer for search queries. 
Apache 
Solr 
Search Head 
(Solr) 
1 2 3 
4 
8 
5 
6 
7 
Ids lookup 
Cache 
Fetch cluster Mapping 
for which month’ 
cluster
Search - Key challenges 
 After processing we have ~40 billion records every month in MySQL database 
including 
 80+ Million Keywords 
 110+ Million Domains 
 1billion+ URLs 
 Multiple keywords for a Single URL and vice-versa 
 Multiple tables with varying size from 50million to 12billion 
 Search is a core functionality, so all records (selected fields) must be Indexed in Solr 
 Page load time (including all 24 widgets, Max) < 1.5 sec (Critical) 
 But… we need to load this data only once every month for all countries, so we can 
do Batch Indexing and as this data never changes, we can apply caching.
Making Data Import and Batch Indexing Faster
Data Import from MySQL to Solr 
• Solr’s DataImportHanlder is awesome but quickly become pretty slow for large volume 
• We wrote our Custom Data Importer that can read(pull) documents from Database and pushses (Async) these into 
Solr. 
Data Importer 
(Custom) 
Solr 
Solr 
Solr 
Table 
ID 
(Primary/Un 
ique Key 
with Index) 
Columns 
1 Record1 
2 Record2 
………… 
5000 Record 5000 
*6000 Record 6000 
------- 
n… Record n… 
Importer batches these 
database Batches into a Bigger 
Batch (10k documents) and 
Flushes to selected Solr servers 
Asynchronously in a round robin 
fashion 
Rather than using “limit” function of 
Database, it queries by Range of IDs 
(Primary Key). Importer Fetches 10 
batches at a time from MySQL 
database, each having 2k Records. 
Each call is Stateless. 
“select * from table t where ID=1 to 
ID<=2000″ 
“select * from table t where ID=2001 
to ID<=4000″ 
……… 
*Non-sequential Id 
Downside: 
• We must need to have a primary key and that can be slow while creating it in database. 
• This approach required more number of calls, if the IDs are not sequential.
Batch Indexing
Indexing All tables into a Single Big Index 
• All tables in same Index, distributed on multiple Solr cores 
and Solr servers (Java processes) 
• Commit on every 120million records or in every 15 
minutes whichever is earlier 
• Disabled Soft-commit and updates (overwrite=false), as 
each call to addDocument calls updateDocument under 
the hood 
• But still.. Indexing was slow (due to being sequential for all 
tables) and we need to stop it after 2 days. 
• Search was also awfully slow (order of Minutes) 
From cache, after 
warm-up 
Bunch of shards ~100
Creating a Distributed Index for each table 
How many shards ? 
• Each table have varying number of records from 50million to 
12billion 
• If we choose 100million per shard (core), it means for 12billion, we 
need to query 120 shards, awfully slow. 
• Other side If we choose 500million/shard, a table with 500million 
records will have only 1 shard, high latency, high memory usage 
(Heap) and no distributed search*. 
• Hybrid Approach : Determine number of shards based on max 
number of records in table. 
• Did a benchmarking to find the best sweet spot for max documents 
(records) per shard with most optimal Search latency 
• Commit at the end for each table. 
Records/Max Shards Table 
Max Number of 
Records in table 
Max number of 
Shards (cores) 
Allowed 
<100 million 1 
100-300million 2 
<500 million 4 
< 1 billion 6 
1-5 billion 8 
>5 billion 16 
* Distributed Search improves latency but may not be faster always as search latency is limited by time taken by last shard in responding.
It worked fine but one day suddenly…. 
java.lang.OutOfMemoryError: 
Java heap Space 
• All Solr servers were crashed. 
• We restarted but they keep crashing randomly after every other day 
• Took a Heap dump and realized that it is due to Field Cache 
• Found a Solution : Doc values and never had this issue again till date.
Doc Values (Life Saver) 
• Disk based Field Data a.ka. Doc values 
• Document to value mapping built at index time 
• Store Field values on Disk (or Memory) in a column stride fashion 
• Better compression than Field Cache 
• Replacement of Field Cache (not completely) 
• Quite suitable for Custom Scoring, Sorting and Faceting 
References: 
• Old article (but Good): http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/ 
• https://cwiki.apache.org/confluence/display/solr/DocValues 
• http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/doc-values.html
Scaling and Making Search Faster…
Partitioning 
• 3 Level Partitioning, by Month, Country and Table name 
• Each Month has its own Cluster and a Cluster Manager. 
• Latency and Throughput are tradeoff, you can’t have both at the same time. 
Node 1 
Solr Solr 
Node n 
Solr Solr Solr 
Cluster 1 
Node 1 
Solr Solr Solr 
Node n 
Web server Farm 
Load Balancer 
App Server 
Search Head 
(US) 
Search Head 
(UK) 
Search Head 
(AU) 
Master Cluster 
Manager 
Internet 
Cluster 2 for 
another 
month e.g 
Feb 
Fetch Cluster Mapping and 
make a request to Search 
Head with respective Solr 
cores for that Country, 
Month and Table 
App Server 
App Server 
Cluster 1 
for a Month 
e.g. Jan 
Solr Solr Solr 
Cluster 2 
Solr 
Cluster 1 
Cluster 
Manager 
Cluster 
Manager 
A 
P 
*A : Active 
P : Passive 
Cluster 
Manager 
Cluster 
Manager 
A 
P 
1 user 
24 UI widgets, 
24 Ajax requests 
41 search 
requests 
Search Head 
(US) 
Search Head 
(UK) 
Search Head 
(AU)
Index Optimization Strategy 
• Running optimization on ~200+ Solr cores is very-very time consuming 
• Solr Cores with bigger Index size (~70GB) have 2500+ segments due to higher Merge Factor while Indexing. 
• Can’ t be run in parallel on all Cores in a Single Machine as heavily dependent on Cpu and Disk IO 
• Optimizing large segments into a very small number is very very time consuming and can take upto 3x Index size on Disk 
• Other side Small number of segments improve performance drastically, so need to have a balance. 
Node 1 
Solr Solr Solr 
Staging Cluster 
Manager 
Optimizer 
Once optimization and cache 
warmup is done, pushes the 
Cluster Mapping to Production 
Cluster manager, making all 
Indices live 
Production Cluster 
Manager 
Fetches Cluster 
Mapping (list of all 
cores) 
*As per our observation for our data, Optimization process takes ~42-46 seconds for 1GB 
We need to do it for 4.6TB (including all boxes), the total Solr Index size for a Single Month 
Optimizing a Solr core into a very small 
number of segments takes a huge time. 
so we do it iteratively. 
Algo: 
Choose Max 3 cores on a 
Machine to optimize in 
parallel. Start with least size 
of Index 
Index Size 
Number of 
docs 
Determine Max 
Segments Allowed 
Reduce Segments to *.90 in each Run 
Current 
Segments 
After 
optimization 
Node 2 
Solr Solr Solr
Finally after optimization and cache warm-up… 
A shard look like this. 
Max Segments 
after 
optimization
External Caching 
• In Distributed search, for a repeated query request, all Solr severs 
needs to be hit, even though result is served from Solr’s cache. It 
increase search latency with lowering throughput. 
• Solution: cache most frequently accessed query results in app layer 
(LRU based eviction) 
• We use Redis for Caching 
• All complex aggregation queries’ results, once fetched from multiple 
Solr servers are served from cache on subsequent requests. 
Why Redis… 
• Advanced In-Memory key-value store 
• Insane fast 
• Response time in order of 5-10ms 
• Provides Cache behavior (set, get) with advance data structures like 
hashes, lists, sets, sorted sets, bitmaps etc. 
• http://redis.io/
Hardware 
• We use Bare Metal, Dedicated servers for Solr due to below reasons 
1. Performance gain (with virtual servers, performance dropped by ~18-20%) 
2. Better value of computing power/$ spent 
• 2.6Ghz, 32 core (4x8 core), 384GB RAM, 6TB SAS 15k (RAID10) 
• 2.6Ghz, 16 core (2x8 core), 192GB RAM, 4TB SAS 15k (RAID10) 
• Since Index size is 4.6TB/month, we want to cache more data in Disk Cache with bigger RAM. 
SSD vs SAS 
1. SSD : Indexing rate - peek (MySQL to Solr) : 330k docs/sec (each doc: ~100-125 bytes) 
2. SAS 15k: 182k docs/sec (dropped by ~45%) 
3. SAS 15k is quite cheaper than SSD for bigger hard disks. 
4. We are using SAS 15k, as being cost effective but have plans to move to SSD in future.
Conclusion : Key takeaways 
General: 
• Understand the characteristics of the data and partition it well. 
Cache: 
 Spend time in analyzing the Cache usage. Tune them. It is 10x-50x faster. 
 Always use Filter Query (fq) wherever it is possible as that will improve the performance due to Filter cache. 
GC : 
 Keep your JVM heap size to lower value (proportional to machine’s RAM) with leaving enough RAM for kernel as bigger 
heap will lead to frequent GC. 4GB to 8GB heap allocation is quite good range. but we use 12GB/16GB. 
 Keep an eye on Garbage collection (GC) logs specially on Full GC. 
Tuning Params: 
 Don’t use Soft Commit if you don’t need it. Specially in Batch Loading 
 Always explore tuning of Solr for High performance, like ramBufferSize, MergeFactor, HttpShardHandler’s 
various configurations. 
 Use hash in Redis to minimize the memory usage. 
Read the whole experience for more detail:http://rahuldausa.wordpress.com/2014/05/16/real-time-search-on-40-billion-records- 
month-with-solr/
Thank you! 
Twitter: @rahuldausa 
dynamicrahul2020@gmail.com 
http://www.linkedin.com/in/rahuldausa

More Related Content

What's hot

Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
Cloudera, Inc.
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Lucidworks
 

What's hot (20)

The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 
Jaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema OrhianJaws - Data Warehouse with Spark SQL by Ema Orhian
Jaws - Data Warehouse with Spark SQL by Ema Orhian
 
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
 
Real Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchReal Time search using Spark and Elasticsearch
Real Time search using Spark and Elasticsearch
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 
Data Science
Data ScienceData Science
Data Science
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, TargetJourney of Implementing Solr at Target: Presented by Raja Ramachandran, Target
Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target
 
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
Loading 350M documents into a large Solr cluster: Presented by Dion Olsthoorn...
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
 

Viewers also liked

Publish or Perish: Towards a Ranking of Scientists using Bibliographic Data ...
Publish or Perish:  Towards a Ranking of Scientists using Bibliographic Data ...Publish or Perish:  Towards a Ranking of Scientists using Bibliographic Data ...
Publish or Perish: Towards a Ranking of Scientists using Bibliographic Data ...
Lior Rokach
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
Lior Rokach
 
When Cyber Security Meets Machine Learning
When Cyber Security Meets Machine LearningWhen Cyber Security Meets Machine Learning
When Cyber Security Meets Machine Learning
Lior Rokach
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learning
butest
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
butest
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Lior Rokach
 

Viewers also liked (20)

Introduction to Scala
Introduction to ScalaIntroduction to Scala
Introduction to Scala
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Scalable Internet Architecture
Scalable Internet ArchitectureScalable Internet Architecture
Scalable Internet Architecture
 
Publish or Perish: Towards a Ranking of Scientists using Bibliographic Data ...
Publish or Perish:  Towards a Ranking of Scientists using Bibliographic Data ...Publish or Perish:  Towards a Ranking of Scientists using Bibliographic Data ...
Publish or Perish: Towards a Ranking of Scientists using Bibliographic Data ...
 
Decision Forest: Twenty Years of Research
Decision Forest: Twenty Years of ResearchDecision Forest: Twenty Years of Research
Decision Forest: Twenty Years of Research
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
When Cyber Security Meets Machine Learning
When Cyber Security Meets Machine LearningWhen Cyber Security Meets Machine Learning
When Cyber Security Meets Machine Learning
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Exceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLExceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETL
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learning
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Hibernate tutorial for beginners
Hibernate tutorial for beginnersHibernate tutorial for beginners
Hibernate tutorial for beginners
 

Similar to Building a Large Scale SEO/SEM Application with Apache Solr

AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentation
Volodymyr Rovetskiy
 

Similar to Building a Large Scale SEO/SEM Application with Apache Solr (20)

Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 
Deep Dive into DynamoDB
Deep Dive into DynamoDBDeep Dive into DynamoDB
Deep Dive into DynamoDB
 
L6.sp17.pptx
L6.sp17.pptxL6.sp17.pptx
L6.sp17.pptx
 
Redshift deep dive
Redshift deep diveRedshift deep dive
Redshift deep dive
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live ConnectTableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
Tableau on Hadoop Meet Up: Advancing from Extracts to Live Connect
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
 
Sharding Methods for MongoDB
Sharding Methods for MongoDBSharding Methods for MongoDB
Sharding Methods for MongoDB
 
AWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentationAWS (Amazon Redshift) presentation
AWS (Amazon Redshift) presentation
 
Why databases cry at night
Why databases cry at nightWhy databases cry at night
Why databases cry at night
 
In memory databases presentation
In memory databases presentationIn memory databases presentation
In memory databases presentation
 
MyHeritage backend group - build to scale
MyHeritage backend group - build to scaleMyHeritage backend group - build to scale
MyHeritage backend group - build to scale
 
JSSUG: SQL Sever Performance Tuning
JSSUG: SQL Sever Performance TuningJSSUG: SQL Sever Performance Tuning
JSSUG: SQL Sever Performance Tuning
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
 
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017
 
CosmosDB for DBAs & Developers
CosmosDB for DBAs & DevelopersCosmosDB for DBAs & Developers
CosmosDB for DBAs & Developers
 
Compare DynamoDB vs. MongoDB
Compare DynamoDB vs. MongoDBCompare DynamoDB vs. MongoDB
Compare DynamoDB vs. MongoDB
 
Dynamo vs Mongo
Dynamo vs MongoDynamo vs Mongo
Dynamo vs Mongo
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Building a Large Scale SEO/SEM Application with Apache Solr

  • 1.
  • 2. Building a Large Scale SEO/SEM Application with Apache Solr Rahul Jain Freelance Big-data/Search Consultant @rahuldausa dynamicrahul2020@gmail.com
  • 3. About Me… • Freelance Big-data/Search Consultant based out of Hyderabad, India • Provide Consulting services and solutions for Solr, Elasticsearch and other Big data solutions (Apache Hadoop and Spark) • Organizer of two Meetup groups in Hyderabad • Hyderabad Apache Solr/Lucene • Big Data Hyderabad
  • 4. What I am going to talk Share our experience in working on Search in this application … • What all issues we have faced and Lessons learned • How we do Database Import, Batch Indexing… • Techniques to Scale and improve Search latency • The System Architecture • Some tips for tuning Solr • Q/A
  • 5. What does the Application do  Keyword Research and Competitor Analysis Tool for SEO (Search Engine Optimization) and SEM (Search Engine Marketing) Professionals  End user search for a keyword or a domain, and get all insights about that.  Aggregate data for the top 50 results of Google and Bing across 3 countries for 80million+ keywords.  Provide key metrics like keywords, CPM (Cost per mille), CPC (Cost per click), competitor’s details etc. Web crawling Data Collection Data Processing & Aggregation Ad Networks Apis Databases *All trademarks and logos belong to their respective owners.
  • 7. High level Architecture Load Balancer (HAProxy) Managed Cache Apache Solr Cache Cluster (Redis) Apache Solr Internet Database (MySQL) App Server (Tomcat) Apache Solr Search Head Web Server Farm Php App (Nginx) Cluster Manager (Custom using Zookeeper) Search Head : • Is a Solr Server which does not contain any data. • Make a Distributed Search request and aggregate the Search Results • Also works as a Load Balancer for search queries. Apache Solr Search Head (Solr) 1 2 3 4 8 5 6 7 Ids lookup Cache Fetch cluster Mapping for which month’ cluster
  • 8. Search - Key challenges  After processing we have ~40 billion records every month in MySQL database including  80+ Million Keywords  110+ Million Domains  1billion+ URLs  Multiple keywords for a Single URL and vice-versa  Multiple tables with varying size from 50million to 12billion  Search is a core functionality, so all records (selected fields) must be Indexed in Solr  Page load time (including all 24 widgets, Max) < 1.5 sec (Critical)  But… we need to load this data only once every month for all countries, so we can do Batch Indexing and as this data never changes, we can apply caching.
  • 9. Making Data Import and Batch Indexing Faster
  • 10. Data Import from MySQL to Solr • Solr’s DataImportHanlder is awesome but quickly become pretty slow for large volume • We wrote our Custom Data Importer that can read(pull) documents from Database and pushses (Async) these into Solr. Data Importer (Custom) Solr Solr Solr Table ID (Primary/Un ique Key with Index) Columns 1 Record1 2 Record2 ………… 5000 Record 5000 *6000 Record 6000 ------- n… Record n… Importer batches these database Batches into a Bigger Batch (10k documents) and Flushes to selected Solr servers Asynchronously in a round robin fashion Rather than using “limit” function of Database, it queries by Range of IDs (Primary Key). Importer Fetches 10 batches at a time from MySQL database, each having 2k Records. Each call is Stateless. “select * from table t where ID=1 to ID<=2000″ “select * from table t where ID=2001 to ID<=4000″ ……… *Non-sequential Id Downside: • We must need to have a primary key and that can be slow while creating it in database. • This approach required more number of calls, if the IDs are not sequential.
  • 12. Indexing All tables into a Single Big Index • All tables in same Index, distributed on multiple Solr cores and Solr servers (Java processes) • Commit on every 120million records or in every 15 minutes whichever is earlier • Disabled Soft-commit and updates (overwrite=false), as each call to addDocument calls updateDocument under the hood • But still.. Indexing was slow (due to being sequential for all tables) and we need to stop it after 2 days. • Search was also awfully slow (order of Minutes) From cache, after warm-up Bunch of shards ~100
  • 13. Creating a Distributed Index for each table How many shards ? • Each table have varying number of records from 50million to 12billion • If we choose 100million per shard (core), it means for 12billion, we need to query 120 shards, awfully slow. • Other side If we choose 500million/shard, a table with 500million records will have only 1 shard, high latency, high memory usage (Heap) and no distributed search*. • Hybrid Approach : Determine number of shards based on max number of records in table. • Did a benchmarking to find the best sweet spot for max documents (records) per shard with most optimal Search latency • Commit at the end for each table. Records/Max Shards Table Max Number of Records in table Max number of Shards (cores) Allowed <100 million 1 100-300million 2 <500 million 4 < 1 billion 6 1-5 billion 8 >5 billion 16 * Distributed Search improves latency but may not be faster always as search latency is limited by time taken by last shard in responding.
  • 14. It worked fine but one day suddenly…. java.lang.OutOfMemoryError: Java heap Space • All Solr servers were crashed. • We restarted but they keep crashing randomly after every other day • Took a Heap dump and realized that it is due to Field Cache • Found a Solution : Doc values and never had this issue again till date.
  • 15. Doc Values (Life Saver) • Disk based Field Data a.ka. Doc values • Document to value mapping built at index time • Store Field values on Disk (or Memory) in a column stride fashion • Better compression than Field Cache • Replacement of Field Cache (not completely) • Quite suitable for Custom Scoring, Sorting and Faceting References: • Old article (but Good): http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/ • https://cwiki.apache.org/confluence/display/solr/DocValues • http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/doc-values.html
  • 16. Scaling and Making Search Faster…
  • 17. Partitioning • 3 Level Partitioning, by Month, Country and Table name • Each Month has its own Cluster and a Cluster Manager. • Latency and Throughput are tradeoff, you can’t have both at the same time. Node 1 Solr Solr Node n Solr Solr Solr Cluster 1 Node 1 Solr Solr Solr Node n Web server Farm Load Balancer App Server Search Head (US) Search Head (UK) Search Head (AU) Master Cluster Manager Internet Cluster 2 for another month e.g Feb Fetch Cluster Mapping and make a request to Search Head with respective Solr cores for that Country, Month and Table App Server App Server Cluster 1 for a Month e.g. Jan Solr Solr Solr Cluster 2 Solr Cluster 1 Cluster Manager Cluster Manager A P *A : Active P : Passive Cluster Manager Cluster Manager A P 1 user 24 UI widgets, 24 Ajax requests 41 search requests Search Head (US) Search Head (UK) Search Head (AU)
  • 18. Index Optimization Strategy • Running optimization on ~200+ Solr cores is very-very time consuming • Solr Cores with bigger Index size (~70GB) have 2500+ segments due to higher Merge Factor while Indexing. • Can’ t be run in parallel on all Cores in a Single Machine as heavily dependent on Cpu and Disk IO • Optimizing large segments into a very small number is very very time consuming and can take upto 3x Index size on Disk • Other side Small number of segments improve performance drastically, so need to have a balance. Node 1 Solr Solr Solr Staging Cluster Manager Optimizer Once optimization and cache warmup is done, pushes the Cluster Mapping to Production Cluster manager, making all Indices live Production Cluster Manager Fetches Cluster Mapping (list of all cores) *As per our observation for our data, Optimization process takes ~42-46 seconds for 1GB We need to do it for 4.6TB (including all boxes), the total Solr Index size for a Single Month Optimizing a Solr core into a very small number of segments takes a huge time. so we do it iteratively. Algo: Choose Max 3 cores on a Machine to optimize in parallel. Start with least size of Index Index Size Number of docs Determine Max Segments Allowed Reduce Segments to *.90 in each Run Current Segments After optimization Node 2 Solr Solr Solr
  • 19. Finally after optimization and cache warm-up… A shard look like this. Max Segments after optimization
  • 20. External Caching • In Distributed search, for a repeated query request, all Solr severs needs to be hit, even though result is served from Solr’s cache. It increase search latency with lowering throughput. • Solution: cache most frequently accessed query results in app layer (LRU based eviction) • We use Redis for Caching • All complex aggregation queries’ results, once fetched from multiple Solr servers are served from cache on subsequent requests. Why Redis… • Advanced In-Memory key-value store • Insane fast • Response time in order of 5-10ms • Provides Cache behavior (set, get) with advance data structures like hashes, lists, sets, sorted sets, bitmaps etc. • http://redis.io/
  • 21. Hardware • We use Bare Metal, Dedicated servers for Solr due to below reasons 1. Performance gain (with virtual servers, performance dropped by ~18-20%) 2. Better value of computing power/$ spent • 2.6Ghz, 32 core (4x8 core), 384GB RAM, 6TB SAS 15k (RAID10) • 2.6Ghz, 16 core (2x8 core), 192GB RAM, 4TB SAS 15k (RAID10) • Since Index size is 4.6TB/month, we want to cache more data in Disk Cache with bigger RAM. SSD vs SAS 1. SSD : Indexing rate - peek (MySQL to Solr) : 330k docs/sec (each doc: ~100-125 bytes) 2. SAS 15k: 182k docs/sec (dropped by ~45%) 3. SAS 15k is quite cheaper than SSD for bigger hard disks. 4. We are using SAS 15k, as being cost effective but have plans to move to SSD in future.
  • 22. Conclusion : Key takeaways General: • Understand the characteristics of the data and partition it well. Cache:  Spend time in analyzing the Cache usage. Tune them. It is 10x-50x faster.  Always use Filter Query (fq) wherever it is possible as that will improve the performance due to Filter cache. GC :  Keep your JVM heap size to lower value (proportional to machine’s RAM) with leaving enough RAM for kernel as bigger heap will lead to frequent GC. 4GB to 8GB heap allocation is quite good range. but we use 12GB/16GB.  Keep an eye on Garbage collection (GC) logs specially on Full GC. Tuning Params:  Don’t use Soft Commit if you don’t need it. Specially in Batch Loading  Always explore tuning of Solr for High performance, like ramBufferSize, MergeFactor, HttpShardHandler’s various configurations.  Use hash in Redis to minimize the memory usage. Read the whole experience for more detail:http://rahuldausa.wordpress.com/2014/05/16/real-time-search-on-40-billion-records- month-with-solr/
  • 23. Thank you! Twitter: @rahuldausa dynamicrahul2020@gmail.com http://www.linkedin.com/in/rahuldausa