SlideShare a Scribd company logo
Scaling Massive ElasticSearch
          Clusters

    Rafał Kuć – Sematext International
   @kucrafal @sematext sematext.com
Who Am I
•   „Solr 3.1 Cookbook” author
•   Sematext software engineer
•   Solr.pl co-founder
•   Father and husband :-)




                Copyright 2012 Sematext Int’l. All rights reserved
What Will I Talk About ?
•   ElasticSearch scaling
•   Indexing thousands of documents per second
•   Performing queries in tens of milliseconds
•   Controling shard and replica placement
•   Handling multilingual content
•   Performance testing
•   Cluster monitoring

                Copyright 2012 Sematext Int’l. All rights reserved
The Challenge
•   More than 50 millions of documents a day
•   Real time search
•   Less than 200ms average query latency
•   Throughput of at least 1000 QPS
•   Multilingual indexing
•   Multilingual querying



                Copyright 2012 Sematext Int’l. All rights reserved
Why ElasticSearch ?
• Written with NRT and cloud support in mind
• Uses Lucene and all its goodness
• Distributed indexing with document
  distribution control out of the box
• Easy index, shard and replicas creation on live
  cluster



               Copyright 2012 Sematext Int’l. All rights reserved
Index Design
• Several indices (at least one index for each day
  of data)
• Indices divided into multiple shards
• Multiple replicas of a single shard
• Real-time, synchronous replication
• Near-real-time index refresh (1 to 30 seconds)



               Copyright 2012 Sematext Int’l. All rights reserved
Shard Deployment Problems
•   Multiple shards per node
•   Replicas on the same nodes as shards
•   Not evenly distributed shards and replicas
•   Some nodes being hot, while others are cold




                Copyright 2012 Sematext Int’l. All rights reserved
Default Shard Deployment

 Shard 1       Shard 2                         Shard 3            Replica 1


              Replica 2
Node 1                                      Node 2




                    Replica 3



                  Node 3
ElasticSearch Cluster

                   Copyright 2012 Sematext Int’l. All rights reserved
What Can We Do With Shards Then ?
• Contol shard placement with node tags:
  – index.routing.allocation.include.tag
  – index.routing.allocation.exclude.tag
• Control shard placement with nodes IP
  addresses:
  – cluster.routing.allocation.include._ip
  – cluster.routing.allocation.exclude._ip
• Specified on index or cluster level
• Can be changed on live cluster !
                Copyright 2012 Sematext Int’l. All rights reserved
Shard Allocation Examples
• Cluster level:
curl -XPUT localhost:9200/_cluster/settings -d '{
   "persistent" : {
     "cluster.routing.allocation.exclude._ip" : "192.168.2.1"
   }
}'
• Index level:
curl -XPUT localhost:9200/sematext/ -d '{
   "index.routing.allocation.include.tag" : "nodeOne,nodeTwo"
}'

                    Copyright 2012 Sematext Int’l. All rights reserved
Number of Shards Per Node
• Allows one to specify number of shards per
  node
• Specified on index level
• Can be changed on live indices
• Example:
curl -XPUT localhost:9200/sematext -d '{
   "index.routing.allocation.total_shards_per_node" : 2
}'


                   Copyright 2012 Sematext Int’l. All rights reserved
Controlled Shard Deployment

 Shard 1     Replica 2                        Shard 3            Replica 1



Node 1                                     Node 2



                    Shard 2            Replica 3



                  Node 3
ElasticSearch Cluster

                  Copyright 2012 Sematext Int’l. All rights reserved
Does Routing Matters ?
• Controls target shard for each document
• Defaults to hash of a document identifier
• Can be specified explicitly (routing parameter) or
  as a field value (a bit less performant)
• Can take any value
• Example:
curl -XPUT localhost:9200/sematext/test/1?routing=1234 -d '{
  "title" : "Test routing document"
}'


                   Copyright 2012 Sematext Int’l. All rights reserved
Indexing the Data

  Shard       Replica                              Shard           Replica
    1           2                                    3               1


              Node 1                                                Node 2


                         Shard             Replica
                           2                 3


                                            Node 3
ElasticSearch Cluster

                        Indexing application
              Copyright 2012 Sematext Int’l. All rights reserved
How We Indexed Data

  Shard 1                                        Shard 2


Node 1                                        Node 2




                      Node 3

ElasticSearch Cluster



                  Indexing application

               Copyright 2012 Sematext Int’l. All rights reserved
Nodes Without Data
• Nodes used only to route data and queries to
  other nodes in the cluster
• Such nodes don’t suffer from I/O waits (of
  course Data Nodes don’t suffer from I/O waits
  all the time)
• Not default ElasticSearch behavior
• Setup by setting node.data to false


              Copyright 2012 Sematext Int’l. All rights reserved
Multilingual Indexing
• Detection of document's language before
  sending it for indexing
• With, e.g. Sematext LangID or Apache Tika
• Set known language analyzers in configuration
  or mappings
• Set analyzer during indexing (_analyzer field)



               Copyright 2012 Sematext Int’l. All rights reserved
Multilingual Indexing Example
{
 "test" : {
  "_analyzer" : { "path" : "langId" },
  "properties" : {
   "id" : { "type" : "long", "store" : "yes", "precision_step" : "0" },
   "title" : { "type" : "string", "store" : "yes", "index" : "analyzed" },
   "langId" : { "type" : "string", "store" : "yes", "index" : "not_analyzed" }
  }
 }
}

curl -XPUT localhost:9200/sematext/test/10 -d '{
  "title" : "Test document",
  "langId" : "english"
}'

                        Copyright 2012 Sematext Int’l. All rights reserved
Multilingual Queries
• Identify language of query before its execution
  (can be problematic)
• Query analyzer can be specified per query
  (analyzer parameter):
  curl -XGET
  localhost:9200/sematext/_search?q=let+AND+me&analyzer=english




                    Copyright 2012 Sematext Int’l. All rights reserved
Query Performance Factors – Lucene
               level
• Refresh interval
  – Defaults to 1 second
  – Can be specified on cluster or index level
  – curl -XPUT localhost:9200/_settings -d '{ "index" : {
    "refresh_interval" : "600s" } }'
• Merge factor
  – Defaults to 10
  – Can be specified on cluster or index level
  – curl -XPUT localhost:9200/_settings -d '{ "index" : {
    "merge.policy.merge_factor" : 30 } }'

                 Copyright 2012 Sematext Int’l. All rights reserved
Let’s Talk About Routing Once Again
• Routes a query to a particular shard
• Speeds up queries depending on number of
  shards for a given index
• Have to be specified manualy with routing
  parameter during query
• routing parameter can take any value:

curl -XGET
'localhost:9200/sematext/_search?q=test&routing=2012-02-16'


                  Copyright 2012 Sematext Int’l. All rights reserved
Querying ElasticSearch – No Routing

        Shard 1           Shard 2                 Shard 3               Shard 4



        Shard 5           Shard 6                 Shard 7               Shard 8


  ElasticSearch Index




                                     Application


                   Copyright 2012 Sematext Int’l. All rights reserved
Querying ElasticSearch – With Routing

         Shard 1           Shard 2                 Shard 3               Shard 4



         Shard 5           Shard 6                 Shard 7               Shard 8


   ElasticSearch Index




                                      Application


                    Copyright 2012 Sematext Int’l. All rights reserved
Performance Numbers
                  Queries without routing (200 shards, 1 replica)
#threads   Avg response time          Throughput             90% line           Median   CPU Utilization

   1          3169ms                  19,0/min              5214ms              2692ms    95 – 99%


                    Queries with routing (200 shards, 1 replica)
#threads   Avg response time          Throughput             90% line           Median   CPU Utilization

  10           196ms                   50,6/sec              642ms              29ms      25 – 40%
  20           218ms                   91,2/sec              718ms              11ms      10 – 15%




                           Copyright 2012 Sematext Int’l. All rights reserved
Scaling Query Throughput – What Else ?

• Increasing the number of shards for data
  distribution
• Increasing the number of replicas
• Using routing
• Avoid always hitting the same node and
  hotspotting it



              Copyright 2012 Sematext Int’l. All rights reserved
FieldCache and OutOfMemory
• ElasticSearch default setup doesn’t limit field
  data cache size




               Copyright 2012 Sematext Int’l. All rights reserved
FieldCache – What We Can do With It ?
• Keep its default type and set:
   – Maximum size (index.cache.field.max_size)
   – Expiration time (index.cache.field.expire)
• Change its type:
   – soft (index.cache.field.type)
• Change your data:
   – Make your fields less precise (ie: dates)
   – If you sort or facet on fields think if you can reduce
     fields granularity
• Buy more servers :-)

                   Copyright 2012 Sematext Int’l. All rights reserved
FieldCache After Changes




     Copyright 2012 Sematext Int’l. All rights reserved
Additional Problems We Encountered
• Rebalancing after full cluster restarts
  – cluster.routing.allocation.disable_allocation
  – cluster.routing.allocation.disable_replica_allocation
• Long startup and initialization
• Faceting with strings vs faceting on numbers on
  high cardinality fields



                Copyright 2012 Sematext Int’l. All rights reserved
JVM Optimization
• Remember to leave enough memory to OS for
  cache
• Make GC frequent ans short vs. rare and long
  – -XX:+UseParNewGC
  – -XX:+UseConcMarkSweepGC
  – -XX:+CMSParallelRemarkEnabled
• -XX:+AlwaysPreTouch (for short performance
  tests)

              Copyright 2012 Sematext Int’l. All rights reserved
Performance Testing
• Data
  – How much data do I need ?
  – Choosing the right queries
• Make changes
  – One change at a time
  – Understand the impact of the change
• Monitor your cluster (jstat, dstat/vmstat,
  SPM)
• Analyze your results
               Copyright 2012 Sematext Int’l. All rights reserved
ElasticSearch Cluster Monitoring
•   Cluster health
•   Indexing statistics
•   Query rate
•   JVM memory and garbage collector work
•   Cache usage
•   Node memory and CPU usage



               Copyright 2012 Sematext Int’l. All rights reserved
Cluster Health




                Node restart




Copyright 2012 Sematext Int’l. All rights reserved
Indexing Statistics




  Copyright 2012 Sematext Int’l. All rights reserved
Query Rate




Copyright 2012 Sematext Int’l. All rights reserved
JVM Memory and GC




   Copyright 2012 Sematext Int’l. All rights reserved
Cache Usage




Copyright 2012 Sematext Int’l. All rights reserved
CPU and Memory




 Copyright 2012 Sematext Int’l. All rights reserved
Summary
• Controlling shard and replica placement
• Indexing and querying multilingual data
• How to use sharding and routing and not to
  tear your hair out
• How to test your cluster performance to find
  bottle-necks
• How to monitor your cluster and find
  problems right away
              Copyright 2012 Sematext Int’l. All rights reserved
We Are Hiring !
•   Dig Search ?
•   Dig Analytics ?
•   Dig Big Data ?
•   Dig Performance ?
•   Dig working with and in open – source ?
•   We’re hiring world – wide !
       http://sematext.com/about/jobs.html

                Copyright 2012 Sematext Int’l. All rights reserved
How to Reach Us
• Rafał Kuć
  – Twitter: @kucrafal
  – E-mail: rafal.kuc@sematext.com
• Sematext
  – Twitter: @sematext
  – Website: http://sematext.com
• Graphs used in the presentation are from:
  – SPM for ElasticSearch (http://sematext.com/spm)

               Copyright 2012 Sematext Int’l. All rights reserved
Thank You For Your Attention

More Related Content

What's hot

Introduction à ElasticSearch
Introduction à ElasticSearchIntroduction à ElasticSearch
Introduction à ElasticSearchFadel Chafai
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
Jurriaan Persyn
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
ABC Talks
 
Elasticsearch V/s Relational Database
Elasticsearch V/s Relational DatabaseElasticsearch V/s Relational Database
Elasticsearch V/s Relational Database
Richa Budhraja
 
ElasticSearch : Architecture et Développement
ElasticSearch : Architecture et DéveloppementElasticSearch : Architecture et Développement
ElasticSearch : Architecture et Développement
Mohamed hedi Abidi
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
Neil Baker
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
Rahul Jain
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Edureka!
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
Danny Yuan
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
Edureka!
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
Divij Sehgal
 
ElasticSearch
ElasticSearchElasticSearch
ElasticSearch
Volodymyr Kraietskyi
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Yongho Ha
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
lucenerevolution
 
Cross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.com
Cross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.comCross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.com
Cross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.com
Ertuğ Karamatlı
 
Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene Dawid Weiss- Finite state automata in lucene
Dawid Weiss- Finite state automata in lucene
Lucidworks (Archived)
 
Kong, Keyrock, Keycloak, i4Trust - Options to Secure FIWARE in Production
Kong, Keyrock, Keycloak, i4Trust - Options to Secure FIWARE in ProductionKong, Keyrock, Keycloak, i4Trust - Options to Secure FIWARE in Production
Kong, Keyrock, Keycloak, i4Trust - Options to Secure FIWARE in Production
FIWARE
 
Elastic 101 - Get started
Elastic 101 - Get startedElastic 101 - Get started
Elastic 101 - Get started
Ismaeel Enjreny
 

What's hot (20)

Introduction à ElasticSearch
Introduction à ElasticSearchIntroduction à ElasticSearch
Introduction à ElasticSearch
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
 
Elasticsearch V/s Relational Database
Elasticsearch V/s Relational DatabaseElasticsearch V/s Relational Database
Elasticsearch V/s Relational Database
 
ElasticSearch : Architecture et Développement
ElasticSearch : Architecture et DéveloppementElasticSearch : Architecture et Développement
ElasticSearch : Architecture et Développement
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
 
Elasticsearch in Netflix
Elasticsearch in NetflixElasticsearch in Netflix
Elasticsearch in Netflix
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
ElasticSearch
ElasticSearchElasticSearch
ElasticSearch
 
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
Spark 의 핵심은 무엇인가? RDD! (RDD paper review)
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Cross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.com
Cross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.comCross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.com
Cross-Cluster and Cross-Datacenter Elasticsearch Replication at sahibinden.com
 
Dawid Weiss- Finite state automata in lucene
 Dawid Weiss- Finite state automata in lucene Dawid Weiss- Finite state automata in lucene
Dawid Weiss- Finite state automata in lucene
 
Kong, Keyrock, Keycloak, i4Trust - Options to Secure FIWARE in Production
Kong, Keyrock, Keycloak, i4Trust - Options to Secure FIWARE in ProductionKong, Keyrock, Keycloak, i4Trust - Options to Secure FIWARE in Production
Kong, Keyrock, Keycloak, i4Trust - Options to Secure FIWARE in Production
 
Elastic 101 - Get started
Elastic 101 - Get startedElastic 101 - Get started
Elastic 101 - Get started
 

Viewers also liked

You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
Jodok Batlogg
 
Elasticsearch 101 - Cluster setup and tuning
Elasticsearch 101 - Cluster setup and tuningElasticsearch 101 - Cluster setup and tuning
Elasticsearch 101 - Cluster setup and tuning
Petar Djekic
 
Tuning Elasticsearch Indexing Pipeline for Logs
Tuning Elasticsearch Indexing Pipeline for LogsTuning Elasticsearch Indexing Pipeline for Logs
Tuning Elasticsearch Indexing Pipeline for Logs
Sematext Group, Inc.
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearch
Rafał Kuć
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2
Sematext Group, Inc.
 
From zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and ElasticsearchFrom zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and Elasticsearch
Rafał Kuć
 
03. ElasticSearch : Data In, Data Out
03. ElasticSearch : Data In, Data Out03. ElasticSearch : Data In, Data Out
03. ElasticSearch : Data In, Data Out
OpenThink Labs
 
Elasticsearch Data Analyses
Elasticsearch Data AnalysesElasticsearch Data Analyses
Elasticsearch Data AnalysesAlaa Elhadba
 
Benchmark slideshow
Benchmark slideshowBenchmark slideshow
Benchmark slideshow
Siddharth Kothari
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
Sematext Group, Inc.
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
otisg
 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep dive
Sematext Group, Inc.
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
BeyondTrees
 
Elasticsearch in Zalando
Elasticsearch in ZalandoElasticsearch in Zalando
Elasticsearch in ZalandoAlaa Elhadba
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
Rafał Kuć
 
Battle of the Giants round 2
Battle of the Giants round 2Battle of the Giants round 2
Battle of the Giants round 2
Rafał Kuć
 
Solr Anti - patterns
Solr Anti - patternsSolr Anti - patterns
Solr Anti - patterns
Rafał Kuć
 
Elasticsearch - Dynamic Nodes
Elasticsearch - Dynamic NodesElasticsearch - Dynamic Nodes
Elasticsearch - Dynamic Nodes
Scott Davis
 
Delhi elasticsearch meetup
Delhi elasticsearch meetupDelhi elasticsearch meetup
Delhi elasticsearch meetup
Bharvi Dixit
 

Viewers also liked (20)

You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
 
Elasticsearch 101 - Cluster setup and tuning
Elasticsearch 101 - Cluster setup and tuningElasticsearch 101 - Cluster setup and tuning
Elasticsearch 101 - Cluster setup and tuning
 
Tuning Elasticsearch Indexing Pipeline for Logs
Tuning Elasticsearch Indexing Pipeline for LogsTuning Elasticsearch Indexing Pipeline for Logs
Tuning Elasticsearch Indexing Pipeline for Logs
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearch
 
Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2Side by Side with Elasticsearch & Solr, Part 2
Side by Side with Elasticsearch & Solr, Part 2
 
From zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and ElasticsearchFrom zero to hero - Easy log centralization with Logstash and Elasticsearch
From zero to hero - Easy log centralization with Logstash and Elasticsearch
 
03. ElasticSearch : Data In, Data Out
03. ElasticSearch : Data In, Data Out03. ElasticSearch : Data In, Data Out
03. ElasticSearch : Data In, Data Out
 
Elasticsearch Data Analyses
Elasticsearch Data AnalysesElasticsearch Data Analyses
Elasticsearch Data Analyses
 
Benchmark slideshow
Benchmark slideshowBenchmark slideshow
Benchmark slideshow
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep dive
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
 
Elasticsearch in Zalando
Elasticsearch in ZalandoElasticsearch in Zalando
Elasticsearch in Zalando
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Battle of the Giants round 2
Battle of the Giants round 2Battle of the Giants round 2
Battle of the Giants round 2
 
Solr Anti - patterns
Solr Anti - patternsSolr Anti - patterns
Solr Anti - patterns
 
Elasticsearch - Dynamic Nodes
Elasticsearch - Dynamic NodesElasticsearch - Dynamic Nodes
Elasticsearch - Dynamic Nodes
 
Delhi elasticsearch meetup
Delhi elasticsearch meetupDelhi elasticsearch meetup
Delhi elasticsearch meetup
 

Similar to Scaling massive elastic search clusters - Rafał Kuć - Sematext

Scaling Massive Elasticsearch Clusters
Scaling Massive Elasticsearch ClustersScaling Massive Elasticsearch Clusters
Scaling Massive Elasticsearch ClustersSematext Group, Inc.
 
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearchBigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
NetConstructor, Inc.
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
Roy Russo
 
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)Sematext Group, Inc.
 
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach ShoolmanRedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
Redis Labs
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Spark Summit
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Sujit Pal
 
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017
Roy Russo
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloud
thelabdude
 
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft..."Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
Dataconomy Media
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
DataWorks Summit
 
Containers orchestrators: Docker vs. Kubernetes
Containers orchestrators: Docker vs. KubernetesContainers orchestrators: Docker vs. Kubernetes
Containers orchestrators: Docker vs. Kubernetes
Dmitry Lazarenko
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)
Anthony Baker
 
Scality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup PresentationScality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup Presentation
Scality
 
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode
 
GIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big DataGIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big Data
Shalin Shekhar Mangar
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearch
Minsoo Jun
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
jhugg
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systexJames Chen
 

Similar to Scaling massive elastic search clusters - Rafał Kuć - Sematext (20)

Scaling Massive Elasticsearch Clusters
Scaling Massive Elasticsearch ClustersScaling Massive Elasticsearch Clusters
Scaling Massive Elasticsearch Clusters
 
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearchBigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
BigData Faceted Search Comparison between Apache Solr vs. ElasticSearch
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
 
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
 
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach ShoolmanRedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
RedisConf17 - Doing More With Redis - Ofer Bengal and Yiftach Shoolman
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit Pal
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
 
Dev nexus 2017
Dev nexus 2017Dev nexus 2017
Dev nexus 2017
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Solr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloudSolr Exchange: Introduction to SolrCloud
Solr Exchange: Introduction to SolrCloud
 
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft..."Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Containers orchestrators: Docker vs. Kubernetes
Containers orchestrators: Docker vs. KubernetesContainers orchestrators: Docker vs. Kubernetes
Containers orchestrators: Docker vs. Kubernetes
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)
 
Scality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup PresentationScality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup Presentation
 
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CIT
 
GIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big DataGIDS2014: SolrCloud: Searching Big Data
GIDS2014: SolrCloud: Searching Big Data
 
About elasticsearch
About elasticsearchAbout elasticsearch
About elasticsearch
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
 

Recently uploaded

Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 

Recently uploaded (20)

Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 

Scaling massive elastic search clusters - Rafał Kuć - Sematext

  • 1. Scaling Massive ElasticSearch Clusters Rafał Kuć – Sematext International @kucrafal @sematext sematext.com
  • 2. Who Am I • „Solr 3.1 Cookbook” author • Sematext software engineer • Solr.pl co-founder • Father and husband :-) Copyright 2012 Sematext Int’l. All rights reserved
  • 3. What Will I Talk About ? • ElasticSearch scaling • Indexing thousands of documents per second • Performing queries in tens of milliseconds • Controling shard and replica placement • Handling multilingual content • Performance testing • Cluster monitoring Copyright 2012 Sematext Int’l. All rights reserved
  • 4. The Challenge • More than 50 millions of documents a day • Real time search • Less than 200ms average query latency • Throughput of at least 1000 QPS • Multilingual indexing • Multilingual querying Copyright 2012 Sematext Int’l. All rights reserved
  • 5. Why ElasticSearch ? • Written with NRT and cloud support in mind • Uses Lucene and all its goodness • Distributed indexing with document distribution control out of the box • Easy index, shard and replicas creation on live cluster Copyright 2012 Sematext Int’l. All rights reserved
  • 6. Index Design • Several indices (at least one index for each day of data) • Indices divided into multiple shards • Multiple replicas of a single shard • Real-time, synchronous replication • Near-real-time index refresh (1 to 30 seconds) Copyright 2012 Sematext Int’l. All rights reserved
  • 7. Shard Deployment Problems • Multiple shards per node • Replicas on the same nodes as shards • Not evenly distributed shards and replicas • Some nodes being hot, while others are cold Copyright 2012 Sematext Int’l. All rights reserved
  • 8. Default Shard Deployment Shard 1 Shard 2 Shard 3 Replica 1 Replica 2 Node 1 Node 2 Replica 3 Node 3 ElasticSearch Cluster Copyright 2012 Sematext Int’l. All rights reserved
  • 9. What Can We Do With Shards Then ? • Contol shard placement with node tags: – index.routing.allocation.include.tag – index.routing.allocation.exclude.tag • Control shard placement with nodes IP addresses: – cluster.routing.allocation.include._ip – cluster.routing.allocation.exclude._ip • Specified on index or cluster level • Can be changed on live cluster ! Copyright 2012 Sematext Int’l. All rights reserved
  • 10. Shard Allocation Examples • Cluster level: curl -XPUT localhost:9200/_cluster/settings -d '{ "persistent" : { "cluster.routing.allocation.exclude._ip" : "192.168.2.1" } }' • Index level: curl -XPUT localhost:9200/sematext/ -d '{ "index.routing.allocation.include.tag" : "nodeOne,nodeTwo" }' Copyright 2012 Sematext Int’l. All rights reserved
  • 11. Number of Shards Per Node • Allows one to specify number of shards per node • Specified on index level • Can be changed on live indices • Example: curl -XPUT localhost:9200/sematext -d '{ "index.routing.allocation.total_shards_per_node" : 2 }' Copyright 2012 Sematext Int’l. All rights reserved
  • 12. Controlled Shard Deployment Shard 1 Replica 2 Shard 3 Replica 1 Node 1 Node 2 Shard 2 Replica 3 Node 3 ElasticSearch Cluster Copyright 2012 Sematext Int’l. All rights reserved
  • 13. Does Routing Matters ? • Controls target shard for each document • Defaults to hash of a document identifier • Can be specified explicitly (routing parameter) or as a field value (a bit less performant) • Can take any value • Example: curl -XPUT localhost:9200/sematext/test/1?routing=1234 -d '{ "title" : "Test routing document" }' Copyright 2012 Sematext Int’l. All rights reserved
  • 14. Indexing the Data Shard Replica Shard Replica 1 2 3 1 Node 1 Node 2 Shard Replica 2 3 Node 3 ElasticSearch Cluster Indexing application Copyright 2012 Sematext Int’l. All rights reserved
  • 15. How We Indexed Data Shard 1 Shard 2 Node 1 Node 2 Node 3 ElasticSearch Cluster Indexing application Copyright 2012 Sematext Int’l. All rights reserved
  • 16. Nodes Without Data • Nodes used only to route data and queries to other nodes in the cluster • Such nodes don’t suffer from I/O waits (of course Data Nodes don’t suffer from I/O waits all the time) • Not default ElasticSearch behavior • Setup by setting node.data to false Copyright 2012 Sematext Int’l. All rights reserved
  • 17. Multilingual Indexing • Detection of document's language before sending it for indexing • With, e.g. Sematext LangID or Apache Tika • Set known language analyzers in configuration or mappings • Set analyzer during indexing (_analyzer field) Copyright 2012 Sematext Int’l. All rights reserved
  • 18. Multilingual Indexing Example { "test" : { "_analyzer" : { "path" : "langId" }, "properties" : { "id" : { "type" : "long", "store" : "yes", "precision_step" : "0" }, "title" : { "type" : "string", "store" : "yes", "index" : "analyzed" }, "langId" : { "type" : "string", "store" : "yes", "index" : "not_analyzed" } } } } curl -XPUT localhost:9200/sematext/test/10 -d '{ "title" : "Test document", "langId" : "english" }' Copyright 2012 Sematext Int’l. All rights reserved
  • 19. Multilingual Queries • Identify language of query before its execution (can be problematic) • Query analyzer can be specified per query (analyzer parameter): curl -XGET localhost:9200/sematext/_search?q=let+AND+me&analyzer=english Copyright 2012 Sematext Int’l. All rights reserved
  • 20. Query Performance Factors – Lucene level • Refresh interval – Defaults to 1 second – Can be specified on cluster or index level – curl -XPUT localhost:9200/_settings -d '{ "index" : { "refresh_interval" : "600s" } }' • Merge factor – Defaults to 10 – Can be specified on cluster or index level – curl -XPUT localhost:9200/_settings -d '{ "index" : { "merge.policy.merge_factor" : 30 } }' Copyright 2012 Sematext Int’l. All rights reserved
  • 21. Let’s Talk About Routing Once Again • Routes a query to a particular shard • Speeds up queries depending on number of shards for a given index • Have to be specified manualy with routing parameter during query • routing parameter can take any value: curl -XGET 'localhost:9200/sematext/_search?q=test&routing=2012-02-16' Copyright 2012 Sematext Int’l. All rights reserved
  • 22. Querying ElasticSearch – No Routing Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6 Shard 7 Shard 8 ElasticSearch Index Application Copyright 2012 Sematext Int’l. All rights reserved
  • 23. Querying ElasticSearch – With Routing Shard 1 Shard 2 Shard 3 Shard 4 Shard 5 Shard 6 Shard 7 Shard 8 ElasticSearch Index Application Copyright 2012 Sematext Int’l. All rights reserved
  • 24. Performance Numbers Queries without routing (200 shards, 1 replica) #threads Avg response time Throughput 90% line Median CPU Utilization 1 3169ms 19,0/min 5214ms 2692ms 95 – 99% Queries with routing (200 shards, 1 replica) #threads Avg response time Throughput 90% line Median CPU Utilization 10 196ms 50,6/sec 642ms 29ms 25 – 40% 20 218ms 91,2/sec 718ms 11ms 10 – 15% Copyright 2012 Sematext Int’l. All rights reserved
  • 25. Scaling Query Throughput – What Else ? • Increasing the number of shards for data distribution • Increasing the number of replicas • Using routing • Avoid always hitting the same node and hotspotting it Copyright 2012 Sematext Int’l. All rights reserved
  • 26. FieldCache and OutOfMemory • ElasticSearch default setup doesn’t limit field data cache size Copyright 2012 Sematext Int’l. All rights reserved
  • 27. FieldCache – What We Can do With It ? • Keep its default type and set: – Maximum size (index.cache.field.max_size) – Expiration time (index.cache.field.expire) • Change its type: – soft (index.cache.field.type) • Change your data: – Make your fields less precise (ie: dates) – If you sort or facet on fields think if you can reduce fields granularity • Buy more servers :-) Copyright 2012 Sematext Int’l. All rights reserved
  • 28. FieldCache After Changes Copyright 2012 Sematext Int’l. All rights reserved
  • 29. Additional Problems We Encountered • Rebalancing after full cluster restarts – cluster.routing.allocation.disable_allocation – cluster.routing.allocation.disable_replica_allocation • Long startup and initialization • Faceting with strings vs faceting on numbers on high cardinality fields Copyright 2012 Sematext Int’l. All rights reserved
  • 30. JVM Optimization • Remember to leave enough memory to OS for cache • Make GC frequent ans short vs. rare and long – -XX:+UseParNewGC – -XX:+UseConcMarkSweepGC – -XX:+CMSParallelRemarkEnabled • -XX:+AlwaysPreTouch (for short performance tests) Copyright 2012 Sematext Int’l. All rights reserved
  • 31. Performance Testing • Data – How much data do I need ? – Choosing the right queries • Make changes – One change at a time – Understand the impact of the change • Monitor your cluster (jstat, dstat/vmstat, SPM) • Analyze your results Copyright 2012 Sematext Int’l. All rights reserved
  • 32. ElasticSearch Cluster Monitoring • Cluster health • Indexing statistics • Query rate • JVM memory and garbage collector work • Cache usage • Node memory and CPU usage Copyright 2012 Sematext Int’l. All rights reserved
  • 33. Cluster Health Node restart Copyright 2012 Sematext Int’l. All rights reserved
  • 34. Indexing Statistics Copyright 2012 Sematext Int’l. All rights reserved
  • 35. Query Rate Copyright 2012 Sematext Int’l. All rights reserved
  • 36. JVM Memory and GC Copyright 2012 Sematext Int’l. All rights reserved
  • 37. Cache Usage Copyright 2012 Sematext Int’l. All rights reserved
  • 38. CPU and Memory Copyright 2012 Sematext Int’l. All rights reserved
  • 39. Summary • Controlling shard and replica placement • Indexing and querying multilingual data • How to use sharding and routing and not to tear your hair out • How to test your cluster performance to find bottle-necks • How to monitor your cluster and find problems right away Copyright 2012 Sematext Int’l. All rights reserved
  • 40. We Are Hiring ! • Dig Search ? • Dig Analytics ? • Dig Big Data ? • Dig Performance ? • Dig working with and in open – source ? • We’re hiring world – wide ! http://sematext.com/about/jobs.html Copyright 2012 Sematext Int’l. All rights reserved
  • 41. How to Reach Us • Rafał Kuć – Twitter: @kucrafal – E-mail: rafal.kuc@sematext.com • Sematext – Twitter: @sematext – Website: http://sematext.com • Graphs used in the presentation are from: – SPM for ElasticSearch (http://sematext.com/spm) Copyright 2012 Sematext Int’l. All rights reserved
  • 42. Thank You For Your Attention