Elasticsearch from the trenches

elasticsearchelasticsearch
from the trenchesfrom the trenches
Jai Jones
jaij@slalom.com

about meabout me
solution architect at slalom
enjoy building search apps
7+ years Lucene
2+ years Hibernate Search
~2 years Elasticsearch

agendaagenda
the ask
initial approach
problems
next steps
lessons learned
improvements
questions

the askthe ask
search 6 billions docs in under 1.5 sec
index 2 millions new docs / day
export billions of docs to CSV ﬁles
index and search docs in realtime
use search throughout the application
free text search
faceted navigation
suggestions
dashboards

free text searchfree text search

faceted navigationfaceted navigation
drill down

hardwarehardware
used "large" servers
servers had lots of CPUs & RAM
non-RAIDed spinning disks
5 dedicated nodes
all nodes store data
all nodes are master
all nodes sort & aggregate
clustercluster
initial approachinitial approach

shardsshards
used the default shard count
5 primary + 1 replica
unlimited primary shards / node
indicesindices
data was chronological
used the time-based index strategy
weekly indices for transaction logs
daily indices for audit logs

memorymemory
dedicated 31 GB to the jvm heap
used remaining memory for file system cache
turned off linux process swapping
maxed out linux file descriptors
used G1 Garbage Collector
index mappingsindex mappings
indexed all fields
stored big documents with 60+ fields
nested documents
parent-child relationships

searchessearches
searched all indices
used query_string searches
searched all ﬁelds
sorted & aggregated on any ﬁeld
range queries
parent-child queries
GET /index-*/_search
"query_string" : {
"query": "+(eggplant | potato)",
"default_field": "_all",
"default_operator": "and"
}

problemsproblems
OutOfMemoryError
ﬁeld data exceeded jvm heap
shard count was in the thousands
garbage collector could not free memory
CircuitBreakerException
ﬁeld data exceeded jvm heap
search results exceeded jvm heap
slow searches (latency increased from seconds to minutes)
nodes became unresponsive
frequent GC pauses
early signs

cluster downcluster down
index corruption
data loss
nodes failed to restart

next stepsnext steps
shard capacityshard capacity
understand data & searches
size based on actual usage
field datafield data
monitor
identify the producers
reduce usage
searchsearch
identify bottlenecks
optimize
clustercluster
ﬁnd failure points
make topology changes
make hardware changes
identify and ﬁx problems...

shard capacityshard capacity
1 shard can handle a lot of data
actually it held ~5x more data
didn't need 5 shards per index
did't need weekly/daily indices
learned...learned...
shard is the unit of scale
how much data can a single shard hold?
ﬁnd the single shard breaking point
1. loaded a single shard with data
2. ran typical searches
3. recorded search response time
4. repeated until response time became unacceptable

field datafield data
which fields and indices are using a lot of field data?
use the stats API to find out
fields used for sorting & aggregation
high cardinality fields
id-cache for parent-child relationships
field data is loaded first time field is accessed
field data is maintained per-index
field data is not GC'd
culprits...culprits...
# Node Stats
curl -XGET 'http://localhost:9200/_nodes/stats/indices/fielddata?human'
# Indices Stat
curl -XGET 'http://localhost:9200/_stats/fielddata/?human'

searchsearch
searching all indices is slow, CPU
intensive and causes field data to
be loaded for every index
# Searches all indices
/indexname-*/_search
# Search specific indices
/indexname-2015/_search
query_string is flexible but allows
inefficient searches like leading
wildcard searches and searches
_all fields by default
{
"query_string" : {
"default_field" : "_all",
"allow_leading_wildcard" "true",
"query" : "this AND that OR thus"
}
}
what are the bottlenecks and resource killers?

clustercluster
ﬁeld data used up 70-90% of the heap memory
not much heap left for node & shard management
stop the world Garbage Collector (GC) pauses made the
cluster unresponsive
nodes dropped out of the cluster
the G1 GC had longer pauses than the CMS GC
sorting, aggregations, id-cache for parent-child
relationships used up a lot of heap memory
managing too many shards used a lot of heap memory
why is the cluster crashing?

lessons learned...lessons learned...
number of shards / node should not exceed the number of CPU cores
figure out the single shard capacity
monitor field data usage
field data usage is permanent and does not get garbage collected
too high field data usage will bring down the cluster
search specific indices by target date range
tune and test all search API searches
split cluster into data, client and master nodes
use the default ES JVM settings and garbage collector

hardwarehardware
used "large" servers
servers had lots of CPUs & RAM
non-RAIDed spinning disks
put master and client nodes on same servers
5 8 dedicated nodes
all nodes are master dedicated master nodes
all nodes store data dedicated data nodes
all nodes sort & aggregate dedicated client nodes
clustercluster
improvementsimprovements

shardsshards
default shard count didn't work
5 1 primary + 1 replica
unlimited primary shards / node # of primary
shards less than # of CPU cores
indicesindices
data was chronological
used the time-based index strategy
weekly monthly indices for transaction logs
daily monthly indices for audit logs

memorymemory
dedicated 31 GB to the jvm heap
used remaining memory for file system cache
turned off linux process swapping
maxed out linux file descriptors
used new G1 GC used stable CMS GC

index mappingsindex mappings
indexed all 40 fields
stored big documents with 60+ fields
nested documents
parent-child relationships
used field aliases to define alternate
fields used in sorting and aggregation
used doc_value on sortable &
aggregation fields
changed boolean data type to string
"field": {
"index": "no"
}
# uses field data
"fieldA": {
"type": "boolean"
}
# uses doc_value (no field data)
"fieldA": {
"type": "string",
"index": "analyzed",
"fields": {
"raw" : {
"type" : "string",
"index" : "not_analyzed",
"fielddata": {
"format": "doc_values"
}
}
}
}

searchessearches
search all indices target specific
indices
query_string simple_query_string
search on all some fields
sorting & aggregations on all low
cardinality fields
range queries filters
parent-child nested queries
added query timeouts
GET /index-201501/_search
"simple_query_string" : {
"query": "+(eggplant | potato)",
"fields": ["field1", "field2"],
"default_operator": "and"
}

Elasticsearch from the trenches

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Elasticsearch from the trenches

Similar to Elasticsearch from the trenches (20)

Recently uploaded

Recently uploaded (20)

Elasticsearch from the trenches