SlideShare a Scribd company logo
elasticsearchelasticsearch
from the trenchesfrom the trenches
Jai Jones
jaij@slalom.com
about meabout me
solution architect at slalom
enjoy building search apps
7+ years Lucene
2+ years Hibernate Search
~2 years Elasticsearch
agendaagenda
the ask
initial approach
problems
next steps
lessons learned
improvements
questions
the askthe ask
search 6 billions docs in under 1.5 sec
index 2 millions new docs / day
export billions of docs to CSV files
index and search docs in realtime
use search throughout the application
free text search
faceted navigation
suggestions
dashboards
free text searchfree text search
faceted navigationfaceted navigation
drill down
suggestionssuggestions
dashboardsdashboards
hardwarehardware
used "large" servers
servers had lots of CPUs & RAM
non-RAIDed spinning disks
5 dedicated nodes
all nodes store data
all nodes are master
all nodes sort & aggregate
clustercluster
initial approachinitial approach
shardsshards
used the default shard count
5 primary + 1 replica
unlimited primary shards / node
indicesindices
data was chronological
used the time-based index strategy
weekly indices for transaction logs
daily indices for audit logs
initial approachinitial approach
memorymemory
dedicated 31 GB to the jvm heap
used remaining memory for file system cache
turned off linux process swapping
maxed out linux file descriptors
used G1 Garbage Collector
initial approachinitial approach
index mappingsindex mappings
indexed all fields
stored big documents with 60+ fields
nested documents
parent-child relationships
searchessearches
searched all indices
used query_string searches
searched all fields
sorted & aggregated on any field
range queries
parent-child queries
GET /index-*/_search
"query_string" : {
"query": "+(eggplant | potato)",
"default_field": "_all",
"default_operator": "and"
}
initial approachinitial approach
problemsproblems
OutOfMemoryError
field data exceeded jvm heap
shard count was in the thousands
garbage collector could not free memory
CircuitBreakerException
field data exceeded jvm heap
search results exceeded jvm heap
slow searches (latency increased from seconds to minutes)
nodes became unresponsive
frequent GC pauses
early signs
cluster downcluster down
index corruption
data loss
nodes failed to restart
next stepsnext steps
shard capacityshard capacity
understand data & searches
size based on actual usage
field datafield data
monitor
identify the producers
reduce usage
searchsearch
identify bottlenecks
optimize
clustercluster
find failure points
make topology changes
make hardware changes
identify and fix problems...
shard capacityshard capacity
1 shard can handle a lot of data
actually it held ~5x more data
didn't need 5 shards per index
did't need weekly/daily indices
learned...learned...
shard is the unit of scale
how much data can a single shard hold?
find the single shard breaking point
1. loaded a single shard with data
2. ran typical searches
3. recorded search response time
4. repeated until response time became unacceptable
field datafield data
which fields and indices are using a lot of field data?
use the stats API to find out
fields used for sorting & aggregation
high cardinality fields
id-cache for parent-child relationships
field data is loaded first time field is accessed
field data is maintained per-index
field data is not GC'd
culprits...culprits...
# Node Stats
curl -XGET 'http://localhost:9200/_nodes/stats/indices/fielddata?human'
# Indices Stat
curl -XGET 'http://localhost:9200/_stats/fielddata/?human'
searchsearch
searching all indices is slow, CPU
intensive and causes field data to
be loaded for every index
# Searches all indices
/indexname-*/_search
# Search specific indices
/indexname-2015/_search
query_string is flexible but allows
inefficient searches like leading
wildcard searches and searches
_all fields by default
{
"query_string" : {
"default_field" : "_all",
"allow_leading_wildcard" "true",
"query" : "this AND that OR thus"
}
}
what are the bottlenecks and resource killers?
clustercluster
field data used up 70-90% of the heap memory
not much heap left for node & shard management
stop the world Garbage Collector (GC) pauses made the
cluster unresponsive
nodes dropped out of the cluster
the G1 GC had longer pauses than the CMS GC
sorting, aggregations, id-cache for parent-child
relationships used up a lot of heap memory
managing too many shards used a lot of heap memory
why is the cluster crashing?
lessons learned...lessons learned...
number of shards / node should not exceed the number of CPU cores
figure out the single shard capacity
monitor field data usage
field data usage is permanent and does not get garbage collected
too high field data usage will bring down the cluster
search specific indices by target date range
tune and test all search API searches
split cluster into data, client and master nodes
use the default ES JVM settings and garbage collector
hardwarehardware
used "large" servers
servers had lots of CPUs & RAM
non-RAIDed spinning disks
put master and client nodes on same servers
5 8 dedicated nodes
all nodes are master dedicated master nodes
all nodes store data dedicated data nodes
all nodes sort & aggregate dedicated client nodes
clustercluster
improvementsimprovements
shardsshards
default shard count didn't work
5 1 primary + 1 replica
unlimited primary shards / node # of primary
shards less than # of CPU cores
indicesindices
data was chronological
used the time-based index strategy
weekly monthly indices for transaction logs
daily monthly indices for audit logs
improvementsimprovements
memorymemory
dedicated 31 GB to the jvm heap
used remaining memory for file system cache
turned off linux process swapping
maxed out linux file descriptors
used new G1 GC used stable CMS GC
improvementsimprovements
index mappingsindex mappings
indexed all 40 fields
stored big documents with 60+ fields
nested documents
parent-child relationships
used field aliases to define alternate
fields used in sorting and aggregation
used doc_value on sortable &
aggregation fields
changed boolean data type to string
"field": {
"index": "no"
}
# uses field data
"fieldA": {
"type": "boolean"
}
# uses doc_value (no field data)
"fieldA": {
"type": "string",
"index": "analyzed",
"fields": {
"raw" : {
"type" : "string",
"index" : "not_analyzed",
"fielddata": {
"format": "doc_values"
}
}
}
}
improvementsimprovements
searchessearches
search all indices target specific
indices
query_string simple_query_string
search on all some fields
sorting & aggregations on all low
cardinality fields
range queries filters
parent-child nested queries
added query timeouts
GET /index-201501/_search
"simple_query_string" : {
"query": "+(eggplant | potato)",
"fields": ["field1", "field2"],
"default_operator": "and"
}
improvementsimprovements
Questions?Questions?
Thank YouThank You

More Related Content

What's hot

Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017
Karanjeet Singh
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
Sarah Guido
 
Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...
Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...
Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...
Lucidworks
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
Wes McKinney
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013
Revolution Analytics
 
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
Databricks
 
R reproducibility
R reproducibilityR reproducibility
R reproducibility
Revolution Analytics
 
Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew Ray
Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew RayData Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew Ray
Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew Ray
Databricks
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Spark Summit
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQL
Databricks
 
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Databricks
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLP
Databricks
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
Wes McKinney
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
Databricks
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
Wes McKinney
 
Datat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planDatat and donuts: how to write a data management plan
Datat and donuts: how to write a data management plan
C. Tobin Magle
 
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Andre Essing
 
MongoDB Deployment Checklist
MongoDB Deployment ChecklistMongoDB Deployment Checklist
MongoDB Deployment Checklist
MongoDB
 

What's hot (20)

Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
 
Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...
Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...
Efficient Scalable Search in a Multi-Tenant Environment: Presented by Harry H...
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013
 
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian ...
 
R reproducibility
R reproducibilityR reproducibility
R reproducibility
 
Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew Ray
Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew RayData Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew Ray
Data Wrangling with PySpark for Data Scientists Who Know Pandas with Andrew Ray
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQL
 
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
 
Advanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLPAdvanced Natural Language Processing with Apache Spark NLP
Advanced Natural Language Processing with Apache Spark NLP
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 
Apache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easierApache Spark Performance is too hard. Let's make it easier
Apache Spark Performance is too hard. Let's make it easier
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
Datat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planDatat and donuts: how to write a data management plan
Datat and donuts: how to write a data management plan
 
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
 
MongoDB Deployment Checklist
MongoDB Deployment ChecklistMongoDB Deployment Checklist
MongoDB Deployment Checklist
 

Similar to Elasticsearch from the trenches

MSR 2009
MSR 2009MSR 2009
MSR 2009
swy351
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
Sandeep Singh
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Codemotion
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
Joe Alex
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
tsliwowicz
 
Realtimestream and realtime fastcatsearch
Realtimestream and realtime fastcatsearchRealtimestream and realtime fastcatsearch
Realtimestream and realtime fastcatsearch
상욱 송
 
Real time monitoring of hadoop and spark workflows
Real time monitoring of hadoop and spark workflowsReal time monitoring of hadoop and spark workflows
Real time monitoring of hadoop and spark workflows
Shankar Manian
 
Building the search engine: from thorns to stars
Building the search engine: from thorns to starsBuilding the search engine: from thorns to stars
Building the search engine: from thorns to stars
Andrey Vinda
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
MongoDB
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
aftab alam
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
Priyadarshini648418
 
Capacity Planning
Capacity PlanningCapacity Planning
Capacity Planning
MongoDB
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
Lars Albertsson
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling Elasticsearch
Sylvain Wallez
 
Análisis del roadmap del Elastic Stack
Análisis del roadmap del Elastic StackAnálisis del roadmap del Elastic Stack
Análisis del roadmap del Elastic Stack
Elasticsearch
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera, Inc.
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Michael Hiskey
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Kognitio
 
Deployment Preparedness
Deployment Preparedness Deployment Preparedness
Deployment Preparedness
MongoDB
 

Similar to Elasticsearch from the trenches (20)

MSR 2009
MSR 2009MSR 2009
MSR 2009
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Realtimestream and realtime fastcatsearch
Realtimestream and realtime fastcatsearchRealtimestream and realtime fastcatsearch
Realtimestream and realtime fastcatsearch
 
Real time monitoring of hadoop and spark workflows
Real time monitoring of hadoop and spark workflowsReal time monitoring of hadoop and spark workflows
Real time monitoring of hadoop and spark workflows
 
Building the search engine: from thorns to stars
Building the search engine: from thorns to starsBuilding the search engine: from thorns to stars
Building the search engine: from thorns to stars
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
 
Capacity Planning
Capacity PlanningCapacity Planning
Capacity Planning
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 
Black friday logs - Scaling Elasticsearch
Black friday logs - Scaling ElasticsearchBlack friday logs - Scaling Elasticsearch
Black friday logs - Scaling Elasticsearch
 
Análisis del roadmap del Elastic Stack
Análisis del roadmap del Elastic StackAnálisis del roadmap del Elastic Stack
Análisis del roadmap del Elastic Stack
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Deployment Preparedness
Deployment Preparedness Deployment Preparedness
Deployment Preparedness
 

Recently uploaded

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 

Recently uploaded (20)

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 

Elasticsearch from the trenches

  • 1. elasticsearchelasticsearch from the trenchesfrom the trenches Jai Jones jaij@slalom.com
  • 2. about meabout me solution architect at slalom enjoy building search apps 7+ years Lucene 2+ years Hibernate Search ~2 years Elasticsearch
  • 3. agendaagenda the ask initial approach problems next steps lessons learned improvements questions
  • 4. the askthe ask search 6 billions docs in under 1.5 sec index 2 millions new docs / day export billions of docs to CSV files index and search docs in realtime use search throughout the application free text search faceted navigation suggestions dashboards
  • 5. free text searchfree text search
  • 9. hardwarehardware used "large" servers servers had lots of CPUs & RAM non-RAIDed spinning disks 5 dedicated nodes all nodes store data all nodes are master all nodes sort & aggregate clustercluster initial approachinitial approach
  • 10. shardsshards used the default shard count 5 primary + 1 replica unlimited primary shards / node indicesindices data was chronological used the time-based index strategy weekly indices for transaction logs daily indices for audit logs initial approachinitial approach
  • 11. memorymemory dedicated 31 GB to the jvm heap used remaining memory for file system cache turned off linux process swapping maxed out linux file descriptors used G1 Garbage Collector initial approachinitial approach index mappingsindex mappings indexed all fields stored big documents with 60+ fields nested documents parent-child relationships
  • 12. searchessearches searched all indices used query_string searches searched all fields sorted & aggregated on any field range queries parent-child queries GET /index-*/_search "query_string" : { "query": "+(eggplant | potato)", "default_field": "_all", "default_operator": "and" } initial approachinitial approach
  • 13. problemsproblems OutOfMemoryError field data exceeded jvm heap shard count was in the thousands garbage collector could not free memory CircuitBreakerException field data exceeded jvm heap search results exceeded jvm heap slow searches (latency increased from seconds to minutes) nodes became unresponsive frequent GC pauses early signs
  • 14. cluster downcluster down index corruption data loss nodes failed to restart
  • 15. next stepsnext steps shard capacityshard capacity understand data & searches size based on actual usage field datafield data monitor identify the producers reduce usage searchsearch identify bottlenecks optimize clustercluster find failure points make topology changes make hardware changes identify and fix problems...
  • 16. shard capacityshard capacity 1 shard can handle a lot of data actually it held ~5x more data didn't need 5 shards per index did't need weekly/daily indices learned...learned... shard is the unit of scale how much data can a single shard hold? find the single shard breaking point 1. loaded a single shard with data 2. ran typical searches 3. recorded search response time 4. repeated until response time became unacceptable
  • 17. field datafield data which fields and indices are using a lot of field data? use the stats API to find out fields used for sorting & aggregation high cardinality fields id-cache for parent-child relationships field data is loaded first time field is accessed field data is maintained per-index field data is not GC'd culprits...culprits... # Node Stats curl -XGET 'http://localhost:9200/_nodes/stats/indices/fielddata?human' # Indices Stat curl -XGET 'http://localhost:9200/_stats/fielddata/?human'
  • 18. searchsearch searching all indices is slow, CPU intensive and causes field data to be loaded for every index # Searches all indices /indexname-*/_search # Search specific indices /indexname-2015/_search query_string is flexible but allows inefficient searches like leading wildcard searches and searches _all fields by default { "query_string" : { "default_field" : "_all", "allow_leading_wildcard" "true", "query" : "this AND that OR thus" } } what are the bottlenecks and resource killers?
  • 19. clustercluster field data used up 70-90% of the heap memory not much heap left for node & shard management stop the world Garbage Collector (GC) pauses made the cluster unresponsive nodes dropped out of the cluster the G1 GC had longer pauses than the CMS GC sorting, aggregations, id-cache for parent-child relationships used up a lot of heap memory managing too many shards used a lot of heap memory why is the cluster crashing?
  • 20. lessons learned...lessons learned... number of shards / node should not exceed the number of CPU cores figure out the single shard capacity monitor field data usage field data usage is permanent and does not get garbage collected too high field data usage will bring down the cluster search specific indices by target date range tune and test all search API searches split cluster into data, client and master nodes use the default ES JVM settings and garbage collector
  • 21. hardwarehardware used "large" servers servers had lots of CPUs & RAM non-RAIDed spinning disks put master and client nodes on same servers 5 8 dedicated nodes all nodes are master dedicated master nodes all nodes store data dedicated data nodes all nodes sort & aggregate dedicated client nodes clustercluster improvementsimprovements
  • 22. shardsshards default shard count didn't work 5 1 primary + 1 replica unlimited primary shards / node # of primary shards less than # of CPU cores indicesindices data was chronological used the time-based index strategy weekly monthly indices for transaction logs daily monthly indices for audit logs improvementsimprovements
  • 23. memorymemory dedicated 31 GB to the jvm heap used remaining memory for file system cache turned off linux process swapping maxed out linux file descriptors used new G1 GC used stable CMS GC improvementsimprovements
  • 24. index mappingsindex mappings indexed all 40 fields stored big documents with 60+ fields nested documents parent-child relationships used field aliases to define alternate fields used in sorting and aggregation used doc_value on sortable & aggregation fields changed boolean data type to string "field": { "index": "no" } # uses field data "fieldA": { "type": "boolean" } # uses doc_value (no field data) "fieldA": { "type": "string", "index": "analyzed", "fields": { "raw" : { "type" : "string", "index" : "not_analyzed", "fielddata": { "format": "doc_values" } } } } improvementsimprovements
  • 25. searchessearches search all indices target specific indices query_string simple_query_string search on all some fields sorting & aggregations on all low cardinality fields range queries filters parent-child nested queries added query timeouts GET /index-201501/_search "simple_query_string" : { "query": "+(eggplant | potato)", "fields": ["field1", "field2"], "default_operator": "and" } improvementsimprovements