Elasticsearch in production

•

5 likes•1,487 views

Video available at http://www.youtube.com/watch?v=gkdfNl0WL-A Original slides at http://presentations.found.no/berlin-buzzwords-2013/ This talk covers some of the lessons we've learned from securing and herding hundreds of Elasticsearch clusters. It is applicable whether you operate Elasticsearch in your own infrastructure, in the cloud, or if you're a developer who wants a better understanding of Elasticsearch's various failure modes. Elasticsearch easily lets you develop amazing things, and it has gone to great lengths to make Lucene's features readily available in a distributed setting. However, when it comes to running Elasticsearch in production, you still have a fairly complicated system on your hands: a system with high expectations on network stability, a huge appetite for memory, and a system that assumes all users are trustworthy. Instead of delving deeply into a few specifics, we give a brief overview of problems you are likely to run into and suggested solutions to these problems. We cover topics that are applicable to both developers and users with Elasticsearch clusters of every shape and size – with an emphasis on resiliency and security. Basic familiarity with Elasticsearch is assumed.

Technology

Elasticsearch in production
Alex Brasetvik
@alexbrasetvik

Who?
Co-founder of Found AS
7+ years of search, 2+ Elasticsearch
We manage hundreds of Elasticsearch
clusters
… on Amazon's cloud

Agenda
Memory (and stability)
Security (and multi-tenancy)
Networking (and reliability)
Client (and resiliency)

Memory
Search engines crave memory
Caches, caches, caches
Field- and ﬁlter caches
Page cache
Index building

PostgreSQL
Veriﬁes resource usage
Safe >>> fast
Uses disk if necessary

Elasticsearch trusts you
Built for speed
It'll jump if you ask it to
What could possibly go wrong?

OutOfMemoryError
Woah there
I ate all the memories
Your cluster may or may not work any more

May or may not work?
What else was happening at the time?
Corrupt cluster state, crashed Netty, …
In short: Don't end up there

Warning signs?
Monitor cache sizes and heap space
Outgrowing page cache: gradual slowdown
Outgrowing heap space: sudden crash

Understand the memory proﬁle
Test realisticly
Bound cache sizes and ﬂush thresholds
v0.90+ takes you longer with ﬁeld ﬁlters, etc.

Large heaps are expensive to garbage collect
Keep heap < 32GiB (But test!)
Lots of page cache is good, though!

Security
Elasticsearch trusts everyone
Not its job to do auth(z)
You're the gatekeeper

_search
Read only?
Limit indexes / wrap with ﬁlters?
Protect the ﬁeld caches

Arbitrary code execution
Elasticsearch has powerful scripting
Not sandboxed
On by default

Any website can reach your machine
http://127.0.0.1:9200/_search?callback=capture&source=…
Run in a virtual machine

Networking
Elasticsearch is distributed
Easy (for a distributed system)
Supports many usage patterns.

Quite common topology
High availability, right?

Obey or risk split brains …
… and irrecoverable data-loss

Stormy clouds
Zone vs instance failure
Thundering herds
Optimizing MTTR is not HA

Client considerations
Idempotent/retry-able requests
Use a connection pool.
_bulk / _msearch

Have enough memory
Have a majority of nodes
Don't allow arbitrary search requests
Use retryable requests

Alex over Trondheim, Tore Helgedagsrud
Elephant, Roy Costello
Wingsuit, Richard Schneider
Lightning Storm and Stars, Justin Ennis
Wingsuit ﬂock, Richard Schneider
Oh salad, you so funny, Eatliver

What's hot

Redis for .NET Developers

Yuriy Guts

A Developer Overview of Redis

Yuriy Guts

Big data is the new big thing where storing the data is the easy part. Gaining insights in your pile of data is something different. Based on a data dump of the well-known StackExchange websites, we will store & analyse 150+ GB of data with Azure Data Lake Store & Analytics to gain some insights about their users. After that we will use Power BI to give an at a glance overview of our learnings. If you are a developer that is interested in big data, this is your time to shine! We will use our existing SQL & C# skills to analyse everything without having to worry about running clusters.

NDC Sydney - Analyzing StackExchange with Azure Data Lake

Tom Kerkhove

Eliminar los puntos ciegos significa que tienes suficiente contexto. ¿Pero, puedes obtener información importante de ese contexto cuándo lo necesitas? Aprende a detectar amenazas mientras evitas el ruido de falsos positivos, con el motor de detección de Elastic Security. Verás cómo automatizar la detección de amenazas mediante correlaciones y Machine Learning, con ejemplos reales de cada uno.

Automatiza las detecciones de amenazas y evita falsos positivos

Imma Valls Bernaus

Introduction to elasticsearch

Florian Hopf

NOSQL - not only sql

Sergey Shishkin

Intergalactic data speak_highload++_20131028

David Fetter

Elasticsearch 5.0

Matias Cascallares

Insight on MongoDB Change Stream - Abhishek.D, Mydbops Team

Mydbops

Growing Up MongoDB

MongoDB

MySQL Rebuild using Logical Backups

Mydbops

Redis

Diego Pacheco

Elasticsearch is a distributed, RESTful search and analytics engine built on top of Apache Lucene. After the initial release in 2010 it has become the most widely used full-text search engine, but it is not stopping there. The revolution happened and now it is time for evolution. We dive into the following questions: - What are shards, how do they work, and why are they making Elasticsearch so fast? - How do shard allocations (which were hard to debug even for us) work and how can you find out what is going wrong with them? - How can you search efficiently across clusters and why did it take two implementations to get this right? - How can new resiliency features improve recovery scenarios and add totally new features? - Why are types finally disappearing and how are we avoid upgrade pains as much as possible? - How can upgrades be improved so that fewer applications are stuck on old or even ancient versions?

Philipp Krenn "Elasticsearch (R)Evolution — You Know, for Search…"

Fwdays

Leveraging chaos mesh in Astra Serverless testing

Pierre Laporte

David Fetter, Disqus

Ontico

Configuring elasticsearch for performance and scale

Bharvi Dixit

Azure Large Scale Deployments - Tales from the Trenches

Aaron Saikovski

Cassandra Redis

Diego Pacheco

Сергей Сверчков (Solution Architect в Altoros) Доклад: «Практика построения высокодоступного решения на базе Cloud Foundry PaaS ». О чём: В докладе Сергей продемонстрирует архитектуру решения, базирующуюся на OpenStack, Cassandra и Cloud Foundry (PaaS), расскажет об интересных особенностях Cloud Foundry. Он также опишет опыт в области обработки данных с медицинских приборов, опыт разработки решения с высокими требованиями по доступности, безопасности в этой области. В своей презентации Сергей раскроет нюансы работы над различными уровнями решения и их интеграцией.

«Практика построения высокодоступного решения на базе Cloud Foundry Paas»

Olga Lavrentieva

From the 2012 Percona Live MySQL Conference in Santa Clara, CA. Craigslist uses a variety of data storage systems in its backend systems: in-memory, SQL, and NoSQL. This talk is an overview of how craigslist works with a focus on the data storage and management choices that were made in each of its major subsystems. These include MySQL, memcached, Redis, MongoDB, Sphinx, and the filesystem. Special attention will be paid to the benefits and tradeoffs associated with choosing from the various popular data storage systems, including long-term viability, support, and ease of integration.

Living with SQL and NoSQL at craigslist, a Pragmatic Approach

Jeremy Zawodny

What's hot (20)

Redis for .NET Developers

A Developer Overview of Redis

NDC Sydney - Analyzing StackExchange with Azure Data Lake

Automatiza las detecciones de amenazas y evita falsos positivos

Introduction to elasticsearch

NOSQL - not only sql

Intergalactic data speak_highload++_20131028

Elasticsearch 5.0

Insight on MongoDB Change Stream - Abhishek.D, Mydbops Team

Growing Up MongoDB

MySQL Rebuild using Logical Backups

Redis

Philipp Krenn "Elasticsearch (R)Evolution — You Know, for Search…"

Leveraging chaos mesh in Astra Serverless testing

David Fetter, Disqus

Configuring elasticsearch for performance and scale

Azure Large Scale Deployments - Tales from the Trenches

Cassandra Redis

«Практика построения высокодоступного решения на базе Cloud Foundry Paas»

Living with SQL and NoSQL at craigslist, a Pragmatic Approach

Similar to Elasticsearch in production

Caching is relevant for a wide range of business applications and there is a huge variety of products in the market ranging from easy to adopt local heap based caches to powerful distributed data grids. This talk addresses advanced usage of Spring’s caching abstraction such as integrating a cache provider that is not integrated by the default Spring Package. In addition to that I will also give an overview of the JCache Specification and it’s adoption in the Spring ecosystem. Finally the presentation will also address various best practices for integrating various caching solutions into enterprise grade applications that don’t have the luxury of having „eventual consistency“ as a non-functional requirement.

Spring One 2 GX 2014 - CACHING WITH SPRING: ADVANCED TOPICS AND BEST PRACTICES

Michael Plöd

Elasticsearch quick Intro (English)

Federico Panini

Presto at Tivo, Boston Hadoop Meetup

Justin Borgman

Edge performance with in memory nosql

Liviu Costea

UnConference for Georgia Southern Computer Science March 31, 2015

Christopher Curtin

Redis and Bloom Filters - Atlanta Java Users Group 9/2014

Christopher Curtin

Perl and Elasticsearch

Dean Hamstead

The .NET Garbage Collector (GC) is really cool. It helps providing our applications with virtually unlimited memory, so we can focus on writing code instead of manually freeing up memory. But how does .NET manage that memory? What are hidden allocations? Are strings evil? It still matters to understand when and where memory is allocated. In this talk, we’ll go over the base concepts of .NET memory management and explore how .NET helps us and how we can help .NET – making our apps better. Expect profiling, Intermediate Language (IL), ClrMD and more!

.NET Fest 2018. Maarten Balliauw. Let’s refresh our memory! Memory management...

NETFest

2nd Athens Big Data Meetup - 2nd Talk - ElasticSearch: Index and Search Log F...

Athens Big Data

MongoDB and server performance

Alon Horev

DotNetFest - Let’s refresh our memory! Memory management in .NET

Maarten Balliauw

MySQL Performance - SydPHP October 2011

Graham Weldon

Modernizing WordPress Search with Elasticsearch

Taylor Lovett

Solr Performance Monitoring with SPM

Sematext Group, Inc.

1 1/2 years ago we have rolled out a new integrated full-text search engine for our Intranet based on Apache Solr. The search engine integrates various data sources such as file systems, wikis, internal websites and web applications, shared calendars, our corporate database, CRM system, email archive, task management and defect tracking etc. This talk is an experience report about some of the good things, the bad things and the surprising things we have encountered over two years of developing with, operating and using a Intranet search engine based on Apache Solr. After setting the scene, we will discuss some interesting requirements that we have for our search engine and how we solved them with Apache Solr (or at least tried to solve). Using these concrete examples, we will discuss some interesting features and limitations of Apache Solr. In the second part of the talk, we will tell a couple of "war stories" and walk through some interesting, annoying and surprising problems that we faced, how we analyzed the issues, identified the cause of the problems and eventually solved them. The talk is aimed at software developers and architects with some basic knowledge about Apache Solr, the Apache Lucene project familiy or similar full-text search engines. It is not an introduction into Apache Solr and we will dive right into the interesting and juicy bits.

Apache Solr - An Experience Report

Netcetera

Running Elasticsearch often requires specialized expertise and significant resources to operate and manage infrastructure and Elasticsearch software. Amazon Elasticsearch Service makes it easy to deploy, operate, and scale Elasticsearch in AWS. In this webinar, we will walk through how to launch a fully functional Amazon Elasticsearch domain, load your data, and analyze it using the built-in Kibana integration. We will also cover the CloudWatch Logs integration, which enables you to have your log data, such as VPC logs, automatically loaded into your Amazon Elasticsearch domain for analysis and exploration.

AWS October Webinar Series - Introducing Amazon Elasticsearch Service

Amazon Web Services

Hands-off logging for OpenShift in AWS

Amir Moghimi

Web20expo Filesystems

royans

Web20expo Filesystems

royans

Beyond the File System: Designing Large-Scale File Storage and Serving

mclee

Similar to Elasticsearch in production (20)

Spring One 2 GX 2014 - CACHING WITH SPRING: ADVANCED TOPICS AND BEST PRACTICES

Elasticsearch quick Intro (English)

Presto at Tivo, Boston Hadoop Meetup

Edge performance with in memory nosql

UnConference for Georgia Southern Computer Science March 31, 2015

Redis and Bloom Filters - Atlanta Java Users Group 9/2014

Perl and Elasticsearch

.NET Fest 2018. Maarten Balliauw. Let’s refresh our memory! Memory management...

2nd Athens Big Data Meetup - 2nd Talk - ElasticSearch: Index and Search Log F...

MongoDB and server performance

DotNetFest - Let’s refresh our memory! Memory management in .NET

MySQL Performance - SydPHP October 2011

Modernizing WordPress Search with Elasticsearch

Solr Performance Monitoring with SPM

Apache Solr - An Experience Report

AWS October Webinar Series - Introducing Amazon Elasticsearch Service

Hands-off logging for OpenShift in AWS

Web20expo Filesystems

Beyond the File System: Designing Large-Scale File Storage and Serving

Recently uploaded

As privacy and data protection regulations evolve rapidly, organizations operating in multiple jurisdictions face mounting challenges to ensure compliance and safeguard customer data. With state-specific privacy laws coming up in multiple states this year, it is essential to understand what their unique data protection regulations will require clearly. How will data privacy evolve in the US in 2024? How to stay compliant? Our panellists will guide you through the intricacies of these states' specific data privacy laws, clarifying complex legal frameworks and compliance requirements. This webinar will review: - The essential aspects of each state's privacy landscape and the latest updates - Common compliance challenges faced by organizations operating in multiple states and best practices to achieve regulatory adherence - Valuable insights into potential changes to existing regulations and prepare your organization for the evolving landscape

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

TrustArc

GenAI Risks & Security Meetup 01052024.pdf

lior mazor

Partners Life - Insurer Innovation Award 2024

The Digital Insurer

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Martijn de Jong

In this session, we will delve into strategic approaches for optimizing knowledge management within Microsoft 365, amidst the evolving landscape of Copilot. From leveraging automatic metadata classification and permission governance with SharePoint Premium, to unlocking Viva Engage for the cultivation of knowledge and communities, you will gain actionable insights to bolster your organization's knowledge-sharing initiatives. In this session, we will also explore how to facilitate solutions to enable your employees to find answers and expertise within Microsoft 365. You will leave equipped with practical techniques and a deeper understanding of how there is more to effective knowledge management than just enabling Copilot, but building actual solutions to prepare the knowledge that Copilot and your employees can use.

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Drew Madelung

Real Time Object Detection Using Open CV

Khem

Imagine a world where information flows as swiftly as thought itself, making decision-making as fluid as the data driving it. Every moment is critical, and the right tools can significantly boost your organization’s performance. The power of real-time data automation through FME can turn this vision into reality. Aimed at professionals eager to leverage real-time data for enhanced decision-making and efficiency, this webinar will cover the essentials of real-time data and its significance. We’ll explore: FME’s role in real-time event processing, from data intake and analysis to transformation and reporting An overview of leveraging streams vs. automations FME’s impact across various industries highlighted by real-life case studies Live demonstrations on setting up FME workflows for real-time data Practical advice on getting started, best practices, and tips for effective implementation Join us to enhance your skills in real-time data automation with FME, and take your operational capabilities to the next level.

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Safe Software

AWS Community Day CPH - Three problems of Terraform

Andrey Devyatkin

Automating Google Workspace (GWS) & more with Apps Script

wesley chun

Manulife - Insurer Innovation Award 2024

The Digital Insurer

Scaling API-first – The story of a global engineering organization

Radu Cotescu

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

Artificial Intelligence Chap.5 : Uncertainty

Khushali Kathiriya

Building Digital Trust in a Digital Economy Veronica Tan, Director - Cyber Security Agency of Singapore Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

apidays

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Edi Saputra

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

A Domino Admins Adventures (Engage 2024)

Gabriella Davis

Scaling API-first – The story of a global engineering organization Ian Reasor, Senior Computer Scientist - Adobe Radu Cotescu, Senior Computer Scientist - Adobe Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

apidays

Webinar Recording: https://www.panagenda.com/webinars/why-teams-call-analytics-is-critical-to-your-entire-business Nothing is as frustrating and noticeable as being in an important call and being unable to see or hear the other person. Not surprising then, that issues with Teams calls are among the most common problems users call their helpdesk for. Having in depth insight into everything relevant going on at the user’s device, local network, ISP and Microsoft itself during the call is crucial for good Microsoft Teams Call quality support. To ensure a quick and adequate solution and to ensure your users get the most out of their Microsoft 365. But did you know that ‘bad calls’ are also an excellent indicator of other problems arising? Precisely because it is so noticeable!? Like the canary in the mine, bad calls can be early indicators of problems. Problems that might otherwise not have been noticed for a while but can have a big impact on productivity and satisfaction. Join this session by Christoph Adler to learn how true Microsoft Teams call quality analytics helped other organizations troubleshoot bad calls and identify and fix problems that impacted Teams calls or the use of Microsoft365 in general. See what it can do to keep your users happy and productive! In this session we will cover - Why CQD data alone is not enough to troubleshoot call problems - The importance of attributing call problems to the right call participant - What call quality analytics can do to help you quickly find, fix-, and prevent problems - Why having retrospective detailed insights matters - Real life examples of how others have used Microsoft Teams call quality monitoring to problem shoot problems with their ISP, network, device health and more.

Why Teams call analytics are critical to your entire business

panagenda

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Neo4j

Recently uploaded (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

GenAI Risks & Security Meetup 01052024.pdf

Partners Life - Insurer Innovation Award 2024

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Real Time Object Detection Using Open CV

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

AWS Community Day CPH - Three problems of Terraform

Automating Google Workspace (GWS) & more with Apps Script

Manulife - Insurer Innovation Award 2024

Scaling API-first – The story of a global engineering organization

Axa Assurance Maroc - Insurer Innovation Award 2024

Artificial Intelligence Chap.5 : Uncertainty

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Strategies for Landing an Oracle DBA Job as a Fresher

A Domino Admins Adventures (Engage 2024)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Why Teams call analytics are critical to your entire business

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Elasticsearch in production

1. Elasticsearch in production Alex Brasetvik @alexbrasetvik

2. How marketing thinks our users feel

3. How we developers sometimes feel

4. Who? Co-founder of Found AS 7+ years of search, 2+ Elasticsearch We manage hundreds of Elasticsearch clusters … on Amazon's cloud

5. Agenda Memory (and stability) Security (and multi-tenancy) Networking (and reliability) Client (and resiliency)

6. Memory Search engines crave memory Caches, caches, caches Field- and ﬁlter caches Page cache Index building

7. PostgreSQL Veriﬁes resource usage Safe >>> fast Uses disk if necessary

8. Elasticsearch trusts you Built for speed It'll jump if you ask it to What could possibly go wrong?

9. OutOfMemoryError Woah there I ate all the memories Your cluster may or may not work any more

10. May or may not work? What else was happening at the time? Corrupt cluster state, crashed Netty, … In short: Don't end up there

11. Warning signs? Monitor cache sizes and heap space Outgrowing page cache: gradual slowdown Outgrowing heap space: sudden crash

12. Understand the memory profile Test realisticly Bound cache sizes and flush thresholds v0.90+ takes you longer with field filters, etc.

13. Large heaps are expensive to garbage collect Keep heap < 32GiB (But test!) Lots of page cache is good, though!

14. Security Elasticsearch trusts everyone Not its job to do auth(z) You're the gatekeeper

15. _search Read only? Limit indexes / wrap with ﬁlters? Protect the ﬁeld caches

16. Arbitrary code execution Elasticsearch has powerful scripting Not sandboxed On by default

17. Any website can reach your machine http://127.0.0.1:9200/_search?callback=capture&source=… Run in a virtual machine

18. Networking Elasticsearch is distributed Easy (for a distributed system) Supports many usage patterns.

19. Quite common topology High availability, right?

20. Obey or risk split brains … … and irrecoverable data-loss

21. +1 is a "tie breaker"

22. Stormy clouds Zone vs instance failure Thundering herds Optimizing MTTR is not HA

23. Client considerations Idempotent/retry-able requests Use a connection pool. _bulk / _msearch

24. Have enough memory Have a majority of nodes Don't allow arbitrary search requests Use retryable requests

25.

26. Alex over Trondheim, Tore Helgedagsrud Elephant, Roy Costello Wingsuit, Richard Schneider Lightning Storm and Stars, Justin Ennis Wingsuit ﬂock, Richard Schneider Oh salad, you so funny, Eatliver

Elasticsearch in production

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Elasticsearch in production

Similar to Elasticsearch in production (20)

Recently uploaded

Recently uploaded (20)

Elasticsearch in production