Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Elastic 101 tutorial - Percona Europe 2018


Published on

Elasticsearch is well known as a highly scalable search engine that stores data in a structure optimized for language based searches but its capabilities and use cases don't stop there. In this tutorial, I'll give you a hands-on introduction to Elasticsearch and give you a glimpse at some of the fundamental concepts.
Database administration is challenging, and Elasticsearch is not an exception to that rule. In this tutorial, we will cover various administrative topics like Installation and Configuration, Cluster/Node management, Indexes management and Monitoring Cluster Health which will help you. Building applications on top of an Elasticsearch are also challenging and raise concerns about schema design. In this tutorial, we will cover developer-oriented topics like Mappings and Analysis, Aggregations and Schema Design that will help you build a robust application on top of Elasticsearch.
There will be lab sessions at the end of some chapters so please have your laptops with you.

Published in: Software
  • Login to see the comments

Elastic 101 tutorial - Percona Europe 2018

  1. 1. Elastic 101 Antonios Giannopoulos DBA @ Rackspace/ObjectRocket Alex Cercel DBA @ Rackspace/ObjectRocket Mihai Aldoiu CDE @ Rackspace/ObjectRocket | | 1
  2. 2. Introduction 2 Antonios Giannopoulos Alex Cercel Mihai Aldoiu
  3. 3. Overview • Introduction • Working with data • Scaling the cluster • Operating the cluster • Troubleshooting the cluster • Upgrade the cluster • Security best practices • Working with data – Advanced operations • Best Practices 3
  4. 4. 4 Labs 1. Unzip the provided .vmdk file 2. Install and or Open VirtualBox 3. Select New 4. Enter A Name 5. Select Type: Linux 6. Select Version: Red Hat (64-bit) 7. Set Memory to at least 4096 (more won’t hurt) 8. Select "Use an existing ... disk file", select the provided .vmdk file 9. Select Create 10. Select Start 11. Login with username: elasticuser , password: elasticuser 12. Navigate to /Percona2018/Lab01 for the first lab.
  5. 5. Introduction ● Key Terms ● Installation ● Configuration files ● JVM fundamentals ● Lucene basics 5
  6. 6. What is elasticsearch? 6 Lucene: - A search engine library entirely written in Java - Developed in 1999 by Doug Cutting - Suitable for any application that requires full text indexing and searching capability But: - Challenging to use - Not originally designed for scaling Elasticsearch: - Built on top of Lucene - Provides scaling - Language independent
  7. 7. What is ELK stack? 7 ElasticSearch: - The main datastore - Provides distributed search capabilities Logstash: - Parse & transform data for ingestion - Ingests from multiple of sources simultaneously Kibana: - An analytics and visualization platform - Search, visualize & interact with Elasticsearch data
  8. 8. Installing Elasticsearch 8 Download: Latest Version: Older Version: Navigate to The simplest way: 1) wget 2) wget 3) shasum -a 512 -c elasticsearch-6.3.2.tar.gz.sha512 (it should return elasticsearch-6.3.2.tar.gz: OK) 4) tar -xzf elasticsearch-6.3.2.tar.gz
  9. 9. Installing Java 9 ElasticSearch requires JRE (JavaSE runtime environment) or JDK (Java Development Kit) - OpenJDK CentOS: yum install java-1.8.0-openjdk - OpenJDK Ubuntu: apt-get install openjdk-8-jre ES versions 6, requires Java8 or higher set JAVA_HOME appropriately - Create a file under /etc/profile.d for example - Add the following lines: export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-*" export PATH=$JAVA_HOME/bin:$PATH
  10. 10. Start the server 10 Create a user elasticuser* Using elasticuser execute: bin/elasticsearch After some noise: [INFO ][o.e.n.Node] [name] started How I know is up and running? *You can’t start ES using root $ curl -X GET "localhost:9200/" { "name" : "KG-_6s9", "cluster_name" : "elasticsearch", "cluster_uuid" : "T9uHpto6QtWRmsjzNFrReA", "version" : { "number" : "6.3.2", "build_flavor" : "default", "build_type" : "tar", "build_hash" : "053779d", "build_date" : "2018-07- 20T05:20:23.451332Z", "build_snapshot" : false, "lucene_version" : "7.3.1", "minimum_wire_compatibility_version" : "5.6.0", "minimum_index_compatibility_version" : "5.0.0" }, "tagline" : "You Know, for Search" }
  11. 11. Explore the directories 11 Folder Description Setting bin Contains the binary scipts, like elasticsearch config Contains the configuration files ES_PATH_CONF data Holds the data (shards/indexes) lib Contains JAR files logs Contains the log files path.logs modules Contains the modules plugins Contains the plugins. Each plugin has its own subdirectory
  12. 12. Configuration files 12 elasticsearch.yml - The primary way of configuring a node. - Its a template which lists the most important settings for a production cluster jvm.options - JVM related options - Elasticsearch uses Log4j 2 for logging Variables can be set either: -Using the configuration file: jvm.options: -Xms512mb - or, using command line ES_JAVA_OPTS="-Xms512m" ./bin/elasticsearch
  13. 13. Elasticsearch.yml 13 - Every node should have a unique - Set it to something meaningful (aws-zone1-objectrocket-es-01) - A cluster is a set of nodes sharing the same - Set it to something meaningful (production, qa, staging) - Path to directory where to store the data (accepts multiple locations) path.logs - Path to log files
  14. 14. Elasticsearch.yml 14 production dc1-prd-es1 /data/es1 path.logs: /logs/es1 bin/elasticsearch -d -p '' $ curl -X GET "localhost:9200/" { "name" : "dc1-prd-es1", "cluster_name" : "production", …
  15. 15. jvm.Options 15 Each Elasticsearch node runs on its own JVM instance JVM is a virtual machine that enables a computer to run Java programs The most important setting is the Heap Size: - Xms: Represents the initial size of total heap space - Xmx: Represents the maximum size of total heap space Best Practices - Set Xms and Xmx to the same size - Set Xmx to no more than 50% of your physical RAM - Do not set Xms and Xmx over 30ish GiB - Use the server version of OpenJDK - Lock the RAM for Heap bootstrap.memory_lock
  16. 16. jvm.Options 16 Heap Off Heap Indexing buffer Completion suggester Cluster state … and more Caches: - query cache (10%) - field data cache (unbounded) - …
  17. 17. jvm.Options 17 Garbage collector - It is a form of automatic memory management - Gets rid of objects which are not being used by a Java application anymore - Automatically reclaims memory for reuse Garbage collectors - ConcMarkSweepGC (CMS) - G1GC (has some Issues with JDK 8) Elasticsearch uses -XX:+UseConcMarkSweepGC GC threads -XX:ParallelGCThreads=N, where N varies on the platform -XX:ParallelCMSThreads=N , where N varies on the platform
  18. 18. jvm.Options 18 Eden s0 s1 Old Generation Perm New Gen -Xmn JVM Heap –Xms -Xmx XX: PermSize XX: MaxPermSize Minor GC Major GC or full GC 1) A new Page stored in Eden 2) After a GC if it survives it moves to s0 ,s1 3) After multiple GCs, s0 or s1 gets full then pages moves to Old Gen
  19. 19. OS settings 19 Disable swap - sysctl vm.swappiness=1 - Remove Swap File descriptors - Set nofile to 65536 - curl -X GET ”<host>:<port>/_nodes/stats/process?filter_path=**.max_file_descriptors” Virtual Memory - sysctl -w vm.max_map_count=262144 Max user process - nproc to 4096 DNS cache settings - networkaddress.cache.ttl=<timeout> - networkaddress.cache.negative.ttl=<timeout>
  20. 20. Network settings 20 Two network communication mechanisms in Elasticsearch - HTTP: which is how the Elasticsearch REST APIs are exposed - Transport: used for internal communication between nodes within the cluster Node 1 Client Node 2 HTTP Transport
  21. 21. Network settings 21 The REST APIs of Elasticsearch are exposed over HTTP - The HTTP module binds to localhost by default - Configure with on elasticsearch.yml - Default port is the first available between 9200-9299 - Configure with http.port on elasticsearch.yml Each call that goes from one node to another uses the transport module - Transport binds to localhost by default - Configure with on elasticsearch.yml - Default port is the first available between 9300-9399 - Configure with transport.tcp.port on elasticsearch.yml
  22. 22. Network settings 22 sets the bind host and the publish host at the same time network.publish_host - Defaults to Multiple interfaces network.bind_host - Defaults to the “best” address from One interface only value Description _[networkInterface]_ Addresses of a network interface, for example _en0_. _local_ Any loopback addresses on the system, for example _site_ Any site-local addresses on the system, for example _global_ Any globally-scoped addresses on the system, for example
  23. 23. Network settings 23 Zen discovery - built in & default discovery module default - It provides unicast discovery, - Uses the transport module On elasticsearch.yml [”node1", ”node2"] Node 1 Node 2 Transport Node 3 1) Retrieves IP/ hostname from list of hosts 2) Tries all hosts until find a reachable one 3) If the cluster name matches, joins the cluster 4) If not, starts its own cluster
  24. 24. Bootstrap tests 24 Development mode: if it does not bind transport to an external interface (the default) Production mode: if it does bind transport to an external interface Bypass production mode: Set discovery.type to single-node Bootstrap Tests - Inspect a variety of Elasticsearch and system settings - A node in production mode must pass all Bootstrap tests to start - es.enforce.bootstrap.checks=true on jvm.options - Highly recommended to have this setting enabled
  25. 25. Bootstrap tests 25 List of Bootstrap Tests - Heap size check - File descriptor check - Memory lock check - Maximum number of threads check - Max file size check - Maximum size virtual memory check - Maximum map count check - Client JVM check - Use serial collector check - System call filter check - OnError and OnOutOfMemoryError checks - Early-access check - G1GC check - All permission check
  26. 26. Lucene 26 Lucene uses a data structure called Inverted Index. An Inverted Index, inverts a page-centric data structure (page->words) to a keyword- centric data structure (word->pages) Allow fast full text searches, at a cost of increased processing when a document is added to the database. 1) Give us your name 2) Give us your home number 3) Give us your home address Frequency Location give 3 1,2,3 us 3 1,2,3 your 3 1,2,3 name 1 1 number 1 2 home 2 2,3 address 1 3
  27. 27. Lucene – Key Terms 27 A Document is the unit of search and index. A Document consists of one or more Fields. A Field is simply a name-value pair. An index consists of one or more Documents. Indexing: involves adding Documents to an Index Searching: - involves retrieving Documents from an index. - Searching requires an index to have already been built - Returns a list of Hits
  28. 28. Kibana 28 Download: Latest Version: Simplest way to install it: Run Kibana: kibana-6.3.2-linux-x86_64/bin/kibana Access Kibana: http://localhost:5601 wget x86_64.tar.gz shasum -a 512 kibana-6.3.2-linux-x86_64.tar.gz tar -xzf kibana-6.3.2-linux-x86_64.tar.gz
  29. 29. Kibana - Devtools 29
  30. 30. 30 Lab 1 Install and configure Elastic Objectives: Learn how to install and configure a standalone Elastic instance. Steps: 1. Navigate to /Percona2018/Lab01 2. Read the instructions on Lab01.txt
  31. 31. Working with Data ● Indexes ● Shards ● CRUD Operations ● Read Operations ● Mappings ● Analyzers 31
  32. 32. Working with Data - Index 32 • An index in Elasticsearch is a logical way of grouping data: ‒ an index has a mapping that defines the fields in the index ‒ an index is a logical namespace that maps to where its contents are stored in the cluster • There are two different concepts in this definition: ‒ an index has some type of data schema mechanism ‒ an index has some type of mechanism to distribute data across a cluster
  33. 33. An index means .... 33 In the Elasticsearch world, index is used as a: ‒ Noun: a document is put into an index in Elasticsearch ‒ Verb: to index a document is to put the document into an index in Elasticsearch { "type":"line", "line_id":4, "play_name":"Henry IV", "speech_number":1, "line_number":"1.1.1", "speaker":"KING HENRY IV", "text_entry":"So shaken as we are, so wan with care," } { "type":"line", "line_id":5, "play_name":"Henry IV", "speech_number":1, "line_number":"1.1.2", "speaker":"KING HENRY IV", "text_entry":"Find we a time for frighted peace to pant" } { "type":"line", "line_id":6, "play_name":"Henry IV", "speech_number":1, "line_number":"1.1.3", "speaker":"KING HENRY IV", "text_entry":"And breathe short-winded accents of new broils"} My_index Documents are indexed to an index
  34. 34. Define an index 34 • Clients communicate with a cluster using Elasticsearch’s REST APIs • An index is defined using the Create Index API, which can be accomplished with a simple PUT command # curl -XPUT 'http://localhost:9200/my_index' -i HTTP/1.1 200 OK content-type: application/json; charset=UTF-8 content-length: 48 {"acknowledged":true,"shards_acknowledged":true}
  35. 35. Shard 35 • A shard is a single piece of an Elasticsearch index ‒ Indexes are partitioned into shards so they can be distributed across multiple nodes • Each shard is a standalone Lucene index ‒ The default number of shards for an index is 5. Number of shards can be changed at index creation time. My_index 0 2 4 3 1 Node 1 Node 2
  36. 36. Working with Data - Document 36 Documents must be JSON objects. • A document can be any text or numeric data you want to search and/or analyze ‒ Specifically, a document is a top-level object that is serialized into JSON and stored in Elasticsearch • Every document has a unique ID ‒ which either you provide, or Elasticsearch generates one for you { "type":"line", "line_id":4, "play_name":"Henry IV", "speech_number":1, "line_number":"1.1.1", "speaker":"KING HENRY IV", "text_entry":"So shaken as we are, so wan with care," }
  37. 37. Index compression 37 • Elasticsearch compresses your documents during indexing ‒ documents are grouped into blocks of 16KB, and then compressed together using LZ4 by default ‒ if your documents are larger than 16KB, you will have larger chunks that contain only one document • You can change the compression to DEFLATE using the index.codec setting: ‒ reduced storage size at slightly higher CPU usage PUT my_index { "settings": { "number_of_shards": 3, "number_of_replicas": 2, "index.codec" : "best_compression" } }
  38. 38. Index a document 38 The Index API is used to index a document ‒ use a PUT or a POST and add the document in the body request ‒ notice we specify the index, the type and an ID ‒ if no ID is provided, elasticsearch will generate one # curl -XPUT 'http://localhost:9200/my_index/my_type/1' -H 'Content-Type: application/json' -d '{ "line_id":5, "play_name":"Henry IV", "speech_number":1, "line_number":"1.1.2", "speaker":"KING HENRY IV", "text_entry":"Find we a time for frighted peace to pant" }' {"_index":"my_index","_type":"my_type","_id":"1","_version":1,"result":"created","_shar ds":{"total":2,"successful":2,"failed":0},"created":true}
  39. 39. Index without specifying an ID 39 You can leave off the id and let Elasticsearch generate one for you: ‒ But notice that only works with POST, not PUT ‒ The generated id comes back in the response # curl -XPOST 'http://localhost:9200/my_index/my_type/' -H 'Content-Type: application/json' -d ' {"line_id":6, "play_name":"Henry IV", "speech_number":1, "line_number":"1.1.3", "speaker":"KING HENRY IV", "text_entry":"And breathe short-winded accents of new broils" }' {"_index":"my_index","_type":"my_type","_id":"AWZIq227Unvtccn4Vvrz","_version":1,"resul t":"created","_shards":{"total":2,"successful":2,"failed":0},"created":true}
  40. 40. Reindexing a document 40 What do you think it happens if we add another document with the same ID ? curl -XPUT 'http://localhost:9200/my_index/my_type/1' -H 'Content-Type: application/json' -d ' { "new_field" : "new_value" }'
  41. 41. ...Overwrites the document 41 • The old field/value pairs of the document are gone ‒ the old document is deleted, and the new one gets indexed • Notice every document has a _version that is incremented whenever the document is changed # curl -XGET http://localhost:9200/my_index/my_type/1?pretty -H 'Content-Type: application/json' { "_index" : "my_index", "_type" : "my_type", "_id" : "1", "_version" : 2, "found" : true, "_source" : { "new_field" : "new_value" } }
  42. 42. The _create endpoint 42 If you do not want a document to be overwritten if it already exists, use the _create endpoint ‒ no indexing occurs and returns a 409 error message: # curl -XPUT 'http://localhost:9200/my_index/my_type/1/_create' -H 'Content-Type: application/json' -d ' {"new_field" : "new_value"}' {"error":{"root_cause":[{"type":"version_conflict_engine_exception","reason":"[my_type][ 1]: version conflict, document already exists (current version [2])","index_uuid":"JGY3Q_9NRjWe-wU- MlK44Q","shard":"3","index":"my_index"}],"type":"version_conflict_engine_exception","rea son":"[my_type][1]: version conflict, document already exists (current version [2])","index_uuid":"JGY3Q_9NRjWe-wU- MlK44Q","shard":"3","index":"my_index"},"status":409}
  43. 43. Locking ? 43 - Every indexed document has a version number - Elasticsearch uses Optimistic concurrency control without locking # curl -XPUT 'http://localhost:9200/my_index/my_type/1?version=3' -d '{ ... }' # 200 OK # curl -XPUT 'http://localhost:9200/my_index/my_type/1?version=2' -d '{ ... }' # 409 Conflict
  44. 44. The _update endpoint 44 To update fields in a document use the _update endpoint. - Make sure to add the “doc” context curl -XPOST 'http://localhost:9200/my_index/my_type/1/_update' -H 'Content-Type: application/json' -d ' { "doc": { "line_id":10, "play_name":"Henry IV", "speech_number":1, "line_number":"1.1.7", "speaker":"KING HENRY IV", "text_entry":"Nor more shall trenching war channel her fields" } }' {"_index":"my_index","_type":"my_type","_id":"1","_version":3,"result":"updated","_shar ds":{"total":2,"successful":2,"failed":0}}
  45. 45. Retrieve a document 45 Use GET to retrieve an indexed document ‒ Notice we specify the index, the type and an ID ‒ Returns a 200 code if document found or a 404 error if the document is not found # curl -XGET http://localhost:9200/my_index/my_type/1?pretty { "_index" : "my_index", "_type" : "my_type", "_id" : "1", "_version" : 1, "found" : true, "_source" : { "line_id" : 5, "play_name" : "Henry IV", "speech_number" : 1, "line_number" : "1.1.2", "speaker" : "KING HENRY IV", "text_entry" : "Find we a time for frighted peace to pant" } }
  46. 46. Deleting a document 46 Use DELETE to delete an indexed document ‒ response code is 200 if the document is found, 404 if not # curl -XDELETE 'http://localhost:9200/my_index/my_type/1/' -H 'Content-Type: application/json' {"found":true,"_index":"my_index","_type":"my_type","_id":" 1","_version":7,"result":"deleted","_shards":{"total":2,"su ccessful":2,"failed":0}}
  47. 47. A simple search 47 Use a GET request sent to the _search endpoint ‒ every document is a hit for this search ‒ by default, Elasticsearch returns 10 hits curl -s -XGET 'http://localhost:9200/my_index/my_type/_search' -H 'Content-Type: application/json' { "took" : 1, "timed_out" : false, …. }, "hits" : { "total" : 2, "max_score" : 1.0, "hits" : [ ... ] } } Search for all docs in my_index Number of ms it took to process the query Number of documents there were hits for this query Array containing documents hit by the search criteria
  48. 48. CRUD Operations Summary 48 Index PUT my_index/my_type/4 Create PUT my_index/my_type/4/_create { "speaker":"KING HENRY IV", "text_entry":"To be commenced in strands afar remote." } Read GET my_index/my_type/4 Update POST my_index/my_type/4/_update { "my_type" : { "text_entry":"No more the thirsty entrance of this soil" } } Delete DELETE my_index/my_type/4
  49. 49. Mapping – what is it? 49 • Elasticsearch will index any document without knowing its details (number of fields, their data types, etc.) - dynamic mapping ‒ However, behind-the-scenes Elasticsearch assigns data types to your fields in a mapping. Mapping is the process of defining how a document, and the fields it contains, are stored and indexed A mapping is a schema definition that contains: ‒ names of fields ‒ data types of fields ‒ how the field should be indexed and stored by Lucene • Mappings map your complex JSON documents into the simple flat documents that Lucene expects.
  50. 50. Defining a mapping 50 • In most use cases, you will need to define your own mappings, but is not required. When you index a document, Elasticsearch dynamically creates or updates the mapping • Mappings are defined in the“mappings”section of an index. You can: ‒ define mappings at index creation, or ‒ add to a mapping of an existing index PUT my_index { "mappings": { define mapping here } }
  51. 51. Let's view a mapping 51 GET my_index/_mapping { "my_index" : { "mappings" : { "my_type" : { "properties" : { "line_id" : { "type" : "long" }, "line_number" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, "play_name" : { "type" : "text", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } }, ... The “properties” section contains the fields and data types in your documents
  52. 52. Elasticsearch data types for fields 52 • Simple types, including: ‒ text: for full text (analyzed) strings ‒ keyword: for exact value strings ‒ date: string formatted as dates, or numeric dates ‒ integer types: like byte, short, integer, long ‒ floating-point numbers: float, double, half_float, scaled_float ‒ boolean ‒ ip: for IPv4 or IPv6 addresses • Hierarchical Types: like object and nested • Specialized Types:geo_point, geo_shape and percolator • Range types and more
  53. 53. Updating existing mapping 53 • Existing field mappings cannot be updated. Changing the mapping would mean invalidating already indexed documents. - Instead, you should create a new index with the correct mappings and reindex your data into that index. There are some exceptions to this rule: • new properties can be added to Object datatype fields. • new multi-fields can be added to existing fields. • the ignore_above parameter can be updated.
  54. 54. Prevent mapping explosion 54 • Defining too many fields in an index is a condition that can lead to a mapping explosion, which can cause out of memory errors and difficult situations to recover from. - For example when using dynamic mapping and every new inserted documents introduces new fields. • The following settings allow you to limit the number of field mappings that can be created manually or dynamically index.mapping.total_fields.limit - maximum number of fields in an index, defaults to 1000 index.mapping.depth.limit - maximum depth for a field, which is measured as the number of inner objects, defaults to 20 index.mapping.nested_fields.limit - maximum number of nested fields in an index, defaults to 50
  55. 55. Analysis 55 • Analysis is the process of converting full text into terms (tokens) which are added to the inverted index for searching. - Analysis is performed by an analyzer which can be either a built-in analyzer or a custom analyzer defined per index. For example, at index time the built-in standard analyzer will first convert the sentence into distinct tokens: "Welcome to Percona Live - Open Source Database Conference 2018" [ welcome to percona live open source database conference 2018 ] Analyzer will lowercase each token, remove frequent stopwords
  56. 56. The analyze api 56 • The _analyze api can be used to test what an analyzer will to your text curl -s -XGET localhost:$ES_PORT/_analyze?analyzer=standard -d 'Welcome to Percona Live - Open Source Database Conference 2018' | python -m json.tool | grep token "tokens": [ "token": "welcome", "token": "to", "token": "percona", "token": "live", "token": "open", "token": "source", "token": "database", "token": "conference", "token": "2018",
  57. 57. Built-in analyzers 57 • Standard - the default analyzer • Simple – breaks text into terms whenever it encounters a character which is not a letter • Keyword – simply indexes the text exactly as is • Others include: ‒ whitespace, stop, pattern, language, and more are described in the docs at - custom analyzers built by you
  58. 58. Analyzer components 58 • An analyzer consists of three parts: 1. Character Filters 2. Tokenizer 3. Token Filters Character Filters Tokenizer Token FiltersInput Output string tokens string tokens
  59. 59. Specifying an analyzer 59 • At index time: PUT my_index { "mappings": { "_doc": { "properties": { "title": { "type": "text", "analyzer": "standard" } } } } } • At search time: Usually, the same analyzer should be applied at index time and at search time, to ensure that the terms in the query are in the same format as the terms in the inverted index. By default, queries will use the analyzer defined in the field mapping, but this can be overridden with the search_analyzer setting: PUT my_index { "mappings": { "_doc": { "properties": { "text": { "type": "text", "analyzer": "autocomplete", "search_analyzer": "standard" }}}}}
  60. 60. Custom analyzer 60 • Best described with an example, let's create a custom analyzer based on standard one, but which also removes stop words PUT my_index { "settings": { "analysis": { "filter": { "my_stopwords": { "type": "stop", "stopwords": ["to", "and", "or", "is", "the"] } }, "analyzer": { "my_content_analyzer": { "type": "custom", "char_filter": [], "tokenizer": "standard", "filter": ["lowercase","my_stopwords"] } } }}}
  61. 61. Scaling the cluster ● 10 000ft view on scaling ● Node roles ● Adding a node to a cluster ● Understanding shards ● Replicas ● Read/Write model ● Sample Architectures 61
  62. 62. 10 000ft view on scaling 62 • ElasticSearch has the potential to be always available as long as we take advantage of it’s scaling features. • With vertical scaling(better hardware) having its limitations, we’ll take a look at the horizontal scaling(more nodes in the same cluster). • If with other datastores, horizontal scaling has its challenges, such as sharding for MongoDB(Antonios has written an amazing tutorial on managing a sharded cluster; you must check it out), ElasticSearch is designed to be distributed by nature so as long as replicas as being used, the application development as well as the administration overheard to manage scaling out the cluster is minimal.
  63. 63. 10 000ft view on scaling 63 • We defined a shard as elements that compose the indexes and is, each, a Lucene index. • By default, ElasticSearch will create 5 per index, but if we have everything on one node and that node goes down? We face disaster. This is where replicas come in. • A replica of a shard is an exact copy of that element that lives on another node. • A node is simply an ElasticSearch process. One or more nodes with the same name under the “” directive under the config file is/are making up a cluster.
  64. 64. 10 000ft view on scaling 64 • All nodes know about all others in the cluster and can also direct a request to another, if needed. • Nodes can handle both http(external) traffic as well as transport(inter cluster) traffic. They can also switch between these. If one node receives an HTTP request that should have been directed at another, it switches to TRANSPORT. • Nodes can have one or more roles in the cluster.
  65. 65. Node Roles 65 • Master-eligible node: A node that has ”node.master” set to true (default), which makes it eligible to be elected as the masternode, which controls the cluster and carries out administrative functions such as deleting and creating indexes. • Data node: A node that has ”” set to true (default). Data nodes hold data and perform data related operations such as CRUD, search, and aggregations. • Ingest node: A node that has ”node.ingest” set to true (default). Ingest nodes are able to apply an ingest pipeline to a document in order to transform and enrich the document, such as adding a field that wasn’t there before, before indexing. With a heavy ingest load, it makes sense to use dedicated ingest nodes and to mark the master and data nodes as “node.ingest: false”
  66. 66. Node Roles 66 • Tribe node: A tribe node, configured via the tribe.* settings, is a special type of coordinating only node that can connect to multiple clusters and perform search and other operations across all connected clusters. In later versions of Elastic, this role became obsolete • Kibana node: In case Kibana is being used on a large scale with many users running complex queries, you can have a dedicated node or nodes for it. • To summarize, any node, by default, is master eligible, is acting as a data node as well as handling ingestions, including ingestion pipelines. As the cluster grows, in order to separate the overhead of different operations(maintaining the cluster, ingestion pipelines, connecting clusters etc), it makes sense to define roles.
  67. 67. Adding a node to a cluster 67 • To add a node or start a cluster,we need to set the directive “” to a descriptive value in /etc/elasticsearch/elasticsearch.yml ; All nodes need to have the same • By default, ElasticSearch binds to the loopback interface so we must edit the networking section of the config file and bind the daemon to a specific ip or use for all: • We must name our nodes, again, with descriptive values: • Nodes running on the same host will be auto-discovered but remote nodes will use zen discovery which will take a list of Ips that will assemble the cluster. The firewall must allow communication on 9200,9300:
  68. 68. Adding a node to a cluster 68 • Of course, there are more options that you can configure but for the sake of this exercise, these will be enough. • Once these are set, restart the daemon and a /_cluster/health?pretty should return something like: curl -X GET http://localhost:9200/_cluster/health?pretty { "cluster_name" : "democluster", ß …. "number_of_nodes" : 2, ß "number_of_data_nodes" : 2, ß …… }
  69. 69. Understading shards 69 A shard is a worker unit that holds data and can be assigned to nodes and is, itself a Lucene index. Think of a self contained search engine that handles a portion of data. ‒ An index is merely a virtual namespace which points to a number of shards My_index My_cluster Node1 Node2 shard shard shard shardshard An index is "split" into shards before any documents are indexed
  70. 70. Primary and Replica 70 • There are two types of shards: - primary: the original shards of an index - replicas: copies of the primary • Documents are replicated between a primary and its replicas - a primary and all replicas are guaranteed to be on different nodes My_cluster Node1 Node2 P0 P2 R3 P3P1 R1 R4 P4 R0 R2
  71. 71. Number of Primary shards 71 • Is fixed – default number of primary shard for an index is 5 • You can specify a different number of shards when you create the index. • Changing the number of shards after the index has been created can be done with the split or shrink index API but it’s NOT a trivial operation. It’s basically the same as reindexing. Plan accordingly. PUT my_new_index { "settings": { "number_of_shards": 3 } }
  72. 72. Replicas are good for 72 • High availability - We can lose a node and still have all the data available - After losing a primary, Elasticsearch will automatically promote a replica to a primary and start replicating unassigned replicas • Read throughput - Replicas can handle query/read requests from client applications - Allows you to scale your data and better utilize cluster resources You can change the number of replicas for an index at any time using the _settings endpoint: PUT my_index/_settings { "number_of_replicas": 2 }
  73. 73. Replicas 73 • Let’s play a bit with replicas. In this example I’ve indexed Shakespeare’s work again. Here is the cluster and the index: curl -X GET http://localhost:9200/_cluster/health?p retty { "cluster_name" : "democluster", "status" : "yellow", ß …. "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 5, "active_shards" : 5, "unassigned_shards" : 5, … } curl -XGET localhost:9200/_cat/indices?v health status index uuid pri rep docs.count docs.deleted store.size yellow open shakespeare jkJ280IVT3mcfswXwBR1QA 5 1 111394 0 22.4mb 22.4mb Yellow indicates a problem. What do you think the problem is? What would be a solution here?
  74. 74. Replicas 74 • Replicas will get automatically assigned if the topology permits it. All I’ve done was to start a second node and: • We can change the replicas number, dynamically, in the index settings. This is a trivial operation, unlike changing the number of shards. curl -XGET localhost:9200/_cat/indices?v health status index uuid pri rep docs.count docs.deleted store.size green open shakespeare jkJ280IVT3mcfswXwBR1QA 5 1 111394 0 44.9mb 22.4mb curl -X PUT "localhost:9200/shakespeare/_settings" -H 'Content- Type: application/json' -d' > { > "index" : { > "number_of_replicas" : 0 > } > } > ' {"acknowledged”}
  75. 75. Write Path 75 • The process of keeping the primary shard in sync with its replicas is called a data replication model. ElasticSearch’s data replication model is based on the primary-backup model. One primary and n backups. • This model runs on top of replication groups. We’ve seen before that as a default, we have 5 primary shards and each of these shards have 1 replica. In the above graph, we have a 2 replication groups and each primary has 3 replicas. • In the context of a replication group, primary shard is responsible for indexing and keeping the replicas up to date. In a replication group at a certain point, some replicas might be offline so the master node will keep a in-sync copies group with the ones that are and have received all the writes that the user has acknowledged. primary replica replica primary Replication group 1 Replication group 2 replica replica replica replica
  76. 76. Write Path 76 • The primary follows the flow of validating the incoming operation and the documents, execute it locally, forward the operation to all replicas in the in-sync list, ack the write once all the replicas from the list have run the operation. • write • Some notes about failure handling. In case a primary fails, the indexing will stop for 1 minute while the master promotes a new primary. The primary will check with his replicas to make sure he’s still primary and wasn’t demoted for whatever reason. An operation coming from a stale primary will be declined by the replicas. 1 2 3 local In-sync 1 2
  77. 77. Read Path 77 • The node that received the query(which is called the coordinating node) will find the relevant shards for the read request, select an active copy of the data(primary or replica; it will round robin) from a replication group, send the read request to the selected copies, combine the results and respond. • The requests to each shard are single threads but more than one shard can be done in parallel.
  78. 78. Read Path 78 • Because we’re talking about roundrobin when we were talking about the active shard, this is where adding more replicas will help. Any new request will hit a different replica so the work is spread. • The failure handling is way easier. If for some reason a response is not received, the coordinating node will resubmit the read request to the relevant replication group, pick a different replica and the same flow reapplies.
  79. 79. Sample Architectures 79 • For lightweight searches and where the data can be reindexed without suffering from loss, the single node cluster is not unseen. • A basic deployment with data resilience is the two node cluster. Most SaaS providers start with this deployment. • The two node model can be scaled as much as it’s needed but is usually recommended in case you are running just basic indexing/search operations. In case more granularity is needed, the data can be reindexed with a higher number of shards and replicas across. • In case the number of nodes in the cluster gets really high or the operations get complex, it’s time to separate the roles. Separating the nodes also needs to take in consideration the cases where you would lose one or more nodes of a specific role. For instance if you’re using ingestion only nodes, data only nodes and master only nodes, you need to consider what happens if you lose one or more.
  80. 80. Sample Architectures 80 • ObjectRocket starts with 4 ingestion nodes, 2 kibana, 2 data and 3 master nodes. • We don’t care how many client nodes we lose as long as we have 1 remaining. • The master nodes pick a active master based on quorum.This helps with split brain. • Data nodes, of course, we can lose at maximum one. • Consider redundant components as much as possible. • We will cover security in a later chapter. By default, in the community version, there is no built in security. In this case, Firewall limitations are a must have.
  81. 81. 81 Lab 2 Scaling the cluster Objectives: Adding nodes to your cluster Change the number of Replicas Steps: 1. Navigate to /Percona2018/Lab02/ 2. Read the instructions on Lab02.txt
  82. 82. Operating the cluster ● Working with nodes ● Working with shards ● Reindex ● Backup/Restore ● Plugins 82
  83. 83. Cheatsheet 83 curl -X GET ”<host>:<port>/_cluster/settings” curl -X GET " ”<host>:<port>/_cluster/settings?include_defaults=true” curl -X PUT ”<host>:<port>/_cluster/settings" -H 'Content-Type: application/json' -d' { "persistent" : { ”name of the setting" : value }}' curl -X PUT ”<host>:<port>/_cluster/settings" -H 'Content-Type: application/json' -d' { "transient" : { ”name of the setting" : null }}'
  84. 84. Shard Allocation 84 Allow control over how and where shards are allocated Shard Allocation settings (cluster.routing.allocation) - enable - node_concurrent_incoming_recoveries - node_concurrent_outgoing_recoveries - node_concurrent_recoveries - Shard Rebalancing settings (cluster.routing.rebalance) - enable - allow_rebalance - cluster_concurrent_rebalance
  85. 85. Shard Allocation - Disk 85 cluster.routing.allocation.disk.threshold_enabled: Defaults to true Low: Do not allocate new shards. Defaults to 85% High: Try to relocate shards. Defaults to 90% Flood_stage: Enforces a read-only index block. Must be released manually. Defaults to 95% How often Elasticsearch should check on disk usage (Defaults to 30s) cluster.routing.allocation.disk.include_relocations: Defaults to true – Could lead to false alerts curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d' { "transient": { "cluster.routing.allocation.disk.watermark.low": "100gb", "cluster.routing.allocation.disk.watermark.high": "50gb", "cluster.routing.allocation.disk.watermark.flood_stage": "10gb", "": "1m” }}
  86. 86. Shard Allocation – Rack/Zone 86 Make Elasticsearch aware of the topology - it can ensure that the primary shard and its replica shards are spread across different - Physical servers (node.attr.phy_host) - Racks (node.attr.rack_id) - Availability Zones ( - Minimize the risk of losing all shard copies at the same time - Minimize latency Configuration: cluster.routing.allocation.awareness.attributes: zone, rack_id Force awareness: zone1,zone2 cluster.routing.allocation.awareness.attributes: zone
  87. 87. Restart node(s) 87 Elasticsearch wants your data to be fully replicated and evenly balanced. When a nodes go down: - The cluster immediately recognize the change - Rebalancing begins - Rebalancing takes time and can become costly During a planned maintenance you should hold off on rebalancing
  88. 88. Restart node(s) 88 Steps: 1) Flush pending indexing operations POST /_flush/synced 2) Disable shard allocation 3) Shut down a single node 4) Perform a maintenance PUT /_cluster/settings { "transient" : { "cluster.routing.allocation.enable" : "none” } }
  89. 89. Restart node(s) 89 5) Restart the node, and confirm that it joins the cluster. 6) Re-enable shard allocation as follows: 7) Check the cluster health PUT /_cluster/settings { "transient" : { "cluster.routing.allocation.enable" : "all" } }
  90. 90. Restart node(s) 90 You can also make Elastic less sensitive to changes The default for Master is to instruct shard relocations is 1m. During restarts we can lower the threshold. Useful setting for slow or unreliable networks. PUT _all/_settings { "settings": { "index.unassigned.node_left.delayed_timeout": "5m" } }
  91. 91. Remove a node 91 Elastic automatically detects topology changes. In order to remove a node you need to drain it and then stop it Where attribute: _name :Match nodes by node names _ip: Match nodes by IP addresses (the IP address associated with the hostname) _host: Match nodes by hostnames PUT _cluster/settings { "transient" : { "cluster.routing.allocation.exclude.{attribute} ": ”<value>" } }
  92. 92. Remove a node 92 Additional considerations: - Master-eligible node - Seed nodes - Space considerations - Performance considerations - If possible stop writes - Do not allow new allocations ("cluster.routing.allocation.enable" : "none") - Overhead from the shard drains - Throttle (indices.recovery.max_bytes_per_sec) - One node at a time (cluster.routing.allocation.disk.watermark) Move shards manually (Reroute API) - Flush and if possible stop writes - Safe for Replicas, not recommended for Primaries (may lead to data loss)
  93. 93. Remove a node 93 Cancel the drain of a node by removing the node or reset the attribute Where attribute: _name :Match nodes by node names _ip: Match nodes by IP addresses (the IP address associated with the hostname) _host: Match nodes by hostnames PUT _cluster/settings { "transient" : { "cluster.routing.allocation.exclude.{attribute}": "" } }
  94. 94. Replace a node 94 Similar to remove a node with the difference that you need to add a node as well. Simplest approach: add a new node and then drain the old node Additional considerations: - Master-eligible/Seed nodes - Do not allow new allocations (cluster.routing.allocation.exclude._name) - Overhead from drain/throttle (indices.recovery.max_bytes_per_sec) - Space considerations - Max amount of data each node can get. Watermark Alternatively use the reroute API to drain the node
  95. 95. Working with Shards 95 Number of Shards/Replicas - Defined on Index creation - Number of Replicas changes dynamically - Number of Shards can change using: - shrink API - split API - reindex API Why increase the number of shards: - Index size - Performance considerations - Hard limits (LUCENE-5843) Almost same reasons apply when you decreasing the number of shards
  96. 96. Shrink API 96 Shrinks an existing index into a new one with fewer primary shards: - Target index must be a factor of the number of shards in the source index - If a prime number it can only be shrunk into a single primary shard - Before shrinking, a (primary or replica) copy of every shard in the index must be present on the same node Works as follows: - First, it creates a new target index with the same definition as the source index, but with a smaller number of primary shards. - Then it hard-links segments from the source index into the target index. - Finally, it recovers the target index as though it were a closed index which had just been re- opened.
  97. 97. Shrink API 97 In order to shrink an index, the index must be marked as read-only, and a copy of every shard in the index must be relocated to the same node and have health green Note that it may take a while… Check progress using GET _cat/recovery?v curl -X PUT ”<host>:<port>/my_source_index/_settings" -H 'Content- Type: application/json' -d' { "settings": { "index.routing.allocation.require._name": "shrink_node_name", "index.blocks.write": true }}’
  98. 98. Shrink API 98 Finally its time to shrink the index: It is similar to create index api – almost same arguments Some constraints apply curl -X POST ”<host>:<port>/my_source_index/_shrink/my_target_index" -H 'Content-Type: application/json' -d' { "settings": { "index.number_of_replicas": <number>, "index.number_of_shards": <number>, "index.routing.allocation.require._name": null, "index.blocks.write": null }}’
  99. 99. Split API 99 Splits an existing index into a new index: - The original primary shard is split into two or more primary shards. - The number of splits is determined by the index.number_of_routing_shards setting The _split API requires the source index to be created with a specific number_of_routing_shards in order to be split in the future. This requirement has been removed in Elasticsearch 7.0 Works as follows: - First, it creates a new target index with a larger number of primary shards. - Then it hard-links segments from the source index into the target index. - Once the low level files are created all documents will be hashed again to delete documents that belong to a different shard. - Finally, it recovers the target index as though it were a closed index which had just been re- opened.
  100. 100. Split API 100 In order to shrink an index, the index must be marked as read-only (assuming the index has number_of_routing_shards set) Split the index: curl -X PUT ”<host>:<port>/my_source_index/_settings" -H 'Content-Type: application/json' -d' { "settings": { "index.blocks.write": true }}’ curl -X POST ”<host>:<port:1>/my_source_index/_split/my_target_index?copy_settings=true" - H 'Content-Type: application/json' -d' { "settings": { "index.number_of_shards": 2 }}'
  101. 101. Reindex API - Definition 101 - Does not copy the settings of the source index - version_type : internal/external - source supports “query”, multi-indexes & remote location - URL parameters: refresh, wait_for_completion, wait_for_active_shards, timeout, scroll and requests_per_second - Supports painless scripts to manipulate indexing curl -X POST ”<host>:<port>/_reindex" -H 'Content-Type: application/json' -d' { "source": { "index": ”<source index>" }, "dest": { "index": ”<destination index>" }}’
  102. 102. Reindex API – Response Body 102 "took": 1200, "timed_out": false, "total": 10, "updated": 0, "created": 10, "deleted": 0, "batches": 1, "noops": 0, "version_conflicts": 2, "retries": { "bulk": 0, "search": 0}, "throttled_millis": 0, "requests_per_second": 1, "throttled_until_millis": 0, "failures": [ ] Total milliseconds the entire operation took The number of documents that were successfully processed Summary of the operation counts The number of version conflicts that reindex hit Throttling Statistics
  103. 103. Reindex API 103 Active Reindex jobs: Cancel a Reindex job: Re-Throttle: Reindexing from a remote server: - Use on-heap buffer that defaults to a maximum size of 100mb - May need to use a smaller batch size - Configure socket_timeout and connect_timeout. Both default to 30 seconds POST _reindex/<id of the reindex>/_rethrottle?requests_per_second=-1 POST _tasks/<id of the reindex>/_cancel GET _tasks?detailed=true&actions=*reindex
  104. 104. Snapshots - Backup 104 A snapshot is a backup taken from a running Elasticsearch cluster Snapshots are taken incrementally Version compatibility – one major version behind You must register a snapshot repository before you can perform snapshot Must exists on elasticsearch.yml: path.repo curl -X GET ”<host>:<port>/_snapshot/_all" curl -X PUT ”<host>:<port>/_snapshot/my_backup" -H 'Content-Type: application/json' -d' { "type": "fs", "settings": { "location": ”backup location" }}’
  105. 105. Snapshots - Backup 105 Shared location: On elasticsearch.yml: path.repo: ["/mount/backups0", "/mount/backups1"] Don’t forget to register it!!! Registration options location: Location of the snapshots compress: Turns on compression of the snapshot files. Defaults to true. chunk_size: Big files can be broken down into chunks. Defaults to null (unlimited chunk size) max_restore_bytes_per_sec: Throttles per node restore rate. Defaults to 40mb/second max_snapshot_bytes_per_sec: Throttles per node snapshot rate. Defaults to 40mb/second readonly: Makes repository read-only. Defaults to false
  106. 106. Snapshots - Backup 106 wait_for_completion whether or not the request should return immediately after snapshot completion ignore_unavailable: Ignores indexes that don’t exists include_global_state: Prevent the cluster global state to be stored as part of the snapshot curl -X PUT ”<host>:<port>/_snapshot/my_backup/snapshot_2?wait_for_completion=true" - H 'Content-Type: application/json' -d' { "indices": "index_1,index_2,index_3", "ignore_unavailable": true, "include_global_state": false }’ curl -X PUT ”<host>:<port>/_snapshot/my_backup/snapshot_1?wait_for_completion=true”
  107. 107. Snapshots - Backup 107 IN_PROGRESS: The snapshot is currently running. SUCCESS: The snapshot finished and all shards were stored successfully. FAILED: The snapshot finished with an error and failed to store any data. PARTIAL: The global cluster state was stored, but data of at least one shard wasn’t stored successfully. INCOMPATIBLE: The snapshot was created with an old version of ES incompatible with the current version of the cluster. Delete snapshot: Unregister Repo: curl -X GET ”<host>:<port>/_snapshot/my_backup/snapshot_1" curl -X DELETE ”<host>:<port>/_snapshot/my_backup/snapshot_2" curl -X DELETE ”<host>:<port>/_snapshot/my_backup"
  108. 108. Snapshots - Restore 108 Check the progress: Also supported: - Partial restore - Restore with different settings - Restore to a different cluster curl -X POST ”<host>:<port>/_snapshot/my_backup/snapshot_1/_restore” curl -X GET ”<host>:<port>/_snapshot/_status” curl -X GET ”<host>:<port>/_snapshot/my_backup/_status" curl -X GET ”<host>:<port>/_snapshot/my_backup/snapshot_1/_status”
  109. 109. Snapshots - Restore 109 Restore with different settings Select indices that should be restored Renames indices on restore using regular expression that supports referencing the original text. Restore global state "index_settings": {"index.number_of_replicas": 0} "ignore_index_settings": ["index.refresh_interval”] curl -X POST "localhost:9200/_snapshot/my_backup/snapshot_1/_res tore" -H 'Content-Type: application/json' -d' { "indices": "index_1,index_2", "ignore_unavailable": true, "include_global_state": true, "rename_pattern": "index_(.+)", "rename_replacement": "restored_index_$1" }’
  110. 110. Plugins 110 Away to enhance the basic Elasticsearch functionality in a custom manner. They range from: - Mapping and analysis - Discovery - Security - Management - Alerting - And many many more… Installation: bin/elasticsearch-plugin install [plugin_name] Considerations: - Security - Maintainability between version We are heavily use Cerebro ( on our tutorial
  111. 111. 111 Lab 3 Operating the cluster Objectives: Learn how to: o Remove a node from a cluster. o Use the ReIndex API Steps: 1. Navigate to /Percona2018/Lab03 2. Read the instructions on Lab03.txt 3. Execute ./ to begin
  112. 112. Troubleshooting ● Cluster health ● Improving Performance ● Diagnostics 112
  113. 113. Cluster health 113 • The cluster health API allows to get a very simple status on the health of the cluster • The health status is either green, yellow or red and exists at three levels: shard, index, and cluster • Shard health ‒ red: at least one primary shard is not allocated in the cluster ‒ yellow: all primaries are allocated but at least one replica is not ‒ green: all shards are allocated • Index health ‒ status of the worst shard in that index • Cluster health ‒ status of the worst index in the cluster
  114. 114. Cluster health 114 { "cluster_name" : "my_cluster", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 5, "active_shards" : 5, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 5, "delayed_unassigned_shards": 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch": 0, "task_max_waiting_in_queue_millis": 0, "active_shards_percent_as_number": 50.0 } GET _cluster/health
  115. 115. Green status 115 • The state your cluster should have – All of your primary and replica shards are allocated and active My_cluster Node1 Node2 P0 R0 Node3 R0 PUT my_index { "settings": { "number_of_shards": 1, "number_of_replicas": 2 } }
  116. 116. Yellow status 116 It means all your primary shards are allocated, but one or more replicas are not. - you may not have enough nodes in the cluster, or a node may have failed My_cluster Node1 Node2 P0 R0 Node3 R0 PUT my_index { "settings": { "number_of_shards": 1, "number_of_replicas": 3 } } R0 Unassigned
  117. 117. Red status 117 • At least one primary shard is missing - searches will return partial results and indexing might fail PUT my_index { "settings": { "number_of_shards": 1, "number_of_replicas": 1 } } My_cluster Node1 Node2 P0 R0 Node3
  118. 118. Resolve unassigned shards 118 Causes: • Shard allocation is purposefully delayed • Too many shards, not enough nodes • You need to re-enable shard allocation • Shard data no longer exists in the cluster • Low disk watermark • Multiple Elasticsearch versions The _cat endpoint will tell you which shards are unassigned, and why: curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,u nassigned.reason| grep UNASSIGNED
  119. 119. Resolve unassigned shards 119 • You can also use the cluster allocation explain API to get more information about shard allocation issues: curl -XGET localhost:9200/_cluster/allocation/explain?pretty { "index" : "testing", "shard" : 0, "primary" : false, "current_state" : "unassigned", … "can_allocate" : "no", "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes", "node_allocation_decisions" : [ { … { "decider" : "same_shard", "decision" : "NO", "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists" }]}]}
  120. 120. Reason 1 – Shard allocation delayed 120 • When a node leaves the cluster, the master node temporarily delays shard reallocation to avoid needlessly wasting resources on rebalancing shards, in the event the original node is able to recover within a certain period of time (one minute, by default) Modify the delay dynamically: curl -XPUT 'localhost:9200/my_index/_settings' -d '{ "settings": { "index.unassigned.node_left.delayed_timeout": "30s" } }'
  121. 121. Reason 2 – Not enough nodes 121 • As nodes join and leave the cluster, the master node reassigns shards automatically, ensuring that multiple copies of a shard aren’t assigned to the same node • A shard may linger in an unassigned state if there are not enough nodes to distribute the shards accordingly. • Make sure that every index in your cluster is initialized with fewer replicas per primary shard than the number of nodes in your cluster
  122. 122. Reason 3 – re-enable shard allocation 122 • Shard allocation is enabled by default on all nodes, but you may have disabled shard allocation at some point (for example, in order to perform a rolling restart) and forgotten to re-enable it. • To enable shard allocation, update the _cluster settings API: curl -XPUT 'localhost:9200/_cluster/settings' -d '{ "transient": { "cluster.routing.allocation.enable" : "all" } }'
  123. 123. Reason 4 – Shard data no longer exists 123 • Primary shard is not available anymore because the index may have been created on a node without any replicas (a technique used to speed up the initial indexing process), and the node left the cluster before the data could be replicated. • Another possibility is that a node may have encountered an issue while rebooting or has storage issues • In this scenario, you have to decide how to proceed: try to get the original node to recover and rejoin the cluster (and do not force allocate the primary shard), or force allocate the shard using the _reroute API and reindex the missing data using the original data source, or from a backup.
  124. 124. Reason 4 – Shard data no longer exists 124 • To allocate an unassigned primary shard: curl -XPOST 'localhost:9200/_cluster/reroute' -d '{ "commands" : [ { "allocate" : { "index" : "my_index", "shard" : 0, "node": "<NODE_NAME>", "allow_primary": "true" } }] }' Warning! The caveat with forcing allocation of a primary shard is that you will be assigning an “empty” shard. If the node that contained the original primary shard data were to rejoin the cluster later, its data would be overwritten by the newly created (empty) primary shard, because it would be considered a “newer” version of the data.
  125. 125. Reason 5 – Low disk watermark 125 • Once a node has reached this level of disk usage, or what Elasticsearch calls a “low disk watermark”, it will not be assigned more shards – default is 85% • You can check the disk space on each node in your cluster (and see which shards are stored on each of those nodes) by querying the _cat API: curl -s –XGET 'localhost:9200/_cat/allocation?v' shards disk.indices disk.used disk.avail disk.percent host ip node 5 260b 47.3gb 43.4gb 100.7gb 46 CSUXak2 Example response:
  126. 126. Reason 5 – Low disk watermark 126 Resolutions: - add more nodes - increase disk size - increase low watermark threshold, if safe: PUT /_cluster/settings -d '{ "transient": { "cluster.routing.allocation.disk.watermark.low": "90%" } }'
  127. 127. Reason 6 – Multiple ES versions 127 • Usually encountered when in the middle of a rolling upgrade • The master node will not assign a primary shard’s replicas to any node running an older major version (1.x -> 2.x -> 5.x).
  128. 128. When nothing works 128 … or restore the affected index from an old snapshot
  129. 129. Poor performance 129 This can be a long discussion – see more in the `Best practices` chapter You want to start by: • Enable slow logging so you can Identify long running queries • Run identified searches through the _profiling API to look at timing of individual components • Filter, filter, filter
  130. 130. Enable slow log 130 • Send a put request to the _cluster API to define the level of slow log that you want to turn on: warn, info, debug, and trace PUT /_cluster/settings ’{ "transient" : { "" : "DEBUG", "logger.index.indexing.slowlog" : "DEBUG" } }' • All slow logging is enabled on the index level: PUT /my_index/_settings '{"" : "50ms", "": "50ms", "index.indexing.slowlog.threshold.index.warn": "50ms" }'
  131. 131. Profile 131 • The Profile API provides detailed timing information about the execution of individual components in a search request and it can be very verbose, especially for complex requests executed across many shards • Usage: GET /my_index/_search { "profile": true, "query" : { "match" : { "speaker": "KING HENRY IV" } } }
  132. 132. Filters 132 • One way to improve the performance of your searches is with filters. The filtered query can be your best friend. It’s important to filter first because filter in a search does not affect the outcome of the document score, so you use very little in terms of resources to cut the search field down to size. • A rule of thumb is to use filters when you can and queries when you must: when you need the actual scoring from the queries. • Also, filters can be cached.
  133. 133. Upgrade the cluster ● Generals ● Upgrade path ● Before upgrading ● Rolling upgrades ● On rolling upgrades ● Full cluster restart upgrades ● Upgrades by re-indexing ● Re-indexing in place ● Moving through the versions 133
  134. 134. Generals 134 • Elasticsearch can read indices created in the previous major version. Older indices must be re-indexed or deleted. • From versions 5.0 ElasticSearch can usually be upgraded using rolling restarts so that the service is not interrupted. • Upgrades across major versions before 6.0 require a full cluster restart • Backup backup backup • Nodes will fail to start if incompatible indexes are being found • You can reindex from a remote location so that you skip the backup/restore option
  135. 135. Upgrade path 135 • Any index created prior to 5.0 will need to be re-indexed into newer versions
  136. 136. Before upgrading 136 • Understand the changes that appeared in the new version by reviewing the Release highlights and Release notes. • Review the list of changes that can break your cluster. • Check the deprecation log to see if any of your current features became absolute. • Check for updated versions of your current plugins or compatibility with the new version. • Upgrade your dev/QA/staging cluster before proceeding with the production cluster. • Back up your data by taking a snapshot before upgrading. If you want to rollback, you will need it. You can’t rollback unless you have a backup.
  137. 137. Rolling upgrades 137 1. As we’ve seen before, ES adjusts the balancing of shards based on topology. If we remove a node just like that, it will think the node crashed and it will start redistributing the shards. Then once more when we get the node back. For this, we need to disable shard allocation • The shard recovery process is being helped by stopping indexing and using “POST _flush/synced” • At this point, the cluster is going to turn yellow because secondary shards from other nodes will get promoted to primary after potential primaries and replica shards will become unavailable but this doesn’t hurt the operation of the cluster. As we’ve discussed, as long as 1 shard from a replication group is available, the dataset is alive. • Depending on the number of nodes you have left, be careful not to take out another J curl -XPUT 'http://localhost:9200/_cluster/settings' -d '{ "transient": { "cluster.routing.allocation.enable": "none" }}'
  138. 138. Rolling upgrades 138 2. Stopping the node. This can be as easy as “service elasticsearch stop”. 3. Carry out the needed maintenance(depending on the packet manager, or the way ES has been installed, you might want to run an yum update or to replace the binaries; Be careful of the versions and plugins: Ø An upper version node will join a cluster made out of lower version nodes but a lower version node won’t join a cluster made out of upper version nodes; Ø /usr/share/elasticsearch/bin/elasticsearch-plugin is a script provided by ES to handle plugins. Upgrade these to the correct versions. Ø During a rolling upgrade, primary shards assigned to a node running the new version cannot have their replicas assigned to a node with the old version. The new version might have a different data format that is not understood by the old version. 4. Starting the node;
  139. 139. Rolling upgrades 139 5. Make sure that everything has started correctly. Check the node’s logs for messages of the sort: 6. Enable shard allocation(same command as at step 1 but use ”null” (the value not string), to reset to default instead of “none” ) 7. Check the cluster status and make sure everything has recovered. It can take a bit for the shards to become available. 8. NEEEEEXT! curl -X GET http://localhost:9200/_cluster/health?pretty { "cluster_name" : "democluster", ß "status" : "green", ß … } [2018-10-25T10:04:45,462][INFO ][o.e.n.Node ] [node2] initialized [2018-10-25T10:04:45,462][INFO ][o.e.n.Node ] [node2] starting ... [2018-10-25T10:04:45,729][INFO ][o.e.t.TransportService ] [node2] publish_address {}, bound_addresses {[::]:9300} [2018-10-25T10:04:50,465][INFO ][o.e.n.Node ] [node2] started. ß
  140. 140. On Rolling upgrades 140 • As mentioned before, in a yellow state, the cluster continues to operate normally. • Because you might have a reduced number of replicas assigned, your performance might be impacted. Plan this outside the normal working hours. • New features will come into play when all the nodes are running the updated version. • Again, we can’t rollback. Lower version nodes won’t join the cluster.
  141. 141. On Rolling upgrades 141 • If you have a network partition that will separate the newly updated nodes from the old ones, when this gets solved, the old ones will fail with a message of the sort: • In this case, you have no other choice than to stop the nodes and get them upgraded. Won’t be rolling and you might have service interruption but there is no other alternative. [2018-10-16T15:08:28,928][INFO ][o.e.d.z.ZenDiscovery ] [node3] failed to send join request to master [{node1}{bWKRUNFXTEy1kBgQ1y2LvA}{Gxzb3blaR86CUL3gKLhnXA}{}{}{ml.machine_memory=8196317 184, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [RemoteTransportException[[node1][][internal:discovery/zen/join]]; nested: IllegalStateException[node {node3}{Nt4eKRkvR6- SZ_gg22lqTQ}{dQRBgGDwSo2Zr7W866e64w}{}{}{ml.machine_memory=8196317184, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} is on version [6.3.2] that cannot deserialize the license format [4], upgrade node to at least 6.4.0]; ]. ß
  142. 142. Full Cluster restart upgrade 142 • It was needed before version 6 when major versions were involved. • v5.6 à v6 can be done with a rolling upgrade. • It involves shutting down the cluster, upgrading the nodes then starting the cluster up. 1. Disable shard allocation so we don’t have the unnecessary IO after nodes are being stopped. 2. As briefly mentioned before, stop indexing and perform a “POST _flush/synced” will help with shard recovery curl -XPUT 'http://localhost:9200/_cluster/settings' -d '{ "transient": { "cluster.routing.allocation.enable": "none" }}'
  143. 143. Full Cluster restart upgrade 143 3. Shut down all nodes. “service stop elasticsearch” or whatever works :) 4. Use your package manager to update elasticsearch on each node. 5. Upgrade the plugins with “/usr/share/elasticsearch/bin/elasticsearch-plugin” 6. Start the nodes up 7. Wait for the nodes to join the cluster. 8. Enable shard allocation. 9. Check that the cluster is back to normal before enabling indexing, curl -X GET http://localhost:9200/_cluster/h ealth?pretty { "cluster_name" : "democluster", "status" : "yellow", ß …. "number_of_nodes" : 1, ß "number_of_data_nodes" : 1, "active_primary_shards" : 5, "active_shards" : 5, "unassigned_shards" : 5, … }
  144. 144. Upgrades by re-indexing 144 • Elasticsearch can read indices created in the previous major version. • V6 will read V5 indices but not the V2 or bellow. V5 will read V2 indices but not V1 or bellow • Older indices will need to be re-indexed or dropped. • If a node will detect an index that is incompatible, it will fail to start. • Based on the above, trying to upgrade to a major version that is really far in front, is a bit tricky if you don’t have a spare cluster. If you do, it’s actually quite easy.
  145. 145. Upgrades by re-indexing 145 • The easiest way in which you can move to a new version would be to create a cluster with that version and use the remote indexing feature. When the new index will be created by the new version for the new version. • To do list for remote indexing: 1. Add the host and port to the new cluster’s elasticsearch.yml under reindex.remote.whitelist: 2. Create an index on the new cluster with the correct mappings and settings. • Using number_of_replicas of 0 and refresh_interval -1 will speed up the next operation. reindex.remote.whitelist: oldhost:oldport
  146. 146. Upgrades by re-indexing 146 3. Reindex from remote. Example: curl -X POST "localhost:9200/_reindex" -H 'Content-Type: application/json' -d' { "source": { "remote": { "host": "http://oldhost:9200", "username": "user", "password": "pass" }, "index": "source", "query": { "match": { "test": "data" } } }, "dest": { "index": "dest" } } '
  147. 147. Re-indexing in place 147 • In order to make an older version index work on a newer version cluster, you will need to reindex to a new one. This will be done by the re-index API curl -X POST "localhost:9200/_reindex" -H 'Content-Type: application/json' -d' { "source": { "index": "twitter" }, "dest": { "index": "new_twitter" } } '
  148. 148. Re-indexing in place 148 1. If you want to maintain your mappings, create a new index and copy the mappings and settings; 2. You can again disable the refresh_interval and number_of_replicas to make the operation faster; 3. Reindex the documents to the new index; 4. Reset the refresh_interval and number_of_replicas to the wanted values; 5. Wait for the alias to turn green and it will do so when the replicas will get allocated
  149. 149. Re-indexing in place 149 6. In a single update, to avoid missed operations on the old index, you should: • Delete the old index (let’s call it old index) • Add an alias with the old index to the new index • Add any aliases that existed on the old index to the new index. More aliases meaning more ”adds” curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' - d' { "actions" : [ { "add": { "index": ”new_index", "alias": ”old_index" } }, { "remove_index": { "index": ”old_index" } }, { "add" : { "index" : ”new_index", "alias" : ”any_other_aliases" } } ] } '
  150. 150. Moving through the versions 150 ElasticSearch V2 Perform a full cluster restart to version 5.6 Re-index the V2 indexes in place so they work with 5.6 Perform a rolling restart to 6.x Fully on V5
  151. 151. Moving through the versions 151 ElasticSearch V1 Perform a full cluster restart to V 2.4.X Re-index the 1.X indices in place so they work on V2.4.X Perform a full cluster restart to V5.6 Re-index the V2 indices so they work on V5 Perform a rolling restart to V6.X Fully on V2 Fully on V5
  152. 152. 152 Lab 4 Upgrading the cluster Objectives: Learn how to: o Upgrade an elasticsearch cluster. Steps: 1. Navigate to /Percona2018/Lab04 2. Read the instructions on Lab04.txt 3. Execute ./ to begin
  153. 153. Security ● Authentication ● Authorization ● Encryption ● Audit 153
  154. 154. Security 154 The Open Source version of ElasticSearch, does not provide - Authentication - Authorization - Encryption To overcome this we will use open-source: - Firewall - Reverse proxy - Encryption tools Alternative you can buy X-Pack which provides a different layer of Security
  155. 155. Firewall 155 Client communication: Intra-cluster communication: iptables -I INPUT 1 -p tcp --dport 9200:9300 -s IP_1, IP_2 -j ACCEPT iptables -I INPUT 4 -p tcp --dport 9200:9300 -j REJECT iptables -I INPUT 1 -p tcp --dport 9300:9400 -s IP_1, IP_2 -j ACCEPT iptables -I INPUT 4 -p tcp --dport 9300:9400 -j REJECT
  156. 156. Firewall 156 DNS SSH Monitoring tools Allow whatever port your monitoring tool uses. iptables -A OUTPUT -p udp --dport 53 -m state --state NEW,ESTABLISHED -j ACCEPT iptables -A INPUT -p udp --sport 53 -m state --state ESTABLISHED -j ACCEPT iptables -A OUTPUT -p tcp --dport 53 -m state --state NEW,ESTABLISHED -j ACCEPT iptables -A INPUT -p tcp --sport 53 -m state --state ESTABLISHED -j ACCEPT iptables -A INPUT -p tcp --dport ssh -j ACCEPT iptables -A OUTPUT -p tcp --sport ssh -j ACCEPT
  157. 157. Reverse Proxy 157 client client client ES node ES node Reverse proxy - Nginx Advertise 9200 to 8080 HTTP request ES:8080 Rules HTTP request
  158. 158. Authentication 158 We are going to use nginx: ngx_http_auth_basic_module On nginx.conf 1 2 3 4 1) Listens to 19200 port 2) Enables auth 3) Password file location 4) ES <host>:<port> server { listen *:19200; location / { auth_basic "Restricted"; auth_basic_user_file /var/data/nginx/.htpasswd; proxy_pass http://localhost:9200; proxy_read_timeout 90; } }
  159. 159. Authentication 159 Create users: - htpasswd -c /var/data/nginx/.htpasswd <username> - You will be prompt for the password - Alternatively use the -b flag and provide the pass on cmd line Access Elasticsearch: curl <host> #Returns 301 curl <host>:19200 #Returns 401 Authorization Required curl <username>:<password>@<host>:<19200> #Returns Elasticsearch output
  160. 160. Adding SSL to the mix 160 Use nginx as reverse proxy to encrypt client communication On nginx.conf Certificates: - Can either obtained by a commercial website - Self generated ssl on; ssl_certificate /etc/ssl/certs/<cert>.crt; ssl_certificate_key /etc/ssl/private/<key>.key; ssl_session_cache shared:SSL:10m;
  161. 161. Authorization 161 - Authentication alone is not enough. - Once allowed access, the client can do whatever it wants in the cluster. - Simplest way of authorization is to deny endpoints location / { auth_basic "Restricted"; auth_basic_user_file /var/data/nginx-elastic/.htpasswd; if ($request_filename ~ _shutdown) { return 403; break; } 1 2 1) If user requests for shutdown 2) Return 403 curl -X GET -k "esuser:esuser@es-node1-9200:19200/_cluster/nodes/_shutdown/" Produces a 403 Forbidden
  162. 162. Authorization 162 Assign roles using nginx. For example a user 1 2 3 1) Listens to 19500 port 2) Enables auth 3) Regex match for endpoints 4) FW to ES <host>:<port>4 server { listen 19500; auth_basic "Restricted"; auth_basic_user_file /var/data/nginx/.htpasswd_users; location / { return 403; } location ~* ^(/_search|/_analyze) { proxy_pass http://<es_node>; proxy_redirect off; }}
  163. 163. Encryption & Co 163 Protecting the data on disk is also essential. LUKS (Linux Unified Key Setup) - encrypts entire block devices - cpus with AES-NI (Advanced Encryption Standard Instruction Set) can accelerate dm-crypt - supports limited number of passwords - Keep the keys in a safe place Always audit: - Access Logs - Ports - Backups - Physical access
  164. 164. Working with Data – Advanced Operations ● Alias ● Bulk API ● Aggregations ● … 164
  165. 165. Pagination 165 • By default, Elasticsearch will return the first 10 hits of your query. The size parameter is used to specify the number of hits. GET shakspeare/_search?pretty { "size": 20, "query": { "match": { "play_name": "Hamlet"} } } But this is just the first page of hits
  166. 166. Pagination - from 166 • Add the from parameter to a query to specify the offset from the first result you want to fetch (it defaults to 0). GET shakespeare/_search?pretty { "from": 20, "size": 20, "query": { "match": { "play_name": "Hamlet"} } } Get the next page of hits
  167. 167. Pagination - Scroll 167 • While a search request returns a single “page” of results, the scroll API can be used to retrieve large numbers of results (or even all results) from a single search request, in much the same way as you would use a cursor on a traditional database • To initiate a scroll search, add the scroll parameter to your search query GET shakespeare/_search?scroll=1m { "size": 1000, "query": { "match_all": {} } } If the scroll is idle for more than 1 minute, then delete it Maximum number of hits to return
  168. 168. Pagination - Scroll 168 • The result from the above request includes the first page of results and a _scroll_id, which should be passed to the scroll API in order to retrieve the next batch of results. POST /_search/scroll { "scroll" : "1m", "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9la VYtZndUQlNsdDcwakFMNjU1QQ==" } Note that the URL should not include the index name - this is specified in the original search request instead.
  169. 169. Search multiple fields 169 • The multi_match query provides a convenient short hand for running a match query against multiple fields ‒ by default, the _score from the best field is used (a best_fields search) GET shakespeare/_search?pretty -d '{ "query": { "multi_match": { "query": "Hamlet", "fields": [ "play_name", "speaker", "text_entry" ], "type": "best_fields" } } }' 3 fields are queried (which results in 3 scores) and the best score is used
  170. 170. Search – per-field boosting 170 • If we want to add more weight to hits on a differents field, in this example, let's say we're more interested in speaker field than play_name – we can boost the score of a field using the caret (^) symbol GET shakespeare/_search?pretty -d '{ "query": { "multi_match": { "query": "Hamlet", "fields": [ "play_name", "speaker^2", "text_entry" ], "type": "best_fields" } } }' We get the same number of hits, but the top hits are different.
  171. 171. Misspelled words - fuzziness 171 • Fuzzy matching treats two words that are “fuzzily” similar as if they were the same word - Fuzziness is something that can be assigned a value - It refers to the number of character modifications, known as edits, to make two words match - Can be set to 0,1or 2, or can be set to“auto” Fuzziness = 1 Fuzziness = 2 "Hamled" "Hamlled" d-> t l-> d->t "Hamlet" "Hamlet"
  172. 172. Add fuzziness to a query 172 GET shakespeare/_search?pretty -d '{ "query": { "match": { "play_name": "Hamled" } } }' GET shakespeare/_search?pretty -d '{ "query": { "match": { "play_name": { "query": "Hamled", "fuzziness": 1 }} }}' 0 hits 4244 hits
  173. 173. Search exact terms 173 • If we need to search for the exact text, we use the match query, which understands how the field has been analyzed, and search on the keyword field: GET shakespeare/_search?pretty -d '{ "query": { "match": { "text_entry.keyword": "To be, or not to be: that is the question" } } }' Exactly 1 hit
  174. 174. Sorting 174 • The results of a query are returned in the order of relevancy, _score descending is the default sorting for a query • A query can contain a sort clause that specifies one or more fields to sort on, as well as the order (asc or desc) GET /shakespeare/_search?pretty -d '{ "query": { "match": { "text_entry": "question" } }, "sort": [ {"play_name": {"order": "desc"} } ] }' "hits" : [ { "_index" : "shakespeare", "_type" : "doc", "_id" : "55924", "_score" : null, "_source" : {.....} If _score is not a field in the sort cause, is not calculated => less compute resources
  175. 175. Highlighting 175 • A common use case for search results is to highlight the matched terms. GET /shakespeare/_search?pretty -d '{ "query": { "match_phrase": { "text_entry": "Hamlet" } }, "highlight": { "fields": { "text_entry": {} } } }' "_source" : { "type" : "line", "line_id" : 36184, "play_name" : "Hamlet", "speech_number" : 99, "line_number" : "5.1.269", "speaker" : "QUEEN GERTRUDE", "text_entry" : "Hamlet, Hamlet!" }, "highlight" : { "text_entry" : [ "<em>Hamlet</em>, <em>Hamlet</em>!" ] } } The response contains a highlight section
  176. 176. Range query 176 • Matches documents with fields that have terms within a certain range. The type of the Lucene query depends on the field type, for string fields, the TermRangeQuery, while for number/date fields, the query is a NumericRangeQuery • The range query accepts the following parameters: gte, gt, lte, lt, boost GET _search { "query": { "range" : { "age" : { "gte" : 10, "lte" : 20 } } } }
  177. 177. Exists query 177 • Returns documents that have at least one non-null value in the original field: • There isn't a missing query, instead use the exists query inside a must_not clause GET /_search { "query": { "exists" : { "field" : "user" } } } GET /_search { "query": { "bool": { "must_not": { "exists": { "field": "user" } } } } }
  178. 178. Wildcard query 178 • Matches documents that have fields matching a wildcard expression; • Supported wildcards are *, which matches any character sequence (including the empty one), and ?, which matches any single character. • Note that this query can be slow, as it needs to iterate over many terms. In order to prevent extremely slow wildcard queries, a wildcard term should not start with one of the wildcards * or ? GET shakespeare/_search?pretty -d { "query": { "wildcard" : { "play_name" : "Henry*" } } }
  179. 179. Regexp query 179 • The regexp query allows you to use regular expression term queries • The "term queries" in that first sentence means that Elasticsearch will apply the regexp to the terms produced by the tokenizer for that field, and not to the original text of the field • Note: The performance of a regexp query heavily depends on the regular expression chosen. Matching everything like .* is very slow as well as using lookaround regular expressions. GET shakespeare/_search?pretty -d { "query": { "regexp":{ "play_name": "H.*t"} } }
  180. 180. Aggregations 180 • Aggregations are a way to perform analytics on your indexed data • There are four main types of aggregations: - Metric: aggregations that keep track and compute metrics over a set of documents. - Bucketing: aggregations that build buckets, where each bucket is associated with a key and a document criterion. When the aggregation is executed, all the buckets criteria are evaluated on every document in the context and when a criterion matches, the document is considered to "fall in" the relevant bucket. - Pipeline: aggregations that aggregate the output of other aggregations and their associated metrics - Matrix: aggregations that operate on multiple fields and produce a matrix result based on the values extracted from the requested document fields. Unlike metric and bucket aggregations, this aggregation family does not yet support scripting and its functionality is currently experimental
  181. 181. Aggregations - Metric 181 • Most metrics are mathematical operations that output a single value: avg, sum, min, max, cardinality • Some metrics output multiple values: stats, percentiles, percentile_ranks • Example: what's the maximum value of the "age" field GET account/_search?pretty -d '{ "size": 0, "aggs": { "max_age": { "max": { "field": "age" } } } }' "aggregations" : { "max_age" : { "value" : 40.0 } } }
  182. 182. Aggregations - bucket 182 • Bucket aggregations don’t calculate metrics over fields like the metrics aggregations do, but instead, they create buckets of documents • Bucket aggregations, as opposed to metrics aggregations, can hold sub-aggregations. These sub-aggregations will be aggregated for the buckets created by their "parent" bucket aggregation • Terms aggregations is very handy, will dynamically create a new bucket for every unique term it encounters of the specified field and get a feel of how your data looks like
  183. 183. Aggregations 183 GET shakespeare/_search?pretty -d '{ "size": 0, "aggs": { "play_names": { "terms": { "field": "play_name", "size": 5 } } } }' • Example: What are the unique play names we have in our index "size" - number of buckets to create (default is 10)
  184. 184. Aggregations 184 "aggregations" : { "play_names" : { "doc_count_error_upper_bound" : 3045, "sum_other_doc_count" : 91399, "buckets" : [ { "key" : "Hamlet", "doc_count" : 4244 }, { "key" : "Coriolanus", "doc_count" : 3992 }, { "key" : "Cymbeline", "doc_count" : 3958 }, { "key" : "Richard III", "doc_count" : 3941 }, { "key" : "Antony and Cleopatra", "doc_count" : 3862 } ]}}} • Notice each bucket has a “key” that represents the distinct value of “field”, • and“doc_count”for the number of docs in the bucket
  185. 185. Nesting buckets 185 GET shakespeare/_search?pretty -d '{ "size": 0, "aggs": { "play_names": { "terms": { "field": "play_name", "size": 1 }, "aggs": { "speakers": { "terms": { "field": "speaker", "size": 5 } } } } } }' The play names are bucketed, then, within each play bucket, our documents are bucketed by speaker.
  186. 186. Nesting buckets 186 "aggregations" : { "play_names" : { "doc_count_error_upper_bound" : 3395, "sum_other_doc_count" : 107152, "buckets" : [ { "key" : "Hamlet", "doc_count" : 4244, "speakers" : { "doc_count_error_upper_bound" : 48, "sum_other_doc_count" : 1698, "buckets" : [ { "key" : "HAMLET", "doc_count" : 1582 }, { "key" : "KING CLAUDIUS", "doc_count" : 594 }, { "key" : "LORD POLONIUS", "doc_count" : 370 The result of our nested aggregation Notice two special values returned in a terms aggregation: - “doc_count_error_upper_bound”: maximum number of missing documents that could potentially have appeared in a bucket - “sum_other_doc_count”: number of documents that do not appear in any of the buckets
  187. 187. Bucket sorting 187 • Sorting can be specified using using “order”: ‒ _count sorts by their doc_count (default in terms) ‒ _key sorts alphabetically (default in histogram and date_histogram) • Sorting can also be on a metric value in a nested aggregation GET shakespeare/_search?pretty -d '{ "size": 0, "aggs": { "play_names": { "terms": { "field": "play_name", "size": 5, "order": { "_count": "desc" } } } } }'
  188. 188. 188 Lab 5 Advanced Operation Objectives: Learn how to: o Work with mappings o Work with analyzers Steps: 1. Navigate to /Percona2018/Lab05 2. Read the instructions on Lab05.txt