Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Get the Most out of Your Amazon Elasticsearch Service Domain (ANT334-R1) - AWS re:Invent 2018

830 views

Published on

Amazon Elasticsearch Service (Amazon ES) makes it easy to deploy and use Elasticsearch in the AWS Cloud to search your data and analyze your logs. In this session, you get key insights into Elasticsearch, including information on how you can optimize your expenditure, minimize your index sizes to lower costs, as well as best practices for keeping your data secure. Also hear from youth sports technology company SportsEngine, about their experience engineering a member-management product of over 260 million documents on top of Elasticsearch. Relive their harrowing journey through tens of thousands of shards, crushed clusters, mountains of pending tasks, and never-ending snapshots. Hear how they went from disaster to delight with Amazon ES.

Get the Most out of Your Amazon Elasticsearch Service Domain (ANT334-R1) - AWS re:Invent 2018

  1. 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Get the Most out of Your Amazon Elasticsearch Service Domain Jon Handler Principal solutions architect Amazon Web Services A N T 3 3 4 Andrew Fleener Senior platform operations manager SportsEngine AJ Stuyvenberg Lead software engineer SportsEngine
  2. 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Elasticsearch – popular, open-source database engine Open source Fast time to value Easy ingestion Easy visualization High performance and distributed Best analytics and search
  3. 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Built for search and analysis Natural language Boolean queries Relevance Text search High-volume ingest Near real time Distributed storage Streaming Time-based visualizations Nestable statistics Time series tools Analysis 0010110100101110001 0111100110000001100 0100110010001100110 0110100001101010011
  4. 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What does it do? Application DataServer, application, network, AWS, and other logs Amazon Elasticsearch Service Domain with index(es) Application users, analysts, DevOps, security
  5. 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon ES is a fully managed service that makes it easy to deploy, manage, and scale Elasticsearch and Kibana Amazon Elasticsearch Service (Amazon ES)
  6. 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Tightly Integrated with Other AWS Services Seamless data ingestion, security, auditing and orchestration Benefits of Amazon ES Supports Open-Source APIs and Tools Drop-in replacement with no need to learn new APIs or skills Easy to Use Deploy a production-ready ElasticSearch cluster in minutes Scalable Resize your cluster with a few clicks or a single API call Secure Deploy into your VPC and restrict access using security groups and IAM policies Highly Available Replicate across Availability Zones, with monitoring and automated self-healing
  7. 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Three things • Optimize storage to reduce cost • Use I3 instances at >3 TB of storage • Increase refresh_interval to increase throughput capacity
  8. 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Sizing for success
  9. 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Properly- sized cluster Improved performance Reduced cost Sizing correctly decreases cost and improves performance Cost dimensions • Instance hours • Storage Cost drivers • Amount of data • Traffic volume and type • Tenancy/Concurrency • Mapping
  10. 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data is stored in indexes, distributed across shards ID Field: value Field: value Field: value Field: value Index Shards
  11. 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. In most situations, use 1 replica as well ID Field: value Field: value Field: value Field: value Index Shards
  12. 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Handling Storage Storage needed = Source * 1.1 * 2 * 7 * 1.15 • On disk, an index is 1.1 x source • Each replica requires that much additional storage • For streaming data multiply by retention (ignore for single indexes • Add 15% for overhead
  13. 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Handling Storage Storage needed = Source * 1.1 * 2 * 7 * 1.15 • Remove unwanted data from your source records! • Logstash sends the entire message embedded in each document – do you need it? • Only send what you intend to search
  14. 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Sizing – test data and conditions Testing 3 different data sets In a number of different conditions Logstash and by hand Condition Logstash Default template Logstash Default compression Logstash Removed @message By hand baseline hand-coded No message remove the @message Best compression Enabled Smallest Indexed TRUE, no @message, best compression, no doc_values, not stored, no field data, no norms, no term vectors Dataset Boardgame Geek Text and Metadata Jargon Text NASA Apache weblogs
  15. 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  16. 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Handling Storage Storage needed = Source * 1.1 * 2 * 7 * 1.15 • Disk usage is driven by your mapping! Disable features that you don't need • Disable _source where you can • Remove keyword fields from text fields if you don't need them • Disable norms for keywords fields
  17. 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  18. 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Effective options in reducing index size • Remove the @message field "@message": { "enabled": false } • Disable _source where you can "_source": { "Includes": [], "excludes": [] } Or "_source": { "enabled": false } • Enable best_compression "_settings" : { "index.codec": "best_compression" }
  19. 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Sizing for Success: Compute
  20. 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Index patterns and the active index Whether you have a streaming use case or a search use case, you'll most often have one index handling most traffic That's the active index logs_11.26.2018 logs_11.25.2018 logs_11.24.2018 logs_11.23.2018 logs_11.22.2018 logs_11.21.2018 logs_11.20.2018 Streaming: One index per day Search: One index product_catalog
  21. 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Instances contain shards from all indexes Streaming: One index per day Search: One index
  22. 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. CPU usage for update processing Streaming Search Data source Stream 1 Database
  23. 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. CPU usage for query processing Streaming Search
  24. 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Active shard to CPU ratio should be < 1:1
  25. 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Total shards not more than 25 per GB of JVM
  26. 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Two ways to set shard count
  27. 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use I3 instances
  28. 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. I3s are cost effective for storage and compute
  29. 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  30. 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Do you need PIOPs? Do you need PIOPs? Short answer: No • 3 IOPs per GB of deployed EBS GP2 Longer answer: The higher the concurrent set of active indexes, the more need you have for PIOPs.
  31. 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Increasing refresh_interval buys indexing capacity
  32. 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is refresh_interval? • Documents initially indexed in RAM, not searchable • refresh_interval controls rate of flush to disk (time to searchable) • Segment merges use (substantial) cluster resources • Set refresh_interval to -1 to disable refresh • Higher refresh_interval buys substantial capacity
  33. 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Two ways to set refresh_interval You can dynamically change the refresh_interval by PUTting to the index settings You can set a _template so all indexes are created with the refresh_interval
  34. 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Two ways to set refresh_interval You can dynamically change the refresh_interval by PUTting to the index settings You can set a _template so all indexes are created with the refresh_interval To support highest capacity, fastest throughput, e.g. to do a large data load, set refresh_interval to -1 and number_of_replicas to 0
  35. 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Reduce data transfer costs • Bulk responses include one line per data line sent • To reduce data transfer out charges, use the filter_path parameter POST <index>/_bulk?filter_path=-items This reduces data out to close to 0. { "took": 9, "errors": false }
  36. 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Mini-wrap • Optimize storage by reducing data and managing mappings • Use I3 instances for best performance and price • Increase refresh_interval to increase throughput capacity
  37. 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  38. 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. SportsEngine Youth Sports Technology Platform • Thousands of Organizations • Millions of athletes 100% on AWS • 60+ distinct services • 650+ EC2 instances • 20+ AWS accounts
  39. 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Here's what we'll cover Challenge 1 • Tenancy and getting the right performance out of the cluster. • Multiple iterations Challenge 2 • Avoiding an explosion of fields from dynamic mapping Challenge 3 • Zero down time reindexing
  40. 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How do you organize thousands of athletes?
  41. 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  42. 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  43. 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  44. 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  45. 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Workload characteristics Each of 20,000 organizations has a distinct dataset We never need to search more than 1 organization at a time Should we have a single index? Multiple indexes? Multiple clusters?
  46. 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where we started (Iteration 1): On-The-Fly MySQL tables Why it failed: • No learning • No shared context • No shared data • No history • Downtime as data was rebuilt • RDS Database / thousands of tables
  47. 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Iteration 2: Elasticsearch with one index Let’s use Elasticsearch! • ActiveRecord pattern • Elasticsearch dynamic field mapping • One index with all tenants • No real plans for expansion • 8 r4.4xlarge (existing ES Infra) Tenant 1 Tenant 2 Tenant n
  48. 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Iteration 2: Elasticsearch with one index Why it failed epically: • Impossible to reindex - no zero down time option • Mappings effectively permanent, our only option was to add new ones • Resistant to change This eventually fell over entirely due to the way we were syncing data from the registration product into Amazon ES
  49. 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Iteration 3: One index per tenant • Dynamic mappings for each organization, updated on demand by the organization. • Support for multiple clusters • Money is cheap, time is expensive • 8 node r4.4xl shared cluster + 8 node r4.2xl dedicated cluster • Allowed us to rebuild data from source with zero downtime
  50. 50. Why it failed epically: • Thousands of cluster events eating all of the CPU of our master nodes • No writes to data nodes while mappings were being updated - which was almost constant Iteration 3: One index per tenant
  51. 51. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. 1 "answers": [ 2 { 3 "first_name": "AJ" 4 }, 5 { 6 "last_name": "Stuyvenberg" 7 }, 8 { 9 "email": "aj@sports.engine" 10 }, 11 { 12 "qu_el_3909370": 0.5 13 }, 14 { 15 "qu_el_3909372": 1.0 16 } 17 ] 1 "properties": { 2 "first_name": { 3 "type": "string" 4 }, 5 "last_name": { 6 "type": "string" 7 }, 8 "email": { 9 "type": "string" 10 }, 11 "qu_el_3909370": { 12 "type": "decimal" 13 }, 14 "qu_el_3909372": { 15 "type": "decimal" 16 } Document Mapping
  52. 52. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Iteration 3.5: Static Mapping Files • Reduce mappings to their generic types for custom data • Use must-match queries to match records • Only use the insert-mapping API to add mappings, avoid Dynamic Mapping entirely.
  53. 53. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. 1 "answers": [ 2 { 3 "key": "first_name-string", 4 "string_value": "AJ" 5 }, 6 { 7 "key": "last_name-string", 8 "string_value": "Stuyvenberg" 9 }, 10 { 11 "key": "email-string", 12 "string_value": “aj@sports.engine" 13 }, 14 { 15 "key": "qu_el_3909370-decimal", 16 "decimal_value": 0.5 17 } 18 ] 1 "properties": { 2 "key": { 3 "type": "keyword" 4 } 5 "string_value": { 6 "type": "keyword", 7 "normalizer": "lower" 8 } 9 "decimal_value": { 10 "type": "double" 11 } 12 } Document Mapping
  54. 54. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Iteration 3.5 PM - Great! Let’s roll it out to everyone!
  55. 55. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Iteration 3.5 burns under load Bad! Also bad!
  56. 56. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Iteration 3.5 Post Mortem Why it failed epically: • 40,000 shards in 2 clusters • Cost went through the roof - Money can’t solve every problem • A bug in ES 5 can break snapshots for indices with high shard counts
  57. 57. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Iteration 4: Massive migration • Support for combined indices for multiple tenants • Single index for large tenants • Blue/Green deployments to move orgs between indices • Third cluster for the new indices (8 node r4.2xlarge cluster) • Combined tenant indices massively shrinking the shard count Tenant 2 Tenant … Tenant nTenant 1 Tenant … Tenant mTenant 1
  58. 58. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Zero-Downtime ReIndexing Step 1: Lock the tenant • Prevent writes and re-queue • Allow reads from applications
  59. 59. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Zero-Downtime ReIndexing Step 2: Hydrate the destination index - Determine destination index location - Aggregate data from several source applications - Use bulk-index API to insert data into Elasticsearch - Writes to the previous index are still disabled
  60. 60. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Zero-Downtime ReIndexing Step 3: Unlock the tenant - Update the tenant_assignment with the new index - Remove tenant data from previous index - Process queued write operations in order - Reads and writes are enabled
  61. 61. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Success!!! • Happy Customers • Happy Ops • Happy Devs • 2 old giant clusters deprovisioned • New cluster significantly smaller 50 indices - 1 cluster - 500 shards - 260 million documents - 1.3 TBs of Data
  62. 62. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Architecture Recommendations • Use multiple indices, but not too many • Use static mappings, avoid dynamic mappings • Use index aliasing to support blue/green reindexing • Application-level support to move tenants between indices • Use Amazon Elasticsearch Service to add data nodes for write capacity, and replica shards for read capacity • Always have an odd number of Master Nodes (at least 3) Use dedicated Master Nodes for production-scale deployments • Consider investing in multi-cluster application support to ease upgrades and experiments
  63. 63. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Speaker Name Contact information
  64. 64. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

×