Elasticsearch in Netflix

15,684 views

Published on

Slides for the Elasticsearch Meetup in Netflix

Published in: Technology

Elasticsearch in Netflix

  1. 1. Elasticsearch In Netflix Danny Yuan, Jae Bae
  2. 2. Welcome Hashtag: #ES_in_Netflix @Elasticsearch - Elasticsearch ! @stonse - Sudhir Tonse ! @g9yuayon - Danny Yuan ! @metacret - Jae Bae
  3. 3. Who Are We?
  4. 4. Who Are We? Software engineers in Netflix’s Platform Engineering team, working on large scale data infrastructure
  5. 5. Who Are We? Software engineers in Netflix’s Platform Engineering team, working on large scale data infrastructure Building and operating Netflix’s cloud real-time query service
  6. 6. Why Are We Here?
  7. 7. Why Are We Here? How We Use Elasticsearch
  8. 8. Why Are We Here? How We Use Elasticsearch Why Elasticsearch
  9. 9. Why Are We Here? How We Use Elasticsearch Why Elasticsearch How We Run Elasticsearch
  10. 10. Why Are We Here? How We Use Elasticsearch Why Elasticsearch How We Run Elasticsearch To Seek Your Feedback
  11. 11. How We Use Elasticsearch
  12. 12. Querying Log Events Tracking Service Deployments
  13. 13. Querying Log Events
  14. 14. A Little Historical Perspective
  15. 15. Netflix is a log generating company that also happens to stream movies - Adrian Cockroft photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/o/in/photostream/
  16. 16. A Humble Beginning
  17. 17. A Humble Beginning
  18. 18. A Humble Beginning
  19. 19. A Humble Beginning
  20. 20. Things Changed
  21. 21. Application Application Application Application Application Application Application Application Application Application
  22. 22. 70,000,000,000
  23. 23. 1,500,000
  24. 24. Making Sense of Billions of Events
  25. 25. So We Evolved
  26. 26. So We Evolved
  27. 27. So We Evolved hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket
  28. 28. So We Evolved hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket
  29. 29. So We Evolved hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket
  30. 30. So We Evolved hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket
  31. 31. So We Evolved hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket select * from log_events where dateint=20140101
  32. 32. Field Name Field Value Client “API” Server “Cryptex” StatusCode 200 ResponseTime 73
  33. 33. Log data Server Farm Log data Server Farm Log Collectors Log data Server Farm
  34. 34. What Could Go Wrong?
  35. 35. You thought parallelization would save the day? Think again
  36. 36. You thought parallelization would save the day? Think again
  37. 37. What Is Missing?
  38. 38. Interactive Exploration
  39. 39. Functional Requirements Arbitrary Boolean Queries Aggregated Query - Top N Query - Trend - Distribution
  40. 40. Non-Functional Requirements - Interactive (response within seconds) ! - Quickly locates the right log events
 - Minimal programming effort
  41. 41. It’s All about Extracting Small Data Out of Big Data
  42. 42. Now Back to the Use Case
  43. 43. Intelligent Alerts
  44. 44. Guided Debugging in the Right Context
  45. 45. Guided Debugging in the Right Context
  46. 46. Guided Debugging in the Right Context
  47. 47. Guided Debugging in the Right Context
  48. 48. Guided Debugging in the Right Context
  49. 49. Guided Debugging in the Right Context
  50. 50. Guided Debugging in the Right Context
  51. 51. Guided Debugging in the Right Context
  52. 52. A Useful Pattern
  53. 53. Aggregated Query -> Individual Query
  54. 54. Examples - S3 diagnostics ! - Tracking email campaigns 
 - Request traces
  55. 55. Status:200 RequestId Parent Id Node Id Service Name Status 4965-4a74 0 123 Edge Service 200 4965-4a74 123 456 Gateway 200 4965-4a74 456 789 Service A 200 4965-4a74e 456 abc Service B 200
  56. 56. Edge Service (456) ---> Gateway (789) Data Name Value Request ID 4965-4a74 Response Time 25 ms Endpoints /rest/service Status Code 200
  57. 57. Why Elasticsearch?
  58. 58. Automatic Sharding and Replication
  59. 59. Flexible Schema
  60. 60. Flexible Schema - Schemaless
  61. 61. Flexible Schema - Schemaless - Reasonable defaults
  62. 62. Nice Extension Model
  63. 63. Nice Extension Model - Customizable REST Actions
  64. 64. Nice Extension Model - Customizable REST Actions - Site Plugins
  65. 65. Nice Extension Model - Customizable REST Actions - Site Plugins - River Plugins
  66. 66. Nice Extension Model - Customizable REST Actions - Site Plugins - River Plugins - Discovery Module
  67. 67. Ecosystem - Plugins, Kibana
  68. 68. Tracking Service Deployments
  69. 69. ! { edda }
  70. 70. Built by Netflix Monitoring Eng Team
  71. 71. Built by Netflix Monitoring Eng Team Tracks History and Changes to Service Deployments

  72. 72. Built by Netflix Monitoring Eng Team Tracks History and Changes to Service Deployments
 Keeps Many Revisions
  73. 73. Built by Netflix Monitoring Eng Team Tracks History and Changes to Service Deployments
 Keeps Many Revisions Tracks Dozens of Document Types
  74. 74. Why Elasticsearch?
  75. 75. Schemas may change at any time
  76. 76. Schemas may change at any time Go schemaless
  77. 77. Users may search for any combination of fields

  78. 78. Users may search for any combination of fields
 This is what search engine is designed for
  79. 79. Users often needs only a few fields
  80. 80. Users often needs only a few fields Projection via “fields” query
  81. 81. Need range queries on date and revisions
  82. 82. Need range queries on date and revisions Natively supported by Elasticsearch
  83. 83. Need range queries on date and revisions Natively supported by Elasticsearch Route by document ID
  84. 84. Running ES in Netflix
  85. 85. Operational Challenges
  86. 86. Operational Challenges Back pressure when indexing
  87. 87. Operational Challenges Back pressure when indexing Diverse configurations and data
  88. 88. Operational Challenges Back pressure when indexing Diverse configurations and data Dynamic flow of log events
  89. 89. Operational Challenges Back pressure when indexing Diverse configurations and data Dynamic flow of log events Needs extensive monitoring and alerting
  90. 90. Operational Challenges Back pressure when indexing Diverse configurations and data Dynamic flow of log events Needs extensive monitoring and alerting Tolerating outage at different scales
  91. 91. Favor Pulling Over Pushing
  92. 92. Choose Config with Data
  93. 93. Integrating ES
  94. 94. AMI for Deployment by Asgard
  95. 95. Archaius for Configuration
  96. 96. Eureka for Server Discovery
  97. 97. Suro for Data Delivery
  98. 98. Servo for Monitoring Metrics
  99. 99. Zone-aware Replication
  100. 100. Multi-region Deployment
  101. 101. Multi-region Deployment Discovery over Cassandra
 Region-aware replication
  102. 102. Favor Index Rolling Over TTL
  103. 103. Favor Index Rolling Over TTL A dedicated service manages index rolling
 Uses index template and routing
  104. 104. Worth Trying G1
  105. 105. Worth Trying G1 Not recommended by ES team, but

  106. 106. Worth Trying G1 Not recommended by ES team, but
 Has fewer and shorter GC pauses

  107. 107. Worth Trying G1 Not recommended by ES team, but
 Has fewer and shorter GC pauses
 Occasional SIGSEGV, but it’s okay
  108. 108. Simple Majority for Master Election
  109. 109. Simple Majority for Master Election Split-brain problem
  110. 110. Simple Majority for Master Election Split-brain problem 
 discovery.zen.minimum_master_nodes

  111. 111. Simple Majority for Master Election Split-brain problem 
 discovery.zen.minimum_master_nodes
 Dynamically updated
  112. 112. Future Work
  113. 113. Future Work Automatic incremental backup and restore
  114. 114. Future Work Automatic incremental backup and restore 
 Auto scaling

  115. 115. Future Work Automatic incremental backup and restore 
 Auto scaling
 Fully automated deployment

  116. 116. Future Work Automatic incremental backup and restore 
 Auto scaling
 Fully automated deployment
 Support more use cases
  117. 117. We’re Hiring
  118. 118. Thank You!

×