Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Elasticsearch In Netflix
Danny Yuan, Jae Bae
Welcome
Hashtag: #ES_in_Netflix
@Elasticsearch - Elasticsearch
!

@stonse - Sudhir Tonse
!

@g9yuayon - Danny Yuan
!

@meta...
Who Are We?
Who Are We?
Software engineers in Netflix’s
Platform Engineering team,
working on large scale data
infrastructure
Who Are We?
Software engineers in Netflix’s
Platform Engineering team,
working on large scale data
infrastructure
Building ...
Why Are We Here?
Why Are We Here?
How We Use Elasticsearch
Why Are We Here?
How We Use Elasticsearch
Why Elasticsearch
Why Are We Here?
How We Use Elasticsearch
Why Elasticsearch
How We Run Elasticsearch
Why Are We Here?
How We Use Elasticsearch
Why Elasticsearch
How We Run Elasticsearch
To Seek Your Feedback
How We Use Elasticsearch
Querying Log Events
Tracking Service Deployments
Querying Log Events
A Little Historical Perspective
Netflix is a log generating company
that also happens to stream movies
- Adrian Cockroft

photo credit: http://www.flickr.co...
A Humble Beginning
A Humble Beginning
A Humble Beginning
A Humble Beginning
Things Changed
Application

Application

Application
Application

Application

Application

Application

Application

Application

Applic...
70,000,000,000
1,500,000
Making Sense of Billions of Events
So We Evolved
So We Evolved
So We Evolved

hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket
So We Evolved

hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket
So We Evolved

hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket
So We Evolved

hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket
So We Evolved

hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket
select * from log_events where dateint=2014...
Field Name

Field Value

Client

“API”

Server

“Cryptex”

StatusCode

200

ResponseTime

73
Log data
Server Farm

Log data
Server Farm
Log Collectors

Log data
Server Farm
What Could Go Wrong?
You thought parallelization would save the day?
Think again
You thought parallelization would save the day?
Think again
What Is Missing?
Interactive Exploration
Functional Requirements
Arbitrary Boolean Queries
Aggregated Query
- Top N Query
- Trend
- Distribution
Non-Functional Requirements
- Interactive (response within seconds)
!

- Quickly locates the right log events

- Minimal p...
It’s All about Extracting Small Data
Out of Big Data
Now Back to the Use Case
Intelligent Alerts
Guided Debugging in the Right Context
Guided Debugging in the Right Context
Guided Debugging in the Right Context
Guided Debugging in the Right Context
Guided Debugging in the Right Context
Guided Debugging in the Right Context
Guided Debugging in the Right Context
Guided Debugging in the Right Context
A Useful Pattern
Aggregated Query -> Individual Query
Examples
- S3 diagnostics
!

- Tracking email campaigns 

-	 Request traces
Status:200
RequestId Parent Id Node Id Service Name

Status

4965-4a74

0

123

Edge Service

200

4965-4a74

123

456

Ga...
Edge Service (456) ---> Gateway (789)

Data Name

Value

Request ID

4965-4a74

Response Time

25 ms

Endpoints

/rest/ser...
Why Elasticsearch?
Automatic Sharding and Replication
Flexible Schema
Flexible Schema
- Schemaless
Flexible Schema
- Schemaless
- Reasonable defaults
Nice Extension Model
Nice Extension Model
- Customizable REST Actions
Nice Extension Model
- Customizable REST Actions

- Site Plugins
Nice Extension Model
- Customizable REST Actions

- Site Plugins
- River Plugins
Nice Extension Model
- Customizable REST Actions

- Site Plugins
- River Plugins
- Discovery Module
Ecosystem - Plugins, Kibana
Tracking Service Deployments
!

{ edda }
Built by Netflix Monitoring Eng Team
Built by Netflix Monitoring Eng Team
Tracks History and Changes to Service
Deployments

Built by Netflix Monitoring Eng Team
Tracks History and Changes to Service
Deployments

Keeps Many Revisions
Built by Netflix Monitoring Eng Team
Tracks History and Changes to Service
Deployments

Keeps Many Revisions
Tracks Dozens ...
Why Elasticsearch?
Schemas may change at any time
Schemas may change at any time
Go schemaless
Users may search for any combination of fields

Users may search for any combination of fields

This is what search engine is designed for
Users often needs only a few fields
Users often needs only a few fields
Projection via “fields” query
Need range queries on date and revisions
Need range queries on date and revisions
Natively supported by Elasticsearch
Need range queries on date and revisions
Natively supported by Elasticsearch
Route by document ID
Running ES in Netflix
Operational Challenges
Operational Challenges
Back pressure when indexing
Operational Challenges
Back pressure when indexing
Diverse configurations and data
Operational Challenges
Back pressure when indexing
Diverse configurations and data
Dynamic flow of log events
Operational Challenges
Back pressure when indexing
Diverse configurations and data
Dynamic flow of log events
Needs extensiv...
Operational Challenges
Back pressure when indexing
Diverse configurations and data
Dynamic flow of log events
Needs extensiv...
Favor Pulling Over Pushing
Choose Config with Data
Integrating ES
AMI for Deployment by Asgard
Archaius for Configuration
Eureka for Server Discovery
Suro for Data Delivery
Servo for Monitoring Metrics
Zone-aware Replication
Multi-region Deployment
Multi-region Deployment
Discovery over Cassandra

Region-aware replication
Favor Index Rolling Over TTL
Favor Index Rolling Over TTL
A dedicated service manages index rolling

Uses index template and routing
Worth Trying G1
Worth Trying G1
Not recommended by ES team, but

Worth Trying G1
Not recommended by ES team, but

Has fewer and shorter GC pauses

Worth Trying G1
Not recommended by ES team, but

Has fewer and shorter GC pauses

Occasional SIGSEGV, but it’s okay
Simple Majority for Master Election
Simple Majority for Master Election
Split-brain problem
Simple Majority for Master Election
Split-brain problem


discovery.zen.minimum_master_nodes

Simple Majority for Master Election
Split-brain problem


discovery.zen.minimum_master_nodes

Dynamically updated
Future Work
Future Work
Automatic incremental backup and restore
Future Work
Automatic incremental backup and restore


Auto scaling

Future Work
Automatic incremental backup and restore


Auto scaling

Fully automated deployment

Future Work
Automatic incremental backup and restore


Auto scaling

Fully automated deployment

Support more use cases
We’re Hiring
Thank You!
Elasticsearch in Netflix
Elasticsearch in Netflix
Elasticsearch in Netflix
Elasticsearch in Netflix
Elasticsearch in Netflix
Elasticsearch in Netflix
Elasticsearch in Netflix
Elasticsearch in Netflix
Elasticsearch in Netflix
Elasticsearch in Netflix
Elasticsearch in Netflix
Elasticsearch in Netflix
Elasticsearch in Netflix
Elasticsearch in Netflix
Elasticsearch in Netflix
Elasticsearch in Netflix
Elasticsearch in Netflix
Elasticsearch in Netflix
Elasticsearch in Netflix
Elasticsearch in Netflix
Elasticsearch in Netflix
Elasticsearch in Netflix
Elasticsearch in Netflix
Elasticsearch in Netflix
Elasticsearch in Netflix
Elasticsearch in Netflix
Elasticsearch in Netflix
Upcoming SlideShare
Loading in …5
×

Elasticsearch in Netflix

17,737 views

Published on

Slides for the Elasticsearch Meetup in Netflix

Published in: Technology

Elasticsearch in Netflix

  1. 1. Elasticsearch In Netflix Danny Yuan, Jae Bae
  2. 2. Welcome Hashtag: #ES_in_Netflix @Elasticsearch - Elasticsearch ! @stonse - Sudhir Tonse ! @g9yuayon - Danny Yuan ! @metacret - Jae Bae
  3. 3. Who Are We?
  4. 4. Who Are We? Software engineers in Netflix’s Platform Engineering team, working on large scale data infrastructure
  5. 5. Who Are We? Software engineers in Netflix’s Platform Engineering team, working on large scale data infrastructure Building and operating Netflix’s cloud real-time query service
  6. 6. Why Are We Here?
  7. 7. Why Are We Here? How We Use Elasticsearch
  8. 8. Why Are We Here? How We Use Elasticsearch Why Elasticsearch
  9. 9. Why Are We Here? How We Use Elasticsearch Why Elasticsearch How We Run Elasticsearch
  10. 10. Why Are We Here? How We Use Elasticsearch Why Elasticsearch How We Run Elasticsearch To Seek Your Feedback
  11. 11. How We Use Elasticsearch
  12. 12. Querying Log Events Tracking Service Deployments
  13. 13. Querying Log Events
  14. 14. A Little Historical Perspective
  15. 15. Netflix is a log generating company that also happens to stream movies - Adrian Cockroft photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/o/in/photostream/
  16. 16. A Humble Beginning
  17. 17. A Humble Beginning
  18. 18. A Humble Beginning
  19. 19. A Humble Beginning
  20. 20. Things Changed
  21. 21. Application Application Application Application Application Application Application Application Application Application
  22. 22. 70,000,000,000
  23. 23. 1,500,000
  24. 24. Making Sense of Billions of Events
  25. 25. So We Evolved
  26. 26. So We Evolved
  27. 27. So We Evolved hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket
  28. 28. So We Evolved hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket
  29. 29. So We Evolved hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket
  30. 30. So We Evolved hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket
  31. 31. So We Evolved hgrep -C 10 -k 5,2,3 'users.*[1-9]{3}' *catalina.out s3//bucket select * from log_events where dateint=20140101
  32. 32. Field Name Field Value Client “API” Server “Cryptex” StatusCode 200 ResponseTime 73
  33. 33. Log data Server Farm Log data Server Farm Log Collectors Log data Server Farm
  34. 34. What Could Go Wrong?
  35. 35. You thought parallelization would save the day? Think again
  36. 36. You thought parallelization would save the day? Think again
  37. 37. What Is Missing?
  38. 38. Interactive Exploration
  39. 39. Functional Requirements Arbitrary Boolean Queries Aggregated Query - Top N Query - Trend - Distribution
  40. 40. Non-Functional Requirements - Interactive (response within seconds) ! - Quickly locates the right log events
 - Minimal programming effort
  41. 41. It’s All about Extracting Small Data Out of Big Data
  42. 42. Now Back to the Use Case
  43. 43. Intelligent Alerts
  44. 44. Guided Debugging in the Right Context
  45. 45. Guided Debugging in the Right Context
  46. 46. Guided Debugging in the Right Context
  47. 47. Guided Debugging in the Right Context
  48. 48. Guided Debugging in the Right Context
  49. 49. Guided Debugging in the Right Context
  50. 50. Guided Debugging in the Right Context
  51. 51. Guided Debugging in the Right Context
  52. 52. A Useful Pattern
  53. 53. Aggregated Query -> Individual Query
  54. 54. Examples - S3 diagnostics ! - Tracking email campaigns 
 - Request traces
  55. 55. Status:200 RequestId Parent Id Node Id Service Name Status 4965-4a74 0 123 Edge Service 200 4965-4a74 123 456 Gateway 200 4965-4a74 456 789 Service A 200 4965-4a74e 456 abc Service B 200
  56. 56. Edge Service (456) ---> Gateway (789) Data Name Value Request ID 4965-4a74 Response Time 25 ms Endpoints /rest/service Status Code 200
  57. 57. Why Elasticsearch?
  58. 58. Automatic Sharding and Replication
  59. 59. Flexible Schema
  60. 60. Flexible Schema - Schemaless
  61. 61. Flexible Schema - Schemaless - Reasonable defaults
  62. 62. Nice Extension Model
  63. 63. Nice Extension Model - Customizable REST Actions
  64. 64. Nice Extension Model - Customizable REST Actions - Site Plugins
  65. 65. Nice Extension Model - Customizable REST Actions - Site Plugins - River Plugins
  66. 66. Nice Extension Model - Customizable REST Actions - Site Plugins - River Plugins - Discovery Module
  67. 67. Ecosystem - Plugins, Kibana
  68. 68. Tracking Service Deployments
  69. 69. ! { edda }
  70. 70. Built by Netflix Monitoring Eng Team
  71. 71. Built by Netflix Monitoring Eng Team Tracks History and Changes to Service Deployments

  72. 72. Built by Netflix Monitoring Eng Team Tracks History and Changes to Service Deployments
 Keeps Many Revisions
  73. 73. Built by Netflix Monitoring Eng Team Tracks History and Changes to Service Deployments
 Keeps Many Revisions Tracks Dozens of Document Types
  74. 74. Why Elasticsearch?
  75. 75. Schemas may change at any time
  76. 76. Schemas may change at any time Go schemaless
  77. 77. Users may search for any combination of fields

  78. 78. Users may search for any combination of fields
 This is what search engine is designed for
  79. 79. Users often needs only a few fields
  80. 80. Users often needs only a few fields Projection via “fields” query
  81. 81. Need range queries on date and revisions
  82. 82. Need range queries on date and revisions Natively supported by Elasticsearch
  83. 83. Need range queries on date and revisions Natively supported by Elasticsearch Route by document ID
  84. 84. Running ES in Netflix
  85. 85. Operational Challenges
  86. 86. Operational Challenges Back pressure when indexing
  87. 87. Operational Challenges Back pressure when indexing Diverse configurations and data
  88. 88. Operational Challenges Back pressure when indexing Diverse configurations and data Dynamic flow of log events
  89. 89. Operational Challenges Back pressure when indexing Diverse configurations and data Dynamic flow of log events Needs extensive monitoring and alerting
  90. 90. Operational Challenges Back pressure when indexing Diverse configurations and data Dynamic flow of log events Needs extensive monitoring and alerting Tolerating outage at different scales
  91. 91. Favor Pulling Over Pushing
  92. 92. Choose Config with Data
  93. 93. Integrating ES
  94. 94. AMI for Deployment by Asgard
  95. 95. Archaius for Configuration
  96. 96. Eureka for Server Discovery
  97. 97. Suro for Data Delivery
  98. 98. Servo for Monitoring Metrics
  99. 99. Zone-aware Replication
  100. 100. Multi-region Deployment
  101. 101. Multi-region Deployment Discovery over Cassandra
 Region-aware replication
  102. 102. Favor Index Rolling Over TTL
  103. 103. Favor Index Rolling Over TTL A dedicated service manages index rolling
 Uses index template and routing
  104. 104. Worth Trying G1
  105. 105. Worth Trying G1 Not recommended by ES team, but

  106. 106. Worth Trying G1 Not recommended by ES team, but
 Has fewer and shorter GC pauses

  107. 107. Worth Trying G1 Not recommended by ES team, but
 Has fewer and shorter GC pauses
 Occasional SIGSEGV, but it’s okay
  108. 108. Simple Majority for Master Election
  109. 109. Simple Majority for Master Election Split-brain problem
  110. 110. Simple Majority for Master Election Split-brain problem 
 discovery.zen.minimum_master_nodes

  111. 111. Simple Majority for Master Election Split-brain problem 
 discovery.zen.minimum_master_nodes
 Dynamically updated
  112. 112. Future Work
  113. 113. Future Work Automatic incremental backup and restore
  114. 114. Future Work Automatic incremental backup and restore 
 Auto scaling

  115. 115. Future Work Automatic incremental backup and restore 
 Auto scaling
 Fully automated deployment

  116. 116. Future Work Automatic incremental backup and restore 
 Auto scaling
 Fully automated deployment
 Support more use cases
  117. 117. We’re Hiring
  118. 118. Thank You!

×