Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018

Managing your black friday logs
Antonio Bonuccelli
ROME - APRIL 13/14 2018

WHOAMI
• Antonio Bonuccelli
• Support Engineer @ Elastic
• Background in dev and SIEM
• Product supportability
2

Agenda
• Elastic Stack
• Logging Architectures
• About Shards and Cluster Sizing
• Optimal Bulk Size
• Distribute the Load
• Tips on optimising Disk IO
3

Elastic Stack
5
Beats Elasticsearch
Logstash
Kibana

Beats
6
Beats Elasticsearch
Logstash
KibanaLog
Files
Metrics
Wire
Data
your{beat}
‣ Endpoint data collection
‣ Written in GO, Low footprint
‣ Framework to build your own

Logstash
7
Beats Elasticsearch
Logstash
Kibana
‣ Data Collector/Processor
‣ Powerful ETL
‣ Server Side Component
Nodes (X)

Elasticsearch
8
Beats
Logstash
Kibana
‣ Data Platform
‣ Really Fast
‣ HTTP + JSON Elasticsearch
Master
Nodes (3)
Machine
Learning (x)
Data Nodes
Ingest
Nodes (X)

Kibana
9
Beats Elasticsearch
Logstash
Kibana
‣ Data Visualization
‣ Stack Configuration
‣ Elastic Stack UI
Instances (X)

Elastic Stack
10
Beats
Log
Files
Metrics
Wire
Data
your{beat}
Logstash
Nodes (X)
Kibana
Instances (X)
Elasticsearch
Master
Nodes (3)
Machine
Learning (x)
Data Nodes
Ingest
Nodes (X)

Elastic Stack
11
Beats
Log
Files
Metrics
Wire
Data
your{beat}
Logstash
Nodes (X)
X-Pack
Kibana
Instances (X)
X-Pack
Elasticsearch
Master
Nodes (3)
Machine
Learning (x)
Data Nodes
Ingest
Nodes (X)
X-Pack

X-Pack
12
Kibana
Elasticsearch
Beats Logstash
Security
Alerting
Monitoring
Reporting
X-Pack
Graph
https://www.elastic.co/products/x-pack
Machine
Learning
SQL
Kibana
Canvas
Query
Profiler
Remote 
Management

X-Pack will open - v6.3
13
https://www.elastic.co/blog/doubling-down-on-open

Elastic Cloud
14
https://www.elastic.co/products/cloud
• SaaS: cloud.elastic.co

Elastic Cloud
15
https://www.elastic.co/products/cloud
• SaaS: cloud.elastic.co 
 
• Elastic Cloud Enterprise: run your own cloud!

Elastic APM
16
• elastic.co/solutions/apm

The Elastic Journey of an Event
18
Beats
Log
Files
Metrics
Wire
Data
your{beat}
Elasticsearch
Master
Nodes (3)
Ingest
Nodes (X)
Data Nodes
Hot (X)
Data Notes
Warm (X)
X-Pack
Kibana
Instances (X)
X-Pack
Events

19
Beats
Log
Files
Metrics
Wire
Data
your{beat}
Elasticsearch
Master
Nodes (3)
Ingest
Nodes (X)
Data Nodes
Hot (X)
Data Notes
Warm (X)
X-Pack
Kibana
Instances (X)
X-Pack
Events
OOTB Dashboards

20
Beats
Log
Files
Metrics
Wire
Data
your{beat}
Elasticsearch
Master
Nodes (3)
Ingest
Nodes (X)
Data Nodes
Hot (X)
Data Notes
Warm (X)
X-Pack
Logstash
Nodes (X)
X-Pack
Kibana
Instances (X)
X-Pack

21
Beats
Log
Files
Metrics
Wire
Data
your{beat}
Data
Store
Web
APIs
Social Sensors
Elasticsearch
Master
Nodes (3)
Ingest
Nodes (X)
Data Nodes
Hot (X)
Data Notes
Warm (X)
X-Pack
Logstash
Nodes (X)
X-Pack
Kibana
Instances (X)
X-Pack

22
Beats
Log
Files
Metrics
Wire
Data
your{beat}
Data
Store
Web
APIs
Social Sensors
Elasticsearch
Master
Nodes (3)
Ingest
Nodes (X)
Data Nodes
Hot (X)
Data Notes
Warm (X)
X-Pack
Logstash
Nodes (X)
X-Pack
Kibana
Instances (X)
X-Pack
NotificationQueues Storage Metrics

23
Beats
Log
Files
Metrics
Wire
Data
your{beat}
Data
Store
Web
APIs
Social Sensors
Elasticsearch
Master
Nodes (3)
Ingest
Nodes (X)
Data Nodes
Hot (X)
Data Notes
Warm (X)
X-Pack
Logstash
Nodes (X)
X-Pack
Kibana
Instances (X)
X-Pack
Persistent
Queues

24
Beats
Log
Files
Metrics
Wire
Data
your{beat}
Data
Store
Web
APIs
Social Sensors
Elasticsearch
Master
Nodes (3)
Ingest
Nodes (X)
Data Nodes
Hot (X)
Data Notes
Warm (X)
X-Pack
Logstash
Nodes (X)
X-Pack
Kafka
Kibana
Instances (X)
X-Pack
Persistent
Queues

About Shards and  
Cluster Sizing

Cluster my_cluster
Server 1
Terminology
27
Node A
d1
d2
d3
d4
d5
d6
d7
d8d9
d10
d11
d12
Index twitter
d6d3
d2
d5
d1
d4
Index logs

Cluster my_cluster
Server 1
Partition
28
Node A
d1
d2
d3
d4
d5
d6
d7
d8d9
d10
d11
d12
Index twitter
d6d3
d2
d5
d1
d4
Index logs
Shards
0
1
4
2
3
0
1

Shard: the basic working unit
• Each shard is a Lucene index
• Shards are not free
• Each shard adds some overhead
29

Not all shards are created equal
30
Node BNode A
d1
d2
d6
Primary
Index Twitter
Example: Index twitter ( primary:1 / rep.factor: 1)

31
Node BNode A
d1
d2
d6
d1
d2
d6
Primary Replica
Index Twitter Index Twitter
Example: Index twitter ( primary:1 / rep.factor: 1)

32
Node BNode A
d1
d2
d6
d1
d2
d6
Primary Replica
Write

33
Node BNode A
d1
d2
d6
d1
d2
d6
Primary Replica
Write
Replicate

34
Node BNode A
d1
d2
d6
d1
d2
d6
Primary Replica
Read

35
Node BNode A
d1
d2
d6
d1
d2
d6
Primary Replica

36
Node BNode A
d1
d2
d6
d1
d2
d6
Primary Replica

37
Node BNode A
d1
d2
d6
d1
d2
d6
Primary Primary
Promote to primary

38
Node BNode A
d1
d2
d6
d1
d2
d6
Primary Primary
Cluster health changes 
from green to yellow

• You can change the # of replicas at anytime 
 
 
 
 
 
 
39
PUT /twitter/_settings
{
"index" : {
"number_of_replicas" : 2
}
}

 
 
 
 
 
 
• You can’t do exactly the same with primaries
40
{
"index" : {
}
}
{
“index" : {
"number_of_shards" : 2
}
}

 
 
 
 
 
 
• You can’t do exactly the same with primaries
41
{
"index" : {
}
}
{
“index" : {
"number_of_shards" : 2
}
}

Scaling
46
Data
... ...
• More data -> More shards
But how many shards?

Shard Size
• Generally depends on many different factors
‒ document size, mapping, hardware, use case, kinds of queries
being executed, desired response time, peak indexing rate,
budget…
47

Shard Size
• Generally depends on many different factors
‒ document size, mapping, hardware, use case, kinds of queries
being executed, desired response time, peak indexing rate,
budget…
• Rules of thumb (logging use case only):
‒ shards have overhead: avoid ending up with a gazillion small
(~KB,MB) shards
‒ average shard size in the order of Gigabytes
‒ max ~30/40GB per shard 
 
48

Sizing exercise
• ~1000 events per second
• 60s * 60m * 24h * 1000 events => ~87M events per day
• 1kb per event => ~82GB per day
• 3 months => ~7TB
• Simplification: Actual indexed data will take more space
49

Cluster my_cluster
Sizing exercise
• Data size: ~7TB
• Shard Size: ~10GB
• Total Primary Shards: ~716
• Replica factor: 1 -> 1432
50
3 months of logs
...

Cluster my_cluster
Sizing exercise
• Data size: ~7TB
• Shard Size: ~10GB
• Total Primary Shards: ~716
• Replica factor: 1 -> 1432
51
3 months of logs
...
• Total store size:14 TB total
• Assuming 16 GB Heap per node
• 1432 / (16GB x 15 Shards) = 5,9666
• Total Servers: ~6 (data nodes)

More about shard sizing
• https://www.elastic.co/elasticon/conf/2016/sf/quantitative-
cluster-sizing
• https://www.elastic.co/blog/how-many-shards-should-i-
have-in-my-elasticsearch-cluster
52

Time-Based Data
• Logs, social media streams, time-based events
• Timestamp + Data
• Do not change
• Typically search for recent events
• Older documents become less important
• Hard to predict the data size
• How do we handle all of this in terms on Indices?
53

54
Cluster my_cluster
d6d3
d2
d5
d1
d4
logs-2018-10-19
Daily Indices(default)

55
Cluster my_cluster
d6d3
d2
d5
d1
d4
logs-2018-10-19
d6d3
d2
d5
d1
d4
logs-2018-10-20

56
Cluster my_cluster
d6d3
d2
d5
d1
d4
logs-2018-10-19
d6d3
d2
d5
d1
d4
logs-2018-10-21
d6d3
d2
d5
d1
d4
logs-2018-10-20

Templates
• Every new created index starting with 'logs-' will have
‒ 2 shards
‒ 1 replica (for each primary shard)
‒ 60 seconds refresh interval
57
PUT _template/logs
{
"template": "logs-*",
"settings": {
"number_of_shards": 2,
"number_of_replicas": 1,
"refresh_interval": "60s"
}
}
More on that later

Alias
58
Cluster my_cluster
d6d3
d2
d5
d1
d4
logs-2018-10-19
users
Application
logs-write
logs-read

Alias
59
Cluster my_cluster
d6d3
d2
d5
d1
d4
logs-2018-10-19
users
Application
logs-write
logs-read
d6d3
d2
d5
d1
d4
logs-2018-10-20

Alias
60
Cluster my_cluster
d6d3
d2
d5
d1
d4
logs-2018-10-19
users
Application
logs-write
logs-read
d6d3
d2
d5
d1
d4
logs-2018-10-20
d6d3
d2
d5
d1
d4
logs-2018-10-21

Rollover API
• Create new index when a condition is met
‒ document count
‒ index age OR size
61
PUT /logs-000001
{
"aliases": {
"logs_write": {}
}
}
# Add > 1000 documents to logs-000001
POST /logs_write/_rollover
{
"conditions": {
"max_age": "7d",
"max_docs": 1000,
"max_size": "5gb"
}
}

Rollover API
• Today can be automated through Curator
• Tomorrow will be part of Index Lifecycle Management
62

Cluster my_cluster
Do not Overshard
• 3 different logs
• 1 index per day each
• 1GB each
• 5 shards (default)
• 6 months retention
• ~900 shards for just
180GB of data
63
access-...
d6d3
d2
d5
d1
d4
application-...
d6d5
d9
d5
d1
d7
mysql-...
d10d59
d3
d5
d0
d4

Recovering from gazillion shards scenario
64

Reindexing
• Gazillion indices with small shards
65
PUT _template/reduce_by_reindex
{
"template": “logs-*-reindexed",
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1
}
}

Reindexing
• Gazillion indices with small shards
66
POST _reindex
{
"source": {
"index": “logs-*"
},
"dest": {
"index": “logs-q1-2018—reindexed”
}
}

Reindexing by query
67
POST _reindex
{
"source": {
"index": “logs-*”,
"query": {
"range": {
"@timestamp": {
"gte": "now-1M",
"lte": "now"
}
}
} },
"dest": {
"index": “logs-q1-2018—reindexed”
}
}

Cleaning up
68
#data in logs-2018-04-reindexed
DELETE logs-2018-04-01
…
…
…

Cleaning up
69
Cluster my_cluster
access-...
d6d3
d2
d5
d1
d4
application-...
d6d5
d9
d5
d1
d7
mysql-...
d10d59
d3
d5
d0
d4
Cluster my_cluster
d6d3
d2
d5
d1
d4
d6d3
d2
d5
d1
d4
d6d3
d2
d5
d1
d4
access-...
application-...
mysql-...

Cleaning up
70
Cluster my_cluster
access-...
d6d3
d2
d5
d1
d4
application-...
d6d5
d9
d5
d1
d7
mysql-...
d10d59
d3
d5
d0
d4
Cluster my_cluster
d6d3
d2
d5
d1
d4
d6d3
d2
d5
d1
d4
d6d3
d2
d5
d1
d4
access-...
application-...
mysql-...

Shrink API
• Shrink an existing index into a new one with fewer primaries
• Index must be marked as read-only
• Source index shards (Primaries or Replicas) must be on the
same node
71
PUT /my_source_index/_settings
{
"settings": {
"index.routing.allocation.require._name": “my_node",
"index.blocks.write": true
}
}

Shrink API
• Target index # of primaries must be a factor of source # of
primaries
• 15 primaries? Shrink to 3, 5 or 1
• Example shrinking down to 1 primary with replica factor 1
72
POST my_source_index/_shrink/my_target_index
{
"settings": {
"index.number_of_replicas": 1,
"index.number_of_shards": 1,
"index.codec": "best_compression"
}
}

Undersharded?
• Remember we write only to primaries
73
Cluster  
my_cluster
Server 7
Data node 4
Server 4
Data node 1
d5
d1
Server 5
Data node 2
d5
d1
Server 6
Data node 3
Server 1
Master
Server 2
Master
Server 3
Master

Undersharded?
74
Cluster  
my_cluster
Server 7
Data node 4
Server 4
Data node 1
d5
d1
Server 5
Data node 2
d5
d1
Server 6
Data node 3
Server 1
Master
Server 2
Master
Server 3
Master

Split API
• The inverse operation compared to the Shrink API
• Follows similar requirements
75

Post Splitting
76
Cluster  
my_cluster
Server 7
Data node 4
Server 4
Data node 1
d5
d1
Server 5
Data node 2
d5
d1
Server 6
Data node 3
Server 1
Dedicated
Master
Server 2
Dedicated
Master
Server 3
Dedicated
Master
d5
d1
d5
d1

Scaling reads
77
Big Data
... ...1M users
But what happens if we have 2M users?

Scaling reads
78
Big Data
... ...1M users
... ...1M users replica factor : 2

Scaling reads
79
Big Data
... ...1M users
... ...1M users
... ...1M users replica factor : 3

What is Bulk?
81
Elasticsearch
Master
Nodes (3)
Ingest
Nodes (X)
Data Nodes
Hot (X)
Data Notes
Warm (X)
X-Pack
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
1000 
log events
Beats
Logstash
Application
1000 index requests
with 1 document
1 bulk request with
1000 documents

Index vs Bulk APIs
82
PUT twitter/_doc/1
{
"user" : “antonio",
"post_date" : “2018-04-14T15:32:12",
"message" : “I love spaghetti Carbonara"
}
POST _bulk
{ "index" : { "_index" : "test", "_type" : "_doc", "_id" : "1" } }
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_type" : "_doc", "_id" :
"2" } }
{ "index" : { "_index" : "test", "_type" : "_doc", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_type" : "_doc", "_index" : "test"} }
{ "doc" : {"field2" : "value2"} }

What is the optimal bulk size?
83
Elasticsearch
Master
Nodes (3)
Ingest
Nodes (X)
Data Nodes
Hot (X)
Data Notes
Warm (X)
X-Pack
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
1000 
log events
Beats
Logstash
Application
4 * 250?
1 * 1000?
2 * 500?

It depends...
• on your application (language, libraries, ...)
• document size (100b, 1kb, 100kb, 1mb, ...)
• number of nodes
• node size
• number of shards
• shards distribution
84

Test it ;)
85
Elasticsearch
Master
Nodes (3)
Ingest
Nodes (X)
Data Nodes
Hot (X)
Data Notes
Warm (X)
X-Pack
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
1000000 
log events
Beats
Logstash
Application
4000 * 250-> 160s
1000 * 1000-> 155s
2000 * 500-> 164s

Test it ;)
86
DATE=`date +%Y.%m.%d`
LOG=logs/logs.txt
exec_test () {
curl -s -XDELETE "http://USER:PASS@HOST:9200/logstash-$DATE"
sleep 10
export SIZE=$1
time cat /home/ubuntu/dataset.txt | ./bin/logstash -f logstash.conf
}
for SIZE in 100 500 1000 3000 5000 10000; do
for i in {1..20}; do
exec_test $SIZE
done; done;
input { stdin{} }
filter {}
output {
elasticsearch {
hosts => ["10.12.145.189"]
flush_size => "${SIZE}"
} }
In Beats set "bulk_max_size"
in the output.elasticsearch

• 2 node cluster (m3.large)
‒ 2 vCPU, 7.5GB Memory, 1x32GB SSD
• 1 index server (m3.large)
‒ logstash
‒ kibana
Test it ;)
87
# docs 100 500 1000 3000 5000 10000
time(s) 191.7 161.9 163.5 160.7 160.7 161.5

Avoid Bottlenecks
89
Elasticsearch
X-Pack
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
1000000 
log events
Beats
Logstash
Application
single node
Node 1
Node 2
round robin

Distributing the Load
• Clients
• Load Balancer
• Coordinating-only Nodes
90

Clients
• Most APIs implement round robin
‒ you specify a seed list
‒ the client sniffs the cluster
‒ the client implement different selectors
• Logstash allows an array
‒ it can sniff the cluster
• Beats allows an array and no sniffing
91
and many more..

Clients
• Most APIs implement round robin
‒ you specify a seed list
‒ the client sniffs the cluster
‒ the client implement different selectors
• Logstash allows an array
‒ it can sniff the cluster
• Beats allows an array
92

Load Balancer
93
Elasticsearch
X-Pack
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
1000000 
log events
Beats
Logstash
Application
LB
Node 2
Node 1

Coordinating-only Node
94
Elasticsearch
X-Pack
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
1000000 
log events
Beats
Logstash
Application
Coord 
node
Data 
node 2
Data 
node 1

Coordinating-only Node
95
Elasticsearch
X-Pack
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
_________
1000000 
log events
Beats
Logstash
Application
Coord 
node
Data
node 2
Data 
node 1
Also offloads
heavy query
related memory
pressure from
data nodes

Increasing write throughput - Knobs to turn
• Don’t need data immediately searchable?
‒ increase refresh_interval to 30s or 60s
‒ defaults to 1s
• Heavy indexing data node(hot)?
‒ consider increasing indexing buffer (divided by all active shards)
‒ defaults 10% of total heap
‒ increase index.translog.flush_threshold_size (defaults 512mb)
• Can afford data loss on node hw failure?
‒ set index.translog.durability to async (defaults to request)
97

We are hiring
• Work with a disruptive technology
• Engineering not an afterthought
• Diverse, inclusive and thriving environment
• High level of independence
• Work from anywhere (yes) 
 
 
 
elastic.co/careers
98

Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018

Similar to Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018 (20)

More from Codemotion

More from Codemotion (20)

Recently uploaded

Recently uploaded (20)

Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018