SlideShare a Scribd company logo
1 of 73
real-time log search &
analysis
ELKstack@weibo.com
about me
• Perler, SA @ weibo.com, renren.com,
china.com...
• Writer of 《网站运维技术与实践》
• Translator of 《 Puppet 3 Cookbook 》
• weibo account : @ARGV
agenda
• ELKstack situation
• ELKstack usecase
• from ELK to ERK
• performance tuning of LERK
ERK situation
• datanode * 26:
• 2.4Ghz*8, 42G, 300G *10 RAID5
• logtype * 25 , 7days , 65 billion events , 60k fields
• size 8TB /day , indexing 190k eps
• rsyslog/logstash * 10
• custom plugins of rsyslog/logstash/kibana
• user : qa team, app/server dev team, are team
• ops : ME*0.8
kopf
stats monitor & setting modify
bigdesk
real-time node stats
zabbix trapper
monitor and alert KPI of ELK
But, Why ELK ?
First, what can log do?
• Identify problem
• data-driven develop/test/operate
• audit
• Laws of Marcus J. Ranum
• Monitor
• Monitoring is the aggregation of health and performance data, events,
and relationships delivered via an interface that provides an holistic view
of a system's state to better understand and address failure scenarios.
@etsy
difficulties of LA(1)
• timestamp + data = log
• OK, what happened between 23:12 and 23:29
yesterday?
difficulties of LA(2)
•text is un-structured data
difficulties of LA(2)
•grep/awk only run at single host
difficulties of LA(3)
• 格式复杂不方便可视化效果
So...
• We need a
real-time big-
data search
platform.
• But, splunk is
expensive.
• So, spell OSS
pls.
ELKstack Beginner
Hello World
# bin/logstash -e
‘input{stdin{}}output{stdout{codec=>rubyd
ebug}}’
Hello World
{
"message" => "Hello World",
"@version" => "1",
"@timestamp" => "2014-08-
07T10:30:59.937Z",
"host" => "raochenlindeMacBook-
Air.local",
}
How Powerful
• $ ./bin/logstash -e
‘input{generator{count=>10000
0000000}output{stdout{codec=
>dots}}}’ | pv -abt > /dev/null
• 15.1MiB 0:02:21 [ 112kiB/s]
How scaling
Talk is cheap,
show me the case!
application log by php
logstash.conf
Kibana3
backend dev and ops use to identify the error of APIs and
apps
and Kibana4
ok, K4 need a pretty color bynow
PHP slowlog
after multiline codec
ops use to check php slow function stack within IDCs and
hosts
drill-down one host
Nginx errorlog
grok {
match => { "message" => "(?<datetime>d{4}/dd/dd dd:dd:dd) [(?
<errtype>w+)] S+: *d+ (?<errmsg>[^,]+), (?<errinfo>.*)$" }
}
mutate {
gsub => [ "errmsg", "too large body: d+ bytes", "too large body" ]
}
if [errinfo] {
ruby {
code => "event.append(Hash[event['errinfo'].split(', ').map{|l| l.split(': ')}])"
}
}
grok {
match => { "request" => '"%{WORD:verb} %{URIPATH:urlpath}(?:?%
{NGX_URIPARAM:urlparam})?(?: HTTP/%{NUMBER:httpversion})"' }
}
kv {
prefix => "url_"
source => "urlparam"
field_split => "&"
}
date {
locale => 'en'
match => [ "datetime", "yyyy/MM/dd HH:mm:ss" ]
}
performance tuning and troubleshooting based
on multi dimensions reports
difference tops in another time
range
app crash
app dev focus on crash stacks which system functions
were filtered out. 。
New release, Ad-hoc filter, Focus
crash
Query helper for QA and NOC,
decease MTTI for complaint
H5 devs focus on the performance
timeline of index.html
probability distribution of response
time
no more average, no more guess
from ELK to ERK
someone's children�
My Poor Child�
WHY?
compare
logstash
• Design : multithreads + SizedQueue
• Lang : JRuby
• Syntax : DSL
• ENV : jre1.7
• Queue : rely on external system
• regexp : ruby
• output : java to ES
• plugin : 182
• monitor : NO!
rsyslog
• multithreads + mainQ
• C
• rainerscript
• within rhel6
• async queue
• ERE
• HTTP to ES
• 57
• pstats
problem of Logstash
• poor performance of Input/syslog, use input/tcp+filter/grok;
• poor performance of Filter/geoip, had developed filter/geoip2
• high CPU cost by Filter/grok, use filter/ruby with split by myself
• OOM in Input/tcp(prior 1.4.2)
• OOM in Output/elasticsearch(prior 1.5.0)
• retry in Output/elasticsearch repeat with SizedQueue in
stud(bynow)
problem of LogStash(1)
• LogStash::Inputs::Syslog
• logstash pipeline :
• input thread
-> filterworker threads * Num
-> output thread
• But What's in Inputs::Syslog :
• TCPServer/accept
-> client thread -> filter/grok -> filter/date
-> filterworker threads
• We need to do grok and date in only one thread!
• Pure TCPServer can processing 50k qps, but 6k after filter/grok, and then 700 after
filter/date!
problem of LogStash(1)
• LogStash::Inputs::Syslog
• Solution:
input {
tcp { port => 514 }
}
filter {
grok { match => ["message", "%{SYSLOGLINE}"] }
syslog_pri { }
date { match => ["timestamp", "ISO8601"] }
}
• 30k eps in `logstash -w 20` testing.
problem LogStash(2)
• LogStash::Filters::Grok
• What's Grok:
• pre-define : NUMBER d+
use %{NUMBER:score} instead (?<score>d+)
• regexp cost LOTS of CPU.
problem of LogStash(2)
• LogStash::Filters::Grok
• solution:
• aviod grok, if you can define a separator to your log format:
filter {
ruby {
init => "@kname = ['datetime','uid','limittype','limitkey','client','clientip','request_time','url']"
code => "event.append(Hash[@kname.zip(event['message'].split('|'))])"
}
mutate {
convert => ["request_time", "float"]
}
}
• Result: cpu utils reduce about 20%
problem of LogStash(3)
• LogStash::Filters::GeoIP
• 7k eps, even if `logstash -w 30`
• The new MaxMindDB format has a great
performance improvement. But LogStash can't
distribute it for some license reason.
problem of LogStash(3)
• LogStash::Filters::GeoIP
• solution:
• use MaxMind::DB::Writer, change the internal
ip.db into ip.mmdb, 300MB->50MB
• JRuby can java_import maxminddb-java.
• 28k eps with LogStash::Filters::MaxMindDB
problem of LogStash(4)
• LogStash::Outputs::Elasticsearch
• 3 bugs bynow :
1. OOM in logstash1.4.2(ftw-0.0.39)
2. retry by Manticore(logstash1.5.0beta1) was repeat with stud in
pipeline, would cause an infinite loop of resending
3. logstash1.5.0rc1 can't record the 429 code, who knows the"got
response of . source:" mean?
• 1 and 3 were solved in the newest logstash1.5.0rc3.
problem of LogStash(5)
• LogStash::Pipeline
• no supervisor for filterworkers. If all filter workers exception, logstash
was blocking but long live!
• If you use filter/ruby to reference `event['field']` as I introduced before,
check the field first!
if [url] {
ruby { code => "event['urlpath']=event['url'].split('?')[0]" }
}
problem of LogStash(6)
• LogStash::Pipeline
• new event would go through the rest filter after
`yield`, but just to output thread(prior
logstash1.5.0).
• yield was used in filter-split, filter-clone
Rsyslog tuning
• action with linkedlist
• imfile with an appropriate statepresistinterval(avoid too many duplication after
restart)
• omfwd with a small rebindinterval(when target with LVS)
• an appropriate global.maxmessagesize
• an appropriate queue.size and queue.highwatermask
• recommended CEE log format, using with mmjsonparse
• separator log format can be processing with mmfields
• make the best use of rainerscript
• concat JSON strings with property replacer
• developed a rsyslog-mmdblookup for ip lookup
problem of rsyslog(1)
• I find an experimental `foreach` in rsyslog8.7, great! but when I
process my JSON array logs from apps, there are 3 bugs:
1. foreach don't judge the type of parameters;
2. action() don't copy msg but ref. If you omfwd each item in foreach,
crash...The test-suite only use omfile which is synchronous.
3. omelasticsearch has an uninitialized variable when enabled
errorfile option.
There will be a new copymsg option of action() in rsyslog8.10,
suppose to publish at May 20.
problem of rsyslog(2)
• Not so many message modification plugins.
• mmexternal could fork too many subprocess in
v8(but not in v7). And the process speed is 2k
eps!
• We had finished a new rsyslog-mmdblookup
plugin, would run in production env in May 15.
input( type=“imtcp” port=“514” )
template( name=“clientlog" type="list" ) {
constant(value="{"@timestamp":"") property(name="timereported" dateFormat="rfc3339")
constant(value="","host":"") property(name="hostname")
constant(value="",“mmdb":") property(name="!iplocation")
constant(value=",") property(name="$.line" position.from="2")
}
ruleset( name=“clientlog” ) {
action(type="mmjsonparse")
if($parsesuccess == "OK") then {
foreach ($.line in $!msgarray) {
if($.line!rtt == “-”) then {
set $.line!rtt = 0;
}
set $.line!urlpath = field($.line!url, 63, 1);
set $.line!urlargs = field($.line!url, 63, 2);
set $.line!from = "";
if ( $.line!urlargs != "***FIELD NOT FOUND***" ) then {
reset $.line!from = re_extract($.line!urlargs, "from=([0-9]+)", 0, 1, "");
} else {
unset $.line!urlargs;
}
action(type=“mmdb” key=“.line!clientip” fields=[“city”,“isp”,“country”] mmdbfile="./ip.mmdb")
action(type="omelasticsearch" server=“1.1.1.1“ bulkmode=“on“
template=“clientlog” queue.size="10000" queue.dequeuebatchsize="2000“ )
}
}
}
if ($programname startswith “mweibo_client”) then {
call clientlog
stop
}
ES tuning
• DO NOT believe the articles online!!
• DO testing use your own dataset, start from one node, one index, one shard,
zero replica.
• use unicast with a bigger fd.ping_timeout
•doc_values, doc_values, doc_values!!!
• increase the sets of gateway, recovery and allocation
• increase refresh_interval and flush_threshold_size
• increase store.throttle.max_bytes_per_sec
• upgrade to 1.5.1 at least
• scale: use max_shards_per_node
• use bulk! no multithreads client, no async
•use curator for _optimize
• no _all for fixed format log
problem of ES(1)
• OOM:
• Kibana3 use facet_filter, which means lots of hits in
QUERY phase.
• There is circuit breaker in new version. So you may watch
the following errors:
Data too large, data for field [@timestamp] would be larger than
limit of[639015321/609.4mb]]
problem of ES(1)
• OOM:
• solution:
• doc_values,doc_values,doc_values!
• No more heap needed, 31GB is enough.
ES 稳定性问题 (2)
• long long down time when relocation and recovery.
• default strategy:
• recovery immediately after restart
• only one shard relocation one time
• limit 20MB
• replica need to copy all files from primary shard!
ES 稳定性问题 (2)
• long long down time when relocation and recovery.
• solution:
• gateway.*: recovery after cluster has enough nodes
• cluster.routing.allocation.*: larger concurrent
• indices.recovery.*: larger limit
• red to yellow: 20 min for full restart.
• Note: there is a bug may cause the recovery process blocking in translog phase.(prior
1.5.1)
problem of ES(3)
• new nodes die.
• default strategy of shard allocation:
• try to balance the total shards number per node.
• no new shard if over 90% disk.
• The second day of scaling, all new shards would be
allocated to the new node! That mean all indexing
load.
ES 稳定性问题 (3)
• new nodes die.
• solution:
1. finish relocation before the creation of next new index.
2. set index.routing.allocation.total_shards_per_node
• note1: pls set a little larger value, in case of recovery for
fault...
• note2: DO NOT set this to old indices, your new node is
busy now.
problem of ES(4)
• async replica
• cpu util% would be rising violently if one segment has some deviation, async do NOT
validate the indexing data.
• ES will delete such async parameter.
ES performance(1)
• 429, 429, 429...
• length of one "client_net_fatal_error" logline may
target than 1MB.
• the max HTTP body of ES is 100MB. Be careful
with bulk_size.
ES performance(2)
• index size is several times larger than raw message size.
• _source: raw JSON
• _all: terms in every fields, for full text searching
• multi-field: .raw for all fields in logstash template
• So:
• no _all for nginx accesslog.
• no _source for metrics tsdb log.
• now analyzed fields for most fields, only analyzed for raw message.
ES performance(3)
• always CPU utils% for segment merge(hot threads forever).
• max segment: 5GB
• min segment: 2MB
• increase: refresh(1s)/flush(200MB)_interval 。
cluster.name: es1003
cluster.routing.allocation.node_initial_primaries_recoveries: 30
cluster.routing.allocation.node_concurrent_recoveries: 5
cluster.routing.allocation.cluster_concurrent_rebalance: 5
cluster.routing.allocation.enable: all
node.name: esnode001
node.master: false
node.data: data
node.max_local_storage_nodes: 1
index.routing.allocation.total_shards_per_node : 3
index.merge.scheduler.max_thread_count: 1
index.refresh_interval: 30s
index.number_of_shards: 26
index.number_of_replicas: 1
index.translog.flush_threshold_size : 5000mb
index.translog.flush_threshold_ops: 50000
index.search.slowlog.threshold.query.warn: 30s
index.search.slowlog.threshold.fetch.warn: 1s
index.indexing.slowlog.threshold.index.warn: 10s
indices.store.throttle.max_bytes_per_sec: 1000mb
indices.cache.filter.size: 10%
indices.fielddata.cache.size: 10%
indices.recovery.max_bytes_per_sec: 2gb
indices.recovery.concurrent_streams: 30
path.data: /data1/elasticsearch/data
path.logs: /data1/elasticsearch/logs
bootstrap.mlockall: true
http.max_content_length: 400mb
http.enabled: true
http.cors.enabled: true
http.cors.allow-origin: "*"
gateway.type: local
gateway.recover_after_nodes: 30
gateway.recover_after_time: 5m
gateway.expected_nodes: 30
discovery.zen.minimum_master_nodes: 3
discovery.zen.ping.timeout: 100s
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["10.19.0.97","10.19.0.98","10.19.0.99"]
monitor.jvm.gc.young.warn: 1000ms
monitor.jvm.gc.old.warn: 10s
monitor.jvm.gc.old.info: 5s
monitor.jvm.gc.old.debug: 2s
problem of ES(1)
• different result in search and store:
curl es.domain.com:9200/logstash-accesslog-2015.04.03/nginx/_search?q=_id:AUx-
QvSBS-dhpiB8_1f1&pretty -d '{
"fields": ["requestTime"],
"script_fields" : {
"test1" : {
"script" : "doc["requestTime"].value"
},
"test2" : {
"script" : "_source.requestTime"
},
"test3" : {
"script" : "doc["requestTime"].value * 1000"
}
}
}'
NOT schema free!
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "logstash-accesslog-2015.04.03",
"_type" : "nginx",
"_id" : "AUx-QvSBS-dhpiB8_1f1",
"_score" : 1.0,
"fields" : {
"test1" : [ 4603039107142836552 ],
"test3" : [ -8646911284551352000 ],
"requestTime" : [ 0.54 ],
"test2" : [ 0.54 ],
}
} ]
}
problem of ES(2)
• some data can't be found!
• ES need the same mapping type with the same field name in
the same _type of same index.
• My "client_net_fatal_error" log data was changed after one
release:
• {"reqhdr":{"Host":"api.weibo.cn"}}
• {"reqhdr":"{"Host":"api.weibo.cn"}"}
• Set the mapping of "reqhdr" object to {"enabled":false}. the
string can only be watched in _sourceJSON, but not searched.
problem of ES(3)
•some data can't be found! Again!
•There was a default setting `ignore_above:256` in logstash template.
curl 10.19.0.100:9200/logstash-mweibo-2015.05.18/mweibo_client_crash/_search?q=_id:AU1ltyTCQC8tD04iYBIe&pretty
-d '{
"fielddata_fields" : ["jsoncontent.content", "jsoncontent.platform"],
"fields" : ["jsoncontent.content","jsoncontent.platform"]
}'
...
"fields" : {
"jsoncontent.content" : [ "dalvik.system.NativeStart.main(Native Method)nCaused by:
java.lang.ClassNotFoundException: Didn't find class "com.sina.weibo.hc.tracking.manager.TrackingService" on path:
DexPathList[[zip file "/data/app/com.sina.weibo-1.apk", zip file "/data/data/com.sina.weibo/code_cache/secondary-
dexes/com.sina.weibo-1.apk.classes2.zip", zip
file "/data/data/com.sina.weibo/app_dex/dbcf1705b9ffbc30ec98d1a76ada120909.jar"],nativeLibraryDirectories=[/data/a
pp-lib/com.sina.weibo-1, /vendor/lib, /system/lib]]" ],
"jsoncontent.platform" : [ "Android_4.4.4_MX4 Pro_Weibo_5.3.0 Beta_WIFI", "Android_4.4.4_MX4 Pro_Weibo_5.3.0
Beta_WIFI" ]
}
kibana custom develop
• upgrade the elastic.js version in K3 to support the API of ES1.2. Then
we can use aggs API to implement new panels(percentile panel,
range panel, and cardinality histogram panel).
• "export as csv" for table panel.
• map provider setting for bettermap.
• term_stats for map.
• china map.
• query helper.
• script field for terms panel.
• OR filtering.
• more in <https://github.com/chenryn/kibana>
see also
•《 Elasticsearch Server(2 edition) 》
•《 Logging and Log Management the Authoritative Guide to Understanding the Concepts Surrounding
Logging and Log Management 》
•《 Data Analysis with Open Source Tools 》
•《 Web Operations: Keeping the data on time 》
•《 The Art of Capacity Planning 》
•《大规模 Web 服务开发技术》
•https://codeascraft.com/
•http://calendar.perfplanet.com
•http://kibana.logstash.es
–JordanSissel@logstash.net
“If a newbie has a bad time, it's a bug.”

More Related Content

What's hot

Tuning Elasticsearch Indexing Pipeline for Logs
Tuning Elasticsearch Indexing Pipeline for LogsTuning Elasticsearch Indexing Pipeline for Logs
Tuning Elasticsearch Indexing Pipeline for LogsSematext Group, Inc.
 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveSematext Group, Inc.
 
Aaron Mildenstein - Using Logstash with Zabbix
Aaron Mildenstein - Using Logstash with ZabbixAaron Mildenstein - Using Logstash with Zabbix
Aaron Mildenstein - Using Logstash with ZabbixZabbix
 
Managing Your Security Logs with Elasticsearch
Managing Your Security Logs with ElasticsearchManaging Your Security Logs with Elasticsearch
Managing Your Security Logs with ElasticsearchVic Hargrave
 
Mobile Analytics mit Elasticsearch und Kibana
Mobile Analytics mit Elasticsearch und KibanaMobile Analytics mit Elasticsearch und Kibana
Mobile Analytics mit Elasticsearch und Kibanainovex GmbH
 
How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life琛琳 饶
 
Solr for Indexing and Searching Logs
Solr for Indexing and Searching LogsSolr for Indexing and Searching Logs
Solr for Indexing and Searching LogsSematext Group, Inc.
 
Centralized + Unified Logging
Centralized + Unified LoggingCentralized + Unified Logging
Centralized + Unified LoggingGabor Kozma
 
Building Scalable, Distributed Job Queues with Redis and Redis::Client
Building Scalable, Distributed Job Queues with Redis and Redis::ClientBuilding Scalable, Distributed Job Queues with Redis and Redis::Client
Building Scalable, Distributed Job Queues with Redis and Redis::ClientMike Friedman
 
Logging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & KibanaLogging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & KibanaAmazee Labs
 
Rihards Olups - Encrypting Daemon Traffic With Zabbix 3.0
Rihards Olups - Encrypting Daemon Traffic With Zabbix 3.0Rihards Olups - Encrypting Daemon Traffic With Zabbix 3.0
Rihards Olups - Encrypting Daemon Traffic With Zabbix 3.0Zabbix
 
Raymond Kuiper - Working the API like a Unix Pro
Raymond Kuiper - Working the API like a Unix ProRaymond Kuiper - Working the API like a Unix Pro
Raymond Kuiper - Working the API like a Unix ProZabbix
 
Volker Fröhlich - How to Debug Common Agent Issues
Volker Fröhlich - How to Debug Common Agent IssuesVolker Fröhlich - How to Debug Common Agent Issues
Volker Fröhlich - How to Debug Common Agent IssuesZabbix
 
Redis modules 101
Redis modules 101Redis modules 101
Redis modules 101Dvir Volk
 
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Data Con LA
 

What's hot (20)

Tuning Elasticsearch Indexing Pipeline for Logs
Tuning Elasticsearch Indexing Pipeline for LogsTuning Elasticsearch Indexing Pipeline for Logs
Tuning Elasticsearch Indexing Pipeline for Logs
 
Docker Monitoring Webinar
Docker Monitoring  WebinarDocker Monitoring  Webinar
Docker Monitoring Webinar
 
On Centralizing Logs
On Centralizing LogsOn Centralizing Logs
On Centralizing Logs
 
Elasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep diveElasticsearch for Logs & Metrics - a deep dive
Elasticsearch for Logs & Metrics - a deep dive
 
Aaron Mildenstein - Using Logstash with Zabbix
Aaron Mildenstein - Using Logstash with ZabbixAaron Mildenstein - Using Logstash with Zabbix
Aaron Mildenstein - Using Logstash with Zabbix
 
Managing Your Security Logs with Elasticsearch
Managing Your Security Logs with ElasticsearchManaging Your Security Logs with Elasticsearch
Managing Your Security Logs with Elasticsearch
 
Mobile Analytics mit Elasticsearch und Kibana
Mobile Analytics mit Elasticsearch und KibanaMobile Analytics mit Elasticsearch und Kibana
Mobile Analytics mit Elasticsearch und Kibana
 
How ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps lifeHow ElasticSearch lives in my DevOps life
How ElasticSearch lives in my DevOps life
 
Solr for Indexing and Searching Logs
Solr for Indexing and Searching LogsSolr for Indexing and Searching Logs
Solr for Indexing and Searching Logs
 
Centralized + Unified Logging
Centralized + Unified LoggingCentralized + Unified Logging
Centralized + Unified Logging
 
Building Scalable, Distributed Job Queues with Redis and Redis::Client
Building Scalable, Distributed Job Queues with Redis and Redis::ClientBuilding Scalable, Distributed Job Queues with Redis and Redis::Client
Building Scalable, Distributed Job Queues with Redis and Redis::Client
 
Logging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & KibanaLogging with Elasticsearch, Logstash & Kibana
Logging with Elasticsearch, Logstash & Kibana
 
Rihards Olups - Encrypting Daemon Traffic With Zabbix 3.0
Rihards Olups - Encrypting Daemon Traffic With Zabbix 3.0Rihards Olups - Encrypting Daemon Traffic With Zabbix 3.0
Rihards Olups - Encrypting Daemon Traffic With Zabbix 3.0
 
Raymond Kuiper - Working the API like a Unix Pro
Raymond Kuiper - Working the API like a Unix ProRaymond Kuiper - Working the API like a Unix Pro
Raymond Kuiper - Working the API like a Unix Pro
 
Volker Fröhlich - How to Debug Common Agent Issues
Volker Fröhlich - How to Debug Common Agent IssuesVolker Fröhlich - How to Debug Common Agent Issues
Volker Fröhlich - How to Debug Common Agent Issues
 
Tuning Solr for Logs
Tuning Solr for LogsTuning Solr for Logs
Tuning Solr for Logs
 
Nodejs - A quick tour (v4)
Nodejs - A quick tour (v4)Nodejs - A quick tour (v4)
Nodejs - A quick tour (v4)
 
Redis modules 101
Redis modules 101Redis modules 101
Redis modules 101
 
Nginx-lua
Nginx-luaNginx-lua
Nginx-lua
 
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
Big Data Day LA 2016/ Big Data Track - Fluentd and Embulk: Collect More Data,...
 

Similar to ELK stack at weibo.com

(Fios#02) 2. elk 포렌식 분석
(Fios#02) 2. elk 포렌식 분석(Fios#02) 2. elk 포렌식 분석
(Fios#02) 2. elk 포렌식 분석INSIGHT FORENSIC
 
Nodejs性能分析优化和分布式设计探讨
Nodejs性能分析优化和分布式设计探讨Nodejs性能分析优化和分布式设计探讨
Nodejs性能分析优化和分布式设计探讨flyinweb
 
Rhebok, High Performance Rack Handler / Rubykaigi 2015
Rhebok, High Performance Rack Handler / Rubykaigi 2015Rhebok, High Performance Rack Handler / Rubykaigi 2015
Rhebok, High Performance Rack Handler / Rubykaigi 2015Masahiro Nagano
 
Application Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.keyApplication Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.keyTim Bunce
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Elk with Openstack
Elk with OpenstackElk with Openstack
Elk with OpenstackArun prasath
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek PROIDEA
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackJakub Hajek
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
Relational Database Access with Python ‘sans’ ORM
Relational Database Access with Python ‘sans’ ORM  Relational Database Access with Python ‘sans’ ORM
Relational Database Access with Python ‘sans’ ORM Mark Rees
 
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf PluginFinding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf PluginInfluxData
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomyDongmin Yu
 

Similar to ELK stack at weibo.com (20)

(Fios#02) 2. elk 포렌식 분석
(Fios#02) 2. elk 포렌식 분석(Fios#02) 2. elk 포렌식 분석
(Fios#02) 2. elk 포렌식 분석
 
Nodejs性能分析优化和分布式设计探讨
Nodejs性能分析优化和分布式设计探讨Nodejs性能分析优化和分布式设计探讨
Nodejs性能分析优化和分布式设计探讨
 
Rhebok, High Performance Rack Handler / Rubykaigi 2015
Rhebok, High Performance Rack Handler / Rubykaigi 2015Rhebok, High Performance Rack Handler / Rubykaigi 2015
Rhebok, High Performance Rack Handler / Rubykaigi 2015
 
Application Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.keyApplication Logging in the 21st century - 2014.key
Application Logging in the 21st century - 2014.key
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Angular2 inter3
Angular2 inter3Angular2 inter3
Angular2 inter3
 
Elk with Openstack
Elk with OpenstackElk with Openstack
Elk with Openstack
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
 
Run Node Run
Run Node RunRun Node Run
Run Node Run
 
Rails Performance
Rails PerformanceRails Performance
Rails Performance
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
K8s monitoring with elk
K8s monitoring with elkK8s monitoring with elk
K8s monitoring with elk
 
Relational Database Access with Python ‘sans’ ORM
Relational Database Access with Python ‘sans’ ORM  Relational Database Access with Python ‘sans’ ORM
Relational Database Access with Python ‘sans’ ORM
 
Angular2 for Beginners
Angular2 for BeginnersAngular2 for Beginners
Angular2 for Beginners
 
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf PluginFinding OOMS in Legacy Systems with the Syslog Telegraf Plugin
Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin
 
Osol Pgsql
Osol PgsqlOsol Pgsql
Osol Pgsql
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
Wider than rails
Wider than railsWider than rails
Wider than rails
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 

More from 琛琳 饶

Monitor is all for ops
Monitor is all for opsMonitor is all for ops
Monitor is all for ops琛琳 饶
 
Perl调用微博API实现自动查询应答
Perl调用微博API实现自动查询应答Perl调用微博API实现自动查询应答
Perl调用微博API实现自动查询应答琛琳 饶
 
Add mailinglist command to gitolite
Add mailinglist command to gitoliteAdd mailinglist command to gitolite
Add mailinglist command to gitolite琛琳 饶
 
Skyline 简介
Skyline 简介Skyline 简介
Skyline 简介琛琳 饶
 
DNS协议与应用简介
DNS协议与应用简介DNS协议与应用简介
DNS协议与应用简介琛琳 饶
 
Mysql测试报告
Mysql测试报告Mysql测试报告
Mysql测试报告琛琳 饶
 
Perl在nginx里的应用
Perl在nginx里的应用Perl在nginx里的应用
Perl在nginx里的应用琛琳 饶
 

More from 琛琳 饶 (8)

More kibana
More kibanaMore kibana
More kibana
 
Monitor is all for ops
Monitor is all for opsMonitor is all for ops
Monitor is all for ops
 
Perl调用微博API实现自动查询应答
Perl调用微博API实现自动查询应答Perl调用微博API实现自动查询应答
Perl调用微博API实现自动查询应答
 
Add mailinglist command to gitolite
Add mailinglist command to gitoliteAdd mailinglist command to gitolite
Add mailinglist command to gitolite
 
Skyline 简介
Skyline 简介Skyline 简介
Skyline 简介
 
DNS协议与应用简介
DNS协议与应用简介DNS协议与应用简介
DNS协议与应用简介
 
Mysql测试报告
Mysql测试报告Mysql测试报告
Mysql测试报告
 
Perl在nginx里的应用
Perl在nginx里的应用Perl在nginx里的应用
Perl在nginx里的应用
 

Recently uploaded

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 

Recently uploaded (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

ELK stack at weibo.com

  • 1. real-time log search & analysis ELKstack@weibo.com
  • 2. about me • Perler, SA @ weibo.com, renren.com, china.com... • Writer of 《网站运维技术与实践》 • Translator of 《 Puppet 3 Cookbook 》 • weibo account : @ARGV
  • 3. agenda • ELKstack situation • ELKstack usecase • from ELK to ERK • performance tuning of LERK
  • 4. ERK situation • datanode * 26: • 2.4Ghz*8, 42G, 300G *10 RAID5 • logtype * 25 , 7days , 65 billion events , 60k fields • size 8TB /day , indexing 190k eps • rsyslog/logstash * 10 • custom plugins of rsyslog/logstash/kibana • user : qa team, app/server dev team, are team • ops : ME*0.8
  • 5. kopf stats monitor & setting modify
  • 7. zabbix trapper monitor and alert KPI of ELK
  • 9. First, what can log do? • Identify problem • data-driven develop/test/operate • audit • Laws of Marcus J. Ranum • Monitor • Monitoring is the aggregation of health and performance data, events, and relationships delivered via an interface that provides an holistic view of a system's state to better understand and address failure scenarios. @etsy
  • 10. difficulties of LA(1) • timestamp + data = log • OK, what happened between 23:12 and 23:29 yesterday?
  • 11. difficulties of LA(2) •text is un-structured data
  • 12. difficulties of LA(2) •grep/awk only run at single host
  • 13. difficulties of LA(3) • 格式复杂不方便可视化效果
  • 14. So... • We need a real-time big- data search platform. • But, splunk is expensive. • So, spell OSS pls.
  • 16. Hello World # bin/logstash -e ‘input{stdin{}}output{stdout{codec=>rubyd ebug}}’ Hello World { "message" => "Hello World", "@version" => "1", "@timestamp" => "2014-08- 07T10:30:59.937Z", "host" => "raochenlindeMacBook- Air.local", }
  • 17. How Powerful • $ ./bin/logstash -e ‘input{generator{count=>10000 0000000}output{stdout{codec= >dots}}}’ | pv -abt > /dev/null • 15.1MiB 0:02:21 [ 112kiB/s]
  • 19. Talk is cheap, show me the case!
  • 22. Kibana3 backend dev and ops use to identify the error of APIs and apps
  • 23. and Kibana4 ok, K4 need a pretty color bynow
  • 25. after multiline codec ops use to check php slow function stack within IDCs and hosts
  • 28. grok { match => { "message" => "(?<datetime>d{4}/dd/dd dd:dd:dd) [(? <errtype>w+)] S+: *d+ (?<errmsg>[^,]+), (?<errinfo>.*)$" } } mutate { gsub => [ "errmsg", "too large body: d+ bytes", "too large body" ] } if [errinfo] { ruby { code => "event.append(Hash[event['errinfo'].split(', ').map{|l| l.split(': ')}])" } } grok { match => { "request" => '"%{WORD:verb} %{URIPATH:urlpath}(?:?% {NGX_URIPARAM:urlparam})?(?: HTTP/%{NUMBER:httpversion})"' } } kv { prefix => "url_" source => "urlparam" field_split => "&" } date { locale => 'en' match => [ "datetime", "yyyy/MM/dd HH:mm:ss" ] }
  • 29. performance tuning and troubleshooting based on multi dimensions reports
  • 30. difference tops in another time range
  • 31. app crash app dev focus on crash stacks which system functions were filtered out. 。
  • 32. New release, Ad-hoc filter, Focus crash
  • 33. Query helper for QA and NOC, decease MTTI for complaint
  • 34. H5 devs focus on the performance timeline of index.html
  • 35. probability distribution of response time no more average, no more guess
  • 36. from ELK to ERK
  • 39. WHY?
  • 40. compare logstash • Design : multithreads + SizedQueue • Lang : JRuby • Syntax : DSL • ENV : jre1.7 • Queue : rely on external system • regexp : ruby • output : java to ES • plugin : 182 • monitor : NO! rsyslog • multithreads + mainQ • C • rainerscript • within rhel6 • async queue • ERE • HTTP to ES • 57 • pstats
  • 41. problem of Logstash • poor performance of Input/syslog, use input/tcp+filter/grok; • poor performance of Filter/geoip, had developed filter/geoip2 • high CPU cost by Filter/grok, use filter/ruby with split by myself • OOM in Input/tcp(prior 1.4.2) • OOM in Output/elasticsearch(prior 1.5.0) • retry in Output/elasticsearch repeat with SizedQueue in stud(bynow)
  • 42. problem of LogStash(1) • LogStash::Inputs::Syslog • logstash pipeline : • input thread -> filterworker threads * Num -> output thread • But What's in Inputs::Syslog : • TCPServer/accept -> client thread -> filter/grok -> filter/date -> filterworker threads • We need to do grok and date in only one thread! • Pure TCPServer can processing 50k qps, but 6k after filter/grok, and then 700 after filter/date!
  • 43. problem of LogStash(1) • LogStash::Inputs::Syslog • Solution: input { tcp { port => 514 } } filter { grok { match => ["message", "%{SYSLOGLINE}"] } syslog_pri { } date { match => ["timestamp", "ISO8601"] } } • 30k eps in `logstash -w 20` testing.
  • 44. problem LogStash(2) • LogStash::Filters::Grok • What's Grok: • pre-define : NUMBER d+ use %{NUMBER:score} instead (?<score>d+) • regexp cost LOTS of CPU.
  • 45. problem of LogStash(2) • LogStash::Filters::Grok • solution: • aviod grok, if you can define a separator to your log format: filter { ruby { init => "@kname = ['datetime','uid','limittype','limitkey','client','clientip','request_time','url']" code => "event.append(Hash[@kname.zip(event['message'].split('|'))])" } mutate { convert => ["request_time", "float"] } } • Result: cpu utils reduce about 20%
  • 46. problem of LogStash(3) • LogStash::Filters::GeoIP • 7k eps, even if `logstash -w 30` • The new MaxMindDB format has a great performance improvement. But LogStash can't distribute it for some license reason.
  • 47. problem of LogStash(3) • LogStash::Filters::GeoIP • solution: • use MaxMind::DB::Writer, change the internal ip.db into ip.mmdb, 300MB->50MB • JRuby can java_import maxminddb-java. • 28k eps with LogStash::Filters::MaxMindDB
  • 48. problem of LogStash(4) • LogStash::Outputs::Elasticsearch • 3 bugs bynow : 1. OOM in logstash1.4.2(ftw-0.0.39) 2. retry by Manticore(logstash1.5.0beta1) was repeat with stud in pipeline, would cause an infinite loop of resending 3. logstash1.5.0rc1 can't record the 429 code, who knows the"got response of . source:" mean? • 1 and 3 were solved in the newest logstash1.5.0rc3.
  • 49. problem of LogStash(5) • LogStash::Pipeline • no supervisor for filterworkers. If all filter workers exception, logstash was blocking but long live! • If you use filter/ruby to reference `event['field']` as I introduced before, check the field first! if [url] { ruby { code => "event['urlpath']=event['url'].split('?')[0]" } }
  • 50. problem of LogStash(6) • LogStash::Pipeline • new event would go through the rest filter after `yield`, but just to output thread(prior logstash1.5.0). • yield was used in filter-split, filter-clone
  • 51. Rsyslog tuning • action with linkedlist • imfile with an appropriate statepresistinterval(avoid too many duplication after restart) • omfwd with a small rebindinterval(when target with LVS) • an appropriate global.maxmessagesize • an appropriate queue.size and queue.highwatermask • recommended CEE log format, using with mmjsonparse • separator log format can be processing with mmfields • make the best use of rainerscript • concat JSON strings with property replacer • developed a rsyslog-mmdblookup for ip lookup
  • 52. problem of rsyslog(1) • I find an experimental `foreach` in rsyslog8.7, great! but when I process my JSON array logs from apps, there are 3 bugs: 1. foreach don't judge the type of parameters; 2. action() don't copy msg but ref. If you omfwd each item in foreach, crash...The test-suite only use omfile which is synchronous. 3. omelasticsearch has an uninitialized variable when enabled errorfile option. There will be a new copymsg option of action() in rsyslog8.10, suppose to publish at May 20.
  • 53. problem of rsyslog(2) • Not so many message modification plugins. • mmexternal could fork too many subprocess in v8(but not in v7). And the process speed is 2k eps! • We had finished a new rsyslog-mmdblookup plugin, would run in production env in May 15.
  • 54. input( type=“imtcp” port=“514” ) template( name=“clientlog" type="list" ) { constant(value="{"@timestamp":"") property(name="timereported" dateFormat="rfc3339") constant(value="","host":"") property(name="hostname") constant(value="",“mmdb":") property(name="!iplocation") constant(value=",") property(name="$.line" position.from="2") } ruleset( name=“clientlog” ) { action(type="mmjsonparse") if($parsesuccess == "OK") then { foreach ($.line in $!msgarray) { if($.line!rtt == “-”) then { set $.line!rtt = 0; } set $.line!urlpath = field($.line!url, 63, 1); set $.line!urlargs = field($.line!url, 63, 2); set $.line!from = ""; if ( $.line!urlargs != "***FIELD NOT FOUND***" ) then { reset $.line!from = re_extract($.line!urlargs, "from=([0-9]+)", 0, 1, ""); } else { unset $.line!urlargs; } action(type=“mmdb” key=“.line!clientip” fields=[“city”,“isp”,“country”] mmdbfile="./ip.mmdb") action(type="omelasticsearch" server=“1.1.1.1“ bulkmode=“on“ template=“clientlog” queue.size="10000" queue.dequeuebatchsize="2000“ ) } } } if ($programname startswith “mweibo_client”) then { call clientlog stop }
  • 55. ES tuning • DO NOT believe the articles online!! • DO testing use your own dataset, start from one node, one index, one shard, zero replica. • use unicast with a bigger fd.ping_timeout •doc_values, doc_values, doc_values!!! • increase the sets of gateway, recovery and allocation • increase refresh_interval and flush_threshold_size • increase store.throttle.max_bytes_per_sec • upgrade to 1.5.1 at least • scale: use max_shards_per_node • use bulk! no multithreads client, no async •use curator for _optimize • no _all for fixed format log
  • 56. problem of ES(1) • OOM: • Kibana3 use facet_filter, which means lots of hits in QUERY phase. • There is circuit breaker in new version. So you may watch the following errors: Data too large, data for field [@timestamp] would be larger than limit of[639015321/609.4mb]]
  • 57. problem of ES(1) • OOM: • solution: • doc_values,doc_values,doc_values! • No more heap needed, 31GB is enough.
  • 58. ES 稳定性问题 (2) • long long down time when relocation and recovery. • default strategy: • recovery immediately after restart • only one shard relocation one time • limit 20MB • replica need to copy all files from primary shard!
  • 59. ES 稳定性问题 (2) • long long down time when relocation and recovery. • solution: • gateway.*: recovery after cluster has enough nodes • cluster.routing.allocation.*: larger concurrent • indices.recovery.*: larger limit • red to yellow: 20 min for full restart. • Note: there is a bug may cause the recovery process blocking in translog phase.(prior 1.5.1)
  • 60. problem of ES(3) • new nodes die. • default strategy of shard allocation: • try to balance the total shards number per node. • no new shard if over 90% disk. • The second day of scaling, all new shards would be allocated to the new node! That mean all indexing load.
  • 61. ES 稳定性问题 (3) • new nodes die. • solution: 1. finish relocation before the creation of next new index. 2. set index.routing.allocation.total_shards_per_node • note1: pls set a little larger value, in case of recovery for fault... • note2: DO NOT set this to old indices, your new node is busy now.
  • 62. problem of ES(4) • async replica • cpu util% would be rising violently if one segment has some deviation, async do NOT validate the indexing data. • ES will delete such async parameter.
  • 63. ES performance(1) • 429, 429, 429... • length of one "client_net_fatal_error" logline may target than 1MB. • the max HTTP body of ES is 100MB. Be careful with bulk_size.
  • 64. ES performance(2) • index size is several times larger than raw message size. • _source: raw JSON • _all: terms in every fields, for full text searching • multi-field: .raw for all fields in logstash template • So: • no _all for nginx accesslog. • no _source for metrics tsdb log. • now analyzed fields for most fields, only analyzed for raw message.
  • 65. ES performance(3) • always CPU utils% for segment merge(hot threads forever). • max segment: 5GB • min segment: 2MB • increase: refresh(1s)/flush(200MB)_interval 。
  • 66. cluster.name: es1003 cluster.routing.allocation.node_initial_primaries_recoveries: 30 cluster.routing.allocation.node_concurrent_recoveries: 5 cluster.routing.allocation.cluster_concurrent_rebalance: 5 cluster.routing.allocation.enable: all node.name: esnode001 node.master: false node.data: data node.max_local_storage_nodes: 1 index.routing.allocation.total_shards_per_node : 3 index.merge.scheduler.max_thread_count: 1 index.refresh_interval: 30s index.number_of_shards: 26 index.number_of_replicas: 1 index.translog.flush_threshold_size : 5000mb index.translog.flush_threshold_ops: 50000 index.search.slowlog.threshold.query.warn: 30s index.search.slowlog.threshold.fetch.warn: 1s index.indexing.slowlog.threshold.index.warn: 10s indices.store.throttle.max_bytes_per_sec: 1000mb indices.cache.filter.size: 10% indices.fielddata.cache.size: 10% indices.recovery.max_bytes_per_sec: 2gb indices.recovery.concurrent_streams: 30 path.data: /data1/elasticsearch/data path.logs: /data1/elasticsearch/logs bootstrap.mlockall: true http.max_content_length: 400mb http.enabled: true http.cors.enabled: true http.cors.allow-origin: "*" gateway.type: local gateway.recover_after_nodes: 30 gateway.recover_after_time: 5m gateway.expected_nodes: 30 discovery.zen.minimum_master_nodes: 3 discovery.zen.ping.timeout: 100s discovery.zen.ping.multicast.enabled: false discovery.zen.ping.unicast.hosts: ["10.19.0.97","10.19.0.98","10.19.0.99"] monitor.jvm.gc.young.warn: 1000ms monitor.jvm.gc.old.warn: 10s monitor.jvm.gc.old.info: 5s monitor.jvm.gc.old.debug: 2s
  • 67. problem of ES(1) • different result in search and store: curl es.domain.com:9200/logstash-accesslog-2015.04.03/nginx/_search?q=_id:AUx- QvSBS-dhpiB8_1f1&pretty -d '{ "fields": ["requestTime"], "script_fields" : { "test1" : { "script" : "doc["requestTime"].value" }, "test2" : { "script" : "_source.requestTime" }, "test3" : { "script" : "doc["requestTime"].value * 1000" } } }'
  • 68. NOT schema free! "hits" : { "total" : 1, "max_score" : 1.0, "hits" : [ { "_index" : "logstash-accesslog-2015.04.03", "_type" : "nginx", "_id" : "AUx-QvSBS-dhpiB8_1f1", "_score" : 1.0, "fields" : { "test1" : [ 4603039107142836552 ], "test3" : [ -8646911284551352000 ], "requestTime" : [ 0.54 ], "test2" : [ 0.54 ], } } ] }
  • 69. problem of ES(2) • some data can't be found! • ES need the same mapping type with the same field name in the same _type of same index. • My "client_net_fatal_error" log data was changed after one release: • {"reqhdr":{"Host":"api.weibo.cn"}} • {"reqhdr":"{"Host":"api.weibo.cn"}"} • Set the mapping of "reqhdr" object to {"enabled":false}. the string can only be watched in _sourceJSON, but not searched.
  • 70. problem of ES(3) •some data can't be found! Again! •There was a default setting `ignore_above:256` in logstash template. curl 10.19.0.100:9200/logstash-mweibo-2015.05.18/mweibo_client_crash/_search?q=_id:AU1ltyTCQC8tD04iYBIe&pretty -d '{ "fielddata_fields" : ["jsoncontent.content", "jsoncontent.platform"], "fields" : ["jsoncontent.content","jsoncontent.platform"] }' ... "fields" : { "jsoncontent.content" : [ "dalvik.system.NativeStart.main(Native Method)nCaused by: java.lang.ClassNotFoundException: Didn't find class "com.sina.weibo.hc.tracking.manager.TrackingService" on path: DexPathList[[zip file "/data/app/com.sina.weibo-1.apk", zip file "/data/data/com.sina.weibo/code_cache/secondary- dexes/com.sina.weibo-1.apk.classes2.zip", zip file "/data/data/com.sina.weibo/app_dex/dbcf1705b9ffbc30ec98d1a76ada120909.jar"],nativeLibraryDirectories=[/data/a pp-lib/com.sina.weibo-1, /vendor/lib, /system/lib]]" ], "jsoncontent.platform" : [ "Android_4.4.4_MX4 Pro_Weibo_5.3.0 Beta_WIFI", "Android_4.4.4_MX4 Pro_Weibo_5.3.0 Beta_WIFI" ] }
  • 71. kibana custom develop • upgrade the elastic.js version in K3 to support the API of ES1.2. Then we can use aggs API to implement new panels(percentile panel, range panel, and cardinality histogram panel). • "export as csv" for table panel. • map provider setting for bettermap. • term_stats for map. • china map. • query helper. • script field for terms panel. • OR filtering. • more in <https://github.com/chenryn/kibana>
  • 72. see also •《 Elasticsearch Server(2 edition) 》 •《 Logging and Log Management the Authoritative Guide to Understanding the Concepts Surrounding Logging and Log Management 》 •《 Data Analysis with Open Source Tools 》 •《 Web Operations: Keeping the data on time 》 •《 The Art of Capacity Planning 》 •《大规模 Web 服务开发技术》 •https://codeascraft.com/ •http://calendar.perfplanet.com •http://kibana.logstash.es
  • 73. –JordanSissel@logstash.net “If a newbie has a bad time, it's a bug.”

Editor's Notes

  1. 这个读取/_cluster/stats接口,很费资源,不适合长期监控用。长期监控应该用/_nodes/_local接口只获取本机的情况。
  2. 不管是日志处理分析还是数值监控分析,目的都应该是成为这么一个促进运维生产力的interface
  3. Timestamp + double = metric monitor 监控是定点采样,离线分析平台是要预定义规则。 而运维在面对这种提问的时候,现有的只有一个范畴,没有规则和具体的故障点。需要的是能在这个范畴里,通过快速的下钻分析,来寻找实际的点。 这是做运维领域的日志分析处理,跟其他“相近”的监控处理,差别很大的一个点。
  4. 这个工作在单机上,就是awk、grep、sort、uniq在干。一个接一个管道。
  5. 但是大规模集群下,没法这么玩。
  6. 更复杂的情况就是多行日志对应一个事件。
  7. 而且这个平台是要细粒度的,易用的。因为面向的用户可能是客服。 Splunk已经是百亿美元的市值了。这就是机器数据分析领域的商业前景。
  8. 这是在个人mba上的测试数据
  9. 各层次都是无状态扩容的
  10. 这个配置语法应该运维人员都能接受,因为同样是Ruby写的Puppet,用的也是这种风格的DSL设计。
  11. query里通过_type和urlpath,errorcode等过滤条件,得到不同的histogram汇聚结果
  12. 这个示例里是两个最基础的可视化面板样式。 Histogram,也就是时间趋势; Terms,也就是topN排序。 第一个示例算是演示一下日志怎么一步步从文本变成图表。 时间趋势是metric系统最常用的,为什么还要用elk? 一个网站加入有2000个api,然后常见的监控维度包括状态码,响应区间,ua,地理区域,运营商,一个个乘起来,这要多少万个item? 在elk里,kibana每次刷新 都是从es里实时计算的结果,可以任意变更query语句。上一页k3的截图,顶部有8个query框,每个里面都写着不同的query语句。只要有需求,随时可以继续修改,添加更多的query框,这就是灵活性。 至于k4,其实每个面板是绑定在一个query上的,点击铅笔标签就跳回到discover页修改query,功能还在,页面布局变化了。
  13. 这里用了一个千层饼图。每层是对慢函数堆栈的同一层次的函数的topN统计。其实思路是类似agentzh常用的火焰图那种跟分层统计。 这种效果,不单单知道最底层函数哪个最多,还能知道影响面最大的调用链条是哪个。比如本页截图,绝大多数慢请求都是在推荐页调用平台的时候curl太慢。 看左下角的主机排名,前十个里九个是yhg机房,唯独第一个是xd机房的设备。显然有问题。那么单击一下这个主机名。页面就会刷新成下面这样。
  14. 这个主机名就作为一个过滤条件添加到dashboard上了。而且会应用到整个dashboard里所有的可视化面板上。可以看到这个千层饼效果,跟之前全局的统计效果完全不一样了。看细节,最多的慢查询,变成了链接memcached时候gethostbyname过慢。问题直接就定位出来了。 这是单一维度变化反应问题根源的一个示例。 随着dashboard上面板增多,那么可以判断问题的维度也就变多,可以从多个维度来定位。下一个例子是nginx的errorlog
  15. 往前一个时间段,各维度变化都很大。那么很明显,前面这个时间段内,造成nginx的error异常多的原因就很清楚了。
  16. 前面几个都是服务器端日志。其实只要是日志都一样处理。比如客户端日志。这是我们客户端crash日志的情况。新发版的时候,开发会特别关心新版本问题出在哪。 我们在收集这个日志的时候,会在logstash里用几行ruby处理一下,去掉堆栈里的系统库函数。然后单独排序公司自己的代码。 可以看到有个beta版,点击一下。
  17. 排序就变化了,现在新版本情况就一目了然,知道啥函数的问题了。 Crash日志很多专门的软件在干这个事情。我这里举例,不是说这么干多么有优势,而是说这是一个日志分析处理的比较通用的办法。
  18. 这是为了方便客服查询,做了一点界面上的小改动(省的客服不会写uid:”123”)。可以根据uid过滤出来这个用户在时间轴上,先后经过哪些阶段的日志,报了什么错。
  19. 前面都是文本处理的玩法。Elk还有一些更偏数据统计方面的玩法。 一个接口的响应时间,所有请求里是怎么分布的?平均时间不靠谱,这个大家都知道。可能会按照range来计数,0-100,100-1000,这个区间是不是合适?而且高峰期的区间计数肯定有变化,这是因为请求数上涨带来的正常变化,还是其实已经有异常了? 通过这个hist分析,拐点(或者说报警的阈值)在哪里就很明显了。然后拿到不同时间段的这两个数组,还可以做t检验啊,sw检验啊,确定两个数组的分布是不是相似,由此判断说高峰期跟平常是不是正常波动,还是已经有异常了。
  20. Elk提供一个golang写的logstash-forwarder,但是这个只支持ssl的tcp传输,不能直接发到队列,不带压缩。
  21. 调优比较方便的是logstash里给线程取名了,top命令可以看到是哪个线程瓶颈,input、filter还是output Geoip2是用了maxminddb-java包,Jruby里直接java_import就可以了
  22. 同样,每个action的name也可以在top的时候看到。而ruleset的消费情况在pstats记录里可以看到,根据这个做监控报警
  23. Doc_values预先生成fielddata到磁盘。节约内存即保证稳定性,又提高性能。 Multicast过不了交换机,在公有云上还可能被判定为恶意扫描。 Bulk以一次POST的body大小在10-15MB为宜,注意自己的日志单条大小。因为bulk_size是条数不是字节数。
  24. 同一索引下不同type的同名字段,其实按照第一个写入的mapping统一处理了。这样搜索的时候就乱套了。