OpenTSDB for monitoring @ Criteo

Nathaniel Braun
Thursday, April 28th, 2016
OpenTSDB for
monitoring @ Criteo
@

2 | Copyright © 2016 Criteo
•Overview of Hadoop @ Criteo
•Our experimental cluster
•Rationale for OpenTSDB
•Stabilizing & scaling OpenTSDB
•OpenTSDB to the rescue in practice
Hitch hiker’s guide to this presentation

Overview of Hadoop @ Criteo
Tokyo TY5 – PROD AS
Sunnyvale SV6 – PROD NA
HongKong HK5 – PROD CN
Paris PA4 – PROD / PREPROD
Paris PA3 –PREPROD / EXP
Amsterdam AM5 – PROD
Criteo’s 8 Hadoop clusters – running CDH Community Edition

AM5: main production cluster
• In use since 2011
• Running CDH3 initially, CDH4 currently
• 1118 DataNodes
• 13 400+ compute cores
• 39 PB of raw disk storage
• 105 TB of RAM capacity
• 40 TB of data imported every day, mostly through HTTPFS
• 100 000+ jobs run daily
Overview of Hadoop @ Criteo – Production AM5

PA4: comparable to AM5, with fewer machines
• Migration done in Q4 2015 – H1 2016
• Running CDH5
• 650+ DataNodes
• 15 600+ compute cores
• 54 PB of raw disk storage
• 143 TB of RAM capacity
• Huawei servers (AM5 is HP-based)
Overview of Hadoop @ Criteo – Production PA4

Criteo has 3 local production Hadoop clusters
• Sunnyvale (SV6): 20 nodes
• Tokyo (TY5): 35 nodes
• Hong Kong (HK5): 20 nodes
Overview of Hadoop @ Criteo – Production local clusters

Criteo has 3 preproduction Hadoop clusters
• Preprod PA3: 54 nodes, running CDH4
• Preprod PA4: 42 nodes, running CDH5
• Experimental: 53 nodes, running CDH5
Overview of Hadoop @ Criteo – Preproduction clusters

Overview of Hadoop @ Criteo – Usage
Types of jobs running on our clusters
• Cascading jobs, mostly for joins between different types of logs (e.g. displays & clicks)
• Pure Map/Reduce jobs for recommendation, Hadoop streaming jobs for learning
• Scalding jobs for analytics
• Hive queries for Business Intelligence
• Spark jobs on CDH5 

Overview of Hadoop @ Criteo – Special consideration
• Kerberos for security
• High-availability on NameNodes and ResourceManager (CDH5 only)
• Infrastructure installed & maintained with Chef

Overview of Hadoop @ Criteo
How can we monitor this complex
infrastructure and services running on top
of it?

• Useful for testing infrastructure changes without impacting users (no SLA)
• Test environment for new technologies
• HBase
o Natural joins
o OpenTSDB for metrology & monitoring
o hRaven for job detailed data (not used anymore)
• Spark, now in production @ PA4
Our experimental cluster – Purpose

• Based on Google BigTable paper
• Integrated with the Hadoop stack
• Stores data in rows sorted by row key
• Uses regions as an ordered set of rows
• Regions sharded by row key bounds
• Regions managed by Region servers, collocated with DataNodes (data is stored on HDFS)
• Oversize regions split into two regions
• Values stored in columns, with no fixed schema as in RDBMS
• Columns grouped in column families
Our experimental cluster – HBase features

Our experimental cluster – HBase architecture
Row key
(user UID)
CF0: user CF1: event
C0: IP C2: browser C3: e-mail C0: time C1: type C2: web site
AAA value Firefox NULL Click Client #0
BBB value Chrome NULL Click Client #0
CCC value Chrome ccc@mail.com Display Client #1
DDD value IE NULL Sales Client #2
EEE value IE NULL Display Client #0
FFF value IE NULL Display Client #3
∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙
XXX value Firefox NULL Sales Client #4
YYY value Chrome NULL Bid Client #5
ZZZ value Opera zzz@mail.com Click Client #5

Row key
(user UID)
∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙
R0
R1
R5

Row key
(user UID)
∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙
R0
R1
R5
RS1
RS2

HBase on the experimental cluster
• 50 region servers
• 44 000+ regions
• ~90 000 requests / second from OpenTSDB
Our experimental cluster – HBase @ Criteo

Metrics to monitor:
• CPU load
• Processes & threads
• RAM available/reserved
• Free/used disk space
• Network statistics
• Sockets open/closed
• Open connections with their statuses
• Network traffic
Rationale for using OpenTSDB – Infrastructure monitoring

Rationale for using OpenTSDB – Service monitoring
NodeManagers ResourceManagersYARN
DataNodes NameNodes JournalNodesHDFS
ZooKeeper Kerberos
HBase
Kafka Storm

Rationale for using OpenTSDB – Service monitoring
NodeManagers ResourceManagersYARN
DataNodes NameNodes JournalNodesHDFS
ZooKeeper Kerberos
HBase
Kafka Storm
Huge diversity of services!

• Diversity
• Many types of nodes & services
• Must be extensible simply to add new metrics
• Scale
• > 2 500 servers
• ~ 90 000 requests / second
• Storage
• Keep fine-grained resolution (down to the minute, at least)
• Long-term storage for analysis & investigation
Rationale for using OpenTSDB – Scale

• Suits the problem well: “Hadoop for monitoring Hadoop”
• Designed for time series: HBase schema optimized for time series queries
• Scalable and resilient, thanks to HBase
• Extensible easily: writing data collector is easy
• Simple to query
Rationale for using OpenTSDB – Solution

Rationale for using OpenTSDB – Easy to query
uri = URI.parse("http://0.rtsd.hpc.criteo.preprod:4242/api/query")
http = Net::HTTP.start(uri.hostname, uri.port)
http.read_timeout = 300
params = {
'start' => '2016/04/21-10:00:00',
'end' => '2016/04/21-12:00:00',
'queries‘ => {
'aggregator' => 'min',
'downsample' => '5m-min',
'metric' => 'hadoop.resourcemanager.queuemetrics.root.AllocatedMB',
'tags' => {
'cluster' => 'ams',
'host' => 'rm.hpc.criteo.prod'
}
}
request = Net::HTTP::Post.new(uri.path, initheader = {'Content-Type' =>'application/json'})
request.body = params.to_json
response = http.request(request)

Rationale for using OpenTSDB – Practical UI

Metric

Time range
Metric

Time range
Metric
Tag keys/values

Time range
Metric
Tag keys/values
Aggregator

• OpenTSDB consists in Time Series Daemons (TSDs) and tcollectors
• Some TSDs used for writing, others for reading, while tcollectors collect metrics
• TSDs are stateless
• TSDs use asyncHBase to scale
• Quiz: what are the advantages?
Rationale for using OpenTSDB – Design

• OpenTSDB consists in Time Series Daemons (TSDs) and tcollectors
• Some TSDs used for writing, others for reading, while tcollectors collect metrics
• TSDs are stateless
• TSDs use asyncHBase to scale
• Quiz: what are the advantages?
Rationale for using OpenTSDB – Design
1. Clients never interact
with HBase directly
2. Simple protocol → easy
to use & extend
3. No state, no
synchronization → great
scalability

• Metrics consist in:
• metric name
• UNIX timestamp
• value (64 bit integer or single-precision floating point value).
• tags (key-value pairs) specific to that metric instance
• Tags useful for aggregations on time series
proc.loadavg.15min 1461781436 15 host=0.namenode.hpc.criteo.prod
• Charts: average load in 15 minutes with the count
aggregator (proxy to machine count)
• Quiz: what is the chart below?
Rationale for using OpenTSDB – Metrics
proc.loadavg.15min

• Metrics consist in:
• metric name
• UNIX timestamp
• value (64 bit integer or single-precision floating point value).
• tags (key-value pairs) specific to that metric instance
• Tags useful for aggregations on time series
proc.loadavg.15min 1461781436 15 host=0.namenode.hpc.criteo.prod
• Charts: average load in 15 minutes with the count
aggregator (proxy to machine count)
• Quiz: what is the chart below?
Rationale for using OpenTSDB – Metrics
proc.loadavg.15min
proc.loadavg.15min
cluster=*

• A single data table (split in regions), named tsdb
• Row key: <metric_uid><timestamp><tagk1><tagv1>[...<tagkN><tagvN>]
• timestamp is rounded down to the hour
• This schema helps group data from the same metric & time bucket close together (HBase sorts rows based on the row key)
• Assumption: query first on time range, then metric, then tags, in that order of preference
• Tag keys are sorted lexicographically
• Tags should be limited, because they are in the row key. Usually less than 5 tags.
• Values are stored in columns
• Column name: 2 or 4 bytes. For 2 bytes:
• Encode offset up to 3 600 seconds → 212 = 4096 → 12 bits
• 4 bits left for format/type
• Other tables, for metadata and name ↔ ID mappings
Rationale for using OpenTSDB – HBase schema

Rationale for using OpenTSDB – HBase schema
Hexadecimal representation of a row key, with two tags
Sorted row keys for the same metric: 000001
Note: row key size varies across rows, because of tags

Rationale for using OpenTSDB – Statistics
Quiz: what should we look
for?

for?

for?
367 513 metrics
30 tag keys (!)
86 194 tag values

Stabilizing &
scaling OpenTSDB

OpenTSDB was hard to scale at first. What problem can you see?
Scaling OpenTSDB

OpenTSDB was hard to scale at first. What problem can you see?
Scaling OpenTSDB
We’re missing data points 

• Analyze all the layers of the system
• Logs are your friends
• Change parameters one by one, not all at once
• Measure, change, deploy, measure. Rinse, repeat
Scaling OpenTSDB – Lessons learned

Varnish & OpenResty save the day
Scaling OpenTSDB – Nifty trick
OpenResty
POST -> GET
Varnish
Cache + LB
OpenResty
POST -> GET
Varnish
Cache + LB
OpenResty
POST -> GET
Varnish
Cache + LB
RTSD
Read OpenTSDB
RTSD
Read OpenTSDB
RTSD
Read OpenTSDB

OpenTSDB to the
rescue in practice

OpenTSDB to the rescue in practice – Easier to use than logs
hadoop.namenode.fsnamesystem.tag.HAState

Two NameNode failovers in one night!

• Hard to spot : it in the morning nothing has changed

• Would be impossible to see with daily aggregation

• Would be impossible to see with daily aggregation
• Trivia: we fixed the tcollector to get that metric

OpenTSDB to the rescue in practice – Investigation
hadoop.nodemanager.direct.TotalCapacity

Huge memory capacity spike

Huge memory capacity spike Node not reporting points

Another huge spike

Another huge spike
No data

OpenTSDB to the rescue in practice – Superimpose charts
hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis

Service restart – configuration change

Service restart – configuration change Service restart – OOM

Service restart – configuration change Service restart – OOM
Log extract:
NodeManager
configured
with 192 GB
physical
memory
allocated to
containers,
which is more
than 80% of
the total
physical
memory
available (89
GB)

OpenTSDB to the rescue in practice – Hiccups

OpenTSDB problem – not node-specific

OpenTSDB problem – not node-specific Node probably dead 

OpenTSDB to the rescue in practice – NameNode rescue
hadoop.namenode.fsnamesystem.BlocksTotal

File deletion
File deletion

File deletion
File deletion
File creation

hadoop.namenode.fsnamesystem.FilesTotal

Slope

Slope
Be careful about the scale!

hadoop.namenode.fsnamesystemstate.NumLiveDataNodes

Quiz: what is this pattern?

• Answer: NameNode checkpoint

• Note: done at regular intervals

• Note: done at regular intervals
• Trivia: never do a failover during a checkpoint!

Quiz: what is the problem?

• Answer: no NameNode checkpoint → no FS image!

• Answer: no NameNode checkpoint → no FS image!
• Follow-up: standby namenode could not startup after a failover, because its FS image was too old

Criteo ♥ BigData
- Very accessible: only 50 euros, which will be given to charity
- Speakers from leading organizations: Google, Spotify, Mesosphere, Criteo …
https://www.eventbrite.co.uk/e/nabdc-not-another-big-data-conference-registration-24415556587

Criteo is hiring!
http://labs.criteo.com/
Criteo is hiring!

OpenTSDB for monitoring @ Criteo

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to OpenTSDB for monitoring @ Criteo

Similar to OpenTSDB for monitoring @ Criteo (20)

Recently uploaded

Recently uploaded (20)

OpenTSDB for monitoring @ Criteo