Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable self-serve, real-time, multitenant monitoring service at Yahoo

Lessons Learned Building A Scalable
Self-serve, Real-time, Multi-tenant
Monitoring Service
PRESENTED BY Mridul Jain, Sumeet Singh⎪ March 31, 2016
S t r a t a C o n f e r e n c e + H a d o o p W o r l d 2 0 1 6 , S a n J o s e

Introduction
2
§  Big ML at Yahoo
§  Has used Storm and Kafka for real-time trend
analysis in search and central monitoring
§  Co-authored Pig on Storm
§  Co-authored CaffeOnSpark for distributed deep
learning
Mridul Jain
Senior Principal Architect
Big Data and Machine Learning
Science and Technology
701 First Avenue,
Sunnyvale, CA 94089 USA
@mridul_jain
§  Manages Hadoop products team at Yahoo
§  Responsible for Product Management, Strategy and
Customer Engagements
§  Managed Cloud Services products team and headed
strategy functions for the Cloud Platform Group at
Yahoo
§  MBA from UCLA and MS from Rensselaer
Polytechnic Institute (RPI)
Sumeet Singh
Sr. Director, Product Management
Cloud and Big Data Platforms
Science and Technology
701 First Avenue,
Sunnyvale, CA 94089 USA
@sumeetksingh

Acknowledgement
3
We want to acknowledge the contributions from Kapil Gupta and Arun Gupta,
Principal Architects with the Yahoo Monitoring team to this presentation as well
as the monitoring platform.
We would also like to thank the entire Yahoo Monitoring and Hadoop and
Big Data Platforms teams for making the next generation monitoring services
a reality at Yahoo.

Agenda
4
Overview1
Transitioning from Classical to Real-time Big Data Architecture
Lessons Learned Scaling the Real-time Big Data Stack
Lessons Learned Optimizing for System Performance
Q&A
2
3
4
5

Introduction to Yahoo’s Monitoring as a Service
5
...
...
Infra Monitoring
CPU, disk, network
Host uptime
HTTP sess. errors
Hosts
Apps
App Monitoring
Req. per second
Avg. latency
API access errors
Hosted Multi-tenant
Monitoring
Service
Collection
Storage
Scheduling
Coordination
Alerts /
Thresholds
Dashboards
Aggregation

Classical Architecture – Pre Real-time Big Data Tech
6
Hosts
200,000
Aggregators
60
DB Shards
2,400
Collectors
43 Frontend /
Query

7
Hosts
200,000
Aggregators
60
DB Shards
2,400
Collectors
43 Frontend /
Query
Large Fan-out1

8
Hosts
200,000
Aggregators
60
DB Shards
2,400
Collectors
43 Frontend /
Query
Large Fan-out1
Manually Sharded DBs2

9
Hosts
200,000
Aggregators
60
DB Shards
2,400
Collectors
43 Frontend /
Query
Large Fan-out1
Massive Query Federation3

10
Hosts
200,000
Aggregators
60
DB Shards
2,400
Collectors
43 Frontend /
Query
Large Fan-out1
Massive Query Federation3
✗ Manageability Challenges

11
H1
H2
H3
H4
H5
Collector Aggregator
Server
DB Server
Dashboard

12
H1
H2
H3
H4
H5
Collector
Dashboard
Aggregator
Server
DB Server
A A A
B B B
Manual partitioning of
hosts
1

13
H1
H2
H3
H4
H5
Collector
Dashboard
Aggregator
Server
DB Server
A A A
B B B
Manual partitioning of
hosts
1 Single threaded agg. /cluster
Seq. processing of rules
4M DP/min per agg.
2

14
H1
H2
H3
H4
H5
Collector
Dashboard
Aggregator
Server
DB Server
A A A
B B B
Single threaded agg. /cluster
4M DP/min per agg.
2Manual partitioning of
hosts
1 1 shard / cluster
1.5M DP/min
3

15
H1
H2
H3
H4
H5
Collector
Dashboard
Aggregator
Server
DB Server
A A A
B B B
4M DP/min per agg.
hosts
1 1 shard / cluster
1.5M DP/min
3 Seq. fetch for
federated queries
4

16
H1
H2
H3
H4
H5
Collector
Dashboard
Aggregator
Server
DB Server
A A A
B B B
4M DP/min per agg.
hosts
1 1 shard / cluster
1.5M DP/min
3 Seq. fetch for
federated queries
4
✗ Scale Challenges ✗ Availability Challenges

Architecture Based on Real-time Big Data Tech
17
Hosts Collectors Data
Highway
UI
Dashboard
&
Graphs

Architecture Based on Real-time Big Data Tech
18
Hosts Collectors Data
Highway
UI
Dashboard
&
Graphs
No manual partitioning / sharing
Built-in horizontal scalability
Built-in High-availability
✔ Manageability
✔ Scalability
✔ Availability
Standard Big Data Frameworks

Scale and Performance
19
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Aggregation
Topologies UI
Dashboard
&
Graphs

20
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Aggregation
Topologies UI
Dashboard
&
Graphs
§  Low latency real-time processing
§  5x scale than the previous architecture
§  Massive parallelism and pipelining
§  Real-time aggregation, thresholds and alerts
§  Support for larger historic data lookup &
processing
§  Support for self-serve complex processing, data
slicing and dicing
§  Pluggable algo and ML models (e.g. EGADS)

Run semantic
and syntactic
validation CLI
Git commit, PR,
Merge
21
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Aggregation
Topologies UI
Dashboard
&
Graphs
Git
CI / CD
A = Filter from *
where host regex …
/alert_policy/kpis.yaml
/contacts/oc.yaml
/rules/system.yo
Alerts to OC,
correlators and
mailing lists

Run semantic
and syntactic
validation CLI
Git commit, PR,
Merge
22
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Aggregation
Topologies UI
Dashboard
&
Graphs
Git
CI / CD
A = Filter from *
where host regex …
/alert_policy/kpis.yaml
/contacts/oc.yaml
/rules/system.yo
✔ Self-serve Easy Deploys ✔ Real-time Alerting
Alerts to OC,
correlators and
mailing lists

Self Serve Rules
23
A = filter * where namespace == “product1”
and application == “apache",60,3

B = filter * where namespace == “product2”
and Tag.host in (“hostgrp1”,”hostgrp4”)

C = threshold A Metric.monstatus.latency <
2 as "mycheck"

Store C

alert C , $LatencyAlertConfig,
$NotificationID , LOW, $UrlID,
$CustMessageID
§  Simple and rich processing language with custom UDF
support for algos and statistical functions
§  Support for arithmetic, set, stats operators, groupby,
joins etc.
§  Events from different namespaces can be combined
§  Thresholds and policies, notifications contact, severity
in a simple hot deployable fashion
§  Store relations and calculations as you like
§  Automatically track all the good, bad, and missing
events

Lessons Learned
24
Producer-consumer problem at scale requires the right balance in architecture1
Skewness in data is hard to debug
E2E multi-tenancy and resourcing should be handled strategically
Optimizations made in async systems are hard to debug
Do not neglect the assumptions/optimizations outside your application
2
3
4
5

Lessons Learned
25
2
3
4
5

Storm + Kafka Based Architecture
26
Central
Collector
(no spooling)
Spout
with Jetty
Servlet Bolt
Product1
Product 2
Product N
133topics
Storm
Kafka
HTTP POST

Scale of an Online Monitoring Solution
27
Central
Collector
(no spooling)
Spout
with Jetty
Servlet Bolt
Product1
Product 2
Product N
133topics
Storm
Kafka
HTTP POST
§  400 bolt tasks in 40
workers
TSDB_1
TSDB_2
TSDB_3
§  450 topologies
§  15 topics /topology
§  3 partitions /topic
§  3 TSDB topics
§  222 partitions per
topic

A Producer - Consumer Pipeline
28
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Aggregation
Topologies UI
Dashboard
&
Graphs

29
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Aggregation
Topologies UI
Dashboard
&
Graphs
§  Excellent E2E Synchronization
§  Provides a breather against individual component failures
§  Reasonably good performance inspite of transient failures
§  Can help individual components to scale, if used smartly

Monitoring Time Roll-ups
30
Topic in-mem state
Kafka Cluster
Spout Bolt
Storm
Topic in-mem state
Topic in-mem state
§  Huge in-memory state
§  220 million/min * 60
§  Trident issues
§  High network à high CPU

Monitoring Time Roll-ups
31
Topic in-mem state
Kafka Cluster
Spout
Storm
Topic in-mem state
Topic in-mem state
§  Aggregate in Spout
§  220 million/min * 60
§  Fields grouping in kafka for a time series
Producer

Kafka Refresh
32
Broker 2
Broker 3
Broker 1
topic 1
topic 2
topic 4
topic 5
topic 6
Kafka
topic 3
Each of the brokers may have
different topics, but each of
them have metadata about
every other broker in the
cluster

Kafka Refresh
33
cluster
§  A producer contacts any broker
to get the topic list across the
cluster every 10 mins
Broker 2
Broker 3
Broker 1
topic 1
topic 2
topic 4
topic 5
topic 6
Kafka
topic 3

Kafka Refresh
34
cluster
§  For each topic fetch call there is
a timeout of 10 secs which is a
blocking call on main
producer thread
Broker 2
Broker 3
Broker 1
topic 1
topic 2
topic 4
topic 5
topic 6
Kafka
topic 3

Kafka Refresh
35
cluster
producer thread
§  If there are 100 topics and a
broker is down(sock time out),
this gets blocked for 1000s >
next refresh cycle (10mins)
Broker 2
Broker 3
Broker 1
topic 1
topic 2
topic 4
topic 5
topic 6
Kafka
topic 3

Kafka Refresh
36
cluster
producer thread
§  Effectively hangs the producer
Broker 2
Broker 3
Broker 1
topic 1
topic 2
topic 4
topic 5
topic 6
Kafka
topic 3

Kafka Refresh
37
cluster
producer thread
§  Effectively hangs the producer
Broker 2
Broker 3
Broker 1
topic 1
topic 2
topic 4
topic 5
topic 6
Kafka
topic 3
Disable refresh
If broker is down anyway the
producer apis get it from an
alternate broker

38
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Aggregation
Topologies UI
Dashboard
&
Graphs
§  Excellent E2E Synchronization
§  Provides a breather against individual component failures
§  Reasonably good performance inspite of transient failures
§  Can help individual components to scale, if used smartly
§  Queuing system is your last line of defense, choose wisely

Lessons Learned
39
2
3
4
5

Skewed Ingestion per Task
40
Spout
bolt
A1
bolt
A2
bolt
A3
bolt
B1
bolt
B2
22 M / min
High rate of ingestion with a “Group By” on limited dimensions will direct all
events for a specific dimension to one task

Skewed Ingestion per Task
41
Spout
bolt
A1
bolt
A2
bolt
A3
bolt
B1
bolt
B2
22 M / min
Overall state per task reduces due to combiners sharing the original big state and
also aggregating it before fwding to final bolts, thus reducing their overall state
Each of the combiners maintain local
state for each of the dimensions and
fwds the aggregated count to B1 or B2
com 1
com 2
com 3
Shuffle Partition By

Abuse
42
§  Max ingestion per TSDB - 120k/s
§  UID table hit hard due to high cardinality data
§  Lots of in-memory states created in Storm bolts

Lessons Learned
43
2
3
4
5

ZooKeeper Scaling
44
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Aggregation
Topologies UI
Dashboard
&
Graphs
ZK - Storm
§  Kafka consumer swap in-out create heavy churn in ZK state for kafka brokers
§  Every time a consumer enter/leaves, all consumers query the group state from ZK
§  Same for rolling upgrade for kafka, restarts, any bad behaviour by consumers
ZK - Kafka
Single Cluster
for Agg.

Topology Scaling
45
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Aggregation
Topologies UI
Dashboard
&
Graphs
Single Cluster
for Agg.

Trident Scaling
46
A = filter * where namespace == “ABC” and application == "XYZ",5,3
1 Rule 1 Logical Bolt
Trident accepts < 400 rules per topology : 400 logical Trident UDFs
§  zookeeper jute size
§  tunable but leads to performance issues : nimbus OOM, worker heartbeat slowness etc.
Eg: 1200 rules will need about 3 trident topologies

Efficient Resourcing and Hardware Utilization
47
Data
Highway
Data Ingest
Topology
Tenant 1
Tenant 2
Tenant 3
Tenant 1
Tenant 2
Tenant 3
Topics
Tenant 1
Tenant 2
Tenant 3
Cluster 1
UI
Dashboard
&
Graphs
Cluster 2
Rollup topology
- all tenants
System,
Abuse
topologies
Isolation

Re-queue Pipeline – Solution for Write Stability
48
Data Queue
6 Hrs
Requeue queue
24 Hrs
Kafka
Kafka
consumer
TSDB Async HBase lib HBase
UID Lookups
UID table unavailable
No response
NSRE
§  Region splits & hotspots
§  NSREs & GCs
§  Region unresponsive
§  Region unavailability
§  Load rebalancing
§  Region queue size max-
out

Lessons Learned
49
2
3
4
5

Auto Retries
50
HBase
Guava Cache
(Inflight RPC queue)
Writer Thread Pool
Inserts

Auto Retries
51
HBase
Guava Cache
Writer Thread Pool
Inserts
Netty Thread Pool
success
Evict the written rpc from cache

Auto Retries
52
HBase
Guava Cache
Writer Thread Pool
Inserts
Netty Thread Pool
failed
Retry to write to HBase by looking
up the RPC in the cache

Auto Retries
53
HBase
Guava Cache
Writer Thread Pool
Inserts
Netty Thread Pool
Callback
Failed/success
Given the additional job of handling the
removed / expired entry
Timed-out RPCs

Auto Retries
54
HBase
Guava Cache
Writer Thread Pool
Inserts
Netty Thread Pool
Callback
retry
Failed/success
Timed-out RPCs
Given the additional job of handling the
removed / expired entry
Put it back in cache

Auto Retries
55
HBase
Guava Cache
Writer Thread Pool
Inserts
Netty Thread Pool
Callback
Given the additional job of
removing expired entry
retry
Failed/success
Stack Overflow!!
Timed-out RPCs

Auto Retries
56
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Stack Overflow!!
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Response
✓ ✓
Timed-out RPCs

Auto Retries
57
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Stack Unwind
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Unlock
Response
No space in stack!!
Throws exception
✓ ✓
Timed-out RPCs

Auto Retries
58
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Stack Unwind
Lock
Lock
Lock
Lock
Lock
Lock
Lock
Unlock
Lock
Response
No space in stack!!
Throws exception
✓ ✓
Timed-out RPCs

Auto Retries
59
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Stack Unwind
Lock
Lock
Lock
Lock
Lock
Lock
Unlock
Lock
Response
No space in stack!!
Throws exception
Lock
✓ ✓
Timed-out RPCs

Auto Retries
60
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Stack Unwind
Lock
Lock
Lock
Lock
Lock
Lock
Response
As stack has unwinded to some extent,
we get space to call Unlock now
Lock
Lock
✓ ✓
Timed-out RPCs

Auto Retries
61
HBase
Writer Thread Pool
Inserts
Netty Thread Pool
Hangup !!
Thread
dies
Lock
Response
Lock
Lock
§  Thread is dead
§  3 locks remaining
§  No thread can write/insert as the cache is locked
§  Guava cache hung, TSDB hung!!
✓ ✓
Timed-out RPCs

Lessons Learned
62
2
3
4
5

Broker 3
Broker 1
Storm and Kafka – Broker Slowness
63
Central
Collector
(no spooling)
Spout
with Jetty
Servlet Bolt
Product1
Product 2
Storm
Kafka
HTTP POST
§  bolt thread writes to in-mem
kafka queue async
§  during slowness of even one
broker if this queue fills up, it
blocks the producer bolt
thread, which in turn back
pressures upstream
TSDB_1
TSDB_2
§  15 topics per
topology
topic
§  3 TSDB topics
topic
§  22 Kafka brokers
§  If we have no
spooling we lose the
data even if broker
recovers, else
replay saves the day
Broker 2
Product2
Product 3

Broker 3
Broker 1
64
Central
Collector
(no spooling)
Spout
with Jetty
Servlet Bolt
Product1
Product 2
Storm
Kafka
HTTP POST
§  bolt thread writes to in-mem
kafka queue async
§  during slowness of even one
broker if this queue fills up, it
blocks the producer bolt
thread, which in turn back
pressures upstream
TSDB_1
TSDB_2
§  15 topics per
topology
topic
§  3 TSDB topics
topic
§  22 Kafka brokers
§  If we have no
spooling we lose the
data even if broker
recovers, else
replay saves the day
Broker 2
Product2
Product 3
✓ Better Monitoring

DiskJVM OS Page Cache
Kafka Broker
§  broker code
§  read variables
§  filehandlers
§  writes from producers
§  metadata
§  partition information
§  Topic information
Writes from
producer
Reads from consumer

Kafka Broker
§  broker code
§  filehandlers
§  metadata
U
N
U
S
E
D
Contents
swapped to disk

DiskJVM
OS
Page
Cache
Kafka Broker
§  broker code
§  filehandlers
§  metadata
Maximize page
cache
U
N
U
S
E
D

Kafka Broker
§  broker code
§  filehandlers
§  metadata
Contents swapped back
from disk
GC kicks in for swapped
out objects

Kafka Broker
§  broker code
§  filehandlers
§  metadata
Contents swapped back
from disk
GC kicks in for swapped
out objects
Writes
High RPS pipeline will see heavy backpressure
and data will get dropped
VM.Swapiness

Lessons Learned
70
2
3
4
5

Thank You
@mridul_jain
@sumeetksingh

Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable self-serve, real-time, multitenant monitoring service at Yahoo

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable self-serve, real-time, multitenant monitoring service at Yahoo

Similar to Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable self-serve, real-time, multitenant monitoring service at Yahoo (20)

More from Sumeet Singh

More from Sumeet Singh (16)

Recently uploaded

Recently uploaded (20)

Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable self-serve, real-time, multitenant monitoring service at Yahoo