Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop

HADOOP EAGLE
Full-stack realtime monitoring framework for eBay hadoop
Edward Zhang yonzhang@ebay.com , @yonzhang2012
Hao Chen hchen9@ebay.com, @ihaoch

Use case: Detect node anomaly by analyzing task failure ratio across all nodes
Assumption : task failure ratio for every node should be approximately equal
Algorithm : node by node compare (symmetry violation) and per node trend
HADOOP EAGLE – EBAY INC 2
HADOOP EAGLE
Background – initial use cases

3
Host: Task failure based anomaly host detection
HADOOP EAGLE – EBAY INC
HADOOP EAGLE
Anomaly Detection & Alerting Analysis Auto-Remediation

4
Scale Challenges @ eBay Hadoop Monitoring
HADOOP EAGLE
• 10+ large Hadoop clusters
• 10,000+ data nodes
• 50,000+ jobs per day
• 50,000,000+ tasks per day
• 500+ types of Hadoop/Hbase native metrics
• Billions of audit events, metrics per day

5
Use cases challenges @ eBay Hadoop Monitoring
HADOOP EAGLE
• Host
• Task failure ratio based machine anomaly detection
• Job monitoring across its lifetime
• Real-time running job performance analysis
• Near real-time job history analytics
• Data skew detection
• Hadoop native metrics
• Hdfs
• Hbase
• M/R
• Logs
• GC log
• Hadoop daemon log
• Audit log
• HDFS image file
• Yarn Framework
• Queue

HADOOP EAGLE
Engineering Challenges @ eBay Hadoop Monitoring
• Varieties of data sources
M/R history job, running, GC log, namenode log, hadoop native metrics, YARN
queue, audit log, hdfs image file etc.
• Varieties of data collectors
pull form hdfs, pull YARN API, ship logs, …
• Complex business logic
join outside data, pre-aggregations, memory window …
• Alert rules can’t be hot deployed
• Scalability issue with single process

7
Job History Performance Analyzer
HADOOP EAGLE
• Monitor job history files in near real-time
• Crawl job history files immediately after it is completed
• Apply expertise rules for job performance suggestions
• Job history trend for the same type of job
Job
Start
Event
Task
Start
Event
Task
End
Event
Task
roll-up
Task2
Start
Event
Task2
End
Event
Task
roll-
up
Job
End
Event
Job
Suggestion
Rules

8
Job real-time monitoring
HADOOP EAGLE
• Monitoring running job in real time
• Minute-level job progress snapshots
• Minute-level resource usage
snapshots
• CPU, HDFS I/O, Disk I/O, slot
seconds
• Roll up to user/queue/cluster level
• Slide window based alert

9
Service: GC Log / Server Log
HADOOP EAGLE
• GC event detection and prediction
• Log metrics statistics
• Real-time log anomaly detection

Why Eagle Monitoring Framework
HADOOP EAGLE

11
• Data collector -> data processing -> metric pre-agg/alert engine -> storage -> dashboards
• We need create framework to cover full stack in monitoring system
Programming Paradigm and Abstraction
HADOOP EAGLE

12
As a framework, Eagle does not
assume :
• Data source (where, what)
• Business logic execution path (how)
• Policy engine implementation (how)
• Data sink (where, what)
Eagle as a Framework
As a framework, Eagle does the
following:
• SQL-like service API
• High-performing query framework
• Lightweight streaming process java API
• Extensible policy engine implementation
• Scalable and distributed rule evaluation
• Native HBase data storage support
• Metadata driven stream processing
• Data source extensibility
• Data sink extensibility
• Interactive dashboard
HADOOP EAGLE

Eagle Overall Architecture
13HADOOP EAGLE – EBAY INC
HADOOP EAGLE

Eagle Monitoring Framework Internals
• Lightweight Streaming Process Framework
• Extensible & Scalable Policy Framework for Alert
• Eagle Query Framework
• Interactive Dashboards
HADOOP EAGLE

15
Facts
• Computation is based on single
event which constitutes endless
continuous stream
• Computation can be
aggregation, time-window,
length-window or join outside
data etc.
• Filter design pattern is used for
modularizing code at the
beginning
Lightweight Streaming Process Framework
HADOOP EAGLE
Abstraction
 Inspired by cascading framework, we
abstract a light-weight streaming
programing API which is independent of
execution environment
 Streaming process is directed acyclic
graph
 This layer of indirection is for code
modularization, code reuse and prevention
of coupling with specific execution
environment
 Runs on single process, Storm or other
streaming technology like Spark

16
Step 1: Task DAG graph setup
Eagle Stream Data Processing API
HADOOP EAGLE
@Override
protected void buildDependency(FlowDef def, DataProcessConfig config) {
Task header = Task.newTask("wordgenerator").setExecutor(source).completeBuild();
Task uppertask = Task.newTask("uppercase").setExecutor(new
UppercaseExecutor()).connectFrom(header).completeBuild();
Task groupbyUppercaseTask = Task.newTask("groupby_uppercase").setExecutor(new
GroupbyCountExecutor()).connectFrom(uppertask).completeBuild();
def.endBy(groupbyUppercaseTask);
}
Step 2: Inter-task data exchange protocol
@Override
protected void buildDependency(FlowDef def, DataProcessConfig config) {
Task header = Task.newTask("wordgenerator").setExecutor(source).completeBuild();
Task uppertask = Task.newTask("uppercase").setExecutor(new
UppercaseExecutor()).connectFrom(header).completeBuild();
Task groupbyUppercaseTask = Task.newTask("groupby_uppercase").setExecutor(new
GroupbyCountExecutor()).connectFrom(uppertask).completeBuild();
def.endBy(groupbyUppercaseTask);
}

17
Execution Graph development, compile and deploy
Development / Compile Phase Deployment / Runtime Phase
HADOOP EAGLE

HADOOP EAGLE

19
Extensible & Scalable Policy framework
HADOOP EAGLE
Scalability
• Dynamic policy partitioning across compute nodes based on configurable partition class
• Dynamic policy deployment
• Event partitioning by storm and policy partitioning by Eagle (N events * M policies)
Extensibility
• Support new policy evaluation engine, for example Siddhi, Esper, Machine learning etc.
Features
• Policy CRUD
• Stream metadata (event attribute name, attribute type, attribute value resolver, …)

20
Dynamic Policy Partitioning
HADOOP EAGLE

21
Scalability of Policy Evaluation
HADOOP EAGLE

22
Extensibility of policy framework
HADOOP EAGLE
public interface PolicyEvaluatorServiceProvider {
public String getPolicyType();
public Class<? extends PolicyEvaluator> getPolicyEvaluator();
public Class<? extends PolicyDefinitionParser> getPolicyDefinitionParser();
public Class<? extends PolicyEvaluatorBuilder> getPolicyEvaluatorBuilder();
public List<Module> getBindingModules();
}
Policy Evaluator Provider use SPI to register policy engine implementations

HADOOP EAGLE

Eagle Query Framework
HADOOP EAGLE
Persistence
• Metric
• Event
• Metadata
• Alert
• Log
• Customized
Structure
• …
Query
• Search
• Filter
• Aggregation
• Sort
• Expression
• ….
Features
• Simple API
• Powerful query
• High performance
• Scalability
• Pluggable
• …
The light-weight metadata-driven store layer to serve
commonly shared storage & query requirements of most monitoring system

HADOOP EAGLE

HADOOP EAGLE
• Metadata definition ORM
• High performance RESTful API supporting CRUD
• SQL-like declarative query syntax
• Generic service client library
• Native support HBase and RDBMS
• Interactive and customizable dashboard

27
• Annotations are metadata to entity
• Metadata driven query compiling and
response rendering
• Metadata driven ser/deser
• Rename column to shorter string(hbase)
• Entity metadata primitives
• Table
• ColumnFamily
• Prefix(the very first partition key)
• Service(entity identifier)
• Partition
• Tags
• Indexes
• Column
Metadata definition ORM
HADOOP EAGLE
@Table("alertdef")
@ColumnFamily("f")
@Prefix("alertdef")
@Service(AlertConstants.ALERT_DEFINITION_SERVICE_ENDPOINT_NAME)
@TimeSeries(false)
@Partition({"cluster", "datacenter"})
@Tags({"programId", "alertExecutorId", "policyId", "policyType"})
@Indexes({
@Index(name="Index_1_alertExecutorId", columns = { "alertExecutorID" })
})
public class AlertDefinitionAPIEntity extends TaggedLogAPIEntity{
@Column("a")
private String desc;
@Column("b")
private String policyDef;
@Column("c")
private String dedupeDef;
@Column("d")
private String notificationDef;
@Column("e")
private String remediationDef;
@Column("f")
private boolean enabled;

28
Generic RESTful API & Query
HADOOP EAGLE
::=
<EntityName>
“[" <FilterCondition> "]"
"<" <GroupbyFields> ">"
"{" <AggregatedFunctions> "}” [ "." "{" <SortbyOptions> "}" ]
eagle-service/rest/entities?query=

29
Generic RESTful API Query Syntax
HADOOP EAGLE
query=JobExecutionService[@cluster=“xyz” AND @datacenter=“abc”]{@startTime,@numTotalMaps}&startTime=&endTime=&pageSize=100
Aggregation Query ::= <EntityName> [QueryCondition]<GroupbyFields>{ AggregatedFunctions}.{SortbyOptions}
query=JobExecutionService[@cluster=“xyz” AND @datacenter=“abc”]<@user>{count, min(endTime-startTime)}&startTime=&endTime=&pageSize=100
query=TaskFailureCountService[@cluster=“xyz” AND @datacenter=“abc” AND
@failureCount>10]{@startTime,@failureCount}&startTime=&endTime=&pageSize=100
CONTAINS, IN, !=, =, <, <=, >, >=
query=TaskFailureCountService[@cluster=“xyz” AND @datacenter=“abc” AND
@failureCount>10]{@startTime,@failureCount}&startTime=&endTime=&pageSize=100&startRowkey=BgVz-9R…….
Search Query
Aggregate Query
TimeSeries Histogram Query
query=GenericMetricService[@cluster="ares" AND @datacenter="lvs"]<@user>{sum(value)}.{sum(value) desc} &timeSeries=true&intervmin=1440
&pageSize=10000000&startTime=2014-07-01 00:00:00&endTime=2014-08-01 00:00:00&metricName=eagle.hdfs.spacesize.cluster
Operators
Numeric Filters
Paginations

30
Generic Eagle Service Client Library
HADOOP EAGLE
• Basic CRUD
• Fluent DSL
• Metric Builder API
• Parallel Client
• Asynchronous Client
client.metric("unit.test.metrics")
.batch(5)
.tags(tags)
.send("unit.test.metrics", System.currentTimeMillis(), tags, 0.1, 0.2, 0.3)
.send(System.currentTimeMillis(), 0.1)
.send(System.currentTimeMillis(),0.1,0.2)
.send(System.currentTimeMillis(),tags,0.1,0.2,0.3)
.send("unit.test.anothermetrics",System.currentTimeMillis(),tags,0.1,0.2,0.3)
.flush();
client.search("GenericMetricService[@cluster="cluster4ut" AND @datacenter =
"datacenter4ut"]<@cluster>{sum(value)}")
.startTime(0)
.endTime(System.currentTimeMillis()+24 * 3600 * 1000)
.metricName("unit.test.metrics")
.pageSize(1000)
.send();

com.ebay.eagle.coprocessor.AggregateProtocol
32
HBase Coprocessor
HADOOP EAGLE
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
avg count max min sum
nocoprocesso in single
region
coprocessor in single
region
estimated in cluster

33
• Uniform HBASE row-key design for all types of monitoring data sources
• Logically partition data by tags which is defined in annotation @Partition({“cluster”,
“datacenter”})
• Physically shard data by HBASE native feature: rowkey range and region mapping
• Write throughput optimized by using HBASE multi-put
• Co-processor to maximize query performance
• Push evaluation of numeric filters down to HBase
• Secondary index support
• Inspection of RESTful resources and entity metadata
• Numeric filters
• Expression evaluation in output fields
• Rowkey inspection
Tuning for HBase Storage
HADOOP EAGLE

HADOOP EAGLE

35
• Interactive: IPython notebook-
like interactive visualization
analysis and troubleshooting.
• Dashboard: Customizable
dashboard layout and drill-down
path, persist and share.
Generic Dashboard Analytics for Eagle Store
HADOOP EAGLE

36
Open Source Soon …
HADOOP EAGLE
• First use case: Eagle to secure
Hadoop platform based on Eagle
framework
• Work closely with Hortonworks,
Dataguise, …
• Share with community and get
community’s support
• Continue to open source job
monitoring, GC monitoring etc.

37
Q & A
HADOOP EAGLE

Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop

Similar to Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop

Editor's Notes