Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop

3,242 views

Published on

Hadoop Summit 2015

Published in: Technology
  • Be the first to comment

Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop

  1. 1. HADOOP EAGLE Full-stack realtime monitoring framework for eBay hadoop Edward Zhang yonzhang@ebay.com , @yonzhang2012 Hao Chen hchen9@ebay.com, @ihaoch
  2. 2. Use case: Detect node anomaly by analyzing task failure ratio across all nodes Assumption : task failure ratio for every node should be approximately equal Algorithm : node by node compare (symmetry violation) and per node trend HADOOP EAGLE – EBAY INC 2 HADOOP EAGLE Background – initial use cases
  3. 3. 3 Host: Task failure based anomaly host detection HADOOP EAGLE – EBAY INC HADOOP EAGLE Anomaly Detection & Alerting Analysis Auto-Remediation
  4. 4. 4 Scale Challenges @ eBay Hadoop Monitoring HADOOP EAGLE – EBAY INC HADOOP EAGLE • 10+ large Hadoop clusters • 10,000+ data nodes • 50,000+ jobs per day • 50,000,000+ tasks per day • 500+ types of Hadoop/Hbase native metrics • Billions of audit events, metrics per day
  5. 5. 5 Use cases challenges @ eBay Hadoop Monitoring HADOOP EAGLE – EBAY INC HADOOP EAGLE • Host • Task failure ratio based machine anomaly detection • Job monitoring across its lifetime • Real-time running job performance analysis • Near real-time job history analytics • Data skew detection • Hadoop native metrics • Hdfs • Hbase • M/R • Logs • GC log • Hadoop daemon log • Audit log • HDFS image file • Yarn Framework • Queue
  6. 6. HADOOP EAGLE – EBAY INC 6 HADOOP EAGLE Engineering Challenges @ eBay Hadoop Monitoring • Varieties of data sources M/R history job, running, GC log, namenode log, hadoop native metrics, YARN queue, audit log, hdfs image file etc. • Varieties of data collectors pull form hdfs, pull YARN API, ship logs, … • Complex business logic join outside data, pre-aggregations, memory window … • Alert rules can’t be hot deployed • Scalability issue with single process
  7. 7. 7 Job History Performance Analyzer HADOOP EAGLE – EBAY INC HADOOP EAGLE • Monitor job history files in near real-time • Crawl job history files immediately after it is completed • Apply expertise rules for job performance suggestions • Job history trend for the same type of job Job Start Event Task Start Event Task End Event Task roll-up Task2 Start Event Task2 End Event Task roll- up Job End Event Job Suggestion Rules
  8. 8. 8 Job real-time monitoring HADOOP EAGLE – EBAY INC HADOOP EAGLE • Monitoring running job in real time • Minute-level job progress snapshots • Minute-level resource usage snapshots • CPU, HDFS I/O, Disk I/O, slot seconds • Roll up to user/queue/cluster level • Slide window based alert
  9. 9. 9 Service: GC Log / Server Log HADOOP EAGLE – EBAY INC HADOOP EAGLE • GC event detection and prediction • Log metrics statistics • Real-time log anomaly detection
  10. 10. Why Eagle Monitoring Framework HADOOP EAGLE – EBAY INC 10 HADOOP EAGLE
  11. 11. 11 • Data collector -> data processing -> metric pre-agg/alert engine -> storage -> dashboards • We need create framework to cover full stack in monitoring system Programming Paradigm and Abstraction HADOOP EAGLE – EBAY INC HADOOP EAGLE
  12. 12. 12 As a framework, Eagle does not assume : • Data source (where, what) • Business logic execution path (how) • Policy engine implementation (how) • Data sink (where, what) Eagle as a Framework HADOOP EAGLE – EBAY INC As a framework, Eagle does the following: • SQL-like service API • High-performing query framework • Lightweight streaming process java API • Extensible policy engine implementation • Scalable and distributed rule evaluation • Native HBase data storage support • Metadata driven stream processing • Data source extensibility • Data sink extensibility • Interactive dashboard HADOOP EAGLE
  13. 13. Eagle Overall Architecture 13HADOOP EAGLE – EBAY INC HADOOP EAGLE
  14. 14. Eagle Monitoring Framework Internals HADOOP EAGLE – EBAY INC 14 • Lightweight Streaming Process Framework • Extensible & Scalable Policy Framework for Alert • Eagle Query Framework • Interactive Dashboards HADOOP EAGLE
  15. 15. 15 Facts • Computation is based on single event which constitutes endless continuous stream • Computation can be aggregation, time-window, length-window or join outside data etc. • Filter design pattern is used for modularizing code at the beginning Lightweight Streaming Process Framework HADOOP EAGLE – EBAY INC HADOOP EAGLE Abstraction  Inspired by cascading framework, we abstract a light-weight streaming programing API which is independent of execution environment  Streaming process is directed acyclic graph  This layer of indirection is for code modularization, code reuse and prevention of coupling with specific execution environment  Runs on single process, Storm or other streaming technology like Spark
  16. 16. 16 Step 1: Task DAG graph setup Eagle Stream Data Processing API HADOOP EAGLE – EBAY INC HADOOP EAGLE @Override protected void buildDependency(FlowDef def, DataProcessConfig config) { Task header = Task.newTask("wordgenerator").setExecutor(source).completeBuild(); Task uppertask = Task.newTask("uppercase").setExecutor(new UppercaseExecutor()).connectFrom(header).completeBuild(); Task groupbyUppercaseTask = Task.newTask("groupby_uppercase").setExecutor(new GroupbyCountExecutor()).connectFrom(uppertask).completeBuild(); def.endBy(groupbyUppercaseTask); } Step 2: Inter-task data exchange protocol @Override protected void buildDependency(FlowDef def, DataProcessConfig config) { Task header = Task.newTask("wordgenerator").setExecutor(source).completeBuild(); Task uppertask = Task.newTask("uppercase").setExecutor(new UppercaseExecutor()).connectFrom(header).completeBuild(); Task groupbyUppercaseTask = Task.newTask("groupby_uppercase").setExecutor(new GroupbyCountExecutor()).connectFrom(uppertask).completeBuild(); def.endBy(groupbyUppercaseTask); }
  17. 17. 17 Execution Graph development, compile and deploy Development / Compile Phase Deployment / Runtime Phase HADOOP EAGLE – EBAY INC HADOOP EAGLE
  18. 18. Eagle Monitoring Framework Internals HADOOP EAGLE – EBAY INC 18 • Lightweight Streaming Process Framework • Extensible & Scalable Policy Framework for Alert • Eagle Query Framework • Interactive Dashboards HADOOP EAGLE
  19. 19. 19 Extensible & Scalable Policy framework HADOOP EAGLE – EBAY INC HADOOP EAGLE Scalability • Dynamic policy partitioning across compute nodes based on configurable partition class • Dynamic policy deployment • Event partitioning by storm and policy partitioning by Eagle (N events * M policies) Extensibility • Support new policy evaluation engine, for example Siddhi, Esper, Machine learning etc. Features • Policy CRUD • Stream metadata (event attribute name, attribute type, attribute value resolver, …)
  20. 20. 20 Dynamic Policy Partitioning HADOOP EAGLE – EBAY INC HADOOP EAGLE
  21. 21. 21 Scalability of Policy Evaluation HADOOP EAGLE – EBAY INC HADOOP EAGLE
  22. 22. 22 Extensibility of policy framework HADOOP EAGLE – EBAY INC HADOOP EAGLE public interface PolicyEvaluatorServiceProvider { public String getPolicyType(); public Class<? extends PolicyEvaluator> getPolicyEvaluator(); public Class<? extends PolicyDefinitionParser> getPolicyDefinitionParser(); public Class<? extends PolicyEvaluatorBuilder> getPolicyEvaluatorBuilder(); public List<Module> getBindingModules(); } Policy Evaluator Provider use SPI to register policy engine implementations
  23. 23. Eagle Monitoring Framework Internals HADOOP EAGLE – EBAY INC 23 • Lightweight Streaming Process Framework • Extensible & Scalable Policy Framework for Alert • Eagle Query Framework • Interactive Dashboards HADOOP EAGLE
  24. 24. Eagle Query Framework HADOOP EAGLE – EBAY INC 24 HADOOP EAGLE Persistence • Metric • Event • Metadata • Alert • Log • Customized Structure • … Query • Search • Filter • Aggregation • Sort • Expression • …. Features • Simple API • Powerful query • High performance • Scalability • Pluggable • … The light-weight metadata-driven store layer to serve commonly shared storage & query requirements of most monitoring system
  25. 25. Eagle Query Framework HADOOP EAGLE – EBAY INC 25 HADOOP EAGLE
  26. 26. Eagle Query Framework HADOOP EAGLE – EBAY INC 26 HADOOP EAGLE • Metadata definition ORM • High performance RESTful API supporting CRUD • SQL-like declarative query syntax • Generic service client library • Native support HBase and RDBMS • Interactive and customizable dashboard
  27. 27. 27 • Annotations are metadata to entity • Metadata driven query compiling and response rendering • Metadata driven ser/deser • Rename column to shorter string(hbase) • Entity metadata primitives • Table • ColumnFamily • Prefix(the very first partition key) • Service(entity identifier) • Partition • Tags • Indexes • Column Metadata definition ORM HADOOP EAGLE – EBAY INC HADOOP EAGLE @Table("alertdef") @ColumnFamily("f") @Prefix("alertdef") @Service(AlertConstants.ALERT_DEFINITION_SERVICE_ENDPOINT_NAME) @TimeSeries(false) @Partition({"cluster", "datacenter"}) @Tags({"programId", "alertExecutorId", "policyId", "policyType"}) @Indexes({ @Index(name="Index_1_alertExecutorId", columns = { "alertExecutorID" }) }) public class AlertDefinitionAPIEntity extends TaggedLogAPIEntity{ @Column("a") private String desc; @Column("b") private String policyDef; @Column("c") private String dedupeDef; @Column("d") private String notificationDef; @Column("e") private String remediationDef; @Column("f") private boolean enabled;
  28. 28. 28 Generic RESTful API & Query HADOOP EAGLE – EBAY INC HADOOP EAGLE ::= <EntityName> “[" <FilterCondition> "]" "<" <GroupbyFields> ">" "{" <AggregatedFunctions> "}” [ "." "{" <SortbyOptions> "}" ] eagle-service/rest/entities?query=
  29. 29. 29 Generic RESTful API Query Syntax HADOOP EAGLE – EBAY INC HADOOP EAGLE query=JobExecutionService[@cluster=“xyz” AND @datacenter=“abc”]{@startTime,@numTotalMaps}&startTime=&endTime=&pageSize=100 Aggregation Query ::= <EntityName> [QueryCondition]<GroupbyFields>{ AggregatedFunctions}.{SortbyOptions} query=JobExecutionService[@cluster=“xyz” AND @datacenter=“abc”]<@user>{count, min(endTime-startTime)}&startTime=&endTime=&pageSize=100 query=TaskFailureCountService[@cluster=“xyz” AND @datacenter=“abc” AND @failureCount>10]{@startTime,@failureCount}&startTime=&endTime=&pageSize=100 CONTAINS, IN, !=, =, <, <=, >, >= query=TaskFailureCountService[@cluster=“xyz” AND @datacenter=“abc” AND @failureCount>10]{@startTime,@failureCount}&startTime=&endTime=&pageSize=100&startRowkey=BgVz-9R……. Search Query Aggregate Query TimeSeries Histogram Query query=GenericMetricService[@cluster="ares" AND @datacenter="lvs"]<@user>{sum(value)}.{sum(value) desc} &timeSeries=true&intervmin=1440 &pageSize=10000000&startTime=2014-07-01 00:00:00&endTime=2014-08-01 00:00:00&metricName=eagle.hdfs.spacesize.cluster Operators Numeric Filters Paginations
  30. 30. 30 Generic Eagle Service Client Library HADOOP EAGLE – EBAY INC HADOOP EAGLE • Basic CRUD • Fluent DSL • Metric Builder API • Parallel Client • Asynchronous Client client.metric("unit.test.metrics") .batch(5) .tags(tags) .send("unit.test.metrics", System.currentTimeMillis(), tags, 0.1, 0.2, 0.3) .send(System.currentTimeMillis(), 0.1) .send(System.currentTimeMillis(),0.1,0.2) .send(System.currentTimeMillis(),tags,0.1,0.2,0.3) .send("unit.test.anothermetrics",System.currentTimeMillis(),tags,0.1,0.2,0.3) .flush(); client.search("GenericMetricService[@cluster="cluster4ut" AND @datacenter = "datacenter4ut"]<@cluster>{sum(value)}") .startTime(0) .endTime(System.currentTimeMillis()+24 * 3600 * 1000) .metricName("unit.test.metrics") .pageSize(1000) .send();
  31. 31. 31 Uniform rowkey design • Metric • Entity • Log HBase Storage Design HADOOP EAGLE – EBAY INC HADOOP EAGLE Rowkey ::= Prefix | Partition Keys | timestamp | tagName | tagValue | … Rowkey ::= Metric Name | Partition Keys | timestamp | tagName | tagValue | … Rowkey ::= Default Prefix | Partition Keys | timestamp | tagName | tagValue | … Rowkey ::= Log Type | Partition Keys | timestamp | tagName | tagValue | … Rowvalue ::= Log Content
  32. 32. com.ebay.eagle.coprocessor.AggregateProtocol 32 HBase Coprocessor HADOOP EAGLE – EBAY INC HADOOP EAGLE 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 avg count max min sum nocoprocesso in single region coprocessor in single region estimated in cluster
  33. 33. 33 • Uniform HBASE row-key design for all types of monitoring data sources • Logically partition data by tags which is defined in annotation @Partition({“cluster”, “datacenter”}) • Physically shard data by HBASE native feature: rowkey range and region mapping • Write throughput optimized by using HBASE multi-put • Co-processor to maximize query performance • Push evaluation of numeric filters down to HBase • Secondary index support • Inspection of RESTful resources and entity metadata • Numeric filters • Expression evaluation in output fields • Rowkey inspection Tuning for HBase Storage HADOOP EAGLE – EBAY INC HADOOP EAGLE
  34. 34. Eagle Monitoring Framework Internals HADOOP EAGLE – EBAY INC 34 • Lightweight Streaming Process Framework • Extensible & Scalable Policy Framework for Alert • Eagle Query Framework • Interactive Dashboards HADOOP EAGLE
  35. 35. 35 • Interactive: IPython notebook- like interactive visualization analysis and troubleshooting. • Dashboard: Customizable dashboard layout and drill-down path, persist and share. Generic Dashboard Analytics for Eagle Store HADOOP EAGLE – EBAY INC HADOOP EAGLE
  36. 36. 36 Open Source Soon … HADOOP EAGLE – EBAY INC HADOOP EAGLE • First use case: Eagle to secure Hadoop platform based on Eagle framework • Work closely with Hortonworks, Dataguise, … • Share with community and get community’s support • Continue to open source job monitoring, GC monitoring etc.
  37. 37. 37 Q & A HADOOP EAGLE – EBAY INC HADOOP EAGLE

×