Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Eagle from eBay at China Hadoop Summit 2015

1,605 views

Published on

Eagle from eBay at China Hadoop Summit 2015 Shanghai.

http://www.chinahadoop.com/english.html

Published in: Technology
  • Be the first to comment

Eagle from eBay at China Hadoop Summit 2015

  1. 1. HADOOP Full-stack real-time monitoring framework for eBay Hadoop Hao Chen | 陈浩 eBay Cloud Service
  2. 2. $ whoami Hao Chen | 陈浩 Software Engineer Analytics Data Infrastructure, Cloud Services eBay Inc. hchen9@ebay.com linkedin.com/in/haozch twitter.com/haozch weibo.com/haochencn 2
  3. 3. 3 eBay’s Challenges in Monitoring 10+ large hadoop clusters 10,000+ nodes 50,000+ jobs per day 50,000,000+ tasks per day 500+ types of hadoop/hbase metrics Billions of audit events per day Large Scale in Real Time Various Business Logic Hadoop Hbase Spark Data Security Hardware Cloud Database Complex and Scalable Policy Join multiple data sources Threshold based, windows based Multiple metrics correlation Metrics pre-aggregations Machine learning based Engineering Modularization Varieties of data sources Varieties of data collectors Complex business logic Alert rules can’t be hot deployed Scalability issue with single process
  4. 4. What’s Eagle 4 The uniform monitoring and alerting framework to monitor large-scale distributed system like hadoop, spark, cloud, etc. in real time. Eagle = Eagle Framework + Eagle Apps
  5. 5. Eagle Ecosystem 5 Apps  DAM  JPA  HBase  Spark Interface  Web Portal  REST Services  Ambari Plugin Integration  Kafka  Storm  HBase  Druid  Elastic Search Eagle Framework Provide full-stack monitoring framework for efficiently developing highly scalable real-time monitoring applications. Eagle Apps Provide built-in monitoring applications for domains like hadoop, spark, hbase, storm and cloud. Eagle Integration Integrate with distributed real-time execution environment like storm, message bus like kafka and storage layer like hbase, and also support extensions. Eagle Interface Allow to access or manage eagle through REST service, web UI or Ambari plugin. Eagle Framework
  6. 6. 6 Eagle App Highlights JPA: Job Performance Analyzer DAM: Security Data Activity Monitoring
  7. 7. 7 JPA: Job Performance Analyzer Historical job analysis Running job analysis Anomaly host detection Job data skew detection Job performance suggestion Anomaly Prediction based on machine learning Monitor and analyze job performance in real-time
  8. 8. 8 Historical Job Analyzer • Job historical performance trend • Task and attempt distribution • Various level (cluster/job/user/host) of resource utilization • Anomaly historical performance detection • TooLowBytesConsumedPerCPUSecond • JobStatisticLongDuration • TooLargeReduceNumAlert • TooLargeShuffleSizeAlert
  9. 9. 9 Running Job Analyzer • Monitoring running job in real time • Minute-level job progress snapshots • Minute-level resource usage snapshots • CPU, HDFS I/O, Disk I/O, slot seconds • Roll up to user/queue/cluster level • Anomaly running status detection • TooLongJobDuration • NoProgressForLong • TooManyTaskFailure
  10. 10. Use Case Detect node anomaly by analyzing task failure ratio across all nodes Assumption Task failure ratio for every node should be approximately equal Algorithm Node by node compare (symmetry violation) and per node trend 10 Task Failure based Anomaly Host Detection
  11. 11. 11 Task Failure based Anomaly Host Detection Alerting: Anomaly Detection & Alerting Insight: Task failure drill-down Insight: Task failure drill-down
  12. 12. Counters & Features Use Case Detect data skew by statistics and distributions for attempt execution durations and counters Assumption Duration and counters should be in normal distribution 12 Real-time Data Skew Detection mapDuration reduceDuration mapInputRecords reduceInputRecords combineInputRecords mapSpilledRecords reduceShuffleRecords mapLocalFileBytesRead reduceLocalFileBytesRead mapHDFSBytesRead reduceHDFSBytesRead Modeling & Statistics Avg Min Max Distributions Max z-score Top-N Correlation Threshold & Detection Counters Correlation > 0.9 & Max(Z-Score) > 90%
  13. 13. 13 Real-time Data Skew Detection
  14. 14. 14 Anomaly Prediction based on Machine Learning • Anomaly Metric Predictive Detection • Offline: Analyzing and combining 500+ metrics together for causal anomaly detections (IG -> PCA -> GMM -> MCC) • Online: Predictively alert for anomaly metrics Normal (Green) and Abnormal (Red) Data and Probability Distribution and Threshold Selection PCA (Principal Component Analysis)
  15. 15. 15 Anomaly Prediction based on Machine Learning • Anomaly Metric Predictive Detection
  16. 16. 16 DAM: Data Activity Monitoring Secure hadoop in real-time Security Use Cases Security Architecture Overview Security Components Highlights Security Machine Learning Integration
  17. 17. 17 Security Use Cases Data Loss Prevention Get alerted and stop a malicious user trying to copy, delete, move sensitive data from the Hadoop cluster. Malicious Logins Detect login when malicious user tries to guess password. Eagle creates user profiles using machine learning algorithm to detect anomalies Unauthorized access Detect and stop a malicious user trying to access classified data without privilege. Malicious user operation Detect and stop a malicious user trying to delete large amount of data. Operation type is one parameter of Eagle user profiles. Eagle supports multiple native operation types.
  18. 18. Security Architecture Overview 18
  19. 19. 19 Security Component Highlights Policy Manager Expressive language - create and modify policies for alerting and remediation on certain data activity monitoring events. Data classification Integrate with Dataguise & Apache Ranger. Policy-based Remediation Ability to detect and stop a threat, improve operational efficiencies, and reduce regulatory compliance costs. User Profiling Based on Machine learning to automatically generate anomaly detection policy User Activity Exploration Ability to drill down into alert details to understand the data security threat
  20. 20. 20 Security Machine Learning Integration • User Activity Profiling • Offline: Determine bandwidth from training dataset the kernel density function parameters (KDE) • Online: If a test data point lies outside the trained bandwidth, it is anomaly (Policy) PCs(Principle Components) in EVD (Eigenvalue Value Decomposition)Kernel Density Function
  21. 21. 21 Security Machine Learning Integration • User Activity Profiling on Spark Historical Audit Events Real-time Audit Events Batch Preprocess User Profile Model Generation (KDE + EVD Algorithm) Eagle StorageHDFS Stream Preprocess Policy Engine Online detection on Storm Offline training on Spark Archived data Real-time stream Kafka Persist model Dynamically load models & policies Alert Consumer Persist alert Eagle Security Plugins
  22. 22. Eagle Monitoring Framework 22 Eagle = Eagle Framework + Eagle Apps Full-stack real time monitoring framework
  23. 23. 23 • Data collector -> data processing -> metric pre-agg/alert engine -> storage -> dashboards • We need create framework to cover full stack in monitoring system Monitoring Programming Paradigm
  24. 24. Eagle Monitoring Framework 24
  25. 25. Eagle Monitoring Framework Highlights 25 Eagle = Eagle Framework + Eagle Apps Lightweight Streaming Process Framework Extensible & Scalable Policy Framework Eagle Query Framework Customizable Dashboards
  26. 26. 26 Step 1: Task DAG graph setup Eagle Stream Data Processing API @Override protected void buildDependency(FlowDef def, DataProcessConfig config) { Task header = Task.newTask("wordgenerator").setExecutor(source).completeBuild(); Task uppertask = Task.newTask("uppercase").setExecutor(new UppercaseExecutor()).connectFrom(header).completeBuild(); Task groupbyUppercaseTask = Task.newTask("groupby_uppercase").setExecutor(new GroupbyCountExecutor()).connectFrom(uppertask).completeBuild(); def.endBy(groupbyUppercaseTask); } Step 2: Inter-task data exchange protocol @Override protected void buildDependency(FlowDef def, DataProcessConfig config) { Task header = Task.newTask("wordgenerator").setExecutor(source).completeBuild(); Task uppertask = Task.newTask("uppercase").setExecutor(new UppercaseExecutor()).connectFrom(header).completeBuild(); Task groupbyUppercaseTask = Task.newTask("groupby_uppercase").setExecutor(new GroupbyCountExecutor()).connectFrom(uppertask).completeBuild(); def.endBy(groupbyUppercaseTask); }
  27. 27. 27 Execution Graph development, compile and deploy Development / Compile Phase Deployment / Runtime Phase
  28. 28. 28 Extensible & Scalable Policy Framework Usability • Declarative Policy Definition Syntax • Stream Metadata (event attribute name, attribute type, attribute value resolver, …) Scalability • Dynamic policy partitioning across compute nodes based on configurable partition class • Dynamic policy deployment • Event partitioning by storm and policy partitioning by Eagle (N events * M policies) Extensibility • Support new policy evaluation engine, for example Siddhi, Esper, Machine learning etc.
  29. 29. 29 Usability of Policy Framework Case HBase Region server high call queue length Policy In the past 30 minutes, there are more than 20 times call queue length>2000 from RegionCallQueueLength[value>2000]#window.Extension:messageTimeWindow(30 min) select host, value, avg(value) as avgValue, count(*) as count group by host having count >= 20 insert into HighRegionServerCallQueueLengthStream;
  30. 30. 30 Scalability of Policy Evaluation Dynamic Policy Partition • N Users with 3 partitions, M policies with 2 partitions, then 3*2 physical tasks • Physical partition + Policy-level partition
  31. 31. 31 Extensibility of Policy Framework public interface PolicyEvaluatorServiceProvider { public String getPolicyType(); public Class<? extends PolicyEvaluator> getPolicyEvaluator(); public Class<? extends PolicyDefinitionParser> getPolicyDefinitionParser(); public Class<? extends PolicyEvaluatorBuilder> getPolicyEvaluatorBuilder(); public List<Module> getBindingModules(); } Policy Evaluator Provider use SPI to register policy engine implementations Built-in Supported Policy Engine • Siddhi Complex Event Processing Engine • Machine Learning based Policy Engine
  32. 32. Eagle Query Framework 32 Persistence • Metric • Event • Metadata • Alert • Log • Customized Structure • … Query • Search • Filter • Aggregation • Sort • Expression • …. The light-weight metadata-driven store layer to serve commonly shared storage & query requirements of most monitoring system
  33. 33. 33 • Interactive: IPython notebook-like interactive visualization analysis and troubleshooting. • Dashboard: Customizable dashboard layout and drill-down path, persist and share. Customizable Dashboard Provide real-time interactive visualization and analytics capability supporting variety of data sources like eagle, druid and so on.
  34. 34. 34 Eagle in Future The general monitoring platform for large-scale system of eBay
  35. 35. 35 Open Source First Use Case Eagle to secure Hadoop in real time based on Eagle framework External Partners Hortonworks, Dataguise, Paypal and Apache Ranger Following Components to Open Source JPA (“Job Performance Analyzer”), HBase and GC Monitoring and so on is opening source soon
  36. 36. 36 Reference Eagle at Hadoop Summit 2015, San Jose http://2015.hadoopsummit.org Slides | Video Eagle at Big Data Summit 2014, Shanghai http://2014ebay.csdn.net/m/zone/ebay_en Slides | Video
  37. 37. 37 The End & Thanks If you want to go fast, go alone. If you want to go far, go together. -- African Proverb Hao Chen hchen9@ebay.com | @haozch
  38. 38. 38 We are Hiring Now https://careers.ebayinc.com Or contact me: hchen9@ebay.com

×