Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Eagle: Secure Hadoop in Real Time

769 views

Published on

Apache Eagle: Secure Hadoop in Real Time

Published in: Technology
  • Be the first to comment

Apache Eagle: Secure Hadoop in Real Time

  1. 1. Apache Eagle: Secure Hadoop in Real Time http://eagle.incubator.apache.org/ Hao Chen (@haozch) | Ralph Su (@raphaelsu)
  2. 2. About Us Apache Eagle Co-creator,PMC & Committer hao@apache.org Sr. Software Engineer at eBay hchen9@ebay.com Apache Eagle PMC & Committer ralphsu@apache.org MTS. Software Engineer at eBay liasu@ebay.com Ralph Su / 苏良飞 Hao Chen / 陈浩
  3. 3. Agenda • Introduction • Architecture • Case Study • Q & A
  4. 4. is a distributed real-time monitoring and alerting engine for hadoop from eBay Open sourced as Apache Incubator Project on Oct 26th 2015 Secure Hadoop in Realtime a data activity monitoring solution to instantly identify access to sensitive data, recognize attacks/ malicious activity and block access in real time. See http://eagle.incubator.apache.org or http://github.com/apache/incubator-eagle Apache Eagle
  5. 5. Eagle  was  initialized  by  end  of  2013  for  hadoop ecosystem  monitoring  as  any  existing   tool  like  zabbix,  ganglia  can  not  handle  the  huge  volume  of  metrics/logs  generated  by   hadoop system  in  eBay. 2014/2015 10,000  nodes 150,000+  cores 200  PB 2000+  user 3000+  nodes 10,000+  cores 50+  PB 2012 2011 1000+  nodes 10,000+  cores 10+  PB 100+  nodes 1000  +  cores 1  PB 2010 2009 50+  nodes 2007 1-­‐10  nodes Hadoop  Data   • Security • Activity Hadoop  Platform   • Heath • Availability • Performance Hadoop  @  eBay  Inc Initiative  – Why  build  Eagle? 7+   CLUSTERS 10000+ NODES 200+  PB DATA 10  B+   EVENTS  /  DAY 500+   METRIC  TYPES 50,000+   JOBS  /  DAY 50,000,000+   TASKS  /  DAY
  6. 6. Initiative  – Core  Challenges Large  Scale  Processing  and  Storage • Scale  IO  Complexity  (Stream) • Scale  Computation  Complexity  (Policy) • Scale  Storage  &  Query  (Event,  Log,  Metric) 1 Real-­‐Time  Alerting • Real-­‐time  Data  Collection • Real-­‐time  Stream  Processing • Real-­‐time  Alerting 2 Expressive Correlation  Model • Complex  policy  model • Stream  GroupBy,  Join,  Window • Machine  Learning 3 Hadoop  Ecosystem  Integration • Data  Source  Integration • Anomaly  Detection  Model 4
  7. 7. 1 Eagle  Framework Distributed  real-­‐time  framework  for  efficiently  developing   highly   scalable  monitoring   applications Eagle  Ecosystem 2 Eagle  Apps Security/  Hadoop/  Operational  Intelligence  /  … 3 Eagle  Interface REST  Service  /  Management  UI  /  Customizable  Analytics   Visualization 4 Eagle  Integration Ambari /  Docker /  Ranger  /  Dataguise Apps à Security à Hadoop à Cloud à Database Interface à Web Portal à REST Services à Analytics Visualization Integration à Ambari à Docker à Ranger à Dataguise Eagle Framework Open  Source Community-­‐driven   and  Cross-­‐community  cooperation 5
  8. 8. Eagle  @  eBay  Inc. 7+   CLUSTERS 10000+ NODES 200+  PB DATA 10  B+   EVENTS  /  DAY 500+   METRIC  TYPES 50,000+   JOBS  /  DAY 50,000,000+   TASKS  /  DAY Eagle  Deployment  at  eBay  Production • 100+  security  policies • 8  nodes • 30  worker  process • 64  kafka partition Eagle  Performance • Avg Latency:   ~  50  ms • Max  Throughput/Cluster:  300  k  /s
  9. 9. Eagle  Architecture
  10. 10. Distributed  Policy  Engine METADATA   MANAGER AlertExecutor_{1} AlertExecutor_{2} … AlertExecutor_{N} Real  Time  Alerts Alerts Policy   Management Policy Dynamical   Policy   Deployment Real-­‐time   Event  Stream Stream_{1} Stream_{*} Dynamical   Stream  Schema Stream   Processing • Real-­‐time  Streaming: Apache  Storm  (Execution  Engine)  +  Kafka  (Message  Bus) • Declarative  Policy: SQL  (CEP)  on  Streaming  +  Hot  Deploy • Linear  Scalability:  Data  volume scale +  Computation  scale • Metadata-­‐Driven:  Schema  Management  and  Dynamical  Policy  Lifecycle
  11. 11. METADATA   MANAGER AlertExecutor_{1} AlertExecutor_{2} … AlertExecutor_{N} Real  Time  Alerts Alerts Policy   Management Policy Dynamical   Policy   Deployment Real-­‐time   Event  Stream Stream_{1} Stream_{*} Dynamical   Stream  Schema Stream   Processing from MetricStream[(name == 'ReplLag') and (value > 1000)] select * insert into outputStream; • Real-­‐time  Streaming: Apache  Storm  (Execution  Engine)  +  Kafka  (Message  Bus) • Declarative  Policy: SQL  (CEP)  on  Streaming  +  Hot  Deploy • Linear  Scalability:  Data  volume scale +  Computation  scale • Metadata-­‐Driven:  Schema  Management  and  Dynamical  Policy  Lifecycle Distributed  Policy  Engine
  12. 12. Declarative  Policy • Filter • Join • Aggregation:  Avg,  Sum  ,  Min,  Max,  etc • Group by • Having • Stream handlers  for  window:  TimeWindow,  Batch  Window,   Length  Window   • Conditions  and  Expressions:  and,  or,  not,  ==,!=,  >=,  >,  <=,  <,   and  arithmetic  operations • Pattern  Processing • Sequence  processing • Event  Tables:  intergrate historical  data  in  realtime processing • SQL-­‐Like  Query:  Query,  Stream  Definition   and  Query  Plan   compilation Distributed  SQL  on  Streaming  :   Siddhi  CEP  +  Storm  by  default from MetricStream[(name == 'ReplLag') and (value > 1000)] select * insert into outputStream;
  13. 13. Declarative  Policy  -­‐ Examples from hadoopJmxMetricEventStream [metric == "hadoop.namenode.fsnamesystemstate.capacityused" and value > 0.9] select metric, host, value, timestamp, component, site insert into alertStream; Example  1:  Alert  if  hadoop namenode capacity  usage  exceed  90  percentages from every a = hadoopJmxMetricEventStream[metric=="hadoop.namenode.fsnamesystem.hastate"] -> b = hadoopJmxMetricEventStream[metric==a.metric and b.host == a.host and a.value != value)] within 10 min select a.host, a.value as oldHaState, b.value as newHaState, b.timestamp as timestamp, b.metric as metric, b.component as component, b.site as site insert into alertStream; Example  2:  Alert  if  hadoop namenode HA  switches
  14. 14. 1 Distributed  Policy  Engine  -­‐ Scalability 2 Distributed   Streaming   Cluster  Environment AlertExecutor_{1} AlertExecutor_{2} … AlertExecutor_{N} Stream_{1} Stream_{*} Stream   Processing Dynamic  policy  partition  by  {event}  *  {policy} • N  Users  with  3  partitions,  M  policies  with  2  partitions,  then  3*2  physical  tasks • Physical  partition  +  policy-­‐level  partition Linear  Scalability Principle
  15. 15. 3 Algorithm Weights  of  Executors   By  Partition   User Random 0.0484 0.152 0.3535 0.105 0.203 0.072 0.042 0.024 Greedy 0.0837 0.0837 0.0837 0.0837 0.0737 0.0637 0.0437 0.0837 Stream  Partition  Skew  (15:1) Distributed  Policy  Engine – Optimization Stream  Partition  Problem     https://en.wikipedia.org/wiki/Partition_problem
  16. 16. Distributed  Real-­time    Policy  Engine Siddhi  CEP  Policy   Evaluator Machine  Learning   Policy  Evaluator• Support  WSO2  Siddhi  CEP  as  first  class • Extensible  Policy  Engine  Implementation • Extensible  Policy  Lifecycle  Management • Metadata-­‐based  Module  Management Extensible   Policy   Evaluator public interface PolicyEvaluatorServiceProvider { public String getPolicyType(); // literal string to identify one type of policy public Class getPolicyEvaluator(); // get policy evaluator implementation public List getBindingModules(); // policy text with json format to object mapping } METADATA  MANAGER Policy/Metadata Distributed  Policy  Engine – Extensibility Policy  Engine  Extensibility
  17. 17. Stream  Processing  Framework Optimizer 1.  Development 2.  Optimization 3.  Compile  to  native  app Use  eagle  alert  framework  as  library
  18. 18. • Light-­‐weight  ORM  Framework  for  HBase/RDMBS • Full-­‐function   SQL-­‐Like  REST  Query   • Optimized  Rowkey design  for  time-­‐series  data • Native  HBase Coprocessor • Secondary  Index  Support @Table("alertdef") @ColumnFamily("f") @Prefix("alertdef") @Service(AlertConstants.ALERT_DEFINITION_SERVICE_ENDPOINT_NAME) @JsonIgnoreProperties(ignoreUnknown = true) @TimeSeries(false) @Tags({"site", "dataSource", "alertExecutorId", "policyId", "policyType"}) @Indexes({ @Index(name="Index_1_alertExecutorId", columns = { "alertExecutorID" }, unique = true), }) public class AlertDefinitionAPIEntity extends TaggedLogAPIEntity{ @Column("a") private String desc; @Column("b") private String policyDef; @Column("c") private String dedupeDef; Query=AlertDefinitionService[@dataSource="hiveQueryLog"]{@policyDef} Large  Scale  Storage  and  Query
  19. 19. Uniform  HBase rowkey design • Metric • Entity • Log Rowkey ::= Prefix | Partition Keys | timestamp | tagName | tagValue | … Rowkey ::= Metric Name | Partition Keys | timestamp | tagName | tagValue | … Rowkey ::= Default Prefix | Partition Keys | timestamp | tagName | tagValue | … Rowkey ::= Log Type | Partition Keys | timestamp | tagName | tagValue | … Rowvalue ::= Log Content Large  Scale  Storage  and  Query http://opentsdb.net
  20. 20. Multi-­‐Tenants  – Topology  Scheduler • Dynamical  Topology  Management • No-­‐downtime  Topology  Maintenance • Topology  High  Availability  &  Balance • Resource  Scheduling  &  Isolation:  Runtime,Woker,  Topology  or  Cluster
  21. 21. Multi-­‐Tenants  -­‐ Dynamical  Correlation • Dynamical  Correlation  on  Runtime: • Sort,  Groupby,  Join,  Window • Hot  Deploy  Logic • Policy  Management • Multi-­‐Correlation  on  the  Single  Stream • Group  by  different  fields  of  same  stream • Resort  same  stream  by  different  order • Join  certain  stream  in  different  way • Multi  Correlation  on  Multi  Streams • Cross  Streams  Join • Real-­‐time  &  Historical  Stream  Join
  22. 22. Multi-­‐Tenants  -­‐ Dynamical  Correlation
  23. 23. Eagle  Use  Cases Data  Activity  Monitoring Secure  Hadoop  in  Realtime a  data  activity  monitoring  solution  to  instantly  identify  access  to  sensitive  data,    recognize   attacks/  malicious  activity  and  block  access  in  real  time. Job  Performance  Monitoring Hadoop,  Spark  Job  Profiling  &  Performance  Monitoring,  Cluster  Health  Anomaly  Detection 1 2 4 eBay  Unified  Monitoring  Platform Provide  unified  monitoring-­‐as-­‐service  for  everything  around  infrastructure  or  business. 3
  24. 24. Eagle  Data  Activity  Monitoring Data  Loss  Prevention Get  alerted  and  stop  a  malicious  user  trying  to  copy,  delete,  move   sensitive  data  from  the  Hadoop  cluster. Malicious  Logins Detect  login  when  malicious  user  tries  to  guess  password.  Eagle   creates  user  profiles  using  machine  learning  algorithm  to  detect   anomalies Unauthorized  access Detect  and  stop  a  malicious  user  trying  to  access  classified  data   without  privilege.   Malicious  user  operation Detect  and  stop  a  malicious  user  trying  to  delete  large  amount  of   data.  Operation  type  is  one  parameter  of  Eagle  user  profiles.  Eagle   supports  multiple  native  operation  types. User Privileges Common   Data  Sets Patterns CommandsZones Query Columns
  25. 25. § HDFS  Policies § Access  to  Sensitive  files § HDFS  Commands  used  (read,  write,  update…) § Client  host   § Destination § Security  Zones § Hive  Policies § Access  to  tables  with  PII  Data § SQL  Query  Profiles § Client  Host § Security  Zone Eagle  Data  Activity  Monitoring Hadoop  Data  Security:  Detect  anomalies  in  accessing  HDFS  and  Hive
  26. 26. Offline:  Determine  bandwidth  from  training  dataset  the  kernel  density  function   parameters  (KDE) Online:  If  a  test  data  point  lies  outside  the  trained  bandwidth,   it  is  anomaly  (Policy) PCs(Principle  Components)  in  EVD (Eigenvalue  Value  Decomposition) Kernel  Density  Function Eagle  Machine  Learning  User  Profile
  27. 27. Use  Case Detect  node  anomaly  by  analyzing  task  failure  ratio  across  all  nodes Assumption Task  failure  ratio  for  every  node  should  be  approximately  equal Algorithm Node  by  node  compare  (symmetry  violation)  and  per  node  trend Eagle  Job  Performance  Monitoring
  28. 28. Alerting:   Anomaly  Detection  Alerting Insight:   Task  failure  drill-­‐down Insight:   Task  failure  drill-­‐down Task  Failure  based  Anomaly  Host  Detection  
  29. 29. Counters  &  Features Use  Case Detect  data  skew  by  statistics  and  distributions  for  attempt  execution  durations  and   counters Assumption      Duration  and  counters  should  be  in  normal  distribution mapDuration reduceDuration mapInputRecords reduceInputRecords combineInputRecords mapSpilledRecords reduceShuffleRecords mapLocalFileBytesRead reduceLocalFileBytesRead mapHDFSBytesRead reduceHDFSBytesRead Modeling  &  Statistics Avg Min Max   Distributions Max  z-­score Top-­N Correlation Threshold  &  Detection Counters Correlation  >  0.9   &  Max(Z-­Score)  >  90%   Hadoop  Job  Data  Skew  Detection
  30. 30. Counters  &  Features Use  Case Detect  data  skew  by  statistics  and  distributions  for  attempt  execution  durations  and   counters Assumption      Duration  and  counters  should  be  in  normal  distribution mapDuration reduceDuration mapInputRecords reduceInputRecords combineInputRecords mapSpilledRecords reduceShuffleRecords mapLocalFileBytesRead reduceLocalFileBytesRead mapHDFSBytesRead reduceHDFSBytesRead Modeling  &  Statistics Avg Min Max   Distributions Max  z-­score Top-­N Correlation Threshold  &  Detection Counters Correlation  >  0.9   &  Max(Z-­Score)  >  90%   Hadoop  Job  Data  Skew  Detection
  31. 31. Learn  More  about  Apache  Eagle Community • Website:  http://eagle.incubator.apache.org • Github:  http://github.com/apache/incubator-­‐eagle • Mailing  list:  dev@eagle.incubator.apache.org Resources • Documentation:  http://eagle.incubator.apache.org/docs/ • Docker images:  https://hub.docker.com/r/apacheeagle/sandbox/ Publications  &  Patents • EAGLE:  USER  PROFILE-­‐BASED  ANOMALY  DETECTION  IN  HADOOP  CLUSTER  (IEEE) • EAGLE:  DISTRIBUTED  REALTIME  MONITORING  FRAMEWORK  FOR  HADOOP  CLUSTER
  32. 32. Q  &  A http://eagle.incubator.apache.org The  slide  is  licensed  under  Creative  Commons  Attribution  4.0  International  license.

×