SlideShare a Scribd company logo
1 of 38
Apache in Action
Secure Hadoop in Real-time
Hao Chen / 陈浩 / hao@apache.org / @haozch
Who is the Guy
2
Co-creator, Committer and PMC @ Apache Eagle
hao@apache.org
Hao Chen / 陈浩
Sr. Software Engineer @ eBay Cloud Service
hchen9@ebay.com
Speaker @ Hadoop Summit (SJC, SHA, BJ) ...
http://people.apache.org/~hao
Agenda
3
• About Eagle
• Architecture
• Ecosystem
• Q & A
What’s Apache Eagle
4
Apache Eagle is a distributed real-time monitoring and
alerting engine for hadoop from eBay
Open sourced as Apache Incubator Project on Oct 26th 2015
Secure Hadoop in Realtime a data activity monitoring solution to instantly identify access
to sensitive data, recognize attacks/ malicious activity and block access in real time.
See http://eagle.incubator.apache.org or http://goeagle.io
Apache Eagle History
Donated to Apache Software Foundation (ASF) from eBay at Oct 26th, 2015
5
Dec 2013 Oct 23 2015 Oct 26 2015
Hadoop Eagle
Project Initiative
Apache Incubator
eagle.incubator.apache.org
Github Open Source
github.com/apache/incubator-eagle
Hadoop Eagle
Production Release
May 2014
Why build Apache Eagle
6
Eagle was initialized by end of 2013 for hadoop ecosystem monitoring as any
existing tool like zabbix, ganglia can not handle the huge volume of metrics/logs
generated by hadoop system in eBay.
2013/2014
10,000 nodes
150,000+ cores
170 PB
2000+ user
3000+ nodes
10,000+ cores
50+ PB
2012
2011
1000+ nodes
10,000+ cores
10+ PB
100+ nodes
1000 + cores
1 PB
2010
2009
50+ nodes
2007
1-10 nodes
Hadoop Data
• Security
• Activity
Hadoop Platform
• Heath
• Availability
• Performance
Hadoop @ eBay Inc
Apache Eagle @ eBay
7
7 CLUSTERS
7427 NODES
160 PB DATA
10 B+ EVENTS / DAY
500+ METRIC TYPES
50,000+ JOBS / DAY
50,000,000+ TASKS / DAY
MONITOR
PROCESS
Agenda
8
• About Eagle
• Architecture
• Ecosystem
• Q & A
Apache Eagle Architecture Overview
9
Scalable
Scales to monitor thousands of policies and
billions of access events
Machine Learning
Create dynamic user profiles based on
user behavior
Real-time
Generates alerts in real time and blocks
users with malicious intent
Extensible
Eagle can be easily extended to monitor
other data sources
Apache Eagle Architecture Overview
10
STREAM PROCESSING
ENGINE
DataCollector
Kafka
HDFS, Audit, Security
METADATA MANAGER
DATASTORES
REMEDIATION ENGINE
Apache
Ranger
MACHINE LEARNING
MODULE
Custom
module
Actionable Alerts
Activities
Actionable
Alerts
PolicyThresholdsUser properties
MLThresholds
Real Time Alert
Dashboard
Security Analyst
Admin Console
Security Engineer
Insights
Metadata
Management
MACHINE LEARNING TRAINING MODULE
Policy Engine
Apache Eagle Architecture Features
11
• Real-time Data Collection
• Distributed Policy Engine
• Stream Processing DSL
• Scalable Data Storage & Query
• Machine Learning Intergration
NOTE {NAME}-{NUMBER} like HDFS-6914 means open source project ticket id contributed by us
Apache Eagle - Data Collection
12
Decoupling with Message Bus
• Apache Kafka: high-throughput distributed
messaging
• Partition: balance between logic and throughput
Cross-Platform Integration
• Community Kafka Client (18+)
• Python/Go/C/C++/JAVA ..
• Enhanced Log4j-kafka
• KAFKA-2041: Extensible Partition Key
• KAFKA-2077: Advanced Topic Selector
Apache Eagle - Data Collection
13
Availability: Filebeat + Logstash
Logstash-1
Logstash-2
Logstash-…
Logstash-
Ligh-weight collector (golang) with daemon Logstash instances cluster
Shuffle Grouping
KafkaField Grouping
Distributed Message Bus
Resource consumption balance
Message throughput balance ( LOGSTASH-179)
Storm Spout: Distributed crawling for hadoop job, node jmx and service logs, etc.
Zookeeper: Centralized state management and distributed locking
Apache Eagle - Data Collection
14
Scalability: Distributed Real-time Ingestion
Zookeeper
METRIC
Centralized State Management
JOB EVENT LOG
Apache Eagle - Distributed Real-time Policy Engine
15
METADATA MANAGER
Distributed Streaming Cluster Environment
AlertExecutor_{1}
AlertExecutor_{2}
…
AlertExecutor_{N}
Real Time
Alerts
Alerts
Policy
Management
Policy
Dynamical Policy Deployment
Real-time
Event Stream
Stream_{1}
Stream_{*}
Dynamical Stream Schema
Stream
Processing
Highlights
• Real-time
• Usability
• Scalability
• Extensibility
• Metadata-driven
Apache Eagle - Distributed Real-time Policy Engine
16
METADATA MANAGER
Real Time
Alerts
Alerts
Policy
Management
Policy
Event Stream
(Kafka)
Dynamical Stream Schema
Dynamical Policy Deployment
Real-time
• Kafka-based Distributed
Message Bus (Extensible)
• Storm-based Real-time
Execution Environment
(Extensible)
• Stream events are
processed and alerts are
evaluated during
streaming
Apache Eagle - Distributed Real-time Policy Engine
17
METADATA MANAGER
Distributed Streaming Cluster Environment
Real Time
Alerts
Alerts
Policy
Management
Policy
Dynamical Policy Deployment
Usability
• Powerful SQL-Like CEP CQL for
Policy Definition
• Dynamical Poilcy Metadata Lifecycle
Management (Deployment/Update)
• Easy-to-use Policy management and
Alert analytics UI
from metricStream[(name == 'ReplLag')
and (value > 1000)] select * insert into
outputStream;
Apache Eagle - Distributed Real-time Policy Engine
18
Full-function Streaming CEP CQL: Siddhi on Storm by default
hdfsAuditLogEventStream[(src == '/tmp/private')]#window.externalTime(timestamp,10 min) select user,
count(timestamp) as aggValue group by user having aggValue >= 5 insert into outputStream;
• Filter
• Join
• Aggregation: Avg, Sum , Min, Max, etc
• Group by
• Having
• Stream handlers for window: TimeWindow, Batch Window, Length Window
• Conditions and Expressions: and, or, not, ==,!=, >=, >, <=, <, and arithmetic operations
• Pattern processing
• Sequence processing
• Event Tables: intergrate historical data in realtime processing
• SQL-Like Query: Query, Stream Definition and Query Plan compilation
Apache Eagle - Distributed Real-time Policy Engine
19
Distributed Streaming Cluster Environment
AlertExecutor_{1}
AlertExecutor_{2}
…
AlertExecutor_{N}
Stream_{1}
Stream_{*}
Stream
Processing
Scalability: dynamic policy partition by {event} * {policy}
• N Users with 3 partitions, M policies with 2 partitions, then 3*2 physical tasks
• Physical partition + policy-level partition
Apache Eagle - Distributed Real-time Policy Engine
20
Distributed Streaming Partition Problem
https://en.wikipedia.org/wiki/Partition_problem
S = {3,1,1,2,2,1,1}
S1 = {1,1,1,1,1}
S2 = {2,2}
S3 = {3}
Apache Eagle - Distributed Real-time Policy Engine
21
Distributed Streaming Partition Strategy
groupBy[ GreedyStrategy ]((_.key1,_.key2 ))
HBase
Key Distribution Statistics
(Online/Offline)
Realtime Partition
Strategy
Key Statistics Cache
Async
Strategy
• Greedy (Online/Offline)
• PoTC
• PKG
• Hashing
Apache Eagle - Distributed Real-time Policy Engine
22
Distributed Real-time Policy Engine
Siddhi CEP
Policy
Evaluator
Machine
Learning Policy
Evaluator
Extensibility
• Support WSO2 Siddhi CEP as first class
• Extensible policy engine implementation
• Extensible policy lifecycle management
Extensible
Policy Evaluator
public interface PolicyEvaluatorServiceProvider {
public String getPolicyType(); // literal string to identify one type of policy
public Class getPolicyEvaluator(); // get policy evaluator implementation
public List getBindingModules(); // policy text with json format to object mapping
}
public interface PolicyEvaluator {
public void evaluate(ValuesArray input) throws Exception; // evaluate input event
public void onPolicyUpdate(AlertDefinitionAPIEntity newAlertDef);// policy update
public void onPolicyDelete(); // invoked when policy is deleted
}
METADATA MANAGER
Policy/Metadata
Apache Eagle - Distributed Real-time Policy Engine
23
Metadata-Driven
• Stream Schema: AlertStreamSchemaEntity
• Policy Definition: AlertDefinitionAPIEntity
• Central metadata management
• Dynamic metadata deployment
@Table("alertdef")
@ColumnFamily("f")
@Prefix("alertdef")
@Service(AlertConstants.ALERT_DEFINITION_SERVICE_ENDPOINT_NAME)
@JsonIgnoreProperties(ignoreUnknown = true)
@TimeSeries(false)
@Tags({"site", "dataSource", "alertExecutorId", "policyId", "policyType"})
@Indexes({
@Index(name="Index_1_alertExecutorId", columns = { "alertExecutorID" }, unique = true),
})
public class AlertDefinitionAPIEntity extends TaggedLogAPIEntity{
@Column("a")
private String desc;
@Column("b")
private String policyDef;
@Column("c")
private String dedupeDef;
METADATA MANAGER
Distributed Real-time Policy Engine
Dynamic Metadata Loading
.flatMap(AuditLogTransformer)
.groupBy(_.user)
.flatMap(UserProfileAggregator);
Apache Eagle - Fluent Stream Processing DSL
24
env.fromKafka (KafkaConfig)
.alert.persistAndEmail
val env = ExecutionEnvironment.getStorm()
env.execute();
Distributed Streaming Cluster Environment
AlertExecutor_{1}
AlertExecutor_{2}
…
AlertExecutor_{N}
Alerts
Real-time
Event Stream
Stream_{1}
Stream_{*}
Stream
Processing
env.execute()
Apache Eagle - Fluent Stream Processing DSL
25
• Physical execution platform independent
• Easily assemble data transformation, filtering,
join and alerting DAG in fluent way
• DAG rewrite and optimization
• StreamUnionExpansion
• StreamGroupbyExpansion
• StreamNameExpansion
• StreamAlertExpansion
• StreamParallelismConfigExpansion
trait StreamProducer{
filter
flatMap
map{1,2,3,4}
groupBy
streamUnion // stream join is hard, not implemented for storm
alertWithConsumer
}
StormExecutionEnvironment env =
ExecutionEnvironmentFactory.getStorm(config);
env.newSource(new
KafkaSourcedSpoutProvider().getSpout(config)).renameOutputFields(1)
.flatMap(new AuditLogTransformer())
.groupBy(0)
.flatMap(new UserProfileAggregatorExecutor());
.alertWithConsumer(“userActivity“,”userProfileExecutor“)
env.execute();
Optimizer
1. Development 2. Optimization 3. Compile to native app
Apache Eagle - Scalable Data Storage and Query
26
• Entity Metadata on large-scale NoSQL
storage like HBase
• Full-function SQL-Like REST Query
• Optimized rowkey design for time-series
monitoring data
• HBase Coprocessor
• Secondary Index
@Table("alertdef")
@ColumnFamily("f")
@Prefix("alertdef")
@Service(AlertConstants.ALERT_DEFINITION_SERVICE_ENDPOINT_NAME)
@JsonIgnoreProperties(ignoreUnknown = true)
@TimeSeries(false)
@Tags({"site", "dataSource", "alertExecutorId", "policyId", "policyType"})
@Indexes({
@Index(name="Index_1_alertExecutorId", columns = { "alertExecutorID" }, unique = true),
})
public class AlertDefinitionAPIEntity extends TaggedLogAPIEntity{
@Column("a")
private String desc;
@Column("b")
private String policyDef;
@Column("c")
private String dedupeDef;
query=
AlertDefinitionService[@dataSource="hiveQueryLog"]{@policyDef}
Uniform rowkey design
• Metric
• Entity
• Log
Rowkey ::= Prefix | Partition Keys | timestamp | tagName | tagValue | …
Rowkey ::= Metric Name | Partition Keys | timestamp | tagName | tagValue | …
Rowkey ::= Default Prefix | Partition Keys | timestamp | tagName | tagValue | …
Rowkey ::= Log Type | Partition Keys | timestamp | tagName | tagValue | …
Rowvalue ::= Log Content
27
Apache Eagle – Uniform HBase Rowkey Design
Apache Eagle - Machine Learning Intergration
28
29
User Activity Profiling
Offline: Determine bandwidth from training dataset the kernel density function parameters (KDE)
Online: If a test data point lies outside the trained bandwidth, it is anomaly (Policy)
PCs(Principle Components) in EVD
(Eigenvalue Value Decomposition)
Kernel Density Function
Apache Eagle – User/System Activity Profiling
30
Anomaly Metric Predictive Detection
Offline: Analyzing and combining 500+ metrics together for causal anomaly detections (IG -> PCA ->
GMM -> MCC)
Online: Predictively alert for anomaly metrics
Normal (Green) and Abnormal (Red)
Data and Probability Distribution and Threshold Selection
PCA (Principal Component Analysis)
Apache Eagle - Anomaly Metric Predictive Detection
Anomaly Metric Predictive Detection Case Study
Agenda
• About Apache Eagle
• Architecture
• Ecosystem
• Q & A
31
32
Apps
 Security
 Hadoop
 Cloud
 Database
Interface
 Web Portal
 REST Services
 Analytics Visualization
Integration
 Ambari
 Docker
 Ranger
 Dataguise
Eagle Framework
Distributed real-time framework for efficiently
developing highly scalable monitoring applications
Eagle Apps
Securiy / Hadoop / Cloud / Database
Eagle Interface
REST Service / Management UI / Customizable Analytics
Visualization
Eagle Integration
Ambari / Docker / Ranger / Dataguise
Open Source
Community-driven and Cross-community cooperation
Eagle
Framework
Apache Eagle Ecosystem
33
Apache Eagle Ecosystem - Security
How to Secure Hadoop in Realtime?
• Apache Eagle
• Apache Ranger
• Apache Knox
• Dataguise
34
Apache Eagle Ecosystem - Hadoop
Eagle in Apache Amabri: natively be part of hadoop ecosystem
35
Apache Eagle Ecosystem - Docker
Eagle in Docker: natively fly on Cloud/Container
STORM
KAFKA
ZOOKEEPER
HBASE
HADOOP
…
Powered of
git clone apache/incubator-eagle
eagle-docker
15 + 1 = 1
docker pull apacheeagle
36
Apache Eagle Ecosystem - Open Source
If you want to go fast, go alone.
If you want to go far, go together.
-- African Proverb
Learn more about Apache Eagle
37
• EAGLE: USER PROFILE-BASED ANOMALY DETECTION IN HADOOP CLUSTER
(IEEE)
• EAGLE: DISTRIBUTED REALTIME MONITORING FRAMEWORK FOR HADOOP
CLUSTER
Q & A
apache/incubator-eagle
@TheApacheEagle
@ApacheEagle
http://eagle.incubator.apache.org
The slide is licensed under Creative Commons Attribution 4.0 International license.

More Related Content

What's hot

Enabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataEnabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government data
DataWorks Summit
 

What's hot (20)

Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
Cloudbreak - Technical Deep Dive
Cloudbreak - Technical Deep DiveCloudbreak - Technical Deep Dive
Cloudbreak - Technical Deep Dive
 
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Future of Apache Storm
Future of Apache StormFuture of Apache Storm
Future of Apache Storm
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Building large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twillBuilding large scale applications in yarn with apache twill
Building large scale applications in yarn with apache twill
 
Spark Technology Center IBM
Spark Technology Center IBMSpark Technology Center IBM
Spark Technology Center IBM
 
Building and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache OozieBuilding and managing complex dependencies pipeline using Apache Oozie
Building and managing complex dependencies pipeline using Apache Oozie
 
Enabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government dataEnabling Modern Application Architecture using Data.gov open government data
Enabling Modern Application Architecture using Data.gov open government data
 
Security From The Big Data and Analytics Perspective
Security From The Big Data and Analytics PerspectiveSecurity From The Big Data and Analytics Perspective
Security From The Big Data and Analytics Perspective
 
IoT:what about data storage?
IoT:what about data storage?IoT:what about data storage?
IoT:what about data storage?
 
Integrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query EnginesIntegrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query Engines
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
 
Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in Realtime
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
 

Viewers also liked

Afl presentation
Afl presentationAfl presentation
Afl presentation
annacb19
 

Viewers also liked (13)

Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Improving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of ServiceImproving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of Service
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
 
MSII service global
MSII service globalMSII service global
MSII service global
 
Leanforms folder panterra
Leanforms folder panterraLeanforms folder panterra
Leanforms folder panterra
 
あいにきて IoT
あいにきて IoTあいにきて IoT
あいにきて IoT
 
2do boletin emancipacion de la mujer
2do boletin   emancipacion de la mujer2do boletin   emancipacion de la mujer
2do boletin emancipacion de la mujer
 
Walden3 twin slideshare 01
Walden3 twin slideshare 01Walden3 twin slideshare 01
Walden3 twin slideshare 01
 
Science and Nature Portfolio
Science and Nature PortfolioScience and Nature Portfolio
Science and Nature Portfolio
 
9789740333616
97897403336169789740333616
9789740333616
 
Afl presentation
Afl presentationAfl presentation
Afl presentation
 

Similar to Apache Eagle in Action

Similar to Apache Eagle in Action (20)

Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data Mesh
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
Apache Eagle Architecture Evolvement
Apache Eagle Architecture EvolvementApache Eagle Architecture Evolvement
Apache Eagle Architecture Evolvement
 
Achieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloudAchieve big data analytic platform with lambda architecture on cloud
Achieve big data analytic platform with lambda architecture on cloud
 
MLOps pipelines using MLFlow - From training to production
MLOps pipelines using MLFlow - From training to productionMLOps pipelines using MLFlow - From training to production
MLOps pipelines using MLFlow - From training to production
 
Security and Advanced Automation in the Enterprise
Security and Advanced Automation in the EnterpriseSecurity and Advanced Automation in the Enterprise
Security and Advanced Automation in the Enterprise
 
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay HadoopHadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
Hadoop Eagle - Real Time Monitoring Framework for eBay Hadoop
 
Log Data Analysis Platform by Valentin Kropov
Log Data Analysis Platform by Valentin KropovLog Data Analysis Platform by Valentin Kropov
Log Data Analysis Platform by Valentin Kropov
 
Log Data Analysis Platform
Log Data Analysis PlatformLog Data Analysis Platform
Log Data Analysis Platform
 
Monitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp DockerMonitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp Docker
 
Leveraging Hadoop in Polyglot Architectures
Leveraging Hadoop in Polyglot ArchitecturesLeveraging Hadoop in Polyglot Architectures
Leveraging Hadoop in Polyglot Architectures
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
 
Netflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open SourceNetflix Cloud Architecture and Open Source
Netflix Cloud Architecture and Open Source
 
DataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementDataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data Management
 
ThroughTheLookingGlass_EffectiveObservability.pptx
ThroughTheLookingGlass_EffectiveObservability.pptxThroughTheLookingGlass_EffectiveObservability.pptx
ThroughTheLookingGlass_EffectiveObservability.pptx
 
MuleSoft Meetup Roma - Processi di Automazione su CloudHub
MuleSoft Meetup Roma - Processi di Automazione su CloudHubMuleSoft Meetup Roma - Processi di Automazione su CloudHub
MuleSoft Meetup Roma - Processi di Automazione su CloudHub
 
Build an AI/ML-driven image archive processing workflow: Image archive, analy...
Build an AI/ML-driven image archive processing workflow: Image archive, analy...Build an AI/ML-driven image archive processing workflow: Image archive, analy...
Build an AI/ML-driven image archive processing workflow: Image archive, analy...
 
Integrating Splunk into your Spring Applications
Integrating Splunk into your Spring ApplicationsIntegrating Splunk into your Spring Applications
Integrating Splunk into your Spring Applications
 
Real time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solrReal time cloud native open source streaming of any data to apache solr
Real time cloud native open source streaming of any data to apache solr
 
Prototyping applications with heroku and elasticsearch
 Prototyping applications with heroku and elasticsearch Prototyping applications with heroku and elasticsearch
Prototyping applications with heroku and elasticsearch
 

Recently uploaded

Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
David Celestin
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
Kayode Fayemi
 
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Hung Le
 

Recently uploaded (20)

Dreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video TreatmentDreaming Marissa Sánchez Music Video Treatment
Dreaming Marissa Sánchez Music Video Treatment
 
lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.lONG QUESTION ANSWER PAKISTAN STUDIES10.
lONG QUESTION ANSWER PAKISTAN STUDIES10.
 
Dreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio IIIDreaming Music Video Treatment _ Project & Portfolio III
Dreaming Music Video Treatment _ Project & Portfolio III
 
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINESBIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
BIG DEVELOPMENTS IN LESOTHO(DAMS & MINES
 
LITTLE ABOUT LESOTHO FROM THE TIME MOSHOESHOE THE FIRST WAS BORN
LITTLE ABOUT LESOTHO FROM THE TIME MOSHOESHOE THE FIRST WAS BORNLITTLE ABOUT LESOTHO FROM THE TIME MOSHOESHOE THE FIRST WAS BORN
LITTLE ABOUT LESOTHO FROM THE TIME MOSHOESHOE THE FIRST WAS BORN
 
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
Proofreading- Basics to Artificial Intelligence Integration - Presentation:Sl...
 
My Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle BaileyMy Presentation "In Your Hands" by Halle Bailey
My Presentation "In Your Hands" by Halle Bailey
 
BEAUTIFUL PLACES TO VISIT IN LESOTHO.pptx
BEAUTIFUL PLACES TO VISIT IN LESOTHO.pptxBEAUTIFUL PLACES TO VISIT IN LESOTHO.pptx
BEAUTIFUL PLACES TO VISIT IN LESOTHO.pptx
 
Introduction to Artificial intelligence.
Introduction to Artificial intelligence.Introduction to Artificial intelligence.
Introduction to Artificial intelligence.
 
Uncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac FolorunsoUncommon Grace The Autobiography of Isaac Folorunso
Uncommon Grace The Autobiography of Isaac Folorunso
 
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven CuriosityUnlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
Unlocking Exploration: Self-Motivated Agents Thrive on Memory-Driven Curiosity
 
Report Writing Webinar Training
Report Writing Webinar TrainingReport Writing Webinar Training
Report Writing Webinar Training
 
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait Cityin kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
in kuwait௹+918133066128....) @abortion pills for sale in Kuwait City
 
Call Girls Near The Byke Suraj Plaza Mumbai »¡¡ 07506202331¡¡« R.K. Mumbai
Call Girls Near The Byke Suraj Plaza Mumbai »¡¡ 07506202331¡¡« R.K. MumbaiCall Girls Near The Byke Suraj Plaza Mumbai »¡¡ 07506202331¡¡« R.K. Mumbai
Call Girls Near The Byke Suraj Plaza Mumbai »¡¡ 07506202331¡¡« R.K. Mumbai
 
History of Morena Moshoeshoe birth death
History of Morena Moshoeshoe birth deathHistory of Morena Moshoeshoe birth death
History of Morena Moshoeshoe birth death
 
Lions New Portal from Narsimha Raju Dichpally 320D.pptx
Lions New Portal from Narsimha Raju Dichpally 320D.pptxLions New Portal from Narsimha Raju Dichpally 320D.pptx
Lions New Portal from Narsimha Raju Dichpally 320D.pptx
 
Digital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of DrupalDigital collaboration with Microsoft 365 as extension of Drupal
Digital collaboration with Microsoft 365 as extension of Drupal
 
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdfSOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
SOLID WASTE MANAGEMENT SYSTEM OF FENI PAURASHAVA, BANGLADESH.pdf
 
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdfAWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
AWS Data Engineer Associate (DEA-C01) Exam Dumps 2024.pdf
 
Zone Chairperson Role and Responsibilities New updated.pptx
Zone Chairperson Role and Responsibilities New updated.pptxZone Chairperson Role and Responsibilities New updated.pptx
Zone Chairperson Role and Responsibilities New updated.pptx
 

Apache Eagle in Action

  • 1. Apache in Action Secure Hadoop in Real-time Hao Chen / 陈浩 / hao@apache.org / @haozch
  • 2. Who is the Guy 2 Co-creator, Committer and PMC @ Apache Eagle hao@apache.org Hao Chen / 陈浩 Sr. Software Engineer @ eBay Cloud Service hchen9@ebay.com Speaker @ Hadoop Summit (SJC, SHA, BJ) ... http://people.apache.org/~hao
  • 3. Agenda 3 • About Eagle • Architecture • Ecosystem • Q & A
  • 4. What’s Apache Eagle 4 Apache Eagle is a distributed real-time monitoring and alerting engine for hadoop from eBay Open sourced as Apache Incubator Project on Oct 26th 2015 Secure Hadoop in Realtime a data activity monitoring solution to instantly identify access to sensitive data, recognize attacks/ malicious activity and block access in real time. See http://eagle.incubator.apache.org or http://goeagle.io
  • 5. Apache Eagle History Donated to Apache Software Foundation (ASF) from eBay at Oct 26th, 2015 5 Dec 2013 Oct 23 2015 Oct 26 2015 Hadoop Eagle Project Initiative Apache Incubator eagle.incubator.apache.org Github Open Source github.com/apache/incubator-eagle Hadoop Eagle Production Release May 2014
  • 6. Why build Apache Eagle 6 Eagle was initialized by end of 2013 for hadoop ecosystem monitoring as any existing tool like zabbix, ganglia can not handle the huge volume of metrics/logs generated by hadoop system in eBay. 2013/2014 10,000 nodes 150,000+ cores 170 PB 2000+ user 3000+ nodes 10,000+ cores 50+ PB 2012 2011 1000+ nodes 10,000+ cores 10+ PB 100+ nodes 1000 + cores 1 PB 2010 2009 50+ nodes 2007 1-10 nodes Hadoop Data • Security • Activity Hadoop Platform • Heath • Availability • Performance Hadoop @ eBay Inc
  • 7. Apache Eagle @ eBay 7 7 CLUSTERS 7427 NODES 160 PB DATA 10 B+ EVENTS / DAY 500+ METRIC TYPES 50,000+ JOBS / DAY 50,000,000+ TASKS / DAY MONITOR PROCESS
  • 8. Agenda 8 • About Eagle • Architecture • Ecosystem • Q & A
  • 9. Apache Eagle Architecture Overview 9 Scalable Scales to monitor thousands of policies and billions of access events Machine Learning Create dynamic user profiles based on user behavior Real-time Generates alerts in real time and blocks users with malicious intent Extensible Eagle can be easily extended to monitor other data sources
  • 10. Apache Eagle Architecture Overview 10 STREAM PROCESSING ENGINE DataCollector Kafka HDFS, Audit, Security METADATA MANAGER DATASTORES REMEDIATION ENGINE Apache Ranger MACHINE LEARNING MODULE Custom module Actionable Alerts Activities Actionable Alerts PolicyThresholdsUser properties MLThresholds Real Time Alert Dashboard Security Analyst Admin Console Security Engineer Insights Metadata Management MACHINE LEARNING TRAINING MODULE Policy Engine
  • 11. Apache Eagle Architecture Features 11 • Real-time Data Collection • Distributed Policy Engine • Stream Processing DSL • Scalable Data Storage & Query • Machine Learning Intergration NOTE {NAME}-{NUMBER} like HDFS-6914 means open source project ticket id contributed by us
  • 12. Apache Eagle - Data Collection 12 Decoupling with Message Bus • Apache Kafka: high-throughput distributed messaging • Partition: balance between logic and throughput Cross-Platform Integration • Community Kafka Client (18+) • Python/Go/C/C++/JAVA .. • Enhanced Log4j-kafka • KAFKA-2041: Extensible Partition Key • KAFKA-2077: Advanced Topic Selector
  • 13. Apache Eagle - Data Collection 13 Availability: Filebeat + Logstash Logstash-1 Logstash-2 Logstash-… Logstash- Ligh-weight collector (golang) with daemon Logstash instances cluster Shuffle Grouping KafkaField Grouping Distributed Message Bus Resource consumption balance Message throughput balance ( LOGSTASH-179)
  • 14. Storm Spout: Distributed crawling for hadoop job, node jmx and service logs, etc. Zookeeper: Centralized state management and distributed locking Apache Eagle - Data Collection 14 Scalability: Distributed Real-time Ingestion Zookeeper METRIC Centralized State Management JOB EVENT LOG
  • 15. Apache Eagle - Distributed Real-time Policy Engine 15 METADATA MANAGER Distributed Streaming Cluster Environment AlertExecutor_{1} AlertExecutor_{2} … AlertExecutor_{N} Real Time Alerts Alerts Policy Management Policy Dynamical Policy Deployment Real-time Event Stream Stream_{1} Stream_{*} Dynamical Stream Schema Stream Processing Highlights • Real-time • Usability • Scalability • Extensibility • Metadata-driven
  • 16. Apache Eagle - Distributed Real-time Policy Engine 16 METADATA MANAGER Real Time Alerts Alerts Policy Management Policy Event Stream (Kafka) Dynamical Stream Schema Dynamical Policy Deployment Real-time • Kafka-based Distributed Message Bus (Extensible) • Storm-based Real-time Execution Environment (Extensible) • Stream events are processed and alerts are evaluated during streaming
  • 17. Apache Eagle - Distributed Real-time Policy Engine 17 METADATA MANAGER Distributed Streaming Cluster Environment Real Time Alerts Alerts Policy Management Policy Dynamical Policy Deployment Usability • Powerful SQL-Like CEP CQL for Policy Definition • Dynamical Poilcy Metadata Lifecycle Management (Deployment/Update) • Easy-to-use Policy management and Alert analytics UI from metricStream[(name == 'ReplLag') and (value > 1000)] select * insert into outputStream;
  • 18. Apache Eagle - Distributed Real-time Policy Engine 18 Full-function Streaming CEP CQL: Siddhi on Storm by default hdfsAuditLogEventStream[(src == '/tmp/private')]#window.externalTime(timestamp,10 min) select user, count(timestamp) as aggValue group by user having aggValue >= 5 insert into outputStream; • Filter • Join • Aggregation: Avg, Sum , Min, Max, etc • Group by • Having • Stream handlers for window: TimeWindow, Batch Window, Length Window • Conditions and Expressions: and, or, not, ==,!=, >=, >, <=, <, and arithmetic operations • Pattern processing • Sequence processing • Event Tables: intergrate historical data in realtime processing • SQL-Like Query: Query, Stream Definition and Query Plan compilation
  • 19. Apache Eagle - Distributed Real-time Policy Engine 19 Distributed Streaming Cluster Environment AlertExecutor_{1} AlertExecutor_{2} … AlertExecutor_{N} Stream_{1} Stream_{*} Stream Processing Scalability: dynamic policy partition by {event} * {policy} • N Users with 3 partitions, M policies with 2 partitions, then 3*2 physical tasks • Physical partition + policy-level partition
  • 20. Apache Eagle - Distributed Real-time Policy Engine 20 Distributed Streaming Partition Problem https://en.wikipedia.org/wiki/Partition_problem S = {3,1,1,2,2,1,1} S1 = {1,1,1,1,1} S2 = {2,2} S3 = {3}
  • 21. Apache Eagle - Distributed Real-time Policy Engine 21 Distributed Streaming Partition Strategy groupBy[ GreedyStrategy ]((_.key1,_.key2 )) HBase Key Distribution Statistics (Online/Offline) Realtime Partition Strategy Key Statistics Cache Async Strategy • Greedy (Online/Offline) • PoTC • PKG • Hashing
  • 22. Apache Eagle - Distributed Real-time Policy Engine 22 Distributed Real-time Policy Engine Siddhi CEP Policy Evaluator Machine Learning Policy Evaluator Extensibility • Support WSO2 Siddhi CEP as first class • Extensible policy engine implementation • Extensible policy lifecycle management Extensible Policy Evaluator public interface PolicyEvaluatorServiceProvider { public String getPolicyType(); // literal string to identify one type of policy public Class getPolicyEvaluator(); // get policy evaluator implementation public List getBindingModules(); // policy text with json format to object mapping } public interface PolicyEvaluator { public void evaluate(ValuesArray input) throws Exception; // evaluate input event public void onPolicyUpdate(AlertDefinitionAPIEntity newAlertDef);// policy update public void onPolicyDelete(); // invoked when policy is deleted } METADATA MANAGER Policy/Metadata
  • 23. Apache Eagle - Distributed Real-time Policy Engine 23 Metadata-Driven • Stream Schema: AlertStreamSchemaEntity • Policy Definition: AlertDefinitionAPIEntity • Central metadata management • Dynamic metadata deployment @Table("alertdef") @ColumnFamily("f") @Prefix("alertdef") @Service(AlertConstants.ALERT_DEFINITION_SERVICE_ENDPOINT_NAME) @JsonIgnoreProperties(ignoreUnknown = true) @TimeSeries(false) @Tags({"site", "dataSource", "alertExecutorId", "policyId", "policyType"}) @Indexes({ @Index(name="Index_1_alertExecutorId", columns = { "alertExecutorID" }, unique = true), }) public class AlertDefinitionAPIEntity extends TaggedLogAPIEntity{ @Column("a") private String desc; @Column("b") private String policyDef; @Column("c") private String dedupeDef; METADATA MANAGER Distributed Real-time Policy Engine Dynamic Metadata Loading
  • 24. .flatMap(AuditLogTransformer) .groupBy(_.user) .flatMap(UserProfileAggregator); Apache Eagle - Fluent Stream Processing DSL 24 env.fromKafka (KafkaConfig) .alert.persistAndEmail val env = ExecutionEnvironment.getStorm() env.execute(); Distributed Streaming Cluster Environment AlertExecutor_{1} AlertExecutor_{2} … AlertExecutor_{N} Alerts Real-time Event Stream Stream_{1} Stream_{*} Stream Processing env.execute()
  • 25. Apache Eagle - Fluent Stream Processing DSL 25 • Physical execution platform independent • Easily assemble data transformation, filtering, join and alerting DAG in fluent way • DAG rewrite and optimization • StreamUnionExpansion • StreamGroupbyExpansion • StreamNameExpansion • StreamAlertExpansion • StreamParallelismConfigExpansion trait StreamProducer{ filter flatMap map{1,2,3,4} groupBy streamUnion // stream join is hard, not implemented for storm alertWithConsumer } StormExecutionEnvironment env = ExecutionEnvironmentFactory.getStorm(config); env.newSource(new KafkaSourcedSpoutProvider().getSpout(config)).renameOutputFields(1) .flatMap(new AuditLogTransformer()) .groupBy(0) .flatMap(new UserProfileAggregatorExecutor()); .alertWithConsumer(“userActivity“,”userProfileExecutor“) env.execute(); Optimizer 1. Development 2. Optimization 3. Compile to native app
  • 26. Apache Eagle - Scalable Data Storage and Query 26 • Entity Metadata on large-scale NoSQL storage like HBase • Full-function SQL-Like REST Query • Optimized rowkey design for time-series monitoring data • HBase Coprocessor • Secondary Index @Table("alertdef") @ColumnFamily("f") @Prefix("alertdef") @Service(AlertConstants.ALERT_DEFINITION_SERVICE_ENDPOINT_NAME) @JsonIgnoreProperties(ignoreUnknown = true) @TimeSeries(false) @Tags({"site", "dataSource", "alertExecutorId", "policyId", "policyType"}) @Indexes({ @Index(name="Index_1_alertExecutorId", columns = { "alertExecutorID" }, unique = true), }) public class AlertDefinitionAPIEntity extends TaggedLogAPIEntity{ @Column("a") private String desc; @Column("b") private String policyDef; @Column("c") private String dedupeDef; query= AlertDefinitionService[@dataSource="hiveQueryLog"]{@policyDef}
  • 27. Uniform rowkey design • Metric • Entity • Log Rowkey ::= Prefix | Partition Keys | timestamp | tagName | tagValue | … Rowkey ::= Metric Name | Partition Keys | timestamp | tagName | tagValue | … Rowkey ::= Default Prefix | Partition Keys | timestamp | tagName | tagValue | … Rowkey ::= Log Type | Partition Keys | timestamp | tagName | tagValue | … Rowvalue ::= Log Content 27 Apache Eagle – Uniform HBase Rowkey Design
  • 28. Apache Eagle - Machine Learning Intergration 28
  • 29. 29 User Activity Profiling Offline: Determine bandwidth from training dataset the kernel density function parameters (KDE) Online: If a test data point lies outside the trained bandwidth, it is anomaly (Policy) PCs(Principle Components) in EVD (Eigenvalue Value Decomposition) Kernel Density Function Apache Eagle – User/System Activity Profiling
  • 30. 30 Anomaly Metric Predictive Detection Offline: Analyzing and combining 500+ metrics together for causal anomaly detections (IG -> PCA -> GMM -> MCC) Online: Predictively alert for anomaly metrics Normal (Green) and Abnormal (Red) Data and Probability Distribution and Threshold Selection PCA (Principal Component Analysis) Apache Eagle - Anomaly Metric Predictive Detection Anomaly Metric Predictive Detection Case Study
  • 31. Agenda • About Apache Eagle • Architecture • Ecosystem • Q & A 31
  • 32. 32 Apps  Security  Hadoop  Cloud  Database Interface  Web Portal  REST Services  Analytics Visualization Integration  Ambari  Docker  Ranger  Dataguise Eagle Framework Distributed real-time framework for efficiently developing highly scalable monitoring applications Eagle Apps Securiy / Hadoop / Cloud / Database Eagle Interface REST Service / Management UI / Customizable Analytics Visualization Eagle Integration Ambari / Docker / Ranger / Dataguise Open Source Community-driven and Cross-community cooperation Eagle Framework Apache Eagle Ecosystem
  • 33. 33 Apache Eagle Ecosystem - Security How to Secure Hadoop in Realtime? • Apache Eagle • Apache Ranger • Apache Knox • Dataguise
  • 34. 34 Apache Eagle Ecosystem - Hadoop Eagle in Apache Amabri: natively be part of hadoop ecosystem
  • 35. 35 Apache Eagle Ecosystem - Docker Eagle in Docker: natively fly on Cloud/Container STORM KAFKA ZOOKEEPER HBASE HADOOP … Powered of git clone apache/incubator-eagle eagle-docker 15 + 1 = 1 docker pull apacheeagle
  • 36. 36 Apache Eagle Ecosystem - Open Source If you want to go fast, go alone. If you want to go far, go together. -- African Proverb
  • 37. Learn more about Apache Eagle 37 • EAGLE: USER PROFILE-BASED ANOMALY DETECTION IN HADOOP CLUSTER (IEEE) • EAGLE: DISTRIBUTED REALTIME MONITORING FRAMEWORK FOR HADOOP CLUSTER
  • 38. Q & A apache/incubator-eagle @TheApacheEagle @ApacheEagle http://eagle.incubator.apache.org The slide is licensed under Creative Commons Attribution 4.0 International license.

Editor's Notes

  1. 特色
  2. Tech stack decision about Message Bus
  3. Cross-Nodes Resource Balanace in production environment
  4. A good ecosystem will continiously keep the architecture renew and advanced no matter in technical stack iteration, human resource or business growth