SlideShare a Scribd company logo
1 of 35
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
1
Near real-time network
anomaly detection and
traffic analysis
Pankaj Rastogi
Tech Manager
Debasish Das
Data Scientist
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
2
Agenda
• Network data overview
• DDoS as network anomaly
• Design challenges
• Trapezium overview
• Results
• Q&A
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
3
Network: Aggregated data overview
• Network Management Protocol (SNMP)
 Network management console
 Network devices (routers, bridges, intelligent hubs)
• Data collection: Aggregated per router interface
• Inbound and outbound traffic statistics sampled at regular interval
- Bits per second (bps)
- Packets per second (pps)
- CPU
- Memory
SNMP
Manager
Routers
SNMP Protocol
SNMP Statistics
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
4
Network: Flow data overview
Web browser
192.168.1.10
Web server
10.1.2.3
Request flow #1
TCP connection
Response flow #2
• Flow #1
- Source address 192.168.1.10
- Destination address 10.1.2.3
- Source port 1025
- Destination port 80
- Protocol TCP
• Flow #2
- Source address 10.1.2.3
- Destination address 192.168.1.10
- Source port 1025
- Destination port 80
- Protocol TCP
• A single flow may consist of several packets
and many bytes
• TCP connections consists of two flows
- Each flow will mirror the other
- Can use TCP flags to determine the
client and the server
• ICMP, UDP and other IP protocol streams
may contain one or two flows
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
5
DDoS as network anomaly
Remote command & control
Attacker
Bots
Router
Customer
Attacker + Bots + Customer locations
Attacker + Bots + Customer IPs
Netflow SNMP
Customer + Volumetric attack magnitude
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
6
SNMP
Anomaly detection on
time series
Nonparametric models
for SNMP DDOS
detection
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
7
SNMP
Network Analysis on SNMP
• Usage of each router/interface
• Find routers that have high packets flow
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
8
Anomaly detection on high
frequency data
Parametric models for
NetFlow DDOS detection
• Generate customer IP focused features based on
DDOS definition
NetFlow
0
75,000
150,000
225,000
300,000
0:00 9/14/15
0:27
9/14/15
0:54
9/14/15
1:21
9/14/15
1:48
9/14/15
2:15
9/14/15
2:42
9/14/15
3:09
9/14/15
3:36
time
flow
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
9
NetFlow
Network Analysis on NetFlow
• Find customer with maximum upload bytes
• Find customer with maximum download bytes
• Find peak usage for given customer
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
10
Why we chose Apache Spark
• Good support for machine learning algorithms
• Spark’s micro-batching capabilities
> Sufficient for our streaming requirements
• Vibrant Spark community
• Excellent talent availability within our group
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
11
Lessons learned -- Spark
• Coalesce partitions when writing to HDFS
• Harmless action like take(1) can result in huge costs
• Multiple actions on a DataFrame/DStreams result in multiple jobs
• Spark DStream checkpointing with RDD models
• spark.sql.parquet.compression.codec – snappy
• spark.sql.shuffle.partitions – 2000+ when partition block size crosses 2 GB
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
12
Design challenges
NFS/GFS
Data source?
Algorithms?
Persistence?
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
13
Design challenges -- SNMP
Near Real time model updates needed Lambda
architecture
• Batch job MUST process data at fixed interval
(e.g., 15 min)
• Stream job MUST
> Handle hot starts (e.g., 90 days of
data)
> Analyze data and generate anomalies
> Updates model every sampling interval
> Start from the last model timestamp on
restart
Coordination between Batch and Stream
processes NEEDED
• Batch job updates ZooKeeper node at fixed
interval (e.g., 15 min)
• Stream job uses the same ZooKeeper node to
load features
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
14
Design challenges -- NetFlow
Seed the model with good parameter estimates
• Batch job populates the initial model parameter
• Stream job hot-starts with model and detect
anomalies
• Stream job updates the model and persist it to
Cassandra
Model maintained in Cassandra
• Stream job read the model to Spark partitions
from Cassandra
• Spark partition updates the model
• Spark partition generates anomalies
• Models across partition are combined using Spark
• Anomalies are persisted to Cassandra
Network analysis
• Find peak usage for a given customer
• Find customer with highest network usage
• Find number of distinct source IPs connected to a
destination IP
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
15
Network anomaly flow design
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
16
Design challenges – multiple applications
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
17
Trapezium
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
18
What is Trapezium?
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
19
What is Trapezium?
• Ability to read data
> From multiple data sources, e.g., HDFS, NFS, Kafka
> In Batch and Streaming modes to support lambda architecture
• Ability to write data
> To multiple data sources, e.g., HDFS, NFS, Kafka
• Plug and Play architecture
> Evaluate multiple algorithms
> Evaluate different features of same algorithm
• Break down complex analytics problem in Transactions
• Build a workflow pipeline combining different Transactions
• Validation and filtering of input data
• Embedded Zookeeper, Kafka, C*, Hbase, etc available for unit tests
• Enable real time query processing capability
> Akka HTTP server provides Spark as a Service
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
20
Trapezium architecture
TrapeziumD1
D2
D3
O1
O2
O3
Validation
D1
V1
V1
O1
D2
O2
D3
O1
VARIOUS TRANSACTIONS
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
21
Workflow
hdfsFileBatch = {
batchTime = 5
batchInfo = [{
name = "hdfs_source"
dataDirectory = {prod = "/prod/data/files"}
}]
}
transactions = [{
transactionName="com.verizon.bda.DataAggregator"
inputData=[{ name="hdfs_source" }]
persistDataName="aggregatedOutput"
},{
transactionName="com.verizon.bda.DataAligner"
inputData=[{ name="aggregatedOutput" }]
persistDataName="alignedOutput"
},{
transactionName="com.verizon.bda.AnomalyFinder"
inputData=[{ name="aggregatedOutput” }, {
name="alignedOutput” }]
persistDataName=”anomalyOutput"
}]
• Workflow is a collection of
transactions in batch or
streaming mode
• Each transaction can take
multiple data sources as input
• Output of one transaction can be
input to another transaction
• Output of each transaction could
be persisted or kept only in
memory
• Single place to handle
exceptions and raise failure
events
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
22
Transaction Traits
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
23
Transaction Traits
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
24
Support data sources
• Trapezium can read data from HDFS, Kafka,
NFS, GFS
• Config entry for reading data from
HDFS/NFS/GFS
dataSource="HDFS"
dataDirectory = {
local="/local/data/files"
dev= "/dev/data/files"
prod= "/prod/data/files"
}
• Config entry for defining protocol
fileSystemPrefix="hdfs://"
fileSystemPrefix="file://"
fileSystemPrefix="s3://"
• Trapezium can read data in various formats
including text, gzip, json, avro and parquet
• Config entry for reading from Kafka
topics
kafkaTopicInfo = {
consumerGroup = "KafkaStreamGroup"
maxRatePerPartition = 970
batchTime = "5"
streamsInfo = [{
name = "queries"
topicName = "deviceanalyzer"
}]
}
• Config entry for reading fileFormat
fileFormat="avro"
fileFormat="json"
fileFormat="parquet”
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
25
Run modes
• Trapezium supports reading data in batch as well streaming mode
• Config entry for reading in batch mode
runMode="STREAM"
batchTime=5
• Config entry for reading in stream mode
runMode="BATCH"
batchTime=5
• Read data by timestamp
offset=2
• Process historical data in sequence of smaller data sets
fileSplit=true
• Process same data multiple times
oneTime=true
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
26
Data validation
• Validates data at the source
• Filters out all invalid rows
• Validates schema of the input data
• Config entry for data validation
validation = {
columns = ["name", "age", "birthday", "location"]
datatypes = ["String", "Int", "Timestamp", "String"]
dateFormat = "yyyy-MM-dd HH:mm:ss"
delimiter = "|"
minimumColumn = 4
rules = {
name=[maxLength(30),minLength(1)]
age=[maxValue(100),minValue(1)]
}
}
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
27
Plug and play capability
• Any transaction can be
added/removed by modifying
workflow config file
• Output from multiple algorithms
can be compared in real time
• Multiple features can be
evaluated in different
transactions
• Data sources can be switched
with config change
• Model training can be done on
different time windows to
achieve best results
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
28
Trapezium – github url
https://github.com/Verizon/trapezium
Version: 1.0.0-SNAPSHOT
Release: 14-Oct-2016
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
29
Results
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
30
SNMP
Spark runtime with Hive/C* read/write
Data volume: 10 routers, 2.2 MB per 5 min, 650 MB per day
Compute: 10 executors, 4 cores
Memory: 16 GB per executor, 4 GB driver
With sampling rate of 2 min:
• 2 nodes with 20 cores each
for 10 routers
• 200 nodes for 1000 routers
With sampling rate of 4 min:
• 2 nodes can process 20 ro
uters
• 100 nodes for 1000 routers
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
31
SNMP
Spark shuffle – read/write
Data volume: 10 routers, 2.2 MB per 5 min, 650 MB per day
Compute: 10 executors, 4 cores
Memory: 16 GB per executor, 4 GB driver
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
32
Data volume: 2 router, 50 MB per min, 70 GB per day
Compute: 10 executors, 4 cores
Memory: 16 GB per executor, 4 GB driver
NetFlow
Spark + C* read/write runtime
• Due to parametric model, run
time is better than SNMP
• NetFlow data is X times more
than SNMP data
16 18
32
47
94.8
0
25
50
75
100
2 4 8 16 32
Runtime(s)
Router
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
33
NetFlow
Spark + C* shuffle write
Shuffle (MB) 2 4 8 16 32
Spark 71.2 150.5 275.7 612.1 1261.4
Cassandra 30.2 64.4 115.6 263.7 545.1
0.
350.
700.
1050.
1400.
2 4 8 16 32
Shuffle(MB)
Router
Spark Cassandra
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
34
Summary
• Reuse code across multiple applications
• Improve developer efficiency
• Encourage standard coding practices
• Provide unit-test framework for better code coverage
• Decouple ETL, analytics and algorithms in different Transactions
• Distribute query processing using Spark as a service
• Easy integration provided by configuration driven architecture
© Verizon 2016 All Rights Reserved
Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of
their respective owners.
35
Thank you

More Related Content

What's hot

Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...DataWorks Summit/Hadoop Summit
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementDataWorks Summit/Hadoop Summit
 
Rich placement constraints: Who said YARN cannot schedule services?
Rich placement constraints: Who said YARN cannot schedule services?Rich placement constraints: Who said YARN cannot schedule services?
Rich placement constraints: Who said YARN cannot schedule services?DataWorks Summit
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingDataWorks Summit
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRyan Bosshart
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDataWorks Summit
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSDataWorks Summit
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopGwen (Chen) Shapira
 

What's hot (20)

Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
 
Rich placement constraints: Who said YARN cannot schedule services?
Rich placement constraints: Who said YARN cannot schedule services?Rich placement constraints: Who said YARN cannot schedule services?
Rich placement constraints: Who said YARN cannot schedule services?
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Empower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and HadoopEmpower Data-Driven Organizations with HPE and Hadoop
Empower Data-Driven Organizations with HPE and Hadoop
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLib
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 

Viewers also liked

Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...DataWorks Summit/Hadoop Summit
 
Getting Additional Value from Logs and APM Data with AppDynamics Unified Anal...
Getting Additional Value from Logs and APM Data with AppDynamics Unified Anal...Getting Additional Value from Logs and APM Data with AppDynamics Unified Anal...
Getting Additional Value from Logs and APM Data with AppDynamics Unified Anal...AppDynamics
 
Analytics for large-scale time series and event data
Analytics for large-scale time series and event dataAnalytics for large-scale time series and event data
Analytics for large-scale time series and event dataAnodot
 
Disrupt the static nature of BI with Predictive Anomaly Detection
Disrupt the static nature of BI with Predictive Anomaly DetectionDisrupt the static nature of BI with Predictive Anomaly Detection
Disrupt the static nature of BI with Predictive Anomaly DetectionAnodot
 
Logisland "Event Mining at scale"
Logisland "Event Mining at scale"Logisland "Event Mining at scale"
Logisland "Event Mining at scale"Thomas Bailet
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 

Viewers also liked (6)

Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
Leveraging smart meter data for electric utilities: Comparison of Spark SQL w...
 
Getting Additional Value from Logs and APM Data with AppDynamics Unified Anal...
Getting Additional Value from Logs and APM Data with AppDynamics Unified Anal...Getting Additional Value from Logs and APM Data with AppDynamics Unified Anal...
Getting Additional Value from Logs and APM Data with AppDynamics Unified Anal...
 
Analytics for large-scale time series and event data
Analytics for large-scale time series and event dataAnalytics for large-scale time series and event data
Analytics for large-scale time series and event data
 
Disrupt the static nature of BI with Predictive Anomaly Detection
Disrupt the static nature of BI with Predictive Anomaly DetectionDisrupt the static nature of BI with Predictive Anomaly Detection
Disrupt the static nature of BI with Predictive Anomaly Detection
 
Logisland "Event Mining at scale"
Logisland "Event Mining at scale"Logisland "Event Mining at scale"
Logisland "Event Mining at scale"
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 

Similar to Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Cisco's Open Device Programmability Strategy: Open Discussion
Cisco's Open Device Programmability Strategy: Open DiscussionCisco's Open Device Programmability Strategy: Open Discussion
Cisco's Open Device Programmability Strategy: Open DiscussionCisco DevNet
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaDataWorks Summit
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit
 
Achieving Network Deployment Flexibility with Mirantis OpenStack
Achieving Network Deployment Flexibility with Mirantis OpenStackAchieving Network Deployment Flexibility with Mirantis OpenStack
Achieving Network Deployment Flexibility with Mirantis OpenStackEric Zhaohui Ji
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
Benefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceBenefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceDataWorks Summit/Hadoop Summit
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit
 
FIWARE Global Summit - Fast RTPS: Programming with the Default middleware for...
FIWARE Global Summit - Fast RTPS: Programming with the Default middleware for...FIWARE Global Summit - Fast RTPS: Programming with the Default middleware for...
FIWARE Global Summit - Fast RTPS: Programming with the Default middleware for...FIWARE
 
Конференция Brocade. 3. Повышение гибкости и эффективности применения баланси...
Конференция Brocade. 3. Повышение гибкости и эффективности применения баланси...Конференция Brocade. 3. Повышение гибкости и эффективности применения баланси...
Конференция Brocade. 3. Повышение гибкости и эффективности применения баланси...SkillFactory
 
ITN_Module_17.pptx
ITN_Module_17.pptxITN_Module_17.pptx
ITN_Module_17.pptxssuserf7cd2b
 
IPv6 and IP Multicast… better together?
IPv6 and IP Multicast… better together?IPv6 and IP Multicast… better together?
IPv6 and IP Multicast… better together?Steve Simlo
 
IPv6 Security - Myths and Reality
IPv6 Security - Myths and RealityIPv6 Security - Myths and Reality
IPv6 Security - Myths and RealitySwiss IPv6 Council
 
Is IPv6 Security Still an Afterthought?
Is IPv6 Security Still an Afterthought?Is IPv6 Security Still an Afterthought?
Is IPv6 Security Still an Afterthought?APNIC
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoophadooparchbook
 

Similar to Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture (20)

Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Cisco's Open Device Programmability Strategy: Open Discussion
Cisco's Open Device Programmability Strategy: Open DiscussionCisco's Open Device Programmability Strategy: Open Discussion
Cisco's Open Device Programmability Strategy: Open Discussion
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
 
Achieving Network Deployment Flexibility with Mirantis OpenStack
Achieving Network Deployment Flexibility with Mirantis OpenStackAchieving Network Deployment Flexibility with Mirantis OpenStack
Achieving Network Deployment Flexibility with Mirantis OpenStack
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Benefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceBenefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business Intelligence
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Tale of a New Bangladeshi NIX
Tale of a New Bangladeshi NIXTale of a New Bangladeshi NIX
Tale of a New Bangladeshi NIX
 
FIWARE Global Summit - Fast RTPS: Programming with the Default middleware for...
FIWARE Global Summit - Fast RTPS: Programming with the Default middleware for...FIWARE Global Summit - Fast RTPS: Programming with the Default middleware for...
FIWARE Global Summit - Fast RTPS: Programming with the Default middleware for...
 
Fast RTPS
Fast RTPSFast RTPS
Fast RTPS
 
Конференция Brocade. 3. Повышение гибкости и эффективности применения баланси...
Конференция Brocade. 3. Повышение гибкости и эффективности применения баланси...Конференция Brocade. 3. Повышение гибкости и эффективности применения баланси...
Конференция Brocade. 3. Повышение гибкости и эффективности применения баланси...
 
ITN_Module_17.pptx
ITN_Module_17.pptxITN_Module_17.pptx
ITN_Module_17.pptx
 
IPv6 and IP Multicast… better together?
IPv6 and IP Multicast… better together?IPv6 and IP Multicast… better together?
IPv6 and IP Multicast… better together?
 
Spark+flume seattle
Spark+flume seattleSpark+flume seattle
Spark+flume seattle
 
IPv6 Security - Myths and Reality
IPv6 Security - Myths and RealityIPv6 Security - Myths and Reality
IPv6 Security - Myths and Reality
 
Is IPv6 Security Still an Afterthought?
Is IPv6 Security Still an Afterthought?Is IPv6 Security Still an Afterthought?
Is IPv6 Security Still an Afterthought?
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark based Lambda Architecture

  • 1. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 1 Near real-time network anomaly detection and traffic analysis Pankaj Rastogi Tech Manager Debasish Das Data Scientist
  • 2. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 2 Agenda • Network data overview • DDoS as network anomaly • Design challenges • Trapezium overview • Results • Q&A
  • 3. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 3 Network: Aggregated data overview • Network Management Protocol (SNMP)  Network management console  Network devices (routers, bridges, intelligent hubs) • Data collection: Aggregated per router interface • Inbound and outbound traffic statistics sampled at regular interval - Bits per second (bps) - Packets per second (pps) - CPU - Memory SNMP Manager Routers SNMP Protocol SNMP Statistics
  • 4. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 4 Network: Flow data overview Web browser 192.168.1.10 Web server 10.1.2.3 Request flow #1 TCP connection Response flow #2 • Flow #1 - Source address 192.168.1.10 - Destination address 10.1.2.3 - Source port 1025 - Destination port 80 - Protocol TCP • Flow #2 - Source address 10.1.2.3 - Destination address 192.168.1.10 - Source port 1025 - Destination port 80 - Protocol TCP • A single flow may consist of several packets and many bytes • TCP connections consists of two flows - Each flow will mirror the other - Can use TCP flags to determine the client and the server • ICMP, UDP and other IP protocol streams may contain one or two flows
  • 5. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 5 DDoS as network anomaly Remote command & control Attacker Bots Router Customer Attacker + Bots + Customer locations Attacker + Bots + Customer IPs Netflow SNMP Customer + Volumetric attack magnitude
  • 6. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 6 SNMP Anomaly detection on time series Nonparametric models for SNMP DDOS detection
  • 7. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 7 SNMP Network Analysis on SNMP • Usage of each router/interface • Find routers that have high packets flow
  • 8. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 8 Anomaly detection on high frequency data Parametric models for NetFlow DDOS detection • Generate customer IP focused features based on DDOS definition NetFlow 0 75,000 150,000 225,000 300,000 0:00 9/14/15 0:27 9/14/15 0:54 9/14/15 1:21 9/14/15 1:48 9/14/15 2:15 9/14/15 2:42 9/14/15 3:09 9/14/15 3:36 time flow
  • 9. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 9 NetFlow Network Analysis on NetFlow • Find customer with maximum upload bytes • Find customer with maximum download bytes • Find peak usage for given customer
  • 10. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 10 Why we chose Apache Spark • Good support for machine learning algorithms • Spark’s micro-batching capabilities > Sufficient for our streaming requirements • Vibrant Spark community • Excellent talent availability within our group
  • 11. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 11 Lessons learned -- Spark • Coalesce partitions when writing to HDFS • Harmless action like take(1) can result in huge costs • Multiple actions on a DataFrame/DStreams result in multiple jobs • Spark DStream checkpointing with RDD models • spark.sql.parquet.compression.codec – snappy • spark.sql.shuffle.partitions – 2000+ when partition block size crosses 2 GB
  • 12. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 12 Design challenges NFS/GFS Data source? Algorithms? Persistence?
  • 13. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 13 Design challenges -- SNMP Near Real time model updates needed Lambda architecture • Batch job MUST process data at fixed interval (e.g., 15 min) • Stream job MUST > Handle hot starts (e.g., 90 days of data) > Analyze data and generate anomalies > Updates model every sampling interval > Start from the last model timestamp on restart Coordination between Batch and Stream processes NEEDED • Batch job updates ZooKeeper node at fixed interval (e.g., 15 min) • Stream job uses the same ZooKeeper node to load features
  • 14. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 14 Design challenges -- NetFlow Seed the model with good parameter estimates • Batch job populates the initial model parameter • Stream job hot-starts with model and detect anomalies • Stream job updates the model and persist it to Cassandra Model maintained in Cassandra • Stream job read the model to Spark partitions from Cassandra • Spark partition updates the model • Spark partition generates anomalies • Models across partition are combined using Spark • Anomalies are persisted to Cassandra Network analysis • Find peak usage for a given customer • Find customer with highest network usage • Find number of distinct source IPs connected to a destination IP
  • 15. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 15 Network anomaly flow design
  • 16. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 16 Design challenges – multiple applications
  • 17. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 17 Trapezium
  • 18. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 18 What is Trapezium?
  • 19. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 19 What is Trapezium? • Ability to read data > From multiple data sources, e.g., HDFS, NFS, Kafka > In Batch and Streaming modes to support lambda architecture • Ability to write data > To multiple data sources, e.g., HDFS, NFS, Kafka • Plug and Play architecture > Evaluate multiple algorithms > Evaluate different features of same algorithm • Break down complex analytics problem in Transactions • Build a workflow pipeline combining different Transactions • Validation and filtering of input data • Embedded Zookeeper, Kafka, C*, Hbase, etc available for unit tests • Enable real time query processing capability > Akka HTTP server provides Spark as a Service
  • 20. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 20 Trapezium architecture TrapeziumD1 D2 D3 O1 O2 O3 Validation D1 V1 V1 O1 D2 O2 D3 O1 VARIOUS TRANSACTIONS
  • 21. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 21 Workflow hdfsFileBatch = { batchTime = 5 batchInfo = [{ name = "hdfs_source" dataDirectory = {prod = "/prod/data/files"} }] } transactions = [{ transactionName="com.verizon.bda.DataAggregator" inputData=[{ name="hdfs_source" }] persistDataName="aggregatedOutput" },{ transactionName="com.verizon.bda.DataAligner" inputData=[{ name="aggregatedOutput" }] persistDataName="alignedOutput" },{ transactionName="com.verizon.bda.AnomalyFinder" inputData=[{ name="aggregatedOutput” }, { name="alignedOutput” }] persistDataName=”anomalyOutput" }] • Workflow is a collection of transactions in batch or streaming mode • Each transaction can take multiple data sources as input • Output of one transaction can be input to another transaction • Output of each transaction could be persisted or kept only in memory • Single place to handle exceptions and raise failure events
  • 22. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 22 Transaction Traits
  • 23. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 23 Transaction Traits
  • 24. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 24 Support data sources • Trapezium can read data from HDFS, Kafka, NFS, GFS • Config entry for reading data from HDFS/NFS/GFS dataSource="HDFS" dataDirectory = { local="/local/data/files" dev= "/dev/data/files" prod= "/prod/data/files" } • Config entry for defining protocol fileSystemPrefix="hdfs://" fileSystemPrefix="file://" fileSystemPrefix="s3://" • Trapezium can read data in various formats including text, gzip, json, avro and parquet • Config entry for reading from Kafka topics kafkaTopicInfo = { consumerGroup = "KafkaStreamGroup" maxRatePerPartition = 970 batchTime = "5" streamsInfo = [{ name = "queries" topicName = "deviceanalyzer" }] } • Config entry for reading fileFormat fileFormat="avro" fileFormat="json" fileFormat="parquet”
  • 25. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 25 Run modes • Trapezium supports reading data in batch as well streaming mode • Config entry for reading in batch mode runMode="STREAM" batchTime=5 • Config entry for reading in stream mode runMode="BATCH" batchTime=5 • Read data by timestamp offset=2 • Process historical data in sequence of smaller data sets fileSplit=true • Process same data multiple times oneTime=true
  • 26. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 26 Data validation • Validates data at the source • Filters out all invalid rows • Validates schema of the input data • Config entry for data validation validation = { columns = ["name", "age", "birthday", "location"] datatypes = ["String", "Int", "Timestamp", "String"] dateFormat = "yyyy-MM-dd HH:mm:ss" delimiter = "|" minimumColumn = 4 rules = { name=[maxLength(30),minLength(1)] age=[maxValue(100),minValue(1)] } }
  • 27. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 27 Plug and play capability • Any transaction can be added/removed by modifying workflow config file • Output from multiple algorithms can be compared in real time • Multiple features can be evaluated in different transactions • Data sources can be switched with config change • Model training can be done on different time windows to achieve best results
  • 28. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 28 Trapezium – github url https://github.com/Verizon/trapezium Version: 1.0.0-SNAPSHOT Release: 14-Oct-2016
  • 29. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 29 Results
  • 30. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 30 SNMP Spark runtime with Hive/C* read/write Data volume: 10 routers, 2.2 MB per 5 min, 650 MB per day Compute: 10 executors, 4 cores Memory: 16 GB per executor, 4 GB driver With sampling rate of 2 min: • 2 nodes with 20 cores each for 10 routers • 200 nodes for 1000 routers With sampling rate of 4 min: • 2 nodes can process 20 ro uters • 100 nodes for 1000 routers
  • 31. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 31 SNMP Spark shuffle – read/write Data volume: 10 routers, 2.2 MB per 5 min, 650 MB per day Compute: 10 executors, 4 cores Memory: 16 GB per executor, 4 GB driver
  • 32. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 32 Data volume: 2 router, 50 MB per min, 70 GB per day Compute: 10 executors, 4 cores Memory: 16 GB per executor, 4 GB driver NetFlow Spark + C* read/write runtime • Due to parametric model, run time is better than SNMP • NetFlow data is X times more than SNMP data 16 18 32 47 94.8 0 25 50 75 100 2 4 8 16 32 Runtime(s) Router
  • 33. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 33 NetFlow Spark + C* shuffle write Shuffle (MB) 2 4 8 16 32 Spark 71.2 150.5 275.7 612.1 1261.4 Cassandra 30.2 64.4 115.6 263.7 545.1 0. 350. 700. 1050. 1400. 2 4 8 16 32 Shuffle(MB) Router Spark Cassandra
  • 34. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 34 Summary • Reuse code across multiple applications • Improve developer efficiency • Encourage standard coding practices • Provide unit-test framework for better code coverage • Decouple ETL, analytics and algorithms in different Transactions • Distribute query processing using Spark as a service • Easy integration provided by configuration driven architecture
  • 35. © Verizon 2016 All Rights Reserved Information contained herein is provided AS IS and subject to change without notice. All trademarks used herein are property of their respective owners. 35 Thank you