SlideShare a Scribd company logo
ANOMALY DETECTION AT SCALE:
A CYBERSECURITY STREAMING DATA PIPELINE USING KAFKA AND AKKA
CLUSTERING
O'Reilly Security Conference NYC, November 2, 2016
Jeff Henrikson
Groovescale
http://www.groovescale.com
OUTLINE
framing
problem statement
streaming tech concepts
outline of solution
architecture, learnings
FRAMING
Why build predictive models?
Models continue to do usefulwork a er humans are not looking
Models are based on assumptions
Only humans can make assumptions
INTRUSION DETECTION
1) Log Data
2) Configure rules
3) Human awareness examines alarms and logs
4) Quick action taken (e.g. deauthorize)
5) Re-authorize once human awareness deems longer-term mitigation is adequate
Sometimes for high-confidence rules we allow 2) to trigger 4) without human intervention
HOW IS A SKILLED PERSON'S AWARENESS CAN BE MORE EFFECTIVELY GUIDED?
1) Matching of network behavior against localized rules
2) Predictive modeling of the aggregate network behavior
HOW IS A SKILLED PERSON'S AWARENESS CAN BE MORE EFFECTIVELY GUIDED?
1) Matching of network behavior against localized rules
2) Predictive modeling of the aggregate network behavior
Hypothesis: Let's see if 2 is better.
AI Artificial Intelligence
"IA" Intelligence Augmented
From Building practical AI systems
Adam Cheyer, (Siri, Sentient, and Viv Labs) Strata 2016
INTRUSION DETECTION TOOLS AS "INTELLIGENCE AUGMENTED"
Intruders are trying to evade detection.
Let's not worry about making the human protector of the network going away. Probably not possible
given evasive response.
PROBLEM STATEMENT
NETWORK PACKET BROKER
CAPTURE SERVER
dumpcap (from Wireshark)
NETFLOW (V5) BASICS
Attributes:
Source/Destination IP
Source/Destination Port
Input interface
Metrics: Number ofPackets, Sum of Bytes, Start Time, End Time.
IPv4 only
https://nsrc.org/workshops/2015/sanog25-nmm-tutorial/materials/netflow.pdf
Functional Requirements
Produce netflow from PCAP
Score netflow for anomalies
Control the number of anomalous events brought to the human expert's attention
Nonfunctional Requirements
Process line rate 10Gb/s
Be within 2x perf of tcpdump
Be within 4x of netflow latency
Do not add single points of failure
SOLUTION OUTLINE
OVERVIEW OF SERVICES
EXTERNAL DESIGN
EXTERNAL DESIGN
System coupling:
Do not prescribe deploying kafka upstream or downstream
(Which Kafka version? Which language binding?)
External APIs:
Ingress HTTP POST octet encoding
Egress HTTP GET Long Polling
INTERNAL DESIGN
INTERNAL DESIGN
Record state only in:
Kafka
Pcap temporary files on local fs
Need to write block id to EFH and dedupe for sumsto be correct in the presence of retries
Prefer late delivery to dropping data
Prefer reading capture time in data stream to wall clock time
Akka-cluster in one slide:
Framework for Actor-based concurrency
Program in Scala or Java
Akka-cluster more general than map reduce, data pipelines
Makes use local and remote resources work the same
MINIMUM VIABLE PREDICTIVE MODEL
1) Take Netflow metrics: sum(bytes), sum(packets), count
2) For each metric, compute mean and variance
3) Emit an "anomaly" when signal exceeds (mean + 3.0*sqrt(variance))
Meets minimum requirement: controls the number of events brought to the human expert's
attention
EXERCISE FOR THE READER
Model for periodicity:
Ihler et al, Adaptive Event Detection with Time–Varying Poisson Processes, ACM SIGKDD 2006
http://www.datalab.uci.edu/papers/event_detection_kdd06.pdf
Symmetrical mapping of docker containers to hosts:
DEPLOYMENT
RESULTS
Qualitatively, users can find relevant Anomalies in a reasonable sized stream
System operates reliably
Numbers are correct within assumptions
ARCHITECTURE, LEARNINGS
SO WHY KAFKA VS ANY OTHER STREAMING COMPONENT?
https://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-technologies/comment-page-1/
HOW DOES YOUR ORGANIZATION PICK COMPONENTS?
STREAMING DATA LITERATURE:
A data entity is created by one module, is passed from module to module until it is no longer needed
and is then destroyed. . . . Punched card accounting systems exemplify this environment.
J. P. Morrison, "Data Stream Linkage Mechanism", IBM Systems Journal, 1978.
http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=45DED06EC91474F5938A9E05CC3D5A61?
doi=10.1.1.89.2601&rep=rep1&type=pdf
BIND ARCHITECTURAL COUPLINGS EARLY SO THAT ARCHICTECTURAL
COMPONENTS CAN BE CHOSEN WITH AMPLE EVIDENCE
Examples of components:
which database
which streaming engine
Examples of couplings:
format of data (e.g. newline delimited json)
how to notify
how to checkpoint
HTTP COUPLING: WINS
Win #1: Can't get access to pcap over API
Win #2: Only RHEL-distributed reqs (perl-core, curl) required for ingress
Win #3: Upgrade kafka when improved
HTTP COUPLING: WIN #3: UPGRADE WHEN READY
Kafka Version 0.9.0 0.10.0.1 0.10.1.0
Partition by Hash x x x
Write timestamp to message x x
Read seek by timestamp x
LEARNING #1
https://github.com/akka/reactive-kafka
Using this library in place of KafkaConsumer
LEARNING #2, HIDING IN PLAIN SIGHT
http://www.reactive-streams.org/
FAVOR INTEGRATION TESTING TO UNIT TESTING
Ingress, egress have optional flag placebo={true,false}. Default to true.
Every deployment simulates low volume placebo sinks, sources.
Transmit heartbeats when each component is sure to have made forward progress.
ON EVALUATING FAULT TOLERANCE AND SCALABILITY
My smart buddy
LinkedIn runs it in production
The NSA
Can we do better?
ON EVALUATING FAULT TOLERANCE AND SCALABILITY:
The idea:
Create linked containers for app
Use tc to tell netfilter to drop and/or delay packets
Run simulated data source
ON EVALUATING FAULT TOLERANCE AND SCALABILITY:
Hands on create container:
Hands on with the container:
Hands on with the host:
(docker-machine's boot2docker has tc built-in)
docker run -it --rm ubuntu:14.04.2 bash
root@07e330775e98:/# apt-get update && apt-get install -y ethtool
root@07e330775e98:/# ethtool -S eth0
NIC statistics:
peer_ifindex: 875
dev=$(ip link | grep '^875:')
tc qdisc change dev $dev root netem delay 100ms 20ms distribution normal
tc qdisc change dev eth0 root netem loss 0.1%
Myth: Code should always go into docker containers through an image
Myth: Code should always go into docker containers through an image
Alternative:
docker run -v $dirSrc:$dirSrc # to convey source code
docker exec # to restart program
Myth: A docker image is something that came from a Dockerfile:
Myth: A docker image is something that came from a Dockerfile:
Alternative
docker run
ansible-playbook -c local
docker commit
ACKNOWLEDGEMENTS
Ilya Levner
Gunjan Gupta, Lightsphere AI
Trey Blalock, Firewall Consulting
RECOMMENDED READING
I Heart Logs, Jay Kreps (creator of Kafka)
Akka in Action, Roestenburg et al
Released Sept 30, 2016
Scala for the Impatient, 1e, Cay Horstman
Second edition coming December 2016
https://www.amazon.com/Heart-Logs-Stream-Processing-Integration/dp/1491909382
https://www.amazon.com/Akka-Action-Raymond-Roestenburg/dp/1617291013
https://www.amazon.com/Scala-Impatient-Cay-S-Horstmann/dp/0321774094
READINGS ON LOW LATENCY DATA ENGINEERING
(ORGANIZED BY COMMUNITY)
Community Title URL
Reactive The Reactive Manifesto http://www.reactivemanifesto.org/
Reactive Streams http://www.reactive-streams.org/
Kafka I Heart Logs, Jay Kreps, 2014 https://www.amazon.com/Heart-Logs-Stream-Processing-
Integration/dp/1491909382
Kafka: The Definitive Guide,
prerelease/2017
https://www.amazon.com/Kafka-Definitive-Real-time-stream-
processing/dp/1491936169
NiFi The core concepts of NifFi http://nifi.apache.org/docs/nifi-docs/html/overview.html#the-core-
concepts-of-nifi
Flow Based
Programming
Flow-Based Programming, J. Paul
Morrison, 2010
https://www.amazon.com/Flow-Based-Programming-2nd-
Application-Development/dp/1451542321
Storm Big Data, Nathan Marz, 2015 https://www.amazon.com/Big-Data-Principles-practices-
scalable/dp/1617290343
QUESTIONS?

More Related Content

What's hot

Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopHadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Josh Patterson
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
Jen Aman
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
SigOpt
 
DLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep LearningDLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep Learning
Brodmann17
 
Anomaly detection in deep learning (Updated) English
Anomaly detection in deep learning (Updated) EnglishAnomaly detection in deep learning (Updated) English
Anomaly detection in deep learning (Updated) English
Adam Gibson
 
End-to-End Object Detection with Transformers
End-to-End Object Detection with TransformersEnd-to-End Object Detection with Transformers
End-to-End Object Detection with Transformers
Seunghyun Hwang
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow
Jen Aman
 
Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow
Rajiv Shah
 
BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for Spark
DESMOND YUEN
 
Basic ideas on keras framework
Basic ideas on keras frameworkBasic ideas on keras framework
Basic ideas on keras framework
Alison Marczewski
 
[Seminar arxiv]fake face detection via adaptive residuals extraction network
[Seminar arxiv]fake face detection via adaptive residuals extraction network [Seminar arxiv]fake face detection via adaptive residuals extraction network
[Seminar arxiv]fake face detection via adaptive residuals extraction network
KIMMINHA3
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
Spark Summit
 
Fcv rep darrell
Fcv rep darrellFcv rep darrell
Fcv rep darrellzukun
 
Recent developments in Deep Learning
Recent developments in Deep LearningRecent developments in Deep Learning
Recent developments in Deep Learning
Brahim HAMADICHAREF
 
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
MLconf
 
Deep Learning Primer: A First-Principles Approach
Deep Learning Primer: A First-Principles ApproachDeep Learning Primer: A First-Principles Approach
Deep Learning Primer: A First-Principles Approach
Maurizio Calo Caligaris
 
GDG-Shanghai 2017 TensorFlow Summit Recap
GDG-Shanghai 2017 TensorFlow Summit RecapGDG-Shanghai 2017 TensorFlow Summit Recap
GDG-Shanghai 2017 TensorFlow Summit Recap
Jiang Jun
 
Learning where to look: focus and attention in deep vision
Learning where to look: focus and attention in deep visionLearning where to look: focus and attention in deep vision
Learning where to look: focus and attention in deep vision
Universitat Politècnica de Catalunya
 
Introduction to Deep Learning and neon at Galvanize
Introduction to Deep Learning and neon at GalvanizeIntroduction to Deep Learning and neon at Galvanize
Introduction to Deep Learning and neon at Galvanize
Intel Nervana
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016
MLconf
 

What's hot (20)

Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopHadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
 
DLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep LearningDLD meetup 2017, Efficient Deep Learning
DLD meetup 2017, Efficient Deep Learning
 
Anomaly detection in deep learning (Updated) English
Anomaly detection in deep learning (Updated) EnglishAnomaly detection in deep learning (Updated) English
Anomaly detection in deep learning (Updated) English
 
End-to-End Object Detection with Transformers
End-to-End Object Detection with TransformersEnd-to-End Object Detection with Transformers
End-to-End Object Detection with Transformers
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow
 
Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow Image Classification Done Simply using Keras and TensorFlow
Image Classification Done Simply using Keras and TensorFlow
 
BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for Spark
 
Basic ideas on keras framework
Basic ideas on keras frameworkBasic ideas on keras framework
Basic ideas on keras framework
 
[Seminar arxiv]fake face detection via adaptive residuals extraction network
[Seminar arxiv]fake face detection via adaptive residuals extraction network [Seminar arxiv]fake face detection via adaptive residuals extraction network
[Seminar arxiv]fake face detection via adaptive residuals extraction network
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
 
Fcv rep darrell
Fcv rep darrellFcv rep darrell
Fcv rep darrell
 
Recent developments in Deep Learning
Recent developments in Deep LearningRecent developments in Deep Learning
Recent developments in Deep Learning
 
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
 
Deep Learning Primer: A First-Principles Approach
Deep Learning Primer: A First-Principles ApproachDeep Learning Primer: A First-Principles Approach
Deep Learning Primer: A First-Principles Approach
 
GDG-Shanghai 2017 TensorFlow Summit Recap
GDG-Shanghai 2017 TensorFlow Summit RecapGDG-Shanghai 2017 TensorFlow Summit Recap
GDG-Shanghai 2017 TensorFlow Summit Recap
 
Learning where to look: focus and attention in deep vision
Learning where to look: focus and attention in deep visionLearning where to look: focus and attention in deep vision
Learning where to look: focus and attention in deep vision
 
Introduction to Deep Learning and neon at Galvanize
Introduction to Deep Learning and neon at GalvanizeIntroduction to Deep Learning and neon at Galvanize
Introduction to Deep Learning and neon at Galvanize
 
Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016Kaz Sato, Evangelist, Google at MLconf ATL 2016
Kaz Sato, Evangelist, Google at MLconf ATL 2016
 

Viewers also liked

Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
Georg Heiler
 
Reducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networksReducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networks
Hakky St
 
Predictive Analytics with Numenta Machine Intelligence
Predictive Analytics with Numenta Machine IntelligencePredictive Analytics with Numenta Machine Intelligence
Predictive Analytics with Numenta Machine Intelligence
Numenta
 
Detecting Anomalies in Streaming Data
Detecting Anomalies in Streaming DataDetecting Anomalies in Streaming Data
Detecting Anomalies in Streaming Data
Numenta
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
Ted Dunning
 
A Practical Guide to Anomaly Detection for DevOps
A Practical Guide to Anomaly Detection for DevOpsA Practical Guide to Anomaly Detection for DevOps
A Practical Guide to Anomaly Detection for DevOps
BigPanda
 
Detecting Hacks: Anomaly Detection on Networking Data
Detecting Hacks: Anomaly Detection on Networking DataDetecting Hacks: Anomaly Detection on Networking Data
Detecting Hacks: Anomaly Detection on Networking Data
James Sirota
 
Internship_presentation
Internship_presentationInternship_presentation
Internship_presentationAditya Gautam
 
HawkEye : A Real-time Anomaly Detection System
HawkEye : A Real-time Anomaly Detection SystemHawkEye : A Real-time Anomaly Detection System
HawkEye : A Real-time Anomaly Detection System
Satnam Singh
 
Analytics for large-scale time series and event data
Analytics for large-scale time series and event dataAnalytics for large-scale time series and event data
Analytics for large-scale time series and event data
Anodot
 
Science of Anomaly Detection
Science of Anomaly Detection Science of Anomaly Detection
Science of Anomaly Detection
Numenta
 
Big data lambda architecture - Streaming Layer Hands On
Big data lambda architecture - Streaming Layer Hands OnBig data lambda architecture - Streaming Layer Hands On
Big data lambda architecture - Streaming Layer Hands On
hkbhadraa
 
"Building Anomaly Detection For Large Scale Analytics", Yonatan Ben Shimon, A...
"Building Anomaly Detection For Large Scale Analytics", Yonatan Ben Shimon, A..."Building Anomaly Detection For Large Scale Analytics", Yonatan Ben Shimon, A...
"Building Anomaly Detection For Large Scale Analytics", Yonatan Ben Shimon, A...
Dataconomy Media
 
VSSML16 L6. Feature Engineering
VSSML16 L6. Feature EngineeringVSSML16 L6. Feature Engineering
VSSML16 L6. Feature Engineering
BigML, Inc
 
Anomaly/Novelty detection with scikit-learn
Anomaly/Novelty detection with scikit-learnAnomaly/Novelty detection with scikit-learn
Anomaly/Novelty detection with scikit-learn
agramfort
 
Data Mining with Splunk
Data Mining with SplunkData Mining with Splunk
Data Mining with SplunkDavid Carasso
 
ACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsSrinath Perera
 
Real Time Data Infrastructure team overview
Real Time Data Infrastructure team overviewReal Time Data Infrastructure team overview
Real Time Data Infrastructure team overview
Monal Daxini
 
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of ToyotaBig Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Data Con LA
 
The Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteThe Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteAlois Reitbauer
 

Viewers also liked (20)

Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Reducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networksReducing the dimensionality of data with neural networks
Reducing the dimensionality of data with neural networks
 
Predictive Analytics with Numenta Machine Intelligence
Predictive Analytics with Numenta Machine IntelligencePredictive Analytics with Numenta Machine Intelligence
Predictive Analytics with Numenta Machine Intelligence
 
Detecting Anomalies in Streaming Data
Detecting Anomalies in Streaming DataDetecting Anomalies in Streaming Data
Detecting Anomalies in Streaming Data
 
Anomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine LearningAnomaly Detection - New York Machine Learning
Anomaly Detection - New York Machine Learning
 
A Practical Guide to Anomaly Detection for DevOps
A Practical Guide to Anomaly Detection for DevOpsA Practical Guide to Anomaly Detection for DevOps
A Practical Guide to Anomaly Detection for DevOps
 
Detecting Hacks: Anomaly Detection on Networking Data
Detecting Hacks: Anomaly Detection on Networking DataDetecting Hacks: Anomaly Detection on Networking Data
Detecting Hacks: Anomaly Detection on Networking Data
 
Internship_presentation
Internship_presentationInternship_presentation
Internship_presentation
 
HawkEye : A Real-time Anomaly Detection System
HawkEye : A Real-time Anomaly Detection SystemHawkEye : A Real-time Anomaly Detection System
HawkEye : A Real-time Anomaly Detection System
 
Analytics for large-scale time series and event data
Analytics for large-scale time series and event dataAnalytics for large-scale time series and event data
Analytics for large-scale time series and event data
 
Science of Anomaly Detection
Science of Anomaly Detection Science of Anomaly Detection
Science of Anomaly Detection
 
Big data lambda architecture - Streaming Layer Hands On
Big data lambda architecture - Streaming Layer Hands OnBig data lambda architecture - Streaming Layer Hands On
Big data lambda architecture - Streaming Layer Hands On
 
"Building Anomaly Detection For Large Scale Analytics", Yonatan Ben Shimon, A...
"Building Anomaly Detection For Large Scale Analytics", Yonatan Ben Shimon, A..."Building Anomaly Detection For Large Scale Analytics", Yonatan Ben Shimon, A...
"Building Anomaly Detection For Large Scale Analytics", Yonatan Ben Shimon, A...
 
VSSML16 L6. Feature Engineering
VSSML16 L6. Feature EngineeringVSSML16 L6. Feature Engineering
VSSML16 L6. Feature Engineering
 
Anomaly/Novelty detection with scikit-learn
Anomaly/Novelty detection with scikit-learnAnomaly/Novelty detection with scikit-learn
Anomaly/Novelty detection with scikit-learn
 
Data Mining with Splunk
Data Mining with SplunkData Mining with Splunk
Data Mining with Splunk
 
ACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics PatternsACM DEBS 2015: Realtime Streaming Analytics Patterns
ACM DEBS 2015: Realtime Streaming Analytics Patterns
 
Real Time Data Infrastructure team overview
Real Time Data Infrastructure team overviewReal Time Data Infrastructure team overview
Real Time Data Infrastructure team overview
 
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of ToyotaBig Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
 
The Dark of Building an Production Incident Syste
The Dark of Building an Production Incident SysteThe Dark of Building an Production Incident Syste
The Dark of Building an Production Incident Syste
 

Similar to Anomaly Detection at Scale

Stream Processing with CompletableFuture and Flow in Java 9
Stream Processing with CompletableFuture and Flow in Java 9Stream Processing with CompletableFuture and Flow in Java 9
Stream Processing with CompletableFuture and Flow in Java 9
Trayan Iliev
 
KEY CONCEPTS FOR SCALABLE STATEFUL SERVICES
KEY CONCEPTS FOR SCALABLE STATEFUL SERVICESKEY CONCEPTS FOR SCALABLE STATEFUL SERVICES
KEY CONCEPTS FOR SCALABLE STATEFUL SERVICES
Mykola Novik
 
Proactive ops for container orchestration environments
Proactive ops for container orchestration environmentsProactive ops for container orchestration environments
Proactive ops for container orchestration environments
Docker, Inc.
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
ConfluentInc1
 
IoT meets Big Data
IoT meets Big DataIoT meets Big Data
IoT meets Big Data
ratthaslip ranokphanuwat
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Alluxio, Inc.
 
IEEE HPSR 2017 Keynote: Softwarized Dataplanes and the P^3 trade-offs: Progra...
IEEE HPSR 2017 Keynote: Softwarized Dataplanes and the P^3 trade-offs: Progra...IEEE HPSR 2017 Keynote: Softwarized Dataplanes and the P^3 trade-offs: Progra...
IEEE HPSR 2017 Keynote: Softwarized Dataplanes and the P^3 trade-offs: Progra...
Christian Esteve Rothenberg
 
ZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed SystemsZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed Systems
Gokhan Boranalp
 
Microsoft Dryad
Microsoft DryadMicrosoft Dryad
Microsoft Dryad
Colin Clark
 
Deep learning with kafka
Deep learning with kafkaDeep learning with kafka
Deep learning with kafka
Nitin Kumar
 
The power of linux advanced tracer [POUG18]
The power of linux advanced tracer [POUG18]The power of linux advanced tracer [POUG18]
The power of linux advanced tracer [POUG18]
Mahmoud Hatem
 
Automating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency SpreadsAutomating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency Spreads
ScyllaDB
 
Akshay Sanjay Kale Resume LinkedIn
Akshay Sanjay Kale Resume LinkedInAkshay Sanjay Kale Resume LinkedIn
Akshay Sanjay Kale Resume LinkedInAkshay Kale
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
Brendan Gregg
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
SDN approach.pptx
SDN approach.pptxSDN approach.pptx
SDN approach.pptx
TrongMinhHoang1
 
How to over-engineer things and have fun? | Oto Brglez, OPALAB
How to over-engineer things and have fun? | Oto Brglez, OPALABHow to over-engineer things and have fun? | Oto Brglez, OPALAB
How to over-engineer things and have fun? | Oto Brglez, OPALAB
HostedbyConfluent
 
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Puppet
 
Microservices Application Tracing Standards and Simulators - Adrians at OSCON
Microservices Application Tracing Standards and Simulators - Adrians at OSCONMicroservices Application Tracing Standards and Simulators - Adrians at OSCON
Microservices Application Tracing Standards and Simulators - Adrians at OSCON
Adrian Cockcroft
 

Similar to Anomaly Detection at Scale (20)

Stream Processing with CompletableFuture and Flow in Java 9
Stream Processing with CompletableFuture and Flow in Java 9Stream Processing with CompletableFuture and Flow in Java 9
Stream Processing with CompletableFuture and Flow in Java 9
 
KEY CONCEPTS FOR SCALABLE STATEFUL SERVICES
KEY CONCEPTS FOR SCALABLE STATEFUL SERVICESKEY CONCEPTS FOR SCALABLE STATEFUL SERVICES
KEY CONCEPTS FOR SCALABLE STATEFUL SERVICES
 
Proactive ops for container orchestration environments
Proactive ops for container orchestration environmentsProactive ops for container orchestration environments
Proactive ops for container orchestration environments
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
 
IoT meets Big Data
IoT meets Big DataIoT meets Big Data
IoT meets Big Data
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
 
IEEE HPSR 2017 Keynote: Softwarized Dataplanes and the P^3 trade-offs: Progra...
IEEE HPSR 2017 Keynote: Softwarized Dataplanes and the P^3 trade-offs: Progra...IEEE HPSR 2017 Keynote: Softwarized Dataplanes and the P^3 trade-offs: Progra...
IEEE HPSR 2017 Keynote: Softwarized Dataplanes and the P^3 trade-offs: Progra...
 
ZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed SystemsZCloud Consensus on Hardware for Distributed Systems
ZCloud Consensus on Hardware for Distributed Systems
 
Cisco OpenSOC
Cisco OpenSOCCisco OpenSOC
Cisco OpenSOC
 
Microsoft Dryad
Microsoft DryadMicrosoft Dryad
Microsoft Dryad
 
Deep learning with kafka
Deep learning with kafkaDeep learning with kafka
Deep learning with kafka
 
The power of linux advanced tracer [POUG18]
The power of linux advanced tracer [POUG18]The power of linux advanced tracer [POUG18]
The power of linux advanced tracer [POUG18]
 
Automating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency SpreadsAutomating the Hunt for Non-Obvious Sources of Latency Spreads
Automating the Hunt for Non-Obvious Sources of Latency Spreads
 
Akshay Sanjay Kale Resume LinkedIn
Akshay Sanjay Kale Resume LinkedInAkshay Sanjay Kale Resume LinkedIn
Akshay Sanjay Kale Resume LinkedIn
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
SDN approach.pptx
SDN approach.pptxSDN approach.pptx
SDN approach.pptx
 
How to over-engineer things and have fun? | Oto Brglez, OPALAB
How to over-engineer things and have fun? | Oto Brglez, OPALABHow to over-engineer things and have fun? | Oto Brglez, OPALAB
How to over-engineer things and have fun? | Oto Brglez, OPALAB
 
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
Exploring the Final Frontier of Data Center Orchestration: Network Elements -...
 
Microservices Application Tracing Standards and Simulators - Adrians at OSCON
Microservices Application Tracing Standards and Simulators - Adrians at OSCONMicroservices Application Tracing Standards and Simulators - Adrians at OSCON
Microservices Application Tracing Standards and Simulators - Adrians at OSCON
 

Recently uploaded

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 

Recently uploaded (20)

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 

Anomaly Detection at Scale

  • 1. ANOMALY DETECTION AT SCALE: A CYBERSECURITY STREAMING DATA PIPELINE USING KAFKA AND AKKA CLUSTERING O'Reilly Security Conference NYC, November 2, 2016 Jeff Henrikson Groovescale http://www.groovescale.com
  • 2. OUTLINE framing problem statement streaming tech concepts outline of solution architecture, learnings
  • 4. Why build predictive models? Models continue to do usefulwork a er humans are not looking Models are based on assumptions Only humans can make assumptions
  • 5. INTRUSION DETECTION 1) Log Data 2) Configure rules 3) Human awareness examines alarms and logs 4) Quick action taken (e.g. deauthorize) 5) Re-authorize once human awareness deems longer-term mitigation is adequate Sometimes for high-confidence rules we allow 2) to trigger 4) without human intervention
  • 6. HOW IS A SKILLED PERSON'S AWARENESS CAN BE MORE EFFECTIVELY GUIDED? 1) Matching of network behavior against localized rules 2) Predictive modeling of the aggregate network behavior
  • 7. HOW IS A SKILLED PERSON'S AWARENESS CAN BE MORE EFFECTIVELY GUIDED? 1) Matching of network behavior against localized rules 2) Predictive modeling of the aggregate network behavior Hypothesis: Let's see if 2 is better.
  • 8. AI Artificial Intelligence "IA" Intelligence Augmented From Building practical AI systems Adam Cheyer, (Siri, Sentient, and Viv Labs) Strata 2016
  • 9. INTRUSION DETECTION TOOLS AS "INTELLIGENCE AUGMENTED" Intruders are trying to evade detection. Let's not worry about making the human protector of the network going away. Probably not possible given evasive response.
  • 13. NETFLOW (V5) BASICS Attributes: Source/Destination IP Source/Destination Port Input interface Metrics: Number ofPackets, Sum of Bytes, Start Time, End Time. IPv4 only https://nsrc.org/workshops/2015/sanog25-nmm-tutorial/materials/netflow.pdf
  • 14. Functional Requirements Produce netflow from PCAP Score netflow for anomalies Control the number of anomalous events brought to the human expert's attention
  • 15. Nonfunctional Requirements Process line rate 10Gb/s Be within 2x perf of tcpdump Be within 4x of netflow latency Do not add single points of failure
  • 19. EXTERNAL DESIGN System coupling: Do not prescribe deploying kafka upstream or downstream (Which Kafka version? Which language binding?) External APIs: Ingress HTTP POST octet encoding Egress HTTP GET Long Polling
  • 21. INTERNAL DESIGN Record state only in: Kafka Pcap temporary files on local fs Need to write block id to EFH and dedupe for sumsto be correct in the presence of retries Prefer late delivery to dropping data Prefer reading capture time in data stream to wall clock time
  • 22. Akka-cluster in one slide: Framework for Actor-based concurrency Program in Scala or Java Akka-cluster more general than map reduce, data pipelines Makes use local and remote resources work the same
  • 23. MINIMUM VIABLE PREDICTIVE MODEL 1) Take Netflow metrics: sum(bytes), sum(packets), count 2) For each metric, compute mean and variance 3) Emit an "anomaly" when signal exceeds (mean + 3.0*sqrt(variance)) Meets minimum requirement: controls the number of events brought to the human expert's attention
  • 24. EXERCISE FOR THE READER Model for periodicity: Ihler et al, Adaptive Event Detection with Time–Varying Poisson Processes, ACM SIGKDD 2006 http://www.datalab.uci.edu/papers/event_detection_kdd06.pdf
  • 25. Symmetrical mapping of docker containers to hosts: DEPLOYMENT
  • 26. RESULTS Qualitatively, users can find relevant Anomalies in a reasonable sized stream System operates reliably Numbers are correct within assumptions
  • 28. SO WHY KAFKA VS ANY OTHER STREAMING COMPONENT? https://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-technologies/comment-page-1/
  • 29. HOW DOES YOUR ORGANIZATION PICK COMPONENTS?
  • 30. STREAMING DATA LITERATURE: A data entity is created by one module, is passed from module to module until it is no longer needed and is then destroyed. . . . Punched card accounting systems exemplify this environment. J. P. Morrison, "Data Stream Linkage Mechanism", IBM Systems Journal, 1978. http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=45DED06EC91474F5938A9E05CC3D5A61? doi=10.1.1.89.2601&rep=rep1&type=pdf
  • 31. BIND ARCHITECTURAL COUPLINGS EARLY SO THAT ARCHICTECTURAL COMPONENTS CAN BE CHOSEN WITH AMPLE EVIDENCE Examples of components: which database which streaming engine Examples of couplings: format of data (e.g. newline delimited json) how to notify how to checkpoint
  • 32. HTTP COUPLING: WINS Win #1: Can't get access to pcap over API Win #2: Only RHEL-distributed reqs (perl-core, curl) required for ingress Win #3: Upgrade kafka when improved
  • 33. HTTP COUPLING: WIN #3: UPGRADE WHEN READY Kafka Version 0.9.0 0.10.0.1 0.10.1.0 Partition by Hash x x x Write timestamp to message x x Read seek by timestamp x
  • 35. LEARNING #2, HIDING IN PLAIN SIGHT http://www.reactive-streams.org/
  • 36. FAVOR INTEGRATION TESTING TO UNIT TESTING Ingress, egress have optional flag placebo={true,false}. Default to true. Every deployment simulates low volume placebo sinks, sources. Transmit heartbeats when each component is sure to have made forward progress.
  • 37. ON EVALUATING FAULT TOLERANCE AND SCALABILITY My smart buddy LinkedIn runs it in production The NSA Can we do better?
  • 38. ON EVALUATING FAULT TOLERANCE AND SCALABILITY: The idea: Create linked containers for app Use tc to tell netfilter to drop and/or delay packets Run simulated data source
  • 39. ON EVALUATING FAULT TOLERANCE AND SCALABILITY: Hands on create container: Hands on with the container: Hands on with the host: (docker-machine's boot2docker has tc built-in) docker run -it --rm ubuntu:14.04.2 bash root@07e330775e98:/# apt-get update && apt-get install -y ethtool root@07e330775e98:/# ethtool -S eth0 NIC statistics: peer_ifindex: 875 dev=$(ip link | grep '^875:') tc qdisc change dev $dev root netem delay 100ms 20ms distribution normal tc qdisc change dev eth0 root netem loss 0.1%
  • 40. Myth: Code should always go into docker containers through an image
  • 41. Myth: Code should always go into docker containers through an image Alternative: docker run -v $dirSrc:$dirSrc # to convey source code docker exec # to restart program
  • 42. Myth: A docker image is something that came from a Dockerfile:
  • 43. Myth: A docker image is something that came from a Dockerfile: Alternative docker run ansible-playbook -c local docker commit
  • 44. ACKNOWLEDGEMENTS Ilya Levner Gunjan Gupta, Lightsphere AI Trey Blalock, Firewall Consulting
  • 45. RECOMMENDED READING I Heart Logs, Jay Kreps (creator of Kafka) Akka in Action, Roestenburg et al Released Sept 30, 2016 Scala for the Impatient, 1e, Cay Horstman Second edition coming December 2016 https://www.amazon.com/Heart-Logs-Stream-Processing-Integration/dp/1491909382 https://www.amazon.com/Akka-Action-Raymond-Roestenburg/dp/1617291013 https://www.amazon.com/Scala-Impatient-Cay-S-Horstmann/dp/0321774094
  • 46. READINGS ON LOW LATENCY DATA ENGINEERING (ORGANIZED BY COMMUNITY) Community Title URL Reactive The Reactive Manifesto http://www.reactivemanifesto.org/ Reactive Streams http://www.reactive-streams.org/ Kafka I Heart Logs, Jay Kreps, 2014 https://www.amazon.com/Heart-Logs-Stream-Processing- Integration/dp/1491909382 Kafka: The Definitive Guide, prerelease/2017 https://www.amazon.com/Kafka-Definitive-Real-time-stream- processing/dp/1491936169 NiFi The core concepts of NifFi http://nifi.apache.org/docs/nifi-docs/html/overview.html#the-core- concepts-of-nifi Flow Based Programming Flow-Based Programming, J. Paul Morrison, 2010 https://www.amazon.com/Flow-Based-Programming-2nd- Application-Development/dp/1451542321 Storm Big Data, Nathan Marz, 2015 https://www.amazon.com/Big-Data-Principles-practices- scalable/dp/1617290343