www.thalesgroup.com
OPEN
Large scale anomaly
detection in cyber-security
YVES MABIALA
2
OPEN
Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorin
partordisclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved.
Agenda
▌The CENTAI lab
▌Challenges in cyber-attacks detection
▌How to improve detection capabilities ?
▌Batch anomaly detection in Pig
▌From batch to real-time
▌Conclusion
3
OPEN
Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorin
partordisclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved.
CENTAI ecosystem
Customers
Concepts
TRL 1-4
Prototypes
TRL 4-6
Products / Solutions
Applications
TRL 7-9
University of
Bordeaux
LABRI
UPMC – LIP6 (Paris)
LIRIS (Lyon)
TRT &
Innovation Hubs
Algorithms co-dev and transfer Proof of Concept / Proof of Technos
TRL = Technical Readiness Level
Start-up
& SMEs
Thales
Business
Lines
Big Data
Big Analytics
Visual Analytics
CENTAI
4
OPEN
Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorin
partordisclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved.
Challenges in cyber-attacks detection
Two kinds of attacks
Non targeted
- General public, Not very complex, Using
common attack patterns
Targeted
- Using very customized attack patterns
- Made through large periods of time
APT
!
5
OPEN
Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorin
partordisclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved.
How to improve detection capabilities ?
Traditional approaches
Based on expert rules (detection or correlation)
Often not sized for handling massive data
Need innovative tools to
Detect abnormal behaviors without relying on
known patterns
Perform analysis on large periods of time
6
OPEN
Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorin
partordisclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved.
What about the data ?
Data collected at multiple levels
Applicative/ Network / System
Data are Big DATA
Massive and heterogeneous (hundred of TBs of logs)
Highly dynamic (x10GB/s for network flows/streams)
7
OPEN
Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorin
partordisclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved.
Batch analysis of applicative/network logs
Context
Audit of logs
Data
Network logs (proxy and firewalls)
Hundred billions of events
Non labeled
Objective
Detection of abnormal events in these logs
(unsupervised anomaly detection)
8
OPEN
Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorin
partordisclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved.
Anomaly detection challenges
Open-source algorithm libraries will not help much
No unsupervised anomaly detection implemented
in most big data analytics frameworks
Data is massive and highly dimensional (x hundreds
variables)
Linear complexity mandatory
Very law false positives rate must be ensured
Outputs must be as “understandable” as possible
9
OPEN
Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorin
partordisclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved.
Copula anomaly detection
Development of a set of tools in PIG
Probabilistic anomaly detection algorithm
- Based on the notion of copula
1
1
2
1
3Event
Variable 1
Variable 2
Variable n
Density 1
Density n
Density 2 Joint Density
Anomaly / Normal event
Density estimation Copula
Thresholding
10
OPEN
Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorin
partordisclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved.
A few words on copula
Express a multivariate cumulative distribution function as
a function of its univariate marginal distributions
Π 𝑢1, 𝑢2 = ℙ 𝑈1 ≤ 𝑢1, 𝑈2 ≤ 𝑢2 = C(F 𝑢1 , G 𝑢2 )
Many types of copulas
Parametric : archimedean (e.g. : C 𝑢, 𝑣 = 𝑢𝑣) , gaussian
Non parametric : empirical
Development of a new copula : Indetermination criteria
𝐶 𝑢1, 𝑢2 = 𝑢2
𝐹−1(𝑢1)
𝐴
+ 𝑢1
𝐺−1(𝑢2)
𝐵
-
𝐹−1(𝑢1)𝐺−1(𝑢2)
𝐴𝐵
11
OPEN
Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorin
partordisclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved.
Implementation in Pig
Implementation as Pig algebraic UDFs (fully
incremental)
Marginal estimation using KDE
Joint distribution estimation using Copula
Threshold estimation with quantile regression/extreme
value
Integration of the UDFs in two scripts
Learning
Scoring
12
OPEN
Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorin
partordisclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved.
Learning phase
13
OPEN
Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorin
partordisclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved.
Testing phase
14
OPEN
Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorin
partordisclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved.
From pure batch analysis to batch and stream
Very satisfied of this first implementation
Quite easy to create customized UDF
Very good performance
But
Need to add some real-time processing capabilities
- While avoiding the burden of coding the algorithm twice
How to combine both capabilities ?
15
OPEN
Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorin
partordisclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved.
5 – Query layer3 – Serving layer
4 – Speed layer
2 – Batch layer
1 - Collect
Lambda architecture
16
OPEN
Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorin
partordisclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved.
Lambda architecture
5 – Query layer3 – Serving layer
4 – Speed layer
2 – Batch layer1 - Collect
17
OPEN
Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorin
partordisclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved.
From Pig to Spark
Implementation as Spark Accumulator
Very easy to transfer the code from Pig UDFs
Pros
An order of magnitude faster
Easier to share information between functions
Same implementation for batch and real-time
Cons
Lost data typing and variable names
18
OPEN
Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorin
partordisclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved.
Anomaly detection in Spark Streaming
19
OPEN
Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorin
partordisclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved.
Conclusion
Big Data analytics becomes essential in cyber-security
Cyber-attacks detection
Investigation and Forensics
Situation awareness
Spark greatly helps to leverage the power of big data
analytics for batch and real-time applications
What is next ?
Provide decision aid and support tools
20
OPEN
Thisdocumentmaynotbereproduced,modified,adapted,published,translated,inanyway,inwholeorin
partordisclosedtoathirdpartywithoutthepriorwrittenconsentofThales-©Thales2015Allrightsreserved.
Thank you for your attention
▌Any questions ?
Yves MABIALA
yves.mabiala@thalesgroup.com

Large scale anomaly detection in cyber-security