Needles, Haystacks and Algorithms: Using Machine Learning to detect complex threats

Needles &
Haystacks:
Machine Learning powered
Threat Detection
Ioan Constantin,
Orange Romania

Text with illustrationHow do you go about finding a
needle in a haystack?
It’s simple, actually.
You bring a magnet.

Agenda
Complex Threats
A brief walk trough
some of the new
challenges most
companies face now
with the advent of
APTs.
Machine Learning
Beyond buzzwords: how
do we improve our
existing detection
methods and
technologies with the
context offered by M.L.?
‘Post-hash’
Context
Threat Intel,
Proactive Security,
Threat Hunting,
SOCMINT, HUMINT,
OSINT
Threat Detection
Are we trying to
replace our existing
SIEMs, FWs and
IPSs? Spoiler: No.
We’re building around
it.
1 2 3 4 5SIEM much?
Are we trying to build
yet another Open
Source SIEM?

Advan
ceduses sophisticated
exploits, 0-Days, Social
Engineering. Stealthy.
Compromised end-points
Can infect multiple end-
points in large networks,
of various type such as
smartphones,
workstations,
applications
Highly Dependant on
Social Engineering,
surveillance, supply
chain-compromise
Persistent
Low-and-slow approach
to attacks
Targeting is conducted
trough continuous
monitoring
Task-specific rather than
opportunistic
The goal is to maintain
long-term access
Complex
Threats

Keep a
door
open
Exfiltrate data
Delete all track
Break stuff
06
Complete
mission
Maintain
presence, use
stealthy
(encrypted)
channels to
communicate
05
Keep a low
profile
Expand control
to other
endpoints,
servers, active
equipment,
harvest data
and configs
04
Lateral
Movement
Look around,
find other
vulnerable
victims, gather
as much intel as
you can
03
Reconnaiss
ance.
Create C2C
controllable
victim end-
points in the
network,
escalate
privileges
02
Establish
foothold
Using social
engineering,
spear phishing,
media infestation
with zero-day
malware
01
Initial
compromise
Works in stages
APTs are a complex species
of threats
And we have no definitive guide
or ‘how-to’ detect it or mitigate it.
We’re constantly learning from
the experience of others and
we’re relying on human
intelligence to better understand
these threats

77%Network Technologies
(FWs, Routers,
Switches)
64 %
Log Monitoring / Event
Correlation / SIEMs / A.I.
/ M.L.
Technical Controls
used to Protect
Against APTs*
*according to a TrendMicro Survey, H1 2017
Mobile Security
Gateways, Mobile Anti-
Malware, Mobile Device
Management
37%
83%Antivirus & Antimalware
66%
Zoning Off (Network
Segregation)

Malware Variety
There are hundreds of
million of malware variations
that can be used in a APT.
This makes it challenging to
detect.
41 2 3
Traditional security is
ineffective
As it needs more than
signature or rule-based
detection. It needs context
and some form of
automated learning method
to expand the context
Log analysis and Log
correlation has its limits
Because, of course, it also
needs context and
‘perspective’.
There’s lots of noise
It’s extremely hard to
separate noise from
legitimate traffic and there’s
lots of noise being ‘ingested’
by security equipment
Challenges in mitigation

The average
company takes 170
days to detect an
advanced threat, 39
days to mitigate and
43 days to recover*
*According to the Ponemon Institute
170
days
39
days
43
days
0%
To Mitigation
For Detection
To Recovery
Challenges in mitigation

Obligatory timeline slide
1998 2003 2009 2010 2011 2018
Moonlight Maze
Some guys in Russia breaks
into systems at the Pentagon,
NASA, U.S. Department of
Energy. Exfiltrates thousands
of docs. Attack lasted 2 years
Ghost Net
Large-scale attack,
initiated from phishing
e-mails, downloaded
trojans
RSA Attack
Adobe Flash exploit +
phishing e-mail =
pwned SecurID
Titan Rain
Attacks on large
defense contractors
from various Chinese
sources.
Stuxnet
Fun fact: there’s
chatter out there
about a not-so-long-
awaited sequel 
Stuxnet 2.0
VPNFilter /
BlackEnergy
And, of course the
youngest of the
bunch…

Proactive
Security
Bug Bounty
Programs
Machine Learning,
maybe?
Threat
Intelligence
OSINT
SOCMINT
HUMINT
Counter
surveillance
Threat
Hunting
Solutions?
Provide
Context.

Machine Learning is the science of getting
computers to act without being explicitly
programmed (Stanford University, Artificial
Intelligence Laboratory)
Machine Learning (…) algorithms can figure out how
to perform important tasks by generalizing from
examples (University of Washington)
Machine Learning

Machine Learning
Learn from available data
Using representation, evaluation
and optimisation. Use training
data
Establish baseline
Define normal behaviour, define
normal data.
Don’t over use the word ‘normal’ 
Find Anomalies
Identify deviations from the
baseline. Correlate deviations
with other input data
Act upon finding
Classify deviations as
anomalies. Notify / Take action

Available data
Network
Devices
Endpoints Applications
Security
Products
Cool stuff
Active network
equipment
Routers, switches,
printers, APs and the
like. They all generate
tons of data
PCs, Laptops,
Smartphones,
Servers
Metrics, Operating
Systems Logs, Hardware
logs, Security logs.
Interesting to say the
least
Software clients
Client-side and server-
side. E-Mail clients, Web
Browsers, VPN clients.
In all shapes and sizes,
from ubiquitous MS
Windows to iOS.
Networked, in the
cloud or on endpoints
Everything from FWs,
IPSs, Sandboxes, AVs /
Anti-Malware, Anti-Spam
& E-Mail security
Gateways, MdM etc.
Our secret sauce
We generate huge
amounts of data from
TELCO-specific
technologies and
equipment. We add it to
the mix to better help
with context.

Sneak &
Peek at our
Analytics & Threat
Detection Platform

Pentest
Results
Infobyte’s Faraday
Vulnerability Scans
We’re keeping an eye on
vulnerable systems that
can be used as entry
points in larger networks
Rapid7 OpenData
Secrepo / Censys / etc.
Threatmap & IoCs
We use data from our
own Threatmap
service to detect
vulnerabilities and
malware in
compromised
websites
Kaggel /
Virustotal / VXHeaven
Malware
analysis
Security Data
We gather anonymized
statistical data about app
usage, website
browsing, detected
attacks and malware
activity from our
Managed Security
Services
Threat intel feeds
IoCs, hashes, tweets and
posts.
Criticalstack / MDL /
Kiran Bandla’s
APTNotes / ISC
Suspicious Domains /
ThreatMiner /
Threat
Crowd
Cellular data
Wi-FiWe operate large Wi-Fi
networks both for
consumer and business
customers.

Architecture
KafkaLogstashElastic SearchKibanaWe

Architecture - II
Logstash Servers
Event Parsing
Event Enrichment
Data Ingestion
Kafka Servers
Load Balancing
Data Collection
Full Packet Data
Endpoints & Servers
Cloud Services
Machine & Equipment Logs
Service Bus
Everything Else
Kafka Servers
Caching
Kibana Servers
Event View
Custom Dashboards
Reporting
Long Term Storage
Hadoop Servers
ElasticSearch Servers
Machine Learning
Analytics
Caching & Short Term
Storage
Alerting
High-level overview
Of our proposed
architecture for our
Threat Detection &
Analytics Platform
Beats
Syslog
Cloud APIs
OSINT
HUMINT
SOCMINT

Machine Learning Algorithms
Sweet spot UnsupervisedSupervised
Supervised Machine
Learning relies on feeding
the machines with labeled
data, be it ‘good’ data or
‘bad’ data.
This helps create a baseline of
expected behaviour, in threat
detection and, in turn, will yield
results in detecting anomalies
On the Unsupervised side
you have to consider several
approaches such as
dimensionality reduction and
association rule learning.
These approaches are useful in
making large data sets easier to
analyse or understand. They can
be used to reduce the complexity
(dimensionality) of fields of data
to look at or group things together
(clustering)

Machine Learning Algorithms
Supervised
Unsupervised
Labeled Data
Unlabeled Data
Malware Identification
Spam Detection
Anomaly Detection
Risk (Reputation) Scoring
Clustering
Association – Rule Learning
Dimensionality Reduction
Entity Classification
Anomaly Detection
Data learning (exploration)
Machine
Learning

(More than) Text Mining
Twitter is a great data source for text mining. This has been done before and
we’re using what we’re learning from both commercial solutions like Sintelix and
Bitext and open source projects.
Twitter
Scrap
Engine /
Text
Mining
Local
Storage
Noise
Removal
Feature
Extraction
Feature
Selection
ElasticSearch
Machine
Learning
Anaylitics
Classification

Circling back to APTs
04
Lateral
Movement
02
Establish
foothold
Noise removal:
Exclude known good
traffic patterns and
legitimate pre-
labeled traffic
Gather Logs:
And traffic
dumps
Known domains:
Ingest data from
known malware
domain lists,
SOCMINT
Look for anomalies:
Large amount of request,
single requests, large
DNS queries etc. Malware
can be noisy
Visualize
Alert
Report
Gather Logs:
Windows
Machines,
mostly
Psexec DCOM QuarkPwD
RDP Logon Scripts WCE
WMIC WinRM MS14-058
mimikatz PWDumpX timestomp
Analyze logs
For specific event IDs,
attribute changes etc.
Visualize
Alert
Report

Expected Output
Dashboards
Configurable dashboard that can be customized to show both live data
and analytics for any number of categories such as Malware
Detections, C2 Server activity, Phishing Attacks reported trough
SOCMINT and our own sensors etc.
Alerting
Customizable alerts per events such as live attacks, threshold
breaking, Social Sentiment Analysis and -possibly- any delta over a
preset baseline
Reporting
Detailed reporting on all incidents, incident categories, sources of
attacks, types of threats etc.

Are we building yet another
Open-Source SIEM?
In short: NO. We’re not looking to
replace existing ones. We’re not
looking into compliance and
automated mitigation.
We’re trying to teach our existing
open-source log management
systems some new tricks so they
can help us in detecting complex
threats.

In closing, some nice stats*:
*From a Ponemon survey conducted in H1 2018
Savings
Companies using M.L. to
detect threats save an
estimate of 2.5 million US$
in operating costs
60%$2.5m 60% 69%
Improves Productivity
Companies are positive that
deploying M.L.-based
security tech improves the
productivity of their security
personnel.
More speed
The most significant benefit
of using M.L. for threat
analysis is increased speed.
64% say that the most
significant advantage is the
acceleration in the
containment of infected
endpoints, devices, hosts.
Identify Vulnerabilities
Sixty percent of the
respondents stated that M.L.
identified their application
security vulnerabilties

Needles, Haystacks and Algorithms: Using Machine Learning to detect complex threats

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Needles, Haystacks and Algorithms: Using Machine Learning to detect complex threats

Similar to Needles, Haystacks and Algorithms: Using Machine Learning to detect complex threats (20)

More from DefCamp

More from DefCamp (20)

Recently uploaded

Recently uploaded (20)

Needles, Haystacks and Algorithms: Using Machine Learning to detect complex threats

Editor's Notes