Beyond Matching: Applying Data Science Techniques to IOC-based Detection

Beyond Matching: Applying data science
techniques to IOC-based detection
(#BeyondMatching)
Alex Pinto - Chief Data Scientist – Niddel
@alexcpsec
@NiddelCorp

• Security Data Scientist
• Capybara Enthusiast
• Co-Founder and Chief Data Scientist at Niddel
(@NiddelCorp)
• Lead of MLSec Project (@MLSecProject)
Who am I?
• What is a Niddel?
• Niddel provides a SaaS-based Autonomous Threat Hunting System
• Research from this talk was performed using anonymized Niddel data and
uses concepts implemented on its products.
• Not a vendor-centric talk, focus on learning and y’all to reproduce this.

• The Promise of IOCs
• 7 Habits of Highly Effective Analysts (ok,
only 3)
• Nation-State APT Detection Deluxe Recipe
• Data Science to Assist on Pivoting
• Maliciousness Ratio
• Maliciousness Rating
• Revisiting TIQ-TEST – Telemetry Test
Agenda

The Promise of IOCs
If you haven’t implemented Threat
Intelligence feeds on your
organization
I will reveal the ending of your
upcoming grueling journey
Apologies in advance

Promise - Some Definitions First
• IOCs: Indicators of compromise
• CTI: Cyber Threat Intelligence
• Will be using them interchangeably
during this presentation
• IOCs -> technical data that allows for
”tactical” discovery of a potential
compromise on a system
• We will be focusing on network IOCs on
this talk
Little Bobby Comics by @RobertMLee and Jeff Haas

Promise – Sounds Great! Sign me up!
• Not so fast, my friend
• Main challenges with IOCs consumption:
• Quality and Curation
• Vetting and quality control
• Open feeds vs Paid feeds
• Manual vs Automated
• Velocity and Volume
• How to operationalize?
• Add to SIEM?
• Block in Firewall / Web Proxy?

Promise – Quality and Velocity at Odds
• AIS – Threat Intel sharing initiative from
US Department of Homeland Security
• I fully support sharing (see previous
intel sharing decks from 2015)
• But if we are resigned to this level of
quality, ”it is what it is”, how can CTI /
IOCs be shaped into a useful tool at
scale?

Promise – Current Implementation Strategies
1. Alerting based on matching with IOC data:
• By being careful, only matching on more ”precise” indicators (URLs >> IPs),
you can reduce number of False Positives, but still challenging
2. Using IOC data to build context for existing alerts:
• Safer bet, but you are not adding any detection power to existing controls
SPOILER ALERT:
Everyone starts with (1) because ”the FPs can’t be that bad”, and then begrudgingly
moves to (2) because there is not enough time in the world to go through all the
noise that (1) generates.

Sad Intermission
DISCLAIMER:
Could not find a picture of a sad capybara. Not sure there is one.

What makes analysts effective?
• They learn from the examples!!
• They don’t look at IOCs as a ”finished
product”, but as a way to learn from the
attacker infrastructure.
• After understanding and research on
samples of data, they can extrapolate
the TTPs (Tactics, Techniques and
Procedures) of the attackers to build
defenses.
Pyramid of Pain from @DavidJBianco

Internet Infrastructure 101
Actually, ”everything” is connected

Nation-State APT Detection Deluxe Recipe
When your ”favorite IR company” blames FROSTY PENGUIN for an attack:
1. Find a piece of malware on compromised organization
2. Extract ”non-benign” places they connect to (real work here, BTW)
3. Pivot on Internet Infrastructure to find related IPs / Domains / URLs
4. Search for these on org, find more malware (Hunting, FTW!)
5. Repeat Steps 1-4 until no more new malware
6. Remediate organization (hopefully!)
7. Publish report or blog post to great fanfare
8. PROFIT (or at least media attention and sales leads)

Data Science to Assist on Pivoting
• Doing it ourselves: - Begin with data collection
1. Get IOCs from your favorite / available providers – there are a few options
that are fairly good. Please do select according to collection criteria.
2. ”Enrich” the data to gather the ”pivot points” and find the connections.
Combine (https://github.com/mlsecproject/combine) can help with IOC gathering
and enrichment for ASN data and pDNS (if you have a Farsight pDNS key)
• IP Addresses:
• AS number
• BGP prefix
• Country
• pDNS relationship to domains
• Domain names:
• pDNS relationship to IPs
• WHOIS Registrations
• SOA
• NS Servers

Data Collection – Example With RIG EK
WHOIS registrant e-mail on a small sample of RIG EK domains on Oct 2016:

This one is NOT Domain Shadowing – active actor registering e-mails:

Autonomous System / Country of IPs are located, RIG EK sample – Oct 2016:

Autonomous System where IPs are located, RIG EK sample – Oct 2016:

Data Aggregation – Rig EK Example
In summary: let’s create different graphs for each one of the pivoting points and measure the
cardinality of the node connectedness
AS48096 - ITGRAD
AS16276 – OVH SAS L
AS14576 – Hosting Solution Ltd
(actually king-servers.com)

Data Aggregation – Context Matters
• What if my favorite websites are actually hosted at those pivoting points?
• I mean, there are a few ”ok” things on .com and .org

Maliciousness Ratio
Let’s build similar aggregation metrics for ”good places” your organizations
We propose a ratio that compares the cardinality of the node connectedness:
• Bpp – count of ”bad entities” connected to a specific pivoting point
• Gpp – count of ”good entities” connected to a specific pivoting point
𝑀𝑅## =
&''
('')&''

Hold on!! Good Places on the Internet?
• Creating and maintaining whitelists is MUCH HARDER than blacklists
• Some tips:
• Use your own telemetry - given the base rate fallacy, places that ”everyone”
goes to are more likely to be benign
• Rarity does not mean bad (shut up, UEBA people), but high visitation almost
always mean good
• Harvest data from your own security tools, like web filters (if you trust them)
• Very shallow scoops of Alexa Top Sites. Very. Shallow.

Maliciousness Ratio – Examples
• Telemetry from an pool of Niddel customers:
• AS48096 – ITGRAD 87.5%
• Country RU 5.2%
• .org TLD 2.9%
• Looking at the base rate:
• ASN Base Rate 0.6%
• Country Base Rate 0.58%
• TLD Base Rate 1.9%
• Severe outliers below base rate may indicate
that the IOC is invalid

Maliciousness Rating
• A ratio from 0 to 1 can be cool for math people, but how risky are those
things anyway?
• We need to compare it to the base rate to have a good measure
• We propose a maliciousness rating which express how much more likely to
be bad the connection with a specific pivoting point than an average pivoting
point of that kind on the Internet.
𝑀𝑅𝑇## =
𝑀𝑅##
∑ 𝑀𝑅##(-)
/
-01
𝑛3

Maliciousness Rating – Sample Distributions

Challenges with the Approach
• How can we best define the cutting scores on all those potential
maliciousness ratings?
• How to combine and weight the multivariate composition of these pivoting
points?
• Solution is probably unique
per company, including
understanding telemetry
patterns, risk appetite for FPs
/ FNs and decision points on
when to block and when to
alert on something.

What if the challenges had been solved?

A More Involved Example (2)
Build the campaign based on the
relationships - they all share the
same support infrastructure on
the IP Address and Name
Servers.

Going back to TIQ-Test
• Biggest criticism of TIQ-Test (mostly self-inflicted) is that is was
always relative, not absolute.
• How can one define what it a ”good” feed?
• Does that even make sense?
• It is easy to tell if a feed is bad (lots of FPs, low curation)
• My thought process:
• Maybe with telemetry, you can identify an ”applicable” feed
• Or ”actionable” if you like your Cybersecurity with extra camo

Actual alert
IOC
accounting
Percentage of the
matches of an
specific feed that
were actual alerts
or incidents at an
organization

Actual alert
UNIQUE IOC
accounting
Percentage of
UNIQUE (only
contributed by
the feed)
matches of an
specific feed
that were
actual alerts or
incidents at an
organization

Challenges with the Approach (2)
• How does one define a valid alert or incident?
• Not many ways but to improve understanding and growth of IR practice:
• Your own incident history (for the 1%-ers)
• Your own CTI / IOC creation processes (for the 0.01%-ers)
• The ”Telemetry Test” has been INVALUABLE for Niddel on partnership and
feed selection
• ”My Threat Intelligence Can Beat Up Your Threat Intelligence” (h/t Rick
Holland)
• How much values does a feed add anyway? Look for unique contributions.

No magic this time – Improve your IR processes

Takeaways
• Lots of ideas to implement, go go go!!
• IOCs (and CTI in general for that matter) are
not a complete waste of time. It’s just raw
data, and needs to be refined in order to be
used properly
• Bringing automation (and simplicity of use)
to threat intelligence and threat hunting is
paramount to bring its usability from the
1% of orgs to a more broad audience at
scale

Thanks!
• Share, like, subscribe, EDM outro
• Q&A and Feedback please!
Alex Pinto – alexcp@niddel.com
@alexcpsec
@NiddelCorp
Little Bobby Comics by @RobertMLee and Jeff Haas

Beyond Matching: Applying Data Science Techniques to IOC-based Detection

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Beyond Matching: Applying Data Science Techniques to IOC-based Detection

Similar to Beyond Matching: Applying Data Science Techniques to IOC-based Detection (20)

Recently uploaded

Recently uploaded (20)

Beyond Matching: Applying Data Science Techniques to IOC-based Detection