Beyond	Matching:	Applying	data	science	
techniques	to	IOC-based	detection
(#BeyondMatching)
Alex	Pinto	- Chief	Data	Scientist	– Niddel
@alexcpsec
@NiddelCorp
• Security	Data	Scientist
• Capybara	Enthusiast
• Co-Founder	and	Chief	Data	Scientist	at	Niddel	
(@NiddelCorp)
• Lead	of	MLSec Project	(@MLSecProject)	
Who	am	I?
• What	is	a	Niddel?
• Niddel	provides	a	SaaS-based	Autonomous	Threat	Hunting	System
• Research	from	this	talk	was	performed	using	anonymized	Niddel	data	and	
uses	concepts	implemented	on	its	products.
• Not	a	vendor-centric	talk,	focus	on	learning	and	y’all to	reproduce	this.
• The	Promise	of	IOCs
• 7 Habits	of	Highly	Effective	Analysts	(ok,	
only	3)
• Nation-State	APT	Detection	Deluxe	Recipe
• Data	Science	to	Assist	on	Pivoting
• Maliciousness	Ratio
• Maliciousness	Rating
• Revisiting	TIQ-TEST	– Telemetry	Test
Agenda
The	Promise	of	IOCs
If	you	haven’t	implemented	Threat	
Intelligence	feeds	on	your	
organization
I	will	reveal	the	ending	of	your	
upcoming	grueling	journey
Apologies	in	advance
Promise	- Some	Definitions	First
• IOCs:	Indicators	of	compromise
• CTI:	Cyber	Threat	Intelligence
• Will	be	using	them	interchangeably	
during	this	presentation
• IOCs	->	technical	data	that	allows	for	
”tactical”	discovery	of	a	potential	
compromise	on	a	system
• We	will	be	focusing	on	network	IOCs	on	
this	talk
Little	Bobby	Comics	by	@RobertMLee and	Jeff	Haas
Promise	– Sounds	Great!	Sign	me	up!
• Not	so	fast,	my	friend
• Main	challenges	with	IOCs	consumption:
• Quality	and	Curation
• Vetting	and	quality	control
• Open	feeds	vs	Paid	feeds
• Manual	vs	Automated	
• Velocity	and	Volume
• How	to	operationalize?	
• Add	to	SIEM?
• Block	in	Firewall	/	Web	Proxy?
Promise	– Quality	and	Velocity	at	Odds
• AIS	– Threat	Intel	sharing	initiative	from	
US	Department	of	Homeland	Security
• I	fully	support	sharing	(see	previous	
intel	sharing	decks	from	2015)
• But	if	we	are	resigned	to	this	level	of	
quality,	”it	is	what	it	is”,	how	can	CTI	/	
IOCs	be	shaped	into	a	useful	tool	at	
scale?
Promise	– Current	Implementation	Strategies
1. Alerting	based	on	matching	with	IOC	data:
• By	being	careful,	only	matching	on	more	”precise”	indicators	(URLs	>>	IPs),	
you	can	reduce	number	of	False	Positives,	but	still	challenging
2. Using	IOC	data	to	build	context	for	existing	alerts:
• Safer	bet,	but	you	are	not	adding	any	detection	power	to	existing	controls
SPOILER ALERT:
Everyone starts with (1) because ”the FPs can’t be that bad”, and then begrudgingly
moves to (2) because there is not enough time in the world to go through all the
noise that (1) generates.
Sad	Intermission
DISCLAIMER:
Could not find a picture of a sad capybara. Not sure there is one.
What	makes	analysts	effective?
• They	learn	from	the	examples!!
• They	don’t	look	at	IOCs	as	a	”finished	
product”,	but	as	a	way	to	learn	from	the	
attacker	infrastructure.
• After	understanding	and	research	on	
samples	of	data,	they	can	extrapolate	
the	TTPs	(Tactics,	Techniques	and	
Procedures)	of	the	attackers	to	build	
defenses.
Pyramid	of	Pain	from	@DavidJBianco
Internet	Infrastructure	101
Actually, ”everything” is connected
Nation-State	APT	Detection	Deluxe	Recipe
When	your	”favorite	IR	company”	blames	FROSTY	PENGUIN	for	an	attack:
1. Find	a	piece	of	malware	on	compromised	organization
2. Extract	”non-benign”	places	they	connect	to	(real	work	here,	BTW)
3. Pivot	on	Internet	Infrastructure	to	find	related	IPs	/	Domains	/	URLs
4. Search	for	these	on	org,	find	more	malware	(Hunting,	FTW!)
5. Repeat	Steps	1-4	until	no	more	new	malware
6. Remediate	organization	(hopefully!)
7. Publish	report	or	blog	post	to	great	fanfare
8. PROFIT	(or	at	least	media	attention	and	sales	leads)
Data	Science	to	Assist	on	Pivoting
• Doing	it	ourselves:	- Begin	with	data	collection
1. Get	IOCs	from	your	favorite	/	available	providers	– there	are	a	few	options	
that	are	fairly	good.	Please	do	select	according	to	collection	criteria.
2. ”Enrich”	the	data	to	gather	the	”pivot	points”	and	find	the	connections.
Combine (https://github.com/mlsecproject/combine) can help with IOC gathering
and enrichment for ASN data and pDNS (if you have a Farsight pDNS key)
• IP	Addresses:
• AS	number
• BGP	prefix
• Country
• pDNS relationship	to	domains
• Domain	names:
• pDNS relationship	to	IPs
• WHOIS	Registrations
• SOA
• NS	Servers
Data	Collection	– Example	With	RIG	EK
WHOIS	registrant	e-mail	on	a	small	sample	of	RIG	EK	domains	on	Oct	2016:
Data	Collection	– Example	With	RIG	EK
This	one	is	NOT	Domain	Shadowing	– active	actor	registering	e-mails:
Data	Collection	– Example	With	RIG	EK
Autonomous	System	/	Country	of	IPs	are	located,	RIG	EK	sample	– Oct	2016:
Data	Collection	– Example	With	RIG	EK
Autonomous	System	where	IPs	are	located,	RIG	EK	sample	– Oct	2016:
Data	Aggregation	– Rig	EK	Example
In summary: let’s create different graphs for each one of the pivoting points and measure the
cardinality of the node connectedness
AS48096	- ITGRAD
AS16276	– OVH	SAS	L
AS14576	– Hosting	Solution	Ltd
(actually	king-servers.com)
Data	Aggregation	– Context	Matters
• What	if	my	favorite	websites	are	actually	hosted	at	those	pivoting	points?
• I	mean,	there	are	a	few	”ok”	things	on	.com	and	.org
Maliciousness	Ratio
Let’s	build	similar	aggregation	metrics	for	”good	places”	your	organizations	
We	propose	a	ratio	that	compares	the	cardinality	of	the	node	connectedness:
• Bpp – count	of	”bad	entities”	connected	to	a	specific	pivoting	point
• Gpp – count	of	”good	entities”	connected	to	a	specific	pivoting	point
𝑀𝑅## =	
&''
('')&''
Hold	on!!	Good	Places	on	the	Internet?
• Creating	and	maintaining	whitelists	is	MUCH	HARDER	than	blacklists
• Some	tips:
• Use	your	own	telemetry	- given	the	base	rate	fallacy,	places	that	”everyone”	
goes	to	are	more	likely	to	be	benign
• Rarity	does	not	mean	bad	(shut	up,	UEBA	people),	but	high	visitation	almost	
always	mean	good
• Harvest	data	from	your	own	security	tools,	like	web	filters	(if	you	trust	them)
• Very	shallow	scoops	of	Alexa	Top	Sites.	Very.	Shallow.
Maliciousness	Ratio	– Examples
• Telemetry	from	an	pool	of	Niddel	customers:
• AS48096	– ITGRAD	 87.5%
• Country	RU 5.2%
• .org	TLD 2.9%
• Looking	at	the	base	rate:
• ASN	Base	Rate 0.6%
• Country	Base	Rate 0.58%
• TLD	Base	Rate 1.9%
• Severe	outliers	below	base	rate	may	indicate	
that	the	IOC	is	invalid
Maliciousness	Rating
• A	ratio	from	0	to	1	can	be	cool	for	math	people,	but	how	risky	are	those	
things	anyway?
• We	need	to	compare	it	to	the	base	rate	to	have	a	good	measure
• We	propose	a	maliciousness	rating	which	express	how	much	more	likely	to	
be	bad	the	connection	with	a	specific	pivoting	point	than	an	average	pivoting	
point	of	that	kind	on	the	Internet.
𝑀𝑅𝑇## =
𝑀𝑅##
∑ 𝑀𝑅##(-)
/
-01
𝑛3
Maliciousness	Rating	– Sample	Distributions
Challenges	with	the	Approach
• How	can	we	best	define	the	cutting	scores	on	all	those	potential	
maliciousness	ratings?
• How	to	combine	and	weight	the	multivariate	composition	of	these	pivoting	
points?
• Solution	is	probably	unique	
per	company,	including	
understanding	telemetry	
patterns,	risk	appetite	for	FPs	
/	FNs	and	decision	points	on	
when	to	block	and	when	to	
alert	on	something.
What	if	the	challenges	had	been	solved?
A	More	Involved	Example	(1)
A	More	Involved	Example	(2)
Build	the	campaign	based	on	the	
relationships	- they	all	share	the	
same	support	infrastructure	on	
the	IP	Address	and	Name	
Servers.
Shia	LeBeouf Approves
One	more	thing…
Going	back	to	TIQ-Test
• Biggest	criticism	of	TIQ-Test	(mostly	self-inflicted)	is	that	is	was	
always	relative,	not	absolute.
• How	can	one	define	what	it	a	”good”	feed?
• Does	that	even	make	sense?
• It	is	easy	to	tell	if	a	feed	is	bad	(lots	of	FPs,	low	curation)
• My	thought	process:
• Maybe with	telemetry,	you	can	identify	an	”applicable”	feed
• Or	”actionable”	if	you	like	your	Cybersecurity	with	extra	camo
Actual	alert	
IOC	
accounting
Percentage	of	the	
matches	of	an	
specific	feed	that	
were	actual	alerts	
or		incidents	at	an	
organization
Actual	alert	
UNIQUE	IOC	
accounting
Percentage	of	
UNIQUE	(only	
contributed	by	
the	feed)	
matches	of	an	
specific	feed	
that	were	
actual	alerts	or	
incidents	at	an	
organization
Challenges	with	the	Approach	(2)
• How	does	one	define	a	valid	alert	or	incident?
• Not	many	ways	but	to	improve	understanding	and	growth	of	IR	practice:
• Your	own	incident	history	(for	the	1%-ers)
• Your	own	CTI	/	IOC	creation	processes	(for	the	0.01%-ers)
• The	”Telemetry	Test”	has	been	INVALUABLE	for	Niddel	on	partnership	and	
feed	selection
• ”My	Threat	Intelligence	Can	Beat	Up	Your	Threat	Intelligence”	(h/t	Rick	
Holland)
• How	much	values	does	a	feed	add	anyway?	Look	for	unique	contributions.
No	magic	this	time	– Improve	your	IR	processes
Takeaways
• Lots	of	ideas	to	implement,	go	go	go!!
• IOCs	(and	CTI	in	general	for	that	matter)	are	
not	a	complete	waste	of	time.	It’s	just	raw	
data,	and	needs	to	be	refined	in	order	to	be	
used	properly
• Bringing	automation	(and	simplicity	of	use)	
to	threat	intelligence	and	threat	hunting	is	
paramount	to	bring	its	usability	from	the	
1%	of	orgs	to	a	more	broad	audience	at	
scale
Thanks!
• Share,	like,	subscribe,	EDM	outro
• Q&A	and	Feedback	please!
Alex	Pinto	– alexcp@niddel.com
@alexcpsec
@NiddelCorp
Little	Bobby	Comics	by	@RobertMLee and	Jeff	Haas

Beyond Matching: Applying Data Science Techniques to IOC-based Detection