CTI crawling and classification in Cyber-Trust

Advanced Cyber-Threat Intelligence, Detection and
Mitigation Platform for a Trusted Internet of Things
Mining for cyber-threat intelligence to
improve cyber-security risk mitigation
Panel on Cyber-security Intelligence
2019 Community of Users Workshop
Nicholas Kolokotronis
Department of Informatics and Telecommunications
University of Peloponnese • nkolok@uop.gr

Cyber-threat intelligence
▪ From unstructured (textual)
high-volume data to
o Vulnerabilities/exploits
o Links to CVE/other VDB IDs
o Threat actors TTPs
o Specific products/platforms
o Popularity, price, …
o CVSS => measurable
▪ CTI needs to be compliant
against legal requirements
2
CT

Cyber-defense goals
▪ Accurate modelling of the
attack strategies
▪ Determine the attackers’
capabilities
o constraint resources (budget,
tools, etc.)
▪ The attackers’ goals vary
depending on the target
o access level, degrade QoS, …
▪ Define the defender’s
available actions
o possible counter-measures
o highlight parameters
▪ Cyber-defense
needs to
minimize
the attack
surface
3

Dynamic
risk analysis
4
Security
properties
should be
measurable

Dynamic risk analysis: attack models
5

Example: exploitation probability
▪ Need to be
measurable
o Estimated from
CVSS metrics
o 𝑃 𝑒𝑖 = 2 ×
𝐴𝑉 × 𝐴𝐶 × 𝐴𝑢
▪ Likewise for an
attack’s attempt
probability
6

ML – from CTI to structured TTPs
▪ Conversion of CTIs to a semi-structured format (JSON, XML)
▪ Filtering specific (TTP, exploits) information, has the benefits:
o More easily processed in a automated way
o Only condensed information will be available
o Reports will be still readable
▪ Known formats for attack patterns is STIX v2.1
▪ The conversion of CTIs into actionable information can be
achieved using ML techniques
7

Threat actions identification
8

CTI generation process
9

Classifier needed with a
number of features, like:
▪ Word size (CTIs with
elaborated TTPs tend to be
larger)
▪ Security action word
density (security correlated
verbs)
▪ Security target word
density (security correlated
nouns)
Data pre-processing
1. Need crawler that gathers all
pages from the web
o CTI vendors (e.g. Symantec)
o Forums, blogs, etc.
2. Sanitize content and keep all
textual information as articles
o Remove HTML tags, images,
etc.
3. Automated decision on the
CTI value of each article
o otherwise it is dropped
10

[CT] CTI crawling and classification
▪ Crawling components used in Cyber-Trust
11

[CT] CTI crawling and classification
▪ Clear/Deep/Forum web crawling in Cyber-Trust
o Implement topic-specific crawling on publicly available web sites
▶︎ focus on Deep/Dark web sites that don’t require authentication
o Model Builder is responsible for creating the classification
model; needs a set of positive and negative URLs.
o Seed Finder identifies the initial seed of URLs to crawl based on
a user-defined query, e.g. on “IoT vulnerabilities”
o The crawled websites go through the Article/Forum Parser,
which extracts the useful text part of each one
▶︎ internally forums are structured in a different way compared to websites
12

Dynamic
risk analysis
(enhanced)
13
Security
properties
should be
measurable

Data pre-processing
▪ Security correlated verbs/nouns are extracted from CVEs,
CAPEC, CWE repositories using NLP techniques
o Used on each article to find all OVS (Object, Verb, Subject) triplets;
these are candidate threat actions
▪ CTI contain strings that an NLP parser may not understand,
such as IoCs
o To remedy this,
we temporally
substitute these
with RegEx, e.g.:
14

TTP specific ontology
15
▪ An ontology created by TTPs provided by ATT&CK and
CAPEC repositories (MITRE)
Class name Class description Example
Kill chain phase Phase information, e.g. name or order Control or 5
Tactic Description of how to achieve a phase Privilege escalation
Technique Description of how to achieve a tactic DLL injection
Threat action Verb associated with malicious action Overwrite, Terminate
Object The action’s target File, Process
Pre-condition Action prerequisites that have to hold User access
Intent Goal/subgoal of an action Run malicious code

Towards threat actions
▪ Find similarity of candidate actions with all records in ontology
▪ Information Retrieval (IR) scoring vs. threshold
▪ Vocabulary based on synonyms (e.g. by WordNet) or custom
▪ Best scoring class is assigned to the threat action
16

[CT] CTI classification
▪ Topic vocabulary in Cyber-
Trust
o XML docs converted into text
via XML Data Retriever
o Normalizer drops symbols,
converts to lowercase, etc.
o Collected tags are multi-word
terms given to Multi-Word
Expression Tokenizer
▶︎“exploit kits” => “exploit-kits”
o Word2Vec finds the similarity
17

[CT] CTI classification
▪ Example top terms in Cyber-Trust collection for tag ddos

CTI sharing: using STIX
▪ Structured language for
any CTI
o wide range use cases support
o can focus on relevant aspects
▪ High level of recognition by
CSIRTs and LEAs
▪ Combined with TAXII 2.0
o OSS implementations
▪ Supported by MISP
Attack pattern SDO
{
“type” : “attack”,
“id” : “attack-pattern-xyz…”,
“created” : “2017-06-8T08:17:27.000Z”,
“modified” : “2017-06-8T08:17:27.000Z”,
“name” : “Input Capture”,
“description” : “Adversary logs
keystrokes to obtain credentials”,
“kill_chain_phases” : “Maintain”,
“external_references” :
[ {
“source_name” : “ATT&CK”,
“id” : “T1056”
} ]
}
19

CTI sources’ quality aspects
▪ Existence of conflicting data among sources
▪ Techniques can be used to assess the credibility of source
o Using special-purpose ranking engines (e.g. SimilarWeb)
▶︎ A combination of metrics (page views, unique site users, web traffic, etc.)
▶︎ Include some Dark Web sites
o Number of users (useful for Dark Web sites)
o Number of posts per day
o Number of CVEs per day
▶︎ More than 3/4 of vulnerabilities are publicly reported online ~7d before NVD
▶︎ Mainly concerns Dark Web, paste sites, and cyber-criminal forums
20

Use of CTI in Cyber-Trust
21
CTI sharing
dark web
deep web
clear web

Conclusions - challenges
▪ ML can be used for extracting CTIs to structured and
actionable formats
▪ Technical challenges for coping with heterogeneity and
volume of cyber-threat data
o Need for (semi-)automated means of processing
o Focused and topic-based crawling can improve performance
o Deep/dark web exploration presents additional challenges
o Big data management and NoSQL stores for efficiency
▪ Legal compliance and privacy-preserving data mining?
22

CTI crawling and classification in Cyber-Trust

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CTI crawling and classification in Cyber-Trust

Similar to CTI crawling and classification in Cyber-Trust (20)

Recently uploaded

Recently uploaded (20)

CTI crawling and classification in Cyber-Trust