Noise Contrastive Estimation-based Matching Framework for Low-Resource Security Attack Pattern Recognition

Noise Contrastive Estimation-based Matching Framework for Low-
Resource Security Attack Pattern Recognition
Tu Nguyen, Nedim Šrndić, Alexander Neth
Huawei R&D Munich
{tu.nguyen, nedim.srndic, alexander.neth}@huawei.com

Cyber Threat Intelligence (CTI)
“Cyber threat intelligence (CTI) is knowledge, skills and experience-
based information concerning the occurrence and assessment of both
cyber and physical threats and threat actors that is intended to help
mitigate potential attacks and harmful events occurring in cyberspace.”
“Cyber threat intelligence sources include open source intelligence, social
media intelligence, human Intelligence, technical intelligence, device log
files, forensically acquired data or intelligence from the internet traffic and
data derived for the deep and dark web.”
[...] We witnessed that the botnet was spread via mass phishing, using a VB−scripted Excel
attachment to download the second stage from xx.warez22.info. The same domain was used
for C&C via HTTP. The botnet distributed a file encryption module we named VBenc. [...]
[1] https://en.wikipedia.org/wiki/Cyber_threat_intelligence

[1.1] https://attack.mitre.org/techniques/T1566/
Phishing (T1566 ) Data Encrypted for Impact (T1486)

[1.3] https://detect-respond.blogspot.com/2013/03/the-pyramid-of-pain.html
Phishing (T1566 ) Data Encrypted for Impact (T1486)

Tactics, Techniques and Procedures (TTP)
[2] https://attack.mitre.org/#
The most high-level (and valuable) concept of CTI is called attack pattern. It
abstracts a security attack by describing its goal ("tactic"), algorithm
("technique") and potential implementations ("procedures").
Over 600 such techniques, 14 tactics and thousands of procedures are curated by
the MITRE ATT&CK [2] ontology, which denotes attack patterns as TTPs (tactics,
techniques and procedures).

Title
Description
Metadata
Procedure
examples
Mitigations
Detection
References
Scarce information
“Adversaries may steal data by exfiltrating it over a
different protocol than that of the existing command and
control channel. ‘’
Abstract formulation
Implications:
1. Concepts:
a. Adversary
b. (Valuable) data
c. Communication protocol
d. C&C channel
2. Prerequisites:
a. Adversary has infiltrated
b. Adversary established C&C channel
over protocol A
3. Actions:
a. Adversary exfiltrates data over
protocol B != A
Tactics, Techniques and Procedures (TTP)
[3] screenshot from https://attack.mitre.org/techniques/T1048/

Semantic
Understanding
Attack graph External Remote
Services (T1133)
Exploit Public-Facing
Application (T1190) …
TTP recognition
• CTI entities and relations are very technical and complex
• Understanding and mapping text to TTPs are challenging even for human
• Learning-based TTP mapping is potential but lacking datasets
• TTPs are numerous and complex
• MITRE ATT&K refine / extend the KB frequently as attacks evolve, thus
system needs to be able to adapt / extend to new TTPs
Tactics, Techniques and Procedures (TTP) Mapping
[4] screenshot from https://www.mandiant.com/resources/blog/apt41-initiates-global-intrusion-campaign-using-multiple-exploits
[4]

TTP Classification Challenges
Large label space
Complex, technical
writing style
Noisy & missing
labels
Long-tail labels

TTP Mapping -> Text Matching
Direct matching
signal
textual profile

TTP Mapping -> Text Matching
Direct matching
signal
textual profile
NCE

Recap of Noise Contrastive Estimation (NCE)
• 𝑋: input space, 𝑌: label space, 𝑌 < ∞
• 𝑠𝜃: 𝑋 × 𝑌 → R differentiable in 𝜃 ∈ 𝑅𝑑
• Goal: sample from 𝑝 𝑥, 𝑦 , estimate 𝜃: 𝑥 → 𝑎𝑟𝑔𝑚𝑎𝑥 𝑦∈𝑌 𝑠𝜃 𝑥, 𝑦 has optimal 0-1 loss
• p𝜃 y x =
exp(𝑠𝜃 𝑥,𝑦 )
𝑦ℎ𝑎𝑡∈𝑌
exp 𝑠𝜃 𝑥,𝑦ℎ𝑎𝑡
• Cross-entropy loss:
• 𝐽𝐶𝐸 𝜃 = 𝐸 𝑥,𝑦 ~𝑝𝑜𝑝 [− log 𝑝𝜃(𝑦|𝑥)]
• Difficult to compute when 𝑌 ≫
• NCE: ∀ 1 ≤ 𝑘 ≤ 𝐾, 2 ≤ 𝐾 ≪ 𝑌
• 𝜋𝜃 𝑘 𝑥, 𝑦 1:𝐾 ) =
exp(𝑠𝜃 𝑥,𝑦𝑘 )
𝑘′=1
𝑘
exp(𝑠𝜃 𝑥,𝑦 𝑘′ )
• 𝐽𝑁𝐶𝐸 𝜃 = 𝐸 𝑥,𝑦1 ~𝑝(𝑥,𝑦) & 𝑦 2:𝐾 ~𝑞 𝐾−1 [− log 𝜋𝜃 1 𝑥, 𝑦 1:𝐾
• 𝑦 2:𝐾 ∈ 𝑌 𝐾−1
: negative samples from 𝑞 𝑠. 𝑡. 𝑞 𝑦 =
1
𝑌
𝑜𝑟 𝑞 𝑦 = 𝑝𝑜𝑝 𝑦
• Negative samples:
• Unconditional: ‘easy’ 𝑦 2:𝐾 ~ 𝑞 𝐾−1
• Conditional: ‘hard’ 𝑦 2:𝐾 ~ ℎ . 𝑥, 𝑦1

Text-TTP Matching
• Classification: 𝑠𝜃: 𝑋 × 𝑌 → R differentiable in 𝜃 ∈ 𝑅𝑑
• 𝑋: input space, 𝑌: label space, 𝑌 ≪ ∞
• Matching: 𝑠𝜃: 𝑋 × 𝑌 → R differentiable in 𝜃 ∈ 𝑅𝑑
• X, Y: input space, 𝑌 ≪ ∞
• naturally with inductive bias
• triangle inequality: 𝜃 𝑥, 𝑧 ≤ 𝜃 𝑥, 𝑦 + 𝜃 𝑦, 𝑧
• Loss: similar to classification
• 𝐽 𝜃 = −1/N 𝑖=1
𝑁
log
exp(𝑠𝜃 𝑥𝑖,𝑦 𝑖,1 )
𝑘′=1
𝐾
exp(𝑠𝜃 𝑥𝑖,𝑦 𝑖,𝑘′ )
• 𝑥1, 𝑦1,1 . . (𝑥𝑁, 𝑦𝑁,1) are training pairs
• 𝑦𝑖,2. . 𝑦 𝑖,𝐾 ~ q(. |xi, yi,1) are K-1 negative samples

Text-TTP Matching
w/o contrastive
learning
/w contrastive
learning

Learning to Compare
TTPi
TTPi_0
TTPi_1
TTPi_2
text
TTPj
TTPj_0
TTPj_1
TTPj_2
+
-
-
-
-
-
+
+
Sub-
Technique
+ Pos label
+ Missing (noisy) label
Neg label
-
Contrast
Weak
Contrast
TTPi
TTPi_0
TTPi_1
TTPi_2
text
TTPj
TTPj_0
TTPj_1
TTPj_2
+
-
-
-
-
-
+
+
𝑝(1|𝑥, 𝑦)
𝑥 𝑦
𝑝(1|𝑥, 𝑦)
Ranking
TTPi_1 +
TTPi +
TTP -
TTP -
TTP -
TTP -
TTPj +
𝑝(1|𝑥, 𝑦)
Optimization
Asymmetric- focus loss
𝛼-balanced focus loss
• Remedy the imbalance btw. # neg vs. # pos
• 𝛼-balanced:
• Down-scale the contribution of negs
• Asymmetric:
• Down-scale the contribution of
possibly noisy labels

• Datasets
• ATT&CK procedure examples
• Source: MITRE ATT&CK website
• Large and very cleanly labeled but too short and too simple
• Short text (summarized from threat reports)
• TRAM
• Source: crowdsourced by MITRE
• Short text, noisy labels
• Expert
• Source: diverse and paragraph-level text from threat
reports labeled by security experts
• High quality but relatively small
• Derived ATT&CK procedure examples
• Source: reference links from MITRE ATT&CK
• Paragraph-level text but relatively noisy labels
• Evaluation protocol
• Train on 72.5%, validate on 12.5%, test on 15% of each dataset
• We combine the training splits cross datasets.
Experiment Settings

Model Analysis
• As the negative sample size increases, the model tends to converge
faster and exhibit better performance.
• It appears that there are no additional benefits beyond a size of 60,
which corresponds to 10% of the label space.
• Our models exhibit a more pronounced skewness in their distribution,
resembling that of a pure classification model like NAPKINXC.
• Broader distribution at the head, indicating inclination to assign
comparable probabilities to multiple labels.

Empirical Studies
High-level
tactics
Hallucinati
on (non
existing)
Not
explicitly
mentioned

Noise Contrastive Estimation-based Matching Framework for Low-Resource Security Attack Pattern Recognition

Recommended

Recommended

More Related Content

Similar to Noise Contrastive Estimation-based Matching Framework for Low-Resource Security Attack Pattern Recognition

Similar to Noise Contrastive Estimation-based Matching Framework for Low-Resource Security Attack Pattern Recognition (20)

Recently uploaded

Recently uploaded (20)

Noise Contrastive Estimation-based Matching Framework for Low-Resource Security Attack Pattern Recognition

Editor's Notes