Tactics, Techniques and Procedures (TTPs) represent sophisticated attack patterns in the cybersecurity domain, described encyclopedically in textual knowledge bases. Identifying TTPs in cybersecurity writing, often called TTP mapping, is an important and challenging task. Conventional learning approaches often target the problem in the classical multi-class or multilabel classification setting. This setting hinders the learning ability of the model due to a large number of classes (i.e., TTPs), the inevitable skewness of the label distribution and the complex hierarchical structure of the label space. We formulate the problem in a different learning paradigm, where the assignment of a text to a TTP label is decided by the direct semantic similarity between the two, thus reducing the complexity of competing solely over the large labeling space. To that end, we propose a neural matching architecture with an effective sampling-based learn-to-compare mechanism, facilitating the learning process of the matching model despite constrained resources.
1. Noise Contrastive Estimation-based Matching Framework for Low-
Resource Security Attack Pattern Recognition
Tu Nguyen, Nedim Šrndić, Alexander Neth
Huawei R&D Munich
{tu.nguyen, nedim.srndic, alexander.neth}@huawei.com
2. Cyber Threat Intelligence (CTI)
“Cyber threat intelligence (CTI) is knowledge, skills and experience-
based information concerning the occurrence and assessment of both
cyber and physical threats and threat actors that is intended to help
mitigate potential attacks and harmful events occurring in cyberspace.”
“Cyber threat intelligence sources include open source intelligence, social
media intelligence, human Intelligence, technical intelligence, device log
files, forensically acquired data or intelligence from the internet traffic and
data derived for the deep and dark web.”
[...] We witnessed that the botnet was spread via mass phishing, using a VB−scripted Excel
attachment to download the second stage from xx.warez22.info. The same domain was used
for C&C via HTTP. The botnet distributed a file encryption module we named VBenc. [...]
[1] https://en.wikipedia.org/wiki/Cyber_threat_intelligence
3. Cyber Threat Intelligence (CTI)
[...] We witnessed that the botnet was spread via mass phishing, using a VB−scripted Excel
attachment to download the second stage from xx.warez22.info. The same domain was used
for C&C via HTTP. The botnet distributed a file encryption module we named VBenc. [...]
[1] https://en.wikipedia.org/wiki/Cyber_threat_intelligence
[1.1] https://attack.mitre.org/techniques/T1566/
[1.2] https://attack.mitre.org/techniques/T1486/
Phishing (T1566 ) Data Encrypted for Impact (T1486)
4. Cyber Threat Intelligence (CTI)
[...] We witnessed that the botnet was spread via mass phishing, using a VB−scripted Excel
attachment to download the second stage from xx.warez22.info. The same domain was used
for C&C via HTTP. The botnet distributed a file encryption module we named VBenc. [...]
[1] https://en.wikipedia.org/wiki/Cyber_threat_intelligence
[1.1] https://attack.mitre.org/techniques/T1566/
[1.2] https://attack.mitre.org/techniques/T1486/
[1.3] https://detect-respond.blogspot.com/2013/03/the-pyramid-of-pain.html
Phishing (T1566 ) Data Encrypted for Impact (T1486)
5. Tactics, Techniques and Procedures (TTP)
[2] https://attack.mitre.org/#
The most high-level (and valuable) concept of CTI is called attack pattern. It
abstracts a security attack by describing its goal ("tactic"), algorithm
("technique") and potential implementations ("procedures").
Over 600 such techniques, 14 tactics and thousands of procedures are curated by
the MITRE ATT&CK [2] ontology, which denotes attack patterns as TTPs (tactics,
techniques and procedures).
6. Title
Description
Metadata
Procedure
examples
Mitigations
Detection
References
Scarce information
“Adversaries may steal data by exfiltrating it over a
different protocol than that of the existing command and
control channel. ‘’
Abstract formulation
Implications:
1. Concepts:
a. Adversary
b. (Valuable) data
c. Communication protocol
d. C&C channel
2. Prerequisites:
a. Adversary has infiltrated
b. Adversary established C&C channel
over protocol A
3. Actions:
a. Adversary exfiltrates data over
protocol B != A
Tactics, Techniques and Procedures (TTP)
[3] screenshot from https://attack.mitre.org/techniques/T1048/
7. Semantic
Understanding
Attack graph External Remote
Services (T1133)
Exploit Public-Facing
Application (T1190) …
TTP recognition
• CTI entities and relations are very technical and complex
• Understanding and mapping text to TTPs are challenging even for human
• Learning-based TTP mapping is potential but lacking datasets
• TTPs are numerous and complex
• MITRE ATT&K refine / extend the KB frequently as attacks evolve, thus
system needs to be able to adapt / extend to new TTPs
Tactics, Techniques and Procedures (TTP) Mapping
[4] screenshot from https://www.mandiant.com/resources/blog/apt41-initiates-global-intrusion-campaign-using-multiple-exploits
[4]
16. • Datasets
• ATT&CK procedure examples
• Source: MITRE ATT&CK website
• Large and very cleanly labeled but too short and too simple
• Short text (summarized from threat reports)
• TRAM
• Source: crowdsourced by MITRE
• Short text, noisy labels
• Expert
• Source: diverse and paragraph-level text from threat
reports labeled by security experts
• High quality but relatively small
• Derived ATT&CK procedure examples
• Source: reference links from MITRE ATT&CK
• Paragraph-level text but relatively noisy labels
• Evaluation protocol
• Train on 72.5%, validate on 12.5%, test on 15% of each dataset
• We combine the training splits cross datasets.
Experiment Settings
18. Model Analysis
• As the negative sample size increases, the model tends to converge
faster and exhibit better performance.
• It appears that there are no additional benefits beyond a size of 60,
which corresponds to 10% of the label space.
• Our models exhibit a more pronounced skewness in their distribution,
resembling that of a pure classification model like NAPKINXC.
• Broader distribution at the head, indicating inclination to assign
comparable probabilities to multiple labels.
Hello everyone, I’m very pleased to present our recent work in the field of AI and cybersecurity. This is a joint work of Nedim Srndic, Alexander Neth and myself. We belong to the AI4Sec team in Huawei Research Center Munich.
The title of the paper is: ‘Noise Contrastive Estimation-based Matching Framework for Low-Resource Security Attack Pattern Recognition’. We provide an efficient learning method to recognize attack-patterns in a textual described cyber attack report. This is a crucial task for Cyber Threat Intelligence. Lets get started.
We start with the definition of Cyber Threat Intelligence, quoted by Wikipedia. Cyber Threat Intelligence (CTI), an essential pillar of cybersecurity, involves collecting and analyzing information on cyber threats, including threat actors, their campaigns, and malware, helping timely threat detection and defense efforts.
Textual threat reports or blogs, published over the web by security vendors, are considered an important source of CTI, where security vendors diligently investigate and promptly detail intricate attacks.
We show here a made up example of how a text in the threat report looks like, we also highlight the CTI entities in the text.
There are at least two different attack patterns, associating to different parts of the text. The first is a Phishing activity and the second described by “the botnet distributed a file encryption” is a ‘Data Encrypted for Impact’ attack pattern.
Widely popular in the security community is this pyramid of pain, indicating how challenging and also how valuable it is to figure out what CTIs are used in a particular attack. We can see what have been seen in our example, the domain name, the tool and the TTPs appear here. TTPs or attack patterns are considered as the tip of this pyramid, indicating that recognizing them is the hardest part and also the most valuable part of the CTI.
The widely adopted knowledge base, where TTPs are conceptualized, standardized and pre-defined is called ATTACK, provided by MITRE organization. You can see in this visualization how the collection or ontology of TTPs looks like in the knowledge base.
We show in this slide how an actual technique looks like in the knowledge base. The example is the ‘Exfiltration Over Alternative Protocol’. There are certain metadata provided together with the TTPs, the title, the textual description, the procedure examples and the suggestions on Mitigations and Detection. In the bottom of the page are the references, where the procedure examples are derived from.
For the textual description, we can see that it has an overall abstract formulation. There are concepts to be understood i.e., … There are also Prerequisites i.e., and there are actions needed to be satisfied.
The task we introduce in this work is TTP Mapping, essentially to extract the pre-defined TTPs from the text. And here is an illustration of the steps. From a threat report, we need a semantic understanding of the text, from that we try to map the interactions of the entities to the attack patterns or TTPs in the second step. Ideally then, an attack graph can be represented by the extract TTPs.
There encompasses different challenges for different steps of the TTP mapping process. For semantic understanding or text encoding, CTI entities and relations are naturally very technical and complex. For TTP Mapping, the task is challenging even for human. The Learning-based TTP mapping is potential but consequently we lack sizable and quality datasets as TTPs are as well numerous and complex. Lastly, MITRE ATT&K refine / extend the KB frequently as attacks evolve, thus system needs to be able to adapt / extend to new TTPs.
Conventionally, TTP Mapping is designed as a vanilla multi-class classification task, where a text is encoded then classified using a softmax function to the labels – which are TTPs.
However, there are certain problems with this softmax-based learning. First, it is the complex, technical writing style of the text input. Then come the challenges of the (1) large label space, (2) the noisy and missing labels, as TTPs are especially hard to precisely annotate and (3) the typical long-tail labels. Overall these challenges bring a great obstacle to this way of learning.
In this work we propose an alternative learning setting where we avoid the direct optimization for discrimination between data points in the large label space. Concretely, we transform the task into a text matching problem, as can be seen between the input text and the TTP. This allow us to utilize the direct semantic similarity between the input-label pairs to derive a calibrated assignment score.
To support the low-resource learning, with limited labels, we further introduce the use of Noise Contrastive Learning (NCE) to add further learning signals to our learning optimization paradigm.
We provide a quick recap to Noise Contrastive Estimation (NCE). NCE is a powerful parameter estimation method for loglinear models, which avoids calculation of the partition function (see in the formula) at each training step, when used with Cross Entropy, which is really computationally demanding for the large label space.
For the NCE estimation method, the target is instead transformed to the discrimination between positive and negative examples.
In the new text-ttp matching paradigm, it is very similar to the classification one, except for now the scoring function is actually also the matching function of the pair. Thus it allows us to naturally rank the pair candidates.
Further the matching function introduces a inherent ‘inductive bias’, thus allowing it to perform well also for the long-tail or even unseen labels.
Here we provide an example of the characteristic of the ‘ranking’-like TTP distribution. We can see that, without NCE, the TTPs tend to concentrate more at the top of the ranking, while with NCE, the TTPs tend to be polarized, with very few TTPs occur at the top-k ranking – which is ideal for our case.
our proposed NCE-based learn-to-compare framework. The NCE mechanism alleviates the complexity of the moderately sized label space and helps the matching model learn the distinctive representations of the labels (TTPs). We introduce further two re-focusing remedies on top of NCEs to allow the framework reduce the level of contrasting to missing or noisy labels.
Overall, our proposed NCE-based models greatly outperform the baselines. Particularly, the asymmetric loss-based model achieves the best performance across most metrics and datasets. We also observe the significant improvements of the two loss variants (i.e., α-balanced and asymmetric) over the vanilla InfoNCE. In addition, the models demonstrates a substantial improvement at the cutoff threshold @1 (∼10%) in comparison to @3 (∼5%). This supports the effectiveness of our matching network in classification settings.
Interestingly, the Dynamic Triplet-loss, which is also a common contrastive learning loss, underperformed in our experiments – indicating the lack of the probabilistic characteristics that are required for the ranking task.
We provide further analysis into (1) the size of the negative samples and (2) the properties of our ranking distributions.
In the end, we also compare to ChatGPT 4.0 for the empirical studies. We observe rather interesting answers from the LLM, however, it still underperforms for this task.