HACT_Fast_Search_COINS_pub.pdf

HAC-T and Fast Search for
Similarity in Security
Jonathan Oliver
Muqeet Ali
Josiah Hagen
Trend Micro Research

Hashes in Security
q Cryptographic hashes (SHA256, MD5) are
very convenient
q Similarity Digests retain the convenience of
hashes
q Can measure the distance (or similarity) of
2 files
2

Purpose
3
q Goodware
q Fareit
q Emotet
q Ursnif

TLSH: Quick Intro
q 2 versions of chrome.exe
SHA256:
c70b8cbb2ac962b343535454e4f2bcb3e48d83a04792c64bc768d59b3c1bf403
723aa4a407160bd99430de690f1f0d34af4a6622e2c44fe95be3bda3d7c344b3
TLSH
1c159d11f445c1b7e5b211b2d879ba71467cbc28832641db63987e1a3db03d23a3b6db
c4159d11f445c1b7d5b211b2d47dba71467cbc28832a40db63987e1a3eb43d22a3b6db
Distance = 9
4

Previous Work
q Ssdeep
DFRWS 2006: "Identifying Almost Identical Files Using Context Triggered Piecewise Hashing”
q Sdhash
Research Advances in Digital Forensics VI, 2010: "Data Fingerprinting with Similarity Digests”
q TLSH
CTC 2013: “TLSH: A Locality Sensitive Hash”
https://github.com/trendmicro/tlsh
q TLSH compared independently
CODASPY 2018: “Beyond Precision and Recall: Understanding Uses (and Misuses) of Similarity
Hashes in Binary Analysis”
MTD 2018: “Quantifying the Effectiveness of Software Diversity using Near-Duplicate Detection
Algorithms”
TLSH is a part of STIX 2.1 https://docs.oasis-open.org/cti/stix/v2.1/cs01/stix-v2.1-cs01.html
5

Criteria
q General purpose similarity measure
§ Domain agnostic
q Accuracy
q Resistant to attack
q Search / Clustering
§ Efficient
§ Scalable
6

Indexes and Trees
q KD trees, regression trees use features / co-
ordinates at the nodes
q Problems with high dimensional data
Ref: https://stackoverflow.com/questions/28028618/2d-kd-tree-and-nearest-neighbour-search
8

TLSH Tree (type of Metric Tree)
9
Nodes contain
(item, distance)

125 225
TLSH1
Left Subtree
Right
Subtree
Triangle
Inequality
Dist (SearchItem,
Right Subtree)
>= 100
Ref: Vantage Point Trees
10

Using TLSH Trees
Either
q Use Backtrack (based on the triangle
inequality) to optimally search a tree
OR
q Search through a forest of trees for
approximate nearest neighbor
q Similarity Search in 𝑂 log 𝑁 comparisons
12

Using Trees for
Ssdeep / Sdhash
13
Creates unbalanced trees
Reason: Ssdeep / Sdhash
q have a limited range (0, 100)
q are not metric like

The HAC-T
Clustering Algorithm
14

Clustering
Hierarchical Agglomerative Clustering (HAC) – normally
infeasible, as normally requires 𝑂(𝑁!) comparisons
HAC-T (single linkage clustering)
Inputs: N items
Threshold T
1. ListPair = Pre-process 𝑁 items to identify the
closest item for each item in 𝑂 𝑁 log 𝑁
2. Put each item in a cluster of size 1
3. Merge clusters (using ListPair) if two items have
𝑑𝑖𝑠𝑡 𝐼𝑡𝑒𝑚1, 𝐼𝑡𝑒𝑚2 ≤ 𝑇
15

Experimental Setup and Data
q Commodity cloud 32-core machine with 128 GB
memory, and AMD EPYC 7000 series processor
q Data has been sourced from VirusTotal data feed.
§ Consists of PE files
§ Includes scan results from five major antivirus
vendors
(Kaspersky,Microsoft,Symantec,Sophos,McAfee)
§ Scan results include labels (whether the file was
detected by AV vendor or not)
§ We compute TLSH hash of the PE files
16

Evaluation Methodology
q We want to measure cluster quality based on
labels provided by VirusTotal.
q We used purity to measure cluster quality:
§ Purity score varies from 0 to 1 (0 for worst and
1 for best cluster quality)
§ Purity score scales linearly with increasing
sample sizes (N)
q We investigated other cluster quality measures like
silhouette coefficient but they do not scale well.
17

Comparison of Different
Clustering Methods (cont.)
q Simple K-Medoid computes the medoid based on
avg. TLSH distance computed for each sample within
a cluster
q Other techniques compared include DBSCAN and
CLARANS (another K-Medoid based algorithm)
q We compared the silhouette coefficient and run
times at various sample sizes
q Conclusion: K-Means/K-Medoid approaches produce
poor cluster quality while DBSCAN produces good
cluster quality but has scalability issues
18

Comparison of Different
Clustering Methods
q We compared different well-known clustering techniques (K-
Means, K-Medoid, CLARANS, DBSCAN) and show that they are
either not scalable or do not produce good quality clusters
which are useful for security analysis
q We adapted K-Means to cluster TLSH hashes as follows:
§ We converted each TLSH hash digest to 70 character
vector, and calculate mean of the vectors (m)
§ We find the TLSH hash digest (d) which is closest to
m by computing TLSH distance
§ We use d as the means in the K-Means algorithm
19

Clustering Quality and Run Time
for K-Means/K-Medoid/DBSCAN
20

Clustering Quality and
Run-Time for HAC-T
q We first report cluster quality for HAC-T using
silhouette coefficient for 10,000 samples, then
scale to millions…
q Parameter T affects clustering quality and %
noise (not clustered)
21

Scaling HAC-T to Tens of
Millions and Beyond…
q HAC-T scales in N*log(N) manner
q HAC-T produces good cluster quality (measured
by purity)
22

Conclusion
q HAC-T is evaluated on up to 10 million samples,
and could easily scale to tens of millions of
samples or more
q We have shown that it produces good quality of
clusters with purity score in the range from 0.97 to
0.98
q Unlike K-Means/K-Medoid approaches it does not
need k as input parameter, and produces well-
formed clusters useful for security analysis
23

Thank You
q Open Source Library
https://github.com/trendmicro/tlsh
q Jonathan Oliver: jon_oliver@trendmicro.com
q Muqeet Ali: muqeet_ali@trendmicro.com
q Josiah Hagen: josiah_hagen@trendmicro.com
24

HACT_Fast_Search_COINS_pub.pdf

Recommended

Recommended

More Related Content

Similar to HACT_Fast_Search_COINS_pub.pdf

Similar to HACT_Fast_Search_COINS_pub.pdf (20)

More from JonathanOliver26

More from JonathanOliver26 (7)

Recently uploaded

Recently uploaded (20)

HACT_Fast_Search_COINS_pub.pdf