Similarity digests have gained popularity for many
security applications like blacklisting/whitelisting, and finding
similar variants of malware. TLSH has been shown to be
particularly good at hunting similar malware, and is resistant to
evasion as compared to other similarity digests like ssdeep and
sdhash. Searching and clustering are fundamental tools which
help the security analysts and security operations center (SOC)
operators in hunting and analyzing malware. Current approaches
which aim to cluster malware are not scalable enough to keep
up with the vast amount of malware and goodware available
in the wild. In this paper, we present techniques which allow
for fast search and clustering of TLSH hash digests which
can aid analysts to inspect large amounts of malware/goodware.
Our approach builds on fast nearest neighbor search techniques
to build a tree-based index which performs fast search based
on TLSH hash digests. The tree-based index is used in our
threshold based Hierarchical Agglomerative Clustering (HAC-T)
algorithm which is able to cluster digests in a scalable manner.
Our clustering technique can cluster digests in O(n logn) time on
average. We performed an empirical evaluation by comparing our
approach with many standard and recent clustering techniques.
We demonstrate that our approach is much more scalable and
still is able to produce good cluster quality. We measured
cluster quality using purity on 10 million samples obtained from
VirusTotal. We obtained a high purity score in the range from
0.97 to 0.98 using labels from five major anti-virus vendors
(Kaspersky, Microsoft, Symantec, Sophos, and McAfee) which
demonstrates the effectiveness of the proposed method.
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
HACT_Fast_Search_COINS_pub.pdf
1. HAC-T and Fast Search for
Similarity in Security
Jonathan Oliver
Muqeet Ali
Josiah Hagen
Trend Micro Research
2. Hashes in Security
q Cryptographic hashes (SHA256, MD5) are
very convenient
q Similarity Digests retain the convenience of
hashes
q Can measure the distance (or similarity) of
2 files
2
5. Previous Work
q Ssdeep
DFRWS 2006: "Identifying Almost Identical Files Using Context Triggered Piecewise Hashing”
q Sdhash
Research Advances in Digital Forensics VI, 2010: "Data Fingerprinting with Similarity Digests”
q TLSH
CTC 2013: “TLSH: A Locality Sensitive Hash”
https://github.com/trendmicro/tlsh
q TLSH compared independently
CODASPY 2018: “Beyond Precision and Recall: Understanding Uses (and Misuses) of Similarity
Hashes in Binary Analysis”
MTD 2018: “Quantifying the Effectiveness of Software Diversity using Near-Duplicate Detection
Algorithms”
TLSH is a part of STIX 2.1 https://docs.oasis-open.org/cti/stix/v2.1/cs01/stix-v2.1-cs01.html
5
8. Indexes and Trees
q KD trees, regression trees use features / co-
ordinates at the nodes
q Problems with high dimensional data
Ref: https://stackoverflow.com/questions/28028618/2d-kd-tree-and-nearest-neighbour-search
8
9. TLSH Tree (type of Metric Tree)
9
Nodes contain
(item, distance)
12. Using TLSH Trees
Either
q Use Backtrack (based on the triangle
inequality) to optimally search a tree
OR
q Search through a forest of trees for
approximate nearest neighbor
q Similarity Search in 𝑂 log 𝑁 comparisons
12
13. Using Trees for
Ssdeep / Sdhash
13
Creates unbalanced trees
Reason: Ssdeep / Sdhash
q have a limited range (0, 100)
q are not metric like
15. Clustering
Hierarchical Agglomerative Clustering (HAC) – normally
infeasible, as normally requires 𝑂(𝑁!) comparisons
HAC-T (single linkage clustering)
Inputs: N items
Threshold T
1. ListPair = Pre-process 𝑁 items to identify the
closest item for each item in 𝑂 𝑁 log 𝑁
2. Put each item in a cluster of size 1
3. Merge clusters (using ListPair) if two items have
𝑑𝑖𝑠𝑡 𝐼𝑡𝑒𝑚1, 𝐼𝑡𝑒𝑚2 ≤ 𝑇
15
16. Experimental Setup and Data
q Commodity cloud 32-core machine with 128 GB
memory, and AMD EPYC 7000 series processor
q Data has been sourced from VirusTotal data feed.
§ Consists of PE files
§ Includes scan results from five major antivirus
vendors
(Kaspersky,Microsoft,Symantec,Sophos,McAfee)
§ Scan results include labels (whether the file was
detected by AV vendor or not)
§ We compute TLSH hash of the PE files
16
17. Evaluation Methodology
q We want to measure cluster quality based on
labels provided by VirusTotal.
q We used purity to measure cluster quality:
§ Purity score varies from 0 to 1 (0 for worst and
1 for best cluster quality)
§ Purity score scales linearly with increasing
sample sizes (N)
q We investigated other cluster quality measures like
silhouette coefficient but they do not scale well.
17
18. Comparison of Different
Clustering Methods (cont.)
q Simple K-Medoid computes the medoid based on
avg. TLSH distance computed for each sample within
a cluster
q Other techniques compared include DBSCAN and
CLARANS (another K-Medoid based algorithm)
q We compared the silhouette coefficient and run
times at various sample sizes
q Conclusion: K-Means/K-Medoid approaches produce
poor cluster quality while DBSCAN produces good
cluster quality but has scalability issues
18
19. Comparison of Different
Clustering Methods
q We compared different well-known clustering techniques (K-
Means, K-Medoid, CLARANS, DBSCAN) and show that they are
either not scalable or do not produce good quality clusters
which are useful for security analysis
q We adapted K-Means to cluster TLSH hashes as follows:
§ We converted each TLSH hash digest to 70 character
vector, and calculate mean of the vectors (m)
§ We find the TLSH hash digest (d) which is closest to
m by computing TLSH distance
§ We use d as the means in the K-Means algorithm
19
21. Clustering Quality and
Run-Time for HAC-T
q We first report cluster quality for HAC-T using
silhouette coefficient for 10,000 samples, then
scale to millions…
q Parameter T affects clustering quality and %
noise (not clustered)
21
22. Scaling HAC-T to Tens of
Millions and Beyond…
q HAC-T scales in N*log(N) manner
q HAC-T produces good cluster quality (measured
by purity)
22
23. Conclusion
q HAC-T is evaluated on up to 10 million samples,
and could easily scale to tens of millions of
samples or more
q We have shown that it produces good quality of
clusters with purity score in the range from 0.97 to
0.98
q Unlike K-Means/K-Medoid approaches it does not
need k as input parameter, and produces well-
formed clusters useful for security analysis
23
24. Thank You
q Open Source Library
https://github.com/trendmicro/tlsh
q Jonathan Oliver: jon_oliver@trendmicro.com
q Muqeet Ali: muqeet_ali@trendmicro.com
q Josiah Hagen: josiah_hagen@trendmicro.com
24