Bytewise approximate matching, searching and clustering

Copyright 2011 Trend Micro Inc. 1
Bytewise approximate matching,
searching and clustering
Liwei Ren, Ph.D
Ray Cheng, Ph.D
Trend Micro Inc.
DFRWS USA 2015, August , 2015, Philadelphia, PA

Copyright 2011 Trend Micro Inc.
Agenda
• Background
• Six Matching Problems and Bytewise Relevance
• Current Work: A Framework of Theory, Algorithms, and
Technologies
• Future Work
Classification 8/17/2015 2

Background
• Similarity digesting schemes:
– Problem: Given two binary strings s1 and s2, measure their similarity.
• Do a hash that preserves similarity property of strings.
• Measure similarity by comparing two hash values.
– Example: TLSH, ssdeep, sdhash

Background
• NIST specification document NIST.SP.800-168 introduces the
concept of bytewise approximate matching :
– NIST document lists four cases to describe this concept:
• Object similarity detection: identify related artifacts, e.g. different versions
of a document.
• Cross Correlation: identify artifacts sharing a common object.
• Embedded Object Detection: identify a given object inside an artifact.
• Fragment Detection: identify the presence of traces/fragments of a known
artifact.
• Dr . Liwei Ren’s talk at DFRWS EU 2015:
– A Theoretic Framework for Evaluating Similarity Digesting Tools
– Using a mathematical model to describe binary similarity.
4

Six Matching Problems and Bytewise Relevance
• The NIST document does not cover all bytewise approximate
matching cases.
• We generalized NIST cases to six cases:

• Continued:
6

Classification of NIST approximate
matching cases
• Similarity Detection: identify related artifacts.
– AM1 (approximate match)
• Cross Correlation: identify artifacts sharing a
common object.
– EM3 (exact match cross-sharing)
• Embedded Object Detection: identify a given
object inside an artifact.
– EM2 (exact match containment)
• Fragment Detection: identify the presence of
traces/fragments of a known artifact.
– EM2 (one or more exact match containment)

• Definition 1 : Given two strings R[1,..,n] and T[1,…,m], if one of
six cases is true, we say R and T are bytewise relevant.
– We denote this as BR(R,T)= 1, otherwise BR(R,T)= 0.
8

A Framework of Theory, Algorithms and Technologies
• Define three fundamental problems using Bytewise
Relevance:
– Matching: Given O1 , O2 ∊ S, determine whether BR (O1,O2) =1.
– Searching : B ⊆ S is a bag of objects . Given o ∊ S , find b ∊ B
such that BR (o, b )=1.
– Clustering: Given a bag B of objects, partition B into groups { G1,
G2,…,Gm} based on BR.
• S = An object space S,
• O = An object in object space S,
•BR = Bytewise Relevance relationship for objects in S.

A Framework of Theory, Algorithms and Technologies
• Our bytewise relevance framework :

Matching
• The Six Matching Problems EM1 – AM3
– Identicalness EM1 : the solution is trivial.
– Containment EM2 : the solution is Rabin-Karp algorithm.
– Cross-sharing EM3 :
• We established a theory on this interesting problem : how to measure cross-
sharing.
• We developed an algorithmic solution with theoretic analysis.
– Similarity AM1 :
• TLSH, ssdeep and sdhash
• Dr. Ren delivered a talk at DFRWS EU 2015: there are eight approaches to
solve this problem.
– We designed a novel similarity digesting scheme TSFP.
– Approximate containment AM2: Two heuristic algorithms
– Approximate cross-sharing AM3: One heuristic algorithm

Searching
• For the relationship BR, the searching problem:
– B is a bag of strings. Given a string T , find s ∊ B such that BR(T,
s)=1.

Searching
• How to solve searching problem?
– Brute force approach : for every s ∊ B, we evaluate BR(T, s). Can
we scale to millions or billions? 
– Candidate selection approach: two-step approach
• STEP 1: select a few candidates { s1, s2,…,sm} quickly
• STEP 2: evaluate each BR(T, sk).
– How to select good candidates?
• String fingerprinting: generate fingerprints from each string from B.
• Indexing Process: Index the fingerprints along with the string ID to create
a index DB as FP-DB.
• Searching Process: given T, generate fingerprints {FP1, FP2,…,FPq} , we
use them to search possible candidates from FP-DB.
– NOTE:
• This is similar to a keyword based search engine where the keywords are
the fingerprints.
• The fingerprinting procedure is actually a special tokenization method.

Future Work: Clustering Problem
• For the relationship BR, one has a clustering problem :
– B is a bag of strings, partition B into groups of strings based on BR.

Future Work: Library and tools
• Analyze algorithms and measure performance.
– Verify they can scale.
• For bytewise approximate matching, searching and clustering,
– Library of functions
– API
– Tools

Application examples of Approximate
Matching, Searching, Clustering
• E-Discovery
– Comparing near duplicate documents
– Grouping near duplicate documents
• Digital forensic analysis
– Identifying similar objects or files
• Malware analysis
– Identifying similar malware or mutated malware
• Anti-plagiarism
– Detection of copyright violations
• Source code governance
• Spam filtering
• Data Loss Prevention

Q&A
• Thank you.
• Any questions?
• Email:
– liwei_ren@trendmicro.com
– ray_cheng@trendmicro.com
17

Application Example
• A search problem in DLP (Data Loss Prevension) system:
– Problem: S = {d1, d2,…, dn} is a collection of confidential documents,.
Given any document T and 0<δ≤1, find a document d ∊ S such that
RLV(d,T)≥ δ.
• RLV is a function to measure the relevance of two documents.
• Challenges: how to construct RLV and δ? How to make search scalable?

Application Example
• A clustering problem in e-Discovery:
– Data are identified as potentially relevant by attorneys
– De-duplication technology.
– Problem: partition S into groups based on the textual relevance.

Background
• Similarity digesting schemes:
– A family of similarity preserving hashing techniques & tools
– Problem: Given two binary strings s1 and s2, measure the similarity
by s= SIM(H(s1), H(s2)).
• H is a hash function that preserves string similarity.
• SIM is another function to measure similarity of two hash values
– Example: TLSH, ssdeep, sdhash
– Challenge: how to evaluate pros & cons between them?

• Definition 2: Let X , Y ∊ { EM1,EM2, EM3 ,AM1, AM2, AM3}. If
problem X is a special case of problem Y , we denote this as X ↪ Y.
• We have following relationship:
EM1 EM2 EM3
AM1 AM2 AM3
↪ ↪
↪ ↪
↪
↪
↪

Bytewise approximate matching, searching and clustering

More Related Content

What's hot

Viewers also liked

Similar to Bytewise approximate matching, searching and clustering

More from Liwei Ren任力偉

Recently uploaded

Bytewise approximate matching, searching and clustering