Copyright 2011 Trend Micro Inc. 1
Bytewise approximate matching,
searching and clustering
Liwei Ren, Ph.D
Ray Cheng, Ph.D
Trend Micro Inc.
DFRWS USA 2015, August , 2015, Philadelphia, PA
Copyright 2011 Trend Micro Inc.
Agenda
• Background
• Six Matching Problems and Bytewise Relevance
• Current Work: A Framework of Theory, Algorithms, and
Technologies
• Future Work
Classification 8/17/2015 2
Copyright 2011 Trend Micro Inc.
Background
• Similarity digesting schemes:
– Problem: Given two binary strings s1 and s2, measure their similarity.
• Do a hash that preserves similarity property of strings.
• Measure similarity by comparing two hash values.
– Example: TLSH, ssdeep, sdhash
Classification 8/17/2015 3
Copyright 2011 Trend Micro Inc.
Background
• NIST specification document NIST.SP.800-168 introduces the
concept of bytewise approximate matching :
– NIST document lists four cases to describe this concept:
• Object similarity detection: identify related artifacts, e.g. different versions
of a document.
• Cross Correlation: identify artifacts sharing a common object.
• Embedded Object Detection: identify a given object inside an artifact.
• Fragment Detection: identify the presence of traces/fragments of a known
artifact.
• Dr . Liwei Ren’s talk at DFRWS EU 2015:
– A Theoretic Framework for Evaluating Similarity Digesting Tools
– Using a mathematical model to describe binary similarity.
4
Copyright 2011 Trend Micro Inc.
Six Matching Problems and Bytewise Relevance
• The NIST document does not cover all bytewise approximate
matching cases.
• We generalized NIST cases to six cases:
Classification 8/17/2015 5
Copyright 2011 Trend Micro Inc.
Six Matching Problems and Bytewise Relevance
• Continued:
6
Copyright 2011 Trend Micro Inc.
Classification of NIST approximate
matching cases
• Similarity Detection: identify related artifacts.
– AM1 (approximate match)
• Cross Correlation: identify artifacts sharing a
common object.
– EM3 (exact match cross-sharing)
• Embedded Object Detection: identify a given
object inside an artifact.
– EM2 (exact match containment)
• Fragment Detection: identify the presence of
traces/fragments of a known artifact.
– EM2 (one or more exact match containment)
Classification 8/17/2015 7
Copyright 2011 Trend Micro Inc.
Six Matching Problems and Bytewise Relevance
• Definition 1 : Given two strings R[1,..,n] and T[1,…,m], if one of
six cases is true, we say R and T are bytewise relevant.
– We denote this as BR(R,T)= 1, otherwise BR(R,T)= 0.
8
Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• Define three fundamental problems using Bytewise
Relevance:
– Matching: Given O1 , O2 ∊ S, determine whether BR (O1,O2) =1.
– Searching : B ⊆ S is a bag of objects . Given o ∊ S , find b ∊ B
such that BR (o, b )=1.
– Clustering: Given a bag B of objects, partition B into groups { G1,
G2,…,Gm} based on BR.
• S = An object space S,
• O = An object in object space S,
•BR = Bytewise Relevance relationship for objects in S.
Classification 8/17/2015 9
Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• Our bytewise relevance framework :
Classification 8/17/2015 10
Copyright 2011 Trend Micro Inc.
Matching
• The Six Matching Problems EM1 – AM3
– Identicalness EM1 : the solution is trivial.
– Containment EM2 : the solution is Rabin-Karp algorithm.
– Cross-sharing EM3 :
• We established a theory on this interesting problem : how to measure cross-
sharing.
• We developed an algorithmic solution with theoretic analysis.
– Similarity AM1 :
• TLSH, ssdeep and sdhash
• Dr. Ren delivered a talk at DFRWS EU 2015: there are eight approaches to
solve this problem.
– We designed a novel similarity digesting scheme TSFP.
– Approximate containment AM2: Two heuristic algorithms
– Approximate cross-sharing AM3: One heuristic algorithm
Classification 8/17/2015 11
Copyright 2011 Trend Micro Inc.
Searching
• For the relationship BR, the searching problem:
– B is a bag of strings. Given a string T , find s ∊ B such that BR(T,
s)=1.
Classification 8/17/2015 12
Copyright 2011 Trend Micro Inc.
Searching
• How to solve searching problem?
– Brute force approach : for every s ∊ B, we evaluate BR(T, s). Can
we scale to millions or billions? 
– Candidate selection approach: two-step approach
• STEP 1: select a few candidates { s1, s2,…,sm} quickly
• STEP 2: evaluate each BR(T, sk).
– How to select good candidates?
• String fingerprinting: generate fingerprints from each string from B.
• Indexing Process: Index the fingerprints along with the string ID to create
a index DB as FP-DB.
• Searching Process: given T, generate fingerprints {FP1, FP2,…,FPq} , we
use them to search possible candidates from FP-DB.
– NOTE:
• This is similar to a keyword based search engine where the keywords are
the fingerprints.
• The fingerprinting procedure is actually a special tokenization method.
Classification 8/17/2015 13
Copyright 2011 Trend Micro Inc.
Future Work: Clustering Problem
• For the relationship BR, one has a clustering problem :
– B is a bag of strings, partition B into groups of strings based on BR.
Classification 8/17/2015 14
Copyright 2011 Trend Micro Inc.
Future Work: Library and tools
• Analyze algorithms and measure performance.
– Verify they can scale.
• For bytewise approximate matching, searching and clustering,
– Library of functions
– API
– Tools
Classification 8/17/2015 15
Copyright 2011 Trend Micro Inc.
Application examples of Approximate
Matching, Searching, Clustering
• E-Discovery
– Comparing near duplicate documents
– Grouping near duplicate documents
• Digital forensic analysis
– Identifying similar objects or files
• Malware analysis
– Identifying similar malware or mutated malware
• Anti-plagiarism
– Detection of copyright violations
• Source code governance
• Spam filtering
• Data Loss Prevention
Classification 8/17/2015 16
Copyright 2011 Trend Micro Inc.
Q&A
• Thank you.
• Any questions?
• Email:
– liwei_ren@trendmicro.com
– ray_cheng@trendmicro.com
17
Copyright 2011 Trend Micro Inc.
Application Example
• A search problem in DLP (Data Loss Prevension) system:
– Problem: S = {d1, d2,…, dn} is a collection of confidential documents,.
Given any document T and 0<δ≤1, find a document d ∊ S such that
RLV(d,T)≥ δ.
• RLV is a function to measure the relevance of two documents.
• Challenges: how to construct RLV and δ? How to make search scalable?
Classification 8/17/2015 18
Copyright 2011 Trend Micro Inc.
Application Example
• A clustering problem in e-Discovery:
– Data are identified as potentially relevant by attorneys
– De-duplication technology.
– Problem: partition S into groups based on the textual relevance.
Classification 8/17/2015 19
Copyright 2011 Trend Micro Inc.
Background
• Similarity digesting schemes:
– A family of similarity preserving hashing techniques & tools
– Problem: Given two binary strings s1 and s2, measure the similarity
by s= SIM(H(s1), H(s2)).
• H is a hash function that preserves string similarity.
• SIM is another function to measure similarity of two hash values
– Example: TLSH, ssdeep, sdhash
– Challenge: how to evaluate pros & cons between them?
Classification 8/17/2015 20
Copyright 2011 Trend Micro Inc.
Six Matching Problems and Bytewise Relevance
• Definition 2: Let X , Y ∊ { EM1,EM2, EM3 ,AM1, AM2, AM3}. If
problem X is a special case of problem Y , we denote this as X ↪ Y.
• We have following relationship:
Classification 8/17/2015 21
EM1 EM2 EM3
AM1 AM2 AM3
↪ ↪
↪ ↪
↪
↪
↪

Bytewise approximate matching, searching and clustering

  • 1.
    Copyright 2011 TrendMicro Inc. 1 Bytewise approximate matching, searching and clustering Liwei Ren, Ph.D Ray Cheng, Ph.D Trend Micro Inc. DFRWS USA 2015, August , 2015, Philadelphia, PA
  • 2.
    Copyright 2011 TrendMicro Inc. Agenda • Background • Six Matching Problems and Bytewise Relevance • Current Work: A Framework of Theory, Algorithms, and Technologies • Future Work Classification 8/17/2015 2
  • 3.
    Copyright 2011 TrendMicro Inc. Background • Similarity digesting schemes: – Problem: Given two binary strings s1 and s2, measure their similarity. • Do a hash that preserves similarity property of strings. • Measure similarity by comparing two hash values. – Example: TLSH, ssdeep, sdhash Classification 8/17/2015 3
  • 4.
    Copyright 2011 TrendMicro Inc. Background • NIST specification document NIST.SP.800-168 introduces the concept of bytewise approximate matching : – NIST document lists four cases to describe this concept: • Object similarity detection: identify related artifacts, e.g. different versions of a document. • Cross Correlation: identify artifacts sharing a common object. • Embedded Object Detection: identify a given object inside an artifact. • Fragment Detection: identify the presence of traces/fragments of a known artifact. • Dr . Liwei Ren’s talk at DFRWS EU 2015: – A Theoretic Framework for Evaluating Similarity Digesting Tools – Using a mathematical model to describe binary similarity. 4
  • 5.
    Copyright 2011 TrendMicro Inc. Six Matching Problems and Bytewise Relevance • The NIST document does not cover all bytewise approximate matching cases. • We generalized NIST cases to six cases: Classification 8/17/2015 5
  • 6.
    Copyright 2011 TrendMicro Inc. Six Matching Problems and Bytewise Relevance • Continued: 6
  • 7.
    Copyright 2011 TrendMicro Inc. Classification of NIST approximate matching cases • Similarity Detection: identify related artifacts. – AM1 (approximate match) • Cross Correlation: identify artifacts sharing a common object. – EM3 (exact match cross-sharing) • Embedded Object Detection: identify a given object inside an artifact. – EM2 (exact match containment) • Fragment Detection: identify the presence of traces/fragments of a known artifact. – EM2 (one or more exact match containment) Classification 8/17/2015 7
  • 8.
    Copyright 2011 TrendMicro Inc. Six Matching Problems and Bytewise Relevance • Definition 1 : Given two strings R[1,..,n] and T[1,…,m], if one of six cases is true, we say R and T are bytewise relevant. – We denote this as BR(R,T)= 1, otherwise BR(R,T)= 0. 8
  • 9.
    Copyright 2011 TrendMicro Inc. A Framework of Theory, Algorithms and Technologies • Define three fundamental problems using Bytewise Relevance: – Matching: Given O1 , O2 ∊ S, determine whether BR (O1,O2) =1. – Searching : B ⊆ S is a bag of objects . Given o ∊ S , find b ∊ B such that BR (o, b )=1. – Clustering: Given a bag B of objects, partition B into groups { G1, G2,…,Gm} based on BR. • S = An object space S, • O = An object in object space S, •BR = Bytewise Relevance relationship for objects in S. Classification 8/17/2015 9
  • 10.
    Copyright 2011 TrendMicro Inc. A Framework of Theory, Algorithms and Technologies • Our bytewise relevance framework : Classification 8/17/2015 10
  • 11.
    Copyright 2011 TrendMicro Inc. Matching • The Six Matching Problems EM1 – AM3 – Identicalness EM1 : the solution is trivial. – Containment EM2 : the solution is Rabin-Karp algorithm. – Cross-sharing EM3 : • We established a theory on this interesting problem : how to measure cross- sharing. • We developed an algorithmic solution with theoretic analysis. – Similarity AM1 : • TLSH, ssdeep and sdhash • Dr. Ren delivered a talk at DFRWS EU 2015: there are eight approaches to solve this problem. – We designed a novel similarity digesting scheme TSFP. – Approximate containment AM2: Two heuristic algorithms – Approximate cross-sharing AM3: One heuristic algorithm Classification 8/17/2015 11
  • 12.
    Copyright 2011 TrendMicro Inc. Searching • For the relationship BR, the searching problem: – B is a bag of strings. Given a string T , find s ∊ B such that BR(T, s)=1. Classification 8/17/2015 12
  • 13.
    Copyright 2011 TrendMicro Inc. Searching • How to solve searching problem? – Brute force approach : for every s ∊ B, we evaluate BR(T, s). Can we scale to millions or billions?  – Candidate selection approach: two-step approach • STEP 1: select a few candidates { s1, s2,…,sm} quickly • STEP 2: evaluate each BR(T, sk). – How to select good candidates? • String fingerprinting: generate fingerprints from each string from B. • Indexing Process: Index the fingerprints along with the string ID to create a index DB as FP-DB. • Searching Process: given T, generate fingerprints {FP1, FP2,…,FPq} , we use them to search possible candidates from FP-DB. – NOTE: • This is similar to a keyword based search engine where the keywords are the fingerprints. • The fingerprinting procedure is actually a special tokenization method. Classification 8/17/2015 13
  • 14.
    Copyright 2011 TrendMicro Inc. Future Work: Clustering Problem • For the relationship BR, one has a clustering problem : – B is a bag of strings, partition B into groups of strings based on BR. Classification 8/17/2015 14
  • 15.
    Copyright 2011 TrendMicro Inc. Future Work: Library and tools • Analyze algorithms and measure performance. – Verify they can scale. • For bytewise approximate matching, searching and clustering, – Library of functions – API – Tools Classification 8/17/2015 15
  • 16.
    Copyright 2011 TrendMicro Inc. Application examples of Approximate Matching, Searching, Clustering • E-Discovery – Comparing near duplicate documents – Grouping near duplicate documents • Digital forensic analysis – Identifying similar objects or files • Malware analysis – Identifying similar malware or mutated malware • Anti-plagiarism – Detection of copyright violations • Source code governance • Spam filtering • Data Loss Prevention Classification 8/17/2015 16
  • 17.
    Copyright 2011 TrendMicro Inc. Q&A • Thank you. • Any questions? • Email: – liwei_ren@trendmicro.com – ray_cheng@trendmicro.com 17
  • 18.
    Copyright 2011 TrendMicro Inc. Application Example • A search problem in DLP (Data Loss Prevension) system: – Problem: S = {d1, d2,…, dn} is a collection of confidential documents,. Given any document T and 0<δ≤1, find a document d ∊ S such that RLV(d,T)≥ δ. • RLV is a function to measure the relevance of two documents. • Challenges: how to construct RLV and δ? How to make search scalable? Classification 8/17/2015 18
  • 19.
    Copyright 2011 TrendMicro Inc. Application Example • A clustering problem in e-Discovery: – Data are identified as potentially relevant by attorneys – De-duplication technology. – Problem: partition S into groups based on the textual relevance. Classification 8/17/2015 19
  • 20.
    Copyright 2011 TrendMicro Inc. Background • Similarity digesting schemes: – A family of similarity preserving hashing techniques & tools – Problem: Given two binary strings s1 and s2, measure the similarity by s= SIM(H(s1), H(s2)). • H is a hash function that preserves string similarity. • SIM is another function to measure similarity of two hash values – Example: TLSH, ssdeep, sdhash – Challenge: how to evaluate pros & cons between them? Classification 8/17/2015 20
  • 21.
    Copyright 2011 TrendMicro Inc. Six Matching Problems and Bytewise Relevance • Definition 2: Let X , Y ∊ { EM1,EM2, EM3 ,AM1, AM2, AM3}. If problem X is a special case of problem Y , we denote this as X ↪ Y. • We have following relationship: Classification 8/17/2015 21 EM1 EM2 EM3 AM1 AM2 AM3 ↪ ↪ ↪ ↪ ↪ ↪ ↪