SlideShare a Scribd company logo
Copyright 2011 Trend Micro Inc. 1
Binary Similarity : Theory, Algorithms and Tool
Evaluation
Liwei Ren, Ph.D, Trend Micro™
University of Houston-Downtown, Houston, Texas, October, 2015
Copyright 2011 Trend Micro Inc.
Agenda
• What is binary similarity ?
• Similarity Digesting: 3 Algorithms
• A Mathematical Model
• Tool Evaluation
• A Novel Fuzzy Hashing
• Summary and Further Research
Classification 10/2/2015 2
Copyright 2011 Trend Micro Inc.
What Is Binary Similarity?
• Binary similarity or approximate matching.
– What is binary similarity ?
• 4 Use Cases specified by a NIST document:
Classification 10/2/2015 3
Copyright 2011 Trend Micro Inc.
What Is Binary Similarity?
Classification 10/2/2015 4
Copyright 2011 Trend Micro Inc.
Similarity Digesting : 3 Algorithms
• Similarity digesting (aka, fuzzy hashing):
– A class of hash techniques or tools that preserve similarity.
– Typical steps for digest generation:
– Detecting similarity with similarity digesting:
• Three similarity digesting algorithms and tools:
– ssdeep, sdhash & TLSH
Classification 10/2/2015 5
Copyright 2011 Trend Micro Inc.
Similarity Digesting : 3 Algorithms
• ssdeep
– Two steps for digesting:
– Edit Distance: Levenshtein distance
Classification 10/2/2015 6
Copyright 2011 Trend Micro Inc.
Similarity Digesting : 3 Algorithms
• Sdhash by Dr Vassil Roussev
– Two steps for digesting:
– Edit Distance: Hamming distance
Classification 10/2/2015 7
Copyright 2011 Trend Micro Inc.
Similarity Digesting : 3 Algorithms
• TLSH
– Two steps for digesting :
– Edit Distance: A diff based evaluation function
Classification 10/2/2015 8
Copyright 2011 Trend Micro Inc.
A Mathematical Model
• Summary of Three Similarity Digesting Schemes:
– Using a first model to describe a binary string with selected features:
• ssdeep model: a string is a sequence of chunks (split from the string).
• sdhash model: a string is a bag of 64-byte blocks (selected with entropy
values).
• TLSH model: a string is a bag of triplets (selected from all 5-grams).
– Using a second model to map the selected features into a digest which
is able to preserve similarity to certain degree.
• ssdeep model: a sequence of chunks is mapped into a 80-byte digest.
• sdhash model: a bag of blocks is mapped into one or multiple 256-byte
bloom filter bitmaps.
• TLSH model: a bag of triplets is mapped into a 32-byte container.
Classification 10/2/2015 9
Copyright 2011 Trend Micro Inc.
A Mathematical Model
• Three approaches for similarity evaluation:
Classification 10/2/2015 10
• 1st model plays critical role for similarity comparison.
• Let focus on discussing various 1st models today.
• Based on a unified format.
• 2nd model saves space but further reduces accuracy.
Copyright 2011 Trend Micro Inc.
A Mathematical Model
• Unified format for 1st model:
– A string is described as a collection of tokens (aka, features)
organized by a data structure:
• ssdeep: a sequence of chunks.
• sdhash: a bag of 64-byte blocks with high entropy values.
• TLSH: a bag of selected triplets.
– Two types of data structures: sequence, bag.
– Three types of tokens: chunks, blocks, triplets.
• Analogical comparison:
Classification 10/2/2015 11
Copyright 2011 Trend Micro Inc.
A Mathematical Model
• Four general types of tokens from binary strings:
– k-grams where k is as small as 3,4,…
– k-subsequences: any subsequence with length k. The triplet in TLSH
is an example.
– Chunks: whole string is split into non-overlapping chunks.
– Blocks: selected substrings of fixed length.
• Eight different models to describe a string for similarity.
• Analogical thinking:
– we define different distances to describe a metric space.
Classification 10/2/2015 12
Copyright 2011 Trend Micro Inc.
Tool Evaluation
• Data Structure:
– Bag: a bag ignores the order of tokens. It is good at handling content
swapping.
– Sequence: a sequence organizes tokens in an order. This is weak for handling
content swapping.
• Tokens:
– k-grams: Due to the small k ( 3,4,5,…), this fine granularity is good at
handling fragmentation.
– k-sequences: Due to the small k ( 3,4,5,…), this fine granularity is good at
handling fragmentation .
– Chunks: This approach takes account of every byte in raw granularity. It
should be OK at handling containment and cross sharing
– Blocks: Depending on different selection functions, even though it does not
take account of every byte, but it may present a string more efficiently and
that is good for generating similarity digests. Due to the nature of fixed
length blocks, it is good at handling containment and cross sharing.
13
Copyright 2011 Trend Micro Inc.
Tool Evaluation
Classification 10/2/2015 14
Tool Model Minor
Changes
Containment Cross
sharing
Swap Fragmentation
ssdeep M1.3 High Medium Medium Medium Low
sdhash M2.4 High High High High Low
TLSH M2.2 High Low Medium High High
Sdhash
+ TLSH
Hybrid High High High High High
Copyright 2011 Trend Micro Inc.
Tool Evaluation
Classification 10/2/2015 15
Copyright 2011 Trend Micro Inc.
Tool Evaluation
• Note: vulnerability is not the scope of this evaluation , but worthy for
mentioning.
• My co-worker Dr. Jon Oliver shows in one of his papers :
– Both ssdeep & sdhash are vulnerable in terms of adversary attacks.
– TLSH is not !
16
Copyright 2011 Trend Micro Inc.
A Novel Fuzzy Hashing
• We like to design a novel fuzzy hashing scheme based on the
M2.4:
– a string is presented by a bag of blocks.
– Two steps: (1) Feature selection; (2) Digest generation.
Classification 10/2/2015 17
Copyright 2011 Trend Micro Inc.
A Novel Fuzzy Hashing
• Continuing:
Classification 10/2/2015 18
Copyright 2011 Trend Micro Inc.
A Novel Fuzzy Hashing
• This is TSFP
– Trend String Fingerprint
• Similarity measurement of TSFP:
– Given two TSFP H and G where H = h1h2… hn and G= g1g2… gm .
– Similarity is measured by function:
• SIMH(H,G) = 200*|S ⋂T| / (|S| + |T|)
– Where S = {h1, h2, … ,hn } and T = {g1, g2, … , gm }
– 0 ≤ SIMH(G,H) ≤ 100
• Similarity measurement of two strings :
– SIM(s,t) = SMTH(TSFP(s), TSFP(h))
Classification 10/2/2015 19
Copyright 2011 Trend Micro Inc.
A Novel Fuzzy Hashing
• Why do we need TSFP ?
• We need to solve the following problems
1. Similarity search problem:
• B is a bag of binary strings {t1, t2 , …,tn} Given δ >0 and a binary string s,
find t ϵ {t1, t2 , …,tn} such that SIM(s, t) ≥δ.
2. Similarity based clustering problem:
• B is a bag of binary strings {{t1, t2 , …, tn }. Partition B into groups based
on their binary similarity.
• Why not {ssdeep, sdhash & TLSH} ?
– An obvious solution is applying a Brute Force algorithm.
– NOTE: Jon Oliver uses random forest to solve the search problem
without Brute Force. I will try to prove its feasibility mathematically.
20
Copyright 2011 Trend Micro Inc.
A Novel Fuzzy Hashing
• Similarity search problem:
• B is a bag of binary strings {t1, t2 , …, tn }. Given δ >0 and a binary string
s, find t ϵ {t1, t2 , …, tn} such that SIM(s, t) ≥δ .
• How does keyword based search engine work?
– Extracting keywords from documents
– Indexing keywords & documents
– Searching via keywords.
• Solution:
– Given a string s, we get its fuzzy hash TSFP(s)= h1h2… hn .
– Let S={h1, h2,…,hn}, each hj is a token of s that we treat it as a
keyword. So we can create the indices TSFP-Index (B).
– We can do two steps to solve the searching problems above.
21
Copyright 2011 Trend Micro Inc.
A Novel Fuzzy Hashing
• Similarity search problem:
• B is a bag of binary strings {t1, t2 , …, tn }. Given δ >0 and a binary string
s, find t ϵ {t1, t2 , …, tn} such that SIM(s, t) ≥δ .
• STEP 1:
– Candidate selection
• Let TSFP(s)= h1h2… hn to create the bag of tokens S={h1, h2,…, hn}.
• Use this bag of tokens to search the indices TSFP-Index(B) so that we
retrieve a list of candidates {s1, s2 , …, sm} ⊂ {t1, t2 , …, tn } ranked by
number of common tokens.
• STEP 2:
– Brute force method at smaller scale
• For each t ϵ {s1, s2 , …, sm}, if SIM( s, r) ≥δ , t is what we are searching for.
22
Copyright 2011 Trend Micro Inc.
Summary and Further Research
• My practice of academic research in industry:
Classification 10/2/2015 23
Copyright 2011 Trend Micro Inc.
Summary and Further Research
Framework of approximate matching, searching and clustering:
Classification 10/2/2015 24
Copyright 2011 Trend Micro Inc.
Q&A
• Thank you for your interest.
• Any questions?
• My Information:
– Email: liwei_ren@trendmicro.com
– Academic Page: https://pitt.academia.edu/LiweiRen
Classification 10/2/2015 25

More Related Content

What's hot

O.M.GSEA - An in-depth introduction to gene-set enrichment analysis
O.M.GSEA - An in-depth introduction to gene-set enrichment analysisO.M.GSEA - An in-depth introduction to gene-set enrichment analysis
O.M.GSEA - An in-depth introduction to gene-set enrichment analysis
Shana White
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
Jaclyn Kokx
 
Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learningbutest
 
data mining
data miningdata mining
data mining
manasa polu
 
An Introduction to Machine Learning and Genomics
An Introduction to Machine Learning and GenomicsAn Introduction to Machine Learning and Genomics
An Introduction to Machine Learning and Genomics
Brittany Lasseigne, Ph.D.
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
Bioinformatics and Computational Biosciences Branch
 
Interpretability of machine learning
Interpretability of machine learningInterpretability of machine learning
Interpretability of machine learning
Daiki Tanaka
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
Roelof Pieters
 
Suffix Tree and Suffix Array
Suffix Tree and Suffix ArraySuffix Tree and Suffix Array
Suffix Tree and Suffix Array
Harshit Agarwal
 
Machine Learning in Healthcare Diagnostics
Machine Learning in Healthcare DiagnosticsMachine Learning in Healthcare Diagnostics
Machine Learning in Healthcare Diagnostics
Larry Smarr
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep Learning
Natasha Latysheva
 
Introduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) TechnologyIntroduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) Technology
QIAGEN
 
hierarchical methods
hierarchical methodshierarchical methods
hierarchical methods
rajshreemuthiah
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clustering
Pier Luca Lanzi
 
The False Discovery Rate: An Overview
The False Discovery Rate: An OverviewThe False Discovery Rate: An Overview
The False Discovery Rate: An Overview
Philip Anderson
 
Text Classification
Text ClassificationText Classification
Text Classification
RAX Automation Suite
 
Markov Chain Monte Carlo Methods
Markov Chain Monte Carlo MethodsMarkov Chain Monte Carlo Methods
Markov Chain Monte Carlo Methods
Francesco Casalegno
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
Simplilearn
 

What's hot (20)

O.M.GSEA - An in-depth introduction to gene-set enrichment analysis
O.M.GSEA - An in-depth introduction to gene-set enrichment analysisO.M.GSEA - An in-depth introduction to gene-set enrichment analysis
O.M.GSEA - An in-depth introduction to gene-set enrichment analysis
 
Introduction to Linear Discriminant Analysis
Introduction to Linear Discriminant AnalysisIntroduction to Linear Discriminant Analysis
Introduction to Linear Discriminant Analysis
 
Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learning
 
data mining
data miningdata mining
data mining
 
An Introduction to Machine Learning and Genomics
An Introduction to Machine Learning and GenomicsAn Introduction to Machine Learning and Genomics
An Introduction to Machine Learning and Genomics
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
Support Vector Machine
Support Vector MachineSupport Vector Machine
Support Vector Machine
 
Interpretability of machine learning
Interpretability of machine learningInterpretability of machine learning
Interpretability of machine learning
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Suffix Tree and Suffix Array
Suffix Tree and Suffix ArraySuffix Tree and Suffix Array
Suffix Tree and Suffix Array
 
Machine Learning in Healthcare Diagnostics
Machine Learning in Healthcare DiagnosticsMachine Learning in Healthcare Diagnostics
Machine Learning in Healthcare Diagnostics
 
Sequence Modelling with Deep Learning
Sequence Modelling with Deep LearningSequence Modelling with Deep Learning
Sequence Modelling with Deep Learning
 
Introduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) TechnologyIntroduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) Technology
 
Kmeans
KmeansKmeans
Kmeans
 
hierarchical methods
hierarchical methodshierarchical methods
hierarchical methods
 
DMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clusteringDMTM Lecture 12 Hierarchical clustering
DMTM Lecture 12 Hierarchical clustering
 
The False Discovery Rate: An Overview
The False Discovery Rate: An OverviewThe False Discovery Rate: An Overview
The False Discovery Rate: An Overview
 
Text Classification
Text ClassificationText Classification
Text Classification
 
Markov Chain Monte Carlo Methods
Markov Chain Monte Carlo MethodsMarkov Chain Monte Carlo Methods
Markov Chain Monte Carlo Methods
 
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
 

Similar to Binary Similarity : Theory, Algorithms and Tool Evaluation

A Theoretic Framework for Evaluating Similarity Digesting Tools
A Theoretic Framework for Evaluating Similarity Digesting ToolsA Theoretic Framework for Evaluating Similarity Digesting Tools
A Theoretic Framework for Evaluating Similarity Digesting Tools
Liwei Ren任力偉
 
Bytewise Approximate Match: Theory, Algorithms and Applications
Bytewise Approximate Match:  Theory, Algorithms and ApplicationsBytewise Approximate Match:  Theory, Algorithms and Applications
Bytewise Approximate Match: Theory, Algorithms and Applications
Liwei Ren任力偉
 
Bytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clusteringBytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clustering
Liwei Ren任力偉
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Spark Summit
 
rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningJeff Heaton
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
Poonam Kshirsagar
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
Big Data Spain
 
3 - Finding similar items
3 - Finding similar items3 - Finding similar items
3 - Finding similar items
Viet-Trung TRAN
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
DECK36
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
Sandeep Joshi
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
Anirudh
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.ppt
Arumugam90
 
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisWorkshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Olga Scrivner
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
nikshaikh786
 
Detection of Embryonic Research Topics by Analysing Semantic Topic Networks
Detection of Embryonic Research Topics by Analysing Semantic Topic NetworksDetection of Embryonic Research Topics by Analysing Semantic Topic Networks
Detection of Embryonic Research Topics by Analysing Semantic Topic Networks
Angelo Salatino
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
Polytechnic University of Bari
 
Mathematical Modeling for Practical Problems
Mathematical Modeling for Practical ProblemsMathematical Modeling for Practical Problems
Mathematical Modeling for Practical Problems
Liwei Ren任力偉
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
StampedeCon
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text mining
Krish_ver2
 
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Noemi Derzsy
 

Similar to Binary Similarity : Theory, Algorithms and Tool Evaluation (20)

A Theoretic Framework for Evaluating Similarity Digesting Tools
A Theoretic Framework for Evaluating Similarity Digesting ToolsA Theoretic Framework for Evaluating Similarity Digesting Tools
A Theoretic Framework for Evaluating Similarity Digesting Tools
 
Bytewise Approximate Match: Theory, Algorithms and Applications
Bytewise Approximate Match:  Theory, Algorithms and ApplicationsBytewise Approximate Match:  Theory, Algorithms and Applications
Bytewise Approximate Match: Theory, Algorithms and Applications
 
Bytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clusteringBytewise approximate matching, searching and clustering
Bytewise approximate matching, searching and clustering
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
 
rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morning
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
 
3 - Finding similar items
3 - Finding similar items3 - Finding similar items
3 - Finding similar items
 
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
Real-time Data De-duplication using Locality-sensitive Hashing powered by Sto...
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
Real time streaming analytics
Real time streaming analyticsReal time streaming analytics
Real time streaming analytics
 
CS3114_09212011.ppt
CS3114_09212011.pptCS3114_09212011.ppt
CS3114_09212011.ppt
 
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisWorkshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
 
MODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptxMODULE 4_ CLUSTERING.pptx
MODULE 4_ CLUSTERING.pptx
 
Detection of Embryonic Research Topics by Analysing Semantic Topic Networks
Detection of Embryonic Research Topics by Analysing Semantic Topic NetworksDetection of Embryonic Research Topics by Analysing Semantic Topic Networks
Detection of Embryonic Research Topics by Analysing Semantic Topic Networks
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 
Mathematical Modeling for Practical Problems
Mathematical Modeling for Practical ProblemsMathematical Modeling for Practical Problems
Mathematical Modeling for Practical Problems
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text mining
 
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
 

More from Liwei Ren任力偉

信息安全领域里的创新和机遇
信息安全领域里的创新和机遇信息安全领域里的创新和机遇
信息安全领域里的创新和机遇
Liwei Ren任力偉
 
企业安全市场综述
企业安全市场综述 企业安全市场综述
企业安全市场综述
Liwei Ren任力偉
 
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural Network
Liwei Ren任力偉
 
聊一聊大明朝的火器
聊一聊大明朝的火器聊一聊大明朝的火器
聊一聊大明朝的火器
Liwei Ren任力偉
 
防火牆們的故事
防火牆們的故事防火牆們的故事
防火牆們的故事
Liwei Ren任力偉
 
移动互联网时代下创新的思维
移动互联网时代下创新的思维移动互联网时代下创新的思维
移动互联网时代下创新的思维
Liwei Ren任力偉
 
硅谷的那点事儿
硅谷的那点事儿硅谷的那点事儿
硅谷的那点事儿
Liwei Ren任力偉
 
非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究
Liwei Ren任力偉
 
世纪猜想
世纪猜想世纪猜想
世纪猜想
Liwei Ren任力偉
 
Arm the World with SPN based Security
Arm the World with SPN based SecurityArm the World with SPN based Security
Arm the World with SPN based Security
Liwei Ren任力偉
 
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching ProblemExtending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
Liwei Ren任力偉
 
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsNear Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Liwei Ren任力偉
 
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Liwei Ren任力偉
 
Phase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillatorsPhase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillators
Liwei Ren任力偉
 
On existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problemOn existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problem
Liwei Ren任力偉
 
Math stories
Math storiesMath stories
Math stories
Liwei Ren任力偉
 
IoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and SolutionsIoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and Solutions
Liwei Ren任力偉
 
Taxonomy of Differential Compression
Taxonomy of Differential CompressionTaxonomy of Differential Compression
Taxonomy of Differential Compression
Liwei Ren任力偉
 
Overview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) TechnologyOverview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) Technology
Liwei Ren任力偉
 
DLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and AlgorithmsDLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and Algorithms
Liwei Ren任力偉
 

More from Liwei Ren任力偉 (20)

信息安全领域里的创新和机遇
信息安全领域里的创新和机遇信息安全领域里的创新和机遇
信息安全领域里的创新和机遇
 
企业安全市场综述
企业安全市场综述 企业安全市场综述
企业安全市场综述
 
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural Network
 
聊一聊大明朝的火器
聊一聊大明朝的火器聊一聊大明朝的火器
聊一聊大明朝的火器
 
防火牆們的故事
防火牆們的故事防火牆們的故事
防火牆們的故事
 
移动互联网时代下创新的思维
移动互联网时代下创新的思维移动互联网时代下创新的思维
移动互联网时代下创新的思维
 
硅谷的那点事儿
硅谷的那点事儿硅谷的那点事儿
硅谷的那点事儿
 
非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究
 
世纪猜想
世纪猜想世纪猜想
世纪猜想
 
Arm the World with SPN based Security
Arm the World with SPN based SecurityArm the World with SPN based Security
Arm the World with SPN based Security
 
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching ProblemExtending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
 
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsNear Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
 
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
 
Phase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillatorsPhase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillators
 
On existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problemOn existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problem
 
Math stories
Math storiesMath stories
Math stories
 
IoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and SolutionsIoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and Solutions
 
Taxonomy of Differential Compression
Taxonomy of Differential CompressionTaxonomy of Differential Compression
Taxonomy of Differential Compression
 
Overview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) TechnologyOverview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) Technology
 
DLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and AlgorithmsDLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and Algorithms
 

Recently uploaded

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 

Recently uploaded (20)

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 

Binary Similarity : Theory, Algorithms and Tool Evaluation

  • 1. Copyright 2011 Trend Micro Inc. 1 Binary Similarity : Theory, Algorithms and Tool Evaluation Liwei Ren, Ph.D, Trend Micro™ University of Houston-Downtown, Houston, Texas, October, 2015
  • 2. Copyright 2011 Trend Micro Inc. Agenda • What is binary similarity ? • Similarity Digesting: 3 Algorithms • A Mathematical Model • Tool Evaluation • A Novel Fuzzy Hashing • Summary and Further Research Classification 10/2/2015 2
  • 3. Copyright 2011 Trend Micro Inc. What Is Binary Similarity? • Binary similarity or approximate matching. – What is binary similarity ? • 4 Use Cases specified by a NIST document: Classification 10/2/2015 3
  • 4. Copyright 2011 Trend Micro Inc. What Is Binary Similarity? Classification 10/2/2015 4
  • 5. Copyright 2011 Trend Micro Inc. Similarity Digesting : 3 Algorithms • Similarity digesting (aka, fuzzy hashing): – A class of hash techniques or tools that preserve similarity. – Typical steps for digest generation: – Detecting similarity with similarity digesting: • Three similarity digesting algorithms and tools: – ssdeep, sdhash & TLSH Classification 10/2/2015 5
  • 6. Copyright 2011 Trend Micro Inc. Similarity Digesting : 3 Algorithms • ssdeep – Two steps for digesting: – Edit Distance: Levenshtein distance Classification 10/2/2015 6
  • 7. Copyright 2011 Trend Micro Inc. Similarity Digesting : 3 Algorithms • Sdhash by Dr Vassil Roussev – Two steps for digesting: – Edit Distance: Hamming distance Classification 10/2/2015 7
  • 8. Copyright 2011 Trend Micro Inc. Similarity Digesting : 3 Algorithms • TLSH – Two steps for digesting : – Edit Distance: A diff based evaluation function Classification 10/2/2015 8
  • 9. Copyright 2011 Trend Micro Inc. A Mathematical Model • Summary of Three Similarity Digesting Schemes: – Using a first model to describe a binary string with selected features: • ssdeep model: a string is a sequence of chunks (split from the string). • sdhash model: a string is a bag of 64-byte blocks (selected with entropy values). • TLSH model: a string is a bag of triplets (selected from all 5-grams). – Using a second model to map the selected features into a digest which is able to preserve similarity to certain degree. • ssdeep model: a sequence of chunks is mapped into a 80-byte digest. • sdhash model: a bag of blocks is mapped into one or multiple 256-byte bloom filter bitmaps. • TLSH model: a bag of triplets is mapped into a 32-byte container. Classification 10/2/2015 9
  • 10. Copyright 2011 Trend Micro Inc. A Mathematical Model • Three approaches for similarity evaluation: Classification 10/2/2015 10 • 1st model plays critical role for similarity comparison. • Let focus on discussing various 1st models today. • Based on a unified format. • 2nd model saves space but further reduces accuracy.
  • 11. Copyright 2011 Trend Micro Inc. A Mathematical Model • Unified format for 1st model: – A string is described as a collection of tokens (aka, features) organized by a data structure: • ssdeep: a sequence of chunks. • sdhash: a bag of 64-byte blocks with high entropy values. • TLSH: a bag of selected triplets. – Two types of data structures: sequence, bag. – Three types of tokens: chunks, blocks, triplets. • Analogical comparison: Classification 10/2/2015 11
  • 12. Copyright 2011 Trend Micro Inc. A Mathematical Model • Four general types of tokens from binary strings: – k-grams where k is as small as 3,4,… – k-subsequences: any subsequence with length k. The triplet in TLSH is an example. – Chunks: whole string is split into non-overlapping chunks. – Blocks: selected substrings of fixed length. • Eight different models to describe a string for similarity. • Analogical thinking: – we define different distances to describe a metric space. Classification 10/2/2015 12
  • 13. Copyright 2011 Trend Micro Inc. Tool Evaluation • Data Structure: – Bag: a bag ignores the order of tokens. It is good at handling content swapping. – Sequence: a sequence organizes tokens in an order. This is weak for handling content swapping. • Tokens: – k-grams: Due to the small k ( 3,4,5,…), this fine granularity is good at handling fragmentation. – k-sequences: Due to the small k ( 3,4,5,…), this fine granularity is good at handling fragmentation . – Chunks: This approach takes account of every byte in raw granularity. It should be OK at handling containment and cross sharing – Blocks: Depending on different selection functions, even though it does not take account of every byte, but it may present a string more efficiently and that is good for generating similarity digests. Due to the nature of fixed length blocks, it is good at handling containment and cross sharing. 13
  • 14. Copyright 2011 Trend Micro Inc. Tool Evaluation Classification 10/2/2015 14 Tool Model Minor Changes Containment Cross sharing Swap Fragmentation ssdeep M1.3 High Medium Medium Medium Low sdhash M2.4 High High High High Low TLSH M2.2 High Low Medium High High Sdhash + TLSH Hybrid High High High High High
  • 15. Copyright 2011 Trend Micro Inc. Tool Evaluation Classification 10/2/2015 15
  • 16. Copyright 2011 Trend Micro Inc. Tool Evaluation • Note: vulnerability is not the scope of this evaluation , but worthy for mentioning. • My co-worker Dr. Jon Oliver shows in one of his papers : – Both ssdeep & sdhash are vulnerable in terms of adversary attacks. – TLSH is not ! 16
  • 17. Copyright 2011 Trend Micro Inc. A Novel Fuzzy Hashing • We like to design a novel fuzzy hashing scheme based on the M2.4: – a string is presented by a bag of blocks. – Two steps: (1) Feature selection; (2) Digest generation. Classification 10/2/2015 17
  • 18. Copyright 2011 Trend Micro Inc. A Novel Fuzzy Hashing • Continuing: Classification 10/2/2015 18
  • 19. Copyright 2011 Trend Micro Inc. A Novel Fuzzy Hashing • This is TSFP – Trend String Fingerprint • Similarity measurement of TSFP: – Given two TSFP H and G where H = h1h2… hn and G= g1g2… gm . – Similarity is measured by function: • SIMH(H,G) = 200*|S ⋂T| / (|S| + |T|) – Where S = {h1, h2, … ,hn } and T = {g1, g2, … , gm } – 0 ≤ SIMH(G,H) ≤ 100 • Similarity measurement of two strings : – SIM(s,t) = SMTH(TSFP(s), TSFP(h)) Classification 10/2/2015 19
  • 20. Copyright 2011 Trend Micro Inc. A Novel Fuzzy Hashing • Why do we need TSFP ? • We need to solve the following problems 1. Similarity search problem: • B is a bag of binary strings {t1, t2 , …,tn} Given δ >0 and a binary string s, find t ϵ {t1, t2 , …,tn} such that SIM(s, t) ≥δ. 2. Similarity based clustering problem: • B is a bag of binary strings {{t1, t2 , …, tn }. Partition B into groups based on their binary similarity. • Why not {ssdeep, sdhash & TLSH} ? – An obvious solution is applying a Brute Force algorithm. – NOTE: Jon Oliver uses random forest to solve the search problem without Brute Force. I will try to prove its feasibility mathematically. 20
  • 21. Copyright 2011 Trend Micro Inc. A Novel Fuzzy Hashing • Similarity search problem: • B is a bag of binary strings {t1, t2 , …, tn }. Given δ >0 and a binary string s, find t ϵ {t1, t2 , …, tn} such that SIM(s, t) ≥δ . • How does keyword based search engine work? – Extracting keywords from documents – Indexing keywords & documents – Searching via keywords. • Solution: – Given a string s, we get its fuzzy hash TSFP(s)= h1h2… hn . – Let S={h1, h2,…,hn}, each hj is a token of s that we treat it as a keyword. So we can create the indices TSFP-Index (B). – We can do two steps to solve the searching problems above. 21
  • 22. Copyright 2011 Trend Micro Inc. A Novel Fuzzy Hashing • Similarity search problem: • B is a bag of binary strings {t1, t2 , …, tn }. Given δ >0 and a binary string s, find t ϵ {t1, t2 , …, tn} such that SIM(s, t) ≥δ . • STEP 1: – Candidate selection • Let TSFP(s)= h1h2… hn to create the bag of tokens S={h1, h2,…, hn}. • Use this bag of tokens to search the indices TSFP-Index(B) so that we retrieve a list of candidates {s1, s2 , …, sm} ⊂ {t1, t2 , …, tn } ranked by number of common tokens. • STEP 2: – Brute force method at smaller scale • For each t ϵ {s1, s2 , …, sm}, if SIM( s, r) ≥δ , t is what we are searching for. 22
  • 23. Copyright 2011 Trend Micro Inc. Summary and Further Research • My practice of academic research in industry: Classification 10/2/2015 23
  • 24. Copyright 2011 Trend Micro Inc. Summary and Further Research Framework of approximate matching, searching and clustering: Classification 10/2/2015 24
  • 25. Copyright 2011 Trend Micro Inc. Q&A • Thank you for your interest. • Any questions? • My Information: – Email: liwei_ren@trendmicro.com – Academic Page: https://pitt.academia.edu/LiweiRen Classification 10/2/2015 25