SlideShare a Scribd company logo
Copyright 2011 Trend Micro Inc. 1
A Theoretic Framework for Evaluating
Similarity Digesting Tools
Liwei Ren, Ph.D, Trend Micro™
DFRWS EU 2015, Dublin, Ireland, March, 2015
Copyright 2011 Trend Micro Inc.
Agenda
• Byte-wise Approximate Matching
• Similarity Digesting Tools
• Mathematical Models for Byte-wise Similarity
• Tool Evaluation with Theoretic Analysis
• Tool Evaluation with Data Experiment
• Further Research for Approximate Matching
Classification 3/25/2015 2
Copyright 2011 Trend Micro Inc.
Byte-wise Approximate Matching
• Byte-wise similarity & approximate matching.
– What is byte-wise similarity ?
• 4 Use Cases specified by NIST :
Classification 3/25/2015 3
Copyright 2011 Trend Micro Inc.
Byte-wise Approximate Matching
Classification 3/25/2015 4
Copyright 2011 Trend Micro Inc.
Similarity Digesting Tools
• Similarity digesting :
– A class of hash techniques or tools that preserve similarity.
– Typical steps for digest generation:
– Detecting similarity with similarity digesting:
• Three similarity digesting algorithms and tools:
– ssdeep, sdhash & TLSH
Classification 3/25/2015 5
Copyright 2011 Trend Micro Inc.
Similarity Digesting Tools
• ssdeep
– Two steps for digesting:
– Edit Distance: Levenshtein distance
Classification 3/25/2015 6
Copyright 2011 Trend Micro Inc.
Similarity Digesting Tools
• Sdhash by Dr Vassil Roussev
– Two steps for digesting:
– Edit Distance: Hamming distance
Classification 3/25/2015 7
Copyright 2011 Trend Micro Inc.
Similarity Digesting Tools
• TLSH
– Two steps for digesting :
– Edit Distance: A diff based evaluation function
Classification 3/25/2015 8
Copyright 2011 Trend Micro Inc.
Mathematical Models for Byte-wise Similarity
• Summary of Three Similarity Digesting Schemes:
– Using a first model to describe a binary string with selected features:
• ssdeep model: a string is a sequence of chunks (split from the string).
• sdhash model: a string is a bag of 64-byte blocks (selected with entropy
values).
• TLSH model: a string is a bag of triplets (selected from all 5-grams).
– Using a second model to map the selected features into a digest which
is able to preserve similarity to certain degree.
• ssdeep model: a sequence of chunks is mapped into a 80-byte digest.
• sdhash model: a bag of blocks is mapped into one or multiple 256-byte
bloom filter bitmaps.
• TLSH model: a bag of triplets is mapped into a 32-byte container.
Classification 3/25/2015 9
Copyright 2011 Trend Micro Inc.
Mathematical Models for Byte-wise Similarity
• Three approaches for similarity evaluation:
Classification 3/25/2015 10
• 1st model plays critical role for similarity comparison.
• Let focus on discussing various 1st models today.
• Based on a unified format.
• 2nd model saves space but further reduces accuracy.
Copyright 2011 Trend Micro Inc.
Mathematical Models for Byte-wise Similarity
• Unified format for 1st model:
– A string is described as a collection of tokens (aka, features)
organized by a data structure:
• ssdeep: a sequence of chunks.
• sdhash: a bag of 64-byte blocks with high entropy values.
• TLSH: a bag of selected triplets.
– Two types of data structures: sequence, bag.
– Three types of tokens: chunks, blocks, triplets.
• Analogical comparison:
Classification 3/25/2015 11
Copyright 2011 Trend Micro Inc.
Mathematical Models for Byte-wise Similarity
• Four general types of tokens from binary strings:
– k-grams where k is as small as 3,4,…
– k-subsequences: any subsequence with length k. The triplet in TLSH
is an example.
– Chunks: whole string is split into non-overlapping chunks.
– Blocks: selected substrings of fixed length.
• Eight different models to describe a string for similarity.
• Analogical thinking:
– we define different distances to describe a metric space.
Classification 3/25/2015 12
Copyright 2011 Trend Micro Inc.
Tool Evaluation with Theoretic Analysis
• Data Structure:
– Bag: a bag ignores the order of tokens. It is good at handling content
swapping.
– Sequence: a sequence organizes tokens in an order. This is weak for handling
content swapping.
• Tokens:
– k-grams: Due to the small k ( 3,4,5,…), this fine granularity is good at
handling fragmentation.
– k-sequences: Due to the small k ( 3,4,5,…), this fine granularity is good at
handling fragmentation .
– Chunks: This approach takes account of every byte in raw granularity. It
should be OK at handling containment and cross sharing
– Blocks: Depending on different selection functions, even though it does not
take account of every byte, but it may present a string more efficiently and
that is good for generating similarity digests. Due to the nature of fixed
length blocks, it is good at handling containment and cross sharing.
13
Copyright 2011 Trend Micro Inc.
Tool Evaluation with Theoretic Analysis
Classification 3/25/2015 14
Tool Model Minor
Changes
Containment Cross
sharing
Swap Fragmentation
ssdeep M1.3 High Medium Medium Medium Low
sdhash M2.4 High High High High Low
TLSH M2.2 High Low Medium High High
Sdhash
+ TLSH
Hybrid High High High High High
Copyright 2011 Trend Micro Inc.
Tool Evaluation with Data Experiment
Classification 3/25/2015 15
Copyright 2011 Trend Micro Inc.
Further Research for Approximate Matching
• A Roadmap for Further Research :
Classification 3/25/2015 16
Copyright 2011 Trend Micro Inc.
Q&A
• Thank you for your interest.
• Any questions?
• My Contact Information:
– Email: liwei_ren@trendmicro.com
– Linkedin: https://www.linkedin.com/in/drliweiren
– Academic Page: https://pitt.academia.edu/LiweiRen
Classification 3/25/2015 17

More Related Content

What's hot

Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia
Unsupervised Learning of an Extensive and Usable Taxonomy for DBpediaUnsupervised Learning of an Extensive and Usable Taxonomy for DBpedia
Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia
Marco Fossati
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
ssbd6985
 
Standard template library
Standard template libraryStandard template library
Standard template library
ThamizhselviKrishnam
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
Sudarsun Santhiappan
 
Introduction into meta data model 2010-07-02
Introduction into meta data model 2010-07-02Introduction into meta data model 2010-07-02
Introduction into meta data model 2010-07-02
Trinity College Dublin
 
Presentation
PresentationPresentation
Presentation
Inna Таrasyan
 
Clustering
ClusteringClustering
Clustering
M Rizwan Aqeel
 
Searching Techniques and Analysis
Searching Techniques and AnalysisSearching Techniques and Analysis
Searching Techniques and Analysis
AkashBorse2
 
Data types & object
Data types & objectData types & object
Data types & object
Alvaro Anguiano
 
Data Clustering Using Swarm Intelligence Algorithms An Overview
Data Clustering Using  Swarm Intelligence Algorithms  An OverviewData Clustering Using  Swarm Intelligence Algorithms  An Overview
Data Clustering Using Swarm Intelligence Algorithms An Overview
Aboul Ella Hassanien
 
msf566-syllabus
msf566-syllabusmsf566-syllabus
msf566-syllabus
Andrew Paul Acosta
 
Hendrik introduction into meta data model 2010-06-20
Hendrik introduction into meta data model 2010-06-20Hendrik introduction into meta data model 2010-06-20
Hendrik introduction into meta data model 2010-06-20
Trinity College Dublin
 
Information retrieval 16 latent semantic indexing model
Information retrieval 16  latent semantic indexing modelInformation retrieval 16  latent semantic indexing model
Information retrieval 16 latent semantic indexing model
Vaibhav Khanna
 
Contextual Definition Generation
Contextual Definition GenerationContextual Definition Generation
Contextual Definition Generation
Sergey Sosnovsky
 
C Omega
C OmegaC Omega
C Omega
iradarji
 

What's hot (15)

Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia
Unsupervised Learning of an Extensive and Usable Taxonomy for DBpediaUnsupervised Learning of an Extensive and Usable Taxonomy for DBpedia
Unsupervised Learning of an Extensive and Usable Taxonomy for DBpedia
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Standard template library
Standard template libraryStandard template library
Standard template library
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 
Introduction into meta data model 2010-07-02
Introduction into meta data model 2010-07-02Introduction into meta data model 2010-07-02
Introduction into meta data model 2010-07-02
 
Presentation
PresentationPresentation
Presentation
 
Clustering
ClusteringClustering
Clustering
 
Searching Techniques and Analysis
Searching Techniques and AnalysisSearching Techniques and Analysis
Searching Techniques and Analysis
 
Data types & object
Data types & objectData types & object
Data types & object
 
Data Clustering Using Swarm Intelligence Algorithms An Overview
Data Clustering Using  Swarm Intelligence Algorithms  An OverviewData Clustering Using  Swarm Intelligence Algorithms  An Overview
Data Clustering Using Swarm Intelligence Algorithms An Overview
 
msf566-syllabus
msf566-syllabusmsf566-syllabus
msf566-syllabus
 
Hendrik introduction into meta data model 2010-06-20
Hendrik introduction into meta data model 2010-06-20Hendrik introduction into meta data model 2010-06-20
Hendrik introduction into meta data model 2010-06-20
 
Information retrieval 16 latent semantic indexing model
Information retrieval 16  latent semantic indexing modelInformation retrieval 16  latent semantic indexing model
Information retrieval 16 latent semantic indexing model
 
Contextual Definition Generation
Contextual Definition GenerationContextual Definition Generation
Contextual Definition Generation
 
C Omega
C OmegaC Omega
C Omega
 

Similar to A Theoretic Framework for Evaluating Similarity Digesting Tools

Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool Evaluation
Liwei Ren任力偉
 
Bytewise Approximate Match: Theory, Algorithms and Applications
Bytewise Approximate Match:  Theory, Algorithms and ApplicationsBytewise Approximate Match:  Theory, Algorithms and Applications
Bytewise Approximate Match: Theory, Algorithms and Applications
Liwei Ren任力偉
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
SowmyaJyothi3
 
Keynote at AImWD
Keynote at AImWDKeynote at AImWD
Keynote at AImWD
Stefan Schlobach
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Spark Summit
 
UNIT - 4: Data Warehousing and Data Mining
UNIT - 4: Data Warehousing and Data MiningUNIT - 4: Data Warehousing and Data Mining
UNIT - 4: Data Warehousing and Data Mining
Nandakumar P
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative Analysis
Editor IJMTER
 
K044055762
K044055762K044055762
K044055762
IJERA Editor
 
Evaluating the Use of Clustering for Automatically Organising Digital Library...
Evaluating the Use of Clustering for Automatically Organising Digital Library...Evaluating the Use of Clustering for Automatically Organising Digital Library...
Evaluating the Use of Clustering for Automatically Organising Digital Library...
pathsproject
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
Polytechnic University of Bari
 
Incremental clustering in search engines
Incremental clustering in search enginesIncremental clustering in search engines
Incremental clustering in search engines
Praxitelis Nikolaos Kouroupetroglou
 
Hierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyondHierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyond
Frank Kelly
 
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
PyData
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
Ernesto Reig
 
Data mining
Data miningData mining
Data mining
EmaSushan
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdf
bintis1
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
rathnaarul
 
IEEE 2014 JAVA DATA MINING PROJECTS A similarity measure for text classificat...
IEEE 2014 JAVA DATA MINING PROJECTS A similarity measure for text classificat...IEEE 2014 JAVA DATA MINING PROJECTS A similarity measure for text classificat...
IEEE 2014 JAVA DATA MINING PROJECTS A similarity measure for text classificat...
IEEEFINALYEARSTUDENTPROJECTS
 
2014 IEEE JAVA DATA MINING PROJECT A similarity measure for text classificati...
2014 IEEE JAVA DATA MINING PROJECT A similarity measure for text classificati...2014 IEEE JAVA DATA MINING PROJECT A similarity measure for text classificati...
2014 IEEE JAVA DATA MINING PROJECT A similarity measure for text classificati...
IEEEMEMTECHSTUDENTSPROJECTS
 

Similar to A Theoretic Framework for Evaluating Similarity Digesting Tools (20)

Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool Evaluation
 
Bytewise Approximate Match: Theory, Algorithms and Applications
Bytewise Approximate Match:  Theory, Algorithms and ApplicationsBytewise Approximate Match:  Theory, Algorithms and Applications
Bytewise Approximate Match: Theory, Algorithms and Applications
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
Keynote at AImWD
Keynote at AImWDKeynote at AImWD
Keynote at AImWD
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
 
UNIT - 4: Data Warehousing and Data Mining
UNIT - 4: Data Warehousing and Data MiningUNIT - 4: Data Warehousing and Data Mining
UNIT - 4: Data Warehousing and Data Mining
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
Textual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative AnalysisTextual Data Partitioning with Relationship and Discriminative Analysis
Textual Data Partitioning with Relationship and Discriminative Analysis
 
K044055762
K044055762K044055762
K044055762
 
Evaluating the Use of Clustering for Automatically Organising Digital Library...
Evaluating the Use of Clustering for Automatically Organising Digital Library...Evaluating the Use of Clustering for Automatically Organising Digital Library...
Evaluating the Use of Clustering for Automatically Organising Digital Library...
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 
Incremental clustering in search engines
Incremental clustering in search enginesIncremental clustering in search engines
Incremental clustering in search engines
 
Hierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyondHierarchical clustering in Python and beyond
Hierarchical clustering in Python and beyond
 
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
An Example of Predictive Analytics: Building a Recommendation Engine Using Py...
 
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
 
Data mining
Data miningData mining
Data mining
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdf
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
 
IEEE 2014 JAVA DATA MINING PROJECTS A similarity measure for text classificat...
IEEE 2014 JAVA DATA MINING PROJECTS A similarity measure for text classificat...IEEE 2014 JAVA DATA MINING PROJECTS A similarity measure for text classificat...
IEEE 2014 JAVA DATA MINING PROJECTS A similarity measure for text classificat...
 
2014 IEEE JAVA DATA MINING PROJECT A similarity measure for text classificati...
2014 IEEE JAVA DATA MINING PROJECT A similarity measure for text classificati...2014 IEEE JAVA DATA MINING PROJECT A similarity measure for text classificati...
2014 IEEE JAVA DATA MINING PROJECT A similarity measure for text classificati...
 

More from Liwei Ren任力偉

信息安全领域里的创新和机遇
信息安全领域里的创新和机遇信息安全领域里的创新和机遇
信息安全领域里的创新和机遇
Liwei Ren任力偉
 
企业安全市场综述
企业安全市场综述 企业安全市场综述
企业安全市场综述
Liwei Ren任力偉
 
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural Network
Liwei Ren任力偉
 
聊一聊大明朝的火器
聊一聊大明朝的火器聊一聊大明朝的火器
聊一聊大明朝的火器
Liwei Ren任力偉
 
防火牆們的故事
防火牆們的故事防火牆們的故事
防火牆們的故事
Liwei Ren任力偉
 
移动互联网时代下创新的思维
移动互联网时代下创新的思维移动互联网时代下创新的思维
移动互联网时代下创新的思维
Liwei Ren任力偉
 
硅谷的那点事儿
硅谷的那点事儿硅谷的那点事儿
硅谷的那点事儿
Liwei Ren任力偉
 
非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究
Liwei Ren任力偉
 
世纪猜想
世纪猜想世纪猜想
世纪猜想
Liwei Ren任力偉
 
Arm the World with SPN based Security
Arm the World with SPN based SecurityArm the World with SPN based Security
Arm the World with SPN based Security
Liwei Ren任力偉
 
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching ProblemExtending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
Liwei Ren任力偉
 
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsNear Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Liwei Ren任力偉
 
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Liwei Ren任力偉
 
Phase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillatorsPhase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillators
Liwei Ren任力偉
 
On existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problemOn existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problem
Liwei Ren任力偉
 
Math stories
Math storiesMath stories
Math stories
Liwei Ren任力偉
 
IoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and SolutionsIoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and Solutions
Liwei Ren任力偉
 
Taxonomy of Differential Compression
Taxonomy of Differential CompressionTaxonomy of Differential Compression
Taxonomy of Differential Compression
Liwei Ren任力偉
 
Overview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) TechnologyOverview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) Technology
Liwei Ren任力偉
 
DLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and AlgorithmsDLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and Algorithms
Liwei Ren任力偉
 

More from Liwei Ren任力偉 (20)

信息安全领域里的创新和机遇
信息安全领域里的创新和机遇信息安全领域里的创新和机遇
信息安全领域里的创新和机遇
 
企业安全市场综述
企业安全市场综述 企业安全市场综述
企业安全市场综述
 
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural Network
 
聊一聊大明朝的火器
聊一聊大明朝的火器聊一聊大明朝的火器
聊一聊大明朝的火器
 
防火牆們的故事
防火牆們的故事防火牆們的故事
防火牆們的故事
 
移动互联网时代下创新的思维
移动互联网时代下创新的思维移动互联网时代下创新的思维
移动互联网时代下创新的思维
 
硅谷的那点事儿
硅谷的那点事儿硅谷的那点事儿
硅谷的那点事儿
 
非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究
 
世纪猜想
世纪猜想世纪猜想
世纪猜想
 
Arm the World with SPN based Security
Arm the World with SPN based SecurityArm the World with SPN based Security
Arm the World with SPN based Security
 
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching ProblemExtending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
 
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsNear Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
 
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
 
Phase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillatorsPhase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillators
 
On existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problemOn existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problem
 
Math stories
Math storiesMath stories
Math stories
 
IoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and SolutionsIoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and Solutions
 
Taxonomy of Differential Compression
Taxonomy of Differential CompressionTaxonomy of Differential Compression
Taxonomy of Differential Compression
 
Overview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) TechnologyOverview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) Technology
 
DLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and AlgorithmsDLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and Algorithms
 

Recently uploaded

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 

Recently uploaded (20)

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 

A Theoretic Framework for Evaluating Similarity Digesting Tools

  • 1. Copyright 2011 Trend Micro Inc. 1 A Theoretic Framework for Evaluating Similarity Digesting Tools Liwei Ren, Ph.D, Trend Micro™ DFRWS EU 2015, Dublin, Ireland, March, 2015
  • 2. Copyright 2011 Trend Micro Inc. Agenda • Byte-wise Approximate Matching • Similarity Digesting Tools • Mathematical Models for Byte-wise Similarity • Tool Evaluation with Theoretic Analysis • Tool Evaluation with Data Experiment • Further Research for Approximate Matching Classification 3/25/2015 2
  • 3. Copyright 2011 Trend Micro Inc. Byte-wise Approximate Matching • Byte-wise similarity & approximate matching. – What is byte-wise similarity ? • 4 Use Cases specified by NIST : Classification 3/25/2015 3
  • 4. Copyright 2011 Trend Micro Inc. Byte-wise Approximate Matching Classification 3/25/2015 4
  • 5. Copyright 2011 Trend Micro Inc. Similarity Digesting Tools • Similarity digesting : – A class of hash techniques or tools that preserve similarity. – Typical steps for digest generation: – Detecting similarity with similarity digesting: • Three similarity digesting algorithms and tools: – ssdeep, sdhash & TLSH Classification 3/25/2015 5
  • 6. Copyright 2011 Trend Micro Inc. Similarity Digesting Tools • ssdeep – Two steps for digesting: – Edit Distance: Levenshtein distance Classification 3/25/2015 6
  • 7. Copyright 2011 Trend Micro Inc. Similarity Digesting Tools • Sdhash by Dr Vassil Roussev – Two steps for digesting: – Edit Distance: Hamming distance Classification 3/25/2015 7
  • 8. Copyright 2011 Trend Micro Inc. Similarity Digesting Tools • TLSH – Two steps for digesting : – Edit Distance: A diff based evaluation function Classification 3/25/2015 8
  • 9. Copyright 2011 Trend Micro Inc. Mathematical Models for Byte-wise Similarity • Summary of Three Similarity Digesting Schemes: – Using a first model to describe a binary string with selected features: • ssdeep model: a string is a sequence of chunks (split from the string). • sdhash model: a string is a bag of 64-byte blocks (selected with entropy values). • TLSH model: a string is a bag of triplets (selected from all 5-grams). – Using a second model to map the selected features into a digest which is able to preserve similarity to certain degree. • ssdeep model: a sequence of chunks is mapped into a 80-byte digest. • sdhash model: a bag of blocks is mapped into one or multiple 256-byte bloom filter bitmaps. • TLSH model: a bag of triplets is mapped into a 32-byte container. Classification 3/25/2015 9
  • 10. Copyright 2011 Trend Micro Inc. Mathematical Models for Byte-wise Similarity • Three approaches for similarity evaluation: Classification 3/25/2015 10 • 1st model plays critical role for similarity comparison. • Let focus on discussing various 1st models today. • Based on a unified format. • 2nd model saves space but further reduces accuracy.
  • 11. Copyright 2011 Trend Micro Inc. Mathematical Models for Byte-wise Similarity • Unified format for 1st model: – A string is described as a collection of tokens (aka, features) organized by a data structure: • ssdeep: a sequence of chunks. • sdhash: a bag of 64-byte blocks with high entropy values. • TLSH: a bag of selected triplets. – Two types of data structures: sequence, bag. – Three types of tokens: chunks, blocks, triplets. • Analogical comparison: Classification 3/25/2015 11
  • 12. Copyright 2011 Trend Micro Inc. Mathematical Models for Byte-wise Similarity • Four general types of tokens from binary strings: – k-grams where k is as small as 3,4,… – k-subsequences: any subsequence with length k. The triplet in TLSH is an example. – Chunks: whole string is split into non-overlapping chunks. – Blocks: selected substrings of fixed length. • Eight different models to describe a string for similarity. • Analogical thinking: – we define different distances to describe a metric space. Classification 3/25/2015 12
  • 13. Copyright 2011 Trend Micro Inc. Tool Evaluation with Theoretic Analysis • Data Structure: – Bag: a bag ignores the order of tokens. It is good at handling content swapping. – Sequence: a sequence organizes tokens in an order. This is weak for handling content swapping. • Tokens: – k-grams: Due to the small k ( 3,4,5,…), this fine granularity is good at handling fragmentation. – k-sequences: Due to the small k ( 3,4,5,…), this fine granularity is good at handling fragmentation . – Chunks: This approach takes account of every byte in raw granularity. It should be OK at handling containment and cross sharing – Blocks: Depending on different selection functions, even though it does not take account of every byte, but it may present a string more efficiently and that is good for generating similarity digests. Due to the nature of fixed length blocks, it is good at handling containment and cross sharing. 13
  • 14. Copyright 2011 Trend Micro Inc. Tool Evaluation with Theoretic Analysis Classification 3/25/2015 14 Tool Model Minor Changes Containment Cross sharing Swap Fragmentation ssdeep M1.3 High Medium Medium Medium Low sdhash M2.4 High High High High Low TLSH M2.2 High Low Medium High High Sdhash + TLSH Hybrid High High High High High
  • 15. Copyright 2011 Trend Micro Inc. Tool Evaluation with Data Experiment Classification 3/25/2015 15
  • 16. Copyright 2011 Trend Micro Inc. Further Research for Approximate Matching • A Roadmap for Further Research : Classification 3/25/2015 16
  • 17. Copyright 2011 Trend Micro Inc. Q&A • Thank you for your interest. • Any questions? • My Contact Information: – Email: liwei_ren@trendmicro.com – Linkedin: https://www.linkedin.com/in/drliweiren – Academic Page: https://pitt.academia.edu/LiweiRen Classification 3/25/2015 17