SlideShare a Scribd company logo
1 of 21
Download to read offline
Copyright 2011 Trend Micro Inc. 1
Bytewise approximate matching,
searching and clustering
Liwei Ren, Ph.D
Ray Cheng, Ph.D
Trend Micro Inc.
DFRWS USA 2015, August , 2015, Philadelphia, PA
Copyright 2011 Trend Micro Inc.
Agenda
• Background
• Six Matching Problems and Bytewise Relevance
• Current Work: A Framework of Theory, Algorithms, and
Technologies
• Future Work
Classification 8/17/2015 2
Copyright 2011 Trend Micro Inc.
Background
• Similarity digesting schemes:
– Problem: Given two binary strings s1 and s2, measure their similarity.
• Do a hash that preserves similarity property of strings.
• Measure similarity by comparing two hash values.
– Example: TLSH, ssdeep, sdhash
Classification 8/17/2015 3
Copyright 2011 Trend Micro Inc.
Background
• NIST specification document NIST.SP.800-168 introduces the
concept of bytewise approximate matching :
– NIST document lists four cases to describe this concept:
• Object similarity detection: identify related artifacts, e.g. different versions
of a document.
• Cross Correlation: identify artifacts sharing a common object.
• Embedded Object Detection: identify a given object inside an artifact.
• Fragment Detection: identify the presence of traces/fragments of a known
artifact.
• Dr . Liwei Ren’s talk at DFRWS EU 2015:
– A Theoretic Framework for Evaluating Similarity Digesting Tools
– Using a mathematical model to describe binary similarity.
4
Copyright 2011 Trend Micro Inc.
Six Matching Problems and Bytewise Relevance
• The NIST document does not cover all bytewise approximate
matching cases.
• We generalized NIST cases to six cases:
Classification 8/17/2015 5
Copyright 2011 Trend Micro Inc.
Six Matching Problems and Bytewise Relevance
• Continued:
6
Copyright 2011 Trend Micro Inc.
Classification of NIST approximate
matching cases
• Similarity Detection: identify related artifacts.
– AM1 (approximate match)
• Cross Correlation: identify artifacts sharing a
common object.
– EM3 (exact match cross-sharing)
• Embedded Object Detection: identify a given
object inside an artifact.
– EM2 (exact match containment)
• Fragment Detection: identify the presence of
traces/fragments of a known artifact.
– EM2 (one or more exact match containment)
Classification 8/17/2015 7
Copyright 2011 Trend Micro Inc.
Six Matching Problems and Bytewise Relevance
• Definition 1 : Given two strings R[1,..,n] and T[1,…,m], if one of
six cases is true, we say R and T are bytewise relevant.
– We denote this as BR(R,T)= 1, otherwise BR(R,T)= 0.
8
Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• Define three fundamental problems using Bytewise
Relevance:
– Matching: Given O1 , O2 ∊ S, determine whether BR (O1,O2) =1.
– Searching : B ⊆ S is a bag of objects . Given o ∊ S , find b ∊ B
such that BR (o, b )=1.
– Clustering: Given a bag B of objects, partition B into groups { G1,
G2,…,Gm} based on BR.
• S = An object space S,
• O = An object in object space S,
•BR = Bytewise Relevance relationship for objects in S.
Classification 8/17/2015 9
Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• Our bytewise relevance framework :
Classification 8/17/2015 10
Copyright 2011 Trend Micro Inc.
Matching
• The Six Matching Problems EM1 – AM3
– Identicalness EM1 : the solution is trivial.
– Containment EM2 : the solution is Rabin-Karp algorithm.
– Cross-sharing EM3 :
• We established a theory on this interesting problem : how to measure cross-
sharing.
• We developed an algorithmic solution with theoretic analysis.
– Similarity AM1 :
• TLSH, ssdeep and sdhash
• Dr. Ren delivered a talk at DFRWS EU 2015: there are eight approaches to
solve this problem.
– We designed a novel similarity digesting scheme TSFP.
– Approximate containment AM2: Two heuristic algorithms
– Approximate cross-sharing AM3: One heuristic algorithm
Classification 8/17/2015 11
Copyright 2011 Trend Micro Inc.
Searching
• For the relationship BR, the searching problem:
– B is a bag of strings. Given a string T , find s ∊ B such that BR(T,
s)=1.
Classification 8/17/2015 12
Copyright 2011 Trend Micro Inc.
Searching
• How to solve searching problem?
– Brute force approach : for every s ∊ B, we evaluate BR(T, s). Can
we scale to millions or billions? 
– Candidate selection approach: two-step approach
• STEP 1: select a few candidates { s1, s2,…,sm} quickly
• STEP 2: evaluate each BR(T, sk).
– How to select good candidates?
• String fingerprinting: generate fingerprints from each string from B.
• Indexing Process: Index the fingerprints along with the string ID to create
a index DB as FP-DB.
• Searching Process: given T, generate fingerprints {FP1, FP2,…,FPq} , we
use them to search possible candidates from FP-DB.
– NOTE:
• This is similar to a keyword based search engine where the keywords are
the fingerprints.
• The fingerprinting procedure is actually a special tokenization method.
Classification 8/17/2015 13
Copyright 2011 Trend Micro Inc.
Future Work: Clustering Problem
• For the relationship BR, one has a clustering problem :
– B is a bag of strings, partition B into groups of strings based on BR.
Classification 8/17/2015 14
Copyright 2011 Trend Micro Inc.
Future Work: Library and tools
• Analyze algorithms and measure performance.
– Verify they can scale.
• For bytewise approximate matching, searching and clustering,
– Library of functions
– API
– Tools
Classification 8/17/2015 15
Copyright 2011 Trend Micro Inc.
Application examples of Approximate
Matching, Searching, Clustering
• E-Discovery
– Comparing near duplicate documents
– Grouping near duplicate documents
• Digital forensic analysis
– Identifying similar objects or files
• Malware analysis
– Identifying similar malware or mutated malware
• Anti-plagiarism
– Detection of copyright violations
• Source code governance
• Spam filtering
• Data Loss Prevention
Classification 8/17/2015 16
Copyright 2011 Trend Micro Inc.
Q&A
• Thank you.
• Any questions?
• Email:
– liwei_ren@trendmicro.com
– ray_cheng@trendmicro.com
17
Copyright 2011 Trend Micro Inc.
Application Example
• A search problem in DLP (Data Loss Prevension) system:
– Problem: S = {d1, d2,…, dn} is a collection of confidential documents,.
Given any document T and 0<δ≤1, find a document d ∊ S such that
RLV(d,T)≥ δ.
• RLV is a function to measure the relevance of two documents.
• Challenges: how to construct RLV and δ? How to make search scalable?
Classification 8/17/2015 18
Copyright 2011 Trend Micro Inc.
Application Example
• A clustering problem in e-Discovery:
– Data are identified as potentially relevant by attorneys
– De-duplication technology.
– Problem: partition S into groups based on the textual relevance.
Classification 8/17/2015 19
Copyright 2011 Trend Micro Inc.
Background
• Similarity digesting schemes:
– A family of similarity preserving hashing techniques & tools
– Problem: Given two binary strings s1 and s2, measure the similarity
by s= SIM(H(s1), H(s2)).
• H is a hash function that preserves string similarity.
• SIM is another function to measure similarity of two hash values
– Example: TLSH, ssdeep, sdhash
– Challenge: how to evaluate pros & cons between them?
Classification 8/17/2015 20
Copyright 2011 Trend Micro Inc.
Six Matching Problems and Bytewise Relevance
• Definition 2: Let X , Y ∊ { EM1,EM2, EM3 ,AM1, AM2, AM3}. If
problem X is a special case of problem Y , we denote this as X ↪ Y.
• We have following relationship:
Classification 8/17/2015 21
EM1 EM2 EM3
AM1 AM2 AM3
↪ ↪
↪ ↪
↪
↪
↪

More Related Content

What's hot

Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Saeedeh Shekarpour
 
An Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning ChallengeAn Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning ChallengeTraian Rebedea
 
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Andre Freitas
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsMarina Santini
 
EKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
EKAW 2016 - TechMiner: Extracting Technologies from Academic PublicationsEKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
EKAW 2016 - TechMiner: Extracting Technologies from Academic PublicationsFrancesco Osborne
 
From Story-Telling to Production
From Story-Telling to ProductionFrom Story-Telling to Production
From Story-Telling to ProductionKwan-yuet Ho
 
Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...Riccardo Albertoni
 
Supporting Springer Nature Editors by means of Semantic Technologies
Supporting Springer Nature Editors by means of Semantic TechnologiesSupporting Springer Nature Editors by means of Semantic Technologies
Supporting Springer Nature Editors by means of Semantic TechnologiesFrancesco Osborne
 
Tensor Networks and Their Applications on Machine Learning
Tensor Networks and Their Applications on Machine LearningTensor Networks and Their Applications on Machine Learning
Tensor Networks and Their Applications on Machine LearningKwan-yuet Ho
 
Semantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked DataSemantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked DataSaeedeh Shekarpour
 
ICWE2013 - Discovering links between political debates and media
ICWE2013 - Discovering links between political debates and mediaICWE2013 - Discovering links between political debates and media
ICWE2013 - Discovering links between political debates and mediagjhouben
 
Topic Extraction using Machine Learning
Topic Extraction using Machine LearningTopic Extraction using Machine Learning
Topic Extraction using Machine LearningSanjib Basak
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT ecij
 
An evaluation of SimRank and Personalized PageRank to build a recommender sys...
An evaluation of SimRank and Personalized PageRank to build a recommender sys...An evaluation of SimRank and Personalized PageRank to build a recommender sys...
An evaluation of SimRank and Personalized PageRank to build a recommender sys...Paolo Tomeo
 
Datacamp - Networkx datacamp chapter 1
Datacamp - Networkx datacamp chapter 1 Datacamp - Networkx datacamp chapter 1
Datacamp - Networkx datacamp chapter 1 ChienNguyen124
 
Text Mining using LDA with Context
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with ContextSteffen Staab
 
Vsm 벡터공간모델
Vsm 벡터공간모델Vsm 벡터공간모델
Vsm 벡터공간모델guesta34d441
 

What's hot (20)

Tutorial on Question Answering Systems
Tutorial on Question Answering Systems Tutorial on Question Answering Systems
Tutorial on Question Answering Systems
 
An Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning ChallengeAn Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning Challenge
 
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
EKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
EKAW 2016 - TechMiner: Extracting Technologies from Academic PublicationsEKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
EKAW 2016 - TechMiner: Extracting Technologies from Academic Publications
 
Atlas.ti tutorial
Atlas.ti tutorialAtlas.ti tutorial
Atlas.ti tutorial
 
From Story-Telling to Production
From Story-Telling to ProductionFrom Story-Telling to Production
From Story-Telling to Production
 
Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...
 
Supporting Springer Nature Editors by means of Semantic Technologies
Supporting Springer Nature Editors by means of Semantic TechnologiesSupporting Springer Nature Editors by means of Semantic Technologies
Supporting Springer Nature Editors by means of Semantic Technologies
 
Tensor Networks and Their Applications on Machine Learning
Tensor Networks and Their Applications on Machine LearningTensor Networks and Their Applications on Machine Learning
Tensor Networks and Their Applications on Machine Learning
 
Semantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked DataSemantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked Data
 
ICWE2013 - Discovering links between political debates and media
ICWE2013 - Discovering links between political debates and mediaICWE2013 - Discovering links between political debates and media
ICWE2013 - Discovering links between political debates and media
 
Topic Extraction using Machine Learning
Topic Extraction using Machine LearningTopic Extraction using Machine Learning
Topic Extraction using Machine Learning
 
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
USING MACHINE LEARNING TO BUILD A SEMI-INTELLIGENT BOT
 
An evaluation of SimRank and Personalized PageRank to build a recommender sys...
An evaluation of SimRank and Personalized PageRank to build a recommender sys...An evaluation of SimRank and Personalized PageRank to build a recommender sys...
An evaluation of SimRank and Personalized PageRank to build a recommender sys...
 
Datacamp - Networkx datacamp chapter 1
Datacamp - Networkx datacamp chapter 1 Datacamp - Networkx datacamp chapter 1
Datacamp - Networkx datacamp chapter 1
 
Text Mining using LDA with Context
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with Context
 
Predicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systemsPredicting the relevance of search results for e-commerce systems
Predicting the relevance of search results for e-commerce systems
 
Topics Modeling
Topics ModelingTopics Modeling
Topics Modeling
 
Vsm 벡터공간모델
Vsm 벡터공간모델Vsm 벡터공간모델
Vsm 벡터공간모델
 

Viewers also liked

SATI MED Publicidad alternativa
SATI MED Publicidad alternativa  SATI MED Publicidad alternativa
SATI MED Publicidad alternativa SATI-MED
 
PROYECTO III ENCUENTRO PORLA PAZ DE ECOLOGIA ACTIVA "ECOEMPLEOS, PONEMOS SEMI...
PROYECTO III ENCUENTRO PORLA PAZ DE ECOLOGIA ACTIVA "ECOEMPLEOS, PONEMOS SEMI...PROYECTO III ENCUENTRO PORLA PAZ DE ECOLOGIA ACTIVA "ECOEMPLEOS, PONEMOS SEMI...
PROYECTO III ENCUENTRO PORLA PAZ DE ECOLOGIA ACTIVA "ECOEMPLEOS, PONEMOS SEMI...fetopax
 
Las telecomunicaciones multimedia (telefònica)
Las telecomunicaciones  multimedia (telefònica)Las telecomunicaciones  multimedia (telefònica)
Las telecomunicaciones multimedia (telefònica)mameluco2
 
Programa mes xabia-marc-2012
Programa mes xabia-marc-2012Programa mes xabia-marc-2012
Programa mes xabia-marc-2012Xabia_Democratica
 
презентация ооо бтех English финал
презентация ооо бтех English финалпрезентация ооо бтех English финал
презентация ооо бтех English финалshabalin-bargan
 
coupon réponse à retourner pour le 16 octobre 2015
coupon réponse à retourner pour le 16 octobre 2015coupon réponse à retourner pour le 16 octobre 2015
coupon réponse à retourner pour le 16 octobre 2015Herve FITE
 
Mansfield U3A Newsletter: December 2015
Mansfield U3A Newsletter: December 2015Mansfield U3A Newsletter: December 2015
Mansfield U3A Newsletter: December 2015dlpruk
 
Legalwise presentation
Legalwise   presentationLegalwise   presentation
Legalwise presentationSymphony3
 
Business communication 5 steps to create mutual understanding
Business communication  5 steps to create mutual understandingBusiness communication  5 steps to create mutual understanding
Business communication 5 steps to create mutual understandingUte Sommer
 
Cómo elaborar una tortilla de patata española
Cómo elaborar una tortilla de patata españolaCómo elaborar una tortilla de patata española
Cómo elaborar una tortilla de patata españolaLuis Miguel Alcon Gonzalez
 
Oray screens and home cinema seats catalog 2014
Oray screens and home cinema seats catalog 2014 Oray screens and home cinema seats catalog 2014
Oray screens and home cinema seats catalog 2014 AV-PRO
 
Ficha Técnica Diplomado E Learning en Salud Infantil Ambulatoria
Ficha Técnica Diplomado E Learning  en Salud Infantil Ambulatoria Ficha Técnica Diplomado E Learning  en Salud Infantil Ambulatoria
Ficha Técnica Diplomado E Learning en Salud Infantil Ambulatoria OTEC Innovares
 
Sexualidad
SexualidadSexualidad
Sexualidadmoira_IQ
 

Viewers also liked (20)

聊一聊大明朝的火器
聊一聊大明朝的火器聊一聊大明朝的火器
聊一聊大明朝的火器
 
Tarifario lt-2014-altoimpacto
Tarifario lt-2014-altoimpactoTarifario lt-2014-altoimpacto
Tarifario lt-2014-altoimpacto
 
SATI MED Publicidad alternativa
SATI MED Publicidad alternativa  SATI MED Publicidad alternativa
SATI MED Publicidad alternativa
 
PROYECTO III ENCUENTRO PORLA PAZ DE ECOLOGIA ACTIVA "ECOEMPLEOS, PONEMOS SEMI...
PROYECTO III ENCUENTRO PORLA PAZ DE ECOLOGIA ACTIVA "ECOEMPLEOS, PONEMOS SEMI...PROYECTO III ENCUENTRO PORLA PAZ DE ECOLOGIA ACTIVA "ECOEMPLEOS, PONEMOS SEMI...
PROYECTO III ENCUENTRO PORLA PAZ DE ECOLOGIA ACTIVA "ECOEMPLEOS, PONEMOS SEMI...
 
Las telecomunicaciones multimedia (telefònica)
Las telecomunicaciones  multimedia (telefònica)Las telecomunicaciones  multimedia (telefònica)
Las telecomunicaciones multimedia (telefònica)
 
PUERTO FINGER
PUERTO FINGERPUERTO FINGER
PUERTO FINGER
 
Programa mes xabia-marc-2012
Programa mes xabia-marc-2012Programa mes xabia-marc-2012
Programa mes xabia-marc-2012
 
Sanamed 10(1) 2015
Sanamed 10(1) 2015Sanamed 10(1) 2015
Sanamed 10(1) 2015
 
Webs de Telefonica
Webs de TelefonicaWebs de Telefonica
Webs de Telefonica
 
презентация ооо бтех English финал
презентация ооо бтех English финалпрезентация ооо бтех English финал
презентация ооо бтех English финал
 
coupon réponse à retourner pour le 16 octobre 2015
coupon réponse à retourner pour le 16 octobre 2015coupon réponse à retourner pour le 16 octobre 2015
coupon réponse à retourner pour le 16 octobre 2015
 
Ptpm002 Pt Mgmt Of Limb Amputees
Ptpm002 Pt  Mgmt Of Limb AmputeesPtpm002 Pt  Mgmt Of Limb Amputees
Ptpm002 Pt Mgmt Of Limb Amputees
 
Mansfield U3A Newsletter: December 2015
Mansfield U3A Newsletter: December 2015Mansfield U3A Newsletter: December 2015
Mansfield U3A Newsletter: December 2015
 
Apoteosis de claudio
Apoteosis de claudioApoteosis de claudio
Apoteosis de claudio
 
Legalwise presentation
Legalwise   presentationLegalwise   presentation
Legalwise presentation
 
Business communication 5 steps to create mutual understanding
Business communication  5 steps to create mutual understandingBusiness communication  5 steps to create mutual understanding
Business communication 5 steps to create mutual understanding
 
Cómo elaborar una tortilla de patata española
Cómo elaborar una tortilla de patata españolaCómo elaborar una tortilla de patata española
Cómo elaborar una tortilla de patata española
 
Oray screens and home cinema seats catalog 2014
Oray screens and home cinema seats catalog 2014 Oray screens and home cinema seats catalog 2014
Oray screens and home cinema seats catalog 2014
 
Ficha Técnica Diplomado E Learning en Salud Infantil Ambulatoria
Ficha Técnica Diplomado E Learning  en Salud Infantil Ambulatoria Ficha Técnica Diplomado E Learning  en Salud Infantil Ambulatoria
Ficha Técnica Diplomado E Learning en Salud Infantil Ambulatoria
 
Sexualidad
SexualidadSexualidad
Sexualidad
 

Similar to Bytewise approximate matching, searching and clustering

Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool EvaluationLiwei Ren任力偉
 
DLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and AlgorithmsDLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and AlgorithmsLiwei Ren任力偉
 
Mathematical Modeling for Practical Problems
Mathematical Modeling for Practical ProblemsMathematical Modeling for Practical Problems
Mathematical Modeling for Practical ProblemsLiwei Ren任力偉
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Jonathon Hare
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Rich Heimann
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown BagDataTactics
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Kira
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisJonathan Stray
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1Dave King
 
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Noemi Derzsy
 
Data Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA DatasetsData Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA DatasetsPyData
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
 
A Theoretic Framework for Evaluating Similarity Digesting Tools
A Theoretic Framework for Evaluating Similarity Digesting ToolsA Theoretic Framework for Evaluating Similarity Digesting Tools
A Theoretic Framework for Evaluating Similarity Digesting ToolsLiwei Ren任力偉
 
[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用台灣資料科學年會
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?Samet KILICTAS
 
Rules for inducing hierarchies from social tagging data
Rules for inducing hierarchies from social tagging dataRules for inducing hierarchies from social tagging data
Rules for inducing hierarchies from social tagging dataHang Dong
 
Recommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRecommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRoelof Pieters
 

Similar to Bytewise approximate matching, searching and clustering (20)

Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool Evaluation
 
DLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and AlgorithmsDLP Systems: Models, Architecture and Algorithms
DLP Systems: Models, Architecture and Algorithms
 
Mathematical Modeling for Practical Problems
Mathematical Modeling for Practical ProblemsMathematical Modeling for Practical Problems
Mathematical Modeling for Practical Problems
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Frontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text AnalysisFrontiers of Computational Journalism week 2 - Text Analysis
Frontiers of Computational Journalism week 2 - Text Analysis
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1
 
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
 
Data Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA DatasetsData Science Keys to Open Up OpenNASA Datasets
Data Science Keys to Open Up OpenNASA Datasets
 
Presentation at MTSR 2012
Presentation at MTSR 2012Presentation at MTSR 2012
Presentation at MTSR 2012
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
A Theoretic Framework for Evaluating Similarity Digesting Tools
A Theoretic Framework for Evaluating Similarity Digesting ToolsA Theoretic Framework for Evaluating Similarity Digesting Tools
A Theoretic Framework for Evaluating Similarity Digesting Tools
 
[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用[系列活動] 人工智慧與機器學習在推薦系統上的應用
[系列活動] 人工智慧與機器學習在推薦系統上的應用
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
 
Rules for inducing hierarchies from social tagging data
Rules for inducing hierarchies from social tagging dataRules for inducing hierarchies from social tagging data
Rules for inducing hierarchies from social tagging data
 
Recommender Systems, Matrices and Graphs
Recommender Systems, Matrices and GraphsRecommender Systems, Matrices and Graphs
Recommender Systems, Matrices and Graphs
 

More from Liwei Ren任力偉

信息安全领域里的创新和机遇
信息安全领域里的创新和机遇信息安全领域里的创新和机遇
信息安全领域里的创新和机遇Liwei Ren任力偉
 
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural NetworkLiwei Ren任力偉
 
移动互联网时代下创新的思维
移动互联网时代下创新的思维移动互联网时代下创新的思维
移动互联网时代下创新的思维Liwei Ren任力偉
 
非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究Liwei Ren任力偉
 
Arm the World with SPN based Security
Arm the World with SPN based SecurityArm the World with SPN based Security
Arm the World with SPN based SecurityLiwei Ren任力偉
 
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching ProblemExtending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching ProblemLiwei Ren任力偉
 
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsNear Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsLiwei Ren任力偉
 
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...Liwei Ren任力偉
 
Phase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillatorsPhase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillatorsLiwei Ren任力偉
 
On existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problemOn existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problemLiwei Ren任力偉
 
IoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and SolutionsIoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and SolutionsLiwei Ren任力偉
 
Taxonomy of Differential Compression
Taxonomy of Differential CompressionTaxonomy of Differential Compression
Taxonomy of Differential CompressionLiwei Ren任力偉
 
Overview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) TechnologyOverview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) TechnologyLiwei Ren任力偉
 
Securing Your Data for Your Journey to the Cloud
Securing Your Data for Your Journey to the CloudSecuring Your Data for Your Journey to the Cloud
Securing Your Data for Your Journey to the CloudLiwei Ren任力偉
 

More from Liwei Ren任力偉 (19)

信息安全领域里的创新和机遇
信息安全领域里的创新和机遇信息安全领域里的创新和机遇
信息安全领域里的创新和机遇
 
企业安全市场综述
企业安全市场综述 企业安全市场综述
企业安全市场综述
 
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural Network
 
防火牆們的故事
防火牆們的故事防火牆們的故事
防火牆們的故事
 
移动互联网时代下创新的思维
移动互联网时代下创新的思维移动互联网时代下创新的思维
移动互联网时代下创新的思维
 
硅谷的那点事儿
硅谷的那点事儿硅谷的那点事儿
硅谷的那点事儿
 
非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究
 
世纪猜想
世纪猜想世纪猜想
世纪猜想
 
Arm the World with SPN based Security
Arm the World with SPN based SecurityArm the World with SPN based Security
Arm the World with SPN based Security
 
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching ProblemExtending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
 
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and AlgorithmsNear Duplicate Document Detection: Mathematical Modeling and Algorithms
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
 
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
 
Phase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillatorsPhase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillators
 
On existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problemOn existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problem
 
Math stories
Math storiesMath stories
Math stories
 
IoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and SolutionsIoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and Solutions
 
Taxonomy of Differential Compression
Taxonomy of Differential CompressionTaxonomy of Differential Compression
Taxonomy of Differential Compression
 
Overview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) TechnologyOverview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) Technology
 
Securing Your Data for Your Journey to the Cloud
Securing Your Data for Your Journey to the CloudSecuring Your Data for Your Journey to the Cloud
Securing Your Data for Your Journey to the Cloud
 

Recently uploaded

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 

Recently uploaded (20)

Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 

Bytewise approximate matching, searching and clustering

  • 1. Copyright 2011 Trend Micro Inc. 1 Bytewise approximate matching, searching and clustering Liwei Ren, Ph.D Ray Cheng, Ph.D Trend Micro Inc. DFRWS USA 2015, August , 2015, Philadelphia, PA
  • 2. Copyright 2011 Trend Micro Inc. Agenda • Background • Six Matching Problems and Bytewise Relevance • Current Work: A Framework of Theory, Algorithms, and Technologies • Future Work Classification 8/17/2015 2
  • 3. Copyright 2011 Trend Micro Inc. Background • Similarity digesting schemes: – Problem: Given two binary strings s1 and s2, measure their similarity. • Do a hash that preserves similarity property of strings. • Measure similarity by comparing two hash values. – Example: TLSH, ssdeep, sdhash Classification 8/17/2015 3
  • 4. Copyright 2011 Trend Micro Inc. Background • NIST specification document NIST.SP.800-168 introduces the concept of bytewise approximate matching : – NIST document lists four cases to describe this concept: • Object similarity detection: identify related artifacts, e.g. different versions of a document. • Cross Correlation: identify artifacts sharing a common object. • Embedded Object Detection: identify a given object inside an artifact. • Fragment Detection: identify the presence of traces/fragments of a known artifact. • Dr . Liwei Ren’s talk at DFRWS EU 2015: – A Theoretic Framework for Evaluating Similarity Digesting Tools – Using a mathematical model to describe binary similarity. 4
  • 5. Copyright 2011 Trend Micro Inc. Six Matching Problems and Bytewise Relevance • The NIST document does not cover all bytewise approximate matching cases. • We generalized NIST cases to six cases: Classification 8/17/2015 5
  • 6. Copyright 2011 Trend Micro Inc. Six Matching Problems and Bytewise Relevance • Continued: 6
  • 7. Copyright 2011 Trend Micro Inc. Classification of NIST approximate matching cases • Similarity Detection: identify related artifacts. – AM1 (approximate match) • Cross Correlation: identify artifacts sharing a common object. – EM3 (exact match cross-sharing) • Embedded Object Detection: identify a given object inside an artifact. – EM2 (exact match containment) • Fragment Detection: identify the presence of traces/fragments of a known artifact. – EM2 (one or more exact match containment) Classification 8/17/2015 7
  • 8. Copyright 2011 Trend Micro Inc. Six Matching Problems and Bytewise Relevance • Definition 1 : Given two strings R[1,..,n] and T[1,…,m], if one of six cases is true, we say R and T are bytewise relevant. – We denote this as BR(R,T)= 1, otherwise BR(R,T)= 0. 8
  • 9. Copyright 2011 Trend Micro Inc. A Framework of Theory, Algorithms and Technologies • Define three fundamental problems using Bytewise Relevance: – Matching: Given O1 , O2 ∊ S, determine whether BR (O1,O2) =1. – Searching : B ⊆ S is a bag of objects . Given o ∊ S , find b ∊ B such that BR (o, b )=1. – Clustering: Given a bag B of objects, partition B into groups { G1, G2,…,Gm} based on BR. • S = An object space S, • O = An object in object space S, •BR = Bytewise Relevance relationship for objects in S. Classification 8/17/2015 9
  • 10. Copyright 2011 Trend Micro Inc. A Framework of Theory, Algorithms and Technologies • Our bytewise relevance framework : Classification 8/17/2015 10
  • 11. Copyright 2011 Trend Micro Inc. Matching • The Six Matching Problems EM1 – AM3 – Identicalness EM1 : the solution is trivial. – Containment EM2 : the solution is Rabin-Karp algorithm. – Cross-sharing EM3 : • We established a theory on this interesting problem : how to measure cross- sharing. • We developed an algorithmic solution with theoretic analysis. – Similarity AM1 : • TLSH, ssdeep and sdhash • Dr. Ren delivered a talk at DFRWS EU 2015: there are eight approaches to solve this problem. – We designed a novel similarity digesting scheme TSFP. – Approximate containment AM2: Two heuristic algorithms – Approximate cross-sharing AM3: One heuristic algorithm Classification 8/17/2015 11
  • 12. Copyright 2011 Trend Micro Inc. Searching • For the relationship BR, the searching problem: – B is a bag of strings. Given a string T , find s ∊ B such that BR(T, s)=1. Classification 8/17/2015 12
  • 13. Copyright 2011 Trend Micro Inc. Searching • How to solve searching problem? – Brute force approach : for every s ∊ B, we evaluate BR(T, s). Can we scale to millions or billions?  – Candidate selection approach: two-step approach • STEP 1: select a few candidates { s1, s2,…,sm} quickly • STEP 2: evaluate each BR(T, sk). – How to select good candidates? • String fingerprinting: generate fingerprints from each string from B. • Indexing Process: Index the fingerprints along with the string ID to create a index DB as FP-DB. • Searching Process: given T, generate fingerprints {FP1, FP2,…,FPq} , we use them to search possible candidates from FP-DB. – NOTE: • This is similar to a keyword based search engine where the keywords are the fingerprints. • The fingerprinting procedure is actually a special tokenization method. Classification 8/17/2015 13
  • 14. Copyright 2011 Trend Micro Inc. Future Work: Clustering Problem • For the relationship BR, one has a clustering problem : – B is a bag of strings, partition B into groups of strings based on BR. Classification 8/17/2015 14
  • 15. Copyright 2011 Trend Micro Inc. Future Work: Library and tools • Analyze algorithms and measure performance. – Verify they can scale. • For bytewise approximate matching, searching and clustering, – Library of functions – API – Tools Classification 8/17/2015 15
  • 16. Copyright 2011 Trend Micro Inc. Application examples of Approximate Matching, Searching, Clustering • E-Discovery – Comparing near duplicate documents – Grouping near duplicate documents • Digital forensic analysis – Identifying similar objects or files • Malware analysis – Identifying similar malware or mutated malware • Anti-plagiarism – Detection of copyright violations • Source code governance • Spam filtering • Data Loss Prevention Classification 8/17/2015 16
  • 17. Copyright 2011 Trend Micro Inc. Q&A • Thank you. • Any questions? • Email: – liwei_ren@trendmicro.com – ray_cheng@trendmicro.com 17
  • 18. Copyright 2011 Trend Micro Inc. Application Example • A search problem in DLP (Data Loss Prevension) system: – Problem: S = {d1, d2,…, dn} is a collection of confidential documents,. Given any document T and 0<δ≤1, find a document d ∊ S such that RLV(d,T)≥ δ. • RLV is a function to measure the relevance of two documents. • Challenges: how to construct RLV and δ? How to make search scalable? Classification 8/17/2015 18
  • 19. Copyright 2011 Trend Micro Inc. Application Example • A clustering problem in e-Discovery: – Data are identified as potentially relevant by attorneys – De-duplication technology. – Problem: partition S into groups based on the textual relevance. Classification 8/17/2015 19
  • 20. Copyright 2011 Trend Micro Inc. Background • Similarity digesting schemes: – A family of similarity preserving hashing techniques & tools – Problem: Given two binary strings s1 and s2, measure the similarity by s= SIM(H(s1), H(s2)). • H is a hash function that preserves string similarity. • SIM is another function to measure similarity of two hash values – Example: TLSH, ssdeep, sdhash – Challenge: how to evaluate pros & cons between them? Classification 8/17/2015 20
  • 21. Copyright 2011 Trend Micro Inc. Six Matching Problems and Bytewise Relevance • Definition 2: Let X , Y ∊ { EM1,EM2, EM3 ,AM1, AM2, AM3}. If problem X is a special case of problem Y , we denote this as X ↪ Y. • We have following relationship: Classification 8/17/2015 21 EM1 EM2 EM3 AM1 AM2 AM3 ↪ ↪ ↪ ↪ ↪ ↪ ↪