SlideShare a Scribd company logo
1 of 25
Expediting MRSH-v2 Approximate
Matching with Hierarchical Bloom
Filter Trees ICDF2C October 10th 2017
DAVID LILLIS, FRANK BREITINGER AND MARK SCANLON
Approximate Matching
 Scenario: Collection of “known illegal” files. Want to search
for these on a seized device.
 Finding exact matches is easy (hashing).
 Approximate matching (a.k.a. “fuzzy hashing”) aims to find
similar files on the byte level, e.g.
 Files that have been extended/truncated.
 Files within files.
 Partial files.
2
 Initially proposed by Breitinger & Baier (2012).
 Generates a similarity digest for each file.
 Consists of one or more Bloom Filters: probabilistic data structure that can say
whether it probably contains an item, or definitely does not contain it.
 These can be compared to calculate a similarity score.
 File divided into “chunks”: file read byte-by-byte and a rolling hash identifies
the end of a chunk.
 Each chunk is hashed using FNV (a fast, noncryptographic hashing function).
 Hash used to set 5 bits of the Bloom Filter.
 When Bloom Filter is full, a new, empty Bloom Filter is added to the
digest, and further inserts are in this.
MRSH-v2 3
Motivations
 Problem: Similarity score comes from a pairwise comparison of two
similarity digests. Not scalable.
 Aim to explore alternative data structures that can achieve the same
results in less time.
 Hierarchical Bloom Filter Trees initially proposed theoretically by
Breitinger et al. (2014).
 This work gathers some empirical data on the performance of this
approach.
 i.e. Can we do the same thing, but faster?
4
Hierarchical Bloom Filter Trees (HBFTs):
Building
 Binary tree of Bloom Filters.
 Each parent is twice the size
of its children.
 Files allocated to leaf
nodes: round robin.
5
File is processed in the same way as
for MRSH-v2.
When each chunk is hashed, this is
used to set bits in the relevant leaf
node.
Hierarchical Bloom Filter Trees (HBFTs):
Building
 Binary tree of Bloom Filters.
 Each parent is twice the size
of its children.
 Files allocated to leaf
nodes: round robin.
6
The same hash values are used to set
bits in the parent node also.
A similar process is followed for all
ancestor nodes.
Hierarchical Bloom Filter Trees (HBFTs):
Building
 Binary tree of Bloom Filters.
 Each parent is twice the size
of its children.
 Files allocated to leaf
nodes: round robin.
7
The file’s MRSH-v2 similarity digest is
stored in association with the
appropriate leaf node.
Hierarchical Bloom Filter Trees (HBFTs):
Building
 Binary tree of Bloom Filters.
 Each parent is twice the size
of its children.
 Files allocated to leaf
nodes: round robin.
8
Every leaf node has a set of similarity
digests associated with it.
Each represents1/L of the collection,
where L is the number of leaf nodes
HBFTs: Searching
 To search for a file, it is also
processed in a similar way.
9
Initially search at the root node.
For each hash of a file chunk, we
check if it is contained in the root
Bloom Filter.
If the number of consecutive matches
exceeds a threshold, it is considered
to be a successful match.
We call this threshold min_run.
HBFTs: Searching
 To search for a file, it is also
processed in a similar way.
10
If a match is found, the search
continues at the next level.
Both child nodes must be
searched.
✓
HBFTs: Searching
 To search for a file, it is also
processed in a similar way.
11
The search continues until one or
more leaf nodes are reached.
✓
✓
✓
✓
✗
✗
✗
HBFTs: Searching
 To search for a file, it is also
processed in a similar way.
12
Bloom Filters can give false
positive results, so it is possible
for searches to reach leaves even
where there are no similar results.
✓
✓
✓
✓
✗
✗✗
✓
✓
HBFTs: Searching
 To search for a file, it is also
processed in a similar way.
13
To calculate the similarity scores,
the existing MRSH-v2 algorithm is
used to make pairwise
comparisons.
A similarity digest is created for
the file that we are searching for.
This must be compared with all
the digest stored at any leaf that
the search reaches.
✓
✓
✓
✓
✗
✗✗
✓
✓
HBFT: Some Questions 14
 How many nodes in the tree?
 More nodes: fewer pairwise comparisons.
 Fewer nodes: larger Bloom Filters (fewer false positives).
 What constitutes a positive match for a node in the tree?
 i.e. what threshold should be used for min_run?
 When comparing two datasets, which should the tree represent?
 t5*: 4,457 files (~1.8GiB)
 Gathered from US government websites, often used for approximate
matching.
 Plain text, HTML, PDF, Images, MS Office documents.
 win7: 48,384 files excluding empty files and symlinks (~10GiB)
 Fresh install of Windows 7.
 Varied file types.
* Obtainable from http://roussev.net/t5
Datasets 15
Experiment #1
 Datasets: Tree represents t5, search for t5.
 Goals:
 Measure effectiveness for exact matching.
 Identify appropriate value for min_run parameter.
 Investigate relationship between size of tree and time to build &
search tree.
 Investigate relationship between size of tree and number of pairwise
comparisons required to calculate similarity scores.
16
Experiment #1: Results
 Exact matching:
 When min_run = 4, all identical files are found.
 With higher values, some files are missed.
17
min_run Recall
4 100%
6 99.98%
8 99.93%
Experiment #1: Results 18
Time to build tree and search for all files.
(excluding pairwise comparisons)
Number of pairwise comparisons required
at leaves.
Experiment #2
 Datasets:
 Tree represents win7, search for t5.
 Tree represents t5, search for win7.
 Investigate whether HBFT should represent the smaller or larger corpus.
 Measure effect on overall running time.
19
Experiment #2: Results 20
Time to search for t5 in a win7 tree.
(excluding pairwise comparisons)
Time to search for win7 in a t5 tree.
(excluding pairwise comparisons)
Experiment #2 21
 Combination of build time + search time is lower when the HBFT
represents the smaller corpus.
 Also, less memory usage.
 Total time (including pairwise comparisons): 1,094 seconds.
 Tree models t5 with one file per leaf node (i.e. 4,457 leaves).
 Search for all files in win7.
 MRSH-v2 takes 2,858 seconds.
Experiment #3
 Datasets:
 4,000 files from t5 represent set of “known-illegal” files.
 win7 represents seized disk image, with 140 “planted” files from t5 added:
 100 files that are also in the “known-illegal” set.
 40 files with high similarity to files in the “known-illegal” set:
 10 that have ≥ 80% similarity.
 10 that have ≥ 60% and < 80% similarity.
 10 that have ≥ 40% and < 60% similarity.
 10 that have ≥ 20% and < 40% similarity.
 Aims:
 Compare time to MRSH-v2
 Evaluate effectiveness of finding planted files.
22
Experiment #3: Results
MRSH-v2
similarity
Files
planted
Files
found
Similar
recall
80%-100% 10 10 100%
60%-79% 10 10 100%
40%-59% 10 10 100%
20%-39% 10 8 80%
Overall 40 38 95%
23
Time to search for planted evidence.
(including pairwise comparisons)
Running time (4,000 leaves):
• MRSH-v2: 2,592 seconds.
• HBFT: 1,182 seconds.
Conclusions
 More leaf nodes lead to fewer pairwise comparisons.
 min_run of 4 looks like a reasonable value.
 If corpora are different sizes, use the tree to represent the smaller one.
 Final experiment: all files with ≥ 20% similarity were found, with time
reduction of 54%.
 Likely to scale better than existing approach using pairwise comparisons.
24
DAVID.LILLIS@UCD.IE
WWW.FORENSICSANDSECURITY.COM
@FORSECRESEARCH
25

More Related Content

What's hot

Indexing structure for files
Indexing structure for filesIndexing structure for files
Indexing structure for filesZainab Almugbel
 
Introduction to the design and specification of file structures
Introduction to the design and specification of file structuresIntroduction to the design and specification of file structures
Introduction to the design and specification of file structuresDevyani Vaidya
 
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - IndexingInformation Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - IndexingSean Golliher
 
Dynamic multi level indexing Using B-Trees And B+ Trees
Dynamic multi level indexing Using B-Trees And B+ TreesDynamic multi level indexing Using B-Trees And B+ Trees
Dynamic multi level indexing Using B-Trees And B+ TreesPooja Dixit
 
Indexing and-hashing
Indexing and-hashingIndexing and-hashing
Indexing and-hashingAmi Ranjit
 
Data indexing presentation
Data indexing presentationData indexing presentation
Data indexing presentationgmbmanikandan
 
Automating Relational Database Schema Design for Very Large Semantic Datasets
Automating Relational Database Schema Design for Very Large Semantic DatasetsAutomating Relational Database Schema Design for Very Large Semantic Datasets
Automating Relational Database Schema Design for Very Large Semantic DatasetsThomas Lee
 
File handling in qbasic
File handling in qbasicFile handling in qbasic
File handling in qbasicSmritiGurung4
 
12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMSkoolkampus
 
11. Hashing - Data Structures using C++ by Varsha Patil
11. Hashing - Data Structures using C++ by Varsha Patil11. Hashing - Data Structures using C++ by Varsha Patil
11. Hashing - Data Structures using C++ by Varsha Patilwidespreadpromotion
 
Outlook PST Files Serving Role of Outlook Data Files
Outlook PST Files Serving Role of Outlook Data FilesOutlook PST Files Serving Role of Outlook Data Files
Outlook PST Files Serving Role of Outlook Data FilesforensicEmailAnalysis
 

What's hot (20)

Data storage and indexing
Data storage and indexingData storage and indexing
Data storage and indexing
 
ISDD Database Structure N5
ISDD Database Structure N5ISDD Database Structure N5
ISDD Database Structure N5
 
File structures
File structuresFile structures
File structures
 
Indexing structure for files
Indexing structure for filesIndexing structure for files
Indexing structure for files
 
CS215 - Lec 2 file organization
CS215 - Lec 2   file organizationCS215 - Lec 2   file organization
CS215 - Lec 2 file organization
 
Introduction to the design and specification of file structures
Introduction to the design and specification of file structuresIntroduction to the design and specification of file structures
Introduction to the design and specification of file structures
 
Hashing
HashingHashing
Hashing
 
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6  - IndexingInformation Retrieval, Encoding, Indexing, Big Table. Lecture 6  - Indexing
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing
 
File handling
File handlingFile handling
File handling
 
Dynamic multi level indexing Using B-Trees And B+ Trees
Dynamic multi level indexing Using B-Trees And B+ TreesDynamic multi level indexing Using B-Trees And B+ Trees
Dynamic multi level indexing Using B-Trees And B+ Trees
 
Indexing and-hashing
Indexing and-hashingIndexing and-hashing
Indexing and-hashing
 
Data indexing presentation
Data indexing presentationData indexing presentation
Data indexing presentation
 
Automating Relational Database Schema Design for Very Large Semantic Datasets
Automating Relational Database Schema Design for Very Large Semantic DatasetsAutomating Relational Database Schema Design for Very Large Semantic Datasets
Automating Relational Database Schema Design for Very Large Semantic Datasets
 
File handling in qbasic
File handling in qbasicFile handling in qbasic
File handling in qbasic
 
12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS
 
11. Hashing - Data Structures using C++ by Varsha Patil
11. Hashing - Data Structures using C++ by Varsha Patil11. Hashing - Data Structures using C++ by Varsha Patil
11. Hashing - Data Structures using C++ by Varsha Patil
 
Lec 1 indexing and hashing
Lec 1 indexing and hashing Lec 1 indexing and hashing
Lec 1 indexing and hashing
 
Outlook PST Files Serving Role of Outlook Data Files
Outlook PST Files Serving Role of Outlook Data FilesOutlook PST Files Serving Role of Outlook Data Files
Outlook PST Files Serving Role of Outlook Data Files
 
Indexing
IndexingIndexing
Indexing
 
Handling computer files
Handling computer filesHandling computer files
Handling computer files
 

Similar to Expediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter Trees

lecture 2 notes indexing in application of database systems.pptx
lecture 2 notes indexing in application of database systems.pptxlecture 2 notes indexing in application of database systems.pptx
lecture 2 notes indexing in application of database systems.pptxpeter1097
 
Finding Similar Files in Large Document Repositories
Finding Similar Files in Large Document RepositoriesFinding Similar Files in Large Document Repositories
Finding Similar Files in Large Document Repositoriesfeiwin
 
File handling3 (1).pdf uhgipughserigrfiogrehpiuhnfi;reuge
File handling3 (1).pdf uhgipughserigrfiogrehpiuhnfi;reugeFile handling3 (1).pdf uhgipughserigrfiogrehpiuhnfi;reuge
File handling3 (1).pdf uhgipughserigrfiogrehpiuhnfi;reugevsol7206
 
How To Search Files In FileZilla.pdf
How To Search Files In FileZilla.pdfHow To Search Files In FileZilla.pdf
How To Search Files In FileZilla.pdfHost It Smart
 
How To Search Files In FileZilla.pdf
How To Search Files In FileZilla.pdfHow To Search Files In FileZilla.pdf
How To Search Files In FileZilla.pdfHost It Smart
 
How To Search Files In FileZilla.pdf
How To Search Files In FileZilla.pdfHow To Search Files In FileZilla.pdf
How To Search Files In FileZilla.pdfHost It Smart
 
Advances in File Carving
Advances in File CarvingAdvances in File Carving
Advances in File CarvingRob Zirnstein
 
Fota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity AlgorithmsFota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity AlgorithmsShivansh Gaur
 
DIGITAL INVESTIGATION USING HASHBASED CARVING
DIGITAL INVESTIGATION USING HASHBASED CARVINGDIGITAL INVESTIGATION USING HASHBASED CARVING
DIGITAL INVESTIGATION USING HASHBASED CARVINGIJCI JOURNAL
 
File Organization in Database
File Organization in DatabaseFile Organization in Database
File Organization in DatabaseA. S. M. Shafi
 
Introduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxIntroduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxsunithachphd
 
Ch 17 disk storage, basic files structure, and hashing
Ch 17 disk storage, basic files structure, and hashingCh 17 disk storage, basic files structure, and hashing
Ch 17 disk storage, basic files structure, and hashingZainab Almugbel
 
File System Comparison on Linux Ubuntu
File System Comparison on Linux UbuntuFile System Comparison on Linux Ubuntu
File System Comparison on Linux UbuntuJayesh Tambe
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxAnkitChauhan817826
 

Similar to Expediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter Trees (20)

lecture 2 notes indexing in application of database systems.pptx
lecture 2 notes indexing in application of database systems.pptxlecture 2 notes indexing in application of database systems.pptx
lecture 2 notes indexing in application of database systems.pptx
 
B tree
B treeB tree
B tree
 
Unit08 dbms
Unit08 dbmsUnit08 dbms
Unit08 dbms
 
IPFSNov5.pptx
IPFSNov5.pptxIPFSNov5.pptx
IPFSNov5.pptx
 
Finding Similar Files in Large Document Repositories
Finding Similar Files in Large Document RepositoriesFinding Similar Files in Large Document Repositories
Finding Similar Files in Large Document Repositories
 
File handling3 (1).pdf uhgipughserigrfiogrehpiuhnfi;reuge
File handling3 (1).pdf uhgipughserigrfiogrehpiuhnfi;reugeFile handling3 (1).pdf uhgipughserigrfiogrehpiuhnfi;reuge
File handling3 (1).pdf uhgipughserigrfiogrehpiuhnfi;reuge
 
How To Search Files In FileZilla.pdf
How To Search Files In FileZilla.pdfHow To Search Files In FileZilla.pdf
How To Search Files In FileZilla.pdf
 
How To Search Files In FileZilla.pdf
How To Search Files In FileZilla.pdfHow To Search Files In FileZilla.pdf
How To Search Files In FileZilla.pdf
 
How To Search Files In FileZilla.pdf
How To Search Files In FileZilla.pdfHow To Search Files In FileZilla.pdf
How To Search Files In FileZilla.pdf
 
Advances in File Carving
Advances in File CarvingAdvances in File Carving
Advances in File Carving
 
Fota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity AlgorithmsFota Delta Size Reduction Using FIle Similarity Algorithms
Fota Delta Size Reduction Using FIle Similarity Algorithms
 
DIGITAL INVESTIGATION USING HASHBASED CARVING
DIGITAL INVESTIGATION USING HASHBASED CARVINGDIGITAL INVESTIGATION USING HASHBASED CARVING
DIGITAL INVESTIGATION USING HASHBASED CARVING
 
File Organization in Database
File Organization in DatabaseFile Organization in Database
File Organization in Database
 
module 2.pptx
module 2.pptxmodule 2.pptx
module 2.pptx
 
Introduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptxIntroduction_to_HDFS sun.pptx
Introduction_to_HDFS sun.pptx
 
Ch 17 disk storage, basic files structure, and hashing
Ch 17 disk storage, basic files structure, and hashingCh 17 disk storage, basic files structure, and hashing
Ch 17 disk storage, basic files structure, and hashing
 
File System Comparison on Linux Ubuntu
File System Comparison on Linux UbuntuFile System Comparison on Linux Ubuntu
File System Comparison on Linux Ubuntu
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Chapter 2
Chapter 2Chapter 2
Chapter 2
 

Recently uploaded

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 

Recently uploaded (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Expediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter Trees

  • 1. Expediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter Trees ICDF2C October 10th 2017 DAVID LILLIS, FRANK BREITINGER AND MARK SCANLON
  • 2. Approximate Matching  Scenario: Collection of “known illegal” files. Want to search for these on a seized device.  Finding exact matches is easy (hashing).  Approximate matching (a.k.a. “fuzzy hashing”) aims to find similar files on the byte level, e.g.  Files that have been extended/truncated.  Files within files.  Partial files. 2
  • 3.  Initially proposed by Breitinger & Baier (2012).  Generates a similarity digest for each file.  Consists of one or more Bloom Filters: probabilistic data structure that can say whether it probably contains an item, or definitely does not contain it.  These can be compared to calculate a similarity score.  File divided into “chunks”: file read byte-by-byte and a rolling hash identifies the end of a chunk.  Each chunk is hashed using FNV (a fast, noncryptographic hashing function).  Hash used to set 5 bits of the Bloom Filter.  When Bloom Filter is full, a new, empty Bloom Filter is added to the digest, and further inserts are in this. MRSH-v2 3
  • 4. Motivations  Problem: Similarity score comes from a pairwise comparison of two similarity digests. Not scalable.  Aim to explore alternative data structures that can achieve the same results in less time.  Hierarchical Bloom Filter Trees initially proposed theoretically by Breitinger et al. (2014).  This work gathers some empirical data on the performance of this approach.  i.e. Can we do the same thing, but faster? 4
  • 5. Hierarchical Bloom Filter Trees (HBFTs): Building  Binary tree of Bloom Filters.  Each parent is twice the size of its children.  Files allocated to leaf nodes: round robin. 5 File is processed in the same way as for MRSH-v2. When each chunk is hashed, this is used to set bits in the relevant leaf node.
  • 6. Hierarchical Bloom Filter Trees (HBFTs): Building  Binary tree of Bloom Filters.  Each parent is twice the size of its children.  Files allocated to leaf nodes: round robin. 6 The same hash values are used to set bits in the parent node also. A similar process is followed for all ancestor nodes.
  • 7. Hierarchical Bloom Filter Trees (HBFTs): Building  Binary tree of Bloom Filters.  Each parent is twice the size of its children.  Files allocated to leaf nodes: round robin. 7 The file’s MRSH-v2 similarity digest is stored in association with the appropriate leaf node.
  • 8. Hierarchical Bloom Filter Trees (HBFTs): Building  Binary tree of Bloom Filters.  Each parent is twice the size of its children.  Files allocated to leaf nodes: round robin. 8 Every leaf node has a set of similarity digests associated with it. Each represents1/L of the collection, where L is the number of leaf nodes
  • 9. HBFTs: Searching  To search for a file, it is also processed in a similar way. 9 Initially search at the root node. For each hash of a file chunk, we check if it is contained in the root Bloom Filter. If the number of consecutive matches exceeds a threshold, it is considered to be a successful match. We call this threshold min_run.
  • 10. HBFTs: Searching  To search for a file, it is also processed in a similar way. 10 If a match is found, the search continues at the next level. Both child nodes must be searched. ✓
  • 11. HBFTs: Searching  To search for a file, it is also processed in a similar way. 11 The search continues until one or more leaf nodes are reached. ✓ ✓ ✓ ✓ ✗ ✗ ✗
  • 12. HBFTs: Searching  To search for a file, it is also processed in a similar way. 12 Bloom Filters can give false positive results, so it is possible for searches to reach leaves even where there are no similar results. ✓ ✓ ✓ ✓ ✗ ✗✗ ✓ ✓
  • 13. HBFTs: Searching  To search for a file, it is also processed in a similar way. 13 To calculate the similarity scores, the existing MRSH-v2 algorithm is used to make pairwise comparisons. A similarity digest is created for the file that we are searching for. This must be compared with all the digest stored at any leaf that the search reaches. ✓ ✓ ✓ ✓ ✗ ✗✗ ✓ ✓
  • 14. HBFT: Some Questions 14  How many nodes in the tree?  More nodes: fewer pairwise comparisons.  Fewer nodes: larger Bloom Filters (fewer false positives).  What constitutes a positive match for a node in the tree?  i.e. what threshold should be used for min_run?  When comparing two datasets, which should the tree represent?
  • 15.  t5*: 4,457 files (~1.8GiB)  Gathered from US government websites, often used for approximate matching.  Plain text, HTML, PDF, Images, MS Office documents.  win7: 48,384 files excluding empty files and symlinks (~10GiB)  Fresh install of Windows 7.  Varied file types. * Obtainable from http://roussev.net/t5 Datasets 15
  • 16. Experiment #1  Datasets: Tree represents t5, search for t5.  Goals:  Measure effectiveness for exact matching.  Identify appropriate value for min_run parameter.  Investigate relationship between size of tree and time to build & search tree.  Investigate relationship between size of tree and number of pairwise comparisons required to calculate similarity scores. 16
  • 17. Experiment #1: Results  Exact matching:  When min_run = 4, all identical files are found.  With higher values, some files are missed. 17 min_run Recall 4 100% 6 99.98% 8 99.93%
  • 18. Experiment #1: Results 18 Time to build tree and search for all files. (excluding pairwise comparisons) Number of pairwise comparisons required at leaves.
  • 19. Experiment #2  Datasets:  Tree represents win7, search for t5.  Tree represents t5, search for win7.  Investigate whether HBFT should represent the smaller or larger corpus.  Measure effect on overall running time. 19
  • 20. Experiment #2: Results 20 Time to search for t5 in a win7 tree. (excluding pairwise comparisons) Time to search for win7 in a t5 tree. (excluding pairwise comparisons)
  • 21. Experiment #2 21  Combination of build time + search time is lower when the HBFT represents the smaller corpus.  Also, less memory usage.  Total time (including pairwise comparisons): 1,094 seconds.  Tree models t5 with one file per leaf node (i.e. 4,457 leaves).  Search for all files in win7.  MRSH-v2 takes 2,858 seconds.
  • 22. Experiment #3  Datasets:  4,000 files from t5 represent set of “known-illegal” files.  win7 represents seized disk image, with 140 “planted” files from t5 added:  100 files that are also in the “known-illegal” set.  40 files with high similarity to files in the “known-illegal” set:  10 that have ≥ 80% similarity.  10 that have ≥ 60% and < 80% similarity.  10 that have ≥ 40% and < 60% similarity.  10 that have ≥ 20% and < 40% similarity.  Aims:  Compare time to MRSH-v2  Evaluate effectiveness of finding planted files. 22
  • 23. Experiment #3: Results MRSH-v2 similarity Files planted Files found Similar recall 80%-100% 10 10 100% 60%-79% 10 10 100% 40%-59% 10 10 100% 20%-39% 10 8 80% Overall 40 38 95% 23 Time to search for planted evidence. (including pairwise comparisons) Running time (4,000 leaves): • MRSH-v2: 2,592 seconds. • HBFT: 1,182 seconds.
  • 24. Conclusions  More leaf nodes lead to fewer pairwise comparisons.  min_run of 4 looks like a reasonable value.  If corpora are different sizes, use the tree to represent the smaller one.  Final experiment: all files with ≥ 20% similarity were found, with time reduction of 54%.  Likely to scale better than existing approach using pairwise comparisons. 24

Editor's Notes

  1. Win7 tree with ~48k leaves has 8KiB leaves & 512MiB root. Largest tree: t5 tree needs ~98k pairwise comparisons, win7 needs ~101k Memory constraints of storing full digests of bigger dataset.