Finding Similar Files in Large Repositories Using Content-Based Chunking

•

1 like•958 views

feiwin

Technology

Finding Similar Files in Large Document Repositories KDD’05, August 21-24, 2005, Chicago, Illinois, USA. Copyright 2005 ACM George Forman HewlettPackard Labs [email_address] Kave Eshghi HewlettPackard Labs [email_address] Stephane Chiocchetti HewlettPackard France [email_address]

Agenda ,[object Object],[object Object],[object Object],[object Object],[object Object]

Introduction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Method ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Hashing background ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Chunking ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Chunking and file similarity ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

File similarity algorithm ,[object Object],[object Object],[object Object],[object Object],[object Object]

File similarity algorithm (cont.) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

File similarity algorithm (cont.) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

File similarity algorithm (cont.) ,[object Object],[object Object],[object Object]

Handling identical files ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Complexity analysis ,[object Object],[object Object],[object Object]

Results ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Related work ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Conclusions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

What's hot

Text categorizationShubham Pahune

Duet @ TREC 2019 Deep Learning TrackBhaskar Mitra

Algorithm Name Detection & ExtractionDeeksha thakur

Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackBhaskar Mitra

TextRank: Bringing Order into TextsShubhangi Tandon

IRE- Algorithm Name Detection in Research PapersSriTeja Allaparthi

Indexing for Large DNA Database sequencesCSCJournals

Tdm probabilistic models (part 2)KU Leuven

Probabilistic models (part 1)KU Leuven

Similarity Measurement Preliminary Resultsxiaojuzheng

Classifying Text using CNNSomnath Banerjee

A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma

How web searching engines workVNIT-ACM Student Chapter

FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESkevig

Text categorizationKU Leuven

The vector space modelpkgosh

Does sizematterAmparo Elizabeth Cano Basave

Topic Extraction on Domain OntologyKeerti Bhogaraju

pptbutest

Latent Semanctic Analysis Auro TripathyAuro Tripathy

What's hot (20)

Text categorization

Duet @ TREC 2019 Deep Learning Track

Algorithm Name Detection & Extraction

Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track

TextRank: Bringing Order into Texts

IRE- Algorithm Name Detection in Research Papers

Indexing for Large DNA Database sequences

Tdm probabilistic models (part 2)

Probabilistic models (part 1)

Similarity Measurement Preliminary Results

Classifying Text using CNN

A Document Exploring System on LDA Topic Model for Wikipedia Articles

How web searching engines work

FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCES

Text categorization

The vector space model

Does sizematter

Topic Extraction on Domain Ontology

ppt

Latent Semanctic Analysis Auro Tripathy

Similar to Finding Similar Files in Large Repositories Using Content-Based Chunking

Fota Delta Size Reduction Using FIle Similarity AlgorithmsShivansh Gaur

2.5 lab1Saqlain Abbas

FINAL PROJECT REPORTDhrumil Shah

Comparison Study of Lossless Data Compression Algorithms for Text DataIOSR Journals

Google File SystemDreamJobs1

Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - IndexingSean Golliher

Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...IRJET Journal

An unsupervised framework for effective indexing of BigDataRamakrishna Prasad Sakhamuri

Advances in File CarvingRob Zirnstein

File System Implementation.pptxRajapriya82

Amazon Product Sentiment reviewLalit Jain

Duplicate File Analyzer using N-layer Hash and Hash TableAM Publications

Building modern data lakes Minio

An Efficient Search Engine for Searching Desired FileIDES Editor

Software tools for high-throughput materials data generation and data miningAnubhav Jain

Efficient Shared Data in PerlPerrin Harkins

An Efficient Approach to Manage Small Files in Distributed File SystemsIRJET Journal

(Julien le dem) parquetNAVER D2

ALA Interoperabilityspacecowboyian

Hadoop data managementSubhas Kumar Ghosh

Similar to Finding Similar Files in Large Repositories Using Content-Based Chunking (20)

Fota Delta Size Reduction Using FIle Similarity Algorithms

2.5 lab1

FINAL PROJECT REPORT

Comparison Study of Lossless Data Compression Algorithms for Text Data

Google File System

Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - Indexing

Methodology for Optimizing Storage on Cloud Using Authorized De-Duplication –...

An unsupervised framework for effective indexing of BigData

Advances in File Carving

File System Implementation.pptx

Amazon Product Sentiment review

Duplicate File Analyzer using N-layer Hash and Hash Table

Building modern data lakes

An Efficient Search Engine for Searching Desired File

Software tools for high-throughput materials data generation and data mining

Efficient Shared Data in Perl

An Efficient Approach to Manage Small Files in Distributed File Systems

(Julien le dem) parquet

ALA Interoperability

Hadoop data management

Recently uploaded

Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple

Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765

Infrared simulation and processing on Nvidia platformsYoss Cohen

QCon London: Mastering long-running processes in modern architecturesBernd Ruecker

Decarbonising Buildings: Making a net-zero built environment a realityIES VE

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal

2024 April Patch TuesdayIvanti

Top 10 Hubspot Development Companies in 2024TopCSSGallery

Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González

So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda

4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sectoritnewsafrica

Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica

Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC

Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani

Accelerating Enterprise Software Engineering with PlatformlessWSO2

Connecting the Dots for Information Discovery.pdfNeo4j

All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes

Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica

Recently uploaded (20)

Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...

Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration

Infrared simulation and processing on Nvidia platforms

QCon London: Mastering long-running processes in modern architectures

Decarbonising Buildings: Making a net-zero built environment a reality

Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...

2024 April Patch Tuesday

Top 10 Hubspot Development Companies in 2024

Generative Artificial Intelligence: How generative AI works.pdf

So einfach geht modernes Roaming fuer Notes und Nomad.pdf

4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector

Emixa Mendix Meetup 11 April 2024 about Mendix Native development

Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...

Landscape Catalogue 2024 Australia-1.pdf

Potential of AI (Generative AI) in Business: Learnings and Insights

Accelerating Enterprise Software Engineering with Platformless

Connecting the Dots for Information Discovery.pdf

All These Sophisticated Attacks, Can We Really Detect Them - PDF

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes

Zeshan Sattar- Assessing the skill requirements and industry expectations for...