The document evaluates and compares two state-of-the-art algorithms for finding near-duplicate web pages: Broder et al.'s shingling algorithm (Algorithm B) and Charikar's random projection based approach (Algorithm C). It tests the algorithms on a dataset of 1.6 billion web pages. The results show that Algorithm C has higher overall precision than Algorithm B at finding near-duplicates, especially for pairs on different sites. The document also presents a combined algorithm that achieves a precision of 0.79.
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
This paper proposes three algorithms (ScanCount, MergeSkip, and DivideSkip) to efficiently search for approximate string matches from a collection of strings. The algorithms utilize inverted indexes of substrings (grams) to find candidate strings. The paper also studies how to integrate filtering techniques with the merging algorithms to further improve performance. Experiments show the proposed techniques significantly outperform existing algorithms and that filters need to be judiciously combined with merging to achieve the best results.
On Improving the Performance of Data Leak Prevention using White-list ApproachPatrick Nguyen
This document proposes improving data leak prevention performance using a white-list approach and Bloom filters. It summarizes previous related work using blacklists and keywords to detect leaks. The authors then improve upon prior work by Fang Hao et al. that used CRC to create fingerprints, by using hash functions with Bloom filters instead to generate fingerprints faster while maintaining accuracy. Experiments test five hash functions on a 9.3GB dataset to evaluate system throughput and percentage of leaked files.
A PERMISSION BASED TREE-STRUCTURED APPROACH FOR REPLICATED DATABASESijp2p
Data replication is gaining increased importance due to the increasing demand for availability, performance and fault tolerance in databases. The main challenge for deploying replicated databases on a large scale is to resolve conflicting update requests. In this paper, we propose a permission based dynamic primary copy algorithm to resolve conflicting requests among various sites of replicated databases. The contribution of this work is the reduction in the number of messages per update request. Further, we propose a novel approach for handling single-site fault tolerance including algorithms for detection, removal, and restoration of a faulty site in the database system.
Genomic repeats detection using Boyer-Moore algorithm on Apache Spark Streaming TELKOMNIKA JOURNAL
Genomic repeats, i.e., pattern searching in the string processing process to find
repeated base pairs in the order of deoxyribonucleic acid (DNA), requires
a long processing time. This research builds a big-data computational
model to look for patterns in strings by modifying and implementing
the Boyer-Moore algorithm on Apache Spark Streaming for human DNA
sequences from the ensemble site. Moreover, we perform some experiments
on cloud computing by varying different specifications of computer clusters
with involving datasets of human DNA sequences. The results obtained show
that the proposed computational model on Apache Spark Streaming is faster
than standalone computing and parallel computing with multicore. Therefore,
it can be stated that the main contribution in this research, which is to develop
a computational model for reducing the computational costs, has been
achieved.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
Traditionally, machine learning based approaches to information retrieval have taken the form of supervised learning-to-rank models. Recently, other machine learning approaches—such as adversarial learning and reinforcement learning—have started to find interesting applications in retrieval systems. At Bing, we have been exploring some of these methods in the context of web search. In this talk, I will share couple of our recent work in this area that we presented at SIGIR 2018.
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
Slides from my keynote talk at the Recherche d'Information SEmantique (RISE) workshop at CORIA-TALN 2018 conference in Rennes, France.
(Abstract)
Neural Information Retrieval (or neural IR) is the application of shallow or deep neural networks to IR tasks. Unlike classical IR models, these machine learning (ML) based approaches are data-hungry, requiring large scale training data before they can be deployed. Traditional learning to rank models employ supervised ML techniques—including neural networks—over hand-crafted IR features. By contrast, more recently proposed neural models learn representations of language from raw text that can bridge the gap between the query and the document vocabulary.
Neural IR is an emerging field and research publications in the area has been increasing in recent years. While the community explores new architectures and training regimes, a new set of challenges, opportunities, and design principles are emerging in the context of these new IR models. In this talk, I will share five lessons learned from my personal research in the area of neural IR. I will present a framework for discussing different unsupervised approaches to learning latent representations of text. I will cover several challenges to learning effective text representations for IR and discuss how latent space models should be combined with observed feature spaces for better retrieval performance. Finally, I will conclude with a few case studies that demonstrates the application of neural approaches to IR that go beyond text matching.
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
This paper proposes three algorithms (ScanCount, MergeSkip, and DivideSkip) to efficiently search for approximate string matches from a collection of strings. The algorithms utilize inverted indexes of substrings (grams) to find candidate strings. The paper also studies how to integrate filtering techniques with the merging algorithms to further improve performance. Experiments show the proposed techniques significantly outperform existing algorithms and that filters need to be judiciously combined with merging to achieve the best results.
On Improving the Performance of Data Leak Prevention using White-list ApproachPatrick Nguyen
This document proposes improving data leak prevention performance using a white-list approach and Bloom filters. It summarizes previous related work using blacklists and keywords to detect leaks. The authors then improve upon prior work by Fang Hao et al. that used CRC to create fingerprints, by using hash functions with Bloom filters instead to generate fingerprints faster while maintaining accuracy. Experiments test five hash functions on a 9.3GB dataset to evaluate system throughput and percentage of leaked files.
A PERMISSION BASED TREE-STRUCTURED APPROACH FOR REPLICATED DATABASESijp2p
Data replication is gaining increased importance due to the increasing demand for availability, performance and fault tolerance in databases. The main challenge for deploying replicated databases on a large scale is to resolve conflicting update requests. In this paper, we propose a permission based dynamic primary copy algorithm to resolve conflicting requests among various sites of replicated databases. The contribution of this work is the reduction in the number of messages per update request. Further, we propose a novel approach for handling single-site fault tolerance including algorithms for detection, removal, and restoration of a faulty site in the database system.
Genomic repeats detection using Boyer-Moore algorithm on Apache Spark Streaming TELKOMNIKA JOURNAL
Genomic repeats, i.e., pattern searching in the string processing process to find
repeated base pairs in the order of deoxyribonucleic acid (DNA), requires
a long processing time. This research builds a big-data computational
model to look for patterns in strings by modifying and implementing
the Boyer-Moore algorithm on Apache Spark Streaming for human DNA
sequences from the ensemble site. Moreover, we perform some experiments
on cloud computing by varying different specifications of computer clusters
with involving datasets of human DNA sequences. The results obtained show
that the proposed computational model on Apache Spark Streaming is faster
than standalone computing and parallel computing with multicore. Therefore,
it can be stated that the main contribution in this research, which is to develop
a computational model for reducing the computational costs, has been
achieved.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
Traditionally, machine learning based approaches to information retrieval have taken the form of supervised learning-to-rank models. Recently, other machine learning approaches—such as adversarial learning and reinforcement learning—have started to find interesting applications in retrieval systems. At Bing, we have been exploring some of these methods in the context of web search. In this talk, I will share couple of our recent work in this area that we presented at SIGIR 2018.
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
Slides from my keynote talk at the Recherche d'Information SEmantique (RISE) workshop at CORIA-TALN 2018 conference in Rennes, France.
(Abstract)
Neural Information Retrieval (or neural IR) is the application of shallow or deep neural networks to IR tasks. Unlike classical IR models, these machine learning (ML) based approaches are data-hungry, requiring large scale training data before they can be deployed. Traditional learning to rank models employ supervised ML techniques—including neural networks—over hand-crafted IR features. By contrast, more recently proposed neural models learn representations of language from raw text that can bridge the gap between the query and the document vocabulary.
Neural IR is an emerging field and research publications in the area has been increasing in recent years. While the community explores new architectures and training regimes, a new set of challenges, opportunities, and design principles are emerging in the context of these new IR models. In this talk, I will share five lessons learned from my personal research in the area of neural IR. I will present a framework for discussing different unsupervised approaches to learning latent representations of text. I will cover several challenges to learning effective text representations for IR and discuss how latent space models should be combined with observed feature spaces for better retrieval performance. Finally, I will conclude with a few case studies that demonstrates the application of neural approaches to IR that go beyond text matching.
Generator Tricks for Systems ProgrammersHiroshi Ono
This document is the introduction to a tutorial about generator tricks for systems programmers presented by David Beazley. It provides biographical information about the presenter, outlines the goals and structure of the tutorial, and introduces some key concepts around iterators and generators in Python. The tutorial will cover practical uses of generators with a focus on files, file systems, parsing, networking, threads, and other systems programming tasks. It aims to provide compelling examples of using generators and get attendees thinking about how to apply them.
This document provides guidance on using the OpenSolaris operating system on Amazon Elastic Compute Cloud (EC2). It discusses the AMI tools and APIs needed to access EC2, prerequisites for using the Solaris AMI such as Java and SSH setup, how to launch and connect to OpenSolaris instances on EC2, and how to rebundle and register customized OpenSolaris AMIs. It also covers limitations and references for further information.
This document proposes a framework for modeling tagging systems and user tagging behavior to combat tag spam. It introduces methods for ranking documents matching a tag based on taggers' reliability. The authors study how existing approaches perform under malicious attacks and the impact of moderation. They define models of good and bad tagging behavior to simulate tag spam and evaluate different query answering schemes and moderator strategies.
Communications is a versatile field that develops skills like critical thinking, problem solving, and writing which are valuable for a wide range of careers from journalism to public relations. The growth of social media and new telecommunication technologies have rapidly changed the communications industry and created more opportunities in areas like crisis management. A communications degree cultivates skills applicable to many career options including internal/external communications, sales, education, research, management, and consulting.
This document discusses culture and different approaches to culture within organizations. It identifies four main tribes or approaches: the "Change Your Culture" tribe which uses a top-down command approach; the "You Can't Change Culture" tribe which believes culture cannot be changed and to work with the existing culture; the "Nurture It" tribe which focuses on understanding the existing culture and nurturing it; and the "Deal With It" tribe which is not described. It provides examples of high performance cultures at GitHub and Etsy that focus on collaboration, sharing, enabling people, and reinforcing culture through tools and structure.
1. The document discusses the history and applications of sentiment analysis, which is a technique used to categorize texts as positive or negative based on opinions and emotions expressed.
2. Machine learning (ML) techniques have increasingly been used for sentiment analysis since the 2000s, allowing for automated classification of texts like product reviews and social media posts.
3. ML approaches to sentiment analysis include classifying texts based on sentiment-bearing features, assessing sentiments towards specific aspects or topics, and automatically labeling text snippets as positive or negative based on linguistic patterns.
This document discusses using Python for easy artificial intelligence. It provides examples of using Python to:
1. Solve puzzles like the eight queens problem and alphametics puzzles with techniques like exhaustive search and constraint propagation.
2. Build a neural network model of a database to perform tasks like generalization, handling missing data, and extrapolation based on incomplete information.
3. Implement the game Mastermind to experiment with different guessing strategies and make the problem of deducing a hidden code as efficient as possible.
This document compares approaches to large-scale data analysis using MapReduce and parallel database management systems (DBMSs). It presents results from running a benchmark of tasks on an open-source MapReduce system (Hadoop) and two parallel DBMSs using a cluster of 100 nodes. The parallel DBMSs showed significantly better performance than MapReduce for the tasks, but took much longer to load data and tune executions. The document discusses architectural differences between the approaches and their performance implications.
This document brings together opposites like rappers and forest animals. It combines creative concepts like Loctite glue sticking things together that would never be joined otherwise. Objects that are totally opposed or antithetical to each other are united.
A sql implementation on the map reduce frameworkeldariof
Tenzing is a SQL query engine built on top of MapReduce for analyzing large datasets at Google. It provides low latency queries over heterogeneous data sources at petabyte scale. Tenzing supports a full SQL implementation with extensions and optimizations to achieve high performance comparable to commercial parallel databases by leveraging MapReduce and traditional database techniques. It is used by over 1,000 Google employees to query over 1.5 petabytes of data and serve 10,000 queries daily.
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
This document summarizes a survey on parallel data processing with MapReduce. It provides an overview of the MapReduce framework, including its architecture, key concepts of Map and Reduce functions, and how it handles parallel processing. It also discusses some inherent pros and cons of MapReduce, such as its simplicity but also performance limitations. Finally, it outlines approaches studied in recent literature to improve and optimize the MapReduce framework.
M phil-computer-science-secure-computing-projectsVijay Karan
The document contains details of several M.Phil Computer Science Secure Computing Projects implemented in C#. The projects deal with topics related to privacy, anonymity, and security in databases, distributed systems, wireless sensor networks, and social networks. Some of the project titles and brief descriptions are listed. The projects involve developing algorithms and protocols to privately update databases, assign anonymous IDs, detect clone attacks in wireless sensor networks, and design secure encounter-based mobile social networks.
M phil-computer-science-secure-computing-projectsVijay Karan
The document contains details of several M.Phil Computer Science Secure Computing Projects implemented in C#. The projects deal with topics related to privacy, anonymity, and security in distributed systems and networks. Some of the project titles and brief descriptions included in the document are:
- Privacy-Preserving Updates to Anonymous and Confidential Databases
- Replica Placement for Route Diversity in Distributed Hash Tables
- Robust Correlation of Encrypted Attack Traffic Through Stepping Stones Using Flow Watermarking
- A Policy Enforcing Mechanism for Trusted Ad Hoc Networks
- Adaptive Fault Tolerant QoS Control Algorithms for Maximizing System Lifetime of Query-Based Wireless Sensor Networks
This document discusses using cloud computing to address challenges in genome informatics posed by exponentially growing genomic data. It outlines how the traditional ecosystem is threatened as DNA sequencing costs decrease faster than storage and computing capacity can grow. Cloud computing provides an alternative by allowing users to rent vast computing resources on demand. The document examines applying MapReduce frameworks like Hadoop and DryadLINQ to bioinformatics applications like EST assembly and Alu clustering. Experiments showed these approaches can simplify processing large genomic datasets with performance comparable to local clusters, though virtual machines introduce around 20% overhead. Overall cloud computing may become preferred for its flexibility and ability to move computation to data.
Data mining projects topics for java and dot netredpel dot com
This document discusses several papers related to data mining and machine learning techniques. It begins with a brief summary of each paper, discussing the key contributions and findings. The summaries cover topics such as differential privacy-preserving data anonymization, fault detection in power systems using decision trees, temporal pattern searching in event data, high dimensional indexing for similarity search, landmark-based approximate shortest path computation, feature selection for high dimensional data, temporal pattern mining in data streams, data leakage detection, keyword search in spatial databases, analyzing relationships on Wikipedia, improving recommender systems using user-item subgroups, decision trees for uncertain data, and building confidential query services in the cloud using data perturbation.
A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERINGIJDKP
Web Ming faces huge problems due to Duplicate and Near Duplicate Web pages. Detecting Near
Duplicates is very difficult in large collection of data like ”internet”. The presence of these web pages
plays an important role in the performance degradation while integrating data from heterogeneous
sources. These pages either increase the index storage space or increase the serving costs. Detecting these
pages has many potential applications for example may indicate plagiarism or copyright infringement.
This paper concerns detecting, and optionally removing duplicate and near duplicate documents which are
used to perform clustering of documents .We demonstrated our approach in web news articles domain. The
experimental results show that our algorithm outperforms in terms of similarity measures. The near
duplicate and duplicate document identification has resulted reduced memory in repositories.
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Mumbai Academisc
This document summarizes a paper that presents a framework called BRA that provides a bidirectional abstraction of asymmetric mobile ad hoc networks to enable off-the-shelf routing protocols to work. BRA maintains multi-hop reverse routes for unidirectional links, improves connectivity by using unidirectional links, enables reverse route forwarding of control packets, and detects packet loss on unidirectional links. Simulations show packet delivery increases substantially when AODV is layered on BRA in asymmetric networks compared to regular AODV.
This document summarizes several papers on document clustering techniques. It discusses hierarchical clustering and similarity measures, as well as multi-representation clustering. Several clustering algorithms are examined, including K-means clustering and graph-based clustering. The document also analyzes similarity measures like multi-viewpoint similarity and evaluates the performance of different clustering methods on document collections.
Generator Tricks for Systems ProgrammersHiroshi Ono
This document is the introduction to a tutorial about generator tricks for systems programmers presented by David Beazley. It provides biographical information about the presenter, outlines the goals and structure of the tutorial, and introduces some key concepts around iterators and generators in Python. The tutorial will cover practical uses of generators with a focus on files, file systems, parsing, networking, threads, and other systems programming tasks. It aims to provide compelling examples of using generators and get attendees thinking about how to apply them.
This document provides guidance on using the OpenSolaris operating system on Amazon Elastic Compute Cloud (EC2). It discusses the AMI tools and APIs needed to access EC2, prerequisites for using the Solaris AMI such as Java and SSH setup, how to launch and connect to OpenSolaris instances on EC2, and how to rebundle and register customized OpenSolaris AMIs. It also covers limitations and references for further information.
This document proposes a framework for modeling tagging systems and user tagging behavior to combat tag spam. It introduces methods for ranking documents matching a tag based on taggers' reliability. The authors study how existing approaches perform under malicious attacks and the impact of moderation. They define models of good and bad tagging behavior to simulate tag spam and evaluate different query answering schemes and moderator strategies.
Communications is a versatile field that develops skills like critical thinking, problem solving, and writing which are valuable for a wide range of careers from journalism to public relations. The growth of social media and new telecommunication technologies have rapidly changed the communications industry and created more opportunities in areas like crisis management. A communications degree cultivates skills applicable to many career options including internal/external communications, sales, education, research, management, and consulting.
This document discusses culture and different approaches to culture within organizations. It identifies four main tribes or approaches: the "Change Your Culture" tribe which uses a top-down command approach; the "You Can't Change Culture" tribe which believes culture cannot be changed and to work with the existing culture; the "Nurture It" tribe which focuses on understanding the existing culture and nurturing it; and the "Deal With It" tribe which is not described. It provides examples of high performance cultures at GitHub and Etsy that focus on collaboration, sharing, enabling people, and reinforcing culture through tools and structure.
1. The document discusses the history and applications of sentiment analysis, which is a technique used to categorize texts as positive or negative based on opinions and emotions expressed.
2. Machine learning (ML) techniques have increasingly been used for sentiment analysis since the 2000s, allowing for automated classification of texts like product reviews and social media posts.
3. ML approaches to sentiment analysis include classifying texts based on sentiment-bearing features, assessing sentiments towards specific aspects or topics, and automatically labeling text snippets as positive or negative based on linguistic patterns.
This document discusses using Python for easy artificial intelligence. It provides examples of using Python to:
1. Solve puzzles like the eight queens problem and alphametics puzzles with techniques like exhaustive search and constraint propagation.
2. Build a neural network model of a database to perform tasks like generalization, handling missing data, and extrapolation based on incomplete information.
3. Implement the game Mastermind to experiment with different guessing strategies and make the problem of deducing a hidden code as efficient as possible.
This document compares approaches to large-scale data analysis using MapReduce and parallel database management systems (DBMSs). It presents results from running a benchmark of tasks on an open-source MapReduce system (Hadoop) and two parallel DBMSs using a cluster of 100 nodes. The parallel DBMSs showed significantly better performance than MapReduce for the tasks, but took much longer to load data and tune executions. The document discusses architectural differences between the approaches and their performance implications.
This document brings together opposites like rappers and forest animals. It combines creative concepts like Loctite glue sticking things together that would never be joined otherwise. Objects that are totally opposed or antithetical to each other are united.
A sql implementation on the map reduce frameworkeldariof
Tenzing is a SQL query engine built on top of MapReduce for analyzing large datasets at Google. It provides low latency queries over heterogeneous data sources at petabyte scale. Tenzing supports a full SQL implementation with extensions and optimizations to achieve high performance comparable to commercial parallel databases by leveraging MapReduce and traditional database techniques. It is used by over 1,000 Google employees to query over 1.5 petabytes of data and serve 10,000 queries daily.
Parallel Data Processing with MapReduce: A SurveyKyong-Ha Lee
This document summarizes a survey on parallel data processing with MapReduce. It provides an overview of the MapReduce framework, including its architecture, key concepts of Map and Reduce functions, and how it handles parallel processing. It also discusses some inherent pros and cons of MapReduce, such as its simplicity but also performance limitations. Finally, it outlines approaches studied in recent literature to improve and optimize the MapReduce framework.
M phil-computer-science-secure-computing-projectsVijay Karan
The document contains details of several M.Phil Computer Science Secure Computing Projects implemented in C#. The projects deal with topics related to privacy, anonymity, and security in databases, distributed systems, wireless sensor networks, and social networks. Some of the project titles and brief descriptions are listed. The projects involve developing algorithms and protocols to privately update databases, assign anonymous IDs, detect clone attacks in wireless sensor networks, and design secure encounter-based mobile social networks.
M phil-computer-science-secure-computing-projectsVijay Karan
The document contains details of several M.Phil Computer Science Secure Computing Projects implemented in C#. The projects deal with topics related to privacy, anonymity, and security in distributed systems and networks. Some of the project titles and brief descriptions included in the document are:
- Privacy-Preserving Updates to Anonymous and Confidential Databases
- Replica Placement for Route Diversity in Distributed Hash Tables
- Robust Correlation of Encrypted Attack Traffic Through Stepping Stones Using Flow Watermarking
- A Policy Enforcing Mechanism for Trusted Ad Hoc Networks
- Adaptive Fault Tolerant QoS Control Algorithms for Maximizing System Lifetime of Query-Based Wireless Sensor Networks
This document discusses using cloud computing to address challenges in genome informatics posed by exponentially growing genomic data. It outlines how the traditional ecosystem is threatened as DNA sequencing costs decrease faster than storage and computing capacity can grow. Cloud computing provides an alternative by allowing users to rent vast computing resources on demand. The document examines applying MapReduce frameworks like Hadoop and DryadLINQ to bioinformatics applications like EST assembly and Alu clustering. Experiments showed these approaches can simplify processing large genomic datasets with performance comparable to local clusters, though virtual machines introduce around 20% overhead. Overall cloud computing may become preferred for its flexibility and ability to move computation to data.
Data mining projects topics for java and dot netredpel dot com
This document discusses several papers related to data mining and machine learning techniques. It begins with a brief summary of each paper, discussing the key contributions and findings. The summaries cover topics such as differential privacy-preserving data anonymization, fault detection in power systems using decision trees, temporal pattern searching in event data, high dimensional indexing for similarity search, landmark-based approximate shortest path computation, feature selection for high dimensional data, temporal pattern mining in data streams, data leakage detection, keyword search in spatial databases, analyzing relationships on Wikipedia, improving recommender systems using user-item subgroups, decision trees for uncertain data, and building confidential query services in the cloud using data perturbation.
A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERINGIJDKP
Web Ming faces huge problems due to Duplicate and Near Duplicate Web pages. Detecting Near
Duplicates is very difficult in large collection of data like ”internet”. The presence of these web pages
plays an important role in the performance degradation while integrating data from heterogeneous
sources. These pages either increase the index storage space or increase the serving costs. Detecting these
pages has many potential applications for example may indicate plagiarism or copyright infringement.
This paper concerns detecting, and optionally removing duplicate and near duplicate documents which are
used to perform clustering of documents .We demonstrated our approach in web news articles domain. The
experimental results show that our algorithm outperforms in terms of similarity measures. The near
duplicate and duplicate document identification has resulted reduced memory in repositories.
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Mumbai Academisc
This document summarizes a paper that presents a framework called BRA that provides a bidirectional abstraction of asymmetric mobile ad hoc networks to enable off-the-shelf routing protocols to work. BRA maintains multi-hop reverse routes for unidirectional links, improves connectivity by using unidirectional links, enables reverse route forwarding of control packets, and detects packet loss on unidirectional links. Simulations show packet delivery increases substantially when AODV is layered on BRA in asymmetric networks compared to regular AODV.
This document summarizes several papers on document clustering techniques. It discusses hierarchical clustering and similarity measures, as well as multi-representation clustering. Several clustering algorithms are examined, including K-means clustering and graph-based clustering. The document also analyzes similarity measures like multi-viewpoint similarity and evaluates the performance of different clustering methods on document collections.
Hierarchal clustering and similarity measures along with multi representationeSAT Journals
Abstract All clustering methods have to assume some cluster relationship on the list of data objects that they really are applied on. Graph-Based Document Clustering works with frequent senses rather than frequent keywords used in traditional text mining techniques.Similarity between a pair of objects can be defined either explicitly or implicitly. With this paper, we analyzed existing multi-viewpoint based similarity measure and two related clustering methods. The main difference between a traditional dissimilarity/similarity measure and ours could be that the former uses merely a single viewpoint, which is the origin, even though the latter utilizes many viewpoints, which you ll find are objects assumed to not have the very same cluster using the two objects being measured. Using multiple viewpoints, more informative assessment of similarity could well be achieved. Theoretical analysis and empirical study are conducted to back up this claim. Two criterion functions for document clustering are proposed dependent on this wonderful measure. We compare them several well-known clustering algorithms which use other popular similarity measures on various document collections confirming the good sides of our proposal. Keywords –Multiview Cluster, Document id, ClusterDistance
Analysis of different similarity measures: SimrankAbhishek Mungoli
SimRank exploits object-to-object relationships very well and finds out the similarity between two objects.
We have used it in our project to find similar reasearch papers from DBLP dataset (DBLP Dataset provides a comprehensive list of research papers in computer science domain).
SimRank is a generic approach and its basic idea can also be applied to other domain of interests as well.
Zhao huang deep sim deep learning code functional similarityitrejos
Measuring code similarity is fundamental for many software engineering
tasks, e.g., code search, refactoring and reuse. However,
most existing techniques focus on code syntactical similarity only,
while measuring code functional similarity remains a challenging
problem. In this paper, we propose a novel approach that encodes
code control flow and data flow into a semantic matrix in which
each element is a high dimensional sparse binary feature vector,
and we design a new deep learning model that measures code functional
similarity based on this representation. By concatenating
hidden representations learned from a code pair, this new model
transforms the problem of detecting functionally similar code to
binary classification, which can effectively learn patterns between
functionally similar code with very different syntactics.
This document discusses techniques for analyzing unstructured text data from computer data inspection. It discusses using clustering algorithms like K-means and hierarchical clustering to automatically group related documents without supervision. The goal is to help computer examiners analyze large amounts of text data more efficiently. Prior work on clustering ensembles, evolving gene expression clusters, self-organizing maps, and thematically clustering search results is reviewed as relevant to this problem. The problem is how to identify and cluster documents stored across multiple remote locations during computer inspections when existing algorithms make this difficult.
Automatically inferring structure correlated variable set for concurrent atom...ijseajournal
The at
omicity of correlated variables is
quite tedious and error prone for programmers to explicitly infer
and specify in
the multi
-
threaded program
. Researchers have studied the automatic discovery of atomic set
programmers intended, such as by frequent itemset
mining and rules
-
based filter. However, due to the
lack
of inspection of inner structure
,
some
implicit sematics
independent
variables intended by user are
mistakenly
classified to be correlated
. In this paper, we present a
novel
simplification method for
program
dependency graph
and the corresponding graph
mining approach to detect the set of variables correlated
by logical structures in the source code. This approach formulates the automatic inference of the correlated
variables as mining frequent subgra
ph of
the simplified
data and control flow dependency graph. The
presented simplified
graph representation of program dependency is not only robust for coding style’s
varieties, but also essential to recognize the logical correlation. We implemented our me
thod and
compared it with previous methods on the open source programs’ repositories. The experiment results
show that our method has less false positive rate than previous methods in the development initial stage. It
is concluded that our presented method
can provide programmers in the development with the sufficient
precise correlated variable set for checking atomicity
This document summarizes research analyzing over 15,000 large language models (LLMs) uploaded to Hugging Face. The authors collected data on each LLM's name, parameters, downloads, and likes. They performed hierarchical clustering and graph analysis to identify relationships between LLMs. Key findings include the identification of popular LLM families (e.g. Wizard, Pythia) and a positive but weak correlation between downloads and likes. The research was used to build an interactive web application called Constellation that visualizes the LLM data through dendrograms, word clouds and graphs.
A Literature Review on Plagiarism Detection in Computer Programming AssignmentsIRJET Journal
This document summarizes research on plagiarism detection in computer programming assignments. It provides an abstract of the research topic and reviews several existing approaches for detecting plagiarism, including similarity-based, logic-based, and machine learning-based methods. Feature-based, string-matching, and algorithm-based techniques are discussed. The review identifies areas for further development, such as improving accuracy and supporting additional programming languages. The goal is to determine the best algorithm and methodology for developing a system to detect plagiarism in code.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
This document discusses event driven architecture (EDA) and domain driven design. It begins with an introduction to the speaker and an overview of EDA basics. It then describes problems with traditional SOA implementations, where domain logic gets split across many systems. The document proposes that exposing domain events on a shared event bus allows isolating cross-cutting functions to separate systems while keeping domain logic together. It provides examples of how this approach improves scalability and decouples systems. Finally, it outlines potential business benefits of using EDA like enabling complex event processing, business process management, and business activity monitoring on top of the domain events.
The document discusses trends and challenges facing information technology, including building a civic semantic web and waiving rights over linked data. It also discusses whether semantic technologies could permit meaningful brand relationships. The document contains a chart showing government department spending in the UK, with the Department of Health spending £105.7 billion, followed by local and regional government spending £34.3 billion, and the NHS spending £90.7 billion.
genpaxospublic-090703114743-phpapp01.pdfHiroshi Ono
This document summarizes an Erlang meeting held on July 3, 2009 in Tokyo. It discusses the gen_paxos Erlang module, which implements the Paxos consensus algorithm. Paxos is needed to solve problems like split-brains where data could become inconsistent without coordination between nodes. The document explains the key aspects of Paxos like its phases, data model in gen_paxos, and how nodes communicate through message passing in Erlang. It also provides references to related works and papers about Paxos.
pragmaticrealworldscalajfokus2009-1233251076441384-2.pdfHiroshi Ono
The document discusses Scala and functional programming concepts. It provides examples of building a chat application in 30 lines of code using Lift, defining case classes and actors for messages. It summarizes that Scala is a pragmatically oriented, statically typed language that runs on the JVM and has a unique blend of object-oriented and functional programming. Functional programming concepts like immutable data structures, functions as first-class values, and for-comprehensions are demonstrated with examples in Scala.
This document is the introduction to "The Little Book of Semaphores" by Allen B. Downey. It provides an overview of the book, which uses examples and puzzles to teach synchronization concepts and patterns. The book aims to give students more practice with these challenging concepts than a typical operating systems course allows. It also discusses the book's licensing as free and open source documentation.
This document provides style guidelines for Scala developers at Twitter. It outlines recommendations for imports, implicit usage, reflection, comments, whitespace, logging, project layout, variable naming conventions, and ends by thanking people for attending.
This document introduces developing a Scala DSL for Apache Camel. It discusses using Scala features like implicit conversions, passing functions as parameters, and by-name parameters to build a DSL. It provides examples of simple routes in the Scala DSL and compares them to Java. It also covers tooling for Scala in Maven and Eclipse and caveats like interacting with Java generics. The goal is to learn basic Scala concepts and syntax for building a Scala DSL, using Camel as an example.
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfHiroshi Ono
The document discusses alternative concurrency paradigms to shared-state concurrency for the JVM, including software transactional memory which allows transactions over shared memory, message passing concurrency using the actor model where actors communicate asynchronously via message passing, and dataflow concurrency where variables can only be assigned once. It provides examples of how these paradigms can be used to implement solutions like transferring funds between bank accounts more elegantly than with shared-state concurrency and locks.
This document discusses using TCP/IP for high performance computing (HPC) applications. It finds that while TCP/IP can achieve bandwidth of 1 Gbps over short distances with low latency, the bandwidth degrades significantly over wide area networks with higher latency. It investigates tuning TCP parameters like socket buffer sizes to improve performance over high latency networks.
Martin Odersky outlines the growth and adoption of Scala over the past 6 years and discusses Scala's future direction over the next 5 years. Key points include:
- Scala has grown from its first classroom use in 2003 to filling a full day of talks at JavaOne in 2009 and developing a large user community.
- Scala 2.8 will include new collections, package objects, named/default parameters, and improved tool support.
- Over the next 5 years, Scala will focus on concurrency and parallelism features at all levels from primitives to tools.
- Other areas of focus include extended libraries, performance improvements, and standardized compiler plugin architecture.
stateyouredoingitwrongjavaone2009-090617031310-phpapp02.pdfHiroshi Ono
This document discusses alternative concurrency paradigms for the Java Virtual Machine (JVM). It begins with an agenda and discusses how Moore's Law no longer solves concurrency problems as processors are becoming multi-core. It then discusses the problems with shared-state concurrency and how separating identity and value can help. The document introduces software transactional memory, message passing concurrency using actors, and dataflow concurrency as alternative paradigms. It uses examples of bank account transfers to demonstrate how these paradigms can be implemented and discusses their advantages over shared-state concurrency.
This document contains the schedule for a conference with sessions on various topics in natural language processing and computational linguistics. The conference will take place from September 14-16. Each day consists of morning and afternoon sessions split into parallel tracks (1a and 1b). Sessions cover areas like semantics, parsing, sentiment analysis, and more. Keynote speakers include Ricardo Baeza-Yates, Kevin Bretonnel Cohen, Mirella Lapata, Shalom Lappin, and Massimo Poesio. Presentations are 20 minutes each with coffee breaks in the mornings and poster sessions in the afternoons.
The article discusses the Guardian's Datastore project, which makes data of public interest freely available online for reuse. Some key points:
- The Datastore contains datasets on topics like MPs' expenses, carbon emissions, and public opinion polls. This data was previously hard to access but the web now allows easy access to billions of statistics.
- Making this data open and machine-readable supports the Guardian's tradition of fact-checking and transparency. It also encourages others to analyze and build upon the data in new ways.
- An early example involved crowdsourcing the review of 500,000 pages of MPs' expenses, revealing new insights. Other Guardian datasets like music recommendations and university rankings are now available for others
genpaxospublic-090703114743-phpapp01.pdfHiroshi Ono
This document summarizes a presentation on Paxos and gen_paxos. It introduces Paxos as a distributed consensus algorithm that is robust to network failures and allows data replication across multiple nodes. It then describes the gen_paxos Erlang implementation of Paxos, including its data model, state machine approach, and messaging between nodes. Key aspects of Paxos like the prepare and propose phases are explained through examples. The document also provides context on applications of Paxos and references for further reading.
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyScyllaDB
Freshworks creates AI-boosted business software that helps employees work more efficiently and effectively. Managing data across multiple RDBMS and NoSQL databases was already a challenge at their current scale. To prepare for 10X growth, they knew it was time to rethink their database strategy. Learn how they architected a solution that would simplify scaling while keeping costs under control.
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
This talk will cover ScyllaDB Architecture from the cluster-level view and zoom in on data distribution and internal node architecture. In the process, we will learn the secret sauce used to get ScyllaDB's high availability and superior performance. We will also touch on the upcoming changes to ScyllaDB architecture, moving to strongly consistent metadata and tablets.
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor IvaniukFwdays
At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving
What began over 115 years ago as a supplier of precision gauges to the automotive industry has evolved into being an industry leader in the manufacture of product branding, automotive cockpit trim and decorative appliance trim. Value-added services include in-house Design, Engineering, Program Management, Test Lab and Tool Shops.
Conversational agents, or chatbots, are increasingly used to access all sorts of services using natural language. While open-domain chatbots - like ChatGPT - can converse on any topic, task-oriented chatbots - the focus of this paper - are designed for specific tasks, like booking a flight, obtaining customer support, or setting an appointment. Like any other software, task-oriented chatbots need to be properly tested, usually by defining and executing test scenarios (i.e., sequences of user-chatbot interactions). However, there is currently a lack of methods to quantify the completeness and strength of such test scenarios, which can lead to low-quality tests, and hence to buggy chatbots.
To fill this gap, we propose adapting mutation testing (MuT) for task-oriented chatbots. To this end, we introduce a set of mutation operators that emulate faults in chatbot designs, an architecture that enables MuT on chatbots built using heterogeneous technologies, and a practical realisation as an Eclipse plugin. Moreover, we evaluate the applicability, effectiveness and efficiency of our approach on open-source chatbots, with promising results.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
AppSec PNW: Android and iOS Application Security with MobSFAjin Abraham
Mobile Security Framework - MobSF is a free and open source automated mobile application security testing environment designed to help security engineers, researchers, developers, and penetration testers to identify security vulnerabilities, malicious behaviours and privacy concerns in mobile applications using static and dynamic analysis. It supports all the popular mobile application binaries and source code formats built for Android and iOS devices. In addition to automated security assessment, it also offers an interactive testing environment to build and execute scenario based test/fuzz cases against the application.
This talk covers:
Using MobSF for static analysis of mobile applications.
Interactive dynamic security assessment of Android and iOS applications.
Solving Mobile app CTF challenges.
Reverse engineering and runtime analysis of Mobile malware.
How to shift left and integrate MobSF/mobsfscan SAST and DAST in your build pipeline.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
High performance Serverless Java on AWS- GoTo Amsterdam 2024Vadym Kazulkin
Java is for many years one of the most popular programming languages, but it used to have hard times in the Serverless community. Java is known for its high cold start times and high memory footprint, comparing to other programming languages like Node.js and Python. In this talk I'll look at the general best practices and techniques we can use to decrease memory consumption, cold start times for Java Serverless development on AWS including GraalVM (Native Image) and AWS own offering SnapStart based on Firecracker microVM snapshot and restore and CRaC (Coordinated Restore at Checkpoint) runtime hooks. I'll also provide a lot of benchmarking on Lambda functions trying out various deployment package sizes, Lambda memory settings, Java compilation options and HTTP (a)synchronous clients and measure their impact on cold and warm start times.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Session 1 - Intro to Robotic Process Automation.pdfUiPathCommunity
👉 Check out our full 'Africa Series - Automation Student Developers (EN)' page to register for the full program:
https://bit.ly/Automation_Student_Kickstart
In this session, we shall introduce you to the world of automation, the UiPath Platform, and guide you on how to install and setup UiPath Studio on your Windows PC.
📕 Detailed agenda:
What is RPA? Benefits of RPA?
RPA Applications
The UiPath End-to-End Automation Platform
UiPath Studio CE Installation and Setup
💻 Extra training through UiPath Academy:
Introduction to Automation
UiPath Business Automation Platform
Explore automation development with UiPath Studio
👉 Register here for our upcoming Session 2 on June 20: Introduction to UiPath Studio Fundamentals: https://community.uipath.com/events/details/uipath-lagos-presents-session-2-introduction-to-uipath-studio-fundamentals/
Session 1 - Intro to Robotic Process Automation.pdf
nearduplicates2006
1. Finding Near-Duplicate Web Pages: A Large-Scale
Evaluation of Algorithms
Monika Henzinger
Google Inc. & Ecole Federale de Lausanne (EPFL)
´ ´
monika@google.com
ABSTRACT of comparisons. Both algorithms work on sequences of ad-
Broder et al.’s [3] shingling algorithm and Charikar’s [4] ran- jacent characters. Brin et al. [1] started to use word se-
dom projection based approach are considered “state-of-the- quences to detect copyright violations. Shivakumar and
art” algorithms for finding near-duplicate web pages. Both Garcia-Molina [13, 14] continued this research and focused
algorithms were either developed at or used by popular web on scaling it up to multi-gigabyte databases [15]. Broder
search engines. We compare the two algorithms on a very et al. [3] also used word sequences to efficiently find near-
large scale, namely on a set of 1.6B distinct web pages. The duplicate web pages. Later, Charikar [4] developed an ap-
results show that neither of the algorithms works well for proach based on random projections of the words in a doc-
finding near-duplicate pairs on the same site, while both ument. Recently Hoad and Zobel [10] developed and com-
achieve high precision for near-duplicate pairs on different pared methods for identifying versioned and plagiarised doc-
sites. Since Charikar’s algorithm finds more near-duplicate uments.
pairs on different sites, it achieves a better precision overall, Both Broder et al. ’s and Charikar’s algorithm have ele-
namely 0.50 versus 0.38 for Broder et al. ’s algorithm. We gant theoretical justifications, but neither has been exper-
present a combined algorithm which achieves precision 0.79 imentally evaluated and it is not known which algorithm
with 79% of the recall of the other algorithms. performs better in practice. In this paper we evaluate both
algorithms on a very large real-world data set, namely on
1.6B distinct web pages. We chose these two algorithms as
Categories and Subject Descriptors both were developed at or used by successful web search en-
H.3.1 [Information Storage and Retrieval]: Content gines and are considered “state-of-the-art” in finding near-
Analysis and Indexing; H.5.4 [Information Interfaces and duplicate web pages. We call them Algorithm B and C.
Presentation]: Hypertext/Hypermedia We set all parameters in Alg. B as suggested in the liter-
ature. Then we chose the parameters in Alg. C so that it
General Terms uses the same amount of space per document and returns
about the same number of correct near-duplicate pairs, i.e.,
Algorithms, Measurement, Experimentation
has about the same recall. We compared the algorithms ac-
cording to three criteria: (1) precision on a random subset,
Keywords (2) the distribution of the number of term differences per
Near-duplicate documents, content duplication, web pages near-duplicate pair, and (3) the distribution of the number
of near-duplicates per page.
1. INTRODUCTION The results are: (1) Alg. C has precision 0.50 and Alg. B
0.38. Both algorithms perform about the same for pairs
Duplicate and near-duplicate web pages are creating large
on the same site (low precision) and for pairs on different
problems for web search engines: They increase the space
sites (high precision.) However, 92% of the near-duplicate
needed to store the index, either slow down or increase the
pairs found by Alg. B belong to the same site, but only 74%
cost of serving results, and annoy the users. Thus, algo-
of Alg. C. Thus, Alg. C finds more of the pairs for which
rithms for detecting these pages are needed.
precision is high and hence has an overall higher precision.
A naive solution is to compare all pairs to documents.
(2) The number of term differences per near-duplicate pair
Since this is prohibitively expensive on large datasets, Man-
are very similar for the two algorithms, but Alg. B returns
ber [11] and Heintze [9] proposed first algorithms for de-
fewer pairs with extremely large term differences. (3) The
tecting near-duplicate documents with a reduced number
distribution of the number of near-duplicates per page fol-
lows a power-law for both algorithms. However, Alg. B has
a higher “spread” around the power-law curve. A possible
Permission to make digital or hard copies of all or part of this work for reason for that “noise” is that the bit string representing
personal or classroom use is granted without fee provided that copies are a page in Alg. B is based on a randomly selected subset
not made or distributed for profit or commercial advantage and that copies of terms in the page. Thus, there might be “lucky” and
bear this notice and the full citation on the first page. To copy otherwise, to “unlucky” choices, leading to pages with an artificially high
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
or low number of near-duplicates. Alg. C does not select a
SIGIR’06, August 6–11, 2006, Seattle, Washington, USA. subset of terms but is based on all terms in the page.
Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00.
2. Finally, we present a combined algorithm that allows for larity computation the m-dimensional vector of minvalues is
different precision-recall tradeoffs. The precision of one trade- reduced to a m -dimensional vector of supershingles by fin-
off is 0.79 with 79% of the recall of Alg. B. gerprinting non-overlapping sequences of minvalues: Let m
It is notoriously hard to determine which pages belong be divisible by m and let l = m/m . The concatentation of
to the same site. Thus we use the following simplified ap- minvalue j ∗l, . . . , (j +1)∗l−1 for 0 ≤ j < m is fingerprinted
proach. The site of a page is (1) the domain name of the with yet another fingerprinting function and is called super-
page if the domain name has at most one dot, i.e., at most shingle.2 This creates a supershingle vector. The number of
two levels; and (2) it is the domain name minus the string identical entries in the supershingle vectors of two pages is
before the first dot, if the domain name has two or more their B-similarity. Two pages are near-duplicates of Alg. B
dots, i.e., three or more levels. For example, the site of or B-similar iff their B-similarity is at least 2.
www.cs.berkeley.edu/index.html is cs.berkeley.edu. The parameters to be set are m, l, m , and k. Following
The paper is organized as follows: Section 2 describes the prior work [7, 8] we chose m = 84, l = 14, and m = 6.
algorithms in detail. Section 3 presents the experiments and We set k = 8 as this lies between k = 10 used in [3] and
the evaluation results. We conclude in Section 4. k = 5 used in [7, 8]. For each page its supershingle vector is
stored, which requires m 64-bit values or 48 bytes.
2. DESCRIPTION OF THE ALGORITHMS Next we describe Alg. C. Let b be a constant. Each token
is projected into b-dimensional space by randomly choosing
For both algorithms every HTML page is converted into b entries from {−1, 1}. This projection is the same for all
a token sequence as follows: All HTML markup in the page pages. For each page a b-dimensional vector is created by
is replaced by white space or, in case of formatting instruc- adding the projections of all the tokens in its token sequence.
tions, ignored. Then every maximal alphanumeric sequence The final vector for the page is created by setting every pos-
is considered a term and is hashed using Rabin’s fingerprint- itive entry in the vector to 1 and every non-positive entry to
ing scheme [12, 2] to generate tokens, with two exceptions: 0, resulting in a random projection for each page. It has the
(1) Every URL contained in the text of the page is broken property that the cosine similarity of two pages is propor-
at slashes and dots, and is treated like a sequence of individ- tional to the number of bits in which the two corresponding
ual terms. (2) In order to distinguish pages with different projections agree. Thus, the C-similarity of two pages is the
images the URL in an IMG-tag is considered to be a term number of bits their projections agree on. We chose b = 384
in the page. More specifically, if the URL points to a dif- so that both algorithms store a bit string of 48 bytes per
ferent host, the whole URL is considered to be a term. If page. Two pages are near-duplicates of Alg. C or C-similar
it points to the host of the page itself, only the filename of iff the number of agreeing bits in their projections lies above
the URL is used as term. Thus if a page and its images on a fixed threshold t. We set t = 372, see Section 3.3.
the same host are mirrored on a different host, the URLs of We briefly compare the two algorithms. In both algo-
the IMG-tags generate the same tokens in the original and rithms the same bit string is assigned to pages with the
mirrored version. same token sequence. Alg. C ignores the order of the to-
Both algorithms generate a bit string from the token se- kens, i.e., two pages with the same set of tokens have the
quence of a page and use it to determine the near-duplicates same bit string. Alg. B takes the order into account as the
for the page. We compare a variant of Broder et al.’s al- shingles are based on the order of the tokens. Alg. B ignores
gorithm as presented by Fetterly et al. [7]1 and a slight the frequency of shingles, while Alg. C takes the frequency
modification of the algorithm in [4] as communicated by of terms into account. For both algorithms there can be
Charikar [5]. We explain next these algorithms. false positives (non near-duplicate pairs returned as near-
Let n be the length of the token sequence of a page. For duplicates) as well as false negatives (near-duplicate pairs
Alg. B every subsequence of k tokens is fingerprinted us- not returned as near-duplicates.) Let T be the sum of the
ing 64-bit Rabin fingerprints, which results in a sequence of number of tokens in all pages and let D be the number of
n − k + 1 fingerprints, called shingles. Let S(d) be the set pages. Alg. B takes time O(T m + Dm ) = O(T m). Alg. C
of shingles of page d. Alg. B makes the assumption that the needs time O(T b) to determine the bit string for each page.
percentage of unique shingles on which the two pages d and As described in Section 3.3 the C-similar pairs are computed
|S(d)∩S(d )|
d agree, i.e. |S(d)∪S(d )| , is a good measure for the similarity using a trick similar to supershingles. It takes time O(D) so
of d and d . To approximate this percentage every shingle that the total time for Alg. C is O(T b).
is fingerprinted with m different fingerprinting functions fi
for 1 ≤ i ≤ m that are the same for all pages. This leads
to n − k + 1 values for each fi . For each i the smallest of 3. EXPERIMENTS
these values is called the i-th minvalue and is stored at the Both algorithms were implemented using the mapreduce
page. Thus, Alg. B creates an m-dimensional vector of min- framework [6]. Mapreduce is a programming model for sim-
values. Note that multiple occurrences of the same shingle plified data processing on machine clusters. Programs writ-
will have the same effect on the minvalues as a single occur- ten in this functional style are automatically parallelized and
rence. Broder et al. showed that the expected percentage of executed on a large cluster of commodity machines. The
entries in the minvalues vector that two pages d and d agree algorithms were executed on a set of 1.6B unique pages col-
on is equal to the percentage of unique shingles on which d lected during a crawl of Google’s crawler. A preprocessing
and d agree. Thus, to estimate the similarity of two pages step grouped pages with the same token sequence into iden-
it suffices to determine the percentage of agreeing entries in tity sets and removed for every identity set all but one page.
the minvalues vectors. To save space and speed up the simi- 2
Megashingles were introduced in [3] to speed up the algo-
1
The only difference is that we omit the wrapping of the rithm even further. Since they do not improve precision or
shingling “window” from end to beginning described in [7]. recall, we did not implement them.
3. 1e+08
About 25-30% of the pages were removed in this step - we do
not know the exact number as the preprocessing was done 1e+07
before we received the pages. 1e+06
Alg. B and C found that between 1.7% and 2.2% of the
Frequency
100000
pages after duplicate removal have a near-duplicate. Thus 10000
the total number of near-duplicates and duplicates is roughly
1000
the same as the one reported by Broder et al. [3] (41%)
and by Fetterly et al. [7] (29.2%) on their collections of 100
web pages. The exact percentage of duplicates and near- 10
duplicates depends on the crawler used to gather the web 1
1 10 100 1000 10000 100000
pages and especially on its handling of session-ids, which Degree
frequently lead to exact duplicates. Note that the focus of
Figure 1: The degree distribution in the B-similarity
this paper is not on determining the percentage of near-
graph in log-log scale.
duplicates on the web, but to compare Alg. B and C on
the same large real-world data set. The web pages resided
on 46.2M hosts with an average of 36.5 pages per host. The
B-similarity Number of near-duplicates Percentage
distribution of the number of pages per host follows a power-
2 958,899,366 52.4%
law. We believe that the pages used in our study are fairly
3 383,076,019 20.9%
representative of the publically available web and thus form
4 225,454,277 12.3%
a useful large-scale real-world data set for the comparison.
5 158,989,276 8.7%
As it is not possible to determine by hand all near-duplicate
6 104,628,248 5.7%
pairs in a set of 1.6B pages we cannot determine the recall of
the algorithms. Instead we chose the threshold t in Alg. C
Table 1: Number of near-duplicate pairs found for
so that both algorithms returned about the same number
each B-similarity value.
of correct near-duplicate pairs, i.e., they have about the
same recall (without actually knowing what it is). Then
we compared the algorithms based on (1) precision, (2) the The remaining near-duplicate pairs were rated undecided.
distribution of the number of term differences in the near- The following three reasons covered 95% of the undecided
duplicate pairs, and (3) the distribution of the number of pairs: (1) prefilled forms with different, but erasable values
near-duplicates per page. Comparison (1) required human such that erasing the values results in the same form; (2) a
evaluation and will be explained next. different “minor” item, like a different text box on the side
or the bottom; (3) pairs which could not be evaluated. To
3.1 Human Evaluation evaluate a pair the pages as stored at the time of the crawl
We randomly sampled B-similar and C-similar pairs and were visually compared and a Linux diff operation was per-
had them evaluated by a human3 , who labeled each near- formed on the two token sequences. The diff output was
duplicate pair either as correct, incorrect, or undecided. We used to easily find the differences in the visual comparison.
used the following definition for a correct near-duplicate: However, for some pages the diff output did not agree with
Two web pages are correct near-duplicates if (1) their text visual inspection. This happened, e.g., because one of these
differs only by the following: a session id, a timestamp, an pages automatically refreshed and the fresh page was differ-
execution time, a message id, a visitor count, a server name, ent from the crawled page. In this case the pair was labeled
and/or all or part of their URL (which is included in the doc- as “cannot evaluate”. A pair was also labeled as “cannot
ument text), (2) the difference is invisible to the visitors of evaluate” when the evaluator could not discern whether the
the pages, (3) the difference is a combination of the items difference in the two pages was major or minor. This hap-
listed in (1) and (2), or (4) the pages are entry pages to the pened mostly for Chinese, Japanese, or Korean pages.
same site. The most common example of URL-only differ-
ences are “parked domains”, i.e. domains that are for sale. 3.2 The Results for Algorithm B
In this case the URL is a domain name and the HTML page Alg. B generated 6 supershingles per page, for a total of
retrieved by the URL is an advertisement page for buying 10.1B supershingles. They were sorted and for each pair of
that domain. Pages of domains for sale by the same organi- pages with an identical supershingle we determined its B-
zation differ usually only by the domain name, i.e., the URL. similarity. This resulted in 1.8B B-similar pairs, i.e., pairs
Examples of Case (4) are entry pages to the same porn site with B-similarity at least 2.
with some different words. Let us define the following B-similarity graph: Every page
A near-duplicate pair is incorrect if the main item(s) of is a node in the graph. There is an edge between two nodes
the page was (were) different. For example, two shopping iff the pair is B-similar. The label of an edge is the B-
pages with common boilerplate text but a different product similarity of the pair, i.e., 2, 3, 4, 5, or 6. The graph has
in the page center is an incorrect near-duplicate pair. 1.8B edges, about half of them have label 2 (see Table 1.) A
node is considered a near-duplicate page iff it is incident to
3
A different approach would be to check the correctness of at least one edge. Alg. B found 27.4M near-duplicate pages.
near-duplicate pages, not pairs, i.e., sample pages for which The average degree of the B-similarity graph is abount 135.
the algorithm found at least one near-duplicate and then Figure 1 shows the degree distribution in log-log scale. It
check whether at least one of its near-duplicates is correct.
However, this approach seemed to require too many human follows a power-law with exponent about -1.3.
comparisons since a page with at least one B-similar page We randomly sampled 96556 B-similar pairs. In 91.9% of
has in the average 135 B-similar pages. the cases both pages belonged to the same site. We then
4. Near Number of Correct Not Undecided All pairs Same site
dups pairs correct Cannot evaluate 78 (47%) 66 (43%)
all 1910 0.38 0.53 0.09 (0.04) Form 69 (41%) 67 (43%)
same site 1758 0.34 0.57 0.09 (0.04) Different minor item 19 (11%) 19 (12%)
diff. sites 152 0.86 0.06 0.08 (0.01) Other 1 (1%) 1 (1%)
B-sim 2 1032 0.24 0.68 0.08 (0.03)
B-sim 3 389 0.42 0.48 0.1 (0.04) Table 4: Reasons for undecided B-similar pairs.
B-sim 4 240 0.55 0.36 0.09 (0.05)
B-sim 5 143 0.71 0.23 0.06 (0.02)
400
B-sim 6 106 0.85 0.05 0.1 (0.08)
Number of near-duplicate pairs
350
Table 2: Alg. B: Fraction of correct, not correct, 300
and undecided pairs, with the fraction of prefilled, 250
erasable forms out of all pairs in that row in paren- 200
thesis.
150
All pairs Same site 100
URL only 302 (41%) 194 (32%) 50
Time stamp only 123 (17%) 119 (20%) 0
0 50 100 150 200
Combination 161 (22%) 145 (24%) Difference in terms
Execution time only 118 (16%) 114 (19%)
Visitor count only 22 (3%) 20 (3%) Figure 2: The distribution of term differences in the
Rest 5 (0%) 5 (1%) sample of Alg. B.
Table 3: The distribution of differences for correct
B-similar pairs. 11, 21% of the pairs have term difference 2, 90% have term
difference less than 42. For 17 pairs the term difference is
larger than 200. None of them are correct near-duplicates.
subsampled these pairs and checked each of the resulting They mostly consist of repeated text in one page (like re-
1910 pairs for correctness (see Table 2.) The overall preci- peated lists of countries) that is completely missing in the
sion is 0.38. However, the errors arise mostly for pairs on other page and could probably be avoided if the frequency
the same site: There the precision drops to 0.34, while for of shingles was taken into account. Figure 2 shows the dis-
pairs on different sites the precision is 0.86. The reason is tribution of term difference up to 200. The spike around
that very often pages on the same site use the same boil- 19 to 36 consists of 569 pages and is mostly due to two
erplate text and differ only in the main item in the center data bases on the web. It contains 326 pairs from the NIH
of the page. If there is a large amount of boilerplate text, Nucleotide database and 103 pairs from the Historic Here-
chances are good that the algorithm cannot distinguish the fordshire database. These two databases are a main source
main item from the boilerplate text and classifies the pair of errors for Alg. B. All of the evaluated pairs between pages
as near-duplicate. of one of these database were rated as incorrect. However,
Precision improves for pairs with larger B-similarity. This 19% of the pairs in the random sample of 96556 pairs came
is expected as larger B-similarity means more agreement in from the NIH database and 9% came from the Herefordshire
supershingles. While the correctness of B-similarity 2 is only database4 . The pages in the NIH database consist of 200-400
0.24, this value increases to 0.42 for B-similarity 3, and to tokens and differ in 10-30 consecutive tokens, the pages in
0.66 for B-similarity larger than 3. However, less than half the Herefordshire database consist of 1000-2000 tokens and
of the pairs have B-similarity 3 or more. differ in about 20 consecutive tokens. In both cases Alg. B
Table 3 analyzes the correct B-similar pairs. It shows that has a good chance of picking two out of the six supershin-
URL-only differences account for 41% of the correct pairs. gles from the long common token sequences originating from
For pairs on different sites 108 out of the 152 B-similar pairs boilerplate text. However, the number of different tokens is
differ only in the URL. This explains largely the high pre- large enough so that Alg. C returned only three of the pairs
cision in this case. Time stamps-only differences, execution in the sample as near-duplicate pairs.
time-only differences, and combinations of differences are 3.3 The Results for Algorithm C
about equally frequent. The remaining cases account for
less than 4% of the correct pairs. We partitioned the bit string of each page into 12 non-
Only 9% of all pairs are labeled undecided. Table 4 shows overlapping 4-byte pieces, creating 20B pieces, and com-
that 92% of them are on the same site. Almost half the cases puted the C-similarity of all pages that had at least one
are pairs that could not be evaluated. Prefilled, erasable piece in common. This approach is guaranteed to find all
forms are the reason for 41% of the cases. Differences in pairs of pages with difference up to 11, i.e., C-similarity 373,
minor items account for only 11%. but might miss some for larger differences.
Next we analyze the distribution of term differences for Alg. C returns all pairs with C-similarity at least t as
the 1910 B-similar pairs. To determine the term difference near-duplicate pairs. As discussed above we chose t so that
of a pair we executed the Linux diff command over the two both algorithms find about the same number of correct near-
token sequences and used the number of tokens that were 4
A second independent sample of 170318 near-duplicate
returned. The average term difference is 24, the mean is pairs confirmed these percentages.
5. 1e+08
Near Number of Correct Not Undecided
1e+07 dups pairs correct
1e+06 all 1872 0.50 0.27 0.23 (0.18)
Frequency 100000
same site 1393 0.36 0.34 0.30 (0.25)
10000
different site 479 0.90 0.05 0.05 (0)
C-sim ≥ 382 179 0.47 0.37 0.16 (0.10)
1000
382 >
100
C-sim ≥ 379 407 0.40 0.37 0.23 (0.18)
10 379 >
1
1 10 100 1000 10000 100000
C-sim ≥ 376 532 0.37 0.27 0.35 (0.30)
Number of nodes C-sim < 376 754 0.62 0.19 0.19 (0.12)
Figure 3: The degree distribution in the C-similarity
Table 5: Alg. C: Fraction of correct, not correct,
graph in log-log scale.
and undecided pairs, with the fraction of prefilled,
erasable forms out of all pairs in that row in paren-
thesis.
duplicate pairs. Alg. B found 1,831M near-duplicate pairs All pairs Same site
containing about 1, 831M ∗0.38 ≈ 696M correct near-duplicate
URL only 676 (72%) 270 (53%)
pairs. For t = 372 Alg. C found 1,630M near-duplicate
Combination 100 (11%) 95 (19%)
pairs containing about 1, 630M ∗ 0.5 = 815M correct near-
Time stamp only 42 (4%) 40 (8%)
duplicate pairs. Thus, we set t = 372. In a slight abuse of
Execution time only 41 (4%) 39 (8%)
notation we call a pair C-similar iff it was returned by our
Message id only 21 (2%) 18 (4%)
implementation. The difference to before is that there might
Rest 57 (6%) 43 (8%)
be some pairs with C-similarity 372 that are not C-similar
because they were not returned by our implementation. Table 6: The distribution of differences for correct
We define the C-similarity graph analog to the B-similarity C-similar pairs.
graph. There are 35.5M nodes with at least one incident
edge, i.e., near-duplicate pages. This is almost 30% more
than for Alg. B. The average degree in the C-similarity graph
is almost 92. Figure 3 shows the degree distribution in log- same site and 80% of the undecided pairs are due to forms.
log scale. It follows a power-law with exponent about -1.4. Only 15% are could not be evaluated. The total number of
We randomly sampled 172,464 near-duplicate pairs. Out such cases (64) is about the same as for Alg. B (78).
of them 74% belonged to the same site. In a random subsam- We also analyzed the term difference in the 1872 near-
ple of 1872 near-duplicate pairs Alg. C achieves an overall duplicate pairs sampled from Alg. C. Figure 4 shows the
precision of 0.50 with 27% incorrect pairs and 23% unde- number of pairs for a given term difference up to 200. The
cided pairs (see Table 5.) For pairs on different sites the average term differences is 94, but this is only due to outliers,
precision is 0.9 with only 5% incorrect pairs and 5% unde- the mean is 7, 24% had a term difference of 2, and 90% had
cided pairs. For pairs on the same site the precision is only a term difference smaller than 44. There were 90 pairs with
0.36 with 34% incorrect pairs and 30% undecided pairs. The term differences larger than 200.
number in parenthesis gives the percentage of pairs out of The site which causes the largest number of incorrect C-
all pairs that were marked Undecided because of prefilled, similar pairs is http://www.businessline.co.uk/, a UK
but erasable forms. It shows that these pages are the main business directory that lists in the center of each page the
reason for the large number of undecided pairs of Alg. C. phone number for a type of business for a specific area code.
Table 5 also lists the precision for different C-similarity Two such pages differ in about 1-5 not necessarily consec-
ranges. Surprisingly precision is highest for C-similarity be- utive tokens and agree in about 1000 tokens. In a sample
tween 372 and 375. This is due to the way we break URLs of 172,464 C-similar pairs, 9.2% were pairs that came from
at slashes and dots. Two pages that differ only in the URL this site5 , all such pairs that were evaluated in the subsam-
usually differ in 2 to 4 tokens since these URLs are frequently ple of 1872 were incorrect. Since the token differences are
domain names. This often places the pair in the range be- non-consecutive, shingling helps: If x different tokens are
tween 372 and 375. Indeed 57% of the pairs that differ only separated by y common tokens s.t. any consecutive sequence
in the URL fall into this range. This explains 387 out of the of common tokens has length less than k, then x + y + k − 1
470 correct near-duplicates for this C-similarity range. different shingles are generated. Indeed, none of these pairs
Table 6 analyzes the correct near-duplicate pairs. URL- in the sample was returned by Alg. B.
only differences account for 72%, combinations of differences
for 11%, time stamps and execution time for 9% together 3.4 Comparison of Algorithms B and C
with about half each. The remaining reasons account for
8%. For near-duplicate pairs on the same site only 53% 3.4.1 Graph Structure
of the correct near-duplicate pairs are caused by URL-only
The average degree in the C-similarity graph is smaller
differences, while 19% are due to a combination of reasons
than in the B-similarity graph (92 vs. 135) and there are
and 8% to time stamps and execution time. For pairs on
fewer high-degree nodes in the C-similarity graph: Only
different sites 406 of the 479 C-similar pairs differ only in
the URL. This explains the high precision in that case. 5
In a second independent random sample of 23475 near-
Table 7 shows that 95% of the undecided pairs are on the duplicate pairs we found 9.4% such pairs.
6. All pairs Same site Evaluation Alg B Alg C
Form 345 (80%) 343 (84%) Incorrect 11 20
Cannot evaluate 64 (15%) 54 (13%) Undecided 8 53
Other 21 (5%) 12 (3%) Correct 0 17
Different minor item 2 (0%) 1 (0%)
Table 8: The evaluations for the near-duplicate pairs
Table 7: Reasons for undecided C-similar pairs. with term difference larger than 200.
500
B-similarity C-similarity average
2 331.5
Number of near-duplicate pairs
450
400
3 342.4
350
4 352.4
300
5 358.9
250
6 370.9
200
150
Table 9: For a given B-similarity the average C-
100
50
similarity.
0
0 50 100 150 200
Difference in terms
Figure 4: The distribution of term differences in the All pairs: Alg. C found more near-duplicates with URL-
sample of Alg. C. only differences (676 vs. 302), while Alg. B found more cor-
rect near-duplicates whose differences lie only in the time
stamp or execution time (241 vs. 83). Alg. C found many
more undecided pairs due to prefilled, erasable forms than
7.5% of the nodes in the C-similarity graph have degree at Alg. B (345 vs. 69). In many applications these forms would
least 100 vs. 10.8% in the B-similarity graph. The plots be considered correct near-duplicate pairs. If so the overall
in Figures 1 and 3 follow a power-law distribution, but the precision of Alg. C would increase to 0.68, while the overall
one for Alg. B “spreads” around the power-law curve much precision of Alg. B would be 0.42.
more than the one for Alg. C. This can be explained as fol-
lows: Each supershingle depends on 112 terms (14 shingles 3.4.3 Term Differences
with 8 terms each.) Two pages are B-similar iff two of its The results for term differences are quite similar, except
supershingles agree. If the page is “lucky”, two of its super- for the larger number (19 vs. 90) of pairs with term differ-
shingles consist of common sequences of terms (like lists of ences larger than 200. To explain this we analyzed each of
countries in alphabetical order) and the corresponding node these 109 pairs (see Table 8). However, there are mistakes in
has a high degree. If the page is “unlucky”, all supershin- our counting methods due to a large number of image URLs
gles consist of uncommon sequences of terms, leading to low that were used for layout improvements and are counted by
degree. In a large enough set of pages there will always be a diff, but are invisible for the user. They affect 9 incorrect
certain percentage of pages that are “lucky” and “unlucky”, pairs and 25 undecided pairs of Alg. C, but no pair of Alg. B.
leading to a deviation from the power-law. Alg. C does not Subtracting them leaves both algorithms with 11 incorrect
select subsets of terms, its bit string depends on all terms pairs with large term differences, and Alg. B with 8 and
in the sequence, and thus the power-law is stricter followed. Alg. C with 28 undecided pairs with large term differences.
Seventeen pairs of Alg. C with large term differences were
3.4.2 Manual Evaluation rated as correct - they are entry pages to the same porn site.
Alg. C outperforms Alg. B with a precision of 0.50 ver- Altogether we conclude that the performance of the two
sus 0.38 for Alg. B. A more detailed analysis shows a few algorithms with respect to term differences is quite similar,
interesting differences. but Alg. B returns fewer pairs with very large term differ-
Pairs on different sites: Both algorithms achieve high pre- ences. However, with 20 incorrect and 17 correct pairs with
cision (0.86 resp. 0.9), 8% of the B-similar pairs and 26% large term difference the precision of Alg. C is barely influ-
of the C-similar pairs are on different sites. Thus, Alg. C enced by its performance on large term differences.
is superior to Alg. B (higher precision and recall) for pairs
on different sites. The main reason for the higher recall is 3.4.4 Correlation
that Alg. C found 406 pairs with URL-only differences on To study the correlation of B-similarity and C-similarity
different sites, while Alg. B returned only 108 such pairs. we determined the C-similarity of each pair in a random
Pairs on the same site: Neither algorithm achieves high sample of 96,556 B-similar pairs. About 4% had C-similarity
precision for pages on the same site. Alg. B returned 1752 at least 372. The average C-similarity was almost 341. Ta-
pairs with 597 correct pairs (precision 0.34), while Alg. C’ ble 9 gives for different B-similarity values the average C-
returns 1386 pairs with 500 correct pairs (precision 0.37). similarity. As can be seen the larger the B-similarity the
Thus, Alg. B has 20% higher recall, while Alg. C achieves larger the average C-similarity. To show the relationship
slightly higher precision for pairs on the same site. However, more clearly Figure 5 plots for each B-similarity level the
a combination of the two algorithms as described in the next distribution of C-similarity values. For B-similarity 2, most
section can achieve a much higher precision for pairs on the of the pairs have C-similarity below 350 with a peak at 323.
same site without sacrificing much recall. For higher B-similarity values the peaks are above 350.
7. 0.9
1200
quot;B-similarity=6quot;
quot;B-similarity=5quot;
quot;B-similarity=4quot; 0.8
quot;B-similarity=3quot;
1000 quot;B-similarity=2quot;
0.7
Precision
0.6
800
Number of pages
0.5
600
0.4
0.3
400 0 50 100 150 200 250 300 350 400
C-similarity
200 Figure 6: For each C-similarity threshold the corre-
sponding precision of the combined algorithm.
0
150 200 250 300 350 400
C-similarity 1
Figure 5: The C-similarity distribution for various 0.9
fixed B-similarities. 0.8
Precision
C-similarity B-similarity average 0.7
372-375 0.10 0.6
376-378 0.22
0.5
379-381 0.32
382-384 0.32 0.4
0.3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Table 10: For a given C-similarity range the average Percentage of correct near-duplicate pairs returned
B-similarity.
Figure 7: For the training set S1 the R-value versus
precision for different cutoff thresholds t.
We also determined the B-similarity for a random sample
of 169,757 C-similar pairs. Again about 4% of the pairs were
B-similar, but for 95% of the pairs the B-similarity was 0. pected precision decreases with increasing R. The long flat
Table 10 gives the details for various C-similarity ranges. range corresponds to different thresholds between 351 and
361 for which precision stays roughly the same while recall
3.5 The Combined Algorithm increases significantly. For the range between 351 and 359
The algorithms wrongly identify pairs as near-duplicates Table 11 gives the resulting precision and R-values. It shows
either (1) because a small difference in tokens causes a large that a C-similarity threshold of 353, 354, or 355 would be a
semantic difference or (2) because of unlucky random choices. good choice, achieving a precision of 0.77 while keeping R
As the bad cases for Alg B showed pairs with a large amount over 0.8. We picked 355.
of boilerplate text and a not very small number (like 10 or The resulting algorithm returns on the testing set S2 363
20) of different tokens that are all consecutive are at risk of out of the 962 pairs as near-duplicates with a precision of
being wrongly identified as near-duplicates by Alg. B, but 0.79 and an R-value of 0.79. For comparison consider us-
are at much less risk by Alg. C. Thus we studied the follow- ing Alg. B with a B-similarity cutoff of 3. That algorithm
ing combined algorithm: First compute all B-similar pairs.
Then filter out those pairs whose C-similarity falls below
a certain threshold. To choose a threshold we plotted the Percentage of Precision C-similarity
precision of the combined algorithm for different threshold correct pairs threshold
values in Figure 6. It shows that precision can significantly returned
improve if a fairly high value of C-similarity, like 350, is 0.869 0.754 351
used. 0.858 0.763 352
To study the impact of different threshold values on recall 0.836 0.768 353
let us define R to be the number of correct near-duplicate 0.825 0.771 354
pairs returned by the combined algorithm divided by the 0.808 0.770 355
number of correct near-duplicate pairs returned by Alg. B. 0.789 0.776 356
We chose Alg. B because the combined algorithm tries to 0.775 0.784 357
filter out the false positives of Alg. B. We randomly sub- 0.747 0.784 358
sampled 948 pairs out of the 1910 pairs that were scored for 0.736 0.789 359
Alg. B, creating the sample S1 . The remaining pairs from
the 1910 pairs form sample S2 . We used S1 to choose a cut- Table 11: On the training set S1 for different C-
off threshold and S2 as testing set to determine the resulting similarity thresholds the corresponding precision
precision and R-value. Figure 7 plots for S1 precision versus and percentage of returned correct pairs.
R for all C-similar thresholds between 0 and 384. As ex-
8. Near Number Correct In- Un- R 5. ACKNOWLEDGMENTS
dups of correct deci-
I want to thank Mike Burrows, Lars Engebretsen, Abhay
pairs ded
Puri and Oren Zamir for their help and useful discussions.
all 363 0.79 0.15 0.06 0.79
same site 296 0.74 0.19 0.07 0.73
different site 65 0.99 0.00 0.01 0.97 6. REFERENCES
[1] S. Brin, J. Davis, and H. Garcia-Molina. Copy
Table 12: The combined Algorithm on the testing Detection Mechanisms for Digital Documents. In 1995
set S2 : Fraction of correct, incorrect, and undecided ACM SIGMOD International Conference on
pairs and R-value. Management of Data (May 1995), 398–409.
[2] A. Broder. Some applications of Rabin’s fingerprinting
method. In Renato Capocelli, Alfredo De Santis, and
only returns 244 correct near-duplicates with a precision of Ugo Vaccaro, editors, Sequences II: Methods in
0.56, while the combined algorithm returns 287 correct near- Communications, Security, and Computer Science,
duplicates with a precision of 0.79. Thus, the combined al- 1993:143–152.
gorithm is superior in both precision and recall to using a [3] A. Broder, S. Glassman, M. Manasse, and G. Zweig.
stricter cutoff for Alg. B. Note that the combined algorithm Syntactic Clustering of the Web. In 6th International
can be implemented with the same amount of space since the World Wide Web Conference (Apr. 1997), 393–404.
bit strings for Alg. C can be computed “on the fly” during [4] M. S. Charikar. Similarity Estimation Techniques
filtering. from Rounding Algorithms. In 34th Annual ACM
Table 12 also shows that 82% of the returned pairs are on Symposium on Theory of Computing (May 2002).
the same site and that the precision improvement is mostly [5] M. S. Charikar. Private communication.
achieved for these pairs. With 0.74 this number is much [6] J. Dean and S. Ghemawat. MapReduce: Simplified
better than either of the individual algorithms. Data Processing on Large Clusters. In 6th Symposium
A further improvement could be achieved by running both on Operating System Design and Implementation
Alg. C and the combined algorithm and returning the pairs (Dec. 2004), 137–150.
on different sites from Alg. C and the pairs for the same [7] D. Fetterly, M. Manasse, and M. Najork. On the
site from the combined algorithm. This would generate Evolution of Clusters of Near-Duplicate Web Pages. In
1.6B∗0.26 ≈ 416M pairs one the same site with 374M cor- 1st Latin American Web Congress (Nov. 2003), 37–45.
rect pairs and 1.6B∗0.919 ∗ 0.79 ≈ 1163M pairs on different [8] D. Fetterly, M. Manasse, and M. Najork. Detecting
sites with about 919M correct pairs. Thus approximately Phrase-Level Duplication on the World Wide Web. To
1335M correct pairs would be returned with a precision of appear in 28th Annual International ACM SIGIR
0.85, i.e., both recall and precision would be superior to the Conference (Aug. 2005).
combined algorithm alone.
[9] N. Heintze. Scalable Document Fingerprinting. In
Proc. of the 2nd USENIX Workshop on Electronic
Commerce (Nov 1996).
4. CONCLUSIONS [10] T. C. Hoad and J. Zobel. Methods for identifying
We performed an evaluation of two near-duplicate algo- versioned and plagiarised documents. Journal of the
rithms on 1.6B web pages. Neither performed well on pages American Society for Information Science and
from the same site, but a combined algorithm did without Technology 54(3):203-215, 2003.
sacrificing much recall. [11] U. Manber. Finding similar files in a large file system.
Two changes might improve the performance of Alg. B In Proc. of the USENIX Winter 1994 Technical
and deserve further study: (1) A weighting of shingles by Conference (Jan. 1994).
frequency and (2) using a different number k of tokens in a [12] M. Rabin. Fingerprinting by random polynomials.
shingle. For example, following [7, 8] one could try k = 5. Report TR-15-81, Center for Research in Computing
However, recall that 28% of the incorrect pairs are caused by Technology, Harvard University, 1981.
pairs of pages in two databases on the web. In these pairs the
[13] N. Shivakumar and H. Garcia-Molina. SCAM: a copy
difference is formed by one consecutive sequence of tokens.
detection mechanism for digital documents. In Proc.
Thus, reducing k would actually increase the chances that
International Conference on Theory and Practice of
pairs of pages in these databases are incorrectly identified
Digital Libraries (June 1995).
as near-duplicates.
[14] N. Shivakumar and H. Garcia-Molina. Building a
Note that Alg. C also could work with much less space.
scalable and accurate copy detection mechanism. In
It would be interesting to study how this affects its per-
Proc. ACM Conference on Digital Libraries (March
formance. Additionally it would be interesting to explore
1996), 160–168.
whether applying Alg. C to sequences of tokens, i.e., shin-
gles, instead of individual tokens would increase its perfor- [15] N. Shivakumar and H. Garcia-Molina. Finding
mance. near-replicas of documents on the web. In Proc.
As our results show both algorithms perform poorly on Workshop on Web Databases (March 1998), 204–212.
pairs from the same site, mostly due to boilerplate text. Us-
ing a boilerplate detection algorithm would probably help.
Another approach would be to use a different, potentially
slower algorithm for pairs on the same site and apply (one
of) the presented algorithms to pairs on different sites.