SlideShare a Scribd company logo
1 of 7
Download to read offline
Near Duplicate Document Detection: Mathematical
Modeling and Algorithms
Liwei Ren
Trend Micro
10101 North De Anza Boulevard
Cupertino, CA 95014, USA
1-408-850-1048
liwei_ren@trendmicro.com
Qiuer Xu
Trend Micro
Building B, Soho International Plaza
Nanjing, 210012, P.R. China
86-25-52386123
fallson_xu@trendmicro.com.cn
ABSTRACT
Near-duplicate document detection is a well-known problem in
the area of information retrieval. It is an important problem to be
solved for many applications in IT industry. It has been studied
with profound research literatures. This article provides a novel
solution to this classic problem. We present the problem with
abstract models along with additional concepts such as text
models, document fingerprints and document similarity. With
these concepts, the problem can be transformed into keyword like
search problem with results ranked by document similarity. There
are two major techniques. The first technique is to extract robust
and unique fingerprints from a document. The second one is to
calculate document similarity effectively. Algorithms for both
fingerprint extraction and document similarity calculation are
introduced as a complete solution.
Categories and Subject Descriptors
H.3.3: Information Search and Retrieval – information filtering,
retrieval models, search process .
General Terms
Algorithms, Experimentation.
Keywords
Duplicate Document, Near Duplicate Detection, Document
Fingerprint, Document Similarity, Retrieval Model, Information
Retrieval, Asymmetric Architecture
1. INTRODUCTION
Near duplicate document detection (NDDD) is a well-known
problem in the area of information retrieval. It is defined to
identify whether a given document is a near duplicate of one or
more documents from a well-defined document set. This problem
can be found in many technical areas such as crawling and
indexing optimization of web search engines, copy detection
systems, email archival, spam filtering, and data leak prevention
systems. There are profound research literatures discussing this
subject with numerous use cases and solutions [1-6]. Recently,
Kumar et al. [7] provided a thorough review of the most
significant works in decades that covers more than 60 papers.
We organize the following sections in the fashion of problem
definition, mathematical modeling and algorithmic solutions. We
will introduce formal problem definition to describe the problem
followed by three text models that are used to present documents.
One text model is selected for constructing algorithmic solution.
By introducing some concepts like document fingerprint and
document similarity, the problem can be decomposed into three
independent problems: (a) document fingerprint extraction; (b)
document similarity calculation; (c) fingerprint based search
engine. Two algorithms are constructed to extract fingerprints
from documents and measure the similarity between documents.
One can use utility of keyword based search engine for solving
the problem (c). Finally, an architecture of asymmetric
fingerprint generation is proposed to reduce the number of
fingerprints. Less number of fingerprints is critical for the success
of some special applications such as data leak prevention systems.
2. Problem Definition and Modeling
The problem proposed in the introduction section is not well-
defined from the perfective of practical implementation. In
practice, we need a quantitative measurement of how “near
duplicated” two documents are. We can need a more rigorous
definition for NDDD.
Definition 1 : Assume that we have a set of documents S. For
any given document d and a percentile X% , one needs to identify
multiple documents D1, D2, …, Dm from S such that SIM(d, Dj) ≥
X% for 1 ≤j ≤m, where SIM is a well-defined function to calculate
the similarity of two documents. The result {D1, D2, …, Dm} is
shown in the descending order of percentiles.
There are several challenges to solve this problem:
(a) The document set may be huge. It could be a scale in
multiples of millions or even billions of documents.
One certainly cannot compare d with each document of
S to calculate the similarity. How to efficiently identify
the reference document D from a huge document set ?
(b) How to construct the similarity function SIM?
Before we are able to answer the questions, we need to propose
text models to present a document. A text model allows us to
exclude irrelevant textual elements so that we can focus on the
essence .
Documents can be in any document format such as Word, Power
Point, Excel, PDF, Post Script and many others. The individual
words or sentences can be in different styles (bold, italic,
underline) and with varieties of fonts. These are not important
textual elements when we discuss “near duplicate”.
Fundamentally, we are more interested in the textual content that
carries semantic significance.
A document can be written in any writing language. The texts in
different languages can be encoded differently, for example,
English texts can be encoded in ASCII, Chinese in GB, and
Japanese in SJIS. However, all languages can be encoded in the
UTF-8 standard which is able to present all languages in one text.
For documents in English or any western language, most authors
view a text as a string of words [2-6]. Words can be extracted
from texts with tokenization technique that uses spaces to separate
words (or tokens) in sentences.
Some languages such as Chinese and Japanese do not use spaces
between words. In those eastern languages, a sentence is a string
of characters without spaces between them. All characters of
different languages can be encoded in UTF-8 characters. As such,
a text in all languages can be considered as a string of UTF-8
characters.
Depending on the languages, each UTF-8 character consists of
one or multiple bytes, for example, a Chinese character typically
consists of three bytes while an ASCII character is one byte.
Therefore, one can view a text also as a string of bytes if we
convert them from its original encoding into UTF-8.
Definition 2: We have three text models to present a document:
 Model 1: A text is a string of tokens ( or a sequence
of tokens)
 Model 2: A text is a string of UTF-8 characters.
 Model 3: A text is a string of bytes when the text is
encoded in UTF-8.
 In summary, a text is a string of basic textual items
where a basic textual unit item means a token, UTF-
8 character or byte.
Besides three models, there exist other text models that basic
textual units are sentences [5], textual lines, or even pages. Those
models are not interests to the authors of this article.
Numerous articles study NDDD using the text model 1. While
this model is good enough to study NDDD for documents in
western languages, it has obstacles when dealing with non-
western languages. Model 1 needs tokenization techniques.
Tokenization is a taunting task for processing documents in
Chinese and Japanese, especially.
There are few works adopting the text model 2 and 3 in academic
world. Manber [1] discussed duplicate detection in terms of pair
wise file matching of ASCII files. This is a special case of the
model 2 and 3. In contrast, it has become a common practice in
industry to apply text model 2 or 3 to many document
management problems such as DLP [8-10] , spam filtering and e-
Discovery. In this article, we use text model 2 to extract
fingerprints from documents and calculate the similarity between
two documents. Both text model 2 and 3 are language
independent while model 1 is not. Therefore, the techniques
developed in this article apply equally to documents in any
languages, and even apply to a document written in multiple
languages.
Definition 3: A document normalization is a process that consists
of three sub-processes applied sequentially:
(a) Converting a document in any format, such as Word,
Excel and PDF, into a plain text encoded in UTF-8;
(b) Converting any plain text in other encodings into a plain
text encoded in UTF-8;
(c) Removing the trivial characters such as white spaces,
delimiters, and control characters and etc. from the
UTF-8 texts.
Definition 4: The result of the document normalization is a string
of UTF-8 characters that contains the most significant information
of the original document. It is called a normalized text or
normalized document.
There are many software tools available for the document
normalization. Without loss of generality, we can consider all
documents as normalized texts in the rest of this article unless we
specify otherwise.
With so much discussion already, it is the time to tackle the two
challenges of Definition 1. To meet the first challenge, let us
introduce the concept of document fingerprint.
Definition 5: A document fingerprint is an integer or a binary
string with fixed length. Fingerprints can be generated from
documents by a function GEN. The fingerprints have the
following characteristics:
(a) A document D has multiple fingerprints { F1, F2, …,
Fn}, i.e., GEN(D) = { F1, F2, …, Fn}.
(b) Two irrelevant documents d and D do not have a
common fingerprint. That is GEN(d) ∩ GEN(D) = ϕ.
This is called the uniqueness.
(c) A fingerprint can survive moderate document changes.
That means GEN(d) ∩ GEN(D) ≠ ϕ if d is a near
duplicated copy of D . This is the robustness.
(d) In summary, a fingerprint is a unique invariant of
document variants.
A document D can be presented by multiple fingerprints, and let
us denote this relationship as D ↔ { F1, F2, …, Fn}. For any
document D from the document set S in Definition 1, we can
assign a unique document ID to it so that we establish a mapping
between ID and the fingerprints. We also denote this as ID ↔ {
F1, F2, …, Fn}. This would remind us of the keyword based
searching problem as we can index this relationship ID ↔ { F1,
F2, …, Fn} into indexing files when treating the fingerprints as
keywords. We can present the NDDD problem of Definition 1
with the following model supported by two procedures indexer
and searcher.
NDDD Model : Assume we have two functions: (a) fingerprint
generation function GEN; (b) document similarity measurement
function SIM, the NDDD problem is reduced into a fingerprint
based indexing and searching problem:
 Indexer: Given a set of documents S, each document
is assigned a unique ID. We extract multiple
fingerprints { F1, F2, …, Fn} from each document D
with the function GEN. The indexer indexes them
together with the document ID, i.e., ID ↔ { F1, F2,
…, Fn}. The indexing results are saved into indexing
files.
 Searcher: For any query document d and the
percentile X%, we extract multiple fingerprints { f1,
f2, …, fn} from the query document d with the
function GEN . The searcher uses them to retrieve
relevant document IDs from the indexing files. If a
reference document contains any of { f1, f2, …, fn},
its ID will be retrieved. With the ID, the reference
document D is retrieved as result. Then, we calculate
SIM(d,D) to measure the similarity. There may be
multiple reference documents retrieved. We
calculate the similarity for all of them, and rank the
results in descending order of similarity.
With the model shown as above, the NDDD problem actually is
decomposed into three independent problems.
Three Sub-Problems:
1. Fingerprint generation --- Generate multiple
fingerprints from a given document D by a fingerprint
generation function GEN(D).
2. Similarity measurement --- Calculate the similarity
between two documents d and D by the similarity
function SIM(d,D).
3. Indexing/Searching --- The indexer indexes document
ID and its fingerprints { F1, F2, …, Fn}. The searcher
retrieves document IDs against indices with given
fingerprints { F1, F2, …, Fn}. This is similar to keyword
based search engine such as Google or Lucene.
One can use general search engine framework or even relational
database system for solving the 3rd
problem. Therefore, we will
propose algorithmic solutions to the first and second problems
only.
3. Algorithms
This section provides algorithms to construct the two functions
GEN and SIM respectively.
The function GEN is to extract fingerprints from a given
normalized document. A fingerprint is a possible invariant of
text that can survive document changes. What can survive
changes? Changes of text can be caused by document
modification with editing operations such as insertion, deletion,
copy/paste and etc.. However, there are many pieces remaining in
the new text. These unchanged pieces shift relatively in text. If
we can identify some unchanged text pieces, we can use them as
text invariants to generate fingerprints. How to locate these
unchanged yet shifting pieces?
First of all, we use text model 2 to present a text as a string of
UTF-8 characters, i.e., let us denote this as T = c1 c2… cL where
L is the string length. Hence, we can discuss strings of characters
instead of texts or documents. Secondarily, we introduce a
concept as “anchoring points” which is briefly discussed in [1]
without implementation suggestions. An anchoring point is a
character in the string that remains the same relative to its
neighborhood when the string changes. One can use the
neighborhood around the anchoring point to generate a fingerprint
with a good hash function H. With multiple anchoring points, we
have multiple fingerprints for the document. There are two issues
to be solved. The first issue is how to select the robust anchoring
points since the string can change. The second issue is that there
may be too many anchoring points so that we generate too many
fingerprints from a given string. We propose algorithm 1 to
construct the function GEN which can handle these two issues.
Definition 6: We need some notations for writing up algorithm 1:
 The alphabet A of UTF-8 characters appearing in the
string.
 Two numbers N and M that selects most robust
anchoring points for generating fingerprints. M can be
fixed for any text string while N is selected according to
string size. Table 1 shows how M and N are configured
as an example.
 The width W of anchoring neighborhoods.
 A hash function H that generate a fingerprint from a
sub-string of size W. There is no specific requirement
for the hash function.
 Character score function defined as
𝑛 ∗ (𝑃𝑛 − 𝑃1) (𝑃𝑖+1 − 𝑃𝑖)2
1≤𝑖<𝑛
Table 1: M and N are configured accordingly
Text Size Range M N
0-10K 4 128
10-20K 4 256
20-30K 4 256
30-50K 4 512
50-70K 4 1024
70-80K 4 1024
80-100K 4 1024
100-500K 4 1024
> 500K 4 1024
Algorithm 1:
Input: String T as c1 c2… cL
Output: Fingerprint set.
Procedure:
Step 1: Select the number N from Table 1 according to the string
length L.
Step 2: Run through the string T while counting the occurrences
of each unique UTF-8 character in A and saving the offsets.
Step 3: For each C ∈ A , the character C should have one or
multiple occurrences in T. Their offsets can be denoted as P1,
P2,… Pn . We use the score function to calculate the score for C.
Step 4: Pick M characters from A that have the highest scores .
That is B = { C1, C2,… CM }.
Step 5: For each C ∈ B, do step 6 to step 9
Step 6: For each occurrence of C in T, we have an anchoring
neighborhood which has C as its center. Each neighborhood is a
sub-string of size W. We denote these neighborhoods as S1, S2,…
Sn with respect to the occurrence offsets P1, P2, … Pn .
Step 7: Sort the list of sub-strings S1, S2,… Sn . Without loss of
generality, we can still denote the sorted list as S1, S2,… Sn .
Step 8: Select first K items from the sorted list where K =
MIN(N , n). They are {S1, S2,… SK }.
Step 9: Apply hash function H to {S1, S2,… SK} to generate K
fingerprints and add them to the fingerprint set.
The algorithm is stated based on text model 2. However, it is
good for other two models as well by replacing “character” by
either “token” or “byte”. The idea of the algorithm is
straightforward. First of all, it selects the most significant
character from the alphabet of the input string with a scoring
function to measure the significance. When calculating the score
of a given character, we consider both the frequency and
distribution of the character across the string. This is reflected in
the score function. Secondarily, for each picked character, it
chooses the robust anchoring points by sorting and picking the top
items from the list. Sorting is a mechanism to change randomness
into order. The result is a set of at most M*N fingerprints. For
example, when the normalized text size is less than 10KB which
is typical in real world, we get at most 4*128=512 fingerprints.
The function SIM is to calculate the similarity between two
normalized documents. We can use text model 2 to present a
document such that we actually compare two strings of characters.
What similarity means to them? If there are some common sub-
strings between two strings and the total length summed up is
long enough, we would consider that they are similar to each
other. We also expect that similarity can be measured in
percentile. We propose algorithm 2 to calculate the similarities
between one given document and a set of reference documents.
The main idea is to identify common sub-strings with hash based
greedy matching strategy.
Definition 7: We need some notations to present algorithm 2:
 A number M that defines the minimum length of
common sub-strings. Common sub-strings must have
minimum length to avoid triviality, otherwise, a single
character can be a common sub-string.
 A hash function H that generate a hash value from a
sub-string of size M. The hash table has chaining
capability to resolve collisions. There is no specific
requirement for the hash function. However, due to the
nature of the algorithm, a rolling hash function is
recommended for good performance.
 A hash table HT.
 For a string T, its substring can be denoted as T[s,…,e]
where s and e are the starting and ending offsets.
 The algorithm is stated with text model 2. However, it
can be applied to other two models as well.
Algorithm 2:
Input: Query string d, and multiple reference strings {D1, D2,
…, Dm}
Output: The similarities {SIM1, SIM2, …, SIMm }
Procedure:
Step 1: Create the hash table HT based on L which is the size of
the input string d.
Step 2: For j = 0 to L-M
 Apply the hash function H to the sub-string d[j,…,j+M-
1] of d to calculate the hash value h
 Store offset j in H[h] or its chained linked-list.
Step 3: For each k in {1,2,…,m}, do step 4 to step 12
Step 4: Let Lk be the length of Dk , set P = 0 and SUM=0.
Step 5: Let h = H(Dk [P,…,P+M-1])
Step 6: If H[h] is empty, we have no match of sub-strings at this
offset P, let P=P+1, go to step 11
Step 7: For each sub-string offset s stored in the chaining linked-
list at H[h], do step 8
Step 8: If d[s,..,s+M-1] ≠Dk [P,…,P+M-1], set V(s)=0, otherwise,
let us extend the two equal sub-strings forward with common
characters as many as possible that arrives at the maximum
common sub-string size V(s.)
Step 9: Let V be the largest of all V(s) that we get from step 8.
Step 10: If V>0, let SUM = SUM + V, P = P + V, otherwise let
P = P + 1
Step 11: If P < Lk-M, go to Step 5
Step 12: Let SIMk = SUM / Lk
Algorithm 2 actually calculates all SIM(d,D1), SIM(d,D2), …
SIM(d, Dm) in one construction. The step 1 and 2 actually pre-
process d. And the step 4 to 12 are the steps to calculate
individual SIM(d,Dj) once a time.
For the normalized query document d and reference document D,
the algorithm 2 identifies a set of common sub-strings and sum up
all their lengths as SUM. Then similarity SIM can be measured
by SUM / Length(D). One may ask why we do not include the
length of d for the similarity. This is because we care more how
much of D is duplicated in the query document d than how much
of d is the content of D. One can certainly design another formula
to calculate the similarity from SUM and both lengths. Finally we
need to make sure SIM measures the similarity meaningfully.
This is guaranteed by the following theorem.
Theorem 1: The function SIM defined by algorithm 2 satisfies
the following properties for two normalized documents d and D:
1. 0 ≤SIM(d,D)≤ 1
2. If d and D are the same document, SIM(d,D)=1
3. If d and D have no common sub-strings at all,
SIM(d,D)=0.
Proof: From step 4 to 11 of algorithm 2, we have 0≤ SUM
≤Length(D). That proves 0 ≤SIM(d,D)≤1. If d=D, it is not
difficult to prove that SUM= Length(D), i.e., SIM(d,D)=1. The
last assertion is trivial.
4. Asymmetric Fingerprint Generation
For some special applications such as DLP (data loss prevention)
endpoint products, indexed fingerprinting files created on servers
must be delivered to remote machines which host searchers. It is
necessary to use less fingerprints to present a document in order to
save network bandwidth and cost. In algorithm 1, there are two
important parameters when generating the fingerprints. They are
the numbers M and N where M is fixed and N is configured
according to the text size defined by a table.
Based on recent experimental research, we can reduce the
fingerprints and keep almost the same recall rate if we apply
smaller number N to the function GEN at indexer side while the N
at server side is kept the same. In other words, we can solve the
NDDD problem even if the indexer can generate much less
number of fingerprints than the searcher. Table 2 is an example
for defining different N’s for both indexer and searcher.
Table 2 : Different N for Indexer and Searcher
Text Size Range M N for Indexer N for Searcher
0-10K 4 8 128
10-20K 4 16 256
20-30K 4 32 256
30-50K 4 32 512
50-70K 4 64 1024
70-80K 4 128 1024
80-100K 4 256 1024
100-500K 4 512 1024
> 500K 4 1024 1024
This method is referred as asymmetric fingerprint generation
while algorithm 1 is the symmetric fingerprint generation. And its
capability to keep almost the same recall rate is supported by the
following theoretical results.
Definition 8: Lets assume M is a constant number. For any
normalized document T, let us denote S( T, N) as the set of
fingerprints that is extracted from T with the number N.
Theorem 2: Let T be any normalized document, and n and m be
two positive integers. If n < m, we have S( T, n) ⊆ S(T,m)
which means the set S(T ,n) is a subset S(T, m).
Proof: This is a natural outcome from the step 8 of algorithm 1.
Theorem 3: Let D and d be two versions of same normalized
document, and n and m be two positive integers. If n < m, we have
S(D, n) ∩ S(d, n) ⊆ S(D, n) ∩ S(d, m) ⊆ S(D, m) ∩ S(d, m).
Proof: Since n<m, we have S(d, n) ⊆ S(d, m) and S(D, n)
⊆ S(D, m) and by theorem 1. Therefore, we have S(D, n) ∩ S(d,
n) ⊆ S(D, n) ∩ S(d, m) and S(D, n) ∩ S(d, m) ⊆ S(D, m) ∩ S(d,
m). Together we have S(D, n) ∩ S(d, n) ⊆ S(D, n) ∩ S(d, m)
⊆ S(D, m) ∩ S(d, m). This competes the proof.
Theorem 3 implies that the recall rate of asymmetric fingerprint
generation is between the two cases of symmetric fingerprint
generation with smaller and larger number of fingerprints. As a
matter of factor, the experimental data shows it is closer to the
second case while it generates much less fingerprints at indexer.
5. Experiments
In this section, we report a data experiment that we implemented
with the asymmetric architecture of fingerprint generation defined
by the parameters of table 2. Both indexer and searcher reside on
a server with Windows server 2003, Intel Xeron E5405@2.0GHZ,
8GB of RAM.
We prepared experimental data sets as follows:
 Normalized document for indexing:
 Corpus 1: this set consists of 1 million plain text
files in UTF-8 encoding. Let denote the corpus
1 as S1.
 Corpus 2: this set consists of 2115 plain text
files in many different languages and has
different file sizes. They are totally irrelevant to
the files in S1. Let denote the corpus 1 as S2.
 Let S = S1 ∪ S2. All files in S are registered for
fingerprint generation and indexing.
 Normalized documents for querying:
 Corpus 3: this set consists of 6*6* 2115 = 76140
files. This corpus consists of documents that are
made from S2 with 6 editing operations and 6
levels of changes presented in percentiles.
Corpus 3 will be used for querying experiment.
 6 levels of changes are defined as 5%, 10%,
20%, 30%, 40% and 50%. For example, the
level 1 means we alter 5% content of an original
file.
 6 editing operations ADD, ADH, ADE, DEL,
CHG, MOV.
The 6 editing operations can be defined specifically as follows:
 ADD: add a randomly generated block of chars at a
random position in the file.
 ADH: add a randomly generated block of chars at a
random position in the file. Also add a randomly
generated block of chars with block size randomly
selected between 50-100 at the beginning of the file.
 ADE: add a randomly generated block of chars at a
randomly selected position in the file. Also add a
randomly generated block of chars with block size
randomly selected between 50-100 in the ending of the
file.
 DEL: delete a block of chars from the file. The start
point of deletion is randomly selected.
 CHG : replace a randomly selected block of chars in the
file with a randomly generated block of chars.
 MOV: move a randomly selected block of chars in the
file to a random position in the file.
Table 3: Querying time in seconds
Change level Total file
number
Total
Time
Sec per file in
average
5% 12690 1727 0.136
10% 12690 1776 0.139
20% 12690 1680 0.132
30% 12690 1709 0.134
40% 12690 1699 0.133
50% 12690 1649 0.129
Table 4: Numbers of files matched at each change level
Change
level
ADD ADH ADE DEL CHG MOV
5% 2080 2079 2082 2074 2071 2055
10% 2079 2069 2079 2073 2067 2055
20% 2045 2047 2055 2063 2029 2046
30% 2027 2019 2023 2058 1979 2041
40% 1993 2000 1998 2021 1924 2049
50% 1969 1977 1978 2020 1894 2049
Table 5: Total recall rate at each change level
Change level Total Files Recall Rate
5% 12441 98.03%
10% 12422 97.88%
20% 12285 96.80%
30% 12147 95.72%
40% 11985 94.44%
50% 11887 93.67%
Figure 1: Recall vs change level for different operations.
Experiment steps:
1. Fingerprint and index all the files in S.
2. Set X% = 20%. For any file from corpus 3, we use it
as a query document for the NDDD problem. The
recall and precision are measured according to the
query results. The performance of the querying speed
can be measured in seconds .
The experimental results are shown in table 3 , table 4, and the
figure 1.
Table 3 shows the performance when executing search for
6*2115=12690 query files with total number and the time per file
in average. For example, for change level 5%, the total time is
1727 seconds which means 0.136 second per file in average. This
is pretty fast when we consider the set S has more than 1 millions
documents fingerprinted.
Table 4 shows the recall rate for each change level and editing
operation. For example, for the change level 5% and ADD
operation, one has 2115 query files, we have 2080 successful
queries, that is 98.3%. Figure 1 illustrates recall rate vs change
level for each operation.
Table 5 shows the recall rates for all change levels. As the
document changes increase, the recall rate drops. The worst recall
rate is 93.67% when the change is around 50%.
We should mention that there is no false positive for all our
76140 query files. This is a natural outcome due to the following
reasons:
 GEN and SIM are two string matching functions that
are independently constructed.
 Even we may have false positives with fingerprint
match, X% will stop the false positives.
6. Conclusion
This article has examined and solved the problem of near
duplicate document detection. What we have studied can be
summarized as follows:
 Formal definition for the problem NDDD.
 Text models are discussed for effective presentation. A
language independent text model is selected to present
the documents
 A NDDD model is proposed to refine the problem
definition which decomposes the NDDD problem into
three separate sub-problems that can be solved
independently.
 Algorithms are introduced to extract document
fingerprints and calculate document similarity.
 An architecture of asymmetric fingerprint generation is
introduced to reduce the number of fingerprints for
some special application.
 The data experiment shows that our algorithmic solution
has good performance, near zero false positives and
pretty higher recall rate even the documents change up
to 50%.
The problem definition and algorithmic solution in this article has
advantages over other approaches. It has near zero false positive
since the similarity calculation is independent of fingerprint
generation. The recall rate is pretty good due to the fact that the
fingerprints are robust with moderate document changes. Finally,
the solution is language independent. It means we can apply the
solution to documents written in any language and even to
documents written in multiple languages.
7. REFERENCES
[1] Manber, U.1994. Finding Similar Files In A Large File
System. Proceedings of the USENIX Winter 1994 Technical
Conference, San Francisco, California
[2] Shivakumar, N. and Garcia Molina, H. 1999. Finding near-
replicas of documents on the web. Lecture Notes in Computer
Science, Springer Berlin / Heidelberg, 1590, 204-212.
[3] Lopresti, D. P. 1999. Models and Algorithms for Duplicate
Document Detection. Proceedings of the Fifth International
Conference on Document Analysis and Recognition, Bangalore,
India, 297-300, September, 1999
[4] Broder, A. Z. 2000. Identifying and Filtering Near-Duplicate
Documents. Proceedings of the 11th
Annual Symposium on
Combinatorial Pattern Matching, UK. Springer-Verlag, pp.1-10,
2000.
[5] Campbell, D. M. , Chen,W.R. and Smith, R. D.. 2000. Copy
detection systems for digital documents. Proceedings of
Advances in Digital Libraries , pp. 78-88, 2000
[6] Ignatov, D. I. and Jánosi-Rancz, K. T. 2009. Towards a
framework for near-duplicate detection in a document collections
based on closed sets of attributes. ACTA Univ. Sapientiae,
Informatica, 1, 2 (2009), 215-233
[7] Kumar, J.P. and Govindarajulu, P. 2009. Duplicate and Near
Duplicate Documents Detection: A Review. European Journal of
Scientific Research, 32, 4 (2009), 514-527.
[8] Ren, L.,Tan, D., Huang, F., Huang S. and Dong, A. 2009.
Matching engine with signature generation. US patent 7,516,130.
[9] Ren, L., Huang S, Huang, F., Dong, A. and Tan, D. 2010.
Matching engine for querying relevant documents . US patent
7,747,642.
[10] Ren, L., Huang S., Huang, F. and Lin, Y. 2010. Document
matching engine using asymmetric signature generation. US
patent 7,860,853.

More Related Content

What's hot

Proposal of an Ontology Applied to Technical Debt on PL/SQL Development
Proposal of an Ontology Applied to Technical Debt on PL/SQL DevelopmentProposal of an Ontology Applied to Technical Debt on PL/SQL Development
Proposal of an Ontology Applied to Technical Debt on PL/SQL Development
Jorge Barreto
 

What's hot (20)

O01741103108
O01741103108O01741103108
O01741103108
 
PB ITC
PB ITCPB ITC
PB ITC
 
Ceis 3
Ceis 3Ceis 3
Ceis 3
 
Text independent speaker identification system using average pitch and forman...
Text independent speaker identification system using average pitch and forman...Text independent speaker identification system using average pitch and forman...
Text independent speaker identification system using average pitch and forman...
 
A Comparative Result Analysis of Text Based Steganographic Approaches
A Comparative Result Analysis of Text Based Steganographic Approaches A Comparative Result Analysis of Text Based Steganographic Approaches
A Comparative Result Analysis of Text Based Steganographic Approaches
 
Introduction to ‘C’ Language
Introduction to ‘C’ LanguageIntroduction to ‘C’ Language
Introduction to ‘C’ Language
 
The recognition system of sentential
The recognition system of sententialThe recognition system of sentential
The recognition system of sentential
 
Relational Database Design - Lecture 4 - Introduction to Databases (1007156ANR)
Relational Database Design - Lecture 4 - Introduction to Databases (1007156ANR)Relational Database Design - Lecture 4 - Introduction to Databases (1007156ANR)
Relational Database Design - Lecture 4 - Introduction to Databases (1007156ANR)
 
Developing an architecture for translation engine using ontology
Developing an architecture for translation engine using ontologyDeveloping an architecture for translation engine using ontology
Developing an architecture for translation engine using ontology
 
Data Types | CS8251- Programming in c | Learn Hub
Data Types | CS8251- Programming in c | Learn HubData Types | CS8251- Programming in c | Learn Hub
Data Types | CS8251- Programming in c | Learn Hub
 
Proposal of an Ontology Applied to Technical Debt on PL/SQL Development
Proposal of an Ontology Applied to Technical Debt on PL/SQL DevelopmentProposal of an Ontology Applied to Technical Debt on PL/SQL Development
Proposal of an Ontology Applied to Technical Debt on PL/SQL Development
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
 
Suitability of naïve bayesian methods for paragraph level text classification...
Suitability of naïve bayesian methods for paragraph level text classification...Suitability of naïve bayesian methods for paragraph level text classification...
Suitability of naïve bayesian methods for paragraph level text classification...
 
Pattern based approach for Natural Language Interface to Database
Pattern based approach for Natural Language Interface to DatabasePattern based approach for Natural Language Interface to Database
Pattern based approach for Natural Language Interface to Database
 
C program structure
C program structureC program structure
C program structure
 
Being Professional
Being ProfessionalBeing Professional
Being Professional
 
Using ontology based context in the
Using ontology based context in theUsing ontology based context in the
Using ontology based context in the
 
ijcai11
ijcai11ijcai11
ijcai11
 
A syntactic analysis model for vietnamese questions in v dlg~tabl system
A syntactic analysis model for vietnamese questions in v dlg~tabl systemA syntactic analysis model for vietnamese questions in v dlg~tabl system
A syntactic analysis model for vietnamese questions in v dlg~tabl system
 
Identifying the semantic relations on
Identifying the semantic relations onIdentifying the semantic relations on
Identifying the semantic relations on
 

Viewers also liked (7)

novel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawlingnovel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawling
 
Ufone
Ufone Ufone
Ufone
 
Duplicate detection
Duplicate detectionDuplicate detection
Duplicate detection
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)
 
Ufone
UfoneUfone
Ufone
 
Ufone Presentation
Ufone PresentationUfone Presentation
Ufone Presentation
 
Template transfer or change of ownership – no objection letter - mobile number
Template   transfer or change of ownership – no objection letter - mobile numberTemplate   transfer or change of ownership – no objection letter - mobile number
Template transfer or change of ownership – no objection letter - mobile number
 

Similar to Near Duplicate Document Detection: Mathematical Modeling and Algorithms

semantic text doc clustering
semantic text doc clusteringsemantic text doc clustering
semantic text doc clustering
Souvik Roy
 
Computing with Directed Labeled Graphs
Computing with Directed Labeled GraphsComputing with Directed Labeled Graphs
Computing with Directed Labeled Graphs
Marko Rodriguez
 
Review of research on devnagari character recognition
Review of research on devnagari character recognitionReview of research on devnagari character recognition
Review of research on devnagari character recognition
Vikas Dongre
 

Similar to Near Duplicate Document Detection: Mathematical Modeling and Algorithms (20)

A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF Metric
 
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSCONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANS
 
semantic text doc clustering
semantic text doc clusteringsemantic text doc clustering
semantic text doc clustering
 
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORKSENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
 
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural NetworkSentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
 
Computing with Directed Labeled Graphs
Computing with Directed Labeled GraphsComputing with Directed Labeled Graphs
Computing with Directed Labeled Graphs
 
Group4 doc
Group4 docGroup4 doc
Group4 doc
 
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPS
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPSANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPS
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPS
 
International Journal of Computer Science, Engineering and Applications (IJCSEA)
International Journal of Computer Science, Engineering and Applications (IJCSEA)International Journal of Computer Science, Engineering and Applications (IJCSEA)
International Journal of Computer Science, Engineering and Applications (IJCSEA)
 
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPS
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPSANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPS
ANOMALY DETECTION IN ARABIC TEXTS USING NGRAMS AND SELF ORGANIZING MAPS
 
Cs8391 notes rejinpaul
Cs8391 notes rejinpaulCs8391 notes rejinpaul
Cs8391 notes rejinpaul
 
Python reading and writing files
Python reading and writing filesPython reading and writing files
Python reading and writing files
 
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSINGFEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
 
Review of research on devnagari character recognition
Review of research on devnagari character recognitionReview of research on devnagari character recognition
Review of research on devnagari character recognition
 
Ju3517011704
Ju3517011704Ju3517011704
Ju3517011704
 
Text Segmentation for Online Subjective Examination using Machine Learning
Text Segmentation for Online Subjective Examination using Machine   LearningText Segmentation for Online Subjective Examination using Machine   Learning
Text Segmentation for Online Subjective Examination using Machine Learning
 
Text Document Classification System
Text Document Classification SystemText Document Classification System
Text Document Classification System
 

More from Liwei Ren任力偉

More from Liwei Ren任力偉 (20)

信息安全领域里的创新和机遇
信息安全领域里的创新和机遇信息安全领域里的创新和机遇
信息安全领域里的创新和机遇
 
企业安全市场综述
企业安全市场综述 企业安全市场综述
企业安全市场综述
 
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural Network
 
聊一聊大明朝的火器
聊一聊大明朝的火器聊一聊大明朝的火器
聊一聊大明朝的火器
 
防火牆們的故事
防火牆們的故事防火牆們的故事
防火牆們的故事
 
移动互联网时代下创新的思维
移动互联网时代下创新的思维移动互联网时代下创新的思维
移动互联网时代下创新的思维
 
硅谷的那点事儿
硅谷的那点事儿硅谷的那点事儿
硅谷的那点事儿
 
非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究非齐次特征值问题解存在性研究
非齐次特征值问题解存在性研究
 
世纪猜想
世纪猜想世纪猜想
世纪猜想
 
Arm the World with SPN based Security
Arm the World with SPN based SecurityArm the World with SPN based Security
Arm the World with SPN based Security
 
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching ProblemExtending Boyer-Moore Algorithm to an Abstract String Matching Problem
Extending Boyer-Moore Algorithm to an Abstract String Matching Problem
 
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
Monotonicity of Phaselocked Solutions in Chains and Arrays of Nearest-Neighbo...
 
Phase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillatorsPhase locking in chains of multiple-coupled oscillators
Phase locking in chains of multiple-coupled oscillators
 
On existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problemOn existence of the solution of inhomogeneous eigenvalue problem
On existence of the solution of inhomogeneous eigenvalue problem
 
Math stories
Math storiesMath stories
Math stories
 
Binary Similarity : Theory, Algorithms and Tool Evaluation
Binary Similarity :  Theory, Algorithms and  Tool EvaluationBinary Similarity :  Theory, Algorithms and  Tool Evaluation
Binary Similarity : Theory, Algorithms and Tool Evaluation
 
IoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and SolutionsIoT Security: Problems, Challenges and Solutions
IoT Security: Problems, Challenges and Solutions
 
Taxonomy of Differential Compression
Taxonomy of Differential CompressionTaxonomy of Differential Compression
Taxonomy of Differential Compression
 
Bytewise Approximate Match: Theory, Algorithms and Applications
Bytewise Approximate Match:  Theory, Algorithms and ApplicationsBytewise Approximate Match:  Theory, Algorithms and Applications
Bytewise Approximate Match: Theory, Algorithms and Applications
 
Overview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) TechnologyOverview of Data Loss Prevention (DLP) Technology
Overview of Data Loss Prevention (DLP) Technology
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 

Near Duplicate Document Detection: Mathematical Modeling and Algorithms

  • 1. Near Duplicate Document Detection: Mathematical Modeling and Algorithms Liwei Ren Trend Micro 10101 North De Anza Boulevard Cupertino, CA 95014, USA 1-408-850-1048 liwei_ren@trendmicro.com Qiuer Xu Trend Micro Building B, Soho International Plaza Nanjing, 210012, P.R. China 86-25-52386123 fallson_xu@trendmicro.com.cn ABSTRACT Near-duplicate document detection is a well-known problem in the area of information retrieval. It is an important problem to be solved for many applications in IT industry. It has been studied with profound research literatures. This article provides a novel solution to this classic problem. We present the problem with abstract models along with additional concepts such as text models, document fingerprints and document similarity. With these concepts, the problem can be transformed into keyword like search problem with results ranked by document similarity. There are two major techniques. The first technique is to extract robust and unique fingerprints from a document. The second one is to calculate document similarity effectively. Algorithms for both fingerprint extraction and document similarity calculation are introduced as a complete solution. Categories and Subject Descriptors H.3.3: Information Search and Retrieval – information filtering, retrieval models, search process . General Terms Algorithms, Experimentation. Keywords Duplicate Document, Near Duplicate Detection, Document Fingerprint, Document Similarity, Retrieval Model, Information Retrieval, Asymmetric Architecture 1. INTRODUCTION Near duplicate document detection (NDDD) is a well-known problem in the area of information retrieval. It is defined to identify whether a given document is a near duplicate of one or more documents from a well-defined document set. This problem can be found in many technical areas such as crawling and indexing optimization of web search engines, copy detection systems, email archival, spam filtering, and data leak prevention systems. There are profound research literatures discussing this subject with numerous use cases and solutions [1-6]. Recently, Kumar et al. [7] provided a thorough review of the most significant works in decades that covers more than 60 papers. We organize the following sections in the fashion of problem definition, mathematical modeling and algorithmic solutions. We will introduce formal problem definition to describe the problem followed by three text models that are used to present documents. One text model is selected for constructing algorithmic solution. By introducing some concepts like document fingerprint and document similarity, the problem can be decomposed into three independent problems: (a) document fingerprint extraction; (b) document similarity calculation; (c) fingerprint based search engine. Two algorithms are constructed to extract fingerprints from documents and measure the similarity between documents. One can use utility of keyword based search engine for solving the problem (c). Finally, an architecture of asymmetric fingerprint generation is proposed to reduce the number of fingerprints. Less number of fingerprints is critical for the success of some special applications such as data leak prevention systems. 2. Problem Definition and Modeling The problem proposed in the introduction section is not well- defined from the perfective of practical implementation. In practice, we need a quantitative measurement of how “near duplicated” two documents are. We can need a more rigorous definition for NDDD. Definition 1 : Assume that we have a set of documents S. For any given document d and a percentile X% , one needs to identify multiple documents D1, D2, …, Dm from S such that SIM(d, Dj) ≥ X% for 1 ≤j ≤m, where SIM is a well-defined function to calculate the similarity of two documents. The result {D1, D2, …, Dm} is shown in the descending order of percentiles. There are several challenges to solve this problem: (a) The document set may be huge. It could be a scale in multiples of millions or even billions of documents. One certainly cannot compare d with each document of S to calculate the similarity. How to efficiently identify the reference document D from a huge document set ? (b) How to construct the similarity function SIM? Before we are able to answer the questions, we need to propose text models to present a document. A text model allows us to exclude irrelevant textual elements so that we can focus on the essence .
  • 2. Documents can be in any document format such as Word, Power Point, Excel, PDF, Post Script and many others. The individual words or sentences can be in different styles (bold, italic, underline) and with varieties of fonts. These are not important textual elements when we discuss “near duplicate”. Fundamentally, we are more interested in the textual content that carries semantic significance. A document can be written in any writing language. The texts in different languages can be encoded differently, for example, English texts can be encoded in ASCII, Chinese in GB, and Japanese in SJIS. However, all languages can be encoded in the UTF-8 standard which is able to present all languages in one text. For documents in English or any western language, most authors view a text as a string of words [2-6]. Words can be extracted from texts with tokenization technique that uses spaces to separate words (or tokens) in sentences. Some languages such as Chinese and Japanese do not use spaces between words. In those eastern languages, a sentence is a string of characters without spaces between them. All characters of different languages can be encoded in UTF-8 characters. As such, a text in all languages can be considered as a string of UTF-8 characters. Depending on the languages, each UTF-8 character consists of one or multiple bytes, for example, a Chinese character typically consists of three bytes while an ASCII character is one byte. Therefore, one can view a text also as a string of bytes if we convert them from its original encoding into UTF-8. Definition 2: We have three text models to present a document:  Model 1: A text is a string of tokens ( or a sequence of tokens)  Model 2: A text is a string of UTF-8 characters.  Model 3: A text is a string of bytes when the text is encoded in UTF-8.  In summary, a text is a string of basic textual items where a basic textual unit item means a token, UTF- 8 character or byte. Besides three models, there exist other text models that basic textual units are sentences [5], textual lines, or even pages. Those models are not interests to the authors of this article. Numerous articles study NDDD using the text model 1. While this model is good enough to study NDDD for documents in western languages, it has obstacles when dealing with non- western languages. Model 1 needs tokenization techniques. Tokenization is a taunting task for processing documents in Chinese and Japanese, especially. There are few works adopting the text model 2 and 3 in academic world. Manber [1] discussed duplicate detection in terms of pair wise file matching of ASCII files. This is a special case of the model 2 and 3. In contrast, it has become a common practice in industry to apply text model 2 or 3 to many document management problems such as DLP [8-10] , spam filtering and e- Discovery. In this article, we use text model 2 to extract fingerprints from documents and calculate the similarity between two documents. Both text model 2 and 3 are language independent while model 1 is not. Therefore, the techniques developed in this article apply equally to documents in any languages, and even apply to a document written in multiple languages. Definition 3: A document normalization is a process that consists of three sub-processes applied sequentially: (a) Converting a document in any format, such as Word, Excel and PDF, into a plain text encoded in UTF-8; (b) Converting any plain text in other encodings into a plain text encoded in UTF-8; (c) Removing the trivial characters such as white spaces, delimiters, and control characters and etc. from the UTF-8 texts. Definition 4: The result of the document normalization is a string of UTF-8 characters that contains the most significant information of the original document. It is called a normalized text or normalized document. There are many software tools available for the document normalization. Without loss of generality, we can consider all documents as normalized texts in the rest of this article unless we specify otherwise. With so much discussion already, it is the time to tackle the two challenges of Definition 1. To meet the first challenge, let us introduce the concept of document fingerprint. Definition 5: A document fingerprint is an integer or a binary string with fixed length. Fingerprints can be generated from documents by a function GEN. The fingerprints have the following characteristics: (a) A document D has multiple fingerprints { F1, F2, …, Fn}, i.e., GEN(D) = { F1, F2, …, Fn}. (b) Two irrelevant documents d and D do not have a common fingerprint. That is GEN(d) ∩ GEN(D) = ϕ. This is called the uniqueness. (c) A fingerprint can survive moderate document changes. That means GEN(d) ∩ GEN(D) ≠ ϕ if d is a near duplicated copy of D . This is the robustness. (d) In summary, a fingerprint is a unique invariant of document variants. A document D can be presented by multiple fingerprints, and let us denote this relationship as D ↔ { F1, F2, …, Fn}. For any document D from the document set S in Definition 1, we can assign a unique document ID to it so that we establish a mapping between ID and the fingerprints. We also denote this as ID ↔ { F1, F2, …, Fn}. This would remind us of the keyword based
  • 3. searching problem as we can index this relationship ID ↔ { F1, F2, …, Fn} into indexing files when treating the fingerprints as keywords. We can present the NDDD problem of Definition 1 with the following model supported by two procedures indexer and searcher. NDDD Model : Assume we have two functions: (a) fingerprint generation function GEN; (b) document similarity measurement function SIM, the NDDD problem is reduced into a fingerprint based indexing and searching problem:  Indexer: Given a set of documents S, each document is assigned a unique ID. We extract multiple fingerprints { F1, F2, …, Fn} from each document D with the function GEN. The indexer indexes them together with the document ID, i.e., ID ↔ { F1, F2, …, Fn}. The indexing results are saved into indexing files.  Searcher: For any query document d and the percentile X%, we extract multiple fingerprints { f1, f2, …, fn} from the query document d with the function GEN . The searcher uses them to retrieve relevant document IDs from the indexing files. If a reference document contains any of { f1, f2, …, fn}, its ID will be retrieved. With the ID, the reference document D is retrieved as result. Then, we calculate SIM(d,D) to measure the similarity. There may be multiple reference documents retrieved. We calculate the similarity for all of them, and rank the results in descending order of similarity. With the model shown as above, the NDDD problem actually is decomposed into three independent problems. Three Sub-Problems: 1. Fingerprint generation --- Generate multiple fingerprints from a given document D by a fingerprint generation function GEN(D). 2. Similarity measurement --- Calculate the similarity between two documents d and D by the similarity function SIM(d,D). 3. Indexing/Searching --- The indexer indexes document ID and its fingerprints { F1, F2, …, Fn}. The searcher retrieves document IDs against indices with given fingerprints { F1, F2, …, Fn}. This is similar to keyword based search engine such as Google or Lucene. One can use general search engine framework or even relational database system for solving the 3rd problem. Therefore, we will propose algorithmic solutions to the first and second problems only. 3. Algorithms This section provides algorithms to construct the two functions GEN and SIM respectively. The function GEN is to extract fingerprints from a given normalized document. A fingerprint is a possible invariant of text that can survive document changes. What can survive changes? Changes of text can be caused by document modification with editing operations such as insertion, deletion, copy/paste and etc.. However, there are many pieces remaining in the new text. These unchanged pieces shift relatively in text. If we can identify some unchanged text pieces, we can use them as text invariants to generate fingerprints. How to locate these unchanged yet shifting pieces? First of all, we use text model 2 to present a text as a string of UTF-8 characters, i.e., let us denote this as T = c1 c2… cL where L is the string length. Hence, we can discuss strings of characters instead of texts or documents. Secondarily, we introduce a concept as “anchoring points” which is briefly discussed in [1] without implementation suggestions. An anchoring point is a character in the string that remains the same relative to its neighborhood when the string changes. One can use the neighborhood around the anchoring point to generate a fingerprint with a good hash function H. With multiple anchoring points, we have multiple fingerprints for the document. There are two issues to be solved. The first issue is how to select the robust anchoring points since the string can change. The second issue is that there may be too many anchoring points so that we generate too many fingerprints from a given string. We propose algorithm 1 to construct the function GEN which can handle these two issues. Definition 6: We need some notations for writing up algorithm 1:  The alphabet A of UTF-8 characters appearing in the string.  Two numbers N and M that selects most robust anchoring points for generating fingerprints. M can be fixed for any text string while N is selected according to string size. Table 1 shows how M and N are configured as an example.  The width W of anchoring neighborhoods.  A hash function H that generate a fingerprint from a sub-string of size W. There is no specific requirement for the hash function.  Character score function defined as 𝑛 ∗ (𝑃𝑛 − 𝑃1) (𝑃𝑖+1 − 𝑃𝑖)2 1≤𝑖<𝑛 Table 1: M and N are configured accordingly Text Size Range M N 0-10K 4 128 10-20K 4 256 20-30K 4 256 30-50K 4 512 50-70K 4 1024 70-80K 4 1024 80-100K 4 1024 100-500K 4 1024
  • 4. > 500K 4 1024 Algorithm 1: Input: String T as c1 c2… cL Output: Fingerprint set. Procedure: Step 1: Select the number N from Table 1 according to the string length L. Step 2: Run through the string T while counting the occurrences of each unique UTF-8 character in A and saving the offsets. Step 3: For each C ∈ A , the character C should have one or multiple occurrences in T. Their offsets can be denoted as P1, P2,… Pn . We use the score function to calculate the score for C. Step 4: Pick M characters from A that have the highest scores . That is B = { C1, C2,… CM }. Step 5: For each C ∈ B, do step 6 to step 9 Step 6: For each occurrence of C in T, we have an anchoring neighborhood which has C as its center. Each neighborhood is a sub-string of size W. We denote these neighborhoods as S1, S2,… Sn with respect to the occurrence offsets P1, P2, … Pn . Step 7: Sort the list of sub-strings S1, S2,… Sn . Without loss of generality, we can still denote the sorted list as S1, S2,… Sn . Step 8: Select first K items from the sorted list where K = MIN(N , n). They are {S1, S2,… SK }. Step 9: Apply hash function H to {S1, S2,… SK} to generate K fingerprints and add them to the fingerprint set. The algorithm is stated based on text model 2. However, it is good for other two models as well by replacing “character” by either “token” or “byte”. The idea of the algorithm is straightforward. First of all, it selects the most significant character from the alphabet of the input string with a scoring function to measure the significance. When calculating the score of a given character, we consider both the frequency and distribution of the character across the string. This is reflected in the score function. Secondarily, for each picked character, it chooses the robust anchoring points by sorting and picking the top items from the list. Sorting is a mechanism to change randomness into order. The result is a set of at most M*N fingerprints. For example, when the normalized text size is less than 10KB which is typical in real world, we get at most 4*128=512 fingerprints. The function SIM is to calculate the similarity between two normalized documents. We can use text model 2 to present a document such that we actually compare two strings of characters. What similarity means to them? If there are some common sub- strings between two strings and the total length summed up is long enough, we would consider that they are similar to each other. We also expect that similarity can be measured in percentile. We propose algorithm 2 to calculate the similarities between one given document and a set of reference documents. The main idea is to identify common sub-strings with hash based greedy matching strategy. Definition 7: We need some notations to present algorithm 2:  A number M that defines the minimum length of common sub-strings. Common sub-strings must have minimum length to avoid triviality, otherwise, a single character can be a common sub-string.  A hash function H that generate a hash value from a sub-string of size M. The hash table has chaining capability to resolve collisions. There is no specific requirement for the hash function. However, due to the nature of the algorithm, a rolling hash function is recommended for good performance.  A hash table HT.  For a string T, its substring can be denoted as T[s,…,e] where s and e are the starting and ending offsets.  The algorithm is stated with text model 2. However, it can be applied to other two models as well. Algorithm 2: Input: Query string d, and multiple reference strings {D1, D2, …, Dm} Output: The similarities {SIM1, SIM2, …, SIMm } Procedure: Step 1: Create the hash table HT based on L which is the size of the input string d. Step 2: For j = 0 to L-M  Apply the hash function H to the sub-string d[j,…,j+M- 1] of d to calculate the hash value h  Store offset j in H[h] or its chained linked-list. Step 3: For each k in {1,2,…,m}, do step 4 to step 12 Step 4: Let Lk be the length of Dk , set P = 0 and SUM=0. Step 5: Let h = H(Dk [P,…,P+M-1]) Step 6: If H[h] is empty, we have no match of sub-strings at this offset P, let P=P+1, go to step 11 Step 7: For each sub-string offset s stored in the chaining linked- list at H[h], do step 8 Step 8: If d[s,..,s+M-1] ≠Dk [P,…,P+M-1], set V(s)=0, otherwise, let us extend the two equal sub-strings forward with common characters as many as possible that arrives at the maximum common sub-string size V(s.) Step 9: Let V be the largest of all V(s) that we get from step 8. Step 10: If V>0, let SUM = SUM + V, P = P + V, otherwise let P = P + 1 Step 11: If P < Lk-M, go to Step 5 Step 12: Let SIMk = SUM / Lk Algorithm 2 actually calculates all SIM(d,D1), SIM(d,D2), … SIM(d, Dm) in one construction. The step 1 and 2 actually pre- process d. And the step 4 to 12 are the steps to calculate individual SIM(d,Dj) once a time. For the normalized query document d and reference document D, the algorithm 2 identifies a set of common sub-strings and sum up all their lengths as SUM. Then similarity SIM can be measured
  • 5. by SUM / Length(D). One may ask why we do not include the length of d for the similarity. This is because we care more how much of D is duplicated in the query document d than how much of d is the content of D. One can certainly design another formula to calculate the similarity from SUM and both lengths. Finally we need to make sure SIM measures the similarity meaningfully. This is guaranteed by the following theorem. Theorem 1: The function SIM defined by algorithm 2 satisfies the following properties for two normalized documents d and D: 1. 0 ≤SIM(d,D)≤ 1 2. If d and D are the same document, SIM(d,D)=1 3. If d and D have no common sub-strings at all, SIM(d,D)=0. Proof: From step 4 to 11 of algorithm 2, we have 0≤ SUM ≤Length(D). That proves 0 ≤SIM(d,D)≤1. If d=D, it is not difficult to prove that SUM= Length(D), i.e., SIM(d,D)=1. The last assertion is trivial. 4. Asymmetric Fingerprint Generation For some special applications such as DLP (data loss prevention) endpoint products, indexed fingerprinting files created on servers must be delivered to remote machines which host searchers. It is necessary to use less fingerprints to present a document in order to save network bandwidth and cost. In algorithm 1, there are two important parameters when generating the fingerprints. They are the numbers M and N where M is fixed and N is configured according to the text size defined by a table. Based on recent experimental research, we can reduce the fingerprints and keep almost the same recall rate if we apply smaller number N to the function GEN at indexer side while the N at server side is kept the same. In other words, we can solve the NDDD problem even if the indexer can generate much less number of fingerprints than the searcher. Table 2 is an example for defining different N’s for both indexer and searcher. Table 2 : Different N for Indexer and Searcher Text Size Range M N for Indexer N for Searcher 0-10K 4 8 128 10-20K 4 16 256 20-30K 4 32 256 30-50K 4 32 512 50-70K 4 64 1024 70-80K 4 128 1024 80-100K 4 256 1024 100-500K 4 512 1024 > 500K 4 1024 1024 This method is referred as asymmetric fingerprint generation while algorithm 1 is the symmetric fingerprint generation. And its capability to keep almost the same recall rate is supported by the following theoretical results. Definition 8: Lets assume M is a constant number. For any normalized document T, let us denote S( T, N) as the set of fingerprints that is extracted from T with the number N. Theorem 2: Let T be any normalized document, and n and m be two positive integers. If n < m, we have S( T, n) ⊆ S(T,m) which means the set S(T ,n) is a subset S(T, m). Proof: This is a natural outcome from the step 8 of algorithm 1. Theorem 3: Let D and d be two versions of same normalized document, and n and m be two positive integers. If n < m, we have S(D, n) ∩ S(d, n) ⊆ S(D, n) ∩ S(d, m) ⊆ S(D, m) ∩ S(d, m). Proof: Since n<m, we have S(d, n) ⊆ S(d, m) and S(D, n) ⊆ S(D, m) and by theorem 1. Therefore, we have S(D, n) ∩ S(d, n) ⊆ S(D, n) ∩ S(d, m) and S(D, n) ∩ S(d, m) ⊆ S(D, m) ∩ S(d, m). Together we have S(D, n) ∩ S(d, n) ⊆ S(D, n) ∩ S(d, m) ⊆ S(D, m) ∩ S(d, m). This competes the proof. Theorem 3 implies that the recall rate of asymmetric fingerprint generation is between the two cases of symmetric fingerprint generation with smaller and larger number of fingerprints. As a matter of factor, the experimental data shows it is closer to the second case while it generates much less fingerprints at indexer. 5. Experiments In this section, we report a data experiment that we implemented with the asymmetric architecture of fingerprint generation defined by the parameters of table 2. Both indexer and searcher reside on a server with Windows server 2003, Intel Xeron E5405@2.0GHZ, 8GB of RAM. We prepared experimental data sets as follows:  Normalized document for indexing:  Corpus 1: this set consists of 1 million plain text files in UTF-8 encoding. Let denote the corpus 1 as S1.  Corpus 2: this set consists of 2115 plain text files in many different languages and has different file sizes. They are totally irrelevant to the files in S1. Let denote the corpus 1 as S2.  Let S = S1 ∪ S2. All files in S are registered for fingerprint generation and indexing.  Normalized documents for querying:  Corpus 3: this set consists of 6*6* 2115 = 76140 files. This corpus consists of documents that are made from S2 with 6 editing operations and 6 levels of changes presented in percentiles. Corpus 3 will be used for querying experiment.  6 levels of changes are defined as 5%, 10%, 20%, 30%, 40% and 50%. For example, the level 1 means we alter 5% content of an original file.
  • 6.  6 editing operations ADD, ADH, ADE, DEL, CHG, MOV. The 6 editing operations can be defined specifically as follows:  ADD: add a randomly generated block of chars at a random position in the file.  ADH: add a randomly generated block of chars at a random position in the file. Also add a randomly generated block of chars with block size randomly selected between 50-100 at the beginning of the file.  ADE: add a randomly generated block of chars at a randomly selected position in the file. Also add a randomly generated block of chars with block size randomly selected between 50-100 in the ending of the file.  DEL: delete a block of chars from the file. The start point of deletion is randomly selected.  CHG : replace a randomly selected block of chars in the file with a randomly generated block of chars.  MOV: move a randomly selected block of chars in the file to a random position in the file. Table 3: Querying time in seconds Change level Total file number Total Time Sec per file in average 5% 12690 1727 0.136 10% 12690 1776 0.139 20% 12690 1680 0.132 30% 12690 1709 0.134 40% 12690 1699 0.133 50% 12690 1649 0.129 Table 4: Numbers of files matched at each change level Change level ADD ADH ADE DEL CHG MOV 5% 2080 2079 2082 2074 2071 2055 10% 2079 2069 2079 2073 2067 2055 20% 2045 2047 2055 2063 2029 2046 30% 2027 2019 2023 2058 1979 2041 40% 1993 2000 1998 2021 1924 2049 50% 1969 1977 1978 2020 1894 2049 Table 5: Total recall rate at each change level Change level Total Files Recall Rate 5% 12441 98.03% 10% 12422 97.88% 20% 12285 96.80% 30% 12147 95.72% 40% 11985 94.44% 50% 11887 93.67% Figure 1: Recall vs change level for different operations. Experiment steps: 1. Fingerprint and index all the files in S. 2. Set X% = 20%. For any file from corpus 3, we use it as a query document for the NDDD problem. The recall and precision are measured according to the query results. The performance of the querying speed can be measured in seconds . The experimental results are shown in table 3 , table 4, and the figure 1. Table 3 shows the performance when executing search for 6*2115=12690 query files with total number and the time per file in average. For example, for change level 5%, the total time is 1727 seconds which means 0.136 second per file in average. This is pretty fast when we consider the set S has more than 1 millions documents fingerprinted. Table 4 shows the recall rate for each change level and editing operation. For example, for the change level 5% and ADD
  • 7. operation, one has 2115 query files, we have 2080 successful queries, that is 98.3%. Figure 1 illustrates recall rate vs change level for each operation. Table 5 shows the recall rates for all change levels. As the document changes increase, the recall rate drops. The worst recall rate is 93.67% when the change is around 50%. We should mention that there is no false positive for all our 76140 query files. This is a natural outcome due to the following reasons:  GEN and SIM are two string matching functions that are independently constructed.  Even we may have false positives with fingerprint match, X% will stop the false positives. 6. Conclusion This article has examined and solved the problem of near duplicate document detection. What we have studied can be summarized as follows:  Formal definition for the problem NDDD.  Text models are discussed for effective presentation. A language independent text model is selected to present the documents  A NDDD model is proposed to refine the problem definition which decomposes the NDDD problem into three separate sub-problems that can be solved independently.  Algorithms are introduced to extract document fingerprints and calculate document similarity.  An architecture of asymmetric fingerprint generation is introduced to reduce the number of fingerprints for some special application.  The data experiment shows that our algorithmic solution has good performance, near zero false positives and pretty higher recall rate even the documents change up to 50%. The problem definition and algorithmic solution in this article has advantages over other approaches. It has near zero false positive since the similarity calculation is independent of fingerprint generation. The recall rate is pretty good due to the fact that the fingerprints are robust with moderate document changes. Finally, the solution is language independent. It means we can apply the solution to documents written in any language and even to documents written in multiple languages. 7. REFERENCES [1] Manber, U.1994. Finding Similar Files In A Large File System. Proceedings of the USENIX Winter 1994 Technical Conference, San Francisco, California [2] Shivakumar, N. and Garcia Molina, H. 1999. Finding near- replicas of documents on the web. Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 1590, 204-212. [3] Lopresti, D. P. 1999. Models and Algorithms for Duplicate Document Detection. Proceedings of the Fifth International Conference on Document Analysis and Recognition, Bangalore, India, 297-300, September, 1999 [4] Broder, A. Z. 2000. Identifying and Filtering Near-Duplicate Documents. Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, UK. Springer-Verlag, pp.1-10, 2000. [5] Campbell, D. M. , Chen,W.R. and Smith, R. D.. 2000. Copy detection systems for digital documents. Proceedings of Advances in Digital Libraries , pp. 78-88, 2000 [6] Ignatov, D. I. and Jánosi-Rancz, K. T. 2009. Towards a framework for near-duplicate detection in a document collections based on closed sets of attributes. ACTA Univ. Sapientiae, Informatica, 1, 2 (2009), 215-233 [7] Kumar, J.P. and Govindarajulu, P. 2009. Duplicate and Near Duplicate Documents Detection: A Review. European Journal of Scientific Research, 32, 4 (2009), 514-527. [8] Ren, L.,Tan, D., Huang, F., Huang S. and Dong, A. 2009. Matching engine with signature generation. US patent 7,516,130. [9] Ren, L., Huang S, Huang, F., Dong, A. and Tan, D. 2010. Matching engine for querying relevant documents . US patent 7,747,642. [10] Ren, L., Huang S., Huang, F. and Lin, Y. 2010. Document matching engine using asymmetric signature generation. US patent 7,860,853.