Near-duplicate document detection is a well-known problem in the area of information retrieval. It is an important problem to be solved for many applications in IT industry. It has been studied with profound research literatures. This article provides a novel solution to this classic problem. We present the problem with abstract models along with additional concepts such as text models, document fingerprints and document similarity. With these concepts, the problem can be transformed into keyword like search problem with results ranked by document similarity. There are two major techniques. The first technique is to extract robust and unique fingerprints from a document. The second one is to calculate document similarity effectively. Algorithms for both fingerprint extraction and document similarity calculation are introduced as a complete solution.
Near Duplicate Document Detection: Mathematical Modeling and Algorithms
1. Near Duplicate Document Detection: Mathematical
Modeling and Algorithms
Liwei Ren
Trend Micro
10101 North De Anza Boulevard
Cupertino, CA 95014, USA
1-408-850-1048
liwei_ren@trendmicro.com
Qiuer Xu
Trend Micro
Building B, Soho International Plaza
Nanjing, 210012, P.R. China
86-25-52386123
fallson_xu@trendmicro.com.cn
ABSTRACT
Near-duplicate document detection is a well-known problem in
the area of information retrieval. It is an important problem to be
solved for many applications in IT industry. It has been studied
with profound research literatures. This article provides a novel
solution to this classic problem. We present the problem with
abstract models along with additional concepts such as text
models, document fingerprints and document similarity. With
these concepts, the problem can be transformed into keyword like
search problem with results ranked by document similarity. There
are two major techniques. The first technique is to extract robust
and unique fingerprints from a document. The second one is to
calculate document similarity effectively. Algorithms for both
fingerprint extraction and document similarity calculation are
introduced as a complete solution.
Categories and Subject Descriptors
H.3.3: Information Search and Retrieval – information filtering,
retrieval models, search process .
General Terms
Algorithms, Experimentation.
Keywords
Duplicate Document, Near Duplicate Detection, Document
Fingerprint, Document Similarity, Retrieval Model, Information
Retrieval, Asymmetric Architecture
1. INTRODUCTION
Near duplicate document detection (NDDD) is a well-known
problem in the area of information retrieval. It is defined to
identify whether a given document is a near duplicate of one or
more documents from a well-defined document set. This problem
can be found in many technical areas such as crawling and
indexing optimization of web search engines, copy detection
systems, email archival, spam filtering, and data leak prevention
systems. There are profound research literatures discussing this
subject with numerous use cases and solutions [1-6]. Recently,
Kumar et al. [7] provided a thorough review of the most
significant works in decades that covers more than 60 papers.
We organize the following sections in the fashion of problem
definition, mathematical modeling and algorithmic solutions. We
will introduce formal problem definition to describe the problem
followed by three text models that are used to present documents.
One text model is selected for constructing algorithmic solution.
By introducing some concepts like document fingerprint and
document similarity, the problem can be decomposed into three
independent problems: (a) document fingerprint extraction; (b)
document similarity calculation; (c) fingerprint based search
engine. Two algorithms are constructed to extract fingerprints
from documents and measure the similarity between documents.
One can use utility of keyword based search engine for solving
the problem (c). Finally, an architecture of asymmetric
fingerprint generation is proposed to reduce the number of
fingerprints. Less number of fingerprints is critical for the success
of some special applications such as data leak prevention systems.
2. Problem Definition and Modeling
The problem proposed in the introduction section is not well-
defined from the perfective of practical implementation. In
practice, we need a quantitative measurement of how “near
duplicated” two documents are. We can need a more rigorous
definition for NDDD.
Definition 1 : Assume that we have a set of documents S. For
any given document d and a percentile X% , one needs to identify
multiple documents D1, D2, …, Dm from S such that SIM(d, Dj) ≥
X% for 1 ≤j ≤m, where SIM is a well-defined function to calculate
the similarity of two documents. The result {D1, D2, …, Dm} is
shown in the descending order of percentiles.
There are several challenges to solve this problem:
(a) The document set may be huge. It could be a scale in
multiples of millions or even billions of documents.
One certainly cannot compare d with each document of
S to calculate the similarity. How to efficiently identify
the reference document D from a huge document set ?
(b) How to construct the similarity function SIM?
Before we are able to answer the questions, we need to propose
text models to present a document. A text model allows us to
exclude irrelevant textual elements so that we can focus on the
essence .
2. Documents can be in any document format such as Word, Power
Point, Excel, PDF, Post Script and many others. The individual
words or sentences can be in different styles (bold, italic,
underline) and with varieties of fonts. These are not important
textual elements when we discuss “near duplicate”.
Fundamentally, we are more interested in the textual content that
carries semantic significance.
A document can be written in any writing language. The texts in
different languages can be encoded differently, for example,
English texts can be encoded in ASCII, Chinese in GB, and
Japanese in SJIS. However, all languages can be encoded in the
UTF-8 standard which is able to present all languages in one text.
For documents in English or any western language, most authors
view a text as a string of words [2-6]. Words can be extracted
from texts with tokenization technique that uses spaces to separate
words (or tokens) in sentences.
Some languages such as Chinese and Japanese do not use spaces
between words. In those eastern languages, a sentence is a string
of characters without spaces between them. All characters of
different languages can be encoded in UTF-8 characters. As such,
a text in all languages can be considered as a string of UTF-8
characters.
Depending on the languages, each UTF-8 character consists of
one or multiple bytes, for example, a Chinese character typically
consists of three bytes while an ASCII character is one byte.
Therefore, one can view a text also as a string of bytes if we
convert them from its original encoding into UTF-8.
Definition 2: We have three text models to present a document:
Model 1: A text is a string of tokens ( or a sequence
of tokens)
Model 2: A text is a string of UTF-8 characters.
Model 3: A text is a string of bytes when the text is
encoded in UTF-8.
In summary, a text is a string of basic textual items
where a basic textual unit item means a token, UTF-
8 character or byte.
Besides three models, there exist other text models that basic
textual units are sentences [5], textual lines, or even pages. Those
models are not interests to the authors of this article.
Numerous articles study NDDD using the text model 1. While
this model is good enough to study NDDD for documents in
western languages, it has obstacles when dealing with non-
western languages. Model 1 needs tokenization techniques.
Tokenization is a taunting task for processing documents in
Chinese and Japanese, especially.
There are few works adopting the text model 2 and 3 in academic
world. Manber [1] discussed duplicate detection in terms of pair
wise file matching of ASCII files. This is a special case of the
model 2 and 3. In contrast, it has become a common practice in
industry to apply text model 2 or 3 to many document
management problems such as DLP [8-10] , spam filtering and e-
Discovery. In this article, we use text model 2 to extract
fingerprints from documents and calculate the similarity between
two documents. Both text model 2 and 3 are language
independent while model 1 is not. Therefore, the techniques
developed in this article apply equally to documents in any
languages, and even apply to a document written in multiple
languages.
Definition 3: A document normalization is a process that consists
of three sub-processes applied sequentially:
(a) Converting a document in any format, such as Word,
Excel and PDF, into a plain text encoded in UTF-8;
(b) Converting any plain text in other encodings into a plain
text encoded in UTF-8;
(c) Removing the trivial characters such as white spaces,
delimiters, and control characters and etc. from the
UTF-8 texts.
Definition 4: The result of the document normalization is a string
of UTF-8 characters that contains the most significant information
of the original document. It is called a normalized text or
normalized document.
There are many software tools available for the document
normalization. Without loss of generality, we can consider all
documents as normalized texts in the rest of this article unless we
specify otherwise.
With so much discussion already, it is the time to tackle the two
challenges of Definition 1. To meet the first challenge, let us
introduce the concept of document fingerprint.
Definition 5: A document fingerprint is an integer or a binary
string with fixed length. Fingerprints can be generated from
documents by a function GEN. The fingerprints have the
following characteristics:
(a) A document D has multiple fingerprints { F1, F2, …,
Fn}, i.e., GEN(D) = { F1, F2, …, Fn}.
(b) Two irrelevant documents d and D do not have a
common fingerprint. That is GEN(d) ∩ GEN(D) = ϕ.
This is called the uniqueness.
(c) A fingerprint can survive moderate document changes.
That means GEN(d) ∩ GEN(D) ≠ ϕ if d is a near
duplicated copy of D . This is the robustness.
(d) In summary, a fingerprint is a unique invariant of
document variants.
A document D can be presented by multiple fingerprints, and let
us denote this relationship as D ↔ { F1, F2, …, Fn}. For any
document D from the document set S in Definition 1, we can
assign a unique document ID to it so that we establish a mapping
between ID and the fingerprints. We also denote this as ID ↔ {
F1, F2, …, Fn}. This would remind us of the keyword based
3. searching problem as we can index this relationship ID ↔ { F1,
F2, …, Fn} into indexing files when treating the fingerprints as
keywords. We can present the NDDD problem of Definition 1
with the following model supported by two procedures indexer
and searcher.
NDDD Model : Assume we have two functions: (a) fingerprint
generation function GEN; (b) document similarity measurement
function SIM, the NDDD problem is reduced into a fingerprint
based indexing and searching problem:
Indexer: Given a set of documents S, each document
is assigned a unique ID. We extract multiple
fingerprints { F1, F2, …, Fn} from each document D
with the function GEN. The indexer indexes them
together with the document ID, i.e., ID ↔ { F1, F2,
…, Fn}. The indexing results are saved into indexing
files.
Searcher: For any query document d and the
percentile X%, we extract multiple fingerprints { f1,
f2, …, fn} from the query document d with the
function GEN . The searcher uses them to retrieve
relevant document IDs from the indexing files. If a
reference document contains any of { f1, f2, …, fn},
its ID will be retrieved. With the ID, the reference
document D is retrieved as result. Then, we calculate
SIM(d,D) to measure the similarity. There may be
multiple reference documents retrieved. We
calculate the similarity for all of them, and rank the
results in descending order of similarity.
With the model shown as above, the NDDD problem actually is
decomposed into three independent problems.
Three Sub-Problems:
1. Fingerprint generation --- Generate multiple
fingerprints from a given document D by a fingerprint
generation function GEN(D).
2. Similarity measurement --- Calculate the similarity
between two documents d and D by the similarity
function SIM(d,D).
3. Indexing/Searching --- The indexer indexes document
ID and its fingerprints { F1, F2, …, Fn}. The searcher
retrieves document IDs against indices with given
fingerprints { F1, F2, …, Fn}. This is similar to keyword
based search engine such as Google or Lucene.
One can use general search engine framework or even relational
database system for solving the 3rd
problem. Therefore, we will
propose algorithmic solutions to the first and second problems
only.
3. Algorithms
This section provides algorithms to construct the two functions
GEN and SIM respectively.
The function GEN is to extract fingerprints from a given
normalized document. A fingerprint is a possible invariant of
text that can survive document changes. What can survive
changes? Changes of text can be caused by document
modification with editing operations such as insertion, deletion,
copy/paste and etc.. However, there are many pieces remaining in
the new text. These unchanged pieces shift relatively in text. If
we can identify some unchanged text pieces, we can use them as
text invariants to generate fingerprints. How to locate these
unchanged yet shifting pieces?
First of all, we use text model 2 to present a text as a string of
UTF-8 characters, i.e., let us denote this as T = c1 c2… cL where
L is the string length. Hence, we can discuss strings of characters
instead of texts or documents. Secondarily, we introduce a
concept as “anchoring points” which is briefly discussed in [1]
without implementation suggestions. An anchoring point is a
character in the string that remains the same relative to its
neighborhood when the string changes. One can use the
neighborhood around the anchoring point to generate a fingerprint
with a good hash function H. With multiple anchoring points, we
have multiple fingerprints for the document. There are two issues
to be solved. The first issue is how to select the robust anchoring
points since the string can change. The second issue is that there
may be too many anchoring points so that we generate too many
fingerprints from a given string. We propose algorithm 1 to
construct the function GEN which can handle these two issues.
Definition 6: We need some notations for writing up algorithm 1:
The alphabet A of UTF-8 characters appearing in the
string.
Two numbers N and M that selects most robust
anchoring points for generating fingerprints. M can be
fixed for any text string while N is selected according to
string size. Table 1 shows how M and N are configured
as an example.
The width W of anchoring neighborhoods.
A hash function H that generate a fingerprint from a
sub-string of size W. There is no specific requirement
for the hash function.
Character score function defined as
𝑛 ∗ (𝑃𝑛 − 𝑃1) (𝑃𝑖+1 − 𝑃𝑖)2
1≤𝑖<𝑛
Table 1: M and N are configured accordingly
Text Size Range M N
0-10K 4 128
10-20K 4 256
20-30K 4 256
30-50K 4 512
50-70K 4 1024
70-80K 4 1024
80-100K 4 1024
100-500K 4 1024
4. > 500K 4 1024
Algorithm 1:
Input: String T as c1 c2… cL
Output: Fingerprint set.
Procedure:
Step 1: Select the number N from Table 1 according to the string
length L.
Step 2: Run through the string T while counting the occurrences
of each unique UTF-8 character in A and saving the offsets.
Step 3: For each C ∈ A , the character C should have one or
multiple occurrences in T. Their offsets can be denoted as P1,
P2,… Pn . We use the score function to calculate the score for C.
Step 4: Pick M characters from A that have the highest scores .
That is B = { C1, C2,… CM }.
Step 5: For each C ∈ B, do step 6 to step 9
Step 6: For each occurrence of C in T, we have an anchoring
neighborhood which has C as its center. Each neighborhood is a
sub-string of size W. We denote these neighborhoods as S1, S2,…
Sn with respect to the occurrence offsets P1, P2, … Pn .
Step 7: Sort the list of sub-strings S1, S2,… Sn . Without loss of
generality, we can still denote the sorted list as S1, S2,… Sn .
Step 8: Select first K items from the sorted list where K =
MIN(N , n). They are {S1, S2,… SK }.
Step 9: Apply hash function H to {S1, S2,… SK} to generate K
fingerprints and add them to the fingerprint set.
The algorithm is stated based on text model 2. However, it is
good for other two models as well by replacing “character” by
either “token” or “byte”. The idea of the algorithm is
straightforward. First of all, it selects the most significant
character from the alphabet of the input string with a scoring
function to measure the significance. When calculating the score
of a given character, we consider both the frequency and
distribution of the character across the string. This is reflected in
the score function. Secondarily, for each picked character, it
chooses the robust anchoring points by sorting and picking the top
items from the list. Sorting is a mechanism to change randomness
into order. The result is a set of at most M*N fingerprints. For
example, when the normalized text size is less than 10KB which
is typical in real world, we get at most 4*128=512 fingerprints.
The function SIM is to calculate the similarity between two
normalized documents. We can use text model 2 to present a
document such that we actually compare two strings of characters.
What similarity means to them? If there are some common sub-
strings between two strings and the total length summed up is
long enough, we would consider that they are similar to each
other. We also expect that similarity can be measured in
percentile. We propose algorithm 2 to calculate the similarities
between one given document and a set of reference documents.
The main idea is to identify common sub-strings with hash based
greedy matching strategy.
Definition 7: We need some notations to present algorithm 2:
A number M that defines the minimum length of
common sub-strings. Common sub-strings must have
minimum length to avoid triviality, otherwise, a single
character can be a common sub-string.
A hash function H that generate a hash value from a
sub-string of size M. The hash table has chaining
capability to resolve collisions. There is no specific
requirement for the hash function. However, due to the
nature of the algorithm, a rolling hash function is
recommended for good performance.
A hash table HT.
For a string T, its substring can be denoted as T[s,…,e]
where s and e are the starting and ending offsets.
The algorithm is stated with text model 2. However, it
can be applied to other two models as well.
Algorithm 2:
Input: Query string d, and multiple reference strings {D1, D2,
…, Dm}
Output: The similarities {SIM1, SIM2, …, SIMm }
Procedure:
Step 1: Create the hash table HT based on L which is the size of
the input string d.
Step 2: For j = 0 to L-M
Apply the hash function H to the sub-string d[j,…,j+M-
1] of d to calculate the hash value h
Store offset j in H[h] or its chained linked-list.
Step 3: For each k in {1,2,…,m}, do step 4 to step 12
Step 4: Let Lk be the length of Dk , set P = 0 and SUM=0.
Step 5: Let h = H(Dk [P,…,P+M-1])
Step 6: If H[h] is empty, we have no match of sub-strings at this
offset P, let P=P+1, go to step 11
Step 7: For each sub-string offset s stored in the chaining linked-
list at H[h], do step 8
Step 8: If d[s,..,s+M-1] ≠Dk [P,…,P+M-1], set V(s)=0, otherwise,
let us extend the two equal sub-strings forward with common
characters as many as possible that arrives at the maximum
common sub-string size V(s.)
Step 9: Let V be the largest of all V(s) that we get from step 8.
Step 10: If V>0, let SUM = SUM + V, P = P + V, otherwise let
P = P + 1
Step 11: If P < Lk-M, go to Step 5
Step 12: Let SIMk = SUM / Lk
Algorithm 2 actually calculates all SIM(d,D1), SIM(d,D2), …
SIM(d, Dm) in one construction. The step 1 and 2 actually pre-
process d. And the step 4 to 12 are the steps to calculate
individual SIM(d,Dj) once a time.
For the normalized query document d and reference document D,
the algorithm 2 identifies a set of common sub-strings and sum up
all their lengths as SUM. Then similarity SIM can be measured
5. by SUM / Length(D). One may ask why we do not include the
length of d for the similarity. This is because we care more how
much of D is duplicated in the query document d than how much
of d is the content of D. One can certainly design another formula
to calculate the similarity from SUM and both lengths. Finally we
need to make sure SIM measures the similarity meaningfully.
This is guaranteed by the following theorem.
Theorem 1: The function SIM defined by algorithm 2 satisfies
the following properties for two normalized documents d and D:
1. 0 ≤SIM(d,D)≤ 1
2. If d and D are the same document, SIM(d,D)=1
3. If d and D have no common sub-strings at all,
SIM(d,D)=0.
Proof: From step 4 to 11 of algorithm 2, we have 0≤ SUM
≤Length(D). That proves 0 ≤SIM(d,D)≤1. If d=D, it is not
difficult to prove that SUM= Length(D), i.e., SIM(d,D)=1. The
last assertion is trivial.
4. Asymmetric Fingerprint Generation
For some special applications such as DLP (data loss prevention)
endpoint products, indexed fingerprinting files created on servers
must be delivered to remote machines which host searchers. It is
necessary to use less fingerprints to present a document in order to
save network bandwidth and cost. In algorithm 1, there are two
important parameters when generating the fingerprints. They are
the numbers M and N where M is fixed and N is configured
according to the text size defined by a table.
Based on recent experimental research, we can reduce the
fingerprints and keep almost the same recall rate if we apply
smaller number N to the function GEN at indexer side while the N
at server side is kept the same. In other words, we can solve the
NDDD problem even if the indexer can generate much less
number of fingerprints than the searcher. Table 2 is an example
for defining different N’s for both indexer and searcher.
Table 2 : Different N for Indexer and Searcher
Text Size Range M N for Indexer N for Searcher
0-10K 4 8 128
10-20K 4 16 256
20-30K 4 32 256
30-50K 4 32 512
50-70K 4 64 1024
70-80K 4 128 1024
80-100K 4 256 1024
100-500K 4 512 1024
> 500K 4 1024 1024
This method is referred as asymmetric fingerprint generation
while algorithm 1 is the symmetric fingerprint generation. And its
capability to keep almost the same recall rate is supported by the
following theoretical results.
Definition 8: Lets assume M is a constant number. For any
normalized document T, let us denote S( T, N) as the set of
fingerprints that is extracted from T with the number N.
Theorem 2: Let T be any normalized document, and n and m be
two positive integers. If n < m, we have S( T, n) ⊆ S(T,m)
which means the set S(T ,n) is a subset S(T, m).
Proof: This is a natural outcome from the step 8 of algorithm 1.
Theorem 3: Let D and d be two versions of same normalized
document, and n and m be two positive integers. If n < m, we have
S(D, n) ∩ S(d, n) ⊆ S(D, n) ∩ S(d, m) ⊆ S(D, m) ∩ S(d, m).
Proof: Since n<m, we have S(d, n) ⊆ S(d, m) and S(D, n)
⊆ S(D, m) and by theorem 1. Therefore, we have S(D, n) ∩ S(d,
n) ⊆ S(D, n) ∩ S(d, m) and S(D, n) ∩ S(d, m) ⊆ S(D, m) ∩ S(d,
m). Together we have S(D, n) ∩ S(d, n) ⊆ S(D, n) ∩ S(d, m)
⊆ S(D, m) ∩ S(d, m). This competes the proof.
Theorem 3 implies that the recall rate of asymmetric fingerprint
generation is between the two cases of symmetric fingerprint
generation with smaller and larger number of fingerprints. As a
matter of factor, the experimental data shows it is closer to the
second case while it generates much less fingerprints at indexer.
5. Experiments
In this section, we report a data experiment that we implemented
with the asymmetric architecture of fingerprint generation defined
by the parameters of table 2. Both indexer and searcher reside on
a server with Windows server 2003, Intel Xeron E5405@2.0GHZ,
8GB of RAM.
We prepared experimental data sets as follows:
Normalized document for indexing:
Corpus 1: this set consists of 1 million plain text
files in UTF-8 encoding. Let denote the corpus
1 as S1.
Corpus 2: this set consists of 2115 plain text
files in many different languages and has
different file sizes. They are totally irrelevant to
the files in S1. Let denote the corpus 1 as S2.
Let S = S1 ∪ S2. All files in S are registered for
fingerprint generation and indexing.
Normalized documents for querying:
Corpus 3: this set consists of 6*6* 2115 = 76140
files. This corpus consists of documents that are
made from S2 with 6 editing operations and 6
levels of changes presented in percentiles.
Corpus 3 will be used for querying experiment.
6 levels of changes are defined as 5%, 10%,
20%, 30%, 40% and 50%. For example, the
level 1 means we alter 5% content of an original
file.
6. 6 editing operations ADD, ADH, ADE, DEL,
CHG, MOV.
The 6 editing operations can be defined specifically as follows:
ADD: add a randomly generated block of chars at a
random position in the file.
ADH: add a randomly generated block of chars at a
random position in the file. Also add a randomly
generated block of chars with block size randomly
selected between 50-100 at the beginning of the file.
ADE: add a randomly generated block of chars at a
randomly selected position in the file. Also add a
randomly generated block of chars with block size
randomly selected between 50-100 in the ending of the
file.
DEL: delete a block of chars from the file. The start
point of deletion is randomly selected.
CHG : replace a randomly selected block of chars in the
file with a randomly generated block of chars.
MOV: move a randomly selected block of chars in the
file to a random position in the file.
Table 3: Querying time in seconds
Change level Total file
number
Total
Time
Sec per file in
average
5% 12690 1727 0.136
10% 12690 1776 0.139
20% 12690 1680 0.132
30% 12690 1709 0.134
40% 12690 1699 0.133
50% 12690 1649 0.129
Table 4: Numbers of files matched at each change level
Change
level
ADD ADH ADE DEL CHG MOV
5% 2080 2079 2082 2074 2071 2055
10% 2079 2069 2079 2073 2067 2055
20% 2045 2047 2055 2063 2029 2046
30% 2027 2019 2023 2058 1979 2041
40% 1993 2000 1998 2021 1924 2049
50% 1969 1977 1978 2020 1894 2049
Table 5: Total recall rate at each change level
Change level Total Files Recall Rate
5% 12441 98.03%
10% 12422 97.88%
20% 12285 96.80%
30% 12147 95.72%
40% 11985 94.44%
50% 11887 93.67%
Figure 1: Recall vs change level for different operations.
Experiment steps:
1. Fingerprint and index all the files in S.
2. Set X% = 20%. For any file from corpus 3, we use it
as a query document for the NDDD problem. The
recall and precision are measured according to the
query results. The performance of the querying speed
can be measured in seconds .
The experimental results are shown in table 3 , table 4, and the
figure 1.
Table 3 shows the performance when executing search for
6*2115=12690 query files with total number and the time per file
in average. For example, for change level 5%, the total time is
1727 seconds which means 0.136 second per file in average. This
is pretty fast when we consider the set S has more than 1 millions
documents fingerprinted.
Table 4 shows the recall rate for each change level and editing
operation. For example, for the change level 5% and ADD
7. operation, one has 2115 query files, we have 2080 successful
queries, that is 98.3%. Figure 1 illustrates recall rate vs change
level for each operation.
Table 5 shows the recall rates for all change levels. As the
document changes increase, the recall rate drops. The worst recall
rate is 93.67% when the change is around 50%.
We should mention that there is no false positive for all our
76140 query files. This is a natural outcome due to the following
reasons:
GEN and SIM are two string matching functions that
are independently constructed.
Even we may have false positives with fingerprint
match, X% will stop the false positives.
6. Conclusion
This article has examined and solved the problem of near
duplicate document detection. What we have studied can be
summarized as follows:
Formal definition for the problem NDDD.
Text models are discussed for effective presentation. A
language independent text model is selected to present
the documents
A NDDD model is proposed to refine the problem
definition which decomposes the NDDD problem into
three separate sub-problems that can be solved
independently.
Algorithms are introduced to extract document
fingerprints and calculate document similarity.
An architecture of asymmetric fingerprint generation is
introduced to reduce the number of fingerprints for
some special application.
The data experiment shows that our algorithmic solution
has good performance, near zero false positives and
pretty higher recall rate even the documents change up
to 50%.
The problem definition and algorithmic solution in this article has
advantages over other approaches. It has near zero false positive
since the similarity calculation is independent of fingerprint
generation. The recall rate is pretty good due to the fact that the
fingerprints are robust with moderate document changes. Finally,
the solution is language independent. It means we can apply the
solution to documents written in any language and even to
documents written in multiple languages.
7. REFERENCES
[1] Manber, U.1994. Finding Similar Files In A Large File
System. Proceedings of the USENIX Winter 1994 Technical
Conference, San Francisco, California
[2] Shivakumar, N. and Garcia Molina, H. 1999. Finding near-
replicas of documents on the web. Lecture Notes in Computer
Science, Springer Berlin / Heidelberg, 1590, 204-212.
[3] Lopresti, D. P. 1999. Models and Algorithms for Duplicate
Document Detection. Proceedings of the Fifth International
Conference on Document Analysis and Recognition, Bangalore,
India, 297-300, September, 1999
[4] Broder, A. Z. 2000. Identifying and Filtering Near-Duplicate
Documents. Proceedings of the 11th
Annual Symposium on
Combinatorial Pattern Matching, UK. Springer-Verlag, pp.1-10,
2000.
[5] Campbell, D. M. , Chen,W.R. and Smith, R. D.. 2000. Copy
detection systems for digital documents. Proceedings of
Advances in Digital Libraries , pp. 78-88, 2000
[6] Ignatov, D. I. and Jánosi-Rancz, K. T. 2009. Towards a
framework for near-duplicate detection in a document collections
based on closed sets of attributes. ACTA Univ. Sapientiae,
Informatica, 1, 2 (2009), 215-233
[7] Kumar, J.P. and Govindarajulu, P. 2009. Duplicate and Near
Duplicate Documents Detection: A Review. European Journal of
Scientific Research, 32, 4 (2009), 514-527.
[8] Ren, L.,Tan, D., Huang, F., Huang S. and Dong, A. 2009.
Matching engine with signature generation. US patent 7,516,130.
[9] Ren, L., Huang S, Huang, F., Dong, A. and Tan, D. 2010.
Matching engine for querying relevant documents . US patent
7,747,642.
[10] Ren, L., Huang S., Huang, F. and Lin, Y. 2010. Document
matching engine using asymmetric signature generation. US
patent 7,860,853.