3. ABSTRACT
This project aims to create an algorithm that can
effectively detect duplicate records from a dataset. The
proposed algorithm involves preprocessing the data,
identifying potential duplicates using similarity measures,
and clustering the records into groups for review and
removal. The project will use Java and existing libraries
and frameworks for data preprocessing and analysis. The
goal is to develop an accurate and efficient algorithm that
can be integrated into data management systems to
improve data quality and integrity.
3
5. IMPORTANT FOR SEVERAL REASONS
5
ACCURACY
Duplicate records
can lead to
inaccurate data
analysis and
reporting
EFFICIENCY
Duplicate records
can lead to
inefficiencies in data
storage and
processing.
ECONOMICAL
Duplicate records
can be costly,
particularly in
industries where
data is a critical
asset
COMPLIANCE
In certain
industries, such as
healthcare,
duplicate records
can lead to
compliance issues.
DATA MERGING
Duplicate records
can make it difficult
to merge data from
different sources.
6. LITERATURE REVIEW
The literature on the detection of duplicacy of
records is extensive and varied, with many
different techniques and approaches being used to
address this important problem in data
management.
6
7. ISSUES IN EXISTING ALGORITHMS
7
SCALABILITY
The algo should be
able to handle large
datasets efficiently
without taking up
excessive
processing power or
time.
ALGORITHM BIAS
The algorithm may
be biased towards
certain types of
duplicates or certain
types of documents.
HUMAN REVIEW
The algo should
include a process
for human review of
identified duplicates
to ensure accuracy
and minimize errors.
SPEED
The algorithm should
be able to identify
duplicates quickly and
efficiently, especially
in real-time systems
where speed is
critical.
PRIVACY
CONCERNS
The algorithm
should protect
sensitive data and
preserve privacy,
still being effective
in detecting
duplicates.
9. PROPOSAL
Our proposal is to develop a
website that will take strings like
titles of a pdf, names, words or
sentences etc. as input and store
them in a database, and will
provide us the duplicate entries in
the database if there are any.
10. HASH BASED ALGORITHMS
Among these algorithms, the most efficient algorithm is
hash-based algorithm because it is used in all other
algorithms. So we decided to use the MD5 algorithm
invented by Ronald Rivest in 1991 which is a hash based
algorithm.
14. SIMULATION
Tools used:
We are using java as the programming language for our algorithm.
For reading the files we are using the java buffered reader class which is
an inbuilt library in java for reading the text files.
For the MD5 algorithm we are using the inbuilt getmd5() function of java
which provides us with the md5 hash values for a string input.
For storing the database we will use the mongoose library of MongoDB.
Mongoose enables us to store the data and we can edit, delete and
retrieve the data whenever possible.
14
19. FUTURE WORK
Till now we are able to find the duplicate titles among
the given files with the help of our algorithm.
Now our future work will be concerning website design
and development.
First we will design our website using figma and then
we will create the front-end of our website using
HTML, CSS and JavaScript.
Then we will be using mongoose in the back-end to
store the titles and return the duplicate titles found.
19