SlideShare a Scribd company logo
1 of 6
Download to read offline
DOI: 10.23883/IJRTER.2017.3404.4SNDK
196
Examination of Document Similarity Using Rabin-Karp Algorithm
Ranti Eka Putri1
, Andysah Putera Utama Siahaan2
1
Faculty of Computer Science, Universitas Pembanguan Panca Budi, Medan, Indonesia
2
Ph.D. Student of School of Computer and Communication Engineering, Universiti Malaysia Perlis, Kangar,
Malaysia
Abstract β€” Documents do not always have the same content. However, the similarity between
documents often occurs in the world of writing scientific papers. Some similarities occur because of a
coincidence, but something happens because of the element of intent. On documents that have little
content, this can be checked by the eyes. However, on documents that have thousands of lines and
pages, of course, it is impossible. To anticipate it, it takes a way that can analyze plagiarism techniques
performed. Many methods can examine the resemblance of documents, one of them by using the
Rabin-Karp algorithm. The algorithm is very well since it has a determination for syllable cuts (K-
Grams). This algorithm looks at how many hash values are the same in both documents. The
percentage of plagiarism can also be adjusted up to a few percent according to the need for examination
of the document. Implementation of this algorithm is beneficial for an institution to do the filtering of
incoming documents. It is usually done at the time of receipt of a scientific paper to be published.
Keywords β€”Text Mining, Plagiarism, Rabin-Karp
I. INTRODUCTION
Information is essential in the world of education, especially scientific information. Since
information can be accessed online, this results in information being easily modified. Files downloaded
from the internet allow users to edit. This file can then be saved using a new or even renamed name
with the new user. This process happens so quickly without having to use certain techniques. The
development of this technology has a positive value. Along with the advancement of the era, the
progress of this technology can not be separated from the negative impact it produces. Modification
of information without listing the main source is an action that violates the rules. The modification is
plagiarism [1]. It is an act of abuse, theft or robbery, of publication, of a declaration, or of declaring it
as a property of one's thoughts, ideas, writings, or creations that are not the author idea.
Performing a plagiarism is an easy thing especially with using internet connection. Plagiarism
can kill one's creativity in developing new ideas. It is a fun activity because it can be done easily and
quickly because this action does not require energy and not have to think hard. Plagiarism can be
prevented by using the help of string matching methods. The algorithm can be modified to analyze
text, images, and even sound. This study attempts to match the document matching using the Rabin-
Karp algorithm. This algorithm is known quickly regarding comparing documents [3][4]. Also, the
parameters in this algorithm can be adjusted to the target to be achieved. The author hopes that by
running this system, the action of plagiarism can be avoided.
International Journal of Recent Trends in Engineering & Research (IJRTER)
Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457]
@IJRTER-2017, All Rights Reserved 197
II. THEORIES
2.1 Plagiarism
Information retrieval is part of computer science related to important documents which will
then be processed in conjunction with other data. It is an information search based on a query that is
expected to meet the previous goal. However, returning documents of plagiarism action may occur.
Plagiarism is a process of plagiarism or recognition of articles, opinions, papers and so on that are not
their own. It is to make the property of another person self-owned without the name of the source. The
person doing the plagiarism is called a plagiarist. It is including a criminal act that is falsifying the
work of others. It is also called copyright theft. Any quotation of words or ideas, the author must
include the name of the original owner. It is also like a book owned by the author may not be reprinted
without the permission of the author or publisher of the essay [2].
In practicing plagiarism, it is not always based on the element of intent. Some have become
plagiarism due to lack of information or reference in making a scientific work. Below are the most
common types of plagiarism:
- Accidental
It occurs since a lack of knowledge of plagiarism and understanding of reference writing. It usually
happens when writing a scientific paper is not based on literature review.
- Unintentional
Information that has frequently been discussed and rewritten again with words that are almost the
same. The same idea can produce different writing if designed well so plagiarism can be avoided.
- Intentional
The act of deliberately quoting a sentence or the whole of another person's work without the
citation of the person's work.
- Self-plagiarism
The use of self-made work in other forms without developing the values or variables present in the
previous work.
The detector of plagiarism is divided into two parts, fingerprinting and full-text comparison.
- Fingerprinting Comparison
It is a technique used to check the relationships between documents whether all the text contained
in a document or text. This technique will break the words on the paper to form a syllable or row
of characters of a certain length. This technique is called hashing. The most commonly used
algorithm is Rabin-Karp.
- Full-text Comparison
This technique performs a content comparison of two documents. It does text comparisons one by
one on each document content. The downside is that it takes longer to compare large documents.
However, the results obtained are quite satisfactory because the results will be used and stored in
a database. Complete text comparison methods can not be applied to documents that are not on the
same storage. The algorithms used in this approach are Brute-Force, Boyer Moore, and
Levenshtein Distance.
2.2 Rabin-Karp
Rabin-Karp algorithm is a search algorithm that searches for a substring pattern in a text using
hashing. It is very effective for multi-pattern matching words [5][7]. One of the practical applications
International Journal of Recent Trends in Engineering & Research (IJRTER)
Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457]
@IJRTER-2017, All Rights Reserved 198
of Rabin-Karp's algorithm is plagiarism detection. Rabin-Karp relies on a hash function to determine
the percentage of plagiarism. The accuracy level can be adjusted based on this feature. The hash
function is a function that determines the feature value of a particular syllable fraction. It converts each
string into a number, called a hash value. Rabin-Karp algorithm determines hash value based on the
same word (Figure 1) [6]. There are two barriers in determining the hash value. First, many different
strings are in a particular sentence. This problem can be solved by assigning multiple strings with the
same hash value. The next problem is not necessarily the string that has the same hash value match to
overcome it for each string is assigned to brute-force technique. Rabin-Karp requires a large prime
number to avoid possible hash values similar to different words.
Figure 1 Rabin-Karp hash example
III.IMPLEMENTATION
3.1 Rabin-Karp Process
This stage performs semantic and syntactic analysis of the text. The purpose of the initial
processing is to prepare the text for data that will undergo further processing. The operations that can
be performed at this stage include the process of removing unnecessary parts of the testing process. It
is done to select the data that has been eligible for execution. Filtering is a classification process to
determine the words that will be used in the process of finding the common word. Each sentence will
be broken down into words that will ultimately be a waste of useless words. The document index is a
set of terms that indicate the content or topic contained by the document. Usually, this will be divided
according to need. The index will distinguish a document from other documents that are in the
collection.
International Journal of Recent Trends in Engineering & Research (IJRTER)
Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457]
@IJRTER-2017, All Rights Reserved 199
The steps that occur in the Rabin-Karb process are as follows:
- Tokenizing
That is to convert a document into a collection of words by entering the words in an array and
separating the punctuation and numbers that are not included in the important words. This process
will also change to lower case.
- Stopword Removal
The process of removing basic words that always exist in the document such as: because, with,
and, or, not and others.
- Stemming
The process of changing the words that still have the prefix and suffix so that it becomes a basic
word.
- Hashing
The process of weighting each word in a document with a value based on a predetermined formula.
3.2 Rabin-Karp Calculation
Hashing is the most important value in the Rabin-Karp algorithm. The result of hashing
letters of k-gram with a certain number of bases is obtained by multiplying the ASCII value with
predetermined numbers where the base is prime. Rabin-Karp method has provisions if two strings are
same then the hash value must be the same as well. Here is an example calculation on Rabin-Karp
algorithm. Assume the text is MEDAN.
K-Gram = 5
Basis = 7
A = MEDAN
A(1) = 77
A(2) = 69
A(3) = 68
A(4) = 65
A(5) = 78
Hash = (77 βˆ— 74) + (69 βˆ— 73) + (68 βˆ— 73) + (65 βˆ— 72) + (78 βˆ— 71)
= 235599
Tokenizing
Stemming
Stopword Removal
Hashing
International Journal of Recent Trends in Engineering & Research (IJRTER)
Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457]
@IJRTER-2017, All Rights Reserved 200
The hash calculation result is 235599. This action is done until all the words on the list are fulfilled.
The following tables 1 and 2 are examples of comparison of documents after the hash values are
obtained. The hash value in the first table will be computed by the hash value of the second table.
Table 1. Hash value of document one
19875 16830 23124 17433 20546
21489 26753 13498 23846 16528
21848 28447 29994 10301 13009
18832 27217 23157 25854 22492
14952 14337 29348 19978 28809
13485 14188 13131 21215 12053
25669 13809 26508 19455 25356
29964 17723 26633 17445 11803
19477 27142 24814 15155 26266
28432 19007 21896 16625 20681
Table 2. Hash value of document two
28432 26406 28424 13930 19187
18049 10867 18516 26753 19975
10152 13053 24120 21896 18351
12605 25101 21215 20750 15513
22949 26006 25045 25932 10695
13254 21504 20286 22492 10615
25565 29941 17403 23018 22666
19744 19769 19877 29535 13139
25669 16830 14297 20916 24640
16960 20681 13131 13009 18947
There are ten pieces of the same hash that both tables have. After calculating the similar hash value,
the next step is to calculate the percentage of similarity of the two documents. The formula used is as
follows:
P =
2 βˆ— SH
THA + THB
βˆ— 100%
Where:
P = Plagiarism Rate
SH = Identical Hash
THA = Total Hash in Document A
THB = Total Hash in Document B
In the previous calculation there are ten values that have the similar value. So the plagiarism level
calculation is as follows.
P =
2βˆ—10
50+50
βˆ— 100%
International Journal of Recent Trends in Engineering & Research (IJRTER)
Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457]
@IJRTER-2017, All Rights Reserved 201
=
20
100
βˆ— 100%
= 0.5 * 100%
= 20%
The percentage of plagiarism held by both documents is 20%.
IV.CONCLUSION
Rabin-Karp algorithm is very well done to calculate the percentage of document similarity. In
addition to the fast process, this algorithm has adjustable parameters to adjust the accuracy of the
assessment. Calculation of hash value greatly affects the result of this algorithm. Adjustments should
still be made when selecting the K-Gram value to be used. Each analyst can determine the feasibility
tolerance for each document whether he belongs to the category of plagiarism or not. The disadvantage
of this algorithm is that the system can never know which documents came first. The algorithm can
only determine that there are similarities that occur in the comparable documents.
REFERENCES
[1] S. K. Shivaji and P. S., "Plagiarism Detection by using Karp-Rabin and String Matching Algorithm
Together," International Journal of Computer Applications, vol. 116, no. 23, pp. 37-41, 2015.
[2] A. Parker and J. O. Hamblen, "Computer Algorithm for Plagiarism Detection," IEEE Trans.
Education, vol. 32, no. 2, pp. 94-99, 1989.
[3] Sunita, R. Malik and M. Gulia, "Rabin-Karp Algorithm with Hashing a String Matching Tool,"
International Journal of Advanced Research in Computer Science and Software Engineering, vol.
4, no. 3, pp. 389-392, 2014.
[4] A. P. Gope and R. N. Behera, "A Novel Pattern Matching Algorithm in Genome," International
Journal of Computer Science and Information Technologies, vol. 5, no. 4, pp. 5450-5457, 2014.
[5] A. P. U. Siahaan, Mesran, R. Rahim and D. Siregar, "K-Gram As A Determinant Of Plagiarism
Level In Rabin-Karp Algorithm," International Journal of Scientific & Technology Research, vol.
6, no. 7, pp. 350-353, 2017.
[6] S. Popov, "Algorithm of the Week: Rabin-Karp String Searching," DZone / Java Zone, 3 April
2012. [Online]. Available: https://dzone.com/articles/algorithm-week-rabin-karp. [Accessed 20
August 2017].
[7] J. Sharma and M. Singh, "CUDA based Rabin-Karp Pattern Matching for Deep Packet Inspection
on a Multicore GPU," International Journal of Computer Network and Information Security, vol.
10, no. 8, pp. 70-77, 2015.

More Related Content

What's hot

Computing semantic similarity measure between words using web search engine
Computing semantic similarity measure between words using web search engineComputing semantic similarity measure between words using web search engine
Computing semantic similarity measure between words using web search enginecsandit
Β 
ast nearest neighbor search with keywords
ast nearest neighbor search with keywordsast nearest neighbor search with keywords
ast nearest neighbor search with keywordsswathi78
Β 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsShubhangi Tandon
Β 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean modelVaibhav Khanna
Β 
A framework for plagiarism
A framework for plagiarismA framework for plagiarism
A framework for plagiarismcsandit
Β 
Finding Similar Files in Large Document Repositories
Finding Similar Files in Large Document RepositoriesFinding Similar Files in Large Document Repositories
Finding Similar Files in Large Document Repositoriesfeiwin
Β 
Implementation of Urdu Probabilistic Parser
Implementation of Urdu Probabilistic ParserImplementation of Urdu Probabilistic Parser
Implementation of Urdu Probabilistic ParserWaqas Tariq
Β 
Independent Study_Final Report
Independent Study_Final ReportIndependent Study_Final Report
Independent Study_Final ReportShikha Swami
Β 
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
Β 
Duplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetDuplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetSameera Horawalavithana
Β 
Using Page Size for Controlling Duplicate Query Results in Semantic Web
Using Page Size for Controlling Duplicate Query Results in Semantic WebUsing Page Size for Controlling Duplicate Query Results in Semantic Web
Using Page Size for Controlling Duplicate Query Results in Semantic WebIJwest
Β 
Document Retrieval System, a Case Study
Document Retrieval System, a Case StudyDocument Retrieval System, a Case Study
Document Retrieval System, a Case StudyIJERA Editor
Β 
In3415791583
In3415791583In3415791583
In3415791583IJERA Editor
Β 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibEl Habib NFAOUI
Β 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibEl Habib NFAOUI
Β 
Search engine. Elasticsearch
Search engine. ElasticsearchSearch engine. Elasticsearch
Search engine. ElasticsearchSelecto
Β 
Semantic tagging for documents using 'short text' information
Semantic tagging for documents using 'short text' informationSemantic tagging for documents using 'short text' information
Semantic tagging for documents using 'short text' informationcsandit
Β 
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...cscpconf
Β 

What's hot (20)

Computing semantic similarity measure between words using web search engine
Computing semantic similarity measure between words using web search engineComputing semantic similarity measure between words using web search engine
Computing semantic similarity measure between words using web search engine
Β 
BoTLRet: A Template-based Linked Data Information Retrieval
 BoTLRet: A Template-based Linked Data Information Retrieval BoTLRet: A Template-based Linked Data Information Retrieval
BoTLRet: A Template-based Linked Data Information Retrieval
Β 
ACL-IJCNLP 2015
ACL-IJCNLP 2015ACL-IJCNLP 2015
ACL-IJCNLP 2015
Β 
ast nearest neighbor search with keywords
ast nearest neighbor search with keywordsast nearest neighbor search with keywords
ast nearest neighbor search with keywords
Β 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into Texts
Β 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean model
Β 
A framework for plagiarism
A framework for plagiarismA framework for plagiarism
A framework for plagiarism
Β 
Finding Similar Files in Large Document Repositories
Finding Similar Files in Large Document RepositoriesFinding Similar Files in Large Document Repositories
Finding Similar Files in Large Document Repositories
Β 
Implementation of Urdu Probabilistic Parser
Implementation of Urdu Probabilistic ParserImplementation of Urdu Probabilistic Parser
Implementation of Urdu Probabilistic Parser
Β 
Independent Study_Final Report
Independent Study_Final ReportIndependent Study_Final Report
Independent Study_Final Report
Β 
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
Β 
Duplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetDuplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy Dataset
Β 
Using Page Size for Controlling Duplicate Query Results in Semantic Web
Using Page Size for Controlling Duplicate Query Results in Semantic WebUsing Page Size for Controlling Duplicate Query Results in Semantic Web
Using Page Size for Controlling Duplicate Query Results in Semantic Web
Β 
Document Retrieval System, a Case Study
Document Retrieval System, a Case StudyDocument Retrieval System, a Case Study
Document Retrieval System, a Case Study
Β 
In3415791583
In3415791583In3415791583
In3415791583
Β 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Β 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
Β 
Search engine. Elasticsearch
Search engine. ElasticsearchSearch engine. Elasticsearch
Search engine. Elasticsearch
Β 
Semantic tagging for documents using 'short text' information
Semantic tagging for documents using 'short text' informationSemantic tagging for documents using 'short text' information
Semantic tagging for documents using 'short text' information
Β 
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
Β 

Similar to Examination of Document Similarity Using Rabin-Karp Algorithm

International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
Β 
A Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming AssignmentsA Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming AssignmentsIRJET Journal
Β 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
Β 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
Β 
An Application of Pattern matching for Motif Identification
An Application of Pattern matching for Motif IdentificationAn Application of Pattern matching for Motif Identification
An Application of Pattern matching for Motif IdentificationCSCJournals
Β 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346IJERA Editor
Β 
Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach IJCSIS Research Publications
Β 
Online Plagiarism Checker
Online Plagiarism CheckerOnline Plagiarism Checker
Online Plagiarism CheckerIRJET Journal
Β 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHIJDKP
Β 
Nature Inspired Models And The Semantic Web
Nature Inspired Models And The Semantic WebNature Inspired Models And The Semantic Web
Nature Inspired Models And The Semantic WebStefan Ceriu
Β 
Sentence similarity-based-text-summarization-using-clusters
Sentence similarity-based-text-summarization-using-clustersSentence similarity-based-text-summarization-using-clusters
Sentence similarity-based-text-summarization-using-clustersMOHDSAIFWAJID1
Β 
Efficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity rankingEfficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity rankingShakas Technologies
Β 
Efficient Parallel Pruning of Associative Rules with Optimized Search
Efficient Parallel Pruning of Associative Rules with Optimized  SearchEfficient Parallel Pruning of Associative Rules with Optimized  Search
Efficient Parallel Pruning of Associative Rules with Optimized SearchIOSR Journals
Β 
A Survey On Plagiarism Detection
A Survey On Plagiarism DetectionA Survey On Plagiarism Detection
A Survey On Plagiarism DetectionKarla Adamson
Β 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewINFOGAIN PUBLICATION
Β 

Similar to Examination of Document Similarity Using Rabin-Karp Algorithm (20)

International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
Β 
Ijetcas14 624
Ijetcas14 624Ijetcas14 624
Ijetcas14 624
Β 
A Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming AssignmentsA Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming Assignments
Β 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
Β 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
Β 
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
Β 
An Application of Pattern matching for Motif Identification
An Application of Pattern matching for Motif IdentificationAn Application of Pattern matching for Motif Identification
An Application of Pattern matching for Motif Identification
Β 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
Β 
Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach
Β 
Online Plagiarism Checker
Online Plagiarism CheckerOnline Plagiarism Checker
Online Plagiarism Checker
Β 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
Β 
Nature Inspired Models And The Semantic Web
Nature Inspired Models And The Semantic WebNature Inspired Models And The Semantic Web
Nature Inspired Models And The Semantic Web
Β 
Sentence similarity-based-text-summarization-using-clusters
Sentence similarity-based-text-summarization-using-clustersSentence similarity-based-text-summarization-using-clusters
Sentence similarity-based-text-summarization-using-clusters
Β 
Measure Term Similarity Using a Semantic Network Approach
Measure Term Similarity Using a Semantic Network ApproachMeasure Term Similarity Using a Semantic Network Approach
Measure Term Similarity Using a Semantic Network Approach
Β 
Efficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity rankingEfficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity ranking
Β 
Efficient Parallel Pruning of Associative Rules with Optimized Search
Efficient Parallel Pruning of Associative Rules with Optimized  SearchEfficient Parallel Pruning of Associative Rules with Optimized  Search
Efficient Parallel Pruning of Associative Rules with Optimized Search
Β 
A Survey On Plagiarism Detection
A Survey On Plagiarism DetectionA Survey On Plagiarism Detection
A Survey On Plagiarism Detection
Β 
At33264269
At33264269At33264269
At33264269
Β 
At33264269
At33264269At33264269
At33264269
Β 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A Review
Β 

More from Universitas Pembangunan Panca Budi

Application of Data Encryption Standard and Lempel-Ziv-Welch Algorithm for Fi...
Application of Data Encryption Standard and Lempel-Ziv-Welch Algorithm for Fi...Application of Data Encryption Standard and Lempel-Ziv-Welch Algorithm for Fi...
Application of Data Encryption Standard and Lempel-Ziv-Welch Algorithm for Fi...Universitas Pembangunan Panca Budi
Β 
An Implementation of a Filter Design Passive LC in Reduce a Current Harmonisa
An Implementation of a Filter Design Passive LC in Reduce a Current Harmonisa An Implementation of a Filter Design Passive LC in Reduce a Current Harmonisa
An Implementation of a Filter Design Passive LC in Reduce a Current Harmonisa Universitas Pembangunan Panca Budi
Β 
Simultaneous Response of Dividend Policy and Value of Indonesia Manufacturing...
Simultaneous Response of Dividend Policy and Value of Indonesia Manufacturing...Simultaneous Response of Dividend Policy and Value of Indonesia Manufacturing...
Simultaneous Response of Dividend Policy and Value of Indonesia Manufacturing...Universitas Pembangunan Panca Budi
Β 
Insecure Whatsapp Chat History, Data Storage and Proposed Security
Insecure Whatsapp Chat History, Data Storage and Proposed SecurityInsecure Whatsapp Chat History, Data Storage and Proposed Security
Insecure Whatsapp Chat History, Data Storage and Proposed SecurityUniversitas Pembangunan Panca Budi
Β 
Prim and Genetic Algorithms Performance in Determining Optimum Route on Graph
Prim and Genetic Algorithms Performance in Determining Optimum Route on GraphPrim and Genetic Algorithms Performance in Determining Optimum Route on Graph
Prim and Genetic Algorithms Performance in Determining Optimum Route on GraphUniversitas Pembangunan Panca Budi
Β 
Multi-Attribute Decision Making with VIKOR Method for Any Purpose Decision
Multi-Attribute Decision Making with VIKOR Method for Any Purpose DecisionMulti-Attribute Decision Making with VIKOR Method for Any Purpose Decision
Multi-Attribute Decision Making with VIKOR Method for Any Purpose DecisionUniversitas Pembangunan Panca Budi
Β 
Mobile Application Detection of Road Damage using Canny Algorithm
Mobile Application Detection of Road Damage using Canny AlgorithmMobile Application Detection of Road Damage using Canny Algorithm
Mobile Application Detection of Road Damage using Canny AlgorithmUniversitas Pembangunan Panca Budi
Β 
Super-Encryption Cryptography with IDEA and WAKE Algorithm
Super-Encryption Cryptography with IDEA and WAKE AlgorithmSuper-Encryption Cryptography with IDEA and WAKE Algorithm
Super-Encryption Cryptography with IDEA and WAKE AlgorithmUniversitas Pembangunan Panca Budi
Β 
Technique for Order Preference by Similarity to Ideal Solution as Decision Su...
Technique for Order Preference by Similarity to Ideal Solution as Decision Su...Technique for Order Preference by Similarity to Ideal Solution as Decision Su...
Technique for Order Preference by Similarity to Ideal Solution as Decision Su...Universitas Pembangunan Panca Budi
Β 
Prototype Application Multimedia Learning for Teaching Basic English
Prototype Application Multimedia Learning for Teaching Basic EnglishPrototype Application Multimedia Learning for Teaching Basic English
Prototype Application Multimedia Learning for Teaching Basic EnglishUniversitas Pembangunan Panca Budi
Β 
TOPSIS Method Application for Decision Support System in Internal Control for...
TOPSIS Method Application for Decision Support System in Internal Control for...TOPSIS Method Application for Decision Support System in Internal Control for...
TOPSIS Method Application for Decision Support System in Internal Control for...Universitas Pembangunan Panca Budi
Β 
Violations of Cybercrime and the Strength of Jurisdiction in Indonesia
Violations of Cybercrime and the Strength of Jurisdiction in IndonesiaViolations of Cybercrime and the Strength of Jurisdiction in Indonesia
Violations of Cybercrime and the Strength of Jurisdiction in IndonesiaUniversitas Pembangunan Panca Budi
Β 
Marketing Strategy through Markov Optimization to Predict Sales on Specific P...
Marketing Strategy through Markov Optimization to Predict Sales on Specific P...Marketing Strategy through Markov Optimization to Predict Sales on Specific P...
Marketing Strategy through Markov Optimization to Predict Sales on Specific P...Universitas Pembangunan Panca Budi
Β 
Prim's Algorithm for Optimizing Fiber Optic Trajectory Planning
Prim's Algorithm for Optimizing Fiber Optic Trajectory PlanningPrim's Algorithm for Optimizing Fiber Optic Trajectory Planning
Prim's Algorithm for Optimizing Fiber Optic Trajectory PlanningUniversitas Pembangunan Panca Budi
Β 
A Review of IP and MAC Address Filtering in Wireless Network Security
A Review of IP and MAC Address Filtering in Wireless Network SecurityA Review of IP and MAC Address Filtering in Wireless Network Security
A Review of IP and MAC Address Filtering in Wireless Network SecurityUniversitas Pembangunan Panca Budi
Β 
Expert System of Catfish Disease Determinant Using Certainty Factor Method
Expert System of Catfish Disease Determinant Using Certainty Factor MethodExpert System of Catfish Disease Determinant Using Certainty Factor Method
Expert System of Catfish Disease Determinant Using Certainty Factor MethodUniversitas Pembangunan Panca Budi
Β 

More from Universitas Pembangunan Panca Budi (20)

Application of Data Encryption Standard and Lempel-Ziv-Welch Algorithm for Fi...
Application of Data Encryption Standard and Lempel-Ziv-Welch Algorithm for Fi...Application of Data Encryption Standard and Lempel-Ziv-Welch Algorithm for Fi...
Application of Data Encryption Standard and Lempel-Ziv-Welch Algorithm for Fi...
Β 
An Implementation of a Filter Design Passive LC in Reduce a Current Harmonisa
An Implementation of a Filter Design Passive LC in Reduce a Current Harmonisa An Implementation of a Filter Design Passive LC in Reduce a Current Harmonisa
An Implementation of a Filter Design Passive LC in Reduce a Current Harmonisa
Β 
Simultaneous Response of Dividend Policy and Value of Indonesia Manufacturing...
Simultaneous Response of Dividend Policy and Value of Indonesia Manufacturing...Simultaneous Response of Dividend Policy and Value of Indonesia Manufacturing...
Simultaneous Response of Dividend Policy and Value of Indonesia Manufacturing...
Β 
Insecure Whatsapp Chat History, Data Storage and Proposed Security
Insecure Whatsapp Chat History, Data Storage and Proposed SecurityInsecure Whatsapp Chat History, Data Storage and Proposed Security
Insecure Whatsapp Chat History, Data Storage and Proposed Security
Β 
Online Shoppers Acceptance: An Exploratory Study
Online Shoppers Acceptance: An Exploratory StudyOnline Shoppers Acceptance: An Exploratory Study
Online Shoppers Acceptance: An Exploratory Study
Β 
Prim and Genetic Algorithms Performance in Determining Optimum Route on Graph
Prim and Genetic Algorithms Performance in Determining Optimum Route on GraphPrim and Genetic Algorithms Performance in Determining Optimum Route on Graph
Prim and Genetic Algorithms Performance in Determining Optimum Route on Graph
Β 
Multi-Attribute Decision Making with VIKOR Method for Any Purpose Decision
Multi-Attribute Decision Making with VIKOR Method for Any Purpose DecisionMulti-Attribute Decision Making with VIKOR Method for Any Purpose Decision
Multi-Attribute Decision Making with VIKOR Method for Any Purpose Decision
Β 
Mobile Application Detection of Road Damage using Canny Algorithm
Mobile Application Detection of Road Damage using Canny AlgorithmMobile Application Detection of Road Damage using Canny Algorithm
Mobile Application Detection of Road Damage using Canny Algorithm
Β 
Super-Encryption Cryptography with IDEA and WAKE Algorithm
Super-Encryption Cryptography with IDEA and WAKE AlgorithmSuper-Encryption Cryptography with IDEA and WAKE Algorithm
Super-Encryption Cryptography with IDEA and WAKE Algorithm
Β 
Technique for Order Preference by Similarity to Ideal Solution as Decision Su...
Technique for Order Preference by Similarity to Ideal Solution as Decision Su...Technique for Order Preference by Similarity to Ideal Solution as Decision Su...
Technique for Order Preference by Similarity to Ideal Solution as Decision Su...
Β 
Prototype Application Multimedia Learning for Teaching Basic English
Prototype Application Multimedia Learning for Teaching Basic EnglishPrototype Application Multimedia Learning for Teaching Basic English
Prototype Application Multimedia Learning for Teaching Basic English
Β 
TOPSIS Method Application for Decision Support System in Internal Control for...
TOPSIS Method Application for Decision Support System in Internal Control for...TOPSIS Method Application for Decision Support System in Internal Control for...
TOPSIS Method Application for Decision Support System in Internal Control for...
Β 
Violations of Cybercrime and the Strength of Jurisdiction in Indonesia
Violations of Cybercrime and the Strength of Jurisdiction in IndonesiaViolations of Cybercrime and the Strength of Jurisdiction in Indonesia
Violations of Cybercrime and the Strength of Jurisdiction in Indonesia
Β 
Marketing Strategy through Markov Optimization to Predict Sales on Specific P...
Marketing Strategy through Markov Optimization to Predict Sales on Specific P...Marketing Strategy through Markov Optimization to Predict Sales on Specific P...
Marketing Strategy through Markov Optimization to Predict Sales on Specific P...
Β 
Prim's Algorithm for Optimizing Fiber Optic Trajectory Planning
Prim's Algorithm for Optimizing Fiber Optic Trajectory PlanningPrim's Algorithm for Optimizing Fiber Optic Trajectory Planning
Prim's Algorithm for Optimizing Fiber Optic Trajectory Planning
Β 
Image Similarity Test Using Eigenface Calculation
Image Similarity Test Using Eigenface CalculationImage Similarity Test Using Eigenface Calculation
Image Similarity Test Using Eigenface Calculation
Β 
Data Compression Using Elias Delta Code
Data Compression Using Elias Delta CodeData Compression Using Elias Delta Code
Data Compression Using Elias Delta Code
Β 
A Review of IP and MAC Address Filtering in Wireless Network Security
A Review of IP and MAC Address Filtering in Wireless Network SecurityA Review of IP and MAC Address Filtering in Wireless Network Security
A Review of IP and MAC Address Filtering in Wireless Network Security
Β 
Expert System of Catfish Disease Determinant Using Certainty Factor Method
Expert System of Catfish Disease Determinant Using Certainty Factor MethodExpert System of Catfish Disease Determinant Using Certainty Factor Method
Expert System of Catfish Disease Determinant Using Certainty Factor Method
Β 
Threats of Computer System and its Prevention
Threats of Computer System and its PreventionThreats of Computer System and its Prevention
Threats of Computer System and its Prevention
Β 

Recently uploaded

Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
Β 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
Β 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
Β 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
Β 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
Β 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
Β 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
Β 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
Β 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
Β 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
Β 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
Β 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
Β 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
Β 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
Β 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
Β 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
Β 

Recently uploaded (20)

Model Call Girl in Tilak Nagar Delhi reach out to us at πŸ”9953056974πŸ”
Model Call Girl in Tilak Nagar Delhi reach out to us at πŸ”9953056974πŸ”Model Call Girl in Tilak Nagar Delhi reach out to us at πŸ”9953056974πŸ”
Model Call Girl in Tilak Nagar Delhi reach out to us at πŸ”9953056974πŸ”
Β 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
Β 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
Β 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
Β 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Β 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
Β 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Β 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Β 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
Β 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
Β 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Β 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
Β 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
Β 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
Β 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
Β 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
Β 
Model Call Girl in Bikash Puri Delhi reach out to us at πŸ”9953056974πŸ”
Model Call Girl in Bikash Puri  Delhi reach out to us at πŸ”9953056974πŸ”Model Call Girl in Bikash Puri  Delhi reach out to us at πŸ”9953056974πŸ”
Model Call Girl in Bikash Puri Delhi reach out to us at πŸ”9953056974πŸ”
Β 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
Β 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
Β 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
Β 

Examination of Document Similarity Using Rabin-Karp Algorithm

  • 1. DOI: 10.23883/IJRTER.2017.3404.4SNDK 196 Examination of Document Similarity Using Rabin-Karp Algorithm Ranti Eka Putri1 , Andysah Putera Utama Siahaan2 1 Faculty of Computer Science, Universitas Pembanguan Panca Budi, Medan, Indonesia 2 Ph.D. Student of School of Computer and Communication Engineering, Universiti Malaysia Perlis, Kangar, Malaysia Abstract β€” Documents do not always have the same content. However, the similarity between documents often occurs in the world of writing scientific papers. Some similarities occur because of a coincidence, but something happens because of the element of intent. On documents that have little content, this can be checked by the eyes. However, on documents that have thousands of lines and pages, of course, it is impossible. To anticipate it, it takes a way that can analyze plagiarism techniques performed. Many methods can examine the resemblance of documents, one of them by using the Rabin-Karp algorithm. The algorithm is very well since it has a determination for syllable cuts (K- Grams). This algorithm looks at how many hash values are the same in both documents. The percentage of plagiarism can also be adjusted up to a few percent according to the need for examination of the document. Implementation of this algorithm is beneficial for an institution to do the filtering of incoming documents. It is usually done at the time of receipt of a scientific paper to be published. Keywords β€”Text Mining, Plagiarism, Rabin-Karp I. INTRODUCTION Information is essential in the world of education, especially scientific information. Since information can be accessed online, this results in information being easily modified. Files downloaded from the internet allow users to edit. This file can then be saved using a new or even renamed name with the new user. This process happens so quickly without having to use certain techniques. The development of this technology has a positive value. Along with the advancement of the era, the progress of this technology can not be separated from the negative impact it produces. Modification of information without listing the main source is an action that violates the rules. The modification is plagiarism [1]. It is an act of abuse, theft or robbery, of publication, of a declaration, or of declaring it as a property of one's thoughts, ideas, writings, or creations that are not the author idea. Performing a plagiarism is an easy thing especially with using internet connection. Plagiarism can kill one's creativity in developing new ideas. It is a fun activity because it can be done easily and quickly because this action does not require energy and not have to think hard. Plagiarism can be prevented by using the help of string matching methods. The algorithm can be modified to analyze text, images, and even sound. This study attempts to match the document matching using the Rabin- Karp algorithm. This algorithm is known quickly regarding comparing documents [3][4]. Also, the parameters in this algorithm can be adjusted to the target to be achieved. The author hopes that by running this system, the action of plagiarism can be avoided.
  • 2. International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457] @IJRTER-2017, All Rights Reserved 197 II. THEORIES 2.1 Plagiarism Information retrieval is part of computer science related to important documents which will then be processed in conjunction with other data. It is an information search based on a query that is expected to meet the previous goal. However, returning documents of plagiarism action may occur. Plagiarism is a process of plagiarism or recognition of articles, opinions, papers and so on that are not their own. It is to make the property of another person self-owned without the name of the source. The person doing the plagiarism is called a plagiarist. It is including a criminal act that is falsifying the work of others. It is also called copyright theft. Any quotation of words or ideas, the author must include the name of the original owner. It is also like a book owned by the author may not be reprinted without the permission of the author or publisher of the essay [2]. In practicing plagiarism, it is not always based on the element of intent. Some have become plagiarism due to lack of information or reference in making a scientific work. Below are the most common types of plagiarism: - Accidental It occurs since a lack of knowledge of plagiarism and understanding of reference writing. It usually happens when writing a scientific paper is not based on literature review. - Unintentional Information that has frequently been discussed and rewritten again with words that are almost the same. The same idea can produce different writing if designed well so plagiarism can be avoided. - Intentional The act of deliberately quoting a sentence or the whole of another person's work without the citation of the person's work. - Self-plagiarism The use of self-made work in other forms without developing the values or variables present in the previous work. The detector of plagiarism is divided into two parts, fingerprinting and full-text comparison. - Fingerprinting Comparison It is a technique used to check the relationships between documents whether all the text contained in a document or text. This technique will break the words on the paper to form a syllable or row of characters of a certain length. This technique is called hashing. The most commonly used algorithm is Rabin-Karp. - Full-text Comparison This technique performs a content comparison of two documents. It does text comparisons one by one on each document content. The downside is that it takes longer to compare large documents. However, the results obtained are quite satisfactory because the results will be used and stored in a database. Complete text comparison methods can not be applied to documents that are not on the same storage. The algorithms used in this approach are Brute-Force, Boyer Moore, and Levenshtein Distance. 2.2 Rabin-Karp Rabin-Karp algorithm is a search algorithm that searches for a substring pattern in a text using hashing. It is very effective for multi-pattern matching words [5][7]. One of the practical applications
  • 3. International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457] @IJRTER-2017, All Rights Reserved 198 of Rabin-Karp's algorithm is plagiarism detection. Rabin-Karp relies on a hash function to determine the percentage of plagiarism. The accuracy level can be adjusted based on this feature. The hash function is a function that determines the feature value of a particular syllable fraction. It converts each string into a number, called a hash value. Rabin-Karp algorithm determines hash value based on the same word (Figure 1) [6]. There are two barriers in determining the hash value. First, many different strings are in a particular sentence. This problem can be solved by assigning multiple strings with the same hash value. The next problem is not necessarily the string that has the same hash value match to overcome it for each string is assigned to brute-force technique. Rabin-Karp requires a large prime number to avoid possible hash values similar to different words. Figure 1 Rabin-Karp hash example III.IMPLEMENTATION 3.1 Rabin-Karp Process This stage performs semantic and syntactic analysis of the text. The purpose of the initial processing is to prepare the text for data that will undergo further processing. The operations that can be performed at this stage include the process of removing unnecessary parts of the testing process. It is done to select the data that has been eligible for execution. Filtering is a classification process to determine the words that will be used in the process of finding the common word. Each sentence will be broken down into words that will ultimately be a waste of useless words. The document index is a set of terms that indicate the content or topic contained by the document. Usually, this will be divided according to need. The index will distinguish a document from other documents that are in the collection.
  • 4. International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457] @IJRTER-2017, All Rights Reserved 199 The steps that occur in the Rabin-Karb process are as follows: - Tokenizing That is to convert a document into a collection of words by entering the words in an array and separating the punctuation and numbers that are not included in the important words. This process will also change to lower case. - Stopword Removal The process of removing basic words that always exist in the document such as: because, with, and, or, not and others. - Stemming The process of changing the words that still have the prefix and suffix so that it becomes a basic word. - Hashing The process of weighting each word in a document with a value based on a predetermined formula. 3.2 Rabin-Karp Calculation Hashing is the most important value in the Rabin-Karp algorithm. The result of hashing letters of k-gram with a certain number of bases is obtained by multiplying the ASCII value with predetermined numbers where the base is prime. Rabin-Karp method has provisions if two strings are same then the hash value must be the same as well. Here is an example calculation on Rabin-Karp algorithm. Assume the text is MEDAN. K-Gram = 5 Basis = 7 A = MEDAN A(1) = 77 A(2) = 69 A(3) = 68 A(4) = 65 A(5) = 78 Hash = (77 βˆ— 74) + (69 βˆ— 73) + (68 βˆ— 73) + (65 βˆ— 72) + (78 βˆ— 71) = 235599 Tokenizing Stemming Stopword Removal Hashing
  • 5. International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457] @IJRTER-2017, All Rights Reserved 200 The hash calculation result is 235599. This action is done until all the words on the list are fulfilled. The following tables 1 and 2 are examples of comparison of documents after the hash values are obtained. The hash value in the first table will be computed by the hash value of the second table. Table 1. Hash value of document one 19875 16830 23124 17433 20546 21489 26753 13498 23846 16528 21848 28447 29994 10301 13009 18832 27217 23157 25854 22492 14952 14337 29348 19978 28809 13485 14188 13131 21215 12053 25669 13809 26508 19455 25356 29964 17723 26633 17445 11803 19477 27142 24814 15155 26266 28432 19007 21896 16625 20681 Table 2. Hash value of document two 28432 26406 28424 13930 19187 18049 10867 18516 26753 19975 10152 13053 24120 21896 18351 12605 25101 21215 20750 15513 22949 26006 25045 25932 10695 13254 21504 20286 22492 10615 25565 29941 17403 23018 22666 19744 19769 19877 29535 13139 25669 16830 14297 20916 24640 16960 20681 13131 13009 18947 There are ten pieces of the same hash that both tables have. After calculating the similar hash value, the next step is to calculate the percentage of similarity of the two documents. The formula used is as follows: P = 2 βˆ— SH THA + THB βˆ— 100% Where: P = Plagiarism Rate SH = Identical Hash THA = Total Hash in Document A THB = Total Hash in Document B In the previous calculation there are ten values that have the similar value. So the plagiarism level calculation is as follows. P = 2βˆ—10 50+50 βˆ— 100%
  • 6. International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457] @IJRTER-2017, All Rights Reserved 201 = 20 100 βˆ— 100% = 0.5 * 100% = 20% The percentage of plagiarism held by both documents is 20%. IV.CONCLUSION Rabin-Karp algorithm is very well done to calculate the percentage of document similarity. In addition to the fast process, this algorithm has adjustable parameters to adjust the accuracy of the assessment. Calculation of hash value greatly affects the result of this algorithm. Adjustments should still be made when selecting the K-Gram value to be used. Each analyst can determine the feasibility tolerance for each document whether he belongs to the category of plagiarism or not. The disadvantage of this algorithm is that the system can never know which documents came first. The algorithm can only determine that there are similarities that occur in the comparable documents. REFERENCES [1] S. K. Shivaji and P. S., "Plagiarism Detection by using Karp-Rabin and String Matching Algorithm Together," International Journal of Computer Applications, vol. 116, no. 23, pp. 37-41, 2015. [2] A. Parker and J. O. Hamblen, "Computer Algorithm for Plagiarism Detection," IEEE Trans. Education, vol. 32, no. 2, pp. 94-99, 1989. [3] Sunita, R. Malik and M. Gulia, "Rabin-Karp Algorithm with Hashing a String Matching Tool," International Journal of Advanced Research in Computer Science and Software Engineering, vol. 4, no. 3, pp. 389-392, 2014. [4] A. P. Gope and R. N. Behera, "A Novel Pattern Matching Algorithm in Genome," International Journal of Computer Science and Information Technologies, vol. 5, no. 4, pp. 5450-5457, 2014. [5] A. P. U. Siahaan, Mesran, R. Rahim and D. Siregar, "K-Gram As A Determinant Of Plagiarism Level In Rabin-Karp Algorithm," International Journal of Scientific & Technology Research, vol. 6, no. 7, pp. 350-353, 2017. [6] S. Popov, "Algorithm of the Week: Rabin-Karp String Searching," DZone / Java Zone, 3 April 2012. [Online]. Available: https://dzone.com/articles/algorithm-week-rabin-karp. [Accessed 20 August 2017]. [7] J. Sharma and M. Singh, "CUDA based Rabin-Karp Pattern Matching for Deep Packet Inspection on a Multicore GPU," International Journal of Computer Network and Information Security, vol. 10, no. 8, pp. 70-77, 2015.