SlideShare a Scribd company logo
DOI: 10.23883/IJRTER.2017.3404.4SNDK
196
Examination of Document Similarity Using Rabin-Karp Algorithm
Ranti Eka Putri1
, Andysah Putera Utama Siahaan2
1
Faculty of Computer Science, Universitas Pembanguan Panca Budi, Medan, Indonesia
2
Ph.D. Student of School of Computer and Communication Engineering, Universiti Malaysia Perlis, Kangar,
Malaysia
Abstract — Documents do not always have the same content. However, the similarity between
documents often occurs in the world of writing scientific papers. Some similarities occur because of a
coincidence, but something happens because of the element of intent. On documents that have little
content, this can be checked by the eyes. However, on documents that have thousands of lines and
pages, of course, it is impossible. To anticipate it, it takes a way that can analyze plagiarism techniques
performed. Many methods can examine the resemblance of documents, one of them by using the
Rabin-Karp algorithm. The algorithm is very well since it has a determination for syllable cuts (K-
Grams). This algorithm looks at how many hash values are the same in both documents. The
percentage of plagiarism can also be adjusted up to a few percent according to the need for examination
of the document. Implementation of this algorithm is beneficial for an institution to do the filtering of
incoming documents. It is usually done at the time of receipt of a scientific paper to be published.
Keywords —Text Mining, Plagiarism, Rabin-Karp
I. INTRODUCTION
Information is essential in the world of education, especially scientific information. Since
information can be accessed online, this results in information being easily modified. Files downloaded
from the internet allow users to edit. This file can then be saved using a new or even renamed name
with the new user. This process happens so quickly without having to use certain techniques. The
development of this technology has a positive value. Along with the advancement of the era, the
progress of this technology can not be separated from the negative impact it produces. Modification
of information without listing the main source is an action that violates the rules. The modification is
plagiarism [1]. It is an act of abuse, theft or robbery, of publication, of a declaration, or of declaring it
as a property of one's thoughts, ideas, writings, or creations that are not the author idea.
Performing a plagiarism is an easy thing especially with using internet connection. Plagiarism
can kill one's creativity in developing new ideas. It is a fun activity because it can be done easily and
quickly because this action does not require energy and not have to think hard. Plagiarism can be
prevented by using the help of string matching methods. The algorithm can be modified to analyze
text, images, and even sound. This study attempts to match the document matching using the Rabin-
Karp algorithm. This algorithm is known quickly regarding comparing documents [3][4]. Also, the
parameters in this algorithm can be adjusted to the target to be achieved. The author hopes that by
running this system, the action of plagiarism can be avoided.
International Journal of Recent Trends in Engineering & Research (IJRTER)
Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457]
@IJRTER-2017, All Rights Reserved 197
II. THEORIES
2.1 Plagiarism
Information retrieval is part of computer science related to important documents which will
then be processed in conjunction with other data. It is an information search based on a query that is
expected to meet the previous goal. However, returning documents of plagiarism action may occur.
Plagiarism is a process of plagiarism or recognition of articles, opinions, papers and so on that are not
their own. It is to make the property of another person self-owned without the name of the source. The
person doing the plagiarism is called a plagiarist. It is including a criminal act that is falsifying the
work of others. It is also called copyright theft. Any quotation of words or ideas, the author must
include the name of the original owner. It is also like a book owned by the author may not be reprinted
without the permission of the author or publisher of the essay [2].
In practicing plagiarism, it is not always based on the element of intent. Some have become
plagiarism due to lack of information or reference in making a scientific work. Below are the most
common types of plagiarism:
- Accidental
It occurs since a lack of knowledge of plagiarism and understanding of reference writing. It usually
happens when writing a scientific paper is not based on literature review.
- Unintentional
Information that has frequently been discussed and rewritten again with words that are almost the
same. The same idea can produce different writing if designed well so plagiarism can be avoided.
- Intentional
The act of deliberately quoting a sentence or the whole of another person's work without the
citation of the person's work.
- Self-plagiarism
The use of self-made work in other forms without developing the values or variables present in the
previous work.
The detector of plagiarism is divided into two parts, fingerprinting and full-text comparison.
- Fingerprinting Comparison
It is a technique used to check the relationships between documents whether all the text contained
in a document or text. This technique will break the words on the paper to form a syllable or row
of characters of a certain length. This technique is called hashing. The most commonly used
algorithm is Rabin-Karp.
- Full-text Comparison
This technique performs a content comparison of two documents. It does text comparisons one by
one on each document content. The downside is that it takes longer to compare large documents.
However, the results obtained are quite satisfactory because the results will be used and stored in
a database. Complete text comparison methods can not be applied to documents that are not on the
same storage. The algorithms used in this approach are Brute-Force, Boyer Moore, and
Levenshtein Distance.
2.2 Rabin-Karp
Rabin-Karp algorithm is a search algorithm that searches for a substring pattern in a text using
hashing. It is very effective for multi-pattern matching words [5][7]. One of the practical applications
International Journal of Recent Trends in Engineering & Research (IJRTER)
Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457]
@IJRTER-2017, All Rights Reserved 198
of Rabin-Karp's algorithm is plagiarism detection. Rabin-Karp relies on a hash function to determine
the percentage of plagiarism. The accuracy level can be adjusted based on this feature. The hash
function is a function that determines the feature value of a particular syllable fraction. It converts each
string into a number, called a hash value. Rabin-Karp algorithm determines hash value based on the
same word (Figure 1) [6]. There are two barriers in determining the hash value. First, many different
strings are in a particular sentence. This problem can be solved by assigning multiple strings with the
same hash value. The next problem is not necessarily the string that has the same hash value match to
overcome it for each string is assigned to brute-force technique. Rabin-Karp requires a large prime
number to avoid possible hash values similar to different words.
Figure 1 Rabin-Karp hash example
III.IMPLEMENTATION
3.1 Rabin-Karp Process
This stage performs semantic and syntactic analysis of the text. The purpose of the initial
processing is to prepare the text for data that will undergo further processing. The operations that can
be performed at this stage include the process of removing unnecessary parts of the testing process. It
is done to select the data that has been eligible for execution. Filtering is a classification process to
determine the words that will be used in the process of finding the common word. Each sentence will
be broken down into words that will ultimately be a waste of useless words. The document index is a
set of terms that indicate the content or topic contained by the document. Usually, this will be divided
according to need. The index will distinguish a document from other documents that are in the
collection.
International Journal of Recent Trends in Engineering & Research (IJRTER)
Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457]
@IJRTER-2017, All Rights Reserved 199
The steps that occur in the Rabin-Karb process are as follows:
- Tokenizing
That is to convert a document into a collection of words by entering the words in an array and
separating the punctuation and numbers that are not included in the important words. This process
will also change to lower case.
- Stopword Removal
The process of removing basic words that always exist in the document such as: because, with,
and, or, not and others.
- Stemming
The process of changing the words that still have the prefix and suffix so that it becomes a basic
word.
- Hashing
The process of weighting each word in a document with a value based on a predetermined formula.
3.2 Rabin-Karp Calculation
Hashing is the most important value in the Rabin-Karp algorithm. The result of hashing
letters of k-gram with a certain number of bases is obtained by multiplying the ASCII value with
predetermined numbers where the base is prime. Rabin-Karp method has provisions if two strings are
same then the hash value must be the same as well. Here is an example calculation on Rabin-Karp
algorithm. Assume the text is MEDAN.
K-Gram = 5
Basis = 7
A = MEDAN
A(1) = 77
A(2) = 69
A(3) = 68
A(4) = 65
A(5) = 78
Hash = (77 ∗ 74) + (69 ∗ 73) + (68 ∗ 73) + (65 ∗ 72) + (78 ∗ 71)
= 235599
Tokenizing
Stemming
Stopword Removal
Hashing
International Journal of Recent Trends in Engineering & Research (IJRTER)
Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457]
@IJRTER-2017, All Rights Reserved 200
The hash calculation result is 235599. This action is done until all the words on the list are fulfilled.
The following tables 1 and 2 are examples of comparison of documents after the hash values are
obtained. The hash value in the first table will be computed by the hash value of the second table.
Table 1. Hash value of document one
19875 16830 23124 17433 20546
21489 26753 13498 23846 16528
21848 28447 29994 10301 13009
18832 27217 23157 25854 22492
14952 14337 29348 19978 28809
13485 14188 13131 21215 12053
25669 13809 26508 19455 25356
29964 17723 26633 17445 11803
19477 27142 24814 15155 26266
28432 19007 21896 16625 20681
Table 2. Hash value of document two
28432 26406 28424 13930 19187
18049 10867 18516 26753 19975
10152 13053 24120 21896 18351
12605 25101 21215 20750 15513
22949 26006 25045 25932 10695
13254 21504 20286 22492 10615
25565 29941 17403 23018 22666
19744 19769 19877 29535 13139
25669 16830 14297 20916 24640
16960 20681 13131 13009 18947
There are ten pieces of the same hash that both tables have. After calculating the similar hash value,
the next step is to calculate the percentage of similarity of the two documents. The formula used is as
follows:
P =
2 ∗ SH
THA + THB
∗ 100%
Where:
P = Plagiarism Rate
SH = Identical Hash
THA = Total Hash in Document A
THB = Total Hash in Document B
In the previous calculation there are ten values that have the similar value. So the plagiarism level
calculation is as follows.
P =
2∗10
50+50
∗ 100%
International Journal of Recent Trends in Engineering & Research (IJRTER)
Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457]
@IJRTER-2017, All Rights Reserved 201
=
20
100
∗ 100%
= 0.5 * 100%
= 20%
The percentage of plagiarism held by both documents is 20%.
IV.CONCLUSION
Rabin-Karp algorithm is very well done to calculate the percentage of document similarity. In
addition to the fast process, this algorithm has adjustable parameters to adjust the accuracy of the
assessment. Calculation of hash value greatly affects the result of this algorithm. Adjustments should
still be made when selecting the K-Gram value to be used. Each analyst can determine the feasibility
tolerance for each document whether he belongs to the category of plagiarism or not. The disadvantage
of this algorithm is that the system can never know which documents came first. The algorithm can
only determine that there are similarities that occur in the comparable documents.
REFERENCES
[1] S. K. Shivaji and P. S., "Plagiarism Detection by using Karp-Rabin and String Matching Algorithm
Together," International Journal of Computer Applications, vol. 116, no. 23, pp. 37-41, 2015.
[2] A. Parker and J. O. Hamblen, "Computer Algorithm for Plagiarism Detection," IEEE Trans.
Education, vol. 32, no. 2, pp. 94-99, 1989.
[3] Sunita, R. Malik and M. Gulia, "Rabin-Karp Algorithm with Hashing a String Matching Tool,"
International Journal of Advanced Research in Computer Science and Software Engineering, vol.
4, no. 3, pp. 389-392, 2014.
[4] A. P. Gope and R. N. Behera, "A Novel Pattern Matching Algorithm in Genome," International
Journal of Computer Science and Information Technologies, vol. 5, no. 4, pp. 5450-5457, 2014.
[5] A. P. U. Siahaan, Mesran, R. Rahim and D. Siregar, "K-Gram As A Determinant Of Plagiarism
Level In Rabin-Karp Algorithm," International Journal of Scientific & Technology Research, vol.
6, no. 7, pp. 350-353, 2017.
[6] S. Popov, "Algorithm of the Week: Rabin-Karp String Searching," DZone / Java Zone, 3 April
2012. [Online]. Available: https://dzone.com/articles/algorithm-week-rabin-karp. [Accessed 20
August 2017].
[7] J. Sharma and M. Singh, "CUDA based Rabin-Karp Pattern Matching for Deep Packet Inspection
on a Multicore GPU," International Journal of Computer Network and Information Security, vol.
10, no. 8, pp. 70-77, 2015.

More Related Content

What's hot

Computing semantic similarity measure between words using web search engine
Computing semantic similarity measure between words using web search engineComputing semantic similarity measure between words using web search engine
Computing semantic similarity measure between words using web search engine
csandit
 
BoTLRet: A Template-based Linked Data Information Retrieval
 BoTLRet: A Template-based Linked Data Information Retrieval BoTLRet: A Template-based Linked Data Information Retrieval
BoTLRet: A Template-based Linked Data Information Retrieval
National Inistitute of Informatics (NII), Tokyo, Japann
 
ast nearest neighbor search with keywords
ast nearest neighbor search with keywordsast nearest neighbor search with keywords
ast nearest neighbor search with keywords
swathi78
 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into Texts
Shubhangi Tandon
 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean model
Vaibhav Khanna
 
A framework for plagiarism
A framework for plagiarismA framework for plagiarism
A framework for plagiarism
csandit
 
Finding Similar Files in Large Document Repositories
Finding Similar Files in Large Document RepositoriesFinding Similar Files in Large Document Repositories
Finding Similar Files in Large Document Repositoriesfeiwin
 
Implementation of Urdu Probabilistic Parser
Implementation of Urdu Probabilistic ParserImplementation of Urdu Probabilistic Parser
Implementation of Urdu Probabilistic Parser
Waqas Tariq
 
Independent Study_Final Report
Independent Study_Final ReportIndependent Study_Final Report
Independent Study_Final ReportShikha Swami
 
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
cscpconf
 
Duplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetDuplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy Dataset
Sameera Horawalavithana
 
Using Page Size for Controlling Duplicate Query Results in Semantic Web
Using Page Size for Controlling Duplicate Query Results in Semantic WebUsing Page Size for Controlling Duplicate Query Results in Semantic Web
Using Page Size for Controlling Duplicate Query Results in Semantic Web
IJwest
 
Document Retrieval System, a Case Study
Document Retrieval System, a Case StudyDocument Retrieval System, a Case Study
Document Retrieval System, a Case Study
IJERA Editor
 
In3415791583
In3415791583In3415791583
In3415791583
IJERA Editor
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
El Habib NFAOUI
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
El Habib NFAOUI
 
Search engine. Elasticsearch
Search engine. ElasticsearchSearch engine. Elasticsearch
Search engine. Elasticsearch
Selecto
 
Semantic tagging for documents using 'short text' information
Semantic tagging for documents using 'short text' informationSemantic tagging for documents using 'short text' information
Semantic tagging for documents using 'short text' information
csandit
 
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
cscpconf
 

What's hot (20)

Computing semantic similarity measure between words using web search engine
Computing semantic similarity measure between words using web search engineComputing semantic similarity measure between words using web search engine
Computing semantic similarity measure between words using web search engine
 
BoTLRet: A Template-based Linked Data Information Retrieval
 BoTLRet: A Template-based Linked Data Information Retrieval BoTLRet: A Template-based Linked Data Information Retrieval
BoTLRet: A Template-based Linked Data Information Retrieval
 
ACL-IJCNLP 2015
ACL-IJCNLP 2015ACL-IJCNLP 2015
ACL-IJCNLP 2015
 
ast nearest neighbor search with keywords
ast nearest neighbor search with keywordsast nearest neighbor search with keywords
ast nearest neighbor search with keywords
 
TextRank: Bringing Order into Texts
TextRank: Bringing Order into TextsTextRank: Bringing Order into Texts
TextRank: Bringing Order into Texts
 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean model
 
A framework for plagiarism
A framework for plagiarismA framework for plagiarism
A framework for plagiarism
 
Finding Similar Files in Large Document Repositories
Finding Similar Files in Large Document RepositoriesFinding Similar Files in Large Document Repositories
Finding Similar Files in Large Document Repositories
 
Implementation of Urdu Probabilistic Parser
Implementation of Urdu Probabilistic ParserImplementation of Urdu Probabilistic Parser
Implementation of Urdu Probabilistic Parser
 
Independent Study_Final Report
Independent Study_Final ReportIndependent Study_Final Report
Independent Study_Final Report
 
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
 
Duplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetDuplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy Dataset
 
Using Page Size for Controlling Duplicate Query Results in Semantic Web
Using Page Size for Controlling Duplicate Query Results in Semantic WebUsing Page Size for Controlling Duplicate Query Results in Semantic Web
Using Page Size for Controlling Duplicate Query Results in Semantic Web
 
Document Retrieval System, a Case Study
Document Retrieval System, a Case StudyDocument Retrieval System, a Case Study
Document Retrieval System, a Case Study
 
In3415791583
In3415791583In3415791583
In3415791583
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
 
Search engine. Elasticsearch
Search engine. ElasticsearchSearch engine. Elasticsearch
Search engine. Elasticsearch
 
Semantic tagging for documents using 'short text' information
Semantic tagging for documents using 'short text' informationSemantic tagging for documents using 'short text' information
Semantic tagging for documents using 'short text' information
 
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
EVALUATION OF THE SHAPD2 ALGORITHM EFFICIENCY IN PLAGIARISM DETECTION TASK US...
 

Similar to Examination of Document Similarity Using Rabin-Karp Algorithm

International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
Ijetcas14 624
Ijetcas14 624Ijetcas14 624
Ijetcas14 624
Iasir Journals
 
A Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming AssignmentsA Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming Assignments
IRJET Journal
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
gerogepatton
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
gerogepatton
 
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
IJET - International Journal of Engineering and Techniques
 
An Application of Pattern matching for Motif Identification
An Application of Pattern matching for Motif IdentificationAn Application of Pattern matching for Motif Identification
An Application of Pattern matching for Motif Identification
CSCJournals
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
IJERA Editor
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach
IJCSIS Research Publications
 
Online Plagiarism Checker
Online Plagiarism CheckerOnline Plagiarism Checker
Online Plagiarism Checker
IRJET Journal
 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
IJDKP
 
Nature Inspired Models And The Semantic Web
Nature Inspired Models And The Semantic WebNature Inspired Models And The Semantic Web
Nature Inspired Models And The Semantic Web
Stefan Ceriu
 
Sentence similarity-based-text-summarization-using-clusters
Sentence similarity-based-text-summarization-using-clustersSentence similarity-based-text-summarization-using-clusters
Sentence similarity-based-text-summarization-using-clusters
MOHDSAIFWAJID1
 
Measure Term Similarity Using a Semantic Network Approach
Measure Term Similarity Using a Semantic Network ApproachMeasure Term Similarity Using a Semantic Network Approach
Measure Term Similarity Using a Semantic Network Approach
BOHR International Journal of Intelligent Instrumentation and Computing
 
Efficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity rankingEfficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity ranking
Shakas Technologies
 
Efficient Parallel Pruning of Associative Rules with Optimized Search
Efficient Parallel Pruning of Associative Rules with Optimized  SearchEfficient Parallel Pruning of Associative Rules with Optimized  Search
Efficient Parallel Pruning of Associative Rules with Optimized Search
IOSR Journals
 
A Survey On Plagiarism Detection
A Survey On Plagiarism DetectionA Survey On Plagiarism Detection
A Survey On Plagiarism Detection
Karla Adamson
 
At33264269
At33264269At33264269
At33264269
IJERA Editor
 
At33264269
At33264269At33264269
At33264269
IJERA Editor
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A Review
INFOGAIN PUBLICATION
 

Similar to Examination of Document Similarity Using Rabin-Karp Algorithm (20)

International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Ijetcas14 624
Ijetcas14 624Ijetcas14 624
Ijetcas14 624
 
A Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming AssignmentsA Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming Assignments
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSA COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
 
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
 
An Application of Pattern matching for Motif Identification
An Application of Pattern matching for Motif IdentificationAn Application of Pattern matching for Motif Identification
An Application of Pattern matching for Motif Identification
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach
 
Online Plagiarism Checker
Online Plagiarism CheckerOnline Plagiarism Checker
Online Plagiarism Checker
 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
 
Nature Inspired Models And The Semantic Web
Nature Inspired Models And The Semantic WebNature Inspired Models And The Semantic Web
Nature Inspired Models And The Semantic Web
 
Sentence similarity-based-text-summarization-using-clusters
Sentence similarity-based-text-summarization-using-clustersSentence similarity-based-text-summarization-using-clusters
Sentence similarity-based-text-summarization-using-clusters
 
Measure Term Similarity Using a Semantic Network Approach
Measure Term Similarity Using a Semantic Network ApproachMeasure Term Similarity Using a Semantic Network Approach
Measure Term Similarity Using a Semantic Network Approach
 
Efficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity rankingEfficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity ranking
 
Efficient Parallel Pruning of Associative Rules with Optimized Search
Efficient Parallel Pruning of Associative Rules with Optimized  SearchEfficient Parallel Pruning of Associative Rules with Optimized  Search
Efficient Parallel Pruning of Associative Rules with Optimized Search
 
A Survey On Plagiarism Detection
A Survey On Plagiarism DetectionA Survey On Plagiarism Detection
A Survey On Plagiarism Detection
 
At33264269
At33264269At33264269
At33264269
 
At33264269
At33264269At33264269
At33264269
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A Review
 

More from Universitas Pembangunan Panca Budi

Application of Data Encryption Standard and Lempel-Ziv-Welch Algorithm for Fi...
Application of Data Encryption Standard and Lempel-Ziv-Welch Algorithm for Fi...Application of Data Encryption Standard and Lempel-Ziv-Welch Algorithm for Fi...
Application of Data Encryption Standard and Lempel-Ziv-Welch Algorithm for Fi...
Universitas Pembangunan Panca Budi
 
An Implementation of a Filter Design Passive LC in Reduce a Current Harmonisa
An Implementation of a Filter Design Passive LC in Reduce a Current Harmonisa An Implementation of a Filter Design Passive LC in Reduce a Current Harmonisa
An Implementation of a Filter Design Passive LC in Reduce a Current Harmonisa
Universitas Pembangunan Panca Budi
 
Simultaneous Response of Dividend Policy and Value of Indonesia Manufacturing...
Simultaneous Response of Dividend Policy and Value of Indonesia Manufacturing...Simultaneous Response of Dividend Policy and Value of Indonesia Manufacturing...
Simultaneous Response of Dividend Policy and Value of Indonesia Manufacturing...
Universitas Pembangunan Panca Budi
 
Insecure Whatsapp Chat History, Data Storage and Proposed Security
Insecure Whatsapp Chat History, Data Storage and Proposed SecurityInsecure Whatsapp Chat History, Data Storage and Proposed Security
Insecure Whatsapp Chat History, Data Storage and Proposed Security
Universitas Pembangunan Panca Budi
 
Online Shoppers Acceptance: An Exploratory Study
Online Shoppers Acceptance: An Exploratory StudyOnline Shoppers Acceptance: An Exploratory Study
Online Shoppers Acceptance: An Exploratory Study
Universitas Pembangunan Panca Budi
 
Prim and Genetic Algorithms Performance in Determining Optimum Route on Graph
Prim and Genetic Algorithms Performance in Determining Optimum Route on GraphPrim and Genetic Algorithms Performance in Determining Optimum Route on Graph
Prim and Genetic Algorithms Performance in Determining Optimum Route on Graph
Universitas Pembangunan Panca Budi
 
Multi-Attribute Decision Making with VIKOR Method for Any Purpose Decision
Multi-Attribute Decision Making with VIKOR Method for Any Purpose DecisionMulti-Attribute Decision Making with VIKOR Method for Any Purpose Decision
Multi-Attribute Decision Making with VIKOR Method for Any Purpose Decision
Universitas Pembangunan Panca Budi
 
Mobile Application Detection of Road Damage using Canny Algorithm
Mobile Application Detection of Road Damage using Canny AlgorithmMobile Application Detection of Road Damage using Canny Algorithm
Mobile Application Detection of Road Damage using Canny Algorithm
Universitas Pembangunan Panca Budi
 
Super-Encryption Cryptography with IDEA and WAKE Algorithm
Super-Encryption Cryptography with IDEA and WAKE AlgorithmSuper-Encryption Cryptography with IDEA and WAKE Algorithm
Super-Encryption Cryptography with IDEA and WAKE Algorithm
Universitas Pembangunan Panca Budi
 
Technique for Order Preference by Similarity to Ideal Solution as Decision Su...
Technique for Order Preference by Similarity to Ideal Solution as Decision Su...Technique for Order Preference by Similarity to Ideal Solution as Decision Su...
Technique for Order Preference by Similarity to Ideal Solution as Decision Su...
Universitas Pembangunan Panca Budi
 
Prototype Application Multimedia Learning for Teaching Basic English
Prototype Application Multimedia Learning for Teaching Basic EnglishPrototype Application Multimedia Learning for Teaching Basic English
Prototype Application Multimedia Learning for Teaching Basic English
Universitas Pembangunan Panca Budi
 
TOPSIS Method Application for Decision Support System in Internal Control for...
TOPSIS Method Application for Decision Support System in Internal Control for...TOPSIS Method Application for Decision Support System in Internal Control for...
TOPSIS Method Application for Decision Support System in Internal Control for...
Universitas Pembangunan Panca Budi
 
Violations of Cybercrime and the Strength of Jurisdiction in Indonesia
Violations of Cybercrime and the Strength of Jurisdiction in IndonesiaViolations of Cybercrime and the Strength of Jurisdiction in Indonesia
Violations of Cybercrime and the Strength of Jurisdiction in Indonesia
Universitas Pembangunan Panca Budi
 
Marketing Strategy through Markov Optimization to Predict Sales on Specific P...
Marketing Strategy through Markov Optimization to Predict Sales on Specific P...Marketing Strategy through Markov Optimization to Predict Sales on Specific P...
Marketing Strategy through Markov Optimization to Predict Sales on Specific P...
Universitas Pembangunan Panca Budi
 
Prim's Algorithm for Optimizing Fiber Optic Trajectory Planning
Prim's Algorithm for Optimizing Fiber Optic Trajectory PlanningPrim's Algorithm for Optimizing Fiber Optic Trajectory Planning
Prim's Algorithm for Optimizing Fiber Optic Trajectory Planning
Universitas Pembangunan Panca Budi
 
Image Similarity Test Using Eigenface Calculation
Image Similarity Test Using Eigenface CalculationImage Similarity Test Using Eigenface Calculation
Image Similarity Test Using Eigenface Calculation
Universitas Pembangunan Panca Budi
 
Data Compression Using Elias Delta Code
Data Compression Using Elias Delta CodeData Compression Using Elias Delta Code
Data Compression Using Elias Delta Code
Universitas Pembangunan Panca Budi
 
A Review of IP and MAC Address Filtering in Wireless Network Security
A Review of IP and MAC Address Filtering in Wireless Network SecurityA Review of IP and MAC Address Filtering in Wireless Network Security
A Review of IP and MAC Address Filtering in Wireless Network Security
Universitas Pembangunan Panca Budi
 
Expert System of Catfish Disease Determinant Using Certainty Factor Method
Expert System of Catfish Disease Determinant Using Certainty Factor MethodExpert System of Catfish Disease Determinant Using Certainty Factor Method
Expert System of Catfish Disease Determinant Using Certainty Factor Method
Universitas Pembangunan Panca Budi
 
Threats of Computer System and its Prevention
Threats of Computer System and its PreventionThreats of Computer System and its Prevention
Threats of Computer System and its Prevention
Universitas Pembangunan Panca Budi
 

More from Universitas Pembangunan Panca Budi (20)

Application of Data Encryption Standard and Lempel-Ziv-Welch Algorithm for Fi...
Application of Data Encryption Standard and Lempel-Ziv-Welch Algorithm for Fi...Application of Data Encryption Standard and Lempel-Ziv-Welch Algorithm for Fi...
Application of Data Encryption Standard and Lempel-Ziv-Welch Algorithm for Fi...
 
An Implementation of a Filter Design Passive LC in Reduce a Current Harmonisa
An Implementation of a Filter Design Passive LC in Reduce a Current Harmonisa An Implementation of a Filter Design Passive LC in Reduce a Current Harmonisa
An Implementation of a Filter Design Passive LC in Reduce a Current Harmonisa
 
Simultaneous Response of Dividend Policy and Value of Indonesia Manufacturing...
Simultaneous Response of Dividend Policy and Value of Indonesia Manufacturing...Simultaneous Response of Dividend Policy and Value of Indonesia Manufacturing...
Simultaneous Response of Dividend Policy and Value of Indonesia Manufacturing...
 
Insecure Whatsapp Chat History, Data Storage and Proposed Security
Insecure Whatsapp Chat History, Data Storage and Proposed SecurityInsecure Whatsapp Chat History, Data Storage and Proposed Security
Insecure Whatsapp Chat History, Data Storage and Proposed Security
 
Online Shoppers Acceptance: An Exploratory Study
Online Shoppers Acceptance: An Exploratory StudyOnline Shoppers Acceptance: An Exploratory Study
Online Shoppers Acceptance: An Exploratory Study
 
Prim and Genetic Algorithms Performance in Determining Optimum Route on Graph
Prim and Genetic Algorithms Performance in Determining Optimum Route on GraphPrim and Genetic Algorithms Performance in Determining Optimum Route on Graph
Prim and Genetic Algorithms Performance in Determining Optimum Route on Graph
 
Multi-Attribute Decision Making with VIKOR Method for Any Purpose Decision
Multi-Attribute Decision Making with VIKOR Method for Any Purpose DecisionMulti-Attribute Decision Making with VIKOR Method for Any Purpose Decision
Multi-Attribute Decision Making with VIKOR Method for Any Purpose Decision
 
Mobile Application Detection of Road Damage using Canny Algorithm
Mobile Application Detection of Road Damage using Canny AlgorithmMobile Application Detection of Road Damage using Canny Algorithm
Mobile Application Detection of Road Damage using Canny Algorithm
 
Super-Encryption Cryptography with IDEA and WAKE Algorithm
Super-Encryption Cryptography with IDEA and WAKE AlgorithmSuper-Encryption Cryptography with IDEA and WAKE Algorithm
Super-Encryption Cryptography with IDEA and WAKE Algorithm
 
Technique for Order Preference by Similarity to Ideal Solution as Decision Su...
Technique for Order Preference by Similarity to Ideal Solution as Decision Su...Technique for Order Preference by Similarity to Ideal Solution as Decision Su...
Technique for Order Preference by Similarity to Ideal Solution as Decision Su...
 
Prototype Application Multimedia Learning for Teaching Basic English
Prototype Application Multimedia Learning for Teaching Basic EnglishPrototype Application Multimedia Learning for Teaching Basic English
Prototype Application Multimedia Learning for Teaching Basic English
 
TOPSIS Method Application for Decision Support System in Internal Control for...
TOPSIS Method Application for Decision Support System in Internal Control for...TOPSIS Method Application for Decision Support System in Internal Control for...
TOPSIS Method Application for Decision Support System in Internal Control for...
 
Violations of Cybercrime and the Strength of Jurisdiction in Indonesia
Violations of Cybercrime and the Strength of Jurisdiction in IndonesiaViolations of Cybercrime and the Strength of Jurisdiction in Indonesia
Violations of Cybercrime and the Strength of Jurisdiction in Indonesia
 
Marketing Strategy through Markov Optimization to Predict Sales on Specific P...
Marketing Strategy through Markov Optimization to Predict Sales on Specific P...Marketing Strategy through Markov Optimization to Predict Sales on Specific P...
Marketing Strategy through Markov Optimization to Predict Sales on Specific P...
 
Prim's Algorithm for Optimizing Fiber Optic Trajectory Planning
Prim's Algorithm for Optimizing Fiber Optic Trajectory PlanningPrim's Algorithm for Optimizing Fiber Optic Trajectory Planning
Prim's Algorithm for Optimizing Fiber Optic Trajectory Planning
 
Image Similarity Test Using Eigenface Calculation
Image Similarity Test Using Eigenface CalculationImage Similarity Test Using Eigenface Calculation
Image Similarity Test Using Eigenface Calculation
 
Data Compression Using Elias Delta Code
Data Compression Using Elias Delta CodeData Compression Using Elias Delta Code
Data Compression Using Elias Delta Code
 
A Review of IP and MAC Address Filtering in Wireless Network Security
A Review of IP and MAC Address Filtering in Wireless Network SecurityA Review of IP and MAC Address Filtering in Wireless Network Security
A Review of IP and MAC Address Filtering in Wireless Network Security
 
Expert System of Catfish Disease Determinant Using Certainty Factor Method
Expert System of Catfish Disease Determinant Using Certainty Factor MethodExpert System of Catfish Disease Determinant Using Certainty Factor Method
Expert System of Catfish Disease Determinant Using Certainty Factor Method
 
Threats of Computer System and its Prevention
Threats of Computer System and its PreventionThreats of Computer System and its Prevention
Threats of Computer System and its Prevention
 

Recently uploaded

Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
Vivekanand Anglo Vedic Academy
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
GeoBlogs
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
timhan337
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
joachimlavalley1
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
Levi Shapiro
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
Anna Sz.
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
Vikramjit Singh
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
RaedMohamed3
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
DhatriParmar
 

Recently uploaded (20)

Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
The French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free downloadThe French Revolution Class 9 Study Material pdf free download
The French Revolution Class 9 Study Material pdf free download
 
The geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideasThe geography of Taylor Swift - some ideas
The geography of Taylor Swift - some ideas
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
Honest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptxHonest Reviews of Tim Han LMA Course Program.pptx
Honest Reviews of Tim Han LMA Course Program.pptx
 
Additional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdfAdditional Benefits for Employee Website.pdf
Additional Benefits for Employee Website.pdf
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...
 
Polish students' mobility in the Czech Republic
Polish students' mobility in the Czech RepublicPolish students' mobility in the Czech Republic
Polish students' mobility in the Czech Republic
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Digital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and ResearchDigital Tools and AI for Teaching Learning and Research
Digital Tools and AI for Teaching Learning and Research
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
The Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptxThe Accursed House by Émile Gaboriau.pptx
The Accursed House by Émile Gaboriau.pptx
 

Examination of Document Similarity Using Rabin-Karp Algorithm

  • 1. DOI: 10.23883/IJRTER.2017.3404.4SNDK 196 Examination of Document Similarity Using Rabin-Karp Algorithm Ranti Eka Putri1 , Andysah Putera Utama Siahaan2 1 Faculty of Computer Science, Universitas Pembanguan Panca Budi, Medan, Indonesia 2 Ph.D. Student of School of Computer and Communication Engineering, Universiti Malaysia Perlis, Kangar, Malaysia Abstract — Documents do not always have the same content. However, the similarity between documents often occurs in the world of writing scientific papers. Some similarities occur because of a coincidence, but something happens because of the element of intent. On documents that have little content, this can be checked by the eyes. However, on documents that have thousands of lines and pages, of course, it is impossible. To anticipate it, it takes a way that can analyze plagiarism techniques performed. Many methods can examine the resemblance of documents, one of them by using the Rabin-Karp algorithm. The algorithm is very well since it has a determination for syllable cuts (K- Grams). This algorithm looks at how many hash values are the same in both documents. The percentage of plagiarism can also be adjusted up to a few percent according to the need for examination of the document. Implementation of this algorithm is beneficial for an institution to do the filtering of incoming documents. It is usually done at the time of receipt of a scientific paper to be published. Keywords —Text Mining, Plagiarism, Rabin-Karp I. INTRODUCTION Information is essential in the world of education, especially scientific information. Since information can be accessed online, this results in information being easily modified. Files downloaded from the internet allow users to edit. This file can then be saved using a new or even renamed name with the new user. This process happens so quickly without having to use certain techniques. The development of this technology has a positive value. Along with the advancement of the era, the progress of this technology can not be separated from the negative impact it produces. Modification of information without listing the main source is an action that violates the rules. The modification is plagiarism [1]. It is an act of abuse, theft or robbery, of publication, of a declaration, or of declaring it as a property of one's thoughts, ideas, writings, or creations that are not the author idea. Performing a plagiarism is an easy thing especially with using internet connection. Plagiarism can kill one's creativity in developing new ideas. It is a fun activity because it can be done easily and quickly because this action does not require energy and not have to think hard. Plagiarism can be prevented by using the help of string matching methods. The algorithm can be modified to analyze text, images, and even sound. This study attempts to match the document matching using the Rabin- Karp algorithm. This algorithm is known quickly regarding comparing documents [3][4]. Also, the parameters in this algorithm can be adjusted to the target to be achieved. The author hopes that by running this system, the action of plagiarism can be avoided.
  • 2. International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457] @IJRTER-2017, All Rights Reserved 197 II. THEORIES 2.1 Plagiarism Information retrieval is part of computer science related to important documents which will then be processed in conjunction with other data. It is an information search based on a query that is expected to meet the previous goal. However, returning documents of plagiarism action may occur. Plagiarism is a process of plagiarism or recognition of articles, opinions, papers and so on that are not their own. It is to make the property of another person self-owned without the name of the source. The person doing the plagiarism is called a plagiarist. It is including a criminal act that is falsifying the work of others. It is also called copyright theft. Any quotation of words or ideas, the author must include the name of the original owner. It is also like a book owned by the author may not be reprinted without the permission of the author or publisher of the essay [2]. In practicing plagiarism, it is not always based on the element of intent. Some have become plagiarism due to lack of information or reference in making a scientific work. Below are the most common types of plagiarism: - Accidental It occurs since a lack of knowledge of plagiarism and understanding of reference writing. It usually happens when writing a scientific paper is not based on literature review. - Unintentional Information that has frequently been discussed and rewritten again with words that are almost the same. The same idea can produce different writing if designed well so plagiarism can be avoided. - Intentional The act of deliberately quoting a sentence or the whole of another person's work without the citation of the person's work. - Self-plagiarism The use of self-made work in other forms without developing the values or variables present in the previous work. The detector of plagiarism is divided into two parts, fingerprinting and full-text comparison. - Fingerprinting Comparison It is a technique used to check the relationships between documents whether all the text contained in a document or text. This technique will break the words on the paper to form a syllable or row of characters of a certain length. This technique is called hashing. The most commonly used algorithm is Rabin-Karp. - Full-text Comparison This technique performs a content comparison of two documents. It does text comparisons one by one on each document content. The downside is that it takes longer to compare large documents. However, the results obtained are quite satisfactory because the results will be used and stored in a database. Complete text comparison methods can not be applied to documents that are not on the same storage. The algorithms used in this approach are Brute-Force, Boyer Moore, and Levenshtein Distance. 2.2 Rabin-Karp Rabin-Karp algorithm is a search algorithm that searches for a substring pattern in a text using hashing. It is very effective for multi-pattern matching words [5][7]. One of the practical applications
  • 3. International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457] @IJRTER-2017, All Rights Reserved 198 of Rabin-Karp's algorithm is plagiarism detection. Rabin-Karp relies on a hash function to determine the percentage of plagiarism. The accuracy level can be adjusted based on this feature. The hash function is a function that determines the feature value of a particular syllable fraction. It converts each string into a number, called a hash value. Rabin-Karp algorithm determines hash value based on the same word (Figure 1) [6]. There are two barriers in determining the hash value. First, many different strings are in a particular sentence. This problem can be solved by assigning multiple strings with the same hash value. The next problem is not necessarily the string that has the same hash value match to overcome it for each string is assigned to brute-force technique. Rabin-Karp requires a large prime number to avoid possible hash values similar to different words. Figure 1 Rabin-Karp hash example III.IMPLEMENTATION 3.1 Rabin-Karp Process This stage performs semantic and syntactic analysis of the text. The purpose of the initial processing is to prepare the text for data that will undergo further processing. The operations that can be performed at this stage include the process of removing unnecessary parts of the testing process. It is done to select the data that has been eligible for execution. Filtering is a classification process to determine the words that will be used in the process of finding the common word. Each sentence will be broken down into words that will ultimately be a waste of useless words. The document index is a set of terms that indicate the content or topic contained by the document. Usually, this will be divided according to need. The index will distinguish a document from other documents that are in the collection.
  • 4. International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457] @IJRTER-2017, All Rights Reserved 199 The steps that occur in the Rabin-Karb process are as follows: - Tokenizing That is to convert a document into a collection of words by entering the words in an array and separating the punctuation and numbers that are not included in the important words. This process will also change to lower case. - Stopword Removal The process of removing basic words that always exist in the document such as: because, with, and, or, not and others. - Stemming The process of changing the words that still have the prefix and suffix so that it becomes a basic word. - Hashing The process of weighting each word in a document with a value based on a predetermined formula. 3.2 Rabin-Karp Calculation Hashing is the most important value in the Rabin-Karp algorithm. The result of hashing letters of k-gram with a certain number of bases is obtained by multiplying the ASCII value with predetermined numbers where the base is prime. Rabin-Karp method has provisions if two strings are same then the hash value must be the same as well. Here is an example calculation on Rabin-Karp algorithm. Assume the text is MEDAN. K-Gram = 5 Basis = 7 A = MEDAN A(1) = 77 A(2) = 69 A(3) = 68 A(4) = 65 A(5) = 78 Hash = (77 ∗ 74) + (69 ∗ 73) + (68 ∗ 73) + (65 ∗ 72) + (78 ∗ 71) = 235599 Tokenizing Stemming Stopword Removal Hashing
  • 5. International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457] @IJRTER-2017, All Rights Reserved 200 The hash calculation result is 235599. This action is done until all the words on the list are fulfilled. The following tables 1 and 2 are examples of comparison of documents after the hash values are obtained. The hash value in the first table will be computed by the hash value of the second table. Table 1. Hash value of document one 19875 16830 23124 17433 20546 21489 26753 13498 23846 16528 21848 28447 29994 10301 13009 18832 27217 23157 25854 22492 14952 14337 29348 19978 28809 13485 14188 13131 21215 12053 25669 13809 26508 19455 25356 29964 17723 26633 17445 11803 19477 27142 24814 15155 26266 28432 19007 21896 16625 20681 Table 2. Hash value of document two 28432 26406 28424 13930 19187 18049 10867 18516 26753 19975 10152 13053 24120 21896 18351 12605 25101 21215 20750 15513 22949 26006 25045 25932 10695 13254 21504 20286 22492 10615 25565 29941 17403 23018 22666 19744 19769 19877 29535 13139 25669 16830 14297 20916 24640 16960 20681 13131 13009 18947 There are ten pieces of the same hash that both tables have. After calculating the similar hash value, the next step is to calculate the percentage of similarity of the two documents. The formula used is as follows: P = 2 ∗ SH THA + THB ∗ 100% Where: P = Plagiarism Rate SH = Identical Hash THA = Total Hash in Document A THB = Total Hash in Document B In the previous calculation there are ten values that have the similar value. So the plagiarism level calculation is as follows. P = 2∗10 50+50 ∗ 100%
  • 6. International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 03, Issue 08; August - 2017 [ISSN: 2455-1457] @IJRTER-2017, All Rights Reserved 201 = 20 100 ∗ 100% = 0.5 * 100% = 20% The percentage of plagiarism held by both documents is 20%. IV.CONCLUSION Rabin-Karp algorithm is very well done to calculate the percentage of document similarity. In addition to the fast process, this algorithm has adjustable parameters to adjust the accuracy of the assessment. Calculation of hash value greatly affects the result of this algorithm. Adjustments should still be made when selecting the K-Gram value to be used. Each analyst can determine the feasibility tolerance for each document whether he belongs to the category of plagiarism or not. The disadvantage of this algorithm is that the system can never know which documents came first. The algorithm can only determine that there are similarities that occur in the comparable documents. REFERENCES [1] S. K. Shivaji and P. S., "Plagiarism Detection by using Karp-Rabin and String Matching Algorithm Together," International Journal of Computer Applications, vol. 116, no. 23, pp. 37-41, 2015. [2] A. Parker and J. O. Hamblen, "Computer Algorithm for Plagiarism Detection," IEEE Trans. Education, vol. 32, no. 2, pp. 94-99, 1989. [3] Sunita, R. Malik and M. Gulia, "Rabin-Karp Algorithm with Hashing a String Matching Tool," International Journal of Advanced Research in Computer Science and Software Engineering, vol. 4, no. 3, pp. 389-392, 2014. [4] A. P. Gope and R. N. Behera, "A Novel Pattern Matching Algorithm in Genome," International Journal of Computer Science and Information Technologies, vol. 5, no. 4, pp. 5450-5457, 2014. [5] A. P. U. Siahaan, Mesran, R. Rahim and D. Siregar, "K-Gram As A Determinant Of Plagiarism Level In Rabin-Karp Algorithm," International Journal of Scientific & Technology Research, vol. 6, no. 7, pp. 350-353, 2017. [6] S. Popov, "Algorithm of the Week: Rabin-Karp String Searching," DZone / Java Zone, 3 April 2012. [Online]. Available: https://dzone.com/articles/algorithm-week-rabin-karp. [Accessed 20 August 2017]. [7] J. Sharma and M. Singh, "CUDA based Rabin-Karp Pattern Matching for Deep Packet Inspection on a Multicore GPU," International Journal of Computer Network and Information Security, vol. 10, no. 8, pp. 70-77, 2015.