This document summarizes a research paper that proposes a method to semantically detect plagiarism in research papers using text mining techniques. It introduces the problem of plagiarism in research and the need for automated detection methods. The proposed method uses TF-IDF to encode documents and LSI for semantic indexing. It collects research papers, preprocesses text, encodes documents with TF-IDF, and indexes them semantically using LSI to find similar papers and detect plagiarism.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Elevating forensic investigation system for file clusteringeSAT Journals
Â
Abstract In computer forensic investigation, thousands of files are usually surveyed. Much of the data in those files consists of formless manuscript, whose investigation by computer examiners is very tough to accomplish. Clustering is the unverified organization of designs that is data items, remarks, or feature vectors into groups (clusters). To find a noble clarification for this automated method of analysis are of great interest. In particular, algorithms such as K-means, K-medoids, Single Link, Complete Link and Average Link can simplify the detection of new and valuable information from the documents under investigation. This paper is going to present an tactic that applies text clustering algorithms to forensic examination of computers seized in police investigations using multithreading technique for data clustering. Keywords- Clustering, forensic computing, text mining, multithreading.
A novel approach for text extraction using effective pattern matching techniqueeSAT Journals
Â
Abstract
There are many data mining techniques have been proposed for mining useful patterns from documents. Still, how to effectively use and update discovered patterns is open for future research , especially in the field of text mining. As most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy(words have multiple meanings) and synonymy(multiple words have same meaning). People have held hypothesis that pattern-based approaches should perform better than the term-based, but many experiments does not support this hypothesis. This paper presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern matching, to improve the effective use of discovered patterns.
Keywords: Pattern Mining, Pattern Taxonomy Model, Inner Pattern Evolving, TF-IDF, NLP etc.
Data mining is the knowledge discovery in databases and the gaol is to extract patterns and knowledge from
large amounts of data. The important term in data mining is text mining. Text mining extracts the quality
information highly from text. Statistical pattern learning is used to high quality information. High âquality in
text mining defines the combinations of relevance, novelty and interestingness. Tasks in text mining are text
categorization, text clustering, entity extraction and sentiment analysis. Applications of natural language
processing and analytical methods are highly preferred to turn
Answer extraction and passage retrieval forWaheeb Ahmed
Â
âQuestion Answering systems (QASs) do the task of
retrieving text portions from a collection of documents that
contain the answer to the userâs questions. These QASs use a
variety of linguistic tools that be able to deal with small
fragments of text. Therefore, to retrieve the documents which
contains the answer from a large document collections, QASs
employ Information Retrieval (IR) techniques to minimize the
number of documents collections to a treatable amount of
relevant text. In this paper, we propose a model for passage
retrieval model that do this task with a better performance for
the purpose of Arabic QASs. We first segment each the top five
ranked documents returned by the IR module into passages.
Then, we compute the similarity score between the userâs
question terms and each passage. The top five passages (with
high similarity score) are retrieved are retrieved. Finally,
Answer Extraction techniques are applied to extract the final
answer. Our method achieved an average for precision of
87.25%, Recall of 86.2% and F1-measure of 87%.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Elevating forensic investigation system for file clusteringeSAT Journals
Â
Abstract In computer forensic investigation, thousands of files are usually surveyed. Much of the data in those files consists of formless manuscript, whose investigation by computer examiners is very tough to accomplish. Clustering is the unverified organization of designs that is data items, remarks, or feature vectors into groups (clusters). To find a noble clarification for this automated method of analysis are of great interest. In particular, algorithms such as K-means, K-medoids, Single Link, Complete Link and Average Link can simplify the detection of new and valuable information from the documents under investigation. This paper is going to present an tactic that applies text clustering algorithms to forensic examination of computers seized in police investigations using multithreading technique for data clustering. Keywords- Clustering, forensic computing, text mining, multithreading.
A novel approach for text extraction using effective pattern matching techniqueeSAT Journals
Â
Abstract
There are many data mining techniques have been proposed for mining useful patterns from documents. Still, how to effectively use and update discovered patterns is open for future research , especially in the field of text mining. As most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy(words have multiple meanings) and synonymy(multiple words have same meaning). People have held hypothesis that pattern-based approaches should perform better than the term-based, but many experiments does not support this hypothesis. This paper presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern matching, to improve the effective use of discovered patterns.
Keywords: Pattern Mining, Pattern Taxonomy Model, Inner Pattern Evolving, TF-IDF, NLP etc.
Data mining is the knowledge discovery in databases and the gaol is to extract patterns and knowledge from
large amounts of data. The important term in data mining is text mining. Text mining extracts the quality
information highly from text. Statistical pattern learning is used to high quality information. High âquality in
text mining defines the combinations of relevance, novelty and interestingness. Tasks in text mining are text
categorization, text clustering, entity extraction and sentiment analysis. Applications of natural language
processing and analytical methods are highly preferred to turn
Answer extraction and passage retrieval forWaheeb Ahmed
Â
âQuestion Answering systems (QASs) do the task of
retrieving text portions from a collection of documents that
contain the answer to the userâs questions. These QASs use a
variety of linguistic tools that be able to deal with small
fragments of text. Therefore, to retrieve the documents which
contains the answer from a large document collections, QASs
employ Information Retrieval (IR) techniques to minimize the
number of documents collections to a treatable amount of
relevant text. In this paper, we propose a model for passage
retrieval model that do this task with a better performance for
the purpose of Arabic QASs. We first segment each the top five
ranked documents returned by the IR module into passages.
Then, we compute the similarity score between the userâs
question terms and each passage. The top five passages (with
high similarity score) are retrieved are retrieved. Finally,
Answer Extraction techniques are applied to extract the final
answer. Our method achieved an average for precision of
87.25%, Recall of 86.2% and F1-measure of 87%.
Text mining efforts to innovate new, previous unknown or hidden data by automatically extracting
collection of information from various written resources. Applying knowledge detection method to
formless text is known as Knowledge Discovery in Text or Text data mining and also called Text Mining.
Most of the techniques used in Text Mining are found on the statistical study of a term either word or
phrase. There are different algorithms in Text mining are used in the previous method. For example
Single-Link Algorithm and Self-Organizing Mapping(SOM) is introduces an approach for visualizing
high-dimensional data and a very useful tool for processing textual data based on Projection method.
Genetic and Sequential algorithms are provide the capability for multiscale representation of datasets and
fast to compute with less CPU time based on the Isolet Reduces subsets in Unsupervised Feature
Selection. We are going to propose the Vector Space Model and Concept based analysis algorithm it will
improve the text clustering quality and a better text clustering result may achieve. We think it is a good
behavior of the proposed algorithm is in terms of toughness and constancy with respect to the formation of
Neural Network.
Most of the text classification problems are associated with multiple class labels and hence automatic text
classification is one of the most challenging and prominent research area. Text classification is the
problem of categorizing text documents into different classes. In the multi-label classification scenario,
each document is associated may have more than one label. The real challenge in the multi-label
classification is the labelling of large number of text documents with a subset of class categories. The
feature extraction and classification of such text documents require an efficient machine learning algorithm
which performs automatic text classification. This paper describes the multi-label classification of product
review documents using Structured Support Vector Machine.
The classical or traditional information system provides answer after a user submits a complete query. It is even
noticed that presently, almost all the relational database systems rely on the query which has syntax and semantics
defined completely to access data. But often it is the case that we are willing to use vague terms in our query. The main
objective of database management system is to provide an environment that is both convenient and efficient for people
to use in storing and retrieving information. A recent trend of supporting auto complete is a first step to cope up with
this problem. We can have design of both classical and fuzzy database and can use effectively fuzzy queries on these
databases. Fuzzy databases are developed to manipulate the incomplete, unclear and vague data such as low, fast, very
high, about etc. The primary focus of fuzzy logic is on the natural language. This Paper provides the users the flexibility
or freedom to query database using natural language. Here this paper implements âinteractive fuzzy searchâ. This
framework for interactive fuzzy search permits the user to explore the data as they type even in the presence of some
minor errors. This paper applies fuzzy queries on relational database so that it is possible to have the precise result as
well as the output for the uncertain terms we generally use based on some membership function
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Extraction of Data Using Comparable Entity Miningiosrjce
Â
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Semi-Supervised Keyphrase Extraction on Scientific Article using Fact-based S...TELKOMNIKA JOURNAL
Â
Most scientific publishers encourage authors to provide keyphrases on their published article.
Hence, the need to automatize keyphrase extraction is increased. However, it is not a trivial task
considering keyphrase characteristics may overlap with the non-keyphraseâs. To date, the accuracy of
automatic keyphrase extraction approaches is still considerably low. In response to such gap, this paper
proposes two contributions. First, a feature called fact-based sentiment is proposed. It is expected to
strengthen keyphrase characteristics since, according to manual observation, most keyphrases are
mentioned in neutral-to-positive sentiment. Second, a combination of supervised and unsupervised
approach is proposed to take the benefits of both approaches. It will enable automatic hidden pattern
detection while keeping candidate importance comparable to each other. According to evaluation, factbased
sentiment is quite effective for representing keyphraseness and semi-supervised approach is
considerably effective to extract keyphrases from scientific articles.
Using data mining methods knowledge discovery for text miningeSAT Journals
Â
Abstract Many data mining techniques have been proposed for mining useful patterns in text documents. However, how to effectively use and update discovered patterns is still an open research issue, especially in the domain of text mining. Since most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern (or phrase)-based approaches should perform better than the term-based ones, but many experiments do not support this hypothesis. Proposed work presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information. Keywords:-Text mining, text classification, pattern mining, pattern evolving, information filtering.
Arabic text categorization algorithm using vector evaluation methodijcsit
Â
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
Text document clustering and similarity detection is the major part of document management, where every document should be identified by its key terms and domain knowledge. Based on the similarity, the documents are grouped into clusters. For document similarity calculation there are several approaches were proposed in the existing system. But the existing system is either term based or pattern based. And those systems suffered from several problems. To make a revolution in this challenging environment, the proposed system presents an innovative model for document similarity by applying back propagation time stamp algorithm. It discovers patterns in text documents as higher level features and creates a network for fast grouping. It also detects the most appropriate patterns based on its weight and BPTT performs the document similarity measures. Using this approach, the document can be categorized easily. In order to perform the above, a new approach is used. This helps to reduce the training process problems. The above framework is named as BPTT. The BPTT has implemented and evaluated using dot net platform with different set of datasets.
Automatically finding domain specific key terms from a given set of research paper is a challenging task and research papers to a particular area of research is a concern for many people including students, professors and researchers. A domain classification of papers facilitates that search process. That is, having a list of domains in a research field, we try to find out to which domain(s) a given paper is more related. Besides, processing the whole paper to read take a long time. In this paper, using domain knowledge requires much human effort, e.g., manually composing a set of labeling a large corpus. In particular, we use the abstract and keyword in research paper as the seeing terms to identify similar terms from a domain corpus which are then filtered by checking their appearance in the research papers. Experiments show the TF âIDF measure and the classification step make this method more precisely to domains. The results show that our approach can extract the terms effectively, while being domain independent.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
Â
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Advantages of Query Biased Summaries in Information RetrievalOnur YÄąlmaz
Â
Presentation of the paper:
Advantages of query biased summaries in information retrieval (1998)
Anastasios Tombros and Mark Sanderson.
In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '98).
ACM, New York, NY, USA, 2-10.
DOI=10.1145/290941.290947
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Text mining efforts to innovate new, previous unknown or hidden data by automatically extracting
collection of information from various written resources. Applying knowledge detection method to
formless text is known as Knowledge Discovery in Text or Text data mining and also called Text Mining.
Most of the techniques used in Text Mining are found on the statistical study of a term either word or
phrase. There are different algorithms in Text mining are used in the previous method. For example
Single-Link Algorithm and Self-Organizing Mapping(SOM) is introduces an approach for visualizing
high-dimensional data and a very useful tool for processing textual data based on Projection method.
Genetic and Sequential algorithms are provide the capability for multiscale representation of datasets and
fast to compute with less CPU time based on the Isolet Reduces subsets in Unsupervised Feature
Selection. We are going to propose the Vector Space Model and Concept based analysis algorithm it will
improve the text clustering quality and a better text clustering result may achieve. We think it is a good
behavior of the proposed algorithm is in terms of toughness and constancy with respect to the formation of
Neural Network.
Most of the text classification problems are associated with multiple class labels and hence automatic text
classification is one of the most challenging and prominent research area. Text classification is the
problem of categorizing text documents into different classes. In the multi-label classification scenario,
each document is associated may have more than one label. The real challenge in the multi-label
classification is the labelling of large number of text documents with a subset of class categories. The
feature extraction and classification of such text documents require an efficient machine learning algorithm
which performs automatic text classification. This paper describes the multi-label classification of product
review documents using Structured Support Vector Machine.
The classical or traditional information system provides answer after a user submits a complete query. It is even
noticed that presently, almost all the relational database systems rely on the query which has syntax and semantics
defined completely to access data. But often it is the case that we are willing to use vague terms in our query. The main
objective of database management system is to provide an environment that is both convenient and efficient for people
to use in storing and retrieving information. A recent trend of supporting auto complete is a first step to cope up with
this problem. We can have design of both classical and fuzzy database and can use effectively fuzzy queries on these
databases. Fuzzy databases are developed to manipulate the incomplete, unclear and vague data such as low, fast, very
high, about etc. The primary focus of fuzzy logic is on the natural language. This Paper provides the users the flexibility
or freedom to query database using natural language. Here this paper implements âinteractive fuzzy searchâ. This
framework for interactive fuzzy search permits the user to explore the data as they type even in the presence of some
minor errors. This paper applies fuzzy queries on relational database so that it is possible to have the precise result as
well as the output for the uncertain terms we generally use based on some membership function
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Extraction of Data Using Comparable Entity Miningiosrjce
Â
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Semi-Supervised Keyphrase Extraction on Scientific Article using Fact-based S...TELKOMNIKA JOURNAL
Â
Most scientific publishers encourage authors to provide keyphrases on their published article.
Hence, the need to automatize keyphrase extraction is increased. However, it is not a trivial task
considering keyphrase characteristics may overlap with the non-keyphraseâs. To date, the accuracy of
automatic keyphrase extraction approaches is still considerably low. In response to such gap, this paper
proposes two contributions. First, a feature called fact-based sentiment is proposed. It is expected to
strengthen keyphrase characteristics since, according to manual observation, most keyphrases are
mentioned in neutral-to-positive sentiment. Second, a combination of supervised and unsupervised
approach is proposed to take the benefits of both approaches. It will enable automatic hidden pattern
detection while keeping candidate importance comparable to each other. According to evaluation, factbased
sentiment is quite effective for representing keyphraseness and semi-supervised approach is
considerably effective to extract keyphrases from scientific articles.
Using data mining methods knowledge discovery for text miningeSAT Journals
Â
Abstract Many data mining techniques have been proposed for mining useful patterns in text documents. However, how to effectively use and update discovered patterns is still an open research issue, especially in the domain of text mining. Since most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern (or phrase)-based approaches should perform better than the term-based ones, but many experiments do not support this hypothesis. Proposed work presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information. Keywords:-Text mining, text classification, pattern mining, pattern evolving, information filtering.
Arabic text categorization algorithm using vector evaluation methodijcsit
Â
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
Text document clustering and similarity detection is the major part of document management, where every document should be identified by its key terms and domain knowledge. Based on the similarity, the documents are grouped into clusters. For document similarity calculation there are several approaches were proposed in the existing system. But the existing system is either term based or pattern based. And those systems suffered from several problems. To make a revolution in this challenging environment, the proposed system presents an innovative model for document similarity by applying back propagation time stamp algorithm. It discovers patterns in text documents as higher level features and creates a network for fast grouping. It also detects the most appropriate patterns based on its weight and BPTT performs the document similarity measures. Using this approach, the document can be categorized easily. In order to perform the above, a new approach is used. This helps to reduce the training process problems. The above framework is named as BPTT. The BPTT has implemented and evaluated using dot net platform with different set of datasets.
Automatically finding domain specific key terms from a given set of research paper is a challenging task and research papers to a particular area of research is a concern for many people including students, professors and researchers. A domain classification of papers facilitates that search process. That is, having a list of domains in a research field, we try to find out to which domain(s) a given paper is more related. Besides, processing the whole paper to read take a long time. In this paper, using domain knowledge requires much human effort, e.g., manually composing a set of labeling a large corpus. In particular, we use the abstract and keyword in research paper as the seeing terms to identify similar terms from a domain corpus which are then filtered by checking their appearance in the research papers. Experiments show the TF âIDF measure and the classification step make this method more precisely to domains. The results show that our approach can extract the terms effectively, while being domain independent.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
Â
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Advantages of Query Biased Summaries in Information RetrievalOnur YÄąlmaz
Â
Presentation of the paper:
Advantages of query biased summaries in information retrieval (1998)
Anastasios Tombros and Mark Sanderson.
In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '98).
ACM, New York, NY, USA, 2-10.
DOI=10.1145/290941.290947
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Measure the Similarity of Complaint Document Using Cosine Similarity Based on...Editor IJCATR
Â
Report handling on "LAPOR!" (Laporan, Aspirasi dan Pengaduan Online Rakyat) system depending on the system administrator who manually reads every incoming report [3]. Read manually can lead to errors in handling complaints [4] if the data flow is huge and grows rapidly, it needs at least three days to prepare a confirmation and it sensitive to inconsistencies [3]. In this study, the authors propose a model that can measure the identities of the Query (Incoming) with Document (Archive). The authors employed Class-Based Indexing term weighting scheme, and Cosine Similarities to analyse document similarities. CoSimTFIDF, CoSimTFICF and CoSimTFIDFICF values used in classification as feature for K-Nearest Neighbour (K-NN) classifier. The optimum result evaluation is pre-processing employ 75% of training data ratio and 25% of test data with CoSimTFIDF feature. It deliver a high accuracy 84%. The k = 5 value obtain high accuracy 84.12%
Testing Different Log Bases for Vector Model Weighting Techniquekevig
Â
Information retrieval systems retrieves relevant documents based on a query submitted by the user. The documents are initially indexed and the words in the documents are assigned weights using a weighting technique called TFIDF which is the product of Term Frequency (TF) and Inverse Document Frequency (IDF). TF represents the number of occurrences of a term in a document. IDF measures whether the term is common or rare across all documents. It is computed by dividing the total number of documents in the system by the number of documents containing the term and then computing the logarithm of the quotient. By default, we use base 10 to calculate the logarithm. In this paper, we are going to test this weighting technique by using a range of log bases from 0.1 to 100.0 to calculate the IDF. Testing different log bases for vector model weighting technique is to highlight the importance of understanding the performance of the system at different weighting values. We use the documents of MED, CRAN, NPL, LISA, and CISI test collections that scientists assembled explicitly for experiments in data information retrieval systems.
Testing Different Log Bases for Vector Model Weighting Techniquekevig
Â
Information retrieval systems retrieves relevant documents based on a query submitted by the user. The documents are initially indexed and the words in the documents are assigned weights using a weighting technique called TFIDF which is the product of Term Frequency (TF) and Inverse Document Frequency (IDF). TF represents the number of occurrences of a term in a document. IDF measures whether the term is common or rare across all documents. It is computed by dividing the total number of documents in the system by the number of documents containing the term and then computing the logarithm of the quotient. By default, we use base 10 to calculate the logarithm. In this paper, we are going to test this weighting technique by using a range of log bases from 0.1 to 100.0 to calculate the IDF. Testing different log bases for vector model weighting technique is to highlight the importance of understanding the performance of the system at different weighting values. We use the documents of MED, CRAN, NPL, LISA, and CISI test collections that scientists assembled explicitly for experiments in data information retrieval systems.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
FINDING OUT NOISY PATTERNS FOR RELATION EXTRACTION OF BANGLA SENTENCESkevig
Â
Relation extraction is one of the most important parts of natural language processing. It is the process of extracting relationships from a text. Extracted relationships actually occur between two or more entities of a certain type and these relations may have different patterns. The goal of the paper is to find out the noisy patterns for relation extraction of Bangla sentences. For the research work, seed tuples were needed containing two entities and the relation between them. We can get seed tuples from Freebase. Freebase is a large collaborative knowledge base and database of general, structured information for public use. But for Bangla language, there is no available Freebase. So we made Bangla Freebase which was the real challenge and it can be used for any other NLP based works. Then we tried to find out the noisy patterns for relation extraction by measuring conflict score.
ON THE RELEVANCE OF QUERY EXPANSION USING PARALLEL CORPORA AND WORD EMBEDDING...ijnlc
Â
In this paper we implement a document retrieval system using the Lucene tool and we conduct some experiments in order to compare the efficiency of two different weighting schema: the well-known TF-IDF and the BM25. Then, we expand queries using a comparable corpus (wikipedia) and word embeddings. Obtained results show that the latter method (word embeddings) is a good way to achieve higher precision rates and retrieve more accurate documents
Experimental Result Analysis of Text Categorization using Clustering and Clas...ijtsrd
Â
In a world that routinely produces more textual data. It is very critical task to managing that textual data. There are many text analysis methods are available to managing and visualizing that data, but many techniques may give less accuracy because of the ambiguity of natural language. To provide the ne grained analysis, in this paper introduce e cient machine learning algorithms for categorize text data. To improve the accuracy, in proposed system I introduced Natural language toolkit NLTK python library to perform natural language processing. The main aim of proposed system is to generalize the model for real time text categorization applications by using e cient text classi cation as well as clustering machine learning algorithms and nd the efficient and accurate model for input dataset using performance measure concept. Patil Kiran Sanajy | Prof. Kurhade N. V. ""Experimental Result Analysis of Text Categorization using Clustering and Classification Algorithms"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-4 , June 2019, URL: https://www.ijtsrd.com/papers/ijtsrd25077.pdf
Paper URL: https://www.ijtsrd.com/engineering/computer-engineering/25077/experimental-result-analysis-of-text-categorization-using-clustering-and-classification-algorithms/patil-kiran-sanajy
International Journal of Computational Engineering Research(IJCER) ijceronline
Â
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...ijdmtaiir
Â
-In this study a comprehensive evaluation of two
supervised feature selection methods for dimensionality
reduction is performed - Latent Semantic Indexing (LSI) and
Principal Component Analysis (PCA). This is gauged against
unsupervised techniques like fuzzy feature clustering using
hard fuzzy C-means (FCM) . The main objective of the study is
to estimate the relative efficiency of two supervised techniques
against unsupervised fuzzy techniques while reducing the
feature space. It is found that clustering using FCM leads to
better accuracy in classifying documents in the face of
evolutionary algorithms like LSI and PCA. Results show that
the clustering of features improves the accuracy of document
classification
Data mining is the knowledge discovery in databases and the gaol is to extract patterns and knowledge from large amounts of data. The important term in data mining is text mining. Text mining extracts the quality information highly from text. Statistical pattern learning is used to high quality information. High âquality in text mining defines the combinations of relevance, novelty and interestingness. Tasks in text mining are text categorization, text clustering, entity extraction and sentiment analysis. Applications of natural language processing and analytical methods are highly preferred to turn text into data for analysis. This survey is about the various techniques and algorithms used in text mining.
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...iosrjce
Â
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Â
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
Â
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Â
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
Â
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Â
Clients donât know what they donât know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clientsâ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Â
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But thereâs more:
In a second workflow supporting the same use case, youâll see:
Your campaign sent to target colleagues for approval
If the âApproveâ button is clicked, a Jira/Zendesk ticket is created for the marketing design team
Butâif the âRejectâ button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
Â
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Â
Are you looking to streamline your workflows and boost your projectsâ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, youâre in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part âEssentials of Automationâ series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Hereâs what youâll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
Weâll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Donât miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Â
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
Â
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. Whatâs changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Â
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Â
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Leading Change strategies and insights for effective change management pdf 1.pdf
Â
P33077080
1. Reena Kharat, Preeti M. Chavan, Vaibhav Jadhav, Kuldeep Rakibe / International Journal of
Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 3, May-Jun 2013, pp.077-080
77 | P a g e
Semantically Detecting Plagiarism for Research Papers
Reena Kharat, Preeti M. Chavan, Vaibhav Jadhav, Kuldeep Rakibe
Department of Computer Engg, Pimpri Chinchwad College Of Engg., Pune.
ABSTRACT
Plagiarism means copying of published
work without proper acknowledgement of
source. Plagiarism is a major concern, in an
academic environment, which affects both the
credibility of institutions as well as its ability to
ensure quality of its student. Plagiarism detection
of research papers deals with checking
similarities with other research papers. Manual
methods cannot be used for checking research
papers, as the assigned reviewer may have
inadequate knowledge in the research disciplines.
They may have different subjective views,
causing possible misinterpretations. Therefore,
there was an urgent need for an effective and
feasible approach to check the submitted
research papers with support of automated
software. A method like- text mining method
came into picture to solve the problem of
automatically checking the research papers
semantically. Our proposed system uses Term
Frequency- Inverse Document Frequency (TF-
IDF) and Latent Semantic Indexing (LSI) to
semantically find plagiarism.
Keywords - Decision Support Systems, Latent
Semantic Indexing (LSI), Term Frequency- Inverse
Document Frequency (TF-IDF), Text Mining
I. INTRODUCTION
Plagiarism by students, professors,
industrialist or researcher is considered academic
fraud. Plagiarism is defined in multiple ways like
copying others original work without
acknowledging the author or source. Original work
is code, formulas, ideas, research, strategies, writing
or other form. Punishment for plagiarism consists of
suspension to termination along with loss of
credibility. Therefore, detecting plagiarism is
essential. Research paper selection is recurring
activity for any conference or journal in academia. It
is a multi-process task that begins with a call for
papers. Fig. 1 shows research paper selection
process. Call for paper is distributed to communities
such as universities or research institutions. They
are then assigned to experts for peer review. The
review results are collected, and the papers arethen
ranked based on the aggregation of the experts
review results.
Figure 1: Research paper selection process
Expert reviewer may have inadequate
knowledge in research discipline. Plagiarism
detection software will help him to detect plagiarism
quickly. Proposed system requires the database with
existing research papers. When the call for papers
(CFP) [5] is made from the end-user, the system
accepts the research paper submitted by end-user.
The system then finds the similarity between the
paper submitted and existing research papers.
The proposed method aims to make manual
process of checking plagiarism of research Papers
computerised. The system allows an agency to
ensure the ambiguity of the research Paper
submitted by end-user. It helps agency to find
semantically similar research papers. Proposed
method makes use of TF-IDF and LSI.
II. LITERATURE REVIEW
There are many existing formal methods
available for plagiarism detection. W. M. Wangn, C.
F. Cheung [6] had proposed- Semantic based
intellectual property management system, an
automated system for assisting the inventors in
patent analysis. It incorporated semantic analysis
and text mining techniques for processing and
analysing the patent documents. But, this method
proposed a hybrid knowledge-based approach to
assign reviewers the clustered research papers.
MOSS stands for âMeasure Of Software Similarityâ
was developed by Alex Aiken at UC Berkeley.
MOSS employs a document fingerprinting
technique to detect textual similarity. MOSS is a
command line tool and is not easy to use. YAP [8]
stands for Yet Another Plague, tries to find a
maximal set of common contiguous substrings to
detect plagiarism, proposed by Wise. It has three
different versions -YAP1, YAP2 and YAP3. Chen et
al discussed SID [9] stands for Shared Information
Distance or Software Integrity Detection, detects
similarity between programs by computing the
shared information between them. Prechelt,
Malpohl, and Phlippsen has discussed JPlag [10],
2. Reena Kharat, Preeti M. Chavan, Vaibhav Jadhav, Kuldeep Rakibe / International Journal of
Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 3, May-Jun 2013, pp.077-080
78 | P a g e
that finds plagiarism in source code written in Java,
C, C++ and Scheme. The use of minimal match
length in JPlag misses some matches.
Apiratikul focussed on Document
Fingerprinting Using Graph Grammar Induction
(DFGGI) [11], which uses a graph-based data
mining technique to find fingerprints in the source
code. Authors analysed the advantages and
limitations that are currently available with systems
for detecting plagiarism and concluded that text-
mining, [3] technique can be used to check research
papers based on their similarities.
III. TECHNICAL PRELIMINARIES
A. TF-IDF
TF-IDF encoding describes a weighted
method based on inverse document frequency (IDF)
[7] combined with the term frequency (TF) to
produce the feature v, such that
vi = tfi *log(N/dfi )
The weights are assigned using above
formula.
Here, N is the total number of papers in the
discipline, tfi is the term frequency of the feature
word wi and dfi is the number of papers containing
the word wi.
TF increases the weight of term and IDF
decreases weight of term. The Term-Document
matrix is created in this step.as shown in fig
A= d1 d2 d3
t1 0.58 0 0
t2 0.1 -0.3 0
t3 0 0 0.98
Weighted term doc matrix
B. LSI
LSI is mathematical technique which uses
Singular value decomposition SVD. It accepts
Term-Document Matrix from TF-IDF and applies
SVD on that matrix has many applications like
clustering, vector dimension reduction and it is used
in making search engines.
D1 D2 t1 t2
D1 2 5 t1 3 5
D2 3 3 t2 2 6
From Matrix A. another two matrices are derived
namely B,C.where B is document by document
matrix,which stores weight of terms which are
common to both documents and is calculated using
B = A*A^T C = A^T*A
Where C is term-term matrix which stores weight of
both terms which are occurring together in same
document. One more matrix is created which is
derived from B as square root of Eigen values of
principle diagonal matrix, which is denoted as â.
And final matrix is build by using,
A = S*â*U^T.
Where S is matrix obtained from B as Eigen values
of B. and U is matrix obtained from C as Eigen
values of C.
Now, A can be used for weighting Document
vectors And Term vectors.
C. Detecting Plagiarism
The formula required for detecting
plagiarism:
Plagiarism= Total number of matched sentences in
matched proposal áTotal number of sentences in
the input proposal
IV. PROPOSED METHOD TO DETECT
PLAGIARISM
The Plagiarism system checks the
similarity of the input paper submitted by end-user,
with the existing research papers of the respective
discipline. The system finally outputs the Best
Matching Unit (BMU). The system provides Best 5
matched papers with respect to the input research
paper, in the descending order, with the ordered best
matched paper, as shown in Figure 2.
After the research papers are submitted by
the end-users, the papers in provided discipline are
checked using the text-mining technique, as shown
in Figure 3. The main plagiarism process consists of
four steps, as:
Step 1) Text document collection:
The existing research papers are stored in
the text format, within the database.
Step 2)Text document preprocessing:
The contents of papers are usually non-
structured. The pre-processing analyzes, extracts, and
identifies the keywords in the full text of the papers
and tokenizes them. Here, a further reduction in the
vocabulary size is achieved, through the removal of
frequently occurring words referred as stop-words,
via- stop file. This is called as filtering phase of
removal of stop-words.
3. Reena Kharat, Preeti M. Chavan, Vaibhav Jadhav, Kuldeep Rakibe / International Journal of
Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 3, May-Jun 2013, pp.077-080
79 | P a g e
Step 3) Text document encoding:
On filtering text documents they are
converted into a feature vector. This step uses TF-
IDF algorithm. Each token is assigned a weight, in
terms of frequency (TF), taking into consideration a
single research paper. IDF considers all the papers,
scattered in the database and calculates the inverse
frequency of the token appeared in all research
papers. So, TF is a local weighting function, while
IDF is a global weighting function.
Figure 2: System Architecture
Figure 3: Main process of text mining
Step 4)SemanticallyIndexingresearchPapers:
LSI creates the semantic relations
among the keywords, after gaining feature-vector.
LSI is a technique for substituting the original
data vectors with shorter vectors in which the
semantic information is preserved. A term-by-
document matrix is formed, which is decomposed
into a set of eigenvectors using singular-value
decomposition. The eigenvectors that have the least
impacts on the matrix are then discarded. Thus, the
document vector formed from the term of the
remaining eigenvectors has a very small dimension
and retains almost all of the relevant original
features. Hence, the system outputs semantically
best matched research papers in descending order.
V. CONCLUSION
Today, competition requires timely and
sophisticated analysis on an integrated view of data.
A new technology leap is needed to structure and
prioritize information for specific end-user
problems.
Our method facilitates text-mining and
optimization techniques to cluster research papers
based on their similarities. The proposed method can
be used to expedite and improve the paper grouping
process.
Plagiarism Detection Method for checking
Papers can make this leap. It facilitates text-mining
technique to check research papers based on their
similarities. It can be used in College Universities to
find ambiguity in the SRS, submitted by the
students. It can be used for Patent Analysis, for
supporting the Intellectual Property Rights. Thus,
the future of the proposed system lies in
constructing the automated decision-making system
for detecting plagiarism.
REFERENCES
[1] J. Vesanto and E. Alhoniemi,
âClustering of the self-organizing
mapâ, IEEE Trans. Neural Netw, vol.
11, no. 3, May 2000, 586â600.
[2] Juan Ramos, âUsing TF-IDF to
Determine Word Relevance in
Document Queriesâ, Department of
Computer Science, Rutgers University,
23515 BPO Way, Piscataway, NJ,
08855.
[3] R. Feldman and J. Sanger, âThe Text
Mining Handbook: Advanced
Approaches in Analyzing Unstructured
Dataâ. New York: Cambridge Univ.
Press, 2007.
[4] Zukas, Anthony, Price, Robert J.,
âDocument Categorization Using Latent
Semantic Indexing White Paperâ
Content Analyst Company, LLC.
[5] Jian Ma, Wei Xu, Yong-hong Sun,
Efraim Turban, Shouyang Wang, and
OuLiu âAn Ontology-Based Text-
Mining Method to Cluster Papers for
Research Project Selection,â IEEE
transactions on systems, man, and
cyberneticsâpart a: systems and
humans, vol. 42, no. 3, may 2012.
[6] W.M. Wangn, C.F.Cheung ,âA
Semantic-based Intellectual Property
Management System (SIPMS) for
supporting patent analysisâ ,Knowledge
Management Research Centre,
Department of Industrial and Systems
Engineering,The HongKong
Polytechnic University, HungHom,
4. Reena Kharat, Preeti M. Chavan, Vaibhav Jadhav, Kuldeep Rakibe / International Journal of
Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 3, May-Jun 2013, pp.077-080
80 | P a g e
Kowloon, Hong Kong.
[7] Milic-Frayling, N., 2005. âText
processing and information retrievalâ,In
Zanasi, A. (Ed.), Text Mining and its
Applications to Intelligence, CRM and
Knowledge Management. WIT Press,
Southampton Boston, pp. 1â45.
[8] Wise, M., âYAP3: improved detection
of similarities in computer program and
other textsâ, Proceedings of twenty
seventh SIGCSE technical symposium
on computer science education,
Philadelphia, USA. 130-134, 1996.
[9] Chen, X., B. Francia, M. Li, B.
Mckinnon and A. Seker, âShared
Information and Program Plagiarism
Detectionâ, IEEE Transactions on
Information Theory, vol. 50, pp.1545-
1551, 2004
[10] Prechelt, Lutz, Guido Malpohl, Michael
Phlippsen, âJPlag: Finding plagiarisms
among set of programsâ, Technical
Report 2000-1, March 28, 2000
[11] Apiratikul, P., âDocument
Fingerprinting Using Graph Grammar
Inductionâ, Masters Thesis submitted to
the Department of Computer Sciences,
Oklahoma State University,2004