This document summarizes a research paper on text classification for authorship attribution analysis. It discusses using statistical techniques like word length, sentence length, and vocabulary richness to differentiate writing styles numerically. The paper presents fuzzy learning classifiers and support vector machines (SVM) to classify texts by author. SVM achieved higher accuracy than fuzzy classifiers alone. The researchers then combined the classifiers, finding even greater accuracy compared to the individual classifiers.
Evolving Swings (topics) from Social Streams using Probability ModelIJERA Editor
Evolving swings from social streams is receiving renewed interest and it is motivated by the growth of social
media and social streams. Non-conventional based approaches can be appropriate which include text, images,
URLs and videos. The focus is on evolving topics by social aspects of the networks and the mentions of user
links between users which are generated intentionally or unintentionally through replies, mentions and retweets.
A probability model of the mentioning behavior is proposed and the proposed model detects the evolving topic
from the anomalies measured. After a several experiments, it shows that mention anomaly based approaches
detects the evolving swing as early as text anomaly based approa0ches.
Ethical and Unethical Methods of Plagiarism Prevention in Academic WritingNader Ale Ebrahim
K. Bakhtiyari, H. Salehi, M. A. Embi, M. Shakiba, A. Zavvari, M. Shahbazi-Moghadam, N. Ale Ebrahim, and M. Mohammadjafari, “Ethical and Unethical Methods of Plagiarism Prevention in Academic Writing,” International Education Studies, vol. 7, no. 7, pp. 52-62, 19 June, 2014.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
Evolving Swings (topics) from Social Streams using Probability ModelIJERA Editor
Evolving swings from social streams is receiving renewed interest and it is motivated by the growth of social
media and social streams. Non-conventional based approaches can be appropriate which include text, images,
URLs and videos. The focus is on evolving topics by social aspects of the networks and the mentions of user
links between users which are generated intentionally or unintentionally through replies, mentions and retweets.
A probability model of the mentioning behavior is proposed and the proposed model detects the evolving topic
from the anomalies measured. After a several experiments, it shows that mention anomaly based approaches
detects the evolving swing as early as text anomaly based approa0ches.
Ethical and Unethical Methods of Plagiarism Prevention in Academic WritingNader Ale Ebrahim
K. Bakhtiyari, H. Salehi, M. A. Embi, M. Shakiba, A. Zavvari, M. Shahbazi-Moghadam, N. Ale Ebrahim, and M. Mohammadjafari, “Ethical and Unethical Methods of Plagiarism Prevention in Academic Writing,” International Education Studies, vol. 7, no. 7, pp. 52-62, 19 June, 2014.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Segmentation of Handwritten Chinese Character Strings Based on improved Algor...ijeei-iaes
Algorithm Liu attracts high attention because of its high accuracy in segmentation of Japanese postal address. But the disadvantages, such as complexity and difficult implementation of algorithm, etc. have an adverse effect on its popularization and application. In this paper, the author applies the principles of algorithm Liu to handwritten Chinese character segmentation according to the characteristics of the handwritten Chinese characters, based on deeply study on algorithm Liu.In the same time, the author put forward the judgment criterion of Segmentation block classification and adhering mode of the handwritten Chinese characters.In the process of segmentation, text images are seen as the sequence made up of Connected Components (CCs), while the connected components are made up of several horizontal itinerary set of black pixels in image. The author determines whether these parts will be merged into segmentation through analyzing connected components. And then the author does image segmentation through adhering mode based on the analysis of outline edges. Finally cut the text images into character segmentation. Experimental results show that the improved Algorithm Liu obtains high segmentation accuracy and produces a satisfactory segmentation result.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Abstract This paper represents a Semantic Analyzer for checking the semantic correctness of the given input text. We describe our system as the one which analyzes the text by comparing it with the meaning of the words given in the WordNet. The Semantic Analyzer thus developed not only detects and displays semantic errors in the text but it also corrects them. Keywords: Part of Speech (POS) Tagger, Morphological Analyzer, Syntactic Analyzer, Semantic Analyzer, Natural Language (NL)
Now a days Twitter has provided a way to collect and understand user’s opinions about many
private or public organizations. All these organizations are reported for the sake to create and monitor
the targeted Twitter streams to understand user’s views about the organization. Usually a user-defined
selection criteria is used to filter and construct the Targeted Twitter stream. There must be an
application to detect early crisis and response with such target stream, that require a require a good
Named Entity Recognition (NER) system for Twitter, which is able to automatically discover emerging
named entities that is potentially linked to the crisis. However, many applications suffer severely from
short nature of tweets and noise. We present a framework called HybridSeg, which easily extracts and
well preserves the linguistic meaning or context information by first splitting the tweets into
meaningful segments. The optimal segmentation of a tweet is found after the sum of stickiness score of
its candidate segment is maximized.This computed stickiness score considers the probability of
segment whether belongs to global context(i.e., being a English phrase) or belongs to local context(i.e.,
being within a batch of tweets).The framework learns from both contexts.It also has the ability to learn
from pseudo feedback. Also from the result of semantic analysis the proposed system provides with
sentiment analysis.
lectronic-mail is widely used most suitable method of transferring messages electronically from one
person to another, rising from and going to any part of the world. Main features of Electronic mail is its speed,
dependability, well-equipped storage options and a large number of added services make it highly well-liked
among people from all sectors of business and society. But being popular it also has negative side too. Electronics
mails are preferred media for a large number of attacks over the internet.. A number of the most popular attacks over
the internet include spams. Some methods are essentially in detection of spam related mails but they have higher false
positives. A number of filters such as Checksum-based filters, Bayesian filters, machine learning based and
memory-based filters are usually used in order to recognize spams. As spammers constantly try to find a way to
avoid existing filters, a new filters need to be developed to catch spam. This paper proposes to find an
resourceful spam mail filtering method using user profile base ontology. Ontologies permit for machineunderstandable
semantics of data. It is main to interchange information with each other for more efficient spam
filtering. Thus, it is essential to build ontology and a framework for capable email filtering. Using ontology that is
particularly designed to filter spam, bunch of useless bulk email could be filtered out on the system. We propose a
user profile-based spam filter that classifies email based on the likelihood that User profile within it have been
included in spam or valid email.
Sentence Extraction Based on Sentence Distribution and Part of Speech Tagging...TELKOMNIKA JOURNAL
Automatic multi-document summarization needs to find representative sentences not only by
sentence distribution to select the most important sentence but also by how informative a term is in a
sentence. Sentence distribution is suitable for obtaining important sentences by determining frequent and
well-spread words in the corpus but ignores the grammatical information that indicates instructive content.
The presence or absence of informative content in a sentence can be indicated by grammatical information
which is carried by part of speech (POS) labels. In this paper, we propose a new sentence weighting
method by incorporating sentence distribution and POS tagging for multi -document summarization.
Similarity-based Histogram Clustering (SHC) is used to cluster sentences in the data set. Cluster ordering
is based on cluster importance to determine the important clusters. Sentence extraction based on
sentence distribution and POS tagging is introduced to extract the representative sentences from the
ordered clusters. The results of the experiment on the Document Understanding Conferences (DUC) 2004
are compared with those of the Sentence Distribution Method. Our proposed method achieved better
results with an increasing rate of 5.41% on ROUGE-1 and 0.62% on ROUGE-2.
Sentiment Analysis of Document Based on Annotation dannyijwest
I present a tool which tells the quality of document or its usefulness based on annotations. Annotation may
include comments, notes, observation, highlights, underline, explanation, question or help etc. comments
are used for evaluative purpose while others are used for summarization or for expansion also. Further
these comments may be on another annotation. Such annotations are referred as meta-annotation. All
annotation may not get equal weightage. My tool considered highlights, underline as well as comments to
infer the collective sentiment of annotators. Collective sentiments of annotators are classified as positive,
negative, objectivity. My tool computes collective sentiment of annotations in two manners. It counts all the
annotation present on the documents as well as it also computes sentiment scores of all annotation which
includes comments to obtain the collective sentiments about the document or to judge the quality of
document. I demonstrate the use of tool on research paper
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONijnlc
Word Sense Disambiguation (WSD) is an important area which has an impact on improving the performance of applications of computational linguistics such as machine translation, information
retrieval, text summarization, question answering systems, etc. We have presented a brief history of WSD,
discussed the Supervised, Unsupervised, and Knowledge-based approaches for WSD. Though many WSD
algorithms exist, we have considered optimal and portable WSD algorithms as most appropriate since they
can be embedded easily in applications of computational linguistics. This paper will also provide an idea of
some of the WSD algorithms and their performances, which compares and assess the need of the word
sense disambiguation.
Rule-based Prosody Calculation for Marathi Text-to-Speech SynthesisIJERA Editor
This research paper presents two empirical studies that examine the influence of different linguistic aspects on
prosody in Marathi. First, we analyzed a Marathi corpus with respect to the effect of syntax and information
status on prosody. Second, we conducted a listening test which investigated the prosodic realisation of
constituents in the Marathi depending on their information status. The results were used to improve the prosody
prediction in the Marathi text-to-speech synthesis system MARY.
Mining Opinion Features in Customer ReviewsIJCERT JOURNAL
Now days, E-commerce systems have become extremely important. Large numbers of customers are choosing online shopping because of its convenience, reliability, and cost. Client generated information and especially item reviews are significant sources of data for consumers to make informed buy choices and for makers to keep track of customer’s opinions. It is difficult for customers to make purchasing decisions based on only pictures and short product descriptions. On the other hand, mining product reviews has become a hot research topic and prior researches are mostly based on pre-specified product features to analyse the opinions. Natural Language Processing (NLP) techniques such as NLTK for Python can be applied to raw customer reviews and keywords can be extracted. This paper presents a survey on the techniques used for designing software to mine opinion features in reviews. Elven IEEE papers are selected and a comparison is made between them. These papers are representative of the significant improvements in opinion mining in the past decade.
This paper is addressed towards extraction of important words from conversations, with the
objective of utilizing these watchwords to recover, for every short audio fragment, a little number of
conceivably relatable reports, which can be prescribed to members, just-in-time. In any case, even a short
audio fragment contains a mixed bag of words, which are conceivably identified with a few topics; also,
utilizing automatic speech recognition (ASR) framework slips errors in the output. Along these lines, it is
hard to surmise correctly the data needs of the discussion members. We first propose a calculation to
remove decisive words from the yield of an ASR framework (or a manual transcript for testing) to
coordinate the potentially differing qualities of subjects and decrease ASR commotion. At that point, we
make use of a technique that to make many implicit queries from the selected keywords which will in
return produce list of relevant documents. The scores demonstrate that our proposition moves forward over
past systems that consider just word recurrence or theme closeness, and speaks to a promising answer for a
report recommender framework to be utilized as a part of discussions.
E-TEXT in E-FL : FOUR FLAVOURS
Dr. Przemysław Kaszubski : IFAConc - web-concordancing with EAP writing students
Mgr Joanna Jendryczka-Wierszycka : E-text annotation - why bother?
Dr. Michał Remiszewski : Towards competence mapping in language teaching/learning
Prof. Włodzimierz Sobkowiak : E-text in Second Life: reification of text?
[ http://ifa.amu.edu.pl/fa/node/1144 ]
[ http://ifa.amu.edu.pl/fa/node/1123 ]
Text Mining is the technique that helps users to find out useful information from a large amount of text documents on the web or database. Most popular text mining and classification methods have adopted term-based approaches. The term based approaches and the pattern-based method describing user preferences. This review paper analyse how the text mining work on the three level i.e sentence level, document level and feature level. In this paper we review the related work which is previously done. This paper also demonstrated that what are the problems arise while doing text mining done at the feature level. This paper presents the technique to text mining for the compound sentences.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Segmentation of Handwritten Chinese Character Strings Based on improved Algor...ijeei-iaes
Algorithm Liu attracts high attention because of its high accuracy in segmentation of Japanese postal address. But the disadvantages, such as complexity and difficult implementation of algorithm, etc. have an adverse effect on its popularization and application. In this paper, the author applies the principles of algorithm Liu to handwritten Chinese character segmentation according to the characteristics of the handwritten Chinese characters, based on deeply study on algorithm Liu.In the same time, the author put forward the judgment criterion of Segmentation block classification and adhering mode of the handwritten Chinese characters.In the process of segmentation, text images are seen as the sequence made up of Connected Components (CCs), while the connected components are made up of several horizontal itinerary set of black pixels in image. The author determines whether these parts will be merged into segmentation through analyzing connected components. And then the author does image segmentation through adhering mode based on the analysis of outline edges. Finally cut the text images into character segmentation. Experimental results show that the improved Algorithm Liu obtains high segmentation accuracy and produces a satisfactory segmentation result.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Abstract This paper represents a Semantic Analyzer for checking the semantic correctness of the given input text. We describe our system as the one which analyzes the text by comparing it with the meaning of the words given in the WordNet. The Semantic Analyzer thus developed not only detects and displays semantic errors in the text but it also corrects them. Keywords: Part of Speech (POS) Tagger, Morphological Analyzer, Syntactic Analyzer, Semantic Analyzer, Natural Language (NL)
Now a days Twitter has provided a way to collect and understand user’s opinions about many
private or public organizations. All these organizations are reported for the sake to create and monitor
the targeted Twitter streams to understand user’s views about the organization. Usually a user-defined
selection criteria is used to filter and construct the Targeted Twitter stream. There must be an
application to detect early crisis and response with such target stream, that require a require a good
Named Entity Recognition (NER) system for Twitter, which is able to automatically discover emerging
named entities that is potentially linked to the crisis. However, many applications suffer severely from
short nature of tweets and noise. We present a framework called HybridSeg, which easily extracts and
well preserves the linguistic meaning or context information by first splitting the tweets into
meaningful segments. The optimal segmentation of a tweet is found after the sum of stickiness score of
its candidate segment is maximized.This computed stickiness score considers the probability of
segment whether belongs to global context(i.e., being a English phrase) or belongs to local context(i.e.,
being within a batch of tweets).The framework learns from both contexts.It also has the ability to learn
from pseudo feedback. Also from the result of semantic analysis the proposed system provides with
sentiment analysis.
lectronic-mail is widely used most suitable method of transferring messages electronically from one
person to another, rising from and going to any part of the world. Main features of Electronic mail is its speed,
dependability, well-equipped storage options and a large number of added services make it highly well-liked
among people from all sectors of business and society. But being popular it also has negative side too. Electronics
mails are preferred media for a large number of attacks over the internet.. A number of the most popular attacks over
the internet include spams. Some methods are essentially in detection of spam related mails but they have higher false
positives. A number of filters such as Checksum-based filters, Bayesian filters, machine learning based and
memory-based filters are usually used in order to recognize spams. As spammers constantly try to find a way to
avoid existing filters, a new filters need to be developed to catch spam. This paper proposes to find an
resourceful spam mail filtering method using user profile base ontology. Ontologies permit for machineunderstandable
semantics of data. It is main to interchange information with each other for more efficient spam
filtering. Thus, it is essential to build ontology and a framework for capable email filtering. Using ontology that is
particularly designed to filter spam, bunch of useless bulk email could be filtered out on the system. We propose a
user profile-based spam filter that classifies email based on the likelihood that User profile within it have been
included in spam or valid email.
Sentence Extraction Based on Sentence Distribution and Part of Speech Tagging...TELKOMNIKA JOURNAL
Automatic multi-document summarization needs to find representative sentences not only by
sentence distribution to select the most important sentence but also by how informative a term is in a
sentence. Sentence distribution is suitable for obtaining important sentences by determining frequent and
well-spread words in the corpus but ignores the grammatical information that indicates instructive content.
The presence or absence of informative content in a sentence can be indicated by grammatical information
which is carried by part of speech (POS) labels. In this paper, we propose a new sentence weighting
method by incorporating sentence distribution and POS tagging for multi -document summarization.
Similarity-based Histogram Clustering (SHC) is used to cluster sentences in the data set. Cluster ordering
is based on cluster importance to determine the important clusters. Sentence extraction based on
sentence distribution and POS tagging is introduced to extract the representative sentences from the
ordered clusters. The results of the experiment on the Document Understanding Conferences (DUC) 2004
are compared with those of the Sentence Distribution Method. Our proposed method achieved better
results with an increasing rate of 5.41% on ROUGE-1 and 0.62% on ROUGE-2.
Sentiment Analysis of Document Based on Annotation dannyijwest
I present a tool which tells the quality of document or its usefulness based on annotations. Annotation may
include comments, notes, observation, highlights, underline, explanation, question or help etc. comments
are used for evaluative purpose while others are used for summarization or for expansion also. Further
these comments may be on another annotation. Such annotations are referred as meta-annotation. All
annotation may not get equal weightage. My tool considered highlights, underline as well as comments to
infer the collective sentiment of annotators. Collective sentiments of annotators are classified as positive,
negative, objectivity. My tool computes collective sentiment of annotations in two manners. It counts all the
annotation present on the documents as well as it also computes sentiment scores of all annotation which
includes comments to obtain the collective sentiments about the document or to judge the quality of
document. I demonstrate the use of tool on research paper
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONijnlc
Word Sense Disambiguation (WSD) is an important area which has an impact on improving the performance of applications of computational linguistics such as machine translation, information
retrieval, text summarization, question answering systems, etc. We have presented a brief history of WSD,
discussed the Supervised, Unsupervised, and Knowledge-based approaches for WSD. Though many WSD
algorithms exist, we have considered optimal and portable WSD algorithms as most appropriate since they
can be embedded easily in applications of computational linguistics. This paper will also provide an idea of
some of the WSD algorithms and their performances, which compares and assess the need of the word
sense disambiguation.
Rule-based Prosody Calculation for Marathi Text-to-Speech SynthesisIJERA Editor
This research paper presents two empirical studies that examine the influence of different linguistic aspects on
prosody in Marathi. First, we analyzed a Marathi corpus with respect to the effect of syntax and information
status on prosody. Second, we conducted a listening test which investigated the prosodic realisation of
constituents in the Marathi depending on their information status. The results were used to improve the prosody
prediction in the Marathi text-to-speech synthesis system MARY.
Mining Opinion Features in Customer ReviewsIJCERT JOURNAL
Now days, E-commerce systems have become extremely important. Large numbers of customers are choosing online shopping because of its convenience, reliability, and cost. Client generated information and especially item reviews are significant sources of data for consumers to make informed buy choices and for makers to keep track of customer’s opinions. It is difficult for customers to make purchasing decisions based on only pictures and short product descriptions. On the other hand, mining product reviews has become a hot research topic and prior researches are mostly based on pre-specified product features to analyse the opinions. Natural Language Processing (NLP) techniques such as NLTK for Python can be applied to raw customer reviews and keywords can be extracted. This paper presents a survey on the techniques used for designing software to mine opinion features in reviews. Elven IEEE papers are selected and a comparison is made between them. These papers are representative of the significant improvements in opinion mining in the past decade.
This paper is addressed towards extraction of important words from conversations, with the
objective of utilizing these watchwords to recover, for every short audio fragment, a little number of
conceivably relatable reports, which can be prescribed to members, just-in-time. In any case, even a short
audio fragment contains a mixed bag of words, which are conceivably identified with a few topics; also,
utilizing automatic speech recognition (ASR) framework slips errors in the output. Along these lines, it is
hard to surmise correctly the data needs of the discussion members. We first propose a calculation to
remove decisive words from the yield of an ASR framework (or a manual transcript for testing) to
coordinate the potentially differing qualities of subjects and decrease ASR commotion. At that point, we
make use of a technique that to make many implicit queries from the selected keywords which will in
return produce list of relevant documents. The scores demonstrate that our proposition moves forward over
past systems that consider just word recurrence or theme closeness, and speaks to a promising answer for a
report recommender framework to be utilized as a part of discussions.
E-TEXT in E-FL : FOUR FLAVOURS
Dr. Przemysław Kaszubski : IFAConc - web-concordancing with EAP writing students
Mgr Joanna Jendryczka-Wierszycka : E-text annotation - why bother?
Dr. Michał Remiszewski : Towards competence mapping in language teaching/learning
Prof. Włodzimierz Sobkowiak : E-text in Second Life: reification of text?
[ http://ifa.amu.edu.pl/fa/node/1144 ]
[ http://ifa.amu.edu.pl/fa/node/1123 ]
Text Mining is the technique that helps users to find out useful information from a large amount of text documents on the web or database. Most popular text mining and classification methods have adopted term-based approaches. The term based approaches and the pattern-based method describing user preferences. This review paper analyse how the text mining work on the three level i.e sentence level, document level and feature level. In this paper we review the related work which is previously done. This paper also demonstrated that what are the problems arise while doing text mining done at the feature level. This paper presents the technique to text mining for the compound sentences.
Text mining efforts to innovate new, previous unknown or hidden data by automatically extracting
collection of information from various written resources. Applying knowledge detection method to
formless text is known as Knowledge Discovery in Text or Text data mining and also called Text Mining.
Most of the techniques used in Text Mining are found on the statistical study of a term either word or
phrase. There are different algorithms in Text mining are used in the previous method. For example
Single-Link Algorithm and Self-Organizing Mapping(SOM) is introduces an approach for visualizing
high-dimensional data and a very useful tool for processing textual data based on Projection method.
Genetic and Sequential algorithms are provide the capability for multiscale representation of datasets and
fast to compute with less CPU time based on the Isolet Reduces subsets in Unsupervised Feature
Selection. We are going to propose the Vector Space Model and Concept based analysis algorithm it will
improve the text clustering quality and a better text clustering result may achieve. We think it is a good
behavior of the proposed algorithm is in terms of toughness and constancy with respect to the formation of
Neural Network.
Textual Document Categorization using Bigram Maximum Likelihood and KNNRounak Dhaneriya
In recent year’s text mining has evolved as a vast field of research in machine learning and artificial intelligence. Text Mining is a difficult task to conduct with an unstructured data format. This research work focuses on the classification of textual data of three different literature books. The collection of data is extracted from books entitled: Oliver Twist, Don Quixote, Pride and Prejudge. We used two different algorithms: KNN and Bigram based Maximum Likelihood for the mentioned purpose and the evaluation of accuracy is done using the confusion matrix. The results suggest that the text mining using bigram based maximum likelihood logic performs well.
AN AUTHORSHIP IDENTIFICATION EMPIRICAL EVALUATION OF WRITING STYLE FEATURES I...CSIT8
In this paper, an investigation was done to identify writing style features that can be used for cross-topic
and cross-genre documents in the Authorship Identification task from 2003 to 2015. Different writing style
features were empirically evaluated that were previously used in single topic and single genre documents
for Authorship Identification to determine whether they can be used effectively for cross-topic and crossgenre Authorship Identification using an ablation process. The dataset used was taken from the 2015 PAN
CLEF Forum English collection consisting of 100 sets. Furthermore, it was investigated whether
combining some of these feature sets can help improve the authorship identification task. Three different
classifiers were used: Naïve Bayes, Support Vector Machine, and Random Forest. The results suggest that
a combination of a lexical, syntactical, structural, and content feature set can be used effectively for cross
topic and cross genre authorship identification, as it achieved an AUC result of 0.837.
AN AUTHORSHIP IDENTIFICATION EMPIRICAL EVALUATION OF WRITING STYLE FEATURES I...ijaia
In this paper, an investigation was done to identify writing style features that can be used for cross-topic
and cross-genre documents in the Authorship Identification task from 2003 to 2015. Different writing style
features were empirically evaluated that were previously used in single topic and single genre documents
for Authorship Identification to determine whether they can be used effectively for cross-topic and crossgenre Authorship Identification using an ablation process. The dataset used was taken from the 2015 PAN
CLEF Forum English collection consisting of 100 sets. Furthermore, it was investigated whether
combining some of these feature sets can help improve the authorship identification task. Three different
classifiers were used: Naïve Bayes, Support Vector Machine, and Random Forest. The results suggest that
a combination of a lexical, syntactical, structural, and content feature set can be used effectively for cross
topic and cross genre authorship identification, as it achieved an AUC result of 0.837.
International Journal of Computational Engineering Research(IJCER) ijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
Sentimental analysis is a context based mining of text, which extracts and identify subjective information from a text or sentence provided. Here the main concept is extracting the sentiment of the text using machine learning techniques such as LSTM Long short term memory . This text classification method analyses the incoming text and determines whether the underlined emotion is positive or negative along with probability associated with that positive or negative statements. Probability depicts the strength of a positive or negative statement, if the probability is close to zero, it implies that the sentiment is strongly negative and if probability is close to1, it means that the statement is strongly positive. Here a web application is created to deploy this model using a Python based micro framework called flask. Many other methods, such as RNN and CNN, are inefficient when compared to LSTM. Dirash A R | Dr. S K Manju Bargavi "LSTM Based Sentiment Analysis" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42345.pdf Paper URL: https://www.ijtsrd.comcomputer-science/data-processing/42345/lstm-based-sentiment-analysis/dirash-a-r
Data mining is the knowledge discovery in databases and the gaol is to extract patterns and knowledge from
large amounts of data. The important term in data mining is text mining. Text mining extracts the quality
information highly from text. Statistical pattern learning is used to high quality information. High –quality in
text mining defines the combinations of relevance, novelty and interestingness. Tasks in text mining are text
categorization, text clustering, entity extraction and sentiment analysis. Applications of natural language
processing and analytical methods are highly preferred to turn
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Elevating forensic investigation system for file clusteringeSAT Journals
Abstract In computer forensic investigation, thousands of files are usually surveyed. Much of the data in those files consists of formless manuscript, whose investigation by computer examiners is very tough to accomplish. Clustering is the unverified organization of designs that is data items, remarks, or feature vectors into groups (clusters). To find a noble clarification for this automated method of analysis are of great interest. In particular, algorithms such as K-means, K-medoids, Single Link, Complete Link and Average Link can simplify the detection of new and valuable information from the documents under investigation. This paper is going to present an tactic that applies text clustering algorithms to forensic examination of computers seized in police investigations using multithreading technique for data clustering. Keywords- Clustering, forensic computing, text mining, multithreading.
This paper proposes Natural language based Discourse Analysis method used for extracting
information from the news article of different domain. The Discourse analysis used the Rhetorical Structure
theory which is used to find coherent group of text which are most prominent for extracting information
from text. RST theory used the Nucleus- Satellite concept for finding most prominent text from the text
document. After Discourse analysis the text analysis has been done for extracting domain related object
and relates this object. For extracting the information knowledge based system has been used which
consist of domain dictionary .The domain dictionary has a bag of words for domain. The system is
evaluated according gold-of-art analysis and human decision for extracted information.
Similar to TEXT CLASSIFICATION FOR AUTHORSHIP ATTRIBUTION ANALYSIS (20)
Advanced Computing: An International Journal (ACIJ) is a peer-reviewed, open access peer-reviewed journal that publishes articles which contribute new results in all areas of the advanced computing. The journal focuses on all technical and practical aspects of high performance computing, green computing, pervasive computing, cloud computing etc. The goal of this journal is to bring together researchers and a practitioners from academia and industry to focus on understanding advances in computing and establishing new collaborations in these areas.
Authors are solicited to contribute to the journal by submitting articles that illustrate research results, projects, surveying works and industrial experiences that describe significant advances in the areas of computing.
Call for Papers - Advanced Computing An International Journal (ACIJ) (2).pdfacijjournal
Submit your Research Papers!!!
Advanced Computing: An International Journal ( ACIJ )
ISSN: 2229 -6727 [Online] ; 2229 - 726X [Print]
Webpage URL: http://airccse.org/journal/acij/acij.html
Submission URL: http://coneco2009.com/submissions/imagination/home.html
Submission Deadline : April 08, 2023
Here's where you can reach us : acijjournal@yahoo.com or acij@aircconline
Advanced Computing: An International Journal (ACIJ
)
is a bi monthly open access peer-reviewed journal that publishes articles which contribute new results in all areas of the advancedcomputing. The journal focuses on all technical and practical aspects of high performancecomputing, green computing, pervasive computing, cloud computing etc. The goal of this journalis to bring together researchers anda practitioners from academia and industry to focus onunderstanding advances in computing and establishing new collaborations in these areas
Submit your Research Papers!!!
Advanced Computing: An International Journal ( ACIJ )
ISSN: 2229 -6727 [Online] ; 2229 - 726X [Print]
Webpage URL: http://airccse.org/journal/acij/acij.html
Submission URL: http://coneco2009.com/submissions/imagination/home.html
Here's where you can reach us : acijjournal@yahoo.com or acij@aircconline.com
7thInternational Conference on Data Mining & Knowledge Management (DaKM 2022)acijjournal
7thInternational Conference on Data Mining & Knowledge Management (DaKM 2022)provides a forum for researchers who address this issue and to present their work in a peer-reviewed forum.
7thInternational Conference on Data Mining & Knowledge Management (DaKM 2022)acijjournal
7thInternational Conference on Data Mining & Knowledge Management (DaKM 2022)provides a forum for researchers who address this issue and to present their work in a peer-reviewed forum.
7thInternational Conference on Data Mining & Knowledge Management (DaKM 2022)acijjournal
7thInternational Conference on Data Mining & Knowledge Management (DaKM 2022)provides a forum for researchers who address this issue and to present their work in a peer-reviewed forum.
4thInternational Conference on Machine Learning & Applications (CMLA 2022)acijjournal
4thInternational Conference on Machine Learning & Applications (CMLA 2022)will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications. The aim of the conference is to provide a platform to the researchers and practitioners from both academia as well as industry to meet and share cutting-edge development in the field.
7thInternational Conference on Data Mining & Knowledge Management (DaKM 2022)acijjournal
7thInternational Conference on Data Mining & Knowledge Management (DaKM 2022)provides a forum for researchers who address this issue and to present their work in a peer-reviewed forum.Authors are solicited to contribute to the conference by submitting articles that illustrate research results, projects, surveying works and industrial experiences that describe significant advances in the following areas, but are not limited to these topics only.
3rdInternational Conference on Natural Language Processingand Applications (N...acijjournal
3rdInternational Conference on Natural Language Processing and Applications (NLPA 2022)will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of Natural Language Computing and its applications. The Conference looks for significant contributions to all major fieldsof the Natural Language processing in theoretical and practical aspects.
4thInternational Conference on Machine Learning & Applications (CMLA 2022)acijjournal
4thInternational Conference on Machine Learning & Applications (CMLA 2022)will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications. The aim of the conference is to provide a platform to the researchers and practitioners from both academia as well as industry to meet and share cutting-edge development in the field.
Graduate School Cyber Portfolio: The Innovative Menu For Sustainable Developmentacijjournal
In today’s milieu, new demands and trends emerge in the field of Education giving teachers of Higher Education Institutions (HEI’s) no choice but to be innovative to cope with the fast changing technology. To be naturally innovative, a graduate school teacher needs to be technologically and pedagogically competent. One of the ways to be on this level is by creating his cyber portfolio to support students’ eportfolio for lifelong learning. Cyber portfolio is an innovative menu for teachers who seek out strategies to integrate technology in their lessons. This paper presents a straightforward preparation on how to innovate a cyber portfolio that has its practical and breakthrough solution against expensive and inflexible vended software which often saddle many universities. Additionally, this cyber portfolio is free and it addresses the 21st century skills of graduate students blended with higher order thinking skills, multiple intelligence, technology and multimedia.
Genetic Algorithms and Programming - An Evolutionary Methodologyacijjournal
Genetic programming (GP) is an automated method for creating a working computer program from a high-level problem statement of a problem. Genetic programming starts from a high-level statement of “what needs to be done” and automatically creates a computer program to solve the problem. In artificial intelligence, genetic programming (GP) is an evolutionary algorithm-based methodology inspired by biological evolution to find computer programs that perform a user defined task. It is a specialization of genetic algorithms (GA) where each individual is a computer program. It is a machine learning technique used to optimize a population of computer programs according to a fitness span determined by a program's ability to perform a given computational task. This paper presents a idea of the various principles of genetic programming which includes, relative effectiveness of mutation, crossover, breeding computer programs and fitness test in genetic programming. The literature of traditional genetic algorithms contains related studies, but through GP, it saves time by freeing the human from having to design complex algorithms. Not only designing the algorithms but creating ones that give optimal solutions than traditional counterparts in noteworthy ways.
Data Transformation Technique for Protecting Private Information in Privacy P...acijjournal
Data mining is the process of extracting patterns from data. Data mining is seen as an increasingly important tool by modern business to transform data into an informational advantage. Data
Mining can be utilized in any organization that needs to find patterns or relationships in their data. A group of techniques that find relationships that have not previously been discovered. In many situations, the extracted patterns are highly private and it should not be disclosed. In order to maintain the secrecy of data,
there is in need of several techniques and algorithms for modifying the original data in order to limit the extraction of confidential patterns. There have been two types of privacy in data mining. The first type of privacy is that the data is altered so that the mining result will preserve certain privacy. The second type of privacy is that the data is manipulated so that the mining result is not affected or minimally affected. The aim of privacy preserving data mining researchers is to develop data mining techniques that could be
applied on data bases without violating the privacy of individuals. Many techniques for privacy preserving data mining have come up over the last decade. Some of them are statistical, cryptographic, randomization methods, k-anonymity model, l-diversity and etc. In this work, we propose a new perturbative masking technique known as data transformation technique can be used for protecting the sensitive information. An
experimental result shows that the proposed technique gives the better result compared with the existing technique.
Advanced Computing: An International Journal (ACIJ) acijjournal
Advanced Computing: An International Journal (ACIJ) is a bi monthly open access peer-reviewed journal that publishes articles which contribute new results in all areas of the advanced computing. The journal focuses on all technical and practical aspects of high performance computing, green computing, pervasive computing, cloud computing etc. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on understanding advances in computing and establishing new collaborations in these areas.
E-Maintenance: Impact Over Industrial Processes, Its Dimensions & Principlesacijjournal
During the course of the industrial 4.0 era, companies have been exponentially developed and have
digitized almost the whole business system to stick to their performance targets and to keep or to even
enlarge their market share. Maintenance function has obviously followed the trend as it’s considered one
of the most important processes in every enterprise as it impacts a group of the most critical performance
indicators such as: cost, reliability, availability, safety and productivity. E-maintenance emerged in early
2000 and now is a common term in maintenance literature representing the digitalized side of maintenance
whereby assets are monitored and controlled over the internet. According to literature, e-maintenance has
a remarkable impact on maintenance KPIs and aims at ambitious objectives like zero-downtime.
10th International Conference on Software Engineering and Applications (SEAPP...acijjournal
10th International Conference on Software Engineering and Applications (SEAPP 2021) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of Software Engineering and Applications. The goal of this Conference is to bring together researchers and practitioners from academia and industry to focus on understanding Modern software engineering concepts and establishing new collaborations in these areas.
10th International conference on Parallel, Distributed Computing and Applicat...acijjournal
10th International conference on Parallel, Distributed Computing and Applications (IPDCA 2021) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of Parallel, Distributed Computing. Original papers are invited on Algorithms and Applications, computer Networks, Cyber trust and security, Wireless networks and mobile Computing and Bioinformatics. The aim of the conference is to provide a platform to the researchers and practitioners from both academia as well as industry to meet and share cutting-edge development in the field.
DETECTION OF FORGERY AND FABRICATION IN PASSPORTS AND VISAS USING CRYPTOGRAPH...acijjournal
In this paper, we present a novel solution to detect forgery and fabrication in passports and visas using
cryptography and QR codes. The solution requires that the passport and visa issuing authorities obtain a
cryptographic key pair and publish their public key on their website. Further they are required to encrypt
the passport or visa information with their private key, encode the ciphertext in a QR code and print it on
the passport or visa they issue to the applicant.
The issuing authorities are also required to create a mobile or desktop QR code scanning app and place it
for download on their website or Google Play Store and iPhone App Store. Any individual or immigration
authority that needs to check the passport or visa for forgery and fabrication can scan its QR code, which
will decrypt the ciphertext encoded in the QR code using the public key stored in the app memory and
displays the passport or visa information on the app screen. The details on the app screen can be
compared with the actual details printed on the passport or visa. Any mismatch between the two is a clear
indication of forgery or fabrication.
Discussed the need for a universal desktop and mobile app that can be used by immigration authorities and
consulates all over the world to enable fast checking of passports and visas at ports of entry for forgery
and fabrication.
Detection of Forgery and Fabrication in Passports and Visas Using Cryptograph...acijjournal
In this paper, wepresenta novel solution to detect forgery and fabrication in passports and visas using cryptography and QR codes. The solution requires that the passport and visa issuing authorities obtain a cryptographic key pair and publish their public key on their website. Further they are required to encrypt the passport or visa information with their private key, encode the ciphertext in a QR code and print it on the passport or visa they issue to the applicant.
The issuing authorities are also required to create a mobile or desktop QR code scanning app and place it for download on their website or Google Play Store and iPhone App Store. Any individual or immigration authority that needs to check the passport or visa for forgery and fabrication can scan its QR code, which will decrypt the ciphertext encoded in the QR code using the public key stored in the app memory and displays the passport or visa information on the app screen. The details on the app screen can be compared with the actual details printed on the passport or visa. Any mismatch between the two is a clear indication of forgery or fabrication.
Discussed the need for a universal desktop and mobile app that can be used by immigration authorities and consulates all over the world to enable fast checking of passports and visas at ports of entry for forgery and fabrication.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
How world-class product teams are winning in the AI era by CEO and Founder, P...
TEXT CLASSIFICATION FOR AUTHORSHIP ATTRIBUTION ANALYSIS
1. Advanced Computing: An International Journal (ACIJ), Vol.4, No.5, September 2013
DOI : 10.5121/acij.2013.4501 1
TEXT CLASSIFICATION FOR AUTHORSHIP
ATTRIBUTION ANALYSIS
M. Sudheep Elayidom1
, Chinchu Jose2
, Anitta Puthussery3
, Neenu K Sasi4
1
Division of Computer Science and Engineering, School of Engineering, Cochin
University, Kalamassery, India
2,3,4
Adi Shankara Institute of Engineering and Technology, Kalady, India
ABSTRACT
Authorship attribution mainly deals with undecided authorship of literary texts. Authorship attribution is
useful in resolving issues like uncertain authorship, recognize authorship of unknown texts, spot plagiarism
so on. Statistical methods can be used to set apart the approach of an author numerically. The basic
methodologies that are made use in computational stylometry are word length, sentence length, vocabulary
affluence, frequencies etc. Each author has an inborn style of writing, which is particular to himself.
Statistical quantitative techniques can be used to differentiate the approach of an author in a numerical
way. The problem can be broken down into three sub problems as author identification, author
characterization and similarity detection. The steps involved are pre-processing, extracting features,
classification and author identification. For this different classifiers can be used. Here fuzzy learning
classifier and SVM are used. After author identification the SVM was found to have more accuracy than
Fuzzy classifier. Later combined the classifiers to obtain a better accuracy when compared to individual
SVM and fuzzy classifier.
KEYWORDS
Authorship attribution, Text pre- processing, Stemming, Feature extraction and Machine learning classifier
1. INTRODUCTION
Authorship attribution is the process of determining the likely author of a given text document.
Applications of authorship attribution include plagiarism detection, resolving disputed authorship
etc. Authorship attribution is the technique of determining the author of a text when it is
ambiguous who wrote it. It is of use when two or more people argue to have written something or
in cases where no one is able to declare that who wrote that document. The complexity of the text
authorship problem is obviously exponentially higher the larger number of likely authors. The
availability of author text samples is also a major constriction when advancing with this problem.
Text authorship attribution engage the following three problems:
1. The one out of many problem – identifying the author of a text author from a group of probable
or expected authors where the author is always in the group of suspects.
2. None or one out of many problem – identifying the author of a text author from a group of
probable or expected authors where the author may not be in the group of suspects.
2. Advanced Computing: An International Journal (ACIJ), Vol.4, No.5, September 2013
2
3. The sole author problem – estimating the possibility of a given text having been written by the
given author or not.
In this work the main focus is on identifying the author of a given text using various steps. Firstly
data pre- processing, secondly feature extraction, thirdly classification and finally author
identification. Data pre- processing involves text tokenizing and text stemming. Feature
extraction involves the process of extracting various features such as top k frequent words,
number of punctuations, number of symbols, character count, sentence count, word count and
ratio of character count and sentence count. Classification involves the use of three classifier such
as fuzzy, SVM and a combination of these two classifiers. Later performance analysis is done to
determine the accuracy of each classifiers.
2. LITERATURE REVIEW
2.1. Creating author fuzzy fingerprints for authorship identification
In this work[1] a fingerprint is extracted from a set of texts. Then that fingerprint is used to
identify the author of a distinct text document. This fingerprint is not similar to the biological
fingerprint. In the field of computer sciences, fingerprints are usually used to avoid the
comparison and transmission of massive data. For example, in order to efficiently verify if a
remote file has been altered, a web browser or proxy server can simply get its fingerprint and
compare it with the fingerprint of the previously fetched copy. Fingerprints are a swift and
compact way to recognize items. To serve the author identification purposes, a fingerprint must
be able to capture the identity of that author. In other words, the probability of a conflict, that is,
two authors having the same fingerprint, must be small. The fingerprint has also to be robust such
that a text should be identified even if the author changes some aspects of the writing style. The
idea of identifying text authorship based on an author fingerprint is a very interesting.
When the word frequencies are used as a proxy for the individual behind a particular text, one can
collect information on the author and identify other texts. The use of word frequencies is a well-
known method. The first step in this method is to gather the top-k word frequencies in all known
texts of each known author. An approximated algorithm is used for this purpose since classical
exact top-k algorithms are inefficient and require the full list of distinct elements to be kept. The
Filtered Space-Saving algorithm is used for this purpose since it provides a fast and compact
answer to the top-k problem although it only gives an approximate solution.
After determining the top-k word frequencies, the fingerprint is created by applying a fuzzifying
function to the word frequencies. The method of fuzzifying the set of features is based on their
order on the top-k list instead of their frequency value. Finally the same calculations are
performed for the text being identified and then to compare this text fuzzy fingerprint with all the
available author fuzzy fingerprints. The most related fingerprint is chosen and the text is assigned
to the author.
2.2. Computing frequent and top-k elements present in data streams in an efficient
manner
An approximate integrated approach is used for solving two problems. That is finding the most
common k elements, and determining the frequent elements in a data stream [2]. It is space
efficient and reports both frequent and top-k elements. The top k algorithm returns k elements
3. Advanced Computing: An International Journal (ACIJ), Vol.4, No.5, September 2013
3
that have roughly the highest frequencies; and it uses limited space for calculating frequent
elements.
For this purpose a counter-based Space- Saving algorithm and its associated Stream-Summary
data structure are used. The underlying idea is to maintain partial information of interest. That is
only the required m elements will be monitored. The updating of the counters are done in such a
manner that, it accurately calculates the frequencies of the major elements. A lightweight data
structure is used to keep the elements which are sorted according to their approximate
frequencies.
2.3. Role of statistical text analysis in authorship attribution
In statistical analysis of literary texts, an objective methodology is applied to the works that have
received impressionistic treatment for a extensive time. While in the subjective type of analysis of
style of the literature, the literary style of the text is used. But the literary style of the text is not
quantifiable, which is an important norm for the judgment. The subjective analysis approach be
able to rarely guide to a unique solution which the scholars can accept. The objective components
for judgments are provided by the Statistical quantitative methods [3].
In the quantitative analysis approach, the style of an author is characterize numerically by
carefully analyzing the style of the text. Then the sets of features in a text that most accurately
describe the author’s style is determined. Authorship attribution is one of the main applications of
stylometry. The stylometry can be defined as the science of measuring literary style. According to
many previous studies it is assumed that each author has an inborn style of writing. That writing
style is peculiar to that specific author. A established literary scholar can capture the peculiarities
in the style of an author by impression. The statisticians recommend to this field is to help
quantify the style, and hence to change a subjective method into an objective technique which is
called to as “Non-Traditional Stylometry”.
The characteristic style of an author can be determined by using certain features. These features
may include word length, richness of vocabulary, sentence length, function words so on. Thomas
Corwin Mendenhall was the first who undertook extensive work to show that some simple
statistical methods may prove useful to solve questions of disputed authorship. He suggested they
may also be utilized in comparative language studies, in tracing the growth of a language, in
studying the growth of the vocabulary from childhood to manhood, and in other directions.
Mendenhall proposed forming relative frequency curve of number of letters per word (word-
length), which he called “word-spectrum” or “characteristic curve” as a method of analysis
leading to identification or discrimination of authorship. He constructed word-spectra for works
of two contemporary novelists called Charles Dickens and the other one is Thackeray, and a few
other writers, to show that texts with the same average word-length might possess different
spectra. He assumed that every writer makes use of a vocabulary which is specific to himself and
the character of which is persistent over time.
2.4. Use of Authorship analysis in cybercrime investigation
Authorship analysis is the process of examining the characteristics of a piece of work in order to
draw conclusions on its authorship. More specifically, the problem can be broken down into three
sub-fields [4]:
4. Advanced Computing: An International Journal (ACIJ), Vol.4, No.5, September 2013
4
• Author Identification find out the likelihood of a particular author having written a piece of
work by examining other works produced by that same author.
• Author Characterization summarizes the characteristics of an author and generates the author
profile based on that person' s work. Some of these characteristics are gender, culture, educational
background, and language familiarity.
• In the Similarity Detection process an author's several pieces of work are compared and find
outs whether or not they are produced by a single author without actually identifying the author.
According to this mechanism, three types of features of the message, as well as style markers,
content-specific features and, structural features, are extracted. Then after the extraction inductive
learning algorithms are used to build feature-based models to identify authorship of illegal
messages. In the first step the feature extractor runs on those documents and generates a set of
style features, which will be used as the input to/for the learning engine. A feature-based model is
then created as the outcome of the learning engine. This model can identify whether a newly
found illegal document is written by that suspicious criminal under different usernames or names.
Three learning algorithms or classifiers were used for comparison purposes. The first classifier is
back propagation neural networks, second one is a decision tree and finally a support vector
machine classifier were used.
2.5. The Combination of Text Classifiers
A more accurate classification procedure can be developed by combining the outputs of the
involved classifiers[5]. Earlier studies of classifier combination have been encouraged mostly by
the instinct that superimposed classifiers, that work in related but qualitatively different ways,
could leverage the distinct strengths of each method. Classifiers can be combined in different
ways. In one method, a text classifier is created from multiple distinct classifiers by selecting the
best classifier to use in different situations or contexts. Other procedures for combining classifiers
consider inputs generated by the contributing classifiers. In another method to combine multiple
classifiers, the scores generated by the contributing classifiers are taken as inputs to the
combination function. Whichever approach to combination employed, the creation of enhanced
classifiers from a set of text classifiers relies on understanding how different classifiers perform
in different informational contexts.
3. TECHNOLOGIES USED
The main processes involved in this are feature extraction, classification and identification. For
this purpose we made use of certain programming platforms.
3.1. Java
For performing data pre-processing and feature extraction, Java Development Kit (JDK) and
software development kit (SDK) are used. It contains a Java compiler, a full copy of the Java
Runtime Environment (JRE), and many other important development tools. MySQL JDBC
Driver and edu.mit.jwi_2.1.4 are the two main libraries used for pre-processing data.
3.2. Net Beans IDE
Net Beans IDE is used to quickly and easily develop Java application. It provides support for Java
Development Kit 7. Here Net Beans Integrated Development Environment runs on the Java SE
5. Advanced Computing: An International Journal (ACIJ), Vol.4, No.5, September 2013
5
Development Kit (JDK) which consists of the Java Runtime Environment plus developer tools for
compiling, debugging, and running the applications written in the Java language.
3.3. Mat lab
MATLAB helps to perform matrix manipulations, implementation of algorithms, to plot certain
functions and data creation of user interfaces etc. It also helps to create interface with other
programming languages such as C, C++, Java, and Fortran. Although MATLAB is intended
primarily for numerical computing, it consists of an optional toolbox that uses the MuPAD
symbolic engine, that allows access to the symbolic computing capabilities. Here MATLAB is
used to classify the features extracted. The features extracted by the java programs are written to a
text file. Later this text file is accessed by the mat lab program for training the classifiers and to
perform author identification.
4. SYSTEM ANALYSIS
4.1. Data Pre- processing
Data pre- processing is a very important step in authorship attribution. Text documents in their
original form are not in suitable form for learning. They must be converted into a suitable input
format. It can be converted in to a vector space since most of the learning algorithms use the
attribute- value representation. This step is crucial in determining the quality of the next stages,
that is, the feature extraction and classification stage. Her data pre- processing involves
tokenization and stemming.
4.1.1. Tokenization
Tokenization is the method of splitting a stream of text input into meaningful elements. These
meaningful elements are called tokens like symbols, phrases, words so on. The extracted group of
tokens act as an input for further processes like parsing and text mining. It is a part of lexical
analysis. In languages that use inter-word spaces, this approach is fairly straightforward.
Tokenization is particularly difficult for languages such as Chinese which have no word
boundaries. But it is easy in the case of the language English.
Usually, the tokenization process occurs at the word level. Yet to define what is meant by a
"word" is sometimes difficult to deal with. Often a tokenizer relies on some simple heuristics.
This can be made clear with some examples such as:
• All adjacent strings of alphabetic characters are always a part of one token. The same is the case
numbers.
• Tokens may be separated by whitespace characters. These may include punctuation characters, a
line break or space.
• The resulting list of tokens may or may not include Punctuation and whitespace.
4.1.2. Stemming
Stemming is the process of reducing the inflected words to their root or base form known as stem.
The stem may not be same as the morphological root of that word. It is enough that related words
map to the same stem, even if that stem is not a convincing root. The programs used for
6. Advanced Computing: An International Journal (ACIJ), Vol.4, No.5, September 2013
6
performing stemming are usually referred to as stemming algorithms or stemmers. A stemming
algorithm reduces the words "fishing", "fished", "fish", and "fisher" to the root or base form of the
word, that is the word "fish".
Here Wordnet Stemmer is used for stemming. This stemmer adds functionality to the simple
pattern-based stemmer SimpleStemmer by checking to see if possible stems are actually present
in Wordnet. Stems are returned only if any possible stems are found otherwise the word is
considered unknown, and the result returned is the same word as that of the SimpleStemmer class.
Wordnet dictionary is required to construct a WordnetStemmer.
4.2. Feature Extraction
The features and their extraction process are very dependent on the text language. These features
can be used to understand the peculiarity of an author' s writing. These features are extracted from
the author' s text. Some of the important features extracted here are,
1. Number of periods.
2. Number of commas.
3. Number of question marks.
4. Number of colons.
5. Number of semi- colons.
6. Number of blanks.
7. Number of exclamation marks.
8. Number of dashes.
9. Number of underscores.
10. Number of brackets.
11. Number of quotations.
12. Number of slashes.
13. Number of words.
14. Number of sentences.
15. Number of characters.
16. Ratio of characters in a sentence.
17. Top K word frequency.
4.3. Applying Classifiers
After performing the feature extraction process, the extracted features are used to classify the
input text data. Mainly two classifiers are used and they are Fuzzy learning classifier and SVM
classifier. Then these two classifiers are combined to form a new classifier.
4.3.1. Fuzzy learning classifier
Fuzzy classification is the process of grouping elements into a fuzzy set. Hence, fuzzy
classification can be defined as the process of grouping individuals having the same
characteristics into a fuzzy set. In this project the texts of different authors are grouped based on
the characteristics of each author. In the fuzzy classification technique, there will be a
membership function μ that indicates whether an individual is a member of a class. Naturally, a
class is a set that is defined by a specific property. Then all objects having that property are the
elements of that particular class. The classification process evaluates a given set of objects and
7. Advanced Computing: An International Journal (ACIJ), Vol.4, No.5, September 2013
7
checks whether they accomplish the classification property. If it matches then it's a member of the
corresponding class.
Here fuzzy classifier is used to classify the given input text and identify the author of that
particular text. For this purpose we take numerous texts of each authors and extract the features
which are specific to each authors. So each author will have a set of features which are unique to
him. So whenever we give a new text or want to identify the author of an unknown text, the
features of that text is extracted and checked for matching with any from the poll of author's
features. The most likely author is then assigned.
4.3.2. Support Vector Machine classifier
Support Vector Machine is an advanced supervised modelling technique for classifying both
linear and nonlinear data. SVM has become a more recent default approach to classification
problems since it is well suited to very high-dimensional spaces and extremely big datasets. In
machine learning, support vector machines are supervised learning models with associated
learning algorithms. They perform actions such as analyze data and recognize patterns. They are
used for regression analysis and classification. The simple SVM takes a set of input data. Then
for each given input, it predicts, which of two possible classes forms the output which makes it a
non-probabilistic binary linear classifier. But here since a group of authors are to be dealt with a
multi SVM is used. This SVM is implemented with RBF kernel.
First the support vector machine is trained using the training value sets prepared according to our
authors and their texts. The SVM used here is multi SVM. It is because a pool of authors are
used. So more than one class is strictly required. After training, the trained machine is used to
classify or predict a new input text data. In order, to obtain satisfactory predictive accuracy,
various SVM kernel functions are available. The SVM kernel function used here is RBF kernel.
4.3.3. Combined classifier
By combining the output of two classifiers we get a more accurate result than a single classifier.
Here two single classifiers were used. They are fuzzy classifier and Support Vector Machine
classifier. These two classifiers are then combined to create a new combined classifier. Both the
classifiers are executed. The results of both the individual classifiers are compared. This helps to
achieve a result with better accuracy compared to individual classifier results.
4.4. Author Identification
In the author identification step as the name specifies the author is identified. For a given input
text the name of the author of the text will be returned. For this purpose several steps are
performed. These steps are the same steps that are performed on the initial stages to differentiate
the features of authors from their known texts.
In the identification step, when an unknown or disputed author input text is given, three main
steps are performed. Firstly data pre- processing is performed which involves tokenizing and
stemming of the input text. Secondly, Feature extraction is performed. Thirdly, after extracting
the features classification is done. Any classifier can be chosen to identify the author namely
fuzzy, SVM and their combination. But the combined one gives a more accurate result. After that
SVM have a better performance than Fuzzy classifier.
8. Advanced Computing: An International Journal (ACIJ), Vol.4, No.5, September 2013
8
5. Results and discussions
The experiments used a set of writings of distinct authors. This set comprises many containing
many words. The texts are written in English. A set of random paragraphs of each authors were
taken and processed.
The set of texts undergoes distinct tokenization and techniques. The distinct tokenization methods
were used:
• Words and punctuation
• Words, punctuation and additional stylometric features.
Punctuation improves significantly the identification rate, so it was always included. The
additional stylometric features used were:
• Number of words
• Number of sentence
• Number of characters
The text classification for authorship attribution analysis based on Fuzzy and SVM classifiers
were performed and their performance based on CPU time is calculated. The following graphs
explains the time taken by each classifiers.
Figure 1. CPU time comparison.
The text classification was performed using different texts of the author from his different
writings. 20 different texts of each 10 different authors were taken for this purpose. Thus a total
of 200 were used. The accuracy was determined using percentage calculation considering the
correctly identified and wrongly identified authors. The following table shows the percentage of
different classifiers.
9. Advanced Computing: An International Journal (ACIJ), Vol.4, No.5, September 2013
9
Table 1. Percentage value of classifiers.
Sl No Classifier Type Percentage value
1 Fuzzy classifier 58
2 SVM classifier 70
3 Fuzzy + SVM classifier 76
The text classification for authorship attribution analysis based on Fuzzy and SVM classifiers has
been implemented. Later a combined classifier of these two was implemented. The input given is
a randomly selected text of an author. The output obtained is the name of the author. From my
work SVM has shown more accuracy than fuzzy technique with an accuracy of 70%. But the
combined classifiers gave an accuracy of 76%.
6. CONCLUSIONS
The proposed work successfully performs identification of the author of a given input text. In this
work different texts of various authors are selected. These texts are then tokenized and stemmed.
Then the frequency of each word in the stemmed text is determined and the top k element is
chosen. Other features such as number of characters, words, sentences and their ratios are also
extracted. The number of different types of punctuations and symbols are also determined. Later
these features were analysed to conclude that each author have certain features which are peculiar
to himself. Then these features were used to train the fuzzy and SVM classifiers. And the
experiments proved that SVM shows more accuracy than fuzzy classifier. Later the combined
classifier is found to have more accuracy than the other two individual classifiers.
ACKNOWLEDGEMENTS
We would like to thank our project guide Mr. M. Sudheep Elayidom, Associate Professor at
Division of computer science and Engineering, School of Engineering, Cochin University, for his
utmost guidance in our project work.
REFERENCES
[1] Nuno Homem, Joao Paulo Carvalho, (2011) “Authorship Identification and Author Fuzzy
Fingerprints” 978-1-61284-968-3/11/2011 IEEE.
[2] A. Metwally, D. Agrawal and A. Abbadi (2005) “Efficient Computation of Frequent and Top- k
Elements in Data Streams”, Technical Report 2005-23, University of California, Santa Barbara,
September.
[3] Rohangiz Modaber Dabagh (2007) “Authorship attribution and statistical text analysis” ,Metodološki
zvezki, Vol. 4, No. 2, 149-163.
[4] Rong Zheng, Yi Qin, Zan Huang, Hsirichun Chen (2003) “Authorship analysis in cybercrime
investigation”,H. Chen et al. (Eds.): ISI, LNCS 2665, pp. 59-73, Springer-Verlag Berlin Heidelberg.
[5] Paul N. Bennett, Susan T. Dumais, Eric Horvitz (2002) “The Combination of Text Classifiers Using
Reliability Indicators”, strive journal.
[6] Ilker Nadi Bozkurt, Ozgur Baghoglu, Erkan Uyar (2007) “Authorship Attribution Performance of
various features and classification methods”.
[7] F. Sebastiani, (2002) “Machine Learning in Automated Text Categorization” , ACM Computing
Surveys, vol. 34, no. 1, pp.1-47.
[8] Dinesh Kavuri, Pallikonda Anil Kumar, Doddapaneni Venkata Subba Rao (2012) “Text and image
classification using fuzzy similarity based self constructing algorithm”, ISSN: 2250–3676 Volume-2,
Issue-6, 1572 – 1576, IJESAT
10. Advanced Computing: An International Journal (ACIJ), Vol.4, No.5, September 2013
10
Authors
M. Sudheep Elayidom
The author is an Associate Professor in Cochin University of Science and Technology
in Kerala, India in the Computer Science and Engineering division, of school of
engineering. He received his PhD in computer science from Cochin University,
Kerala, India and masters in computer and information science with a first rank from
the same university. He got his B.Tech with a first rank from M.G University,
Kerala, India. He has published many international papers in the domain of data
mining and is active in the research area guiding research scholars in the fields of big
data, cloud data bases and so on.
Chinchu Jose
The author did B.Tech in Computer Science and Engineering from M.G University,
Kerala, India. Now pursuing M.Tech in Computer science and Engineering from M.
G university.
Anitta Puthussery
The author did B.E in Computer Science and Engineering from Anna University,
Tamilnadu, India. Now pursuing M.Tech in Computer science and Engineering
from M. G university.
Neenu K Sasi
The author did B.Tech in Computer Science and Engineering from M.G University,
Kerala, India. Now pursuing M.Tech in Computer science and Engineering from M.
G university.