Stemming is the process to convert words into their root words by the stemming algorithm. It is one of the main processes in text analytics where the text data needs to go through stemming process before proceeding to further analysis. Text analytics is a very common practice nowadays that is practiced toanalyze contents of text data from various sources such as the mass media and media social. In this study, two different stemming techniques; Porter and Lancaster are evaluated. The differences in the outputs that are resulted from the different stemming techniques are discussed based on the stemming error and the resulted visualization. The finding from this study shows that Porter stemming performs better than Lancaster stemming, by 43%, based on the stemming error produced. Visualization can still be accommodated by the stemmed text data but some understanding of the background on the text data is needed by the tool users to ensure that correct interpretation can be made on the visualization outputs.
Detection and Privacy Preservation of Sensitive Attributes Using Hybrid Appro...rahulmonikasharma
Privacy Preserving Record Linkage (PPRL) is a major area of database research which entangles in colluding huge multiple heterogeneous data sets with disjunctive or additional information about an entity while veiling its private information. This paper gives an enhanced algorithm for merging two datasets using Sorted Neighborhood Deterministic approach and an improved Preservation algorithm which makes use of automatic selection of sensitive attributes and pattern mining over dynamic queries. We guarantee strong privacy, less computational complexity and scalability and address the legitimate concerns over data security and privacy with our approach.
Document Retrieval System, a Case StudyIJERA Editor
In this work we have proposed a method for automatic indexing and retrieval. This method will provide as a
result the most likelihood document which is related to the input query. The technique used in this project is
known as singular-value decomposition, in this method a large term by document matrix is analyzed and
decomposed into 100 factors. Documents are represented by 100 item vector of factor weights. On the other
hand queries are represented as pseudo-document vectors, which are formed from weighed combinations of
terms.
An efficient approach to query reformulation in web searcheSAT Journals
Abstract Wide range of problems regarding to natural language processing, mining of data, bioinformatics and information retrieval can be categorized as string transformation, the following task refers the same. If we give an input string, the system will generates the top k most equivalent output strings which are related to the same input string. In this paper we proposes a narrative and probabilistic method for the transformation of string, which is considered as accurate and also efficient. The approach uses a log linear model, along with the method used for training the model, and also an algorithm that generates the top k outcomes. Log linear method can be defined as restrictive possibility distribution of a result string and the set of rules for the alteration conditioned on key string. It is guaranteed that the resultant top k list will be generated using the algorithm for string generation which is based on pruning. The projected technique is applied to correct the spelling error in query as well as reformulation of queries in case of web based search. Spelling error correction, query reformulation for the related query is not considered in the previous work. Efficiency is not considered as an important issue taken into the consideration in earlier methods and was not focused on improvement of accuracy and efficiency in string transformation. The experimental outcomes on huge scale data show that the projected method is extremely accurate and also efficient. Keywords: Log linear method, Query reformulation, Spelling Error correction.
Semantic similarity and semantic relatedness
measure in particular is very important in the current scenario
due to the huge demand for natural language processing based
applications such as chatbots and information retrieval systems
such as knowledge base based FAQ systems. Current approaches
generally use similarity measures which does not use the context
sensitive relationships between the words. This leads to erroneous
similarity predictions and is not of much use in real life
applications. This work proposes a novel approach that gives an
accurate relatedness measure of any two words in a sentence by
taking their context into consideration. This context correction
results in a more accurate similarity prediction which results in
higher accuracy of information retrieval systems.
International Journal of Computational Engineering Research(IJCER) ijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
Detection and Privacy Preservation of Sensitive Attributes Using Hybrid Appro...rahulmonikasharma
Privacy Preserving Record Linkage (PPRL) is a major area of database research which entangles in colluding huge multiple heterogeneous data sets with disjunctive or additional information about an entity while veiling its private information. This paper gives an enhanced algorithm for merging two datasets using Sorted Neighborhood Deterministic approach and an improved Preservation algorithm which makes use of automatic selection of sensitive attributes and pattern mining over dynamic queries. We guarantee strong privacy, less computational complexity and scalability and address the legitimate concerns over data security and privacy with our approach.
Document Retrieval System, a Case StudyIJERA Editor
In this work we have proposed a method for automatic indexing and retrieval. This method will provide as a
result the most likelihood document which is related to the input query. The technique used in this project is
known as singular-value decomposition, in this method a large term by document matrix is analyzed and
decomposed into 100 factors. Documents are represented by 100 item vector of factor weights. On the other
hand queries are represented as pseudo-document vectors, which are formed from weighed combinations of
terms.
An efficient approach to query reformulation in web searcheSAT Journals
Abstract Wide range of problems regarding to natural language processing, mining of data, bioinformatics and information retrieval can be categorized as string transformation, the following task refers the same. If we give an input string, the system will generates the top k most equivalent output strings which are related to the same input string. In this paper we proposes a narrative and probabilistic method for the transformation of string, which is considered as accurate and also efficient. The approach uses a log linear model, along with the method used for training the model, and also an algorithm that generates the top k outcomes. Log linear method can be defined as restrictive possibility distribution of a result string and the set of rules for the alteration conditioned on key string. It is guaranteed that the resultant top k list will be generated using the algorithm for string generation which is based on pruning. The projected technique is applied to correct the spelling error in query as well as reformulation of queries in case of web based search. Spelling error correction, query reformulation for the related query is not considered in the previous work. Efficiency is not considered as an important issue taken into the consideration in earlier methods and was not focused on improvement of accuracy and efficiency in string transformation. The experimental outcomes on huge scale data show that the projected method is extremely accurate and also efficient. Keywords: Log linear method, Query reformulation, Spelling Error correction.
Semantic similarity and semantic relatedness
measure in particular is very important in the current scenario
due to the huge demand for natural language processing based
applications such as chatbots and information retrieval systems
such as knowledge base based FAQ systems. Current approaches
generally use similarity measures which does not use the context
sensitive relationships between the words. This leads to erroneous
similarity predictions and is not of much use in real life
applications. This work proposes a novel approach that gives an
accurate relatedness measure of any two words in a sentence by
taking their context into consideration. This context correction
results in a more accurate similarity prediction which results in
higher accuracy of information retrieval systems.
International Journal of Computational Engineering Research(IJCER) ijceronline
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology.
Conceptual similarity measurement algorithm for domain specific ontology[Zac Darcy
This paper presents the similarity measurement algorithm for domain specific terms collected in the
ontology based data integration system. This similarity measurement algorithm can be used in ontology
mapping and query service of
ontology based data integration sy
stem. In this paper, we focus
o
n the web
query service to apply
this proposed algorithm
. Concepts similarity is important for web query service
because the words in user input query are not
same wholly with the concepts in
ontology. So, we need to
extract the possible concepts that are match or related to the input words with the help of machine readable
dictionary WordNet. Sometimes, we use the generated mapping rules in query generation procedure for
some words that canno
t be
confirmed the similarity of these words
by WordNet. We prove the effect
of this
algorithm with two degree semantic result of web minin
g by generating
the concepts results obtained form
the input query
A legal text usually long and complicated, it has some characteristic that make it different from other dailyuse
texts.Then, translating a legal text is generally considered to be difficult. This paper introduces an approach to split a legal sentence based on its logical structure and presents selecting appropriate
translation rules to improve phrase reordering of legal translation. We use a method which divides a English legal sentence based on its logical structure to segment the legal sentence. New features with rich linguistic and contextual information of split sentences are proposed to rule selection. We apply maximum entropy to combine rich linguistic and contextual information and integrate the features of split sentences into the legal translation, tree-based SMT system. We obtain improvements in performance for legal translation from English to Japanese over Moses and Moses-chart systems.
IEEE PROJECTS 2015
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Dot Net
DOTNET Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
Java Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2015
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US
1 CRORE PROJECTS
Door No: 214/215,2nd Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: 1croreprojects@gmail.com
website:1croreprojects.com
Phone : +91 97518 00789 / +91 72999 51536
In recent years the growth of digital data is increasing dramatically, knowledge discovery and data mining have attracted immense attention with coming up need for turning such data into useful information and knowledge. Keyword extraction is considered an essential task in natural language processing (NLP) that facilitates mapping of documents to a concise set of representative single and multi-word phrases. This paper investigates using of Word2Vec and Decision Tree for keywords extraction from textual documents. The Sem-Eval (2010) dataset is used as a main input for the proposed study. The words are represented by vectors with Word2Vec technique following applying pre-processing operations on the dataset. This method is based on word similarity between candidate keywords from both collecting keywords for each label and one sample from the same label. An appropriate threshold has been determined by which the percentages that exceed this threshold are exported to the Decision Tree in order to consider an appropriate classification to be taken on the text document.
Some similarity measurements were used for the classification process. The efficiency and accuracy of the algorithm was measured in the process of classification using precision, recall and F-score rates. The obtained results indicated that using of vector representation for each keyword is an effective way to identify the most similar words, so that the opportunity to recognize the correct classification of the document increases. When using word2Vec CBOW the result of F-Score was 64% with the Gini method and WordNet Lemmatizer. Meanwhile, when using Word2Vec SG the result of F-Score was 82% with Gini Index and English Porter Stemming which considered the highest ratio for all our experiments.
http://sites.google.com/site/ijcsis/
https://google.academia.edu/JournalofComputerScience
https://www.linkedin.com/in/ijcsis-research-publications-8b916516/
http://www.researcherid.com/rid/E-1319-2016
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
The internet has caused a humongous growth in the amount of data available to the common
man. Summaries of documents can help find the right information and are particularly effective
when the document base is very large. Keywords are closely associated to a document as they
reflect the document's content and act as indexes for the given document. In this work, we
present a method to produce extractive summaries of documents in the Kannada language. The
algorithm extracts key words from pre-categorized Kannada documents collected from online
resources. We combine GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse
Document Frequency) methods along with TF (Term Frequency) for extracting key words and
later use these for summarization. In the current implementation a document from a given category is selected from our database and depending on the number of sentences given by theuser, a summary is generated.
Performance analysis on secured data method in natural language steganographyjournalBEEI
The rapid amount of exchange information that causes the expansion of the internet during the last decade has motivated that a research in this field. Recently, steganography approaches have received an unexpected attention. Hence, the aim of this paper is to review different performance metric; covering the decoding, decrypting and extracting performance metric. The process of data decoding interprets the received hidden message into a code word. As such, data encryption is the best way to provide a secure communication. Decrypting take an encrypted text and converting it back into an original text. Data extracting is a process which is the reverse of the data embedding process. The effectiveness evaluation is mainly determined by the performance metric aspect. The intention of researchers is to improve performance metric characteristics. The evaluation success is mainly determined by the performance analysis aspect. The objective of this paper is to present a review on the study of steganography in natural language based on the criteria of the performance analysis. The findings review will clarify the preferred performance metric aspects used. This review is hoped to help future research in evaluating the performance analysis of natural language in general and the proposed secured data revealed on natural language steganography in specific.
Many of previous research have proven that the usage of rhetorical relations is capable to enhance many applications such as text summarization, question answering and natural language generation. This work proposes an approach that expands the benefit of rhetorical relations to address redundancy problem for cluster-based text summarization of multiple documents. We exploited rhetorical relations exist between sentences to group similar sentences into multiple clusters to identify themes of common information. The candidate summary were extracted from these clusters. Then, cluster-based text summarization is performed using Conditional Markov Random Walk Model to measure the saliency scores of the candidate summary. We evaluated our method by measuring the cohesion and separation of the clusters constructed by exploiting rhetorical relations and ROUGE score of generated summaries. The experimental result shows that our method performed well which shows promising potential of applying rhetorical relation in text clustering which benefits text summarization of multiple documents.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
The computer is based on natural language on
summarization and machine system. It is very
difficult for human being manually summarize
large amount of text. The greatest challenge
for text summarization to summarize convent
from number of textual and semi structured
sources, including text , HTML page, portable
document file . This summarization can be
determine from internal and external measure.
Our proposed work sentence similarity based
text summarization using clusters help in
finding subjective question and answer on
internet . This work provide short units of text
that belong to similar information. I proposed
my work on sentence similarity-based
computation that helps to experiment for
similar text computation. Extractive
summarization text system choosing a subset of
the similar group from the text. proposal work I
used the part of speech, proper noun, verb,
pronouns such as he, she, and they etc. With
the help of part of speech we find important
sentence using a statistical method like proper
noun and sentence similarity system. It based
on internet information that contain picky
sentence.
Indonesian language email spam detection using N-gram and Naïve Bayes algorithmjournalBEEI
Indonesia is ranked the top 8th out of the total country population in the world for the global spammers. Web-based spam filter service with the REST API type can be used to detect email spam in the Indonesian language on the email server or various types of email server applications. With REST API, then there will be data exchange between the applications with JSON data type using existing HTTP commands. One type of spam filter commonly used is Bayesian Filtering, where the Naïve Bayes algorithm is used as a classification algorithm. Meanwhile, the N-gram method is used to increase the accuracy of the implementation of the Naïve Bayes algorithm in this study. N-gram and Naïve Bayes algorithms to detect spam email in the Indonesian language have successfully been implemented with accuracy around 0.615 until 0.94, precision at 0.566 until 0.924, recall at 0.96 until 1.00, and F-measure at 0.721 until 0.942. The best solution is found by using the 5-gram method with the highest score of accuracy at 0.94, precision at 0.924, recall at 0.96, and F-measure value at 0.942.
A NEW TOP-K CONDITIONAL XML PREFERENCE QUERIESijaia
Preference querying technology is a very important issue in a variety of applications ranging from ecommerce
to personalized search engines. Most of recent research works have been dedicated to this topic in the Artificial Intelligence and Database fields. Several formalisms allowing preference reasoning and specification have been proposed in the Artificial Intelligence domain. On the other hand, in the Database field the interest has been focused mainly in extending standard Structured Query Language (SQL) and also eXtensible Markup Language (XML) with preference facilities in order to provide personalized query
answering. More precisely, the interest in the database context focuses on the notion of Top-k preference
query and on the development of efficient methods for evaluating these queries. A Top-k preference query
returns k data tuples which are the most preferred according to the user’s preferences. Of course, Top-k
preference query answering is closely dependent on the particular preference model underlying the semantics of the operators responsible for selecting the best tuples. In this paper, we consider the Conditional Preference queries (CP-queries) where preferences are specified by a set of rules expressed in a logical formalism. We introduce Top-k conditional preference queries (Top-k CP-queries), and the
operators BestK-Match and Best-Match for evaluating these queries will be presented.
Analysis of Opinionated Text for Opinion Miningmlaij
In sentiment analysis, the polarities of the opinions expressed on an object/feature are determined to assess the sentiment of a sentence or document whether it is positive/negative/neutral. Naturally, the object/feature is a noun representation which refers to a product or a component of a product, let’s say, the "lens" in a camera and opinions emanating on it are captured in adjectives, verbs, adverbs and noun words themselves. Apart from such words, other meta-information and diverse effective features are also going to play an important role in influencing the sentiment polarity and contribute significantly to the performance of the system. In this paper, some of the associated information/meta-data are explored and investigated in the sentiment text. Based on the analysis results presented here, there is scope for further assessment and utilization of the meta-information as features in text categorization, ranking text document, identification of spam documents and polarity classification problems.
Performance Evaluation of Query Processing Techniques in Information Retrievalidescitation
The first element of the search process is the query.
The user query being on an average restricted to two or three
keywords makes the query ambiguous to the search engine.
Given the user query, the goal of an Information Retrieval
[IR] system is to retrieve information which might be useful
or relevant to the information need of the user. Hence, the
query processing plays an important role in IR system.
The query processing can be divided into four categories
i.e. query expansion, query optimization, query classification and
query parsing. In this paper an attempt is made to evaluate the
performance of query processing algorithms in each of the
category. The evaluation was based on dataset as specified by
Forum for Information Retrieval [FIRE15]. The criteria used
for evaluation are precision and relative recall. The analysis is
based on the importance of each step in query processing. The
experimental results show that the significance of each step
in query processing and also the relevance of web semantics
and spelling correction in the user query.
Application Of Extreme Value Theory To Bursts PredictionCSCJournals
Bursts and extreme events in quantities such as connection durations, file sizes, throughput, etc. may produce undesirable consequences in computer networks. Deterioration in the quality of service is a major consequence. Predicting these extreme events and burst is important. It helps in reserving the right resources for a better quality of service. We applied Extreme value theory (EVT) to predict bursts in network traffic. We took a deeper look into the application of EVT by using EVT based Exploratory Data Analysis. We found that traffic is naturally divided into two categories, Internal and external traffic. The internal traffic follows generalized extreme value (GEV) model with a negative shape parameter, which is also the same as Weibull distribution. The external traffic follows a GEV with positive shape parameter, which is Frechet distribution. These findings are of great value to the quality of service in data networks, especially when included in service level agreement as traffic descriptor parameters.
Conceptual similarity measurement algorithm for domain specific ontology[Zac Darcy
This paper presents the similarity measurement algorithm for domain specific terms collected in the
ontology based data integration system. This similarity measurement algorithm can be used in ontology
mapping and query service of
ontology based data integration sy
stem. In this paper, we focus
o
n the web
query service to apply
this proposed algorithm
. Concepts similarity is important for web query service
because the words in user input query are not
same wholly with the concepts in
ontology. So, we need to
extract the possible concepts that are match or related to the input words with the help of machine readable
dictionary WordNet. Sometimes, we use the generated mapping rules in query generation procedure for
some words that canno
t be
confirmed the similarity of these words
by WordNet. We prove the effect
of this
algorithm with two degree semantic result of web minin
g by generating
the concepts results obtained form
the input query
A legal text usually long and complicated, it has some characteristic that make it different from other dailyuse
texts.Then, translating a legal text is generally considered to be difficult. This paper introduces an approach to split a legal sentence based on its logical structure and presents selecting appropriate
translation rules to improve phrase reordering of legal translation. We use a method which divides a English legal sentence based on its logical structure to segment the legal sentence. New features with rich linguistic and contextual information of split sentences are proposed to rule selection. We apply maximum entropy to combine rich linguistic and contextual information and integrate the features of split sentences into the legal translation, tree-based SMT system. We obtain improvements in performance for legal translation from English to Japanese over Moses and Moses-chart systems.
IEEE PROJECTS 2015
1 crore projects is a leading Guide for ieee Projects and real time projects Works Provider.
It has been provided Lot of Guidance for Thousands of Students & made them more beneficial in all Technology Training.
Dot Net
DOTNET Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
Java Project Domain list 2015
1. IEEE based on datamining and knowledge engineering
2. IEEE based on mobile computing
3. IEEE based on networking
4. IEEE based on Image processing
5. IEEE based on Multimedia
6. IEEE based on Network security
7. IEEE based on parallel and distributed systems
ECE IEEE Projects 2015
1. Matlab project
2. Ns2 project
3. Embedded project
4. Robotics project
Eligibility
Final Year students of
1. BSc (C.S)
2. BCA/B.E(C.S)
3. B.Tech IT
4. BE (C.S)
5. MSc (C.S)
6. MSc (IT)
7. MCA
8. MS (IT)
9. ME(ALL)
10. BE(ECE)(EEE)(E&I)
TECHNOLOGY USED AND FOR TRAINING IN
1. DOT NET
2. C sharp
3. ASP
4. VB
5. SQL SERVER
6. JAVA
7. J2EE
8. STRINGS
9. ORACLE
10. VB dotNET
11. EMBEDDED
12. MAT LAB
13. LAB VIEW
14. Multi Sim
CONTACT US
1 CRORE PROJECTS
Door No: 214/215,2nd Floor,
No. 172, Raahat Plaza, (Shopping Mall) ,Arcot Road, Vadapalani, Chennai,
Tamin Nadu, INDIA - 600 026
Email id: 1croreprojects@gmail.com
website:1croreprojects.com
Phone : +91 97518 00789 / +91 72999 51536
In recent years the growth of digital data is increasing dramatically, knowledge discovery and data mining have attracted immense attention with coming up need for turning such data into useful information and knowledge. Keyword extraction is considered an essential task in natural language processing (NLP) that facilitates mapping of documents to a concise set of representative single and multi-word phrases. This paper investigates using of Word2Vec and Decision Tree for keywords extraction from textual documents. The Sem-Eval (2010) dataset is used as a main input for the proposed study. The words are represented by vectors with Word2Vec technique following applying pre-processing operations on the dataset. This method is based on word similarity between candidate keywords from both collecting keywords for each label and one sample from the same label. An appropriate threshold has been determined by which the percentages that exceed this threshold are exported to the Decision Tree in order to consider an appropriate classification to be taken on the text document.
Some similarity measurements were used for the classification process. The efficiency and accuracy of the algorithm was measured in the process of classification using precision, recall and F-score rates. The obtained results indicated that using of vector representation for each keyword is an effective way to identify the most similar words, so that the opportunity to recognize the correct classification of the document increases. When using word2Vec CBOW the result of F-Score was 64% with the Gini method and WordNet Lemmatizer. Meanwhile, when using Word2Vec SG the result of F-Score was 82% with Gini Index and English Porter Stemming which considered the highest ratio for all our experiments.
http://sites.google.com/site/ijcsis/
https://google.academia.edu/JournalofComputerScience
https://www.linkedin.com/in/ijcsis-research-publications-8b916516/
http://www.researcherid.com/rid/E-1319-2016
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
The internet has caused a humongous growth in the amount of data available to the common
man. Summaries of documents can help find the right information and are particularly effective
when the document base is very large. Keywords are closely associated to a document as they
reflect the document's content and act as indexes for the given document. In this work, we
present a method to produce extractive summaries of documents in the Kannada language. The
algorithm extracts key words from pre-categorized Kannada documents collected from online
resources. We combine GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse
Document Frequency) methods along with TF (Term Frequency) for extracting key words and
later use these for summarization. In the current implementation a document from a given category is selected from our database and depending on the number of sentences given by theuser, a summary is generated.
Performance analysis on secured data method in natural language steganographyjournalBEEI
The rapid amount of exchange information that causes the expansion of the internet during the last decade has motivated that a research in this field. Recently, steganography approaches have received an unexpected attention. Hence, the aim of this paper is to review different performance metric; covering the decoding, decrypting and extracting performance metric. The process of data decoding interprets the received hidden message into a code word. As such, data encryption is the best way to provide a secure communication. Decrypting take an encrypted text and converting it back into an original text. Data extracting is a process which is the reverse of the data embedding process. The effectiveness evaluation is mainly determined by the performance metric aspect. The intention of researchers is to improve performance metric characteristics. The evaluation success is mainly determined by the performance analysis aspect. The objective of this paper is to present a review on the study of steganography in natural language based on the criteria of the performance analysis. The findings review will clarify the preferred performance metric aspects used. This review is hoped to help future research in evaluating the performance analysis of natural language in general and the proposed secured data revealed on natural language steganography in specific.
Many of previous research have proven that the usage of rhetorical relations is capable to enhance many applications such as text summarization, question answering and natural language generation. This work proposes an approach that expands the benefit of rhetorical relations to address redundancy problem for cluster-based text summarization of multiple documents. We exploited rhetorical relations exist between sentences to group similar sentences into multiple clusters to identify themes of common information. The candidate summary were extracted from these clusters. Then, cluster-based text summarization is performed using Conditional Markov Random Walk Model to measure the saliency scores of the candidate summary. We evaluated our method by measuring the cohesion and separation of the clusters constructed by exploiting rhetorical relations and ROUGE score of generated summaries. The experimental result shows that our method performed well which shows promising potential of applying rhetorical relation in text clustering which benefits text summarization of multiple documents.
A NOVEL APPROACH FOR WORD RETRIEVAL FROM DEVANAGARI DOCUMENT IMAGESijnlc
Large amount of information is lying dormant in historical documents and manuscripts. This information would go futile if not stored in digital form. Searching some relevant information from these scanned images would ideally require converting these document images to text form by doing optical character
recognition (OCR). For indigenous scripts of India, there are very few OCRs that can successfully recognize printed text images of varying quality, size, style and font. An alternate approach using word spotting can be effective to access large collections of document images. We propose a word spotting
technique based on codes for matching the word images of Devanagari script. The shape information is utilised for generating integer codes for words in the document image and these codes are matched for final retrieval of relevant documents. The technique is illustrated using Marathi document images.
The computer is based on natural language on
summarization and machine system. It is very
difficult for human being manually summarize
large amount of text. The greatest challenge
for text summarization to summarize convent
from number of textual and semi structured
sources, including text , HTML page, portable
document file . This summarization can be
determine from internal and external measure.
Our proposed work sentence similarity based
text summarization using clusters help in
finding subjective question and answer on
internet . This work provide short units of text
that belong to similar information. I proposed
my work on sentence similarity-based
computation that helps to experiment for
similar text computation. Extractive
summarization text system choosing a subset of
the similar group from the text. proposal work I
used the part of speech, proper noun, verb,
pronouns such as he, she, and they etc. With
the help of part of speech we find important
sentence using a statistical method like proper
noun and sentence similarity system. It based
on internet information that contain picky
sentence.
Indonesian language email spam detection using N-gram and Naïve Bayes algorithmjournalBEEI
Indonesia is ranked the top 8th out of the total country population in the world for the global spammers. Web-based spam filter service with the REST API type can be used to detect email spam in the Indonesian language on the email server or various types of email server applications. With REST API, then there will be data exchange between the applications with JSON data type using existing HTTP commands. One type of spam filter commonly used is Bayesian Filtering, where the Naïve Bayes algorithm is used as a classification algorithm. Meanwhile, the N-gram method is used to increase the accuracy of the implementation of the Naïve Bayes algorithm in this study. N-gram and Naïve Bayes algorithms to detect spam email in the Indonesian language have successfully been implemented with accuracy around 0.615 until 0.94, precision at 0.566 until 0.924, recall at 0.96 until 1.00, and F-measure at 0.721 until 0.942. The best solution is found by using the 5-gram method with the highest score of accuracy at 0.94, precision at 0.924, recall at 0.96, and F-measure value at 0.942.
A NEW TOP-K CONDITIONAL XML PREFERENCE QUERIESijaia
Preference querying technology is a very important issue in a variety of applications ranging from ecommerce
to personalized search engines. Most of recent research works have been dedicated to this topic in the Artificial Intelligence and Database fields. Several formalisms allowing preference reasoning and specification have been proposed in the Artificial Intelligence domain. On the other hand, in the Database field the interest has been focused mainly in extending standard Structured Query Language (SQL) and also eXtensible Markup Language (XML) with preference facilities in order to provide personalized query
answering. More precisely, the interest in the database context focuses on the notion of Top-k preference
query and on the development of efficient methods for evaluating these queries. A Top-k preference query
returns k data tuples which are the most preferred according to the user’s preferences. Of course, Top-k
preference query answering is closely dependent on the particular preference model underlying the semantics of the operators responsible for selecting the best tuples. In this paper, we consider the Conditional Preference queries (CP-queries) where preferences are specified by a set of rules expressed in a logical formalism. We introduce Top-k conditional preference queries (Top-k CP-queries), and the
operators BestK-Match and Best-Match for evaluating these queries will be presented.
Analysis of Opinionated Text for Opinion Miningmlaij
In sentiment analysis, the polarities of the opinions expressed on an object/feature are determined to assess the sentiment of a sentence or document whether it is positive/negative/neutral. Naturally, the object/feature is a noun representation which refers to a product or a component of a product, let’s say, the "lens" in a camera and opinions emanating on it are captured in adjectives, verbs, adverbs and noun words themselves. Apart from such words, other meta-information and diverse effective features are also going to play an important role in influencing the sentiment polarity and contribute significantly to the performance of the system. In this paper, some of the associated information/meta-data are explored and investigated in the sentiment text. Based on the analysis results presented here, there is scope for further assessment and utilization of the meta-information as features in text categorization, ranking text document, identification of spam documents and polarity classification problems.
Performance Evaluation of Query Processing Techniques in Information Retrievalidescitation
The first element of the search process is the query.
The user query being on an average restricted to two or three
keywords makes the query ambiguous to the search engine.
Given the user query, the goal of an Information Retrieval
[IR] system is to retrieve information which might be useful
or relevant to the information need of the user. Hence, the
query processing plays an important role in IR system.
The query processing can be divided into four categories
i.e. query expansion, query optimization, query classification and
query parsing. In this paper an attempt is made to evaluate the
performance of query processing algorithms in each of the
category. The evaluation was based on dataset as specified by
Forum for Information Retrieval [FIRE15]. The criteria used
for evaluation are precision and relative recall. The analysis is
based on the importance of each step in query processing. The
experimental results show that the significance of each step
in query processing and also the relevance of web semantics
and spelling correction in the user query.
Application Of Extreme Value Theory To Bursts PredictionCSCJournals
Bursts and extreme events in quantities such as connection durations, file sizes, throughput, etc. may produce undesirable consequences in computer networks. Deterioration in the quality of service is a major consequence. Predicting these extreme events and burst is important. It helps in reserving the right resources for a better quality of service. We applied Extreme value theory (EVT) to predict bursts in network traffic. We took a deeper look into the application of EVT by using EVT based Exploratory Data Analysis. We found that traffic is naturally divided into two categories, Internal and external traffic. The internal traffic follows generalized extreme value (GEV) model with a negative shape parameter, which is also the same as Weibull distribution. The external traffic follows a GEV with positive shape parameter, which is Frechet distribution. These findings are of great value to the quality of service in data networks, especially when included in service level agreement as traffic descriptor parameters.
Supreme court dialogue classification using machine learning models IJECEIAES
Legal classification models help lawyers identify the relevant documents required for a study. In this study, the focus is on sentence level classification. To be more precise, the work undertaken focuses on a conversation in the supreme court between the justice and other correspondents. In the study, both the naïve Bayes classifier and logistic regression are used to classify conversations at the sentence level. The performance is measured with the help of the area under the curve score. The study found that the model that was trained on a specific case yielded better results than a model that was trained on a larger number of conversations. Case specificity is found to be more crucial in gaining better results from the classifier.
Conceptual Similarity Measurement Algorithm For Domain Specific OntologyZac Darcy
This paper presents the similarity measurement algorithm for domain specific terms collected in the
ontology based data integration system. This similarity measurement algorithm can be used in ontology
mapping and query service of ontology based data integration system. In this paper, we focus on the web
query service to apply this proposed algorithm. Concepts similarity is important for web query service
because the words in user input query are not same wholly with the concepts in ontology. So, we need to
extract the possible concepts that are match or related to the input words with the help of machine readable
dictionary WordNet. Sometimes, we use the generated mapping rules in query generation procedure for
some words that cannot be confirmed the similarity of these words by WordNet. We prove the effect of this
algorithm with two degree semantic result of web mining by generating the concepts results obtained form
the input query.
Data mining is the knowledge discovery in databases and the gaol is to extract patterns and knowledge from large amounts of data. The important term in data mining is text mining. Text mining extracts the quality information highly from text. Statistical pattern learning is used to high quality information. High –quality in text mining defines the combinations of relevance, novelty and interestingness. Tasks in text mining are text categorization, text clustering, entity extraction and sentiment analysis. Applications of natural language processing and analytical methods are highly preferred to turn text into data for analysis. This survey is about the various techniques and algorithms used in text mining.
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
The objective is to find among all partitions of the data set, best publishing according to some quality measure. Affinity propagation is a low error, high speed, flexible, and remarkably simple clustering algorithm that may be used in forming teams of participants for business simulations and experiential exercises, and in organizing participant’s preferences for the parameters of simulations. This paper proposes an efficient Affinity Propagation algorithm that guarantees the same clustering result as the original algorithm after convergence. The heart of our approach is (1) to prune unnecessary message exchanges in the iterations and (2) to compute the convergence values of pruned messages after the iterations to determine clusters.
An Approach To Automatic Text Summarization Using Simplified Lesk Algorithm A...ijctcm
Text Summarization is a way to produce a text, which contains the significant portion of information of the
original text(s). Different methodologies are developed till now depending upon several parameters to find
the summary as the position, format and type of the sentences in an input text, formats of different words,
frequency of a particular word in a text etc. But according to different languages and input sources, these
parameters are varied. As result the performance of the algorithm is greatly affected. The proposed
approach summarizes a text without depending upon those parameters. Here, the relevance of the
sentences within the text is derived by Simplified Lesk algorithm and WordNet, an online dictionary. This
approach is not only independent of the format of the text and position of a sentence in a text, as the
sentences are arranged at first according to their relevance before the summarization process, the
percentage of summarization can be varied according to needs. The proposed approach gives around 80%
accurate results on 50% summarization of the original text with respect to the manually summarized result,
performed on 50 different types and lengths of texts. We have achieved satisfactory results even upto 25%
summarization of the original text.
Data mining is the knowledge discovery in databases and the gaol is to extract patterns and knowledge from
large amounts of data. The important term in data mining is text mining. Text mining extracts the quality
information highly from text. Statistical pattern learning is used to high quality information. High –quality in
text mining defines the combinations of relevance, novelty and interestingness. Tasks in text mining are text
categorization, text clustering, entity extraction and sentiment analysis. Applications of natural language
processing and analytical methods are highly preferred to turn
A survey of Stemming Algorithms for Information Retrievaliosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDINGkevig
Applying natural language processing-related algorithms is currently a popular project in legal
applications, for instance, document classification of legal documents, contract review and machine
translation. Using the above machine learning algorithms, all need to encode the words in the document in
the form of vectors. The word embedding model is a modern distributed word representation approach and
the most common unsupervised word encoding method. It facilitates subjecting other algorithms and
subsequently performing the downstream tasks of natural language processing vis-à-vis. The most common
and practical approach of accuracy evaluation with the word embedding model uses a benchmark set with
linguistic rules or the relationship between words to perform analogy reasoning via algebraic calculation.
This paper proposes establishing a 1,256 Legal Analogical Reasoning Questions Set (LARQS) from the
2,388 Chinese Codex corpus using five kinds of legal relations, which are then used to evaluate the
accuracy of the Chinese word embedding model. Moreover, we discovered that legal relations might be
ubiquitous in the word embedding model.
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDINGkevig
Applying natural language processing-related algorithms is currently a popular project in legal
applications, for instance, document classification of legal documents, contract review and machine
translation. Using the above machine learning algorithms, all need to encode the words in the document in
the form of vectors. The word embedding model is a modern distributed word representation approach and
the most common unsupervised word encoding method. It facilitates subjecting other algorithms and
subsequently performing the downstream tasks of natural language processing vis-à-vis. The most common
and practical approach of accuracy evaluation with the word embedding model uses a benchmark set with
linguistic rules or the relationship between words to perform analogy reasoning via algebraic calculation.
This paper proposes establishing a 1,256 Legal Analogical Reasoning Questions Set (LARQS) from the
2,388 Chinese Codex corpus using five kinds of legal relations, which are then used to evaluate the
accuracy of the Chinese word embedding model. Moreover, we discovered that legal relations might be
ubiquitous in the word embedding model.
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
Document similarity is an important part of Natural Language Processing and is most commonly used for
plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity
algorithm could have a major positive impact on the field of Natural Language Processing. This report sets
out to examine the numerous document similarity algorithms, and determine which ones are the most
useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of
document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based
algorithms. The most effective algorithms in each category are also compared in our work using a series of
benchmark datasets and evaluations that test every possible area that each algorithm could be used in.
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMSgerogepatton
Document similarity is an important part of Natural Language Processing and is most commonly used for
plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity
algorithm could have a major positive impact on the field of Natural Language Processing. This report sets
out to examine the numerous document similarity algorithms, and determine which ones are the most
useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of
document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based
algorithms. The most effective algorithms in each category are also compared in our work using a series of
benchmark datasets and evaluations that test every possible area that each algorithm could be used in.
Square transposition: an approach to the transposition process in block cipherjournalBEEI
The transposition process is needed in cryptography to create a diffusion effect on data encryption standard (DES) and advanced encryption standard (AES) algorithms as standard information security algorithms by the National Institute of Standards and Technology. The problem with DES and AES algorithms is that their transposition index values form patterns and do not form random values. This condition will certainly make it easier for a cryptanalyst to look for a relationship between ciphertexts because some processes are predictable. This research designs a transposition algorithm called square transposition. Each process uses square 8 × 8 as a place to insert and retrieve 64-bits. The determination of the pairing of the input scheme and the retrieval scheme that have unequal flow is an important factor in producing a good transposition. The square transposition can generate random and non-pattern indices so that transposition can be done better than DES and AES.
Hyper-parameter optimization of convolutional neural network based on particl...journalBEEI
Deep neural networks have accomplished enormous progress in tackling many problems. More specifically, convolutional neural network (CNN) is a category of deep networks that have been a dominant technique in computer vision tasks. Despite that these deep neural networks are highly effective; the ideal structure is still an issue that needs a lot of investigation. Deep Convolutional Neural Network model is usually designed manually by trials and repeated tests which enormously constrain its application. Many hyper-parameters of the CNN can affect the model performance. These parameters are depth of the network, numbers of convolutional layers, and numbers of kernels with their sizes. Therefore, it may be a huge challenge to design an appropriate CNN model that uses optimized hyper-parameters and reduces the reliance on manual involvement and domain expertise. In this paper, a design architecture method for CNNs is proposed by utilization of particle swarm optimization (PSO) algorithm to learn the optimal CNN hyper-parameters values. In the experiment, we used Modified National Institute of Standards and Technology (MNIST) database of handwritten digit recognition. The experiments showed that our proposed approach can find an architecture that is competitive to the state-of-the-art models with a testing error of 0.87%.
Supervised machine learning based liver disease prediction approach with LASS...journalBEEI
In this contemporary era, the uses of machine learning techniques are increasing rapidly in the field of medical science for detecting various diseases such as liver disease (LD). Around the globe, a large number of people die because of this deadly disease. By diagnosing the disease in a primary stage, early treatment can be helpful to cure the patient. In this research paper, a method is proposed to diagnose the LD using supervised machine learning classification algorithms, namely logistic regression, decision tree, random forest, AdaBoost, KNN, linear discriminant analysis, gradient boosting and support vector machine (SVM). We also deployed a least absolute shrinkage and selection operator (LASSO) feature selection technique on our taken dataset to suggest the most highly correlated attributes of LD. The predictions with 10 fold cross-validation (CV) made by the algorithms are tested in terms of accuracy, sensitivity, precision and f1-score values to forecast the disease. It is observed that the decision tree algorithm has the best performance score where accuracy, precision, sensitivity and f1-score values are 94.295%, 92%, 99% and 96% respectively with the inclusion of LASSO. Furthermore, a comparison with recent studies is shown to prove the significance of the proposed system.
A secure and energy saving protocol for wireless sensor networksjournalBEEI
The research domain for wireless sensor networks (WSN) has been extensively conducted due to innovative technologies and research directions that have come up addressing the usability of WSN under various schemes. This domain permits dependable tracking of a diversity of environments for both military and civil applications. The key management mechanism is a primary protocol for keeping the privacy and confidentiality of the data transmitted among different sensor nodes in WSNs. Since node's size is small; they are intrinsically limited by inadequate resources such as battery life-time and memory capacity. The proposed secure and energy saving protocol (SESP) for wireless sensor networks) has a significant impact on the overall network life-time and energy dissipation. To encrypt sent messsages, the SESP uses the public-key cryptography’s concept. It depends on sensor nodes' identities (IDs) to prevent the messages repeated; making security goals- authentication, confidentiality, integrity, availability, and freshness to be achieved. Finally, simulation results show that the proposed approach produced better energy consumption and network life-time compared to LEACH protocol; sensors are dead after 900 rounds in the proposed SESP protocol. While, in the low-energy adaptive clustering hierarchy (LEACH) scheme, the sensors are dead after 750 rounds.
Plant leaf identification system using convolutional neural networkjournalBEEI
This paper proposes a leaf identification system using convolutional neural network (CNN). This proposed system can identify five types of local Malaysia leaf which were acacia, papaya, cherry, mango and rambutan. By using CNN from deep learning, the network is trained from the database that acquired from leaf images captured by mobile phone for image classification. ResNet-50 was the architecture has been used for neural networks image classification and training the network for leaf identification. The recognition of photographs leaves requested several numbers of steps, starting with image pre-processing, feature extraction, plant identification, matching and testing, and finally extracting the results achieved in MATLAB. Testing sets of the system consists of 3 types of images which were white background, and noise added and random background images. Finally, interfaces for the leaf identification system have developed as the end software product using MATLAB app designer. As a result, the accuracy achieved for each training sets on five leaf classes are recorded above 98%, thus recognition process was successfully implemented.
Customized moodle-based learning management system for socially disadvantaged...journalBEEI
This study aims to develop Moodle-based LMS with customized learning content and modified user interface to facilitate pedagogical processes during covid-19 pandemic and investigate how teachers of socially disadvantaged schools perceived usability and technology acceptance. Co-design process was conducted with two activities: 1) need assessment phase using an online survey and interview session with the teachers and 2) the development phase of the LMS. The system was evaluated by 30 teachers from socially disadvantaged schools for relevance to their distance learning activities. We employed computer software usability questionnaire (CSUQ) to measure perceived usability and the technology acceptance model (TAM) with insertion of 3 original variables (i.e., perceived usefulness, perceived ease of use, and intention to use) and 5 external variables (i.e., attitude toward the system, perceived interaction, self-efficacy, user interface design, and course design). The average CSUQ rating exceeded 5.0 of 7 point-scale, indicated that teachers agreed that the information quality, interaction quality, and user interface quality were clear and easy to understand. TAM results concluded that the LMS design was judged to be usable, interactive, and well-developed. Teachers reported an effective user interface that allows effective teaching operations and lead to the system adoption in immediate time.
Understanding the role of individual learner in adaptive and personalized e-l...journalBEEI
Dynamic learning environment has emerged as a powerful platform in a modern e-learning system. The learning situation that constantly changing has forced the learning platform to adapt and personalize its learning resources for students. Evidence suggested that adaptation and personalization of e-learning systems (APLS) can be achieved by utilizing learner modeling, domain modeling, and instructional modeling. In the literature of APLS, questions have been raised about the role of individual characteristics that are relevant for adaptation. With several options, a new problem has been raised where the attributes of students in APLS often overlap and are not related between studies. Therefore, this study proposed a list of learner model attributes in dynamic learning to support adaptation and personalization. The study was conducted by exploring concepts from the literature selected based on the best criteria. Then, we described the results of important concepts in student modeling and provided definitions and examples of data values that researchers have used. Besides, we also discussed the implementation of the selected learner model in providing adaptation in dynamic learning.
Prototype mobile contactless transaction system in traditional markets to sup...journalBEEI
One way to prevent and reduce the spread of the covid-19 pandemic is through physical distancing program. This research aims to develop a prototype contactless transaction system using digital payment mechanisms and QR code technology that will be applied in traditional markets. The method used in the development of electronic market systems is a prototype approach. The application of QR code and digital payments are used as a solution to minimize money exchange contacts that are common in traditional markets. The results showed that the system built was able to accelerate and facilitate the buying and selling transaction process in traditional market environment. Alpha testing shows that all functional systems are running well. Meanwhile, beta testing shows that the user can very well accept the system that was built. The results of the study also show acceptance of the usefulness of the system being built, as well as the optimism of its users to be able to take advantage of this system both technologically and functionally, so its can be a part of the digital transformation of the traditional market to the electronic market and has become one of the solutions in reducing the spread of the current covid-19 pandemic.
Wireless HART stack using multiprocessor technique with laxity algorithmjournalBEEI
The use of a real-time operating system is required for the demarcation of industrial wireless sensor network (IWSN) stacks (RTOS). In the industrial world, a vast number of sensors are utilised to gather various types of data. The data gathered by the sensors cannot be prioritised ahead of time. Because all of the information is equally essential. As a result, a protocol stack is employed to guarantee that data is acquired and processed fairly. In IWSN, the protocol stack is implemented using RTOS. The data collected from IWSN sensor nodes is processed using non-preemptive scheduling and the protocol stack, and then sent in parallel to the IWSN's central controller. The real-time operating system (RTOS) is a process that occurs between hardware and software. Packets must be sent at a certain time. It's possible that some packets may collide during transmission. We're going to undertake this project to get around this collision. As a prototype, this project is divided into two parts. The first uses RTOS and the LPC2148 as a master node, while the second serves as a standard data collection node to which sensors are attached. Any controller may be used in the second part, depending on the situation. Wireless HART allows two nodes to communicate with each other.
Implementation of double-layer loaded on octagon microstrip yagi antennajournalBEEI
A double-layer loaded on the octagon microstrip yagi antenna (OMYA) at 5.8 GHz industrial, scientific and medical (ISM) Band is investigated in this paper. The double-layer consist of two double positive (DPS) substrates. The OMYA is overlaid with a double-layer configuration were simulated, fabricated and measured. A good agreement was observed between the computed and measured results of the gain for this antenna. According to comparison results, it shows that 2.5 dB improvement of the OMYA gain can be obtained by applying the double-layer on the top of the OMYA. Meanwhile, the bandwidth of the measured OMYA with the double-layer is 14.6%. It indicates that the double-layer can be used to increase the OMYA performance in term of gain and bandwidth.
The calculation of the field of an antenna located near the human headjournalBEEI
In this work, a numerical calculation was carried out in one of the universal programs for automatic electro-dynamic design. The calculation is aimed at obtaining numerical values for specific absorbed power (SAR). It is the SAR value that can be used to determine the effect of the antenna of a wireless device on biological objects; the dipole parameters will be selected for GSM1800. Investigation of the influence of distance to a cell phone on radiation shows that absorbed in the head of a person the effect of electromagnetic radiation on the brain decreases by three times this is a very important result the SAR value has decreased by almost three times it is acceptable results.
Exact secure outage probability performance of uplinkdownlink multiple access...journalBEEI
In this paper, we study uplink-downlink non-orthogonal multiple access (NOMA) systems by considering the secure performance at the physical layer. In the considered system model, the base station acts a relay to allow two users at the left side communicate with two users at the right side. By considering imperfect channel state information (CSI), the secure performance need be studied since an eavesdropper wants to overhear signals processed at the downlink. To provide secure performance metric, we derive exact expressions of secrecy outage probability (SOP) and and evaluating the impacts of main parameters on SOP metric. The important finding is that we can achieve the higher secrecy performance at high signal to noise ratio (SNR). Moreover, the numerical results demonstrate that the SOP tends to a constant at high SNR. Finally, our results show that the power allocation factors, target rates are main factors affecting to the secrecy performance of considered uplink-downlink NOMA systems.
Design of a dual-band antenna for energy harvesting applicationjournalBEEI
This report presents an investigation on how to improve the current dual-band antenna to enhance the better result of the antenna parameters for energy harvesting application. Besides that, to develop a new design and validate the antenna frequencies that will operate at 2.4 GHz and 5.4 GHz. At 5.4 GHz, more data can be transmitted compare to 2.4 GHz. However, 2.4 GHz has long distance of radiation, so it can be used when far away from the antenna module compare to 5 GHz that has short distance in radiation. The development of this project includes the scope of designing and testing of antenna using computer simulation technology (CST) 2018 software and vector network analyzer (VNA) equipment. In the process of designing, fundamental parameters of antenna are being measured and validated, in purpose to identify the better antenna performance.
Transforming data-centric eXtensible markup language into relational database...journalBEEI
eXtensible markup language (XML) appeared internationally as the format for data representation over the web. Yet, most organizations are still utilising relational databases as their database solutions. As such, it is crucial to provide seamless integration via effective transformation between these database infrastructures. In this paper, we propose XML-REG to bridge these two technologies based on node-based and path-based approaches. The node-based approach is good to annotate each positional node uniquely, while the path-based approach provides summarised path information to join the nodes. On top of that, a new range labelling is also proposed to annotate nodes uniquely by ensuring the structural relationships are maintained between nodes. If a new node is to be added to the document, re-labelling is not required as the new label will be assigned to the node via the new proposed labelling scheme. Experimental evaluations indicated that the performance of XML-REG exceeded XMap, XRecursive, XAncestor and Mini-XML concerning storing time, query retrieval time and scalability. This research produces a core framework for XML to relational databases (RDB) mapping, which could be adopted in various industries.
Key performance requirement of future next wireless networks (6G)journalBEEI
Given the massive potentials of 5G communication networks and their foreseeable evolution, what should there be in 6G that is not in 5G or its long-term evolution? 6G communication networks are estimated to integrate the terrestrial, aerial, and maritime communications into a forceful network which would be faster, more reliable, and can support a massive number of devices with ultra-low latency requirements. This article presents a complete overview of potential 6G communication networks. The major contribution of this study is to present a broad overview of key performance indicators (KPIs) of 6G networks that cover the latest manufacturing progress in the environment of the principal areas of research application, and challenges.
Noise resistance territorial intensity-based optical flow using inverse confi...journalBEEI
This paper presents the use of the inverse confidential technique on bilateral function with the territorial intensity-based optical flow to prove the effectiveness in noise resistance environment. In general, the image’s motion vector is coded by the technique called optical flow where the sequences of the image are used to determine the motion vector. But, the accuracy rate of the motion vector is reduced when the source of image sequences is interfered by noises. This work proved that the inverse confidential technique on bilateral function can increase the percentage of accuracy in the motion vector determination by the territorial intensity-based optical flow under the noisy environment. We performed the testing with several kinds of non-Gaussian noises at several patterns of standard image sequences by analyzing the result of the motion vector in a form of the error vector magnitude (EVM) and compared it with several noise resistance techniques in territorial intensity-based optical flow method.
Modeling climate phenomenon with software grids analysis and display system i...journalBEEI
This study aims to model climate change based on rainfall, air temperature, pressure, humidity and wind with grADS software and create a global warming module. This research uses 3D model, define, design, and develop. The results of the modeling of the five climate elements consist of the annual average temperature in Indonesia in 2009-2015 which is between 29oC to 30.1oC, the horizontal distribution of the annual average pressure in Indonesia in 2009-2018 is between 800 mBar to 1000 mBar, the horizontal distribution the average annual humidity in Indonesia in 2009 and 2011 ranged between 27-57, in 2012-2015, 2017 and 2018 it ranged between 30-60, during the East Monsoon, the wind circulation moved from northern Indonesia to the southern region Indonesia. During the west monsoon, the wind circulation moves from the southern part of Indonesia to the northern part of Indonesia. The global warming module for SMA/MA produced is feasible to use, this is in accordance with the value given by the validate of 69 which is in the appropriate category and the response of teachers and students through a 91% questionnaire.
An approach of re-organizing input dataset to enhance the quality of emotion ...journalBEEI
The purpose of this paper is to propose an approach of re-organizing input data to recognize emotion based on short signal segments and increase the quality of emotional recognition using physiological signals. MIT's long physiological signal set was divided into two new datasets, with shorter and overlapped segments. Three different classification methods (support vector machine, random forest, and multilayer perceptron) were implemented to identify eight emotional states based on statistical features of each segment in these two datasets. By re-organizing the input dataset, the quality of recognition results was enhanced. The random forest shows the best classification result among three implemented classification methods, with an accuracy of 97.72% for eight emotional states, on the overlapped dataset. This approach shows that, by re-organizing the input dataset, the high accuracy of recognition results can be achieved without the use of EEG and ECG signals.
Parking detection system using background subtraction and HSV color segmentationjournalBEEI
Manual system vehicle parking makes finding vacant parking lots difficult, so it has to check directly to the vacant space. If many people do parking, then the time needed for it is very much or requires many people to handle it. This research develops a real-time parking system to detect parking. The system is designed using the HSV color segmentation method in determining the background image. In addition, the detection process uses the background subtraction method. Applying these two methods requires image preprocessing using several methods such as grayscaling, blurring (low-pass filter). In addition, it is followed by a thresholding and filtering process to get the best image in the detection process. In the process, there is a determination of the ROI to determine the focus area of the object identified as empty parking. The parking detection process produces the best average accuracy of 95.76%. The minimum threshold value of 255 pixels is 0.4. This value is the best value from 33 test data in several criteria, such as the time of capture, composition and color of the vehicle, the shape of the shadow of the object’s environment, and the intensity of light. This parking detection system can be implemented in real-time to determine the position of an empty place.
Quality of service performances of video and voice transmission in universal ...journalBEEI
The universal mobile telecommunications system (UMTS) has distinct benefits in that it supports a wide range of quality of service (QoS) criteria that users require in order to fulfill their requirements. The transmission of video and audio in real-time applications places a high demand on the cellular network, therefore QoS is a major problem in these applications. The ability to provide QoS in the UMTS backbone network necessitates an active QoS mechanism in order to maintain the necessary level of convenience on UMTS networks. For UMTS networks, investigation models for end-to-end QoS, total transmitted and received data, packet loss, and throughput providing techniques are run and assessed and the simulation results are examined. According to the results, appropriate QoS adaption allows for specific voice and video transmission. Finally, by analyzing existing QoS parameters, the QoS performance of 4G/UMTS networks may be improved.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Planning Of Procurement o different goods and services
Visualizing stemming techniques on online news articles text analytics
1. Bulletin of Electrical Engineering and Informatics
Vol. 10, No. 1, February 2021, pp. 365~373
ISSN: 2302-9285, DOI: 10.11591/eei.v10i1.2504 365
Journal homepage: http://beei.org
Visualizing stemming techniques on online news articles text
analytics
Nurul Atiqah Razmi, Muhammad Zharif Zamri, Sharifah Syafiera Syed Ghazalli, Noraini Seman
Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, Malaysia
Article Info ABSTRACT
Article history:
Received Mar 31, 2020
Revised Jun 16, 2020
Accepted Jul 23, 2020
Stemming is the process to convert words into their root words
by the stemming algorithm. It is one of the main processes in text analytics
where the text data needs to go through stemming process before proceeding
to further analysis. Text analytics is a very common practice nowadays that
is practiced toanalyze contents of text data from various sources such as
the mass media and media social. In this study, two different stemming
techniques; Porter and Lancaster are evaluated. The differences in the outputs
that are resulted from the different stemming techniques are discussed based
on the stemming error and the resulted visualization. The finding from this
study shows that Porter stemming performs better than Lancaster stemming,
by 43%, based on the stemming error produced. Visualization can still be
accommodated by the stemmed text data but some understanding
of the background on the text data is needed by the tool users to ensure that
correct interpretation can be made on the visualization outputs.
Keywords:
Lancaster stemmer
Porter stemmer
Stemming
Text analytics
Visualization
This is an open access article under the CC BY-SA license.
Corresponding Author:
Noraini Seman,
Department of Computer Science,
Faculty of Computer & Mathematical Sciences,
Universiti Teknologi MARA,
Shah Alam, Selangor, Malaysia.
Email: aini@tmsk.uitm.edu.my
1. INTRODUCTION
Text analytics has gained a great deal of attention and development in the last decade due
to the tremendous growth of text data from the electronic media. A study from IDC has stated that
the volume of text data is expanding by 50 times within the period of 2010 to 2020, thanks to the changes
and advancements in electronics technology [1]. Text analytics or text mining refers to the process
of knowledge discovery from text (KDT) from sources such as structured source-RDBMS and unstructured
sources-XML, JSON, word documents and images [2]. It is practiced for data from different areas such as
social media, electronic news, websites, blogs, electronic health record, academics, etc [3]. In text analytics,
the pre-processing of text data is a very critical process. In the pre-processing phase to stem text data, when
heavy stemming algorithms are applied on text data and the algorithms aggressively direct the removal of
affixes, this results in incorrect stem words. The aggressive algorithms reduce the under-stemming errors but
increase the over stemming. Under stemming is the condition where words that refer to the same root are not
reduced to the same stem word, while over-stemming is the condition where words are reduced to the same
stem word even though they have different meanings. Due to these stemming errors, the end analytics on
the stemmed text may not be valid [4, 5]. In this study, the visualization process is a means to evaluate
the performance of text data pre-processing. In this study, the Voyant tool is utilized from its webpage [6].
This tool is a widely-used tool for text data visualization due to its user-friendly interface and the ease to
2. ISSN: 2302-9285
Bulletin of Electr Eng & Inf, Vol. 10, No. 1, February 2021 : 365 – 373
366
interpret its resulted outcome for public usage [7, 8]. This study utilizes text data on the contents of petrol
prices and blockchain- two popular topics in the country. Text analytics has been applied on petrol prices
field for petrol forecasting-whether prices are going to increase or decrease [9]. Meanwhile, blockchain topic
is also emerging in the country as its application grows in various industries [10, 11].
Text analytics approach includes multidisciplinary approaches that cover the process of text data
acquisition and retrieval, text pre-processing, the analysis of the text itself, the extraction of information,
visualization process and machine learning approaches for identification and classification [12, 13]. In
the pre-processing of text data, the complex structure of the text data is pre-processed in order to be used in
further analysis [3]. The common tasks in text pre-processing are tokenization, filtering, lemmatization, and
stemming of the words. In this study, two steps of pre-processing are performed, which are filtering and
stemming. Filtering is the process where some words - usually the words with no content addition such as
the prepositions and conjunctions are being removed from the text data. The removed words are known as
„stop words‟. As for stemming, it is the process to transform words into their root words by applying the
stemming algorithms [14]. The filtering of the text data is conducted manually by removing the stop words,
followed by stemming of the text data manually and by Python NLTK tool [15]. Meanwhile, stemming can
be defined as the removal of affixes (prefixes, suffixes) from a word to generate its root word for analytical
purpose [16]. It is an operation that relates morphological variants in words by the removal process of
affixes; either the prefixes or suffixes, and producing a root form of word called a stem that most of the time,
similarly approximates the root morphemes [17, 18]. Its main objective is to lessen the distinctive syntactic
structures/word types of a word. There are two famous techniques in stemming, which are Porter and
Lancaster. Porter stemming is the most generally utilized stemmer while Lancaster stemming is known as
one of the most aggressive stemming algorithms.The stemming algorithms can be classified into three main
categories, which are affix removal method, statistical method and mixed methods [19, 20].
The objective of this study is to compare the stemming error as resulted from Porter Stemmer vs
Lancaster Stemmer, and thus to identify which one of them has better stemming performance. Next, the
resulted stemmed text data are visualized, and the resulted visualizations are compared to identify any
significant difference between the visualization outputs of manually stemmed text data, Porter stemmed, and
Lancaster stemmed.
- Porter stemmer
This most popular algorithm for stemming of text data in the English Language has been proposed
in 1980 [21]. The initial algorithm is based on the idea that the suffixes in English Language are constructed
from smaller and simpler suffixes. At present, this algorithm consists of about 60 rules. When a rule
is applied and accepted, the suffix is removed appropriately, and the next rule is then applied. Porter‟s
general rule for removing a suffix is as follows [22]:
<condition><suffix><new suffix>
This means that if a word ends with the suffix and the stem before “suffix” satisfies the given condition,
“new suffix” will replace “suffix”. For example, a rule (m>0) EED EE means if the word has at least one
vowel, consonant and ending EED, replace the ending to EE. Therefore, “agreed” turn into “agree”.
Meanwhile, for “feed”, the word is remained since asides from the „eed‟, it only contains one consonant
and no other vowel [18, 21, 23]. Porter algorithm provides a simple approach for conflation that works
well in practice and is applicable to a range of languages. It generally produces less error rate than
other stemmers [21, 24, 25].
- Lancaster stemmer
The Lancaster Stemmer which is also known as Paice and Husk Stemmer is a conflation based
iterative algorithm, developed by Chris D. Paice in 1990 [23, 26]. This algorithm consists of a table
containing 120 rules provided in a rule file. Each rule specifies either removal or replacement of an ending.
The stemming rules are terminated if they are not applicable for a word, or if the word starts with a vowel
and there are only two letters left, or if a word starts with a consonant and there are only three characters left.
If not, the stemming rules are applied and the process repeats. For example, the word „agreed‟ that starts with
a vowel is reduced correctly to „agree‟. However, this algorithm has the tendency to produce excessive
stemming. The word „agree‟ is under-stemmed into „agr‟ while „are‟ is under-stemmed into „ar‟. It is a very
heavy algorithm, but at the same time it is easy to be implemented where the iteration of every rule takes care
of removal and replacement of characters in a document as per rules applied.
The objective of this study is to compare the stemming error as resulted from Porter Stemmer vs
Lancaster Stemmer, and thus to identify which one of them has better stemming performance. Next, the
resulted stemmed text data are visualized, and the resulted visualizations are compared to identify any
significant difference between the visualization outputs of manually stemmed text data, Porter stemmed and
Lancaster stemmed.
3. Bulletin of Electr Eng & Inf ISSN: 2302-9285
Visualizing stemming techniques on online news articles text analytics (Nurul Atiqah Razmi)
367
2. RESEARCH METHOD
The methodology of this study is as shown in Figure 1. There are four designated phases: (i) data
collection, (ii) data extraction, (iii) data processing, and (iv) visualization.
Figure 1. The general flow of text visualization
This study applies a sample of 10 text documents that are selected randomly from online newspaper
articles. The topics covered in the articles are related with petrol prices and blockchain technology in
Malaysia. Five articles are related to petrol prices issues and five articles are related to blockchain issues.
Only five sentences are extracted from each of the article to limit the size of this study.In pre-processing,
the stop words in the documents are filtered out. Stop words are the list of words that do not give contextual
meaning and added information to the whole document, such as the words „a‟, „the‟, „and‟, „while‟, „is‟ and
etc. These words are irrelevant to the context, but most of the times appear frequently in documents that may
affect the representation of the text data by visualization. Next, manual stemming is conducted, followed by
stemming by Porter method and Lancaster method which Python NLTK [15]. The 2D visualization
of the stemmed text data is then conducted. The performance of the stemming techniques are evaluated based
on the stemming errors and the visualization outputs.
3. RESULTS AND ANALYSIS
3.1. Stemming by Porter method vs Lancaster method
All documents are uploaded into Python NLTK as separate documents. Excerpts of the resulted
stemmed articles are shown in the following Figure 2 (a) and 2 (b). From Figure 2 (a) and 2 (b), significant
differences can be seen from the result. Porter stems words into longer roots, while Lancaster stems words
into much shorter root words. For example, the original words from the article - „saving‟ is reduced into
the correct root word „save‟ by Porter but is reduced into „sav‟ by Lancaster. This is an under-stemming error
where the word is not reduced to the correct word. Other examples are „similar‟ reduced to „simil‟, „Africa‟
reduced to „Afric‟, and „China‟ reduced to „chin‟ by Lancaster. However, errors are also observed for Porter
stemming. The summary of stemming errors for both stemmers on the 10 articles, as compared to our manual
stemming as the baseline is presented in the following Table 1.
(a) (b)
Figure 2. (a) Example of stemmed text result of Porter Stemmer of Lancaster Stemmer, (b) Example
of stemmed text result
For all the 10 articles tested, Lancaster stemming results in higher stemming error compared to
Porter stemming. This is resulted from the aggressiveness of Lancaster that under-stems words, resulting in
Data
Collection
Data
Extraction
Data
Processing
Data
Visualizationon
4. ISSN: 2302-9285
Bulletin of Electr Eng & Inf, Vol. 10, No. 1, February 2021 : 365 – 373
368
the words being reduced to incorrect root words. The average stemming error for Porter is 21.89%, while
Lancaster is having almost 10% higher error than Porter at 31.23%. As a comparison of the percentage of
average stemming error, from the result, Porter stemming performs at 43% better than Lancaster stemming in
term of less stemming error. The resulted stemmed texts from both stemmers are then visualized to compare
on the visualization result from different stemming outputs.
Table 1. Comparison of stemming errors between Porter and Lancaster stemming
Article No.
Stemming Error
Porter Lancaster
Article 1 22.33% 29.13%
Article 2 16.95% 31.36%
Article 3 25.00% 31.03%
Article 4 24.66% 35.62%
Article 5 27.97% 32.20%
Article 6 11.35% 18.44%
Article 7 22.41% 32.76%
Article 8 24.31% 34.72%
Article 9 21.77% 32.65%
Article 10 22.14% 34.35%
Average 21.89% 31.23%
3.2. Visualization outputs
The stemmer tool summarizes the number of unique words in the articles. For manual stemmed
articles, the text contains 464 unique words. For Porter stemmed articles, there are 493 unique words while
for Lancaster stemmed articles, there are 499 unique words. The manual stemmed articles contain much
lesser number of unique words than the later. This is caused by the stemming variation from Porter and
Lancaster stemming. For example, the word „countries‟ is manually stemmed as „country‟, but by Porter
stemming, it is stemmed as „countri‟. This results in the Porter stemmed articles to obtain two nouns for
„country‟ which are „country‟ and „countri‟. For Lancaster stemming, as example, the word „africa‟
is stemmed as „afric‟ while the „african‟ is stemmed as „afr‟, resulting in two nouns for „africa‟ which are
„afric‟ and „afr‟ whilst both words are manually stemmed into „africa‟. In short, Porter and Lancaster
stemming increase the unique words in text data as compared to manual stemming.We refer to the statistical
summary on the frequency and the relative frequency values of each word. Whilst frequency is the count of
appearances of every word, the relative frequency is the fraction of times of occurrences of a word [27].
In this study, it indicates the count of appearances of words relative to the number of words in a segment of
a corpus. The segments are the fraction of sentences separation, either by punctuation marks or by separators.
In the case of this study, the visualization tool segments the corpus by the serial number of the uploaded
articles which is from 1 to 10.
(1)
The statistical summary of frequency and relative frequency of the top 10 words in the articles are
shown in Table 2.
Table 2. The relative frequency of words from different stemming techniques
Terms
Manual Porter Lancaster
freq Relative frequency freq Relative frequency freq Relative frequency
price/prices/pric 17 0.086420* 13 0.061728 17 0.061728
baobab 6 0.084507* 6 0.084507* 13 0.070423*
petrol 17 0.078652* 17 0.078652* 10 0.078652*
blockchain 8 0.065574 8 0.065574* 8 0.065574*
africa 4 0.056338 4 0.056338 - -
finance/financi/fin 9 0.056338 5 0.056338 6 0.056338
Hellogold 4 0.056338 4 0.056338 4 0.056338
university/univers 5 0.054348 5 0.053763 5 0.053763
price/prices/pric** - 0.053763 4 0.049383 4 0.049383
week/weekli 9 0.053763 - - -
increase 6 - 6 0.049383 5 0.049383
country 6 - - 4 0.049180
Note: * Top three rankings based on the relative frequency.** Appearance of word in second segment
5. Bulletin of Electr Eng & Inf ISSN: 2302-9285
Visualizing stemming techniques on online news articles text analytics (Nurul Atiqah Razmi)
369
The result shows that by manual stemming, the word „price‟ acquires the highest relative frequency
which is 0.086420, followed by „baobab‟ (0.084507) and „petrol‟ (0.078652). However, by Porter and
Lancaster stemming, the top three words with the highest relative frequency do not include the word „price‟.
The top three words from Porter stemming are „baobab‟ (0.084507), „petrol‟ (0.078652), and blockchain
(0.065574), while the top three words from Lancaster stemming are „petrol‟ (0.078652), „baobab‟ (0.070423)
and „blockchain‟ (0.065574). This shows some inconsistency between the relative frequency representation
for manually stemmed articles over Porter and Lancaster stemmed articles since the word with the highest
relative frequency from manual stemming is ranked forth for both Porter and Lancaster stemming.
The resulted relative frequency is visualized in term of the document trend at every segment of the articles as
presented in Figure 3 (a) while the frequency value is visualized in the word cloud as in Figure 3 (b).
(a)
(b)
Figure 3. (a) Visualization in term of the document trend of manually stemmed articles, (b) Visualization of
corpus word cloud of manually stemmed articles
The visualizations in Figure 3 (a) and 3(b) shows that from the manually stemmed article, „price‟
occupies the highest relative frequency at the first segment of the text data and followed by petrol in the tenth
segment. The words with the highest relative frequency over the 10 articles are „petrol‟, followed by „price‟,
„finance‟, „malaysia‟ and „week‟. This indicate that the articles are mainly explaining on the petrol prices in
Malaysia that are set by the Government of Malaysia on the weekly basis, and how does the petrol price
6. ISSN: 2302-9285
Bulletin of Electr Eng & Inf, Vol. 10, No. 1, February 2021 : 365 – 373
370
setting is influenced by the financial condition in Malaysia. Two words appear as the key words in the word
cloud – „price‟ and „petrol‟. The result from manual stemmed articles are compared with the visualizations of
Porter stemmed articles as in Figure 4 (a) and 4 (b).
(a)
(b)
Figure 4. (a) Visualization of document trend using Porter stemmed articles, (b) Word cloud visualization
using Porter stemmed articles
For Porter stemmed articles, the visualizations in Figure 4 show words with the highest relative
frequency are „petrol‟ in the tenth segment, followed by „prices‟, „malaysia‟, „blockchain‟ and „countri‟.
Same with the manually stemmed articles, the top two words with the highest frequency are „petrol‟ and
„prices‟ that signify the topic of the articles on petrol prices issue. It is followed by „malaysia‟, that signifies
that the article is discussing on petrol prices issues in Malaysia. The next highest frequency words are
„blockchain‟ and „countri‟, that indicates blockchain technology as having influence on petrol prices in
Malaysia and for the word „countri‟, the implication of this word is not clear whether it is explaining on the
Malaysia country only or it regards other countries as well since the word comes in singular noun. From the
word cloud, „petrol‟ word retains its original frequency at 17 counts, but the size of „prices‟ has decreased to
13 counts due to occurrences of under-stemming that separate the root word „price‟ from word „prices‟. Since
the size of „prices‟ reduces, „Malaysia‟ word appears to be more significant.
In Figure 5 (a) and 5 (b) is similar with Porter stemmed articles, the highest relative frequency word
is „petrol‟ at the tenth segment, followed by „pric‟, „malays‟, „blockchain‟, and „fin‟ (financial). In the word
cloud, the significant size of „petrol‟, „pric‟, „malays‟, and „blockchain‟ can be seen. Similar with the result
from Porter stemmed articles, the visualization may signify that the articles are discussing on petrol prices
issues in Malaysia which influences or being influenced by the blockchain technology.
7. Bulletin of Electr Eng & Inf ISSN: 2302-9285
Visualizing stemming techniques on online news articles text analytics (Nurul Atiqah Razmi)
371
(a)
(b)
Figure 5. (a) Visualization of document trend using Lancaster stemmed articles, (b) Word cloud visualization
using Lancaster stemmed articles
From the study as outlined in this paper, it has been found that Porter stemming performs better than
Lancaster stemming, based on the stemming error produced by 43% better. The visualization on the three
type of stemming techniques has resulted in the following words being visualized with the highest relative
frequency and are the key words of the articles. The following words are in their order of relative frequency
from the highest to the fifth highest relative frequency.
Manual : „petrol‟, „price‟, „finance‟, „malaysia‟ „week‟
Porter : „petrol‟, „prices‟, „malaysia‟, „blockchain‟ „countri‟
Lancaster : „petrol‟, „pric‟, „malays‟, „blockchain‟, „fin‟
From the manually stemmed text data, the main topic of the text data is on the petrol price setting in
Malaysia that is affected by the financial situation of the country and is being reviewed and set by weekly
basis. As observed here, the term „blockchain‟ is visualized as less significant. This is due to the lack of
counts of the term „blockchain‟ in its related articles – it is being mentioned once to two times per article. On
the other hand, by Porter and Lancaster stemming, the word „blockchain‟ is also visualized as one of the most
frequent word. The reason is due to by these two stemmers, the word „finance‟ and „week‟ are being under-
stemmed. The original word „financial‟ is reduced to „financi‟ by Porter while reduced to „financ‟ by
Lancaster. The original word „weekly‟ is reduced to „weekli‟ by both stemmers that results in this „weekli‟
word failed to be combined with the root word „week‟ in the frequency counting. This has led to the missed-
out information that the articles are stressing on the weekly set-up of the petrol prices. Another interesting
finding is that the words „Malaysia‟ and „Malaysian‟ have been reduced to „Malays‟, that indicate Lancaster
stemming is not able to differentiate between „Malaysia‟ – a country, and „Malay‟ – a race. It is critical in
term of stemming articles related to the country of Malaysia, or stemming articles related to the races in
Malaysia, whereby Malaysia is a multi-racial country and the resulted stemming by Lancaster can result in
8. ISSN: 2302-9285
Bulletin of Electr Eng & Inf, Vol. 10, No. 1, February 2021 : 365 – 373
372
misguided visualization outputs on these matters. The word „financial‟ has also been reduced to only „fin‟,
that anticipates the users to guess out its root word as they can be, for example, „fine‟, „finish‟, „final‟,
and etc. As for the relative frequency result as shown in Table 2, however, both Porter and Lancaster
stemmed articles have not obtained „price‟ as having the highest relative frequency. This is due to the root
word „price‟ have been stemmed to incorrect stems such as „prices‟ and „pric‟, that makes the statistical
information incorrect.
4. CONCLUSION
As a conclusion from this study, in term of the stemming performance from the result of stemming
errors, Porter stemming performs at 43% better than Lancaster stemming, but the current stemming errors are
still high which are more than 20% of the words in the articles. As for the visualization process, visualization
can still be accommodated by the stemmed text data by the two stemmers, since some of the visualization is
correctly visualized as compared to the output from the manually stemmed text data. However, some
understanding on the background of the articles is needed for by tool users to ensure that correct
interpretation can be made on the visualization outputs. The study can be further continued to compare
between Porter stemming and Lancaster stemming for Malay or English Languages, related to the issues in
the country of Malaysia (as Lancaster stemming fails to stem „Malaysia‟ correctly) and in term of the issues
on multi races in Malaysia (as Lancaster stemming stems the word „Malaysia‟ into „Malays‟) to see the
impact of the stemming error that may result in significant differences in the visualization output. Other than
that, there are still potentials for the algorithms to be improved for English Language to increase the
stemming accuracy and thus, the visualization output accuracy.
ACKNOWLEDGEMENTS
The authors would like to thank Universiti Teknologi MARA (UiTM) for the research funding and
support via grant number 600-IRMI/PERDANA 5/3 BESTARI (107/2018).
REFERENCES
[1] J. Gantz and D. Reinsel, “THE DIGITAL UNIVERSE IN 2020: Big Data, Bigger Digital Shadows, and Biggest
Growth in the Far East,” pp. 1-16, December 2012.
[2] R. Feldman and I. Dagan, “Knowledge Discovery in Textual Databases (KDT),” First Int. Conf. Knowl. Discov.
Data Min., vol. 95, pp. 112–117, 1995.
[3] Z. Zainol, M. T. H. Jaymes, and P. N. E. Nohuddin, “VisualUrText: A Text Analytics Tool for Unstructured
Textual Data,” Journal of Physics: Conference Series, vol. 1018, no. 1, 2018.
[4] A. Joshi, N. Thomas, and M. Dabhade, “Modified Porter Stemming Algorithm,” International Journal of Computer
Science and Information Technologies, vol. 7, no. 1, pp. 266–269, 2016.
[5] C. D. Paice, “An Evaluation Method for Stemming Algorithms,” in SIGIR ’94, London: Springer London,
pp. 42–50, 1994.
[6] “Voyant Tools.” [Online]. Available at: https://voyant-tools.org/. [Accessed: 11-May-2019].
[7] M. Burghardt, J. Pörsch, B. Tirlea, and C. Wolff, “WebNLP: An Integrated Web-Interface for Python NLTK
and Voyant,” Proceeding 12th KONVENS, pp. 1–5, September 2015.
[8] M. E. Welsh, “Review of Voyant Tools,” Collab. Librariansh., vol. 6, no. 2, pp. 96–97, 2014.
[9] X. Li, W. Shang, and S. Wang, “Text-based crude oil price forecasting: A deep learning approach,” International
Journal of Forecasting, 2018.
[10] L. S. Sankar, M. Sindhu and M. Sethumadhavan, "Survey of consensus protocols on blockchain applications," 2017
4th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore,
pp. 1-5, 2017.
[11] M. N. Saadat, S. Abdul Halim, H. Osman, R. Mohammad Nassr, and M. F. Zuhairi, “Blockchain based
crowdfunding systems,” Indonesian Journal of Electrical Engineering and Computer Science (IJEECS), vol. 15,
no. 1, p. 409-413, July 2019.
[12] S. Dang and P. H. Ahmad, “Text Mining : Techniques and its Application,” IJETI International Journal of
Engineering & Technology Innovations, vol. 1, no. 4, pp. 22–25, 2014.
[13] S. Sinclair and G. Rockwell, “Text Analysis and Visualization,” A New Companion to digital humanities,
pp. 274–290, 2015.
[14] M. Allahyari et al., “A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques,” 2017.
[15] “Python NLTK Stemming and Lemmatization Demo.” [Online]. Available: http://text-processing.com/demo/stem/.
[Accessed: 11-May-2019].
[16] M. Sankupellay and S. Valliappan, “Malay-Language Stemmer,” Sunway Academic Journal, vol. 3, pp. 147–153, 2006.
[17] W. B. Frakes, V. Tech, and C. J. Fox, “Strength and Similarity of Affix Removal Stemming Algorithms,” ACM
SIGIR Forum, New York, NY, USA, vol. 37, no. 1, pp. 26-30, 2003.
9. Bulletin of Electr Eng & Inf ISSN: 2302-9285
Visualizing stemming techniques on online news articles text analytics (Nurul Atiqah Razmi)
373
[18] D. A. Hull and G. Grefenstette, “A Detailed Analysis of English Stemming Algorithms,” Rank Xerox Research
Centre, pp. 1-16, 1996.
[19] M. Anjali and G. Jivani, “A Comparative Study of Stemming Algorithms,” Int. J. Comput. Technol. Appl., vol. 2,
no. 6, pp. 1930–1938, 2011.
[20] M. Hadni, A. Lachkar and S. A. Ouatik, "A new and efficient stemming technique for Arabic Text Categorization,"
2012 International Conference on Multimedia Computing and Systems, Tangier, pp. 791-796, 2012.
[21] P. Willett, “The Porter stemming algorithm: then and now,” Program: electronic library and information systems,
vol. 40, no. 3, pp. 219-223, 2006.
[22] M. F. Porter, “Celebrating 40 Years Of ICT In Libraries, Museums And Archives An Algorithm For Suffix
Stripping,” Program: electronic library and information systems, vol. 14, no. 3, pp. 211–218, 2006.
[23] S. R. Sirsat, S. R. Sirsat, D. V. Chavan, D. Hemant, S. Mahalle, and P. N. Mahavidyalaya, “Strength and Accuracy
Analysis of Affix Removal Stemming Algorithms,” International Journal of Computer Science and Information
Technologies, vol. 4, no. 2, pp. 265–269, 2013.
[24] M. Lennon, D. S. Peirce, B. D. Tarry, and P. Willett, “An evaluation of some conflation algorithms for information
retrieval,” Journal of information Science, vol. 3, no. 4, pp. 177–183, 1981.
[25] W. Ben and A. Karaa, “A New Stemmer To Improve Information Retrieval,” International Journal of Network
Security & Its Applications, vol. 5, no. 4, pp. 143–154, 2013.
[26] J. Köhler, S. Philippi, M. Specht, and A. Rüegg, “Ontology based text indexing and querying for the semantic
web,” Knowledge-Based Systems, vol. 19, no. 8, pp. 744–754, Dec. 2006.
[27] S. Dean, B. Illowsky, and D. Ph, “Sampling and Data : Frequency, Relative Frequency, and Cumulative Frequency
Frequency Table of Student Work Hours w/Relative Frequency,” pp. 1–6, 2010.